Fighting Spam With Statistics

Various sources estimate that junk mail is responsible for 40% to 95% of all SMTP traffic. There is no shortage of vendors offering anti-spam products, but all those products fail to address the root cause of the problem: Internet messaging allows anyone to send as many messages as s/he wants.


Why currently employed methods to fight spam fail

1. Blacklisting is inefficient because it focuses on blocking specific senders. Spammers can easily get around this by using botnets. Blacklisting is also prone to false positives, i.e. legitimate senders or ISP relays get blocked.

2. Content scanning. No matter how smart your content scanner is, it is an automated process that cannot compete with tough brains of a mass-mailing professional. Statistical (bayesian) scanning is easily defeated by randomization; numerous techniques exist to avoid keyword-based detection and new methods surface regularly. Content scanning is also known to suffer from false positives.

3. Greylisting (temporary mail rejection and manipulating MX records) might be efficient for small ISPs, but if used globally, junk mailers would easily script their way around it. Greylisting also delays and sometimes impacts delivery of legitimate mail.

4. Teergrubing (delaying MTA sessions) might help reduce spamming performance, but overall it is like shooting a bear with a bb gun.


The ultimate challenge

Spammers can find ways to beat spam filters. When this happens, they will send hundreds of thousands of messages. The filters then get adjusted and the process starts over again. To fight spam successfully we need to be able to limit spammers' ability to send large amounts of messages in short periods of time. Once this is achieved -- voila -- junk mail is dead (well, not quite dead, but at least reduced to acceptable levels). Is this really possible? Yes, to some extent. How? Statistics to the rescue!


The anatomy of junk mail traffic

1. Junk messages are relatively small
2. Junk messages are sent in chunks
3. Junk messages within one chunk are very similar
4. Each junk source usually emits many messages in short period of time

There are only two properties that spammers cannot forge: IP addresses of both ends of the TCP channel that is used to emit junk messages, this is our key to contain spam traffic.


Fighting spam with data analysis

Proposed solution: pattern analysis. Since all spam fighting methods use bold names, let's call this methodology Source Trust Prediction (STP). Here is how it works:


Before accepting a message, an MTA connects to the STP server and sends just one variable: source IP address of the client requesting SMTP session. STP also knows the IP address of the MTA. Other properties may be used for correlation (size of the message, count and sizes of attached files), but it will significantly impact performance of the STP server, so let's focus on just source and destination IPs. Another important parameter is the time of the request. STP server correlates this information with the data received from other MTAs and replies with a number that reflects how likely the sender is a junk mail source. The MTA then decides whether to drop or accept the message, or take other appropriate action.


The STP server must be capable of identifying patterns and trends in real time, which is extremely challenging, but not impossible for statistics and data mining experts out there. The database of source and the destination IPs can be mined to identify links, trends, and patterns. In other words, the STP server knows who sends how many messages where and when. This information can be used to calculate a number (the STP value) for each sender to represent its credibility, or trust.


It is not necessary to keep the data for long periods of time. A two weeks worth or data (or even less) should be enough. The crucial part for successful implementation is wide adaptation. The more clients that STP server has - the more precise its reply will be.


While this will not stop spam, it will be reduced dramatically. The STP value of a spam source will grow proportionally to the number of junk messages sent. The first several thousands emails will get to unlucky recipients when spamming starts, but the rest hundreds of thousands will not.


How spammers can fight the STP server

1. By slowing down message rates for each source
2. By limiting the number of messages sent by each source
3. By feeding STP server with useless noise
4. By DoS-ing the STP server(s)

The first two methods are not acceptable for junk mailers.
The #3 can be effectively countermeasured, again, by statistics.
And there are ways to minimize the likelihood of #4.


Typical junk mail scenarios and how STP can help



One-To-Many


The most common scheme is when one junk mail source sends messages to a number of MTAs. This case is very easy to handle. Unless you are a large ISP or an official mass-mailing source (for example, an organization sending periodical newsletters to customers), there is no need for you to send thousands of messages within minutes. Official bulk mail sources can be exempted (whitelisted) if necessary. Large SMTP sources (ISPs or webmail providers) with more or less constant traffic volumes can also be statistically identified. But if a source suddenly appears and starts sending hundreds of messages in all directions, it is likely a junk mail transmitter.

Smarthost Abuse and Open Mail Relays


In this scenario junk mail traffic is directed to a relay host that distributes it. The relay host could be an open relay, a spammer's server or an abused ISP smart host. If abused server is not a valid SMTP relay, this case can be handled as the previous one, the relaying host essentially becomes a single source that distributes mail to many destinations. This behavior is easy to detect. The real problem is when a valid SMTP relay is abused. For example, a virus/worm could take advantage of the SMTP smarthost that relays messages for ISP customers. The preferred way to address this problem is at the ISP level. For example, their smarthost could throttle customer's connections by accepting only so many messages in a period of time. This would make the smarthost abuse method extremely inefficient for spammers.

Many-To-One


Now, let's take a look at a more challenging problem, when many junk mail generators send messages to one mail relay. This could happen, for example when spammers use botnets to target users in a single domain. Statistics in this case might help detect patterns, but it will not be efficient unless each spam source sends high number of messages. Manual intervention and/or additional heuristic filtering is required. In this case STP does not provide any immediate benefits, but it logs all sources and it will use this data in future pattern analysis. As soon as that botnet is used to send messages to other MTAs, the STP will detect patterns, appropriate STP scores will be given to junk mail sources, and MTAs accepting mail for other domains will be able to take appropriate actions.

One-To-One


Another unlikely, but tough to handle scenario is when each junk mail source is dedicated to sending messages to only one MX. In this case the senders can be statistically detected because of an unusual behavior. Unless you are a hop in the SMTP chain, there is no reason for you to send thousands of messages to a single SMTP server within short period of time.



What to do with the STP score?


So, what action should an MTA take when it receives a score from the STP that marks SMTP session as possibly initiated by junk mail sender? Well, it depends. Simply dropping the message could occasionally cause valid messages to be discarded, but if the STP score is high enough, it might be an option. Temporarily refusing delivery is another option, as well as is teergrubing. The most efficient action would probable be a combination of all three depending on the value of the STP score.


STP is not a remedy


Again, STP will not prevent spam, but it can dramatically reduce it. In ideal conditions when a high number of mail relays report source addresses to the STP server for analysis, it might reduce junk mail traffic to a small fraction of what it is now. Spammers will only be able to send relatively small numbers of messages until STP patterns become obvious and MTAs start taking actions.


STP Implementation challenges


1. A high-performance distributed system capable of processing large volume of data in real time to identify trends and patterns is very hard to implement.
2. The system cannot function unless it is widely adopted and many MX servers participate in reporting.
3. Privacy concerns over mail sending habits logged by STP servers (no, I am not kidding)




Comments -> vtalk@hexview.com