Anyone with an inbox nowadays knows that email spam is a real and serious problem. Luckily there are a number of techniques you can employ to contain this phenomenon and decrease the amount of junk mail you receive.
Spammers harvest email addresses using bots that surf the Net in search of email addresses. If an email address is hidden somehow when it’s published on the Web, a bot may miss it. Address munging is the process of hiding or disguising an address. For instance, you can write an address like this: name [AT] domain [DOT] com, or create an image that displays the address, or write the address in ASCII characters. For example, when you put @ in the HTML code, the browser translates it to @.
Once the spammers have your email address, the fight moves to your mail server and inbox. A simple approach to reducing spam is to filter each message’s content. With content filters, the body of the message is scanned in search of trigger words, such as Viagra or free money. If one or more of these keys are found, the message is marked as spam. In some implementations you don’t have a “spam/not spam” identification but instead a score (the higher the score is, the higher the chance the message is spam), so one can customize the system a little.
The main disadvantage of this method is that spammers often misspell words or hide them to avoid recognition. Moreover, using a large list of trigger words can increase the number of false positive cases.
The real evolution of these methods uses statistical analysis of a message’s contents (typically a Bayes classifier) to recognize spam in a more adaptive way. In a mail client that employs Bayesian filtering, the user marks a message as spam or not, and over time the filter learns which messages are good and bad. This method can be used on both the client side, with software such as Mozilla Thunderbird, and server side, with packages like SpamAssassin.
Although Bayesian techniques try to resolve some of the limitations of content-based filters, you can still get false positives. Moreover, spammers can encapsulate the message into an image or craft the text to try to bypass the filter.
Sender Policy Framework
Spammers often send mail from forged addresses. A system called Sender Policy Framework (SPF) uses the Domain Name System (DNS) to decide when to reject or accept a message.
To implement this technique, you have to add a TXT field on the DNS of your domain, using a special syntax. You can use the wizard on the SPF homepage to generate one. The field specifies which hosts and IP addresses are allowed to send mail from your domain. Then, when your mail server receives a message from firstname.lastname@example.org, it makes a DNS query to domain.com searching for an SPF record. If it’s found, the mail server looks if the host of the sender of the message is in the list of the allowed ones. Otherwise, the message is rejected.
Again, SpamAssasin is one popular open source application that can implement SPF. Many popular messaging servers implement it directly or by applying patches or plugins.
SPF is a good technique but it has two drawbacks. First, SPF records are not widely used, and on domains without an SPF record, the SMTP server will accept any message. Second, often it’s difficult to decide which hosts are allowed to send mail using a given domain as sender.
Real time blacklist
Another server-side technique called real time blacklist uses a central database with a list of untrusted IP addresses that have been known to deliver spam. When the SMTP server receives a message, it must query one or more of these lists, looking for the sender’s IP address. If it’s found, the message is rejected.
There are a lot of lists available. Sorbs and SBL are two widely used ones.
This method works well, but some lists are too restrictive and others are too permissive. It can also lead to false positives; if someone owns an home server (with a dynamic IP address) or was accidentally inserted into list, you won’t receive messages from him.
Another server-side approach relies on the way SMTP servers are used by spammers. When a destination mail server is not available, the sending server tries to send the message again later. Many servers used by spammers are simpler, and care more about the number of messages sent than whether every message arrives, so when a spammer’s SMTP server gets an error during delivery, it gives up sending the message instead of trying again.
With greylisting, a destination mail server will reject every message from an unknown IP address with a temporary error. A traditional mail server will retry later, and at that time the message will be accepted. This approach requires the receiving server to save the IP address of the sender so it can recognize it later and then accept the message.
The main drawback of this method is increased latency when receiving messages, though you can ameliorate that problem with techniques such as whitelisting trusted servers.
An interesting variant of greylisting uses the method described above only if the sender is found on an RBL list (and typically one that’s very restrictive). That way the majority of messages arrive instantly, and the rest arrive with a little delay.
Vipul’s Razor fights spam by promoting collaboration between users. Cloudmark maintains centralized databases that collect a sort of hash (in effect, a small fingerprint) of spam messages. When a user receives a message, the software automatically queries these servers looking for the hash of the message. If there is a match, the message is rejected. If there is no match, but the message is junk anyway, its hash can be sent (manually or automatically) to these centralized servers. To avoid hashbusters injection (adding data to make the hash different) the system uses ephemeral signatures (calculating the hash only of a random part of the message).
There are three main drawbacks of this approach:
- Hash values of some junk messages may not be in the database when we query the system.
- Two completely different messages could have the same hash. Although it is very uncommon, it must be taken into consideration when evaluating false positives.
- Because the system is powered by users, someone could decide that a message is spam even though it isn’t. In recent versions a Truth Evaluation System (a sort of users’ reputation system) improved this, but again the problem should be consider when evaluating false positives.
Distributed Checksum Clearinghouse
The Distributed Checksum Clearinghouse works in a way similar to Razor. It uses a kind of hash of the message (a checksum) too and it also queries a centralized server with all the checksums. However, in this technique, there isn’t direct cooperation between users; instead, the system is totally automatic. The mail server/client sends to the central server the checksums of all messages (spam or not) with no user interaction. The central system counts the occurrences of every checksum and, when a certain threshold value is exceeded, the message is marked as spam. In this approach, if the same message is received by a lot of users, it is probably junk. A statistical technique called fuzzy logic is used to avoid hashbusters injection.
Although this technique does not require huge bandwidth, it can slow down an already overloaded mail server. Large organizations should provide a local DCC servers instead of using one master server.
DomainKeys Identified Mail
This server-side method uses asymmetric encryption, and guarantees the integrity of the message too. The mail server that is sending the message adds a header to the message itself containing a digital signature of the message content. The sending server also needs to add a special DNS record that holds its public key (similar to SPF). On the receiving end, the mail server analyzes the domain of the sender and retrieves its public key with a DNS query. At this point, with the public key and an encryption algorithm, the receiving server can verify that the message was sent from a trusted domain, and it can verify that the message wasn’t modified during the transfer.
The main drawback of this system is the low diffusion of it. Although big companies like Yahoo! implement it, it isn’t used by a lot of small servers.
White list / black list
Whitelisting and blacklisting aren’t really antispam techniques but rather additional controls that one can use with almost every method. In a whitelist one can specify a series of trusted addresses or domains. If a sender is in this list, all controls are skipped and the message is received without delays or the risk of a false positive.
A blacklist collects addresses that users don’t want to receive mail from. Depending on the implementation, messages from those addresses can be rejected or marked in some way.
What’s the best antispam technique? The answer depends on the kind and size of spam you receive. For example if you don’t receive much email, you would probably prefer a system with no false positive at all. Mail administrators who don’t want to maintain a complex infrastructure should avoid using Vipul’s Razor or content filters that must be trained.
You can even mix techniques, or customize them in any way you like.