B@Bel: Leveraging Email Delivery for Spam Mitigation
Total Page:16
File Type:pdf, Size:1020Kb
B@bel: Leveraging Email Delivery for Spam Mitigation Gianluca Stringhinix, Manuel Egelex, Apostolis Zarrasz, Thorsten Holzz, Christopher Kruegelx, and Giovanni Vignax xUniversity of California, Santa Barbara z Ruhr-University Bochum fgianluca,maeg,chris,[email protected] fapostolis.zarras,[email protected] Abstract paigns generate between $400K and $1M revenue per Traditional spam detection systems either rely on con- month [20]. tent analysis to detect spam emails, or attempt to detect Nowadays, about 85% of world-wide spam traffic is spammers before they send a message, (i.e., they rely sent by botnets [40]. Botnets are networks of compro- on the origin of the message). In this paper, we intro- mised computers that act under the control of a single duce a third approach: we present a system for filtering entity, known as the botmaster. During recent years, a spam that takes into account how messages are sent by wealth of research has been performed to mitigate both spammers. More precisely, we focus on the email de- spam and botnets [18, 22, 29, 31, 33, 34, 50]. livery mechanism, and analyze the communication at the Existing spam detection systems fall into two main SMTP protocol level. categories. The first category focuses on the content of We introduce two complementary techniques as con- an email. By identifying features of an email’s content, crete instances of our new approach. First, we leverage one can classify it as spam or ham (i.e., a benign email the insight that different mail clients (and bots) imple- message) [16, 27, 35]. The second category focuses on ment the SMTP protocol in slightly different ways. We the origin of an email [17, 43]. By analyzing distinctive automatically learn these SMTP dialects and use them features about the sender of an email (e.g., the IP address to detect bots during an SMTP transaction. Empiri- or autonomous system from which the email is sent, or cal results demonstrate that this technique is successful the geographical distance between the sender and the re- in identifying (and rejecting) bots that attempt to send cipient), one can assess whether an email is likely spam, emails. Second, we observe that spammers also take into without looking at the email content. account server feedback (for example to detect and re- While existing approaches reduce spam, they also suf- move non-existent recipients from email address lists). fer from limitations. For instance, running content anal- We can take advantage of this observation by returning ysis on every received email is not always feasible for fake information, thereby poisoning the server feedback high-volume servers [41]. In addition, such content anal- on which the spammers rely. The results of our experi- ysis systems can be evaded [25, 28]. Similarly, origin- ments show that by sending misleading information to a based techniques have coverage problems in practice. spammer, it is possible to prevent recipients from receiv- Previous work showed how IP blacklisting, a popular ing subsequent spam emails from that same spammer. origin-based technique [3], misses a large fraction of the IP addresses that are actually sending spam [32, 37]. 1 Introduction In this paper, we propose a novel, third approach to Email spam, or unsolicited bulk email, is one of the ma- fight spam. Instead of looking at the content of mes- jor open security problems of the Internet. Accounting sages (what) or their origins (who), we analyze the way for more than 77% of the overall world-wide email traf- in which emails are sent (how). More precisely, we focus fic [21], spam is annoying for users who receive emails on the email delivery mechanism. That is, we look at the they did not request, and it is damaging for users who communication between the sender of an email and the fall for scams and other attacks. Also, spam wastes re- receiving mail server at the SMTP protocol level. Our sources on SMTP servers, which have to process a sig- approach can be used in addition to traditional spam de- nificant amount of unwanted emails [41]. fense mechanisms. We introduce two complementary A lucrative business has emerged around email spam, techniques as concrete instances of our new approach: and recent studies estimate that large affiliate cam- SMTP dialects and Server feedback manipulation. SMTP dialects. This technique leverages the observa- a mail server identifies the sender as a bot, instead of tion that different email clients (and bots) implement the dropping the connection, the server could simply reply SMTP protocol in slightly different ways. These de- that the recipient address does not exist. To identify a bot, viations occur at various levels, and range from differ- one can either use traditional origin-based approaches or ences in the case of protocol keywords, to differences in leverage the SMTP dialects proposed in this paper. When the syntax of individual messages, to the way in which the server feedback is poisoned in this fashion, spammers messages are parsed. We refer to deviations from the have to decide between two options. One possibility is to strict SMTP specification (as defined in the correspond- continue to consider server feedback and, as a result, re- ing RFCs) as SMTP dialects. As with human language move valid email addresses from their email list. This dialects, the listener (the server) typically understands reduces the spam emails that these users will receive in what the speaker (a legitimate email client or a bot) is the future. Alternatively, spammers can decide to distrust saying. This is because SMTP servers, similar to many and discard any server feedback. This reduces the effec- other Internet services, follow Postel’s law, which states: tiveness of future campaigns since emails will be sent to “Be liberal in what you accept, and conservative in what non-existent users. you send.” Our experimental results demonstrate that our tech- We introduce a model that represents SMTP dialects niques are successful in identifying (and rejecting) bots as state machines, and we present an algorithm that that attempt to send unwanted emails. Moreover, we learns dialects for different email clients (and their re- show that we can successfully poison spam campaigns spective email engines). Our algorithm uses both pas- and prevent recipients from receiving subsequent emails sive observation and active probing to efficiently gener- from certain spammers. However, we recognize that ate models that can distinguish between different email spam is an adversarial activity and an arms race. Thus, engines. Unlike previous work on service and protocol a successful deployment of our approach might prompt fingerprinting, our models are stateful. This is impor- spammers to adapt. We discuss possible paths for spam- tant, because it is almost never enough to inspect a single mers to evolve, and we argue that such evolution comes message to be able to identify a specific dialect. at a cost in terms of performance and flexibility. Leveraging our models, we implement a decision pro- To summarize, the paper makes the following main con- cedure that can, based on the observation of an SMTP tributions: transaction, determine the sender’s dialect. This is use- ful, as it allows an email server to terminate the con- • We introduce a novel approach to detect and mit- nection with a client when this client is recognized as a igate spam emails. This approach focuses on the spambot. The connection can be dropped before any con- email delivery mechanism — the SMTP communi- tent is transmitted, which saves computational resources cation between the email client and the email server. at the server. Moreover, the identification of a sender’s It is complementary to traditional techniques that dialect allows analysts to group bots of the same family, operate either on the message origin or on the mes- or track the evolution of spam engines within a single sage content. malware family. • We introduce the concept of SMTP dialects as one Server feedback manipulation. The SMTP protocol concrete instance of our approach. Dialects capture is used by a client to send a message to the server. Dur- small variations in the ways in which clients imple- ing this transaction, the client receives from the server ment the SMTP protocol. This allows us to distin- information related to the delivery process. One impor- guish between legitimate email clients and spam- tant piece of information is whether the intended recipi- bots. We designed and implemented a technique to ent exists or not. The performance of a spam campaign automatically learn the SMTP dialects of both legit- can improve significantly when a botmaster takes into imate email clients and spambots. account server feedback. In particular, it is beneficial • We implemented our approach in a tool, called for spammers to remove non-existent recipient addresses B@bel. Our experimental results demonstrate that from their email lists. This prevents a spammer from B@bel is able to correctly identify spambots in a sending useless messages during subsequent campaigns. real-world scenario. Indeed, previous research has shown that certain bots re- port the error codes received from email servers back to • We study how the feedback provided by email their command and control nodes [22, 38]. servers to bots is used by their botmasters. As a sec- To exploit the way in which botnets currently lever- ond instance of our approach, we show how provid- age server feedback, it is possible to manipulate the re- ing incorrect feedback to bots can