A Campaign-Based Characterization of Spamming Strategies
Total Page:16
File Type:pdf, Size:1020Kb
A Campaign-based Characterization of Spamming Strategies Pedro H. Calais, Douglas E. V. Pires Cristine Hoepers, Dorgival Olavo Guedes, Wagner Meira Jr. Klaus Steding-Jessen Computer Science Department Computer Emergency Response Team Brazil Federal University of Minas Gerais Network Information Center Brazil Belo Horizonte, MG - Brazil S~aoPaulo, SP - Brazil Abstract technique employed by spammers to maximize the ef- fectiveness of their attacks, reducing the probability This paper presents a methodology for that the message is blocked by spam filters and pre- the characterization of spamming strategies venting their activities of being identified and tracked. based on the identification of spam cam- Our approach is to characterize spam campaigns in ad- paigns. To deeply understand how spammers dition to individual messages. We define a campaign abuse network resources and obfuscate their as a set of messages that have the same goal (e.g., messages, an aggregated analysis of spam advertising a specific product) and employ the same messages is not enough. Grouping spam obfuscation strategy, which comprises either content messages into campaigns is important to un- obfuscation and network exploitation strategies. In veil behaviors that cannot be noticed when general, spammers obfuscate and change the content looking at the whole set of spams collected. of their messages on a systematic and automated way. We propose a spam identification technique They try to avoid sending identical messages, which based on a frequent pattern tree, which natu- would make the task of detecting his or her messages rally captures the invariants on message con- easier. Thus, in order to characterize the strategies tent and detect campaigns that differ only and traffic generated by different spammers, it is neces- due to obfuscated fragments. After that, we sary to identify groups of messages that are generated characterize these campaigns both in terms following the same procedure and are part of the same of content obfuscation and exploitation of spam campaign. In this paper, we propose a novel network resources. Our methodology in- and scalable methodology for identifying spam cam- cludes the use of attribute association anal- paigns. After the campaigns have been identified and ysis: by applying an association rule min- messages are associated with those campaigns, we can ing algorithm, we were able to determine co- then characterize how each campaign have exploited occurrence of campaign attributes that un- the network resources and how their contents have veil different spamming strategies. In partic- been obfuscated. ular, we found strong relations between the origin of the spam and how it abused the We consider the identification of spam campaigns a network, and also between operating systems crucial step for identifying spamming strategies and and types of abuse. improving our understanding of how spammers abuse network resources for a number of reasons. First, the identification of campaigns creates new dimensions 1 Introduction that can be analyzed and correlated. Aggregate anal- ysis on spam data is limited in determining spamming strategies. By grouping spam messages in their associ- Despite current strategies to minimize the impact of ated campaigns, we can characterize how the spammer spams, it is necessary a continuous effort to understand disseminated his or her messages. Second, the volume in detail how spammers generate, distribute and dis- of spam messages is huge, and processing such amount seminate their messages in the network, to maintain of data is costly and sometimes unfeasible. Group- and even improve the effectiveness of anti-spam mech- ing messages into campaigns provides a summariza- anisms (Pu & Webb, 2006). The goal of this paper tion criteria, drastically reducing the amount of data is to characterize different spamming strategies em- to be treated, while maintaining their key characteris- ployed by spam senders. We define as an strategy any tics. Finally, the identification of spam campaigns neu- that graph were analyzed, for example, the identifica- tralize the effect of the variable volume of messages as- tion of large groups of IPs that send spam messages sociated to each spam campaign, which might hide fre- with the same URL. The notion of campaign was im- quent behaviors happening only on smaller campaigns. plicitly used, by grouping messages by their URLs, but, as URL obfuscation is a common practice, the After identifying the messages that were generated groups of IPs referencing the same URLs could be even from the same spam campaign, we propose a method- bigger. Our work considers not only URLs, but also ology for characterizing spam dissemination strategies. other features while identifying campaigns. The methodology is based on the detection of invari- ants and co-occurrence of mechanisms for sending mes- SpamScatter (Anderson et al., 2007) is a technique sages adopted by a single campaign. These invariants that determines spam campaigns by performing im- and patterns represent spamming behaviors and may age shingling, which looks for similarities between im- be used for definition of criteria for detection, identi- ages from different spam web pages. The methodology fication and minimization of the impact of spam. adopted by the authors is similar to ours: campaigns are first identified and then characterized. However, We applied our characterization methodology to 97 while their work analyzes the scam hosting infrastruc- million spam messages captured during 12 months ture, our focus is on the characterization of the net- by low-interaction honeypots (Provos & Holz, 2007), work infrastructure abuse. which were configured to emulate computers with open relays and open proxies. We were able to find strong In fact, the idea of identifying spam campaigns is not relations between the origin of the spam and how it new. In the literature, most research on grouping near- abused the network and also between operating sys- duplicate spam messages aims to detect campaigns as tems and these abuse types. a strategy for blocking them, based on the fact that an inherent characteristic of unsolicited e-mails is that they are sent in high volumes during short periods 2 Related Work of time. Different strategies for grouping messages into campaigns can be mentioned, such as techniques Many recent works have studied spammers' abuse that consider URL information (Yeh & Lin, 2006), strategies, both considering network behavior and con- signature-based approaches such as I-Match (Kolcz & tent obfuscation. Chowdhury, 2007) and techniques that compute simi- In (Ramachandran & Feamster, 2006) the authors ana- larities between spam images (Wang et al., 2007). Our lyze how spammers exploit the Internet infrastructure goal is different from those works in the sense that to send their messages, including the most popular IP we intend to identify spam campaigns to characterize ranges exploited for sending spam and the more com- them in terms of content and network obfuscation. Al- mon abuse types, such as bots and BGP hijacking. In though our findings may support the development and particular, the authors show that spam messages tend improvement of anti-spam techniques, this is not our to be sent from very restricted IP ranges. Some statis- main objective. tics about the origin of the messages show the most Regarding content characterization, (Pu & Webb, common operating systems originating spams and the 2006) presented some analysis of temporal evolution of autonomous systems (AS) that account for the high- spammer strategies regarding the techniques they use est volume of spams. Our paper also characterizes to construct their messages. These techniques were spamming network strategies, but instead of looking extracted from the rules identified by the anti-spam at the group of messages as a whole, we group mes- filter SpamAssassin. The authors showed that some sages into campaigns and then analyze how the groups obfuscation techniques are abandoned over time, pos- of IPs that disseminated each campaign have abused sibly due to changes in the environment, such as a bug network resources. This approach provides insights on fix on an e-mail client program. On the other hand, how spammers act, which would not be possible on some strategies are able to persist for long periods of an aggregated analysis. Moreover, we also found rela- time. Our work is complementary to theirs and also tions among operating systems, spam origin and abuse provides an interesting framework for trend detection, types, which extends the analysis presented on (Ra- which would be a future work direction. machandran & Feamster, 2006). Another characteristic of our work is the use of data A recent work on characterization of strategies of spam mining techniques to unveil spamming strategies. Al- dissemination is presented in (Li & Hsieh, 2006). The though data mining has been extensively applied on authors grouped spams according to the messages' a wide range of contexts such as e-commerce, bio- URLs and analyzed the graph representing the rela- informatics and industrial applications (Tan et al., tionships between IPs and URLs. Some properties of 2005), we are not aware of any work that applied logs the Operating System of the source IP for each data mining techniques for spam characterization pur- TCP connection, using passive fingerprinting tech- poses. niques (Provos & Holz, 2007). All logs and data cap- tured were then collected by a central server. 3 Methodology Any spammer trying to abuse one of these honeypots to send spam would tend to believe that the emails In this section, we present our methodology for char- were delivered successfully. The message, however, acterizing spammers' strategies for dissemination of was stored locally and never delivered to its recipients. spam messages. The methodology is divided into three The only exceptions were emails sent by spammers to distinct phases: data collection, campaign identifica- test if the proxy/relay was delivering messages. Each tion, and characterization. These three phases will be test message was specially crafted, and contained in- detailed in the next subsections.