A Study of Spam Origins
Total Page:16
File Type:pdf, Size:1020Kb
Spamology: A Study of Spam Origins ∗ Craig A. Shue Minaxi Gupta Chin Hua Kong Oak Ridge National Computer Science Dept. Computer Science Dept. Laboratory Indiana University Indiana University Oak Ridge, Tennessee, USA Bloomington, Indiana, USA Bloomington, Indiana, USA [email protected] [email protected] [email protected] John T. Lubia Asim S. Yuksel Computer Science Dept. Computer Science Dept. Indiana University Indiana University Bloomington, Indiana, USA Bloomington, Indiana, USA [email protected] [email protected] ABSTRACT email lists for very little money.” This paper is motivated by The rise of spam in the last decade has been staggering, with the question: How are these bulk email lists created? While the rate of spam exceeding that of legitimate email. While previous work has explored one avenue for how spammers conjectures exist on how spammers gain access to email ad- might be harvesting email addresses [1], our goal is to un- dresses to spam, most work in the area of spam containment dertake a systematic study to understand the phenomenon. has either focused on better spam filtering methodologies Our general approach is to register a dedicated domain, or on understanding the botnets commonly used to send run an email server for that domain, and post email ad- spam. In this paper, we aim to understand the origins of dresses belonging to the domain at strategic places and watch spam. We post dedicated email addresses to record how the inflow of spam. The first step we take is to register for and where spammers go to obtain email addresses. We find free offers on the Web and give each web site a unique email that posting an email address on public web pages yields address. This is motivated by the simple observation that immediate and high-volume spam. Surprisingly, even sim- Internet users are frequently asked to trust and confide in ple email obfuscation approaches are still sufficient today a variety of web sites. Many sites offer their services, such to prevent spammers from harvesting emails. We also find as email accounts, and news for free. However, to access that attempts to find open relays continue to be popular these services, users must register and provide some per- among spammers. The insights we gain on the use of web sonal information. This often includes an email address, al- crawlers used to harvest email addresses and the common- lowing the site to contact the user. While users may implic- alities of techniques used by spammers open the door for itly assume that the site will securely and responsibly han- radically different follow-up work on spam containment and dle the confidential registration information, this assump- even systematic enforcement of spam legislation at a large tion may not hold in practice. Our next step is to post scale. unique, site-specific email addresses at popular sites with a blog or a comment section, such as washingtonpost.com and nytimes.com. The intent behind this step is to see if 1. INTRODUCTION the posted email addresses get discovered by web crawlers Spam exceeds legitimate email today. It wastes band- that eventually lead to spam. To compare the likelihood width and storage, delays valid emails, and hurts human pro- of getting email addresses harvested from popular sites ver- ductivity. Given these implications, significant efforts have sus less popular ones, and to understand the behavior of rightly been devoted to spam containment. While we have crawlers that lead to spam, we also post multiple email ad- made great leaps in developing novel spam-filtering tech- dresses on various web pages belonging to our department. niques and in understanding the botnets typically used to We exposed 22,230 unique email addresses and monitored send spam, our understanding of where spammers get the the inflow of spam for a period of almost 5 months. The re- email addresses to spam is largely confined to “they buy bulk sulting data helped us draw various interesting conclusions ∗ and also helped confirm some prevailing wisdom. Specifi- Craig Shue performed this work while a Ph.D. candidate at Indiana University. cally, we concluded that: Final edits of the submitted manuscript have been authored • by a contractor of the U.S. Government under contract DE- Users must exercise caution when divulging AC05-00OR22725. Accordingly, the U.S. Government re- their email addresses. Though none of the popular tains a nonexclusive, royalty-free license to publish or re- web sites where we registered for accounts spammed, produce the published form of this contribution, or allow a few of the less popular ones did send spam. Publicly others to do so, for U.S. Government purposes. posting email addresses at both popular and unpopu- lar sites led to the most spam. • Spam arrives instantly. Our first spam arrived in less than one hour, indicating that new email addresses get discovered quickly on the Internet. web sites in a manner that were available to web crawlers. Our general approach was to create a new email address to • Commonly-used email obfuscation techniques provide in each case. This allowed us to track the outcome are offering protection (for now). It is common on a per-site basis. In the passive approach, we simply ob- practice to replace the conventional @ in email ad- served the unsolicited spam sent to our mail server. We now dresses by an AT in order to defeat email harvesting. describe each in detail. We found that the spammers are still not parsing sim- ple obfuscations as of now. However, one should not 1. Signing up for mailing lists and newsletters: Our count on the protection offered by such simple obfus- first data set was motivated by the question “How often do cation schemes, for they are trivial to defeat. web sites leak email addresses provided to them?” To col- lect data to answer this question, we signed up and created • Spamming crawlers exist and can be tracked. accounts at 70 different types of web sites, including those Our scheme of dynamically presenting each new vis- that promise free offers and mailing lists and newsletters. itor to our department web pages with a new email We provided a unique email address belonging to our regis- address allowed us to track which crawlers, if any, led tered domain to each site in order to track which of these to spam. We found that crawling for harvesting email sites lead to spam. We used two different ways to find these address is occurring regularly. Thus, blocking access sites. First, we used the Alexa Web Information Service [3] for spamming crawlers could be an important new step to find popular web sites which allowed creation of accounts. in spam containment. This confirms findings from the 14 of the 70 accounts we created were in Alexa’s top 1000 work in Project Honey Pot [1]. web sites according to their traffic rankings. Second, we searched for pages that had forms that asked for email ad- • Top spammers use multiple email-harvesting dresses. This led to the creation of 56 additional accounts. strategies. Top spammers crawled popular and un- Combined, these 70 accounts helped us understand the be- popular web pages to harvest email addresses. They havior of popular web sites in comparison to the unpopular also made attempts to check if our mail server could be ones. We refer to this data set as the ACT (standing for used as an open relay. On the one hand, it shows the account) data set throughout the rest of this paper. aggressiveness of spammers. On the other, it opens up new avenues for fighting spam because anybody who 2. Posting emails on popular web sites: The second is doing both could be filtered. data set was motivated by the question “What is the likeli- hood of receiving spam on an email address posted at a pop- ular site?” To collect data to answer this question, we posted Roadmap: The remainder of this paper is organized as fol- 9 unique email addresses on 4 popular web sites. 3 of these lows: We explain our data collection technique in Section 2. sites were in the Alexa top 1000 popular sites. The email ad- In Section 3, we provide an overview of the spam we collect dresses were often added in comment sections of news stories and in Section 4, we analyze how quickly spam is received or in blogs. To avoid removal of the email addresses by site once an email address is exposed. Section 5 examines spam- moderators, yet still avoid legitimate emails, we included in- ming trends at the domain and IP granularity and Section 6 nocuous commentary along with each email address. Since investigates the behavior of web crawlers that are tied to our comments are unlikely to garner responses from regular spam campaigns. We review related work in Section 7 and visitors to these sites, we consider any email sent to these 9 discuss the implications of this study on spam containment addresses to be spam. Subsequently, this data set is denoted in Section 8. as the PST (standing for post) data set. 3. Posting static email addresses on less popular 2. DATA COLLECTION METHODOLOGY web pages: This data set was motivated by the ques- Spammers can harvest email addresses from multiple sour- tion: “What is the likelihood of receiving spam on an email ces on the Internet, including web pages, blogs, newsgroups, address at a less popular web site?” To collect this data social networking sites, and mailing lists. In our study, we set, we posted two different email addresses on two depart- examine the ramifications of making email addresses avail- mental course web pages.