Understanding How Spammers Steal Your E-Mail Address: an Analysis of the First Six Months of Data from Project Honey Pot

Understanding How Spammers Steal Your E-Mail Address: An Analysis of the First Six Months of Data from Project Honey Pot Matthew B. Prince† Lee Holloway, Arthur M. Keller CEO, Unspam, LLC Eric Langheinrich, Information Systems and Adjunct Professor of Law & Benjamin M. Dahl Technology Management John Marshall Law School Unspam, LLC Baskin School of Engineering 850 W. Adams, Suite 4e 850 W. Adams, Suite 4e Univ. of California, Santa Cruz Chicago, IL 60607-3096 Chicago, IL 60607-3096 & Advisor, Unspam, LLC Abstract CEAS 2004 and opened to public volunteers October 14, 2004. Since its launch, the Project’s software has This paper summarizes and analyses data been installed by more than 5,000 users on websites compiled on the activities of email harvesters worldwide. The Project’s honey pots are running in at gathered through a 5,000+ member honey pot least 80 countries and on every inhabited continent. system that issues unique addresses based on a visitor’s IP address and specific spidering As of June 20, 2005, the Project is monitoring more time. The project, known as Project Honey than 250,000 active spamtrap e-mail honey pots. Pot, has provided data about the geographical Thousands of spamtrap addresses are distributed source of harvesting and mail processing, the through honey pots each day and the Project is on pace sending patterns of different types of to have more than 1 million active spamtraps monitored spammers as well as list management by the end of 2005. behavior. In addition to providing guidance for website administrators trying to forestall This paper is the first thorough analysis of the data gathered by Project Honey Pot. Understanding the harvesting, the Project data also suggest that behavior of harvesters is critical to controlling the spam anti-harvesting regulations offer a new, problem. Harvesters sit at the beginning of the spam potentially successful prosecutorial avenues cycle. Studies by the Pew Internet Project, the Center against spam as well as inform potential for Democracy and Technology, as well as the Federal amendments to current anti-spam laws that may help those efforts. Trade Commission have found that harvesting is the primary way spammers obtain new e-mail addresses.1 Understanding harvesting and the resulting address 1 Introduction† distribution can provide not only a mechanism to keep e-mail addresses out of the hands of spammers, but may It is axiomatic to say that the best way to stop spam is also help identify spam gangs and give law enforcement to keep spammers from getting your e-mail address. officials a new cause of action for prosecutions. While e-postage, challenge-response systems, Bayesian filters, realtime block lists, and reputation services may be necessary once an address is widely distributed, all 2 Technical Background of these anti-spam measures can be made more Project Honey Pot consists of two primary components: effective if the process of obtaining e-mail addresses in 1) the honey pot software installed on machines the first place is made difficult and auditable. To that worldwide, and 2) the centralized server which collects end, Project Honey Pot was created to understand the data from and distributes spamtrap addresses to the primary way by which spammers obtain new e-mail honey pots. The Project currently supports honey pot addresses. software for platforms running the following scripting Project Honey Pot (www.projecthoneypot.org) is a 1 distributed honey pot network to track e-mail See “Spam: How it is hurting email and degrading life on the Internet,” Deborah Fallows, Pew Internett & Amer. Life Project, Oct. harvesters, and, subsequently, the spammers who send 22, 2003 <http://www.pewinternet.org/report_display.asp?r=102>; to harvested addresses. The Project was announced at “Why Am I Getting All This Spam?” Center for Democracy and Technology, March 2003 <http://www.cdt.org/speech/spam/ 030319spamreport.shtml>; “Email Address Harvesting: How † Corresponding author. Email: [email protected]. Spammers Reap What You Sow,” FTC Report, Nov. 2002 Telephone: +01.312.543.3045 (direct). <http://www.ftc.gov/bcp/conline/pubs/alerts/spamalrt.htm>. language: PHP, ASP, ASP.NET, Perl, mod_perl, as well as thousands of additional domains donated by ColdFusion, SAP Netweaver BSP, and Python. We also our members. These donations take place by members provide a wrapper for users of the MovableType pointing their donated domains’ MX record to our blogging software to allow for easy installation. servers. Website administrators download and install the honey By combining our donated domains with our possible pot software. The static content of the honey pots, usernames, we can currently create approximately 10 which primarily consists of a legal disclaimer trillion unique e-mail addresses that will resolve to our forbidding the harvesting of the addresses displayed on mail servers. This allows us to distribute a unique the page, is randomized for each download in order to spamtrap to every visitor to a honey pot for the make the Project’s honey pots difficult to recognize and foreseeable future. Moreover, it means it is difficult for avoid. On a few high-traffic websites, we have further spammers to determine what e-mail addresses on their customized the boilerplate legal disclaimer as well as lists are, in fact, spamtraps. To further disguise our the look of the honey pot for particular members’ spamtraps we rotate the IP addresses of our mail servers needs. and are continuously looking for ways to further hide what addresses belong to the Project. After the honey pot script is installed and activated, we provide instructions to the website administrator on linking from his current web pages to the honey pot 3 Data Analysis page. These links are generally formatted to be invisible Harvesters make up a significant percentage of the to human visitors to the website, but to be followed by robot traffic currently trolling the Internet. web spiders and robots. We test these formats to ensure Approximately 6.5 percent of the traffic visiting our they are followed by the latest crop of spam harvesters. honey pots subsequently turns out to be spam When one of these links is followed and a honey pot is harvesters. While some human traffic inevitably finds accessed by a visitor, the honey pot script installed on our honey pots, the vast majority of visitors to these the webserver instantly contacts the centralized Project pages are automated spiders. We estimate, therefore, Honey Pot servers. The honey pot script passes to the that harvesters make up at least 5 percent of all centralized servers an array that includes the IP address automated traffic online. of the visitor, the useragent of the visitor, and the The average time from a spamtrap address being referer string of the visitor. The servers record this harvested to when it receives its first message is visitor information as well as a timestamp and return a currently 11 days, 7 hours, 43 minutes, and 10 seconds. unique spamtrap e-mail address to the honey pot script. The fastest turnaround is less than 1 second, and the The spamtrap address is handed out only once and is slowest is 223 days, 19 hours, 37 minutes, and 8 tied to both a moment in time and visitor information. seconds. The slowest time is just under the total online The honey pot script combines the spamtrap address age of the Project. As a result, we believe that the average turnaround time will continue to rise slightly as with the static content and displays a web page. The 2 process from access to page display typically takes less the Project ages. than a second and creates little additional load for the We’ve been surprised, so far, by how slow the web server where the honey pot script is installed. turnaround for some spammers has been. This lends While every spamtrap address is unique, they are support to the hypothesis that there is a class of designed to look like real addresses. There are two parts individuals involved in the spam trade who to every e-mail address: 1) the username, which appears methodically gather addresses. These individuals could before the @ sign, and 2) the domain, which appears be spammers who also send to those addresses, or they after the @ sign. We construct usernames with a list of could sit at the top of the spam food chain, selling the more than 6,000 common first names, 12,000 common lists they obtain to the spammers who then send last names, a 60,000 word dictionary, and random other messages to those lists. There is additional evidence letters and numbers. These components are combined to from recent legal cases that these “listmen” do exist as form typical usernames used by legitimate mailing part of the spam economy. Identification of these systems. For example: listmen, with an understanding and control of their behavior following closely thereafter, we believe offers • john.smith a critical choke point in the spam cycle both legally and • john_smith technologically. • johnasmith • jsmith We have also been surprised by how clearly many • orange42 harvesters identify themselves. A substantial percentage • orangegrasslands of harvesters can be identified by the “useragent” they For the domain portion of the spamtrap address, we use a number of domains controlled by Project Honey Pot 2 For the latest stats, see the Project Honey Pot statistics available online at: http://www.projecthoneypot.org/statistics.php. broadcast when visiting a website. While some universe of harvesters. In fact, only 25 harvesters are harvester disguise their identity pretending to be a responsible for more than 50 percent of the volume of typical website visitor (e.g., Mozilla/4.0 (compatible; spam that has been sent to the Project.

Understanding How Spammers Steal Your E-Mail Address: an Analysis of the First Six Months of Data from Project Honey Pot

Prospects, Leads, and Subscribers

Address Munging: the Practice of Disguising, Or Munging, an E-Mail Address to Prevent It Being Automatically Collected and Used

Technical and Legal Approaches to Unsolicited Electronic Mail†

Presentations Made by Senders

Asian Anti-Spam Guide 1

Taxonomy of Email Reputation Systems (Invited Paper)

Factors Involved in Estimating Cost of Email Spam

NIST SP 800-44 Version 2

WHITE PAPER Email Deliverability Review

How Do Spammers Harvest Email Addresses ?

Spam and Spam Prevention

84. Blog Spam Detection Using Intelligent Bayesian Approach