A Tool for Evaluating Spam Filters
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF CALGARY SpamTestSim: A Tool For Evaluating Spam Filters by Nathan Mark Friess A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF COMPUTER SCIENCE CALGARY, ALBERTA January, 2009 c Nathan Mark Friess 2009 ISBN: 978-0-494-51104-6 Abstract This thesis describes a new framework for evaluating spam filters: SpamTestSim. Spam- TestSim is a simulator that employs a multi-agent system, where agents simulate both legitimate email users and spammers. Agents send email messages to each other, as well as reply to and forward the messages between each other. These messages are then fed through a real mail server running a spam filter, with the goal of identifying weaknesses in the filter. This thesis also focuses on two important sub-topics of the overall simulator: the construction of the social network of agents, and a new approach to gathering non-spam text for the agents to use in the simulation. Finally, Spam- TestSim is compared to another well-known spam filter evaluation framework, and a selection of those results are presented here. One notable experiment demonstrates the advantage of using SpamTestSim over the other framework in uncovering weaknesses in two different spam filters. iii Acknowledgements I would like to thank my supervisor, Dr. John Aycock, for his guidance and his continual patience throughout my studies, especially during the writing phase. I would also like to thank my friends at Lyryx for their suggestions on the small but important writing technicalities, as well as their moral support during my studies. Finally, I’d like to thank Chris Walpole for his encouragement and insights about school and life in general. iv Table of Contents ApprovalPage ................................... ii Abstract....................................... iii Acknowledgements ................................. iv TableofContents.................................. v ListofTables .................................... vii ListofFigures....................................viii ListofCodeListings ................................ ix 1 Introduction.................................. 1 1.1 Definitions............................... 3 1.2 Motivation............................... 7 1.3 Multi-AgentSystems ......................... 10 1.4 ThesisOutline............................. 13 2 RelatedWork ................................. 14 2.1 Overview ............................... 14 2.2 Current Evaluation Techniques . 15 2.3 DistributedTesting.......................... 18 2.4 TestingTheory ............................ 18 2.5 Alternative Evaluation Techniques . 20 2.6 Summary ............................... 23 3 SpamTestSim ................................. 24 3.1 SimulationPhases .......................... 25 3.2 UnderlyingInfrastructure . 28 3.3 Agents................................. 29 3.3.1 BaseAgent.......................... 29 3.3.2 SpamAgent ......................... 35 3.3.3 MailingListAgent. .. .. 36 3.3.4 HamAgent ......................... 37 3.4 Summary ............................... 41 4 GeneratingSocialNetworks . .. .. 42 4.1 BasicSocialNetworkGeneration . 42 v 4.2 SocialNetworksOverTimeZones . 45 4.3 ThemultiTzOrgAlgorithm . 47 4.4 TheindepOrgAlgorithm. .. .. 49 4.5 Study Of The Scale-Free Property Of indepOrg And multiTzOrg . 52 4.6 AnomalyInTheBasicAlgorithm . 54 4.7 Summary ............................... 57 5 Building A Ham Corpus From Usenet . 59 5.1 Experiments.............................. 60 5.1.1 CustomNewsgroupList . 63 5.1.2 Top100NewsgroupList . 66 5.2 Limitations .............................. 67 5.3 Summary ............................... 68 6 Results..................................... 69 6.1 TREC2006SpamEvaluationKit . 70 6.2 SpamTestSim: Base Configuration . 73 6.3 SpamTestSim: Sending More Unsolicited Messages . 76 6.4 SpamTestSim: Replying More Often . 78 6.5 Summary ............................... 79 7 Conclusion................................... 81 7.1 FutureWork.............................. 83 7.2 Summary ............................... 86 References...................................... 87 A ExampleOfAgentConfigurationParameters. 94 A.1 Per-ClassParametersOnly. 97 A.2 Per-Instance Parameters Using A Regular Expression . 98 A.3 Per-Instance Parameters Using A Python Lambda Function . 99 A.4 InteractionOfPer-InstanceParameters . .100 A.5 Summary ...............................101 B UsenetNewsgroupListings. .102 B.1 CustomNewsgroupList . .. .. .102 B.2 Top100NewsgroupList . .. .. .103 vi List of Tables 5.1 DSPAM Classification Of Custom Newsgroup List . 64 5.2 ManualClassificationOfReplies. 66 5.3 Manual Classification Of Non-Replies . 66 5.4 DSPAM Classification Of Top 100 Newsgroup List . 67 6.1 Results of TREC 2006 Spam Evaluation Kit Against Spam Filters . 73 6.2 Results of Running SpamTestSim Against Spam Filters . 76 6.3 Results of Running SpamTestSim With More Unsolicited Messages . 77 6.4 Results of Running SpamTestSim With More Replies . 79 vii List of Figures 1.1 PublicInterfaceOfASpamFilter. 7 2.1 SpamFilterEvaluationFramework . 16 3.1 AgentClassDiagram............................. 30 3.2 Example Of Per-Class Parameters For HamAgent . 31 3.3 ExampleOfPer-InstanceParameters . 32 3.4 Example Of Input And Output For The Thunderbird 2.0 MailClient . 34 4.1 Social Network Generated By multiTzOrg . 48 4.2 SocialNetworkGeneratedByindepOrg. 50 4.3 Plots Of Degree Versus Frequency Of Nodes In Generated Social Networks 55 5.1 Simple Usenet Reply Message . 62 6.1 SpamTestSimBaseConfiguration . 75 6.2 SpamTestSim Configuration With More Unsolicited Messages . 76 6.3 SpamTestSim Configuration With More Replies . 79 A.1 ExampleSocialNetwork . 95 A.2 Example SpamTestSim Configuration . 96 viii List of Code Listings 3.1 SimulationPseudo-Code . 27 4.1 B-A Scale-Free Network Generation Algorithm . 44 4.2 multiTzOrg Social Network Generation Algorithm . 48 4.3 indepOrg Social Network Generation Algorithm . 51 4.4 Improved B-A Scale-Free Network Generation Algorithm . 57 ix 1 Chapter 1 Introduction Spam is a complex issue that affects everyone with an email account, whether they realize it or not. Cisco Systems Inc. estimates spam volumes to be between 85 and 90% of all email received in August and September 2008[7], or looking at raw numbers, in the order of 100 to 200 billion messages.1 While individually many users are annoyed by the extraneous messages appearing in their inbox on a regular basis, spam has greater effects on email and the Internet in general than just being an annoyance. With such a high percentage of spam messages being received by mail servers around the world, a large number of computing resources such as CPU time, memory, and disk space are wasted by any organization that operates a mail server. This is costly, to say the least, but even worse is the time wasted by system administrators and even individual users that must wade through the messages that their spam filter is unable to correctly classify. Spam is used as a vector for computer crimes, including the spread of computer viruses and worms, identity theft, and fraud. An early form of fraud widely seen in spam is advance fee fraud[55], carried over from the pre-Internet era, where victims are encouraged to pay the fraudster a small fee with the promise of receiving a large amount of money in the future. While this is the basic description, advance fee fraud has taken many different forms over the years[33]. Spam is also a common vehicle for phishing[56], where fraudsters trick victims into giving out personal information such as passwords for various online services, bank accounts, or Social Security numbers. These 1Statistics about spam are difficult to calculate globally, however companies like Cisco have sensors spread across the Internet, providing a reasonable estimate for the purposes of this discussion. 2 are only a couple types of particularly harmful spam, among the many more messages selling (sometimes illegal) goods, and many other schemes conducted using spam. The most widely used defense against spam is to deploy spam filters on mail servers and in email clients. An ideal spam filter would prevent all of that junk from reaching the user, while never losing an email that the user wants to see. Somewhere in the vast sea of noise are important email messages relating to business transactions, for which every email lost can cost hundreds or thousands of dollars. Unfortunately, no spam filter is perfect, and so the battle against those who send spam and those who protect the users and computers continues on, with economic losses estimated in the billions of dollars[16]. It is important to be able to evaluate spam filters and have a useful comparison of the accuracy and performance of the various spam filters available. Thorough evaluations help users decide which filter to use for their situation and help researchers focus their efforts as well as provide a benchmark for new ideas that are developed. This thesis discusses the importance of evaluating spam filters, the limitations of current tools, and introduces a new spam filter evaluation framework called SpamTest- Sim. SpamTestSim is designed to evaluate spam filter accuracy and performance by simulating certain human behaviours. SpamTestSim is a multi-agent system, where the agents simulate individual email users, spammers, and other automated email systems. The simulation focuses on emulating behaviours exhibited by humans with the goal of exposing weaknesses in spam filters that result in