Mitigating Spam Using Network-Level Features
Total Page:16
File Type:pdf, Size:1020Kb
MITIGATING SPAM USING NETWORK-LEVEL FEATURES A Thesis Presented to The Academic Faculty by Anirudh V. Ramachandran In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in Computer Science School of Computer Science Georgia Institute of Technology August 2011 MITIGATING SPAM USING NETWORK-LEVEL FEATURES Approved by: Professor Nick Feamster, Advisor Professor Patrick Traynor School of Computer Science School of Computer Science Georgia Institute of Technology Georgia Institute of Technology Dr. Anirban Dasgupta Professor Kilian Weinberger Research Scientist Department of Computer Science and Yahoo! Research Engineering Washington University in St. Louis Professor Wenke Lee Date Approved: 20 May 2011 School of Computer Science Georgia Institute of Technology To mom and dad for their love, support, and encouragement. iii ACKNOWLEDGEMENTS Most of the good work that my name appears on is directly due to my advisor, Prof. Nick Feamster. Nick is one of those few professors bold enough to offer funding to a green first-year PhD with no experience in networking or security, solely on the basis of a 7001 mini-project—all before he even started at Tech. I am glad he took that chance. Nick taught me almost everything I know about picking problems, crafting clever solutions, and, just as importantly, presenting it well. I’m also grateful for his suggestions, insights, and much-needed nudges in the right direction, as well as his informal advising style (hurrah for open-ended 3 AM meetings!). Despite his brilliance, Nick is also one of the hardest-working people I know, and I strive to emulate his work ethic. I would also like to thank the rest of my committee. Prof. Wenke Lee’s courses were enlightening, and he has offered precise and constructive suggestions whenever I have had the fortune to bounce my ideas off of him. Prof. Patrick Traynor gave excellent comments on my drafts, and his “Destructive Research” course was one of the most interesting courses I have taken at Tech—his choice of readings showed how and why many crucial real-world security mechanisms are fundamentally broken. My external committee members, Dr. Anirban Dasgupta and Prof. Kilian Weinberger, hosted me at Yahoo! Research during my most enjoyable summer internship. They tolerated my limited knowledge of theory and made that summer a great experience: Kilian provided great Machine Learning advice and first suggested that I try clustering on voting data from Yahoo! Mail (which became the key idea behind Chapter 6). Anirban helped me explore the various nooks and crannies of the behemoth that is Yahoo! Mail, and provided solutions—both theory and code—the many times that I iv got stuck. Several other people have directly influenced my research. I’m grateful to the hosts of my previous internships and have learned a great deal from each: Prof. Scott Shenker of UC Berkeley for hosting me during summer 2006, and Dr. Balachander Krishnamurthy, Dr. Kobus van der Merwe, and Dr. Oliver Spatscheck of AT&T Re- search for summer 2007. I’m also lucky to have worked with Prof. Santosh Vempala; Santosh’s keen intellect and unquenchable enthusiasm for applying theoretical ideas to practical problems were awe-inspiring. I also thank my other collaborators over the years for all the discussions and insights—Kaushik Bhandankar, Atish Das Sarma, David Dagon, Shuang Hao, Hitesh Khandelwal, Yogesh Mundada, Srini Seetharaman, and Mukarram Tariq; though not everything resulted in publications, their ideas and thoughts have shaped my research. The research in this dissertation would not have been interesting without the underlying data itself being interesting; I thank David Mazi`eres for his spam trap data, Suresh Ramasubramanian for mail data from Out- blaze, David Dagon for traces of botnet activity and DNSBL lookups, and Xiaopeng Xi and Jay Pujara of the Yahoo! anti-spam team for their mail logs. My years at Tech would have been a lot less fun had it not been for the company of several people outside work. Thanks to my roommates over the years—Atish, Murtaza, Srikanth, and Yogesh—for the company and for putting up with me. I owe my sanity to labmates and local friends: Ahmed, Amogh, Atish, Bilal, Ilias, Manish, Millo, Mukarram, Partha, Ravi & Yarong, Ruomei, Saamer, Sai, and Srini. Amogh dutifully listened to my music recommendations though he probably hated most of them. Atish and I had some great conversations and I really enjoyed swapping crazy theories and startup ideas with him. Srini is generally awesome and has saved my skin so many times. Mukarram is one of the smartest people I have met and has been a constant source of inspiration. Of late, Ilias has provided excellent company, both for playing music and as a concert buddy. Thanks are also due to my mates from v Narmada Hostel and the 6th wing for all the good times and memories. My parents have been pillars of support all through my PhD, providing constant encouragement, advice, news from home, and a clearer view of my own priorities. I dedicate this dissertation to them. My sister Inku and brother-in-law Pramu have provided encouragement and advice from not-too-far-away, and have helped me un- wind after stressful periods so many times these past 6 years. Grandma has always been supportive and loving though I forget to call her for months on end. I am indebted to all of them. vi BIBLIOGRAPHIC NOTES Chapter 3 extends work from our paper “Understanding the Network-level Behav- ior of Spammers” [103] by adding newer longitudinal analyses. Chapter 4 contains material from the paper “Revealing Botnet Membership Using DNSBL Counter- Intelligence” [107]. Chapter 5 contains material from the paper “Filtering Spam with Behavioral Blacklisting” [108]. Chapter 6 contains material from the paper “Spam or Ham? Characterizing and Detecting Fraudulent Not Spam Reports in Webmail Systems” [102] that is under submission. Chapter 7 contains some material from the unpublished work “Spotting Spammers with a Dynamic Reputation System” [109] as well as new material. vii TABLE OF CONTENTS DEDICATION .................................. iii ACKNOWLEDGEMENTS .......................... iv PREFACE ..................................... vii LIST OF TABLES ............................... xiii LIST OF FIGURES .............................. xiv SUMMARY .................................... xvi 1 INTRODUCTION ............................. 1 1.1 Spam Filtering: Techniques and Challenges . 2 1.1.1 TypesofSpamFilters...................... 2 1.1.2 Problems with Spam Filtering . 4 1.2 ThesisStatement ............................ 8 1.3 Contributions .............................. 8 1.4 Roadmap................................. 10 2 BACKGROUND AND RELATED WORK ............. 14 2.1 SpammingMethods ........................... 14 2.1.1 DirectSpamming ........................ 14 2.1.2 OpenRelaysandProxies . .. .. 15 2.1.3 Spam Using Unorthodox BGP Route Announcements . 16 2.2 SpamfromBotnets ........................... 18 2.2.1 Direct Spamming to the Recipient . 19 2.2.2 Spamming via Webmail Services . 21 2.3 Mitigationtechniques .......................... 21 2.3.1 Content-based Filtering . 22 2.3.2 IP Blacklists . 23 2.3.3 Other Filtering Methods . 24 viii 2.3.4 Analysis of the Economics of Spam . 25 2.4 Related Work in Network-level Spam Filtering . 25 2.4.1 Characterization of Network-level Properties . 25 2.4.2 Network-level Spam Mitigation Techniques . 27 2.5 Where This Dissertation Fits in . 30 3 CHARACTERIZING THE NETWORK-LEVEL BEHAVIOR OF SPAMMERS ................................. 32 3.1 Introduction ............................... 32 3.2 DataCollection ............................. 36 3.2.1 SpamEmailTraces ....................... 37 3.2.2 Legitimate Email Traces . 40 3.2.3 BotnetCommandandControlData . 40 3.2.4 BGPRoutingMeasurements . 41 3.3 Network-level Characteristics of Spammers . 43 3.3.1 DistributionAcrossNetworks. 43 3.3.2 Distribution Across ASes and Countries . 49 3.3.3 SendingPatternsofSpammers . 52 3.3.4 The Effectiveness of Blacklists . 55 3.3.5 Message Size Distribution of Spam . 59 3.4 SpamfromBotnets ........................... 60 3.4.1 BobaxTopology ......................... 61 3.4.2 OperatingSystemsofSpammingHosts . 62 3.4.3 Spamming Bot Activity Profile . 63 3.5 Spam from Short-lived BGP Announcements . 69 3.5.1 BGP Spectrum Agility . 70 3.5.2 Prevalence of BGP Spectrum Agility . 74 3.5.3 How Much Spam from Spectrum Agility? . 75 3.6 LessonsforBetterSpamMitigation . 76 3.7 Summary................................. 78 ix 4 IDENTIFYING BOTNETS THAT PERFORM IP BLACKLIST RECONNAISSANCE ........................... 80 4.1 Introduction ............................... 80 4.2 Model of Reconnaiassance Techniques . 83 4.2.1 PropertiesofReconnaissanceQueries . 84 4.2.2 Reconnaissance Techniques . 86 4.3 DataandAnalysis............................ 88 4.3.1 Data Collection and Processing . 88 4.3.2 Analysis and Detection . 88 4.4 Preliminary Results . 90 4.5 Countermeasures............................. 92 4.6 Summary................................. 94 5 FILTERING SPAM USING BEHAVIORAL BLACKLISTING . 95 5.1 Introduction ............................... 95 5.2 Motivation ................................ 98 5.2.1 The Behavior of Spamming IP Addresses . 99 5.2.2 ThePerformanceofIP-BasedBlacklists . 102 5.2.3 The Case for Behavioral Blacklisting . 106 5.3 ClusteringAlgorithm .......................... 107 5.3.1 SpectralClustering . .. .. 107 5.3.2 SpamTracker: Clustering Email Senders . 108 5.4 Design .................................. 111 5.4.1