Early Detection of Spam-Related Activity

EARLY DETECTION OF SPAM-RELATED ACTIVITY A Thesis Presented to The Academic Faculty by Shuang Hao In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Computer Science Georgia Institute of Technology December 2014 Copyright c 2014 by Shuang Hao EARLY DETECTION OF SPAM-RELATED ACTIVITY Approved by: Professor Nick Feamster, Advisor Professor Mostafa Ammar School of Computer Science School of Computer Science Georgia Institute of Technology Georgia Institute of Technology Professor Vern Paxson Dr. Christian Kreibich Department of Electrical Engineering Senior Research Scientist and Computer Sciences International Computer Science Institute University of California, Berkeley & In- & Lastline, Inc. ternational Computer Science Institute Professor Wenke Lee Date Approved: 22 October 2014 School of Computer Science Georgia Institute of Technology To my dear family, Thank you for all of your love, support and encouragements. iii ACKNOWLEDGEMENTS This thesis would not have been completed without the invaluable support, guidance, and inspiration I have received from many people during my Ph.D. study. My sincere gratitude goes to my advisor, Prof. Nick Feamster, for his caring guidance and consistent support. He led me on this exciting and fruitful journey, encouraged me to explore research topics, and provided me with both insightful advice and intellectual help. His boundless enthusiasm always inspires my passion for our work. He taught me indispens- able research skills including picking problems, crafting solutions, presenting to audiences, and cultivating research taste. I feel extremely fortunate to have had the opportunity to work with him closely for many years. The example he set up as a wonderful researcher and teacher will continue to light my way in my future career. I am deeply grateful to Prof. Vern Paxson for mentoring me during my summer internships at ICSI. His valuable knowledge and constructive suggestions helped me to better address the research problems, and guided me to learn to think more critically and creatively as a scientist. The detailed suggestions and comments that he gave on earlier versions of this dissertation greatly helped me to improve my work. I would like to thank the members of my committee, Prof. Wenke Lee, Prof. Mostafa Ammar, and Dr. Christian Kreibich, for their interest and help in my work. Prof. Wenke Lee has offered insightful suggestions through my study at Georgia Tech. Prof. Mostafa Ammar gave thought-provoking feedback on my thesis. Dr. Christian Kreibich helped me to understand the details of spam campaigns, and provided great advice during my internships at ICSI. I also wish to express my gratitude to all coauthors and collaborators for their support and assitance, which makes the research much stronger and solid. I have also been very fortunate to be a member of an excellent networking research group at Georgia Tech. I wish to thank Anirudh Ramachandran, Mukarram bin Tariq, Murtaza Motiwala, Valas Valancius, Sam Burnett, Srikanth Sundaresan, Robert Lychev, Hyojoon iv Kim, Yogesh Mundada, Bilal Anwer, Maria Konte, Ben Jones, Sean Donovan, Sarthak Grover, and Mi Seon Park for their valuable discussions and feedback on the research, and thank Cong Shi, Samantha Lo, Nan Hua, Tongqing Qiu, Partha Kanuparthy, and Ahmed Mansy for getting the lab life enjoyable and enriching. Their friendship and company has made my years at Georgia Tech full of wonderful memories. I dedicate this dissertation to my family. My parents, my brother Yang, and my grand- parents have always been supportive, loving, and encouraging. I am deeply indebted to them. v TABLE OF CONTENTS DEDICATION ...................................... iii ACKNOWLEDGEMENTS .............................. iv LIST OF TABLES ................................... x LIST OF FIGURES .................................. xi SUMMARY ........................................ xiii I INTRODUCTION ................................. 1 1.1 Spam: A Widespread Security Threat . .1 1.2 Detecting Spam-Related Activity: Why Is It Hard? . .4 1.3 Approach Overview . .5 1.4 Contributions . .6 1.5 Lessons Learned . .9 1.6 Bibliographic Notes . 10 II BACKGROUND AND RELATED WORK ................. 11 2.1 The Evolution and Components of Spam . 11 2.2 Detection Methods . 13 2.2.1 Content-based Detection . 13 2.2.2 Sender-based Detection . 14 2.2.3 DNS-based Detection . 14 2.2.4 Web-based Detection . 16 2.2.5 Other Mitigation Methods and Effort . 16 III SNARE: FILTERING SPAM WITH NETWORK-LEVEL FEATURES 18 3.1 Introduction . 18 3.2 Background . 21 3.2.1 Email Sender Reputation Systems . 21 3.2.2 Data and Deployment Scenario . 22 3.2.3 Supervised Learning: RuleFit . 24 3.3 Network-level Features . 26 3.3.1 Single-Packet Features . 27 vi 3.3.2 Single-Header and Single-Message Features . 33 3.3.3 Aggregate Features . 35 3.4 Evaluating the Reputation Engine . 36 3.4.1 Setup . 36 3.4.2 Accuracy of Reputation Engine . 38 3.4.3 Other Considerations . 40 3.5 A Spam-Filtering System . 44 3.5.1 System Overview . 44 3.5.2 Evaluation . 46 3.6 Discussion and Limitations . 48 3.6.1 Evasion-Resistance and Robustness . 48 3.6.2 Other Limitations . 50 3.7 Summary . 51 IV MONITORING THE INITIAL DNS LOOKUPS AND HOSTING OF SPAMMER DOMAINS ............................. 52 4.1 Introduction . 52 4.2 Context and Data Collection . 54 4.2.1 DNS Resource Records and Lookups . 54 4.2.2 Data Collection . 55 4.3 Registration & Resource Records . 57 4.3.1 Time Between Registration and Attack . 57 4.3.2 Location of DNS Infrastructure . 58 4.4 Early Lookup Behavior . 62 4.4.1 Network-Wide Patterns . 63 4.4.2 Evolution of Lookup Traffic . 64 4.5 Summary . 65 V UNDERSTANDING THE DOMAIN REGISTRATION BEHAVIOR OF SPAMMERS .................................... 67 5.1 Introduction . 67 5.2 Background: DNS Registration Process and Life Cycle . 69 5.3 Data Collection . 71 vii 5.4 Longevity of Spammer Domains . 73 5.4.1 Age of Domains Used for Spamming . 73 5.4.2 Duration-of-Use in Spam Campaigns . 74 5.4.3 Lifetime of Recently Registered Domains . 75 5.5 Spam Domain Infrastructure . 77 5.5.1 Registrars Used for Spammer Domains . 77 5.5.2 Authoritative Nameservers . 79 5.6 Detecting Registration Spikes . 82 5.6.1 Bulk Registrations by Spammers . 82 5.6.2 Detecting Abnormal Registration Batches . 84 5.6.3 Refining Threshold Probabilities . 88 5.7 Domain Registration Patterns . 89 5.7.1 Domain Categories . 90 5.7.2 Prevalence of Registration Patterns . 90 5.7.3 Retread Registration Patterns . 92 5.7.4 Naming Patterns for Brand-New Domains . 94 5.8 Summary . 95 VI PREDATOR: PROACTIVE DETECTION OF SPAMMER DOMAINS AT TIME-OF-REGISTRATION ........................ 97 6.1 Introduction . 97 6.2 Architecture . 99 6.2.1 Design Goals . 99 6.2.2 Operation . 100 6.3 Feature Extraction . 101 6.3.1 Case Study of Spammer Domain Registrations . 101 6.3.2 Domain Profile Features . 104 6.3.3 Life Cycle Features . 108 6.3.4 Batch Correlation Features . 109 6.4 Classifier Design . 111 6.4.1 Supervised Learning: CPM . 111 6.4.2 Building Detection Models . 112 viii 6.4.3 Assessing Feature Importance . 113 6.5 Evaluation . 114 6.5.1 Data Set and Labels . 114 6.5.2 Detection Accuracy . 115 6.5.3 Comparison to Existing Blacklists . 119 6.5.4 Feature Ranking . 124 6.6 Discussion . 125 6.6.1 Actionable Policies . 125 6.6.2 Evasion Resistance . 126 6.7 Summary . 127 VII CONCLUDING REMARKS .......................... 128 7.1 Summary of Contributions . 128 7.2 Future Work . 130 REFERENCES ..................................... 132 ix LIST OF TABLES 1 Description of data used from the McAfee dataset. 22 2 SNARE performance using RuleFit....................... 37 3 Ranking of feature importance in SNARE.................... 41 4 Mutual information among features in SNARE................. 42 5 DNZA format examples. 55 6 Top three ASes containing domains' records. 61 7 Five largest clusters based on lookup networks. 63 8 Summary of data feeds in the domain registration analysis. 71 9 Monthly data statistics of .com domain registrations. 71 10 The 10 registrars with the greatest number of spammer domains. 77 11 Top nameservers hosting spammer domains. 81 12 Epoch examples from registrar Moniker. 103 13 Domain examples from registrar Moniker. 104 14 Summary of.

Early Detection of Spam-Related Activity

Search Engine Optimization: a Survey of Current Best Practices

Click Trajectories: End-To-End Analysis of the Spam Value Chain

Prospects, Leads, and Subscribers

Fully Automatic Link Spam Detection∗ Work in Progress

Clique-Attacks Detection in Web Search Engine for Spamdexing Using K-Clique Percolation Technique

Passive Monitoring of DNS Anomalies Bojan Zdrnja1, Nevil Brownlee1, and Duane Wessels2

Sidekick Suspicious Domain Classification in the .Nl Zone

Proactive Cyberfraud Detection Through Infrastructure Analysis

I Wish to Thank the United States Department of Commerce's

Understanding and Combating Link Farming in the Twitter Social Network

Fast-Flux Botnet Detection Based on Traffic Response and Search Engines Credit Worthiness

Outcomes Report of the GNSO Ad Hoc Group on Domain Tasting