Clustering Spam Domains and Hosts: Anti-Spam Forensics with Data Mining
Total Page:16
File Type:pdf, Size:1020Kb
CLUSTERING SPAM DOMAINS AND HOSTS: ANTI-SPAM FORENSICS WITH DATA MINING by CHUN WEI ALAN P. SPRAGUE, COMMITTEE CHAIR ANTHONY SKJELLUM CHENGCUI ZHANG KENT R. KERLEY RANDAL VAUGHN A DISSERTATION Submitted to the graduate faculty of The University of Alabama at Birmingham, in partial fulfillment of the requirements for the degree of Doctor of Philosophy BIRMINGHAM, ALABAMA 2010 Copyright by Chun Wei 2010 TABLE OF CONTENTS Page ABSTRACT ....................................................................................................................... iv LIST OF TABLES ...............................................................................................................v LIST OF FIGURES ........................................................................................................... vi LIST OF ABBREVIATIONS .......................................................................................... viii CHAPTER 1 INTRODUCTION ...........................................................................................................1 1.1 Current Spam Trend....................................................................................................1 1.2 Protective Mechanisms of Spammers ........................................................................2 1.2.1 Word Obfuscation ...............................................................................................2 1.2.2 Botnet ..................................................................................................................3 1.2.3 Spam Hosting Infrastructure................................................................................4 1.2.4 Fast-Flux Service Networks ................................................................................6 1.3 Research Problem, Goal and Impact ..........................................................................7 2 LITERATURE REVIEW ..............................................................................................12 2.1 Anti-Spam Research .................................................................................................12 2.1.1 Spam Filtering ...................................................................................................13 2.1.2 Message Obfuscation.........................................................................................14 2.1.3 Research on Botnet Detection ...........................................................................17 2.1.4 Research on URLs and Spam Hosts ..................................................................21 2.1.5 Scam vs. Spam Campaign .................................................................................25 2.2 Research on Data Clustering ....................................................................................25 2.2.1 Linkage Based Clustering..................................................................................25 2.2.2 Connected Components .....................................................................................27 2.2.3 Research on Data Streams .................................................................................30 3 HIERARCHICAL CLUSTERING ................................................................................33 3.1 Attribute Extraction ..................................................................................................33 3.2 Clustering Methods ..................................................................................................34 i 3.2.1 Agglomerative Hierarchical Clustering Based on Common Attributes ............35 3.2.2 Connected Components with Weighted Edges .................................................37 3.3 Experimental Results ................................................................................................38 3.3.1 Data Collection ..................................................................................................38 3.3.2 Results of Agglomerative Hierarchical Clustering............................................39 3.3.3 Validation of Results .........................................................................................39 3.3.4 Results of Weighted Edges ................................................................................42 3.4 Discussion .................................................................................................................44 4 FUZZY STRING MATCHING ......................................................................................46 4.1 String Similarity ........................................................................................................46 4.1.1 Inverse Levenshtein Distance ............................................................................46 4.1.2 String Similarity ................................................................................................47 4.2 Subject Similarity .....................................................................................................48 4.2.1 Subject Similarity Score Based on Partial Token Matching .............................48 4.2.2 Adjustable Similarity Score Based on Subject Length......................................49 4.3 Subject Clustering Algorithms .................................................................................50 4.3.1 Simple Algorithm ..............................................................................................50 4.3.2 Recursive Seed Selection Algorithm .................................................................51 4.4 Experimental Results ................................................................................................52 5 CLUSTERING SPAM DOMAINS ................................................................................54 5.1 Retrieval of Spam Domain Data ...............................................................................54 5.1.1 Wildcard DNS Record .......................................................................................56 5.1.2 Retrieval of Hosting IP Addresses ....................................................................57 5.2 Daily Clustering Methods .........................................................................................57 5.2.1 Hosting IP Similarity between Two Domains ...................................................60 5.2.2 Subject Similarity between Two Domains ........................................................61 5.2.3 Overall Similarity between Two Domains ........................................................62 5.2.4 Bi-connected Component Algorithm ................................................................63 5.2.5 Labeling Emails Based on Domain Clusters .....................................................64 5.3 Day to Day Clustering Method .................................................................................64 5.3.1 Similarity between Two Clusters ......................................................................66 5.3.2 Linking Two Clusters ........................................................................................68 5.4 Experimental Results ................................................................................................69 5.4.1 Daily Clustering Results ....................................................................................69 5.4.2 Tracing Clusters over the Experiment Period of Time ......................................74 5.5 Discussion .................................................................................................................80 6 TRACKING CLUSTERS USING HISTORICAL DATA .............................................83 6.1 Historical Cluster Repository....................................................................................84 6.2 Experiment on IP Tracing .........................................................................................85 6.2.1 Canadian Pharmacy Scam .................................................................................86 ii 6.2.2 Ultimate Replica Watches Scam .......................................................................89 6.2.3 Tracing a Phishing Campaign ...........................................................................91 6.2.4 Other Scams and IP Addresses ..........................................................................92 6.3 Discussion ................................................................................................................93 7 CONCLUSION AND FUTURE WORK .......................................................................95 7.1 Benefits and Impact ..................................................................................................97 7.1.1 Improving Domain Black Listing ......................................................................97 7.1.2 Forensic Applications ........................................................................................98 7.1.3 Contributions to Data Mining ..........................................................................100 7.2 Future Work ............................................................................................................101 LIST OF REFERENCES .................................................................................................106 APPENDIX A Spam Database Description ......................................................................................113 B Recursive Seed Selection Algorithm (Pseudo Code) ................................................117 C Bi-connected Component Algorithm