Advances in Online Learning-Based Spam Filtering

ADVANCES IN ONLINE LEARNING-BASED SPAM FILTERING A dissertation submitted by D. Sculley, M.Ed., M.S. In partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science TUFTS UNIVERSITY August 2008 ADVISER: Carla E. Brodley Acknowledgments I would like to take this opportunity to thank my advisor Carla Brodley for her patient guidance, my parents David and Paula Sculley for their support and en- couragement, and my bride Jessica Evans for making everything worth doing. I gratefully acknowledge Rediff.com for funding the writing of this dissertation. D. Sculley TUFTS UNIVERSITY August 2008 ii Abstract The low cost of digital communication has given rise to the problem of email spam, which is unwanted, harmful, or abusive electronic content. In this thesis, we present several advances in the application of online machine learning methods for auto- matically filtering spam. We detail a sliding-window variant of Support Vector Machines that yields state of the art results for the standard online filtering task. We explore a variety of feature representations for spam data. We reduce human labeling cost through the use of efficient online active learning variants. We give practical solutions to the one-sided feedback scenario, in which users only give labeling feedback on messages predicted to be non-spam. We investigate the impact of class label noise on machine learning-based spam filters, showing that previous benchmark evaluations rewarded filters prone to overfitting in real-world settings and proposing several modifications for combating these negative effects. Finally, we investigate the performance of these filtering methods on the more challenging task of abuse filtering in blog comments. Together, these contributions enable more accurate spam filters to be deployed in real-world settings, with greater robustness to noise, lower computation cost and lower human labeling cost. Contents Acknowledgments ii List of Tables viii List of Figures xi Chapter 1 Introduction 1 1.1 IdealizedOnlineFiltering . 2 1.2 Online Learners for the Idealized Scenario . .... 4 1.3 Contributions: Beyond Idealized Online Filtering . ....... 5 1.3.1 Online Filtering with Reduced Human Effort . 6 1.3.2 Online Filtering with One-Sided Feedback . 7 1.3.3 Online Filtering with Noisy Feedback . 8 1.3.4 Online Filtering with Feedback from Diverse Users . ... 9 1.4 DefiningSpam .............................. 10 1.4.1 Conflicting Definitions . 10 1.4.2 ScopeandScale ......................... 11 1.5 MachineLearningProblems . 12 1.6 OverviewofDissertation. 13 i Chapter 2 Online Filtering Methods 14 2.1 Notation.................................. 15 2.2 FeatureMappings ............................ 16 2.2.1 Hand-crafted features . 16 2.2.2 Word Based Features . 17 2.2.3 k-merFeatures .......................... 19 2.2.4 Wildcard and Gappy Features . 21 2.2.5 Normalization . 23 2.2.6 Message Truncation . 23 2.2.7 Semi-structured Data . 24 2.3 Online Machine Learning Algorithms forOnlineSpamFiltering . 24 2.3.1 Naive Bayes Variants . 26 2.3.2 Compression-Based Methods . 32 2.3.3 Perceptron Variants . 36 2.3.4 LogisticRegression. 39 2.3.5 EnsembleMethods . .. .. .. .. .. .. .. 40 2.4 ExperimentalComparisons . 41 2.4.1 TREC Spam Filtering Methodology . 41 2.4.2 DataSets ............................. 42 2.4.3 ParameterTuning ........................ 42 2.4.4 The (1-ROCA)% Evaluation Measure . 43 2.4.5 ComparisonResults . 45 Chapter 3 Online Filtering with Support Vector Machine Variants 50 3.1 An Anti-Spam Controversy . 51 ii 3.1.1 Contributions........................... 51 3.2 SpamandOnlineSVMs ......................... 52 3.2.1 Background:SVMs. .. .. .. .. .. .. .. 52 3.2.2 OnlineSVMs ........................... 55 3.2.3 Tuning the Regularization Parameter, C ........... 56 3.2.4 EmailSpamandOnlineSVMs . 59 3.2.5 ComputationalCost . 59 3.3 RelaxedOnlineSVMs(ROSVM) . 60 3.3.1 ReducingProblemSize . 63 3.3.2 Reducing Number of Updates . 64 3.3.3 ReducingIterations . 65 3.4 Experiments................................ 65 3.4.1 ROSVMTests .......................... 66 3.4.2 OnlineSVMsandROSVM . 71 3.4.3 Results .............................. 72 3.5 ROSVMs at the TREC 2007 Spam Filtering Competition . 73 3.5.1 Parameter Settings . 73 3.5.2 ExperimentalResults . 73 3.6 Discussion................................. 74 Chapter 4 Online Active Learning Methods for Spam Filtering 77 4.1 Re-Thinking Active Learning for Spam Filtering . 77 4.2 RelatedWork............................... 79 4.2.1 Pool-based Active Learning . 80 4.2.2 Online Active Learning . 81 4.2.3 Semi-Supervised Learning and Spam Filtering . 82 iii 4.3 OnlineActiveLearningMethods . 84 4.3.1 Label Efficient b-Sampling.................... 84 4.3.2 Logistic Margin Sampling . 86 4.3.3 Fixed Margin Sampling . 88 4.3.4 Baselines ............................. 89 4.4 Experiments................................ 90 4.4.1 DataSets ............................. 90 4.4.2 Classification Performance . 90 4.4.3 Comparing Online and Pool-Based Active Learning . 100 4.4.4 OnlineSamplingRates. 103 4.4.5 Online Active Learning at the TREC 2007 Spam Filtering Competition . 104 4.5 Conclusions ................................ 105 Chapter 5 Online Filtering with One-Sided Feedback 108 5.1 TheOne-SidedFeedbackScenario . 109 5.2 Contributions ............................... 110 5.3 PreliminariesandBackground. 111 5.3.1 Breaking Classical Learners . 112 5.3.2 An Apple Tasting Solution . 113 5.3.3 ImprovingonAppleTasting. 114 5.4 LabelEfficientOnlineLearning . 116 5.5 Margin-BasedLearners. 117 5.5.1 Two Margin-Based Learners . 117 5.5.2 Margin-Based Pushes and Pulls . 118 5.5.3 Margins, One-Sided Feedback, and Active Learning . 119 iv 5.5.4 ExploringandExploiting . 121 5.5.5 Pathological Distributions . 123 5.6 MinorityClassProblems. 125 5.7 Experiments................................ 125 5.8 Conclusions ................................ 128 Chapter 6 Online Filtering with Noisy Feedback 131 6.1 NoiseintheLabels............................ 132 6.1.1 CausesofNoise.......................... 132 6.1.2 Contributions.. .. .. .. .. .. .. .. 133 6.2 RelatedWork............................... 134 6.2.1 Label Noise in Email Spam . 134 6.2.2 Avoiding Overfitting . 134 6.3 Label Noise Hurts Aggressive Filters . 135 6.3.1 Evaluation ............................ 135 6.3.2 Data Sets with Synthetic Noise . 136 6.3.3 Filters............................... 136 6.3.4 InitialResults.. .. .. .. .. .. .. .. 138 6.4 FilteringwithoutOverfitting . 141 6.4.1 TuningLearningRates. 142 6.4.2 Regularization . 143 6.4.3 LabelCleaning .......................... 146 6.4.4 LabelCorrecting . 147 6.5 Experiments................................ 150 6.5.1 Synthetic Label Noise . 150 6.5.2 Natural Label Noise . 151 v 6.6 Discussion................................. 153 Chapter 7 Online Filtering with Feedback from Diverse Users 155 7.1 BlogCommentFiltering . 156 7.1.1 User Flags and Community Standards . 157 7.1.2 Contributions.. .. .. .. .. .. .. .. 157 7.2 RelatedWork............................... 158 7.2.1 Blog Comment Abuse Filtering . 158 7.2.2 Splog Detection . 158 7.2.3 Comparisons to Email Spam Filtering . 159 7.3 The msgboard1 DataSet ........................ 160 7.3.1 Noise and User Flags . 161 7.3.2 PatternsofAbuse . .. .. .. .. .. .. .. 162 7.3.3 Understanding User Flags . 163 7.4 Online Filtering Methods with Class-Specific Costs . 164 7.4.1 FeatureSets............................ 165 7.4.2 Alternatives . 165 7.5 Experiments................................ 166 7.5.1 Experimental Design . 166 7.5.2 ParameterTuning . 166 7.5.3 Results Using User-Flags for Evaluation . 167 7.5.4 FilteringThresholds . 169 7.5.5 Global Versus Per-Topic Filtering . 171 7.6 Gold-StandardEvaluation . 171 7.6.1 Constructing a Gold Standard Set . 172 7.6.2 GoldStandardResults. 175 vi 7.6.3 Filters Versus User Flags . 177 7.6.4 Filters Versus Dedicated Adjudicators . 177 7.7 Discussion................................. 178 7.7.1 Two-Stage Filtering . 178 7.7.2 Feedback to and from Users . 179 7.7.3 IndividualThresholds . 179 Chapter 8 Conclusions 181 8.1 How Can We Benefit from Unlabeled Data? . 182 8.2 How can we attain better user feedback? . 182 8.3 How can the academic research community gain access to larger scale, realworldbenchmarkdatasets? . 183 Bibliography 185 vii List of Tables 2.1 Comparison results for methods on trec05p-1 data set in the idealized online scenario. Results are reported as (1-ROCA)%, with 0.95 confidence intervals in parenthesis . 46 2.2 Comparison results for methods on trec06p data set in the idealized online scenario. Results are reported as (1-ROCA)%, with 0.95 confidence intervals in parenthesis . 47 2.3 Comparison results for methods on trec07p data set in the idealized online scenario. Results are reported as (1-ROCA)%, with 0.95 confidence intervals in parenthesis . 48 3.1 Results for Email Spam filtering with Online SVM on benchmark data sets. Score reported is (1-ROCA)%, where 0 is optimal. These results are directly comparable to those on the same data sets with otherfilters,reportedinChapter2.. 58 3.2 Execution time for Online SVMs with email spam detection, in CPU seconds. These times do not include the time spent mapping strings to feature vectors. The number of examples in each data set is given inthelastrowascorpussize. 60 viii 3.3 Email Spam Benchmark Data. These results compare Online SVM and ROSVM on email spam detection, using binary 4-mer feature space. Score reported is (1-ROCA)%, where 0 is optimal. 72 3.4 Results for ROSVMs

Advances in Online Learning-Based Spam Filtering

Uncovering Social Network Sybils in the Wild

Svms for the Blogosphere: Blog Identification and Splog Detection

By Nilesh Bansal a Thesis Submitted in Conformity with the Requirements

Blogosphere: Research Issues, Tools, and Applications

Web Spam Taxonomy

Robust Detection of Comment Spam Using Entropy Rate

Spam in Blogs and Social Media

2 Uncovering Social Network Sybils in the Wild

The SAGE Handbook of Web History by Niels Brügger, Ian

Social Software: Fun and Games, Or Business Tools?

D2.4 Weblog Spider Prototype and Associated Methodology

SPAM in the Blogosphere