ADVANCES IN ONLINE LEARNING-BASED SPAM FILTERING

A dissertation

submitted by

D. Sculley, M.Ed., M.S.

In partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in

Computer Science

TUFTS UNIVERSITY

August 2008

ADVISER: Carla E. Brodley Acknowledgments

I would like to take this opportunity to thank my advisor Carla Brodley for her patient guidance, my parents David and Paula Sculley for their support and en- couragement, and my bride Jessica Evans for making everything worth doing. I gratefully acknowledge Rediff.com for funding the writing of this disserta- tion.

D. Sculley

TUFTS UNIVERSITY

August 2008

ii Abstract

The low cost of digital communication has given rise to the problem of , which is unwanted, harmful, or abusive electronic content. In this thesis, we present several advances in the application of online machine learning methods for auto- matically filtering spam. We detail a sliding-window variant of Support Vector

Machines that yields state of the art results for the standard online filtering task.

We explore a variety of feature representations for spam data. We reduce human labeling cost through the use of efficient online active learning variants. We give practical solutions to the one-sided feedback scenario, in which users only give la- beling feedback on messages predicted to be non-spam. We investigate the impact of class label noise on machine learning-based spam filters, showing that previous benchmark evaluations rewarded filters prone to overfitting in real-world settings and proposing several modifications for combating these negative effects. Finally, we investigate the performance of these filtering methods on the more challenging task of abuse filtering in comments. Together, these contributions enable more accurate spam filters to be deployed in real-world settings, with greater robustness to noise, lower computation cost and lower human labeling cost. Contents

Acknowledgments ii

List of Tables viii

List of Figures xi

Chapter 1 Introduction 1

1.1 IdealizedOnlineFiltering ...... 2

1.2 Online Learners for the Idealized Scenario ...... 4

1.3 Contributions: Beyond Idealized Online Filtering ...... 5

1.3.1 Online Filtering with Reduced Human Effort ...... 6

1.3.2 Online Filtering with One-Sided Feedback ...... 7

1.3.3 Online Filtering with Noisy Feedback ...... 8

1.3.4 Online Filtering with Feedback from Diverse Users ...... 9

1.4 DefiningSpam ...... 10

1.4.1 Conflicting Definitions ...... 10

1.4.2 ScopeandScale ...... 11

1.5 MachineLearningProblems ...... 12

1.6 OverviewofDissertation...... 13

i Chapter 2 Online Filtering Methods 14

2.1 Notation...... 15

2.2 FeatureMappings ...... 16

2.2.1 Hand-crafted features ...... 16

2.2.2 Word Based Features ...... 17

2.2.3 k-merFeatures ...... 19

2.2.4 Wildcard and Gappy Features ...... 21

2.2.5 Normalization ...... 23

2.2.6 Message Truncation ...... 23

2.2.7 Semi-structured Data ...... 24

2.3 Online Machine Learning Algorithms

forOnlineSpamFiltering ...... 24

2.3.1 Naive Bayes Variants ...... 26

2.3.2 Compression-Based Methods ...... 32

2.3.3 Perceptron Variants ...... 36

2.3.4 LogisticRegression...... 39

2.3.5 EnsembleMethods ...... 40

2.4 ExperimentalComparisons ...... 41

2.4.1 TREC Spam Filtering Methodology ...... 41

2.4.2 DataSets ...... 42

2.4.3 ParameterTuning ...... 42

2.4.4 The (1-ROCA)% Evaluation Measure ...... 43

2.4.5 ComparisonResults ...... 45

Chapter 3 Online Filtering with Support Vector Machine Variants 50

3.1 An Anti-Spam Controversy ...... 51

ii 3.1.1 Contributions...... 51

3.2 SpamandOnlineSVMs ...... 52

3.2.1 Background:SVMs...... 52

3.2.2 OnlineSVMs ...... 55

3.2.3 Tuning the Regularization Parameter, C ...... 56

3.2.4 EmailSpamandOnlineSVMs ...... 59

3.2.5 ComputationalCost ...... 59

3.3 RelaxedOnlineSVMs(ROSVM) ...... 60

3.3.1 ReducingProblemSize ...... 63

3.3.2 Reducing Number of Updates ...... 64

3.3.3 ReducingIterations ...... 65

3.4 Experiments...... 65

3.4.1 ROSVMTests ...... 66

3.4.2 OnlineSVMsandROSVM ...... 71

3.4.3 Results ...... 72

3.5 ROSVMs at the TREC 2007 Spam Filtering Competition ...... 73

3.5.1 Parameter Settings ...... 73

3.5.2 ExperimentalResults ...... 73

3.6 Discussion...... 74

Chapter 4 Online Active Learning Methods for Spam Filtering 77

4.1 Re-Thinking Active Learning for Spam Filtering ...... 77

4.2 RelatedWork...... 79

4.2.1 Pool-based Active Learning ...... 80

4.2.2 Online Active Learning ...... 81

4.2.3 Semi-Supervised Learning and Spam Filtering ...... 82

iii 4.3 OnlineActiveLearningMethods ...... 84

4.3.1 Label Efficient b-Sampling...... 84

4.3.2 Logistic Margin Sampling ...... 86

4.3.3 Fixed Margin Sampling ...... 88

4.3.4 Baselines ...... 89

4.4 Experiments...... 90

4.4.1 DataSets ...... 90

4.4.2 Classification Performance ...... 90

4.4.3 Comparing Online and Pool-Based Active Learning ...... 100

4.4.4 OnlineSamplingRates...... 103

4.4.5 Online Active Learning at the

TREC 2007 Spam Filtering Competition ...... 104

4.5 Conclusions ...... 105

Chapter 5 Online Filtering with One-Sided Feedback 108

5.1 TheOne-SidedFeedbackScenario ...... 109

5.2 Contributions ...... 110

5.3 PreliminariesandBackground...... 111

5.3.1 Breaking Classical Learners ...... 112

5.3.2 An Apple Tasting Solution ...... 113

5.3.3 ImprovingonAppleTasting...... 114

5.4 LabelEfficientOnlineLearning ...... 116

5.5 Margin-BasedLearners...... 117

5.5.1 Two Margin-Based Learners ...... 117

5.5.2 Margin-Based Pushes and Pulls ...... 118

5.5.3 Margins, One-Sided Feedback, and Active Learning ...... 119

iv 5.5.4 ExploringandExploiting ...... 121

5.5.5 Pathological Distributions ...... 123

5.6 MinorityClassProblems...... 125

5.7 Experiments...... 125

5.8 Conclusions ...... 128

Chapter 6 Online Filtering with Noisy Feedback 131

6.1 NoiseintheLabels...... 132

6.1.1 CausesofNoise...... 132

6.1.2 Contributions...... 133

6.2 RelatedWork...... 134

6.2.1 Label Noise in Email Spam ...... 134

6.2.2 Avoiding Overfitting ...... 134

6.3 Label Noise Hurts Aggressive Filters ...... 135

6.3.1 Evaluation ...... 135

6.3.2 Data Sets with Synthetic Noise ...... 136

6.3.3 Filters...... 136

6.3.4 InitialResults...... 138

6.4 FilteringwithoutOverfitting ...... 141

6.4.1 TuningLearningRates...... 142

6.4.2 Regularization ...... 143

6.4.3 LabelCleaning ...... 146

6.4.4 LabelCorrecting ...... 147

6.5 Experiments...... 150

6.5.1 Synthetic Label Noise ...... 150

6.5.2 Natural Label Noise ...... 151

v 6.6 Discussion...... 153

Chapter 7 Online Filtering with Feedback from Diverse Users 155

7.1 BlogCommentFiltering ...... 156

7.1.1 User Flags and Community Standards ...... 157

7.1.2 Contributions...... 157

7.2 RelatedWork...... 158

7.2.1 Blog Comment Abuse Filtering ...... 158

7.2.2 Splog Detection ...... 158

7.2.3 Comparisons to Email Spam Filtering ...... 159

7.3 The msgboard1 DataSet ...... 160

7.3.1 Noise and User Flags ...... 161

7.3.2 PatternsofAbuse ...... 162

7.3.3 Understanding User Flags ...... 163

7.4 Online Filtering Methods with Class-Specific Costs ...... 164

7.4.1 FeatureSets...... 165

7.4.2 Alternatives ...... 165

7.5 Experiments...... 166

7.5.1 Experimental Design ...... 166

7.5.2 ParameterTuning ...... 166

7.5.3 Results Using User-Flags for Evaluation ...... 167

7.5.4 FilteringThresholds ...... 169

7.5.5 Global Versus Per-Topic Filtering ...... 171

7.6 Gold-StandardEvaluation ...... 171

7.6.1 Constructing a Gold Standard Set ...... 172

7.6.2 GoldStandardResults...... 175

vi 7.6.3 Filters Versus User Flags ...... 177

7.6.4 Filters Versus Dedicated Adjudicators ...... 177

7.7 Discussion...... 178

7.7.1 Two-Stage Filtering ...... 178

7.7.2 Feedback to and from Users ...... 179

7.7.3 IndividualThresholds ...... 179

Chapter 8 Conclusions 181

8.1 How Can We Benefit from Unlabeled Data? ...... 182

8.2 How can we attain better user feedback? ...... 182

8.3 How can the academic research community gain access to larger scale,

realworldbenchmarkdatasets? ...... 183

Bibliography 185

vii List of Tables

2.1 Comparison results for methods on trec05p-1 data set in the ideal-

ized online scenario. Results are reported as (1-ROCA)%, with 0.95

confidence intervals in parenthesis ...... 46

2.2 Comparison results for methods on trec06p data set in the ideal-

ized online scenario. Results are reported as (1-ROCA)%, with 0.95

confidence intervals in parenthesis ...... 47

2.3 Comparison results for methods on trec07p data set in the ideal-

ized online scenario. Results are reported as (1-ROCA)%, with 0.95

confidence intervals in parenthesis ...... 48

3.1 Results for Email Spam filtering with Online SVM on benchmark

data sets. Score reported is (1-ROCA)%, where 0 is optimal. These

results are directly comparable to those on the same data sets with

otherfilters,reportedinChapter2...... 58

3.2 Execution time for Online SVMs with email spam detection, in CPU

seconds. These times do not include the time spent mapping strings

to feature vectors. The number of examples in each data set is given

inthelastrowascorpussize...... 60

viii 3.3 Email Spam Benchmark Data. These results compare Online SVM

and ROSVM on email spam detection, using binary 4-mer feature

space. Score reported is (1-ROCA)%, where 0 is optimal...... 72

3.4 Results for ROSVMs and comparison methods at the TREC 2007

Spam Filtering track. Score reported is (1-ROCA)%, where 0 is op-

timal, with .95 confidence intervals in parenthesis...... 74

5.1 Results for Email Spam filtering. We report F1 score, Recall, Preci-

sion, number of False Spams (lost ham) and number of False Hams

(spam in inbox) for with one-sided feedback. We report results with

fullfeedbackforcomparison...... 129

6.1 Results for prior methods on trec06p data set with uniform synthetic

noise. Results are reported as (1-ROCA)%, with 0.95 confidence

intervals. Bold numbers indicate best result for a given noise level, or

confidence interval overlapping with confidence interval of best result. 139

6.2 Results for prior methods on trec07p data set with uniform synthetic

noise. Results are reported as (1-ROCA)%, with 0.95 confidence

intervals. Bold numbers indicate best result for a given noise level, or

confidence interval overlapping with confidence interval of best result.

Methods unable to complete a given task are marked with dnf. . . . 140

6.3 Results for modified methods on trec06p data set with uniform syn-

thetic noise. Results are reported as (1-ROCA)%, with 0.95 con-

fidence intervals. Bold numbers indicate best result, or confidence

interval overlapping with confidence interval of best result...... 148

ix 6.4 Results for modified methods on trec07p data set with uniform syn-

thetic noise. Results are reported as (1-ROCA)%, with 0.95 con-

fidence intervals. Bold numbers indicate best result, or confidence

interval overlapping with confidence interval of best result...... 149

6.5 Results for natural and synthetic noise at identical noise levels. Nat-

ural label noise for trec05p-1 was uniformly sampled from human

labelings collected by the spamorham.org project. Results are re-

ported as (1-ROCA)%, with 0.95 confidence intervals...... 152

7.1 Summary statistics for the msgboard1 corpus of blog comments,

brokenoutbytopic...... 160

7.2 Selected words with high information gain, for flagged and non-

flagged comments. Obscenities and references to specific religious

figures have been removed from the flagged list for display, and stop

words have been removed from the non-flagged list...... 164

7.3 ROCA results of topic-specific versus global filtering. Generative

methods benefit from topic-specific filtering, while discriminative meth-

ods are not significantly harmed by global filtering...... 171

7.4 Summary statistics for the gold standard evaluation set. Adjudi-

cation and correction rates vary widely by topic. The news topic,

in particular, required extensive adjudication of religious and racial

comments...... 175

7.5 Results for F1 Measure, Gold Standard Evaluation. F1 Measure is

computed using precision and recall, where an abusive comment is

considered a positive example. For all filters, the F1 measure was

computed at the precision-recall break-even point...... 176

x List of Figures

1.1 Idealized online filtering scenario...... 3

2.1 Obscured Text. These are the 25 most common variants of the word

‘Viagra’ found in the 2005 TREC spam data set, illustrating the

problemofwordobfuscation...... 19

2.2 Pseudo-code for Classical Perceptron update rule and classification

function...... 36

2.3 Pseudo-code for Perceptron with Margins update rule; note that the

classification function is the same as for classical Perceptron. .... 38

2.4 Logistic Regression employs a logistic loss function for positive and

negative examples, which punishes mistakes made with high confi-

dence more heavily than mistakes made with low confidence. . . . . 39

3.1 Visualizing SVM Classification. An SVM learns a hyperplane that

separates the positive and negative data examples with the maximum

possible margin. Error terms ξi > 0 are given for examples on the wrongsideoftheirrespectivemargin...... 53

3.2 PseudocodeforOnlineSVM...... 55

xi 3.3 Tuning the Regularization Parameter C. Tests were conducted with

Online SMO, using binary feature vectors, on the spamassassin data

set of 6034 examples. Graph plots C versus Area under the ROC curve. 57

3.4 Visualizing the effect of C. Hyperplane A maximizes the margin

while accepting a small amount of training error. This corresponds

to setting C to a low value. Hyperplane B accepts a smaller margin

in order to reduce training error. This corresponds to setting C to

a high value. Content-based spam filtering appears to do best with

high values of C...... 61

3.5 Pseudo-codeforRelaxedOnlineSVM...... 62

3.6 ReducedSizeTests...... 67

3.7 ReducedIterationsTests...... 69

3.8 ReducedUpdatesTests...... 70

4.1 OnlineActiveLearning...... 83

4.2 Label Efficient b-Sampling Probabilities ...... 85

4.3 Logistic Margin Sampling Probabilities ...... 87

4.4 Online Active Learning using Perceptron with Margins, on trec05p-1

data...... 92

4.5 Online Active Learning using Logistic Regression, on trec05p-1 data. 93

4.6 Online Active Learning using ROSVM, on trec05p-1 data...... 94

4.7 Online Active Learning using Perceptron with Margins, on trec06p

data...... 95

4.8 Online Active Learning using Logistic Regression, on trec06p data. 96

4.9 Online Active Learning using ROSVM, on trec06p data...... 97

xii 4.10 Online Active Learning using Perceptron with Margins, on trec07p

data...... 98

4.11 Online Active Learning using Logistic Regression, on trec07p data. 99

4.12 Comparing Pool-based and Online Active Learning on trec06p . . 102

4.13 Perceptron with Margins, sampling rate over time, trec05p-1 . . . . 103

4.14 Screen shot of proposed user interface for active requests for label

feedback. In this framework, the user would be encouraged to label

asmallnumberofinformativemessages...... 106

5.1 Spam Filtering with One-Sided Feedback...... 109

5.2 One Sided Feedback Breaks Perceptron. Here, white dots are ham

examples, the black dots are spam, the dashed line is the prediction

hyperplane, and the shaded area predicts spam. Examples 1, 2, and 3

each cause no updates: 1 and 3 are correct, and no feedback is given

on 2. Examples 4 and 30 are the only examples causing updates,

ratcheting the hyperplane until no hams are correctly identified. . . . 111

5.3 Pseudo-code for Label Efficient active learner...... 114

5.4 Margin-Based Pushes and Pulls. Examples 1, 2, and 3 cause no up-

dates, as before. But Examples 4 and 25, each correctly classified but

within the margins, push the hyperplane towards the spam. Example

15, a misclassified spam, pulls the hyperplane towards the ham. . . . 118

5.5 Implicit Uncertainty Sampling for Perceptron with Margins. The

margin-based learner with hypothesis h and margins m+ and m− learning from one-sided feedback reduces to an active learner with

′ hypothesis h and margins m+ and h using uncertainty sampling in the region between h and h′...... 119

xiii 5.6 Exploration and Exploitation. If the initial hypothesis is h, then

examples 1 and 2 cause margin updates pushing he out towards m−,

but not beyond it unless an example is found to lie between h and he. 121 5.7 Pathological Distributions for One-Sided Feedback...... 123

6.1 Results for varying learning rate η for Logistic Regression, on spamassassin

tuning data with varying levels of synthetic uniform label noise. For

clarity, the order of results is consistent between legend and figure. . 142

6.2 Results for varying C in ROSVM for regularization, on spamassassin

tuning data with varying levels of synthetically added label noise. For

clarity, the order of results is consistent between legend and figure. . 144

6.3 Pseudo-codeforOnlineLabelCleaning...... 146

6.4 Pseudo-code for Online Label Correcting...... 147

7.1 Flag rates over time for the most popular blog. The spikes indicate

periods of high amounts of flagging, often caused by abusive flame

wars among users. Graphs for other show similar patterns. . . 161

7.2 ROCA results for User-Flag evaluation (top) and Gold Standard

Evaluation (bottom). Legend is the same for both graphs...... 168

7.3 ROC Curves using User Flag Evaluation (top) and Gold Standard

Evaluation(bottom), for all-test...... 170

7.4 Screen shot of the blog comment rating tool used by adjudicators. . 174

xiv Chapter 1

Introduction

The task of email spam filtering – automatically removing unwanted, harmful, or offensive email messages before they are delivered to a user – is an important, large scale application area for machine learning methods [37]. However, to date there has been a rift between academic researchers and industrial practitioners in spam

filtering research. Academic evaluations, such as those conducted at the TREC spam

filtering tracks [24, 16, 18], have reported that near-perfect filtering results may be obtained with a variety of machine learning methods [15]. Yet these methods and performance levels are not reflected in real-world practice.

This dissertation suggests that the rift between the academic and industrial spam filtering communities is rooted in an overly optimistic evaluation scenario used by academic researchers, which we refer to as the idealized online filtering scenario

[15]. This idealized scenario assumes that a machine learning-based filter will be given perfectly accurate label feedback for every message that the filter encounters.

In practice, this label feedback is provided only sporadically by users, and is far from perfectly accurate. Thus, practitioners have tended to discount the performance

1 claims made by academics as being difficult or impossible to replicate in real-world settings.

This dissertation shows that high levels of spam filtering performance are achievable in settings that are more realistic than the idealized scenario. These settings include online scenarios in which users are willing to label only a subset of messages, in which which users are willing to label predicted ham1 but not predicted spam, in which users give erroneous feedback, or in which diverse users disagree on what is spam and what is ham.

In the remainder of this introductory chapter, we review the idealized on- line filtering scenario in detail, and discuss machine learning methods that have performed well in this setting. We detail the modified filtering scenarios that form the bulk of this dissertation. We discuss the scope and scale of the spam filtering problem, and define the term “spam” for the purposes of this dissertation. Finally, we explain why these problems are of general interest to the machine learning com- munity, as well as to the spam filtering community.

1.1 Idealized Online Filtering

Before exploring modified filtering scenarios, it is first necessary to understand the idealized online filtering scenario that has been traditionally used to evaluate learning-based filters [24, 16, 18, 15].

The idealized filtering scenario is depicted in Figure 1.1. Messages are as- sumed to arrive in streaming fashion, one message at a time. For each message, the learning-based filter makes a prediction of spam or ham. A human user then observes the message and the predicted label, and delivers feedback to the learning-based fil-

1Messages that are not spam are referred to as ham.

2 Spam Message Stream Learning-based Predict Spam Filter Ham

User Feedback

Figure 1.1: Idealized online filtering scenario.

ter, providing it with the true label of that message. The filter is then able to use this label feedback to update its internal model, ideally improving future predictive performance.

This idealized scenario is useful for three reasons. First, it is a reasonable

first approximation of the setting in which spam filters are deployed in real-world settings, where messages do, indeed, arrive in streaming fashion and users do provide label feedback. Second, it is a natural adaptation of the online learning scenario that has been well studied in machine learning [65]. Thus, a range of existing machine learning algorithms may be applied to this task. Third, this scenario makes clear that the stream of messages may be unbounded. This emphasizes the need for solutions allowing updates that are efficient in both computation and memory requirements, and that may adjust to changing patterns in the data over time.

3 1.2 Online Learners for the Idealized Scenario

The idealized online filtering scenario has enabled the development and empirical evaluation of a wide range of machine learning methods for spam filtering. These methods are briefly discussed here, and are reviewed in detail in Chapters 2 and 3.

The use of machine learning methods for online filtering dates back at least as far as 2002, when Paul Graham proposed using a variant of the Naive Bayes classifier for filtering email spam [39, 40]. Since then, a number of machine learning methods have been proposed for filtering, including several additional variants of the Naive Bayes classifier [63], as well as random decision-tree forests [72], support vector machines (SVMs) [34, 51, 72, 20, 84] logistic regression [38, 19], compression based methods [6] and ensemble methods [62].

Our contributions in this regard have included the application of the Per- ceptron Algorithm with Margins [52, 48, 86], and the development of a fast, online

SVM variant called Relaxed Online SVM (ROSVM) for spam filtering [84]. ROSVM has given top level performance on several filtering tasks at TREC 2007 [18].

Despite this variety of possible methods, a recent survey of open-source spam

filtering methods revealed that only Naive Bayes variants are commonly used in real- world filters [15]. Naive Bayes variants belong to the generative family of classifiers, which model the underlying processes generating the two classes of data (spam and ham) as an intermediate step [66]. The other main family of classifiers are discrim- inative, which seek to find boundaries between classes in the data space, without modeling the functions that produce these classes [66]. It has been found that the generative family performs better than the discriminative family when training data is scarce. However, when training data is plentiful, discriminative methods achieve lower asymptotic error [68]. For spam filtering systems collecting large amounts of

4 training data, it seems reasonable to conjecture that discriminative methods would be superior in this setting. Indeed, in recent experiments, logistic regression and

ROSVMs – both discriminative methods – have out-performed all known Bayesian competitors [38, 19, 84].

Of equal importance to the particular learning method is the feature rep- resentation chosen to representing this semi-structured email data. The majority of filters in the literature have relied on some form of word-based features [93].

However, spammers have developed attacks specifically designed to defeat word- based models, including tokenization and obfuscation attacks [93] that produce the now-familiar character-level modifications often seen in email spam messages. An exception to this are the compression-based filtering methods [6], that essentially rely on short character substrings [82]. We extend work in this area by proposing and testing the use of a variety of feature representations for spam data. Some of these feature mappings were originally developed in the field of computational bi- ology where character-level mutations are common [86]. Top level performers from the most recent TREC evaluation used the binary 4-mer feature space we proposed for spam filtering [19, 85].

1.3 Contributions: Beyond Idealized Online Filtering

The user feedback in the idealized scenario, depicted by the dashed gray line in

Figure 1.1, may be far from perfect. In this disserataion, we examine several ways that this assumption of perfect feedback may be modified to better reflect the needs and behaviors of real human users. First, in the idealized scenario, human users are required to perform significant amounts of hand labeling. This would ideally be reduced to require only a small fraction of examples to be labeled [79]. Second, in

5 many settings users never give feedback for any messages predicted to be spam [80].

Third, users may give mistaken or even maliciously inaccurate feedback [83]. Fourth, when many users view the same message, there may be significant disagreement about its “true” label [81]. Each of these observations motivates a contribution of this dissertation, as detailed in the remainder of this section.

1.3.1 Online Filtering with Reduced Human Effort

Examining the idealized online filtering scenario, one obvious problem is the as- sumption that humans will give feedback for every message in the message stream

[15]. There are several obvious flaws with this assumption, most notably the fact that requiring this effort from users reduces much of the benefit the filter is meant to provide. Gordon Cormack, among others, has proposed a scheme that allows users to report only errors made by filters [17], but even this requires users to scan every message. Industry experts have disclosed that real users label only a fraction of the total messages presented to them.

The active learning paradigm is a machine learning approach to reducing the labeling effort required by humans [14]. Such methods can reduce labeling effort dramatically, without significant reduction in classification performance. Although the idea of using active learning is a natural fit for reducing human labeling effort in the spam filtering domain, prior applications of active learning in this setting have been both computationally expensive and have harmed classification performance

[16]. This is because these methods have used a pool-based approach, in which the active learner selects a number of examples from a large pool of unlabeled examples, iterating through many rounds. Cost is incurred as each example in the pool is considered many times.2 Additionally, many methods for pool-based active

2Segal et al. have proposed an active learning variant which reduces this re-assessment cost, but

6 learning are prone to selecting redundant examples in such a setting, reducing the benefit of the human labeling effort.

In this dissertation, we propose and evaluate the use of online active learn- ing methods for spam filtering [79]. Online active learners do not rely on a pool of unlabeled examples, but consider each unlabeled message as it arrives. For each message, the learner not only predicts ham or spam, but also determines whether or not request a label from a human. We test several prior online active learning meth- ods showing that the required human labeling effort may be significantly reduced with little reduction in classification performance and negligible additional compu- tational cost. Furthermore, our simple and novel fixed-margin sampling method gave best results across a majority of data sets and base learning methods. Our proposal of online active learning as the natural form of active learning for spam

filtering was adopted by the TREC 2007 spam filtering track [18].

1.3.2 Online Filtering with One-Sided Feedback

A second possible problem with the idealized online filtering scenario is the assump- tion that users will give feedback on both messages predicted to be ham and those predicted to be spam. There are a number of real-world settings in which users will never give feedback for predicted spam messages. For example, some systems remove predicted spam before they are shown to the user. In other cases, non-expert users may not know how to view predicted spam. Other users may simply choose never to view predicted spam messages. In all of these cases, feedback will only be given for predicted ham examples.

This scenario, which we refer to as the one-sided feedback scenario, was first examined by Helmbold et al., who called it the apple tasting problem wherein labels still requires every example in the pool to be labeled before active learning can commence [87].

7 were only provided for predicted positive examples [43]. They showed that one- sided feedback would break several online-learning algorithms such as Perceptron and Winnow, and gave a solution to this problem that involved sampling from the predicted negatives in a uniformly random manner at a rate determined by the past performance of the model.

We confirm that in the spam filtering domain, mistake-driven learners such as Perceptron are indeed broken by one-sided feedback. We apply the apple-tasting solution, and propose additional variants of online active learning to deal with this problem. However, we find the surprising result that margin based learners such as the Perceptron Algorithm with Margins and ROSVMs are able to filter effectively in this setting without modification [80], as they implicitly perform a variant of

fixed-margin uncertainty sampling on predicted spam messages.

1.3.3 Online Filtering with Noisy Feedback

A further issue with the idealized online filtering scenario is the assumption that user feedback will be accurate. In reality, users may give feedback that is mistaken, or even maliciously inaccurate [83]. Thus, the feedback may contain class label noise, wherein training data is incorrectly labeled. Indeed, John Graham-Cummings’ spamorham.org project found that human labeling error rates approached 10% [41], and industry experts have cited a 3% error rate from users [96]. The errors included in these figures are objective errors, on the order of a email being re- ported as ham. Clearly, real-world spam filters must be robust to class label noise, but this is not considered in the idealized scenario.

There are several machine learning methods for dealing with class label noise, including various forms of regularization [78], as well as methods for cleaning [7, 71]

8 or correcting training examples suspected to be mis-labeled [98]. However, previous

filtering methods, such as top performers from TREC spam filtering competitions, do not employ any such measures. Indeed, the current “folk wisdom” in the spam

filtering community is that methods such as regularization only hurt filtering per- formance [17].

We show that even low levels of uniform class label noise harms or even breaks top performing filtering methods from TREC. We explore several methods of ameliorating the effects of class label noise, finding that uniform label noise can be successfully handled with a variety of different methods. However, non-uniform label noise, as is produced by real users, remains a more difficult challenge.

1.3.4 Online Filtering with Feedback from Diverse Users

We extend our investigation of noisy feedback by considering cases where label feedback is inconsistent. The idealized online filtering scenario implicitly assumes that the user feedback is consistent – that is, that there is a single, objectively true label applied to each message. However, if many different users view the same message, it is possible that they will have differing perceptions of the “true” label for that message. This scenario is reflected in the domain of blog comment filtering.

In a blog, or internet journal, the blogger periodically posts entries. The readers of that blog then may post comments about the entry, which are added to the blog as a form of community discussion. Such postings may contain abuse, such as obscenities, personal attacks, or remarks degrading a particular race, religion, or nationality. However, because each comment may be read by many different users, there may be different viewpoints as to which comments are abusive and which are not. Indeed, our recent study of blog comments found pairwise inter-annotator

9 agreement to be below 75% among the three volunteer human adjudicators.

This environment is an extreme challenge for filters developed for the online

filtering scenario. We explore the ability of such filters to perform in this challenging environment of filtering blog comment abuse, finding that regularization and the use of class-dependent misclassification costs both give improvements for filtering methods in this domain [81]. Additionally, we suggest methods for improving the use of feedback from diverse users [81].

1.4 Defining Spam

As the primary focus of this dissertation is on the detecting and filtering of content- based email spam, it is worth taking the time to define this term. Although the concept of spam as unwanted mass-email has entered the general lexicon [1], the term remains slightly ambiguous owing to several different conflicting usages.

1.4.1 Conflicting Definitions

In 1998, Cranor and LaMacchia defined spam as “unsolicited bulk email” [26]. This is a strict, unambiguous definition; however, in practice it is overly narrow. There may be messages that a user has “solicited” by neglecting to un-check a box when using a web form for purchase, for example. More importantly, unwanted, offensive, or harmful messages that are not sent in “bulk” still negatively impact the user.

Such messages should ideally be filtered regardless of whether or not they are sent as part of a larger campaign.

A broader definition harkens back to Peter Denning’s 1982 ACM President’s letter, titled “Electronic Junk.” [32] Defining spam as electronic junk, or perhaps more specifically as unwanted or harmful electronic messages [84] encompasses a

10 wide range of user needs. However, these terms themselves are subjective, and different users may have different perceptions of what is junk, unwanted, or harmful.

Thus, this second definition gains increased coverage, but loses the ability to be objectively applied.

A definition of spam which may be objectively applied is anything marked as spam by a user. However, this form of definition by example has limited ability to generalize – it essentially requires that every possible message be labeled by a given user.

For the purposes of this dissertation, we rely on the definition of spam used by the experts who provided the gold-standard judgments of spam and ham in the

TREC spam filtering benchmark data sets, using a boot-strapping process [23].

These experts used the following definition [23] of spam:

Unsolicited, unwanted email that was sent indiscriminately, directly or

indirectly, by a sender having no current relationship with the recipient.

Thus, these gold-standard labels attempt to provide an objective, consistent labeling as far as possible. In Chapters 6 and 7, we examine the impact of using actual user feedback labels for training, rather than gold-standard labels.

1.4.2 Scope and Scale

In practical settings, spam filtering is a large scale problem: recent estimates show that as much as 80% of all email traffic is spam, amounting to billions of messages per day worldwide [37]. This level of spam email creates cost in network capacity and storage, to say nothing of the human cost involved in sorting through unfiltered spam.

Furthermore, email spam is prototypical of the family of content-based spam,

11 wherein the goal of the spammer is to deliver a spam message that will be understood by a human user, for possible commercial or social gain. Other forms of content- based spam include blog comment spam, which involves spam comments posted to message boards on electronic journals (blogs) [81, 64], and SMS spam sent by text- messaging services [21] on cellphones or IM spam sent by services.

Because of the similarity of these domains, it is reasonable to conjecture that ad- vances in email spam filtering be be applied in these other domains as well. This dissertation also includes a chapter on filtering blog comment spam, highlighting both similarities and differences of this domain.

1.5 Machine Learning Problems

We believe that developing effective, efficient spam filtering will provide unambigu- ous social benefit. Yet aside from issues of serving society, we are also motivated in this work by the important machine-learning challenges inherent in this task, that make results from spam filtering applicable to other domains ranging from general text classification to computational biology to optimizing online advertising systems.

This domain typically involves semi-structured data. For example, an email message includes not only textual data, but also header information with routing and meta-information, and may additionally include images, attachments, and links.

Furthermore, the textual data may contain obfuscations that spammers employ in an attempt to avoid detection [93]. Typical feature representations for this domain typically result in high dimensionality, in which there may be many relevant features.

In real-world settings, filtering is a large scale task that may involve billions of messages per day, placing a premium on efficient, scalable solutions that update in real-time, or near real-time. Finally, such systems rely on non-expert humans

12 to provide label feedback, and must be robust to a variety of imperfections in the training data. Thus, advances in this application area not only have practical benefit in the domain of spam filtering, but may also be of use in other large-scale, time sensitive domains involving semi-structured, high dimensional, noisy data.

1.6 Overview of Dissertation

The remainder of this dissertation is organized as follows. Chapter 2 reviews prior feature mappings and machine learning algorithms used for online spam filtering, in- cluding variants of Naive Bayes, compression methods, Perceptron variants, Logistic

Regression, and ensemble methods, and provides an empirical comparison of these approaches. Chapter 3 introduces our novel ROSVM algorithm, an efficient SVM variant for streaming data suited to the online filtering task. Chapter 4 explores the use of online active learning for spam filtering. The problem of one-sided feedback is presented in Chapter 5, and issues of noisy feedback are discussed in Chapter

6. Chapter 7 shows the ability of online learning-based filters to perform on the blog comment abuse filtering task, which involves feedback from diverse users. Our conclusions and plans for future work appear in the final chapter.

13 Chapter 2

Online Filtering Methods

A wide range of machine learning methods have been applied to the online filtering scenario. In this chapter, we review the most successful of these methods, includ- ing variants of the Naive Bayes classifier, compression-based methods, Perceptron variants, Logistic Regression, and ensemble methods. We also describe several dif- ferent feature mappings that have been used to transform semi-structured email data, containing text, header information, and possible attachments, into feature vectors usable by machine learning methods. Thus, this chapter serves as a review of background and related work.

This chapter is divided into four sections. The first reviews basic notation used throughout this dissertation. The second reviews methods for feature map- , the pre-processing step necessary to convert semi-structured email data into numerical feature vectors used by most machine learning methods. The third section reviews online machine learning algorithms for online filtering, and the final section of this chapter includes an experimental comparison of these various methodologies.

This chapter does not review Support Vector Machines or variants; these are pre-

14 sented in the next chapter of this dissertation with experimental comparisons to the methods detailed in this chapter.

2.1 Notation

Before describing online filtering methods, it is first necessary to outline the notation used in this dissertation.

In the online filtering scenario, the filter is shown one message (or example) at a time, in a time-ordered sequence where i is the current time step. Each message

n is represented by a feature vector xi X, where X R , with an associated label ∈ ⊆ yi Y where Y 1, +1 for ham and spam messages, respectively. For now, we ∈ ⊆ {− } will assume that this label is ground truth, and represents the “true” classification of the message. In Chapters 6 and 7, we will explore the case where the label may

′ be noisy. In these later chapters, the example’s “true” label is yi, and it is not ′ necessarily the case that yi = yi. For each example the filter is asked to predict ham or spam using a function

n f(xi), with the label hidden. This function may make use of a weight vector w R , ∈ which maintains a set of weights, where each weight is associated with a particular dimension in the feature space.

Once the prediction is made, the example’s label yi is revealed to the filter, which may then update its prediction model f( ) as needed using (xi,yi). This · update is often performed by changing the weights stored in w.

The sparsity of an example vector xi is given by s, and represents the number

1 of non-zero values in the vector. An individual feature in X is referred to by Xj.

The value of a particular feature Xj in a specific feature vector xi is given by xij.

1Note that this usage of the term “sparsity” is slightly different than the common English usage of this word.

15 2.2 Feature Mappings

As noted in the previous section, each message is represented by a feature vector x Rn. That is, a set of numerical features is extracted from the message, and ∈ these numerical scores are stored in a vector. This process is called feature mapping, and allows generic machine learning methods to operate on semi-structured email data.2 The choice of feature mapping is an important one. Experience has shown in machine learning that choosing an appropriate feature mapping can have more impact on the quality of results than the choice of learning method. In this section, we review several possible feature mappings and discuss the strengths of each.

2.2.1 Hand-crafted features

One possibility is to hand-craft specific features that are uniquely suited to the spam filtering task. Several systems employ this approach, including one major industrial spam filtering system as well as the open-source filter SpamAssassin [2].

This approach requires experts to identify features within messages that may help to distinguish spam from ham. For example, the hand-crafted features used by

SpamAssassin version 3.2 include the following [2]:

Subject contains ‘‘As Seen’’ •

Subject starts with dollar amount •

Offers a alert about a stock • 2Those familiar with kernel methods will note that string kernels exist that can allow kernel-based learning methods to operate directly on string-based data, without an intermediate feature mapping step [78]. However, the increased classification cost of such methods makes them impractical for this large-scale filtering setting. Indeed, recent work in computational biology has shown that explicit feature mapping for string-based genomic data reduces computational cost in comparison to the use of string kernels in practical settings [89]. In this dissertation, we consider only explicit feature mappings in order to maintain scalability.

16 Money back guarantee •

Message talks about a replica watch •

Uses a numeric IP address in URL •

Phrase: extra inches •

Phrase: L0an •

In SpamAssassin version 3.2, there are 748 of these hand-crafted features, which is a relatively low number of features. The human effort in hand-crafting these domain specific features results in a focused feature space in which there are few irrelevant features, ensuring that computation and storage cost are kept to a minimum.

However, this approach has significant drawbacks. First, the human effort required in crafting appropriate features is significant, and it is often difficult for hu- mans to guess which characteristics of a message are most informative [39]. Second, such features are not robust to changing circumstances, and may be easily attacked by intelligent spammers [93]. Third, such features are language-specific, requiring the entire feature space to be reformulated for each new language.

2.2.2 Word Based Features

An alternative to hand-crafting a small set of focused features is to employ a wider set of generic features, with the hope that this will capture more patterns in the data with less human effort.

One simple feature space is the word-based feature space. This is constructed as follows. Define a “word” as a contiguous substring of non-whitespace characters

[38]. If we consider only words of a maximum finite length, then there are n possible

17 words and we can construct a feature space Rn with n dimensions, each indexed by a unique word. To map a message M to a feature vector x Rn, assign a score for ∈ each dimension in the vector based on the number of times that the indexing word appears in M.

There are several possible scoring schemes. The count-based method assigns the raw number of occurrences of a given word in the message as its score. The TF-

IDF scoring method [75] has been used in information retrieval to weight rare (and presumably more informative) words more heavily. However, several tests (including our own) have found that the simple binary scoring method is most effective for spam

filtering tasks [34, 63]. In this system, a score of 1 indicates that a word occurs in the message, and a score of 0 indicates that it does not.

Note that although the feature space may be as large as the total number of possible words, typical feature vectors will be sparse, containing a relatively small number of non-zero values. Thus, sparse vector data structures allow for efficient storage of these vectors. These may be implemented as linked lists or hash tables containing index-value pairs. In the binary case, it is particularly efficient to store sparse vectors as arrays containing non-zero index values, which may be sorted for efficient computation of inner products.

To our knowledge, the word-based feature space was first employed by Gra- ham [39], using a Naive Bayes variant. This work highlighted the benefit of a wide- reaching feature space: his filter found the “word” ff0000 to be one of the most informative indicators of spaminess [39], which is the HTML code for the color red.

Such features are quickly identified by machine learning methods, but are far from obvious to humans attempting to hand-craft features. As an implicit admission of the limited utility of hand-crafted features, the SpamAssassin team has recognized

18 Viagra VIAGRA Viiagrra viagra visagra Vi@gra Viaagrra Viaggra Viagraa Viiaagra Via-ggra Viia-gra V1AAGRRA Viiagra Via-gra Vigraa Viagra viagra Viagrra V&Igra VIAgra V|agra Viaaggra vaigra V’iagra

Figure 2.1: Obscured Text. These are the 25 most common variants of the word ‘Viagra’ found in the 2005 TREC spam data set, illustrating the problem of word obfuscation. these problems, and include binary features that encode the output of a Naive Bayes classifier using a word-based feature space [2].

Although word-based features are an improvement over hand-crafted fea- tures, they are still subject to attack. Spammers routinely attempt to defeat word- based filters using techniques such as intentional misspellings, character substitu- tions, and insertions of white-space, all of which can pose problems for word-based

filters [93].

2.2.3 k-mer Features

As noted above, word-based features are subject to attack by spammers using word- obfuscation methods [93], which include intentional misspellings, character substi- tutions, and insertions of white-space by spammers. For example, Figure 2.1 shows the 25 most common obfuscations of the word viagra in the TREC 2005 public corpus of email spam, trec05p-1 [86]. There were hundreds of other obfuscations for this word alone. While a case could be made that a word-based filter would eventually encounter all possible obfuscations of a given word, the combinatorics of obfuscation render this possibility impractical. Instead, we suggest that feature

19 mappings that allow for inexact string matching provide a practical alternative [86].

One such feature mapping employs a feature space using overlapping k-mers, which are contiguous substrings of k symbols [60, 53]. For example, the 4-mers of the string ababbacb are:

abab babb abba bbac bacb

This feature space was originally designed for use on genomic data, in which genomic sequences are encoded as strings [60, 53]. Like spam data, character levels substitutions, insertions, and deletions are common in this setting, and the use of overlapping k-mers provides a measure of robustness to these variations. As with the word-based features, several scoring methods are possible, including count-based scores and TF-IDF scoring. However, our tests have found that binary scoring is most effective [84].

This feature space requires a unique dimension for each possible unique k- mer. Thus, the dimensionality of this space is Σ k, where Σ is the size of the alpha- | | | | bet of available symbols, and the value of each dimension in the space corresponds to the score associated with a particular k-mer. In email and spam classification tasks, which may include attachments, the available alphabet of symbols is quite large, consisting of all 256 possible single-byte characters. However, sparse data structures may also be employed here in similar fashion to those suggested for the word-based feature-space discussed above.

The first use of k-mers in spam detection was by Hershkop and Stolfo, who tested a spam filter using the cosine similarity measure between k-mer vectors in conjunction with a centroid-based variant the Nearest Neighbor classifier [44]. Also, note that k-mers are sometimes referred to as character-level n-grams. We choose to use the term k-mers to avoid confusion with word-level n-grams which are commonly

20 employed and discussed in information retrieval literature.

For those familiar with kernel methods, note that although k-mers may be employed in conjunction with string kernels [53, 54], we follow the recommendation of Sonnenburg [89] and represent k-mer features in explicit sparse feature vectors.

Our tests have shown that a setting of k = 4 is often optimal for spam classification, thus the resulting feature space is still possible to store explicitly and doing so increases computational efficiency for classification.

2.2.4 Wildcard and Gappy Features

In computational biology, it has been found that k-mers alone are not expressive enough to give optimal classification performance on genomic data. In general, a k-mer feature space may have reduced efficacy on substrings in which at least two character substitutions, insertions, or deletions occur no more than k positions apart. Computational biologists have developed extended forms of inexact string matching features to address this issue. These include wildcard features and gappy features [55], which allow k-mers to match even if there have been a small (specified) number of insertions, deletions, or character substitutions.

We applied variants of these modifications in our TREC 2006 Spam Filtering competition entries [86], in order to test the effectiveness of added flexibility in inexact string matching. Our subsequent tests (a subset of which are included in the results section of this chapter) showed that the simple binary k-mer feature space gave optimal results – the added flexibility of wildcards and gaps did not give added benefit in the spam filtering domain. For completeness, we describe these feature mappings in this section. As with the word-based and k-mer features above, these feature mappings may be performed explicitly using sparse vector data structures

21 for computational efficiency [86].

Wildcards. The (k, w) wildcard mapping maps each k-mer to a set of k-mers in which up to w “wildcard” characters replace the characters in the original k-mer [55].

For example, the (3, 1) wildcard mapping of the k-mer abc is the set abc, *bc, { a*b, ab* . The wildcard character is a special symbol that is allowed to match } with any other character. Naturally, allowing wildcards increases computational cost. However, in our testing with spam data, we have found that allowing a fixed wildcard variant gives equivalently strong results as to the standard wildcard kernel

[86]. A fixed (k,p) wildcard mapping maps a given k-mer to a set of two k-mers: the original, and a k-mer with a wildcard character at position p. Note that the

first position in a string is position 0. Thus, the fixed (3, 1) wildcard mapping of abc is abc, a*c . This fixed mapping gives more flexibility to the k-mer feature { } space, but only increases computational cost by a constant factor of two.

Gaps. The (g, k) gappy mapping (where g k), allows g-mers to match with ≤ k-mers by inserting k g gaps into the g-mer [55]. Note that this is equivalent − to allowing k-mers to match with g-mers by deleting up to k g characters from − the k-mer. Thus, the (2, 3) gappy mapping of the string acbd includes positive values for features indexed by acb, cbd, ab, cb, cd, bd . As with the wildcard { } mappings, we reduce computational cost with a fixed (k,p) gappy variant, in which a k-mer is is mapped to a set of k-mers: the original k-mer, and a k-mer in which the character at position p has been removed [86].

22 2.2.5 Normalization

Email messages are of varying lengths, causing feature vectors mapped from these messages to be of varying magnitudes. This can cause problems for some machine learning methods, especially those that compute an inner product < w, xi >. In such cases, the learner may make predictions with exceptionally low confidence for short messages, even when those messages are clearly spam or ham.

One standard method from information retrieval and machine learning that reduces the impact of message length is to normalize the feature vectors representing the messages. One standard method, employed in this dissertation, is to normalize using the Euclidean norm (also known as the L2 norm) of the feature vectors. That is:

xi xi−normalized = √< xi, xi >

2.2.6 Message Truncation

In 2006, Cormack first noted that the use of message truncation [17] led to improved

filtering results as well as increased computational efficiency. In message truncation, the email message is truncated after n characters, including any header information and attachments. Typical values of n in message truncation have been 2500 [19] or 3000 [84]. This tends to emphasize information contained in the headers, in- cluding routing information, sender and recipient information, and the subject line.

Computational cost is reduced in cases where the message is otherwise very long: some emails contain several megabytes of data. Furthermore, truncation provides a measure implicit normalization, as truncated messages do not vary as greatly in length as un-truncated messages. Finally, truncation provides a measure of resis-

23 tance against the good word attack, in which objectively “good” words are stuffed into the end of an email spam in an attempt to defeat learning-based spam filters

[61].

2.2.7 Semi-structured Data

As noted in Chapter 1, email data is best described as semi-structured data. That is, email data contains information in different forms such as header information, message text, and attachments that may include images or other files. Further- more, the message text may include obfuscations such as intentional mis-spellings, character substitutions, and the like.

Some attempts have been made to exploit the structure in emails for more effective feature mappings. These have included hand-crafted rules discussed above, and also more general strategies such as treating words or features drawn from the message header differently than words or features drawn from the message body [38].

Interestingly, the 4-mer feature space with message truncation has out-performed results from all such attempts. One could argue that message truncation implicitly takes some structure into account: it clearly places emphasis on the message header and the early parts of the message body. Yet it is still interesting to note that appar- ently the most effective method for dealing with this semi-structured information is primarily to ignore the structure.

2.3 Online Machine Learning Algorithms

for Online Spam Filtering

The previous section reviewed several feature mapping methods that transform la- beled email data into feature vectors for training machine learning methods. In

24 theory, any standard supervised machine learning method could be applied to such data. Yet the practical application of online spam filtering methods has strict re- quirements that are influenced by the scale of contemporary email usage. This scale renders many supervised machine learning methods impractical.

We identify five requirements that a given machine learning method must satisfy to be appropriate for the spam filtering domain. These are:

Classification Performance. The first goal of any filter is to filter effec- • tively. We assess classification performance with the (1-ROCA)% measure,

which we use as our primary evaluation measure in this dissertation. The

(1-ROCA)% measure is reviewed in section 2.4.4, along with a discussion of

alternate possible metrics.

Fast Prediction. The computational cost for a prediction for a given example • xi should be computable with O(s) computation cost, where s is the sparsity

of xi.

Scalable Online Updates. The cost of updating the model should not • depend on the amount of data in the training data set. In the online case the

size of the data set increases over time, and may be effectively unbounded.

Fast Adaptation. In the real world, spammers adapt to filters by changing • the patterns of their spam attacks [93]. An effective filtering method adapts

to new attacks quickly; ideally after only a single example of a new attack.

Robustness to High Dimensionality. As described in the previous section, • several natural feature representations for spam filtering are of high dimen-

sionality. Not all learning methods perform well with high dimensional data

[45].

25 In the remainder of this section, we review the machine learning methods that satisfy these requirements and are thus suitable for the task of online spam

filtering. In doing so, we will note when a given method fails to meet any of the above requirements. Assessments of classification performance are given in the ex- perimental section at the end of this chapter.

2.3.1 Naive Bayes Variants

Variants of the Naive Bayes classifier are among the first machine learning methods to be applied to the spam filtering problem [39, 40, 63]. We first review the general principles of the Naive Bayes classifier, and then describe several variants that have been proposed for spam filtering.

Overview of Naive Bayes In the supervised learning methodology, we assume that there is an unknown target function f : X Y , mapping example vectors to 7→ class labels [66]. This target may be expressed as a probability distribution P (Y X), | in which the probability of a class label depends on a distribution over the space of possible messages [66]. We seek to model this target function using labeled training data.

The Bayes rule is given by:

P (X Y )P (Y ) P (Y X)= | | P (X)

Estimating P (Y ) from observation is straightforward and requires only a relatively small sample of training data, but estimating P (X Y ) may require considerably | more data. When X Rn, there are 2(2n 1) parameters to estimate [66]. This is ⊆ − impractical in the general case with high dimensional data.

26 For the case in which the features in X are conditionally independent given

Y , then: n P (X Y )= P (Xj Y ) | Y | j=1

When both the elements of X and the class values of Y are binary attributes, then there are now only 2n parameters to estimate, dramatically reducing the amount of training data needed from the general case, above [66].

Note that for numerical reasons, it is infeasible to compute products of many small fractions, as is required here, because these are quickly driven to zero due to round-off errors in finite precision computing. Thus, it is preferred to work with log probabilities [65] for all Naive Bayes variants.

This observation forms the basis of the Naive Bayes classifier: we (naively) assume that the elements of X are conditionally independent, and then use training data to estimate values for each P (Xj Y ) [66]. Although the assumption of condi- | tional independence is often objectively untrue, in practice effective classifiers are still often obtained.

The training of a Naive Bayes classifier produces a model of the distribution that generates the data as an intermediate step. For this reason, these methods are said to belong to the generative family of classifiers [66].

There are several variants of Naive Bayes, which differ primarily in their assumptions about the probability distributions P (Xj Y ). Of these, the Multi- | Variate Bernoulli Naive Bayes does not meet the criteria for predictions in O(s) time. Furthermore, all of the Naive Bayes variants deal poorly with high dimensional data, as shown in the experimental section.

27 Multi-Variate Bernoulli Naive Bayes Perhaps the most commonly applied variant [63] is the the Multi-Variate Bernoulli Naive Bayes. This variant assumes that a message is generated in the following way. There are two boxes, one marked spam and one marked ham, with each box containing a distinct biased coin for each unique feature in the feature space X. A message is generating by first flipping a biased coin to determine the class label yi of the message. Depending on the result of the class label, the appropriate box of biased coins is selected. Each coin in the box is then flipped, one per feature in X. For each coin that comes up heads the associated feature in the feature vector xi is scored 1, and is scored 0 otherwise. This reflects the independence assumption that the features in messages are conditionally independent [63].

The probability for a given coin j coming up heads is P (Xj yi). We can | estimate this probability from data using a Laplacian prior that we refer to as the

Document Prior:

1+ MXj ,yi P (Xj yi)= | 2+ MXj

Here, MXj ,yi is the number of seen messages with class label yi for which the feature

Xj = 1, and MXj is the total number of seen messages seen for which the feature

Xj = 1.

To classify a new message xi, we can compute the probability that either box of coins would generate the message. For a given class label yi, this probability is:

n xij (1−xij ) P (xi yi)= P (Xj yi) (1 P (Xj yi)) | Y | · − | j=1

To classify a message as spam, we first set a threshold τ. We return a classi-

28 fication of spam only when:

P (yi = +1)P (xi yi = +1) | >τ P (yi = +1)P (xi yi = +1)+ P (yi = 1)P (xi yi = 1) | − | −

In the online scenario, updating a Naive Bayes model simply requires updat- ing the probabilities for each P (Xj yi) for each feature. | Although Metsis et al. found that this variant was the most widely applied

Naive Bayes variant in deployed systems [63], it is undesirable for two reasons.

First, classification using this method requires a computation for each feature in the feature space, and is therefore O(n) rather than O(s). In general, the feature space n may be many orders of magnitude larger than the sparsity s of a typical example vector, especially when message truncation is used. Second, this variant gave worst classification performance out of all variants tested by Metsis et al. in a comparison of Naive Bayes variants for spam filtering [63].

Multinomial Naive Bayes with Boolean Attributes (Token Prior) In the multi-nomial model, we assume that a message is generated in a different way. We still have two boxes, one for spam and one for ham as before. However, now the boxes are filled with tokens of varying sizes, and each token has a feature from

X written on it. To generate a message, we first determine the number of tokens to pull by randomly selecting d from a distribution of possible message lengths.3

We then choose a box by flipping a biased coin to determine the class label yi of our new message. Finally, we blindly pull a total of d tokens from the box, with replacement. For each token pulled, we add the feature listed on the token to the

3Note that this distribution of message lengths does not depend on the class label of the message [63]. Although this may not be a good assumption in the general case for spam filtering, it is not unreasonable when message truncation is performed.

29 generated message [63].

Each feature is randomly selected from the box with probability P (Xj yi), | which we can estimate from training data using a Laplacian prior based on counts of tokens, referred to here as the Token Prior:

1+ TXj ,yi P (Xj yi)= | n + TXj

where n is the total number of possible tokens, TXj ,yi is the number of times that token Xj occurs in messages of class label yi, and TXj is the total number of times token Xj occurs in all messages. Note that in the general case, a token may occur in a message more than once. Yet Metsis et al. found that restricting these counts to binary values in the resulting feature vector xi improved classification performance. This agrees with work using other methods that have also found binary feature values to be most effective [34, 84]. When the token counts are restricted to binary values, this variant is referred to as Multinomial Naive Bayes with Boolean Attributes [63], and was the one of the best performing Naive Bayes variant tested by Metsis et al.

The probability that a message xi is generated by the box with class label yi is given by [63]:

n x P (Xj yi) ij P (xi yi)= P (yi)P (d)d! | | Y xij! j=1

x Note that when xij = 0, then P (Xj yi) ij = 1, and these terms may be excluded | from the product. Thus, it is only necessary to compute probabilities for those features that actually occur in the message, reducing classification cost from O(n) for the Multivariate Naive Bayes to O(s) with this variant.

30 Furthermore, the terms P (d), d, and xij! are constant across all possible class labels, and cancel out when computing relative likelihood. Thus, to classify a new message, select a threshold τ as before, and return spam only when [63]:

n xij P (yi) j=1 P (Xj y = +1) Q n| x >τ P (y) P (Xj y) ij Py∈{−1,+1} Qj=1 |

Multinomial Naive Bayes with Boolean Attributes (Document Prior)

Observe that, in practical terms, there are two differences between the Multivariate

Bernoulli Naive Bayes and the Multinomial Naive Bayes with Boolean Attributes

(Term Prior) discussed above.

The first difference is in the manner in which the token probabilities are estimated. In the Multivariate case, the default assumption is that a given token

1 has a P = 2 probability of appearing in a message, while in the Multinomial case the default probability is that a given token has P dn probability of appearing in ≈ a given message.

The second difference is that the Multivariate method explicitly includes probabilities of absent tokens not occuring in the message in computing the likeli- hood that a message was generated by a given box of coins, while the Multinomial method does not include this information. This makes the Multinomial method more computationally efficient. The Multinomial method is also more effective in terms of classification performance. Is this difference in performance due to the different estimate of probability, or the exclusion of token absence information?

We have tested this question by proposing a hybrid variant, which we refer to as Multinomial Naive Bayes with Boolean Attributes (Document Prior). In this case, the per-token conditional probabilities are estimated with the document prior used by the Multivariate method:

31 1+ MXj ,yi P (Xj yi)= | 2+ Myi

Classification is performed using the Multinomial method:

n xij P (yi) j=1 P (Xj y = +1) Q n| x >τ P (y) P (Xj y) ij Py∈{−1,+1} Qj=1 | Surprisingly, this variant gave best results out of all Naive Bayes methods tested on noisy blog comment data (see Chapter 7). That is, the traditional Multino- mial (Term Prior) method out-performs the Multivariate Method (Document Prior), and the Multinomial (Document Prior) method out-performs the Multinomial (Term

Prior) method. Thus, we conclude that both the Document Prior and the exclusion of absence information are helpful in spam filtering.

2.3.2 Compression-Based Methods

The Naive Bayes variants operate best under the condition that features are condi- tionally independent. When this condition breaks down, classification performance may suffer. For example, the Naive Bayes variants perform poorly with the feature space of overlapping k-mers. In this case, the features are inter-dependent and treat- ing them as independent tokens causes the assumption of conditional independence to break down.

A second set of methods in the generative family of supervised machine learning methods employs techniques from data compression [6] as a means to deal with this problem.

In this methodology, a message is assumed to be generated roughly as fol- lows. Assume that we are dealing with a Markov process of order k that generates

32 messages. There two sets of (many) boxes, one for each class label. Each box is filled with a set of tokens of differing sizes with a unique token for each possible character in the alphabet Σ of all possible characters, including an “end of message” character.

Each box is marked with a unique substring of up to k 1 characters, and there is − one box for each possible substring of length 0 to k 1, and is also marked with a − class label.

To generate a message, we follow these instructions.

1. Begin with a blank piece of paper.

2. Flip a biased coin to determine the class label of the message.

3. Look at the k 1 most recent characters written down in the message, − and select a token from the box with the selected class label that matches the

substring of k 1 most recent characters. −

4. Write down the character on the token, replace it in the box, and repeat

from step 3 until an “end of message” token is pulled.

With this generative framework, it is possible to estimate the probabilities of pulling tokens from the various boxes from data. Conveniently, it turns out that standard adaptive data-compression algorithms do exactly this, and may be employed as supervised machine learning methods in their own right. It has been shown that such a classifier implicitly employs a feature space of k-mers [82].

Compression and Classification Basic information theory shows that the num- ber of bits needed to encode a message depends on the entropy of that message, which is defined in terms of the probability that the sender will choose to send exactly that message [88]. Formally, the minimum number of bits L(m) required to encode a

33 message m is:

L(m)= log P (m) − where P (m) is the probability that the generating process (i.e., the sender) will generate message m [88]. When the probability distribution over messages is known exactly, methods such as Arithmetic Encoding can be used to encode messages efficiently, approaching this theoretical lower bound for message encoding [95].

In practice, the true probability distribution is not known, and must be estimated from data. The goal of adaptive data compression algorithms such as

Prediction by Partial Matching (PPM) [13] and Dynamic Markov-chain Compres- sion (DMC) [22] is to estimate the probabilities of message generation as closely as possible, in order to enable efficient compression. However, learning a distribution over all possible messages is intractable in the general case [6]. To make the prob- lem tractable, these methods assume that messages are generated in the order-k

Markovian manner described above.

Essentially, PPM models these probabilities by keeping frequency counts of all substrings of length 1 to k in a table [13], while DMC models probabilities by constructing a finite state machine with transition probabilities learned from the message data [22]. (Full details of the PPM algorithm [13] and the DMC algorithm

[22] are available in the original papers describing these algorithms; for our purposes they may be treated as “black box” algorithms for probability estimation.)

Let the i-th character in a message m be denoted mi, and the string of characters from the i-th character to the j-th character (inclusive) be denoted mi,j. Then, assuming adequate training of the order-k models, either DMC or PPM will

34 estimate the probability that the sender generates message m as:

|m|

P (m)= P (mi m ) Y | (i−1)−k,(i−1) i=1

Thus, either method will encode the given message with L(m) bits:

|m| |m|

L(m) log P (mi m )= 1 log P (mi m ) ≈− Y | (i−1)−k,i−1 − · X | (i−1)−k,i−1 i=1 i=1

For classification in the spam filtering domain, we can train two models, one with spam messages and one with ham by feeding a set of messages to a given compression method. Once the models have been trained, we can classify a new message by finding the argmin(L(m)spam, L(m)ham), where L(m)spam is the length of compressing message m with the spam model, and L(m)ham is the length of compressing message m with the ham model [6].

In practice, it is possible to use compression algorithms “off the shelf” for such classification. However, it is not necessary to actually compress the messages in order to classify them; learning and applying the probability estimates is sufficient.

Thus, practical uses of data-compression techniques for spam filtering do not include the Arithmetic Encoding step found in “off the shelf” compression algorithms, and allow for efficient online updates [6].

A PPM compression-based method was the best performing spam filter at

TREC 2005 [24], and compression-based methods remained competitive with best methods at TREC 2006 and 2007 [16, 18]. These methods satisfy all of the criteria needed to be considered for effective spam filtering.

35 PerceptronClassify(x,w,τ) Given: example x X, ∈ weight vector w Rn, ∈ threshold η Begin return - τ End

PerceptronUpdate(x,w, y, η) Given: example x X, ∈ weight vector w Rn, ∈ true label y, learning rate η Begin If y PerceptronClassify(x,w,τ) 0 · ≤ w w + yηx ← End

Figure 2.2: Pseudo-code for Classical Perceptron update rule and classification func- tion.

2.3.3 Perceptron Variants

We now leave aside the generative family of supervised machine learning methods, and focus on the discriminative family of methods. In this set of methods, we no longer attempt to model the underlying process that generates ham and spam messages. Rather, we consider the feature space X, and attempt to find boundaries in this space separating regions of ham and spam.

Classical Perceptron Perhaps the oldest of the discriminative machine learning algorithms is Rosenblatt’s Perceptron Algorithm from the late 1950’s [73]. In its classical form, Perceptron learns a set of weights w Rn that defines a hyperplane ∈

36 separating two classes: in our case, ham and spam. Classical Perceptron is an online learner using a 0-1 loss function, updating its hypothesis on each new mistake.

Pseudo-code for classical Perceptron is given in Figure 2.2.

In online tests, Perceptron and the other discriminative methods described in this thesis begin with a null hypothesis, where w = 0. The classification function is used to predict the class

The hyperplane found by Perceptron intersects the origin of the coordinate space . If we wish to bias this hyperplane, we may do so by including a special feature X0 that is always set to 1 for every feature vector in our data set [35]. That is, an offset value is implicitly included in the weight vector as w0, and each example xi has xi0 = 1. Classification with Perceptron is performed by fixing a classification thresh- old, which may be used to account for asymmetric misclassification costs between classes. The function PerceptronClassify is used to make predictions; when the value returned by this function is positive, spam is predicted, and ham is predicted otherwise.

Perceptron has the benefits of being fast and straightforward, satisfying all of the computational requirements of our criteria for practical spam filtering. It is employed by SpamAssassin to train weights for the hand-crafted features used by this filter [2]. Yet in practice, it has been found to give poor classification perfor- mance in the presence of noise [48]. Perceptron’s poor classification performance on spam filtering is demonstrated in our experimental section.

Perceptron with Margins A Perceptron variant with better classification per- formance is Perceptron with Margins [52], a noise-tolerant [48] variant of the classical

Perceptron algorithm.

37 PerceptronWithMarginsUpdate(x,w, y, η, m, τ) Given: example x X, ∈ weight vector w Rn, ∈ true label y, learning rate η, margin parameter m, threshold τ Begin If y PerceptronClassify(x,w,τ) m · ≤ w w + yηx ← End

Figure 2.3: Pseudo-code for Perceptron with Margins update rule; note that the classification function is the same as for classical Perceptron.

Perceptron with Margins uses the same classification function as Classical

Perceptron, but has a different update rule. Perceptron with Margins accepts a margin parameter m. Perceptron with Margins performs an update whenever a mistaken prediction is made, or whenever an example x is found to lie within distance m of the separating hyperplane. This fixed-margin has been likened to a cheap approximation of large margin classifiers such as Support Vector Machines, and provides additional tolerance to noise [48].

We used Perceptron with Margins at TREC 2006, achieving second place results [86]. To our knowledge, this was the first use of this algorithm in the spam

filtering domain. Both variants are able to perform both classification and training updates in O(s) time.

38 1

0.9 Loss for ham training examples 0.8 Loss for spam training examples 0.7

0.6

0.5

0.4 Magnitude of loss 0.3

0.2

0.1

0 −5 −4 −3 −2 −1 0 1 2 3 4 5 Distance from hyperplane

Figure 2.4: Logistic Regression employs a logistic loss function for positive and negative examples, which punishes mistakes made with high confidence more heavily than mistakes made with low confidence.

2.3.4 Logistic Regression

The Perceptron variants are structured on a 0-1 loss function. However, in the noise- less case is better to consider mistakes with low confidence (near the hyperplane) as less costly than mistakes made with high confidence (far from the hyperplane). One such method is to use a logistic loss function in place of the 0-1 loss function, as is used by Logistic Regression. Logistic Regression was recently proposed for email spam filtering [38], and has given state of the art performance at TREC 2007 [19].

Like Perceptron variants, Logistic Regression stores a linear hypothesis with a weight vector w. Here, the prediction function maps an input example to the probability that the example has a positive label, based on that example’s distance from the separating hyperplane [66]. That is:

1 f(xi)= p(yi = 1 xi)= w x | 1+ e< , i>

39 We predict spam whenever f(xi) >τ, and ham otherwise. A fast update procedure for Logistic Regression uses an online gradient de- scent method. We start with w 0 and update for each new example. Assuming ← that y 0, 1 rather than y 1, +1 , so that y may be treated as the true ∈ { } ∈ {− } probability of class membership, the update for w is:

w w + ηxi(yi f(xi)) ← −

Like the discriminative methods previously described, both classification and up- dates are performed in O(s) time. Logistic regression satisfies all of the criteria for practical spam filtering.

2.3.5 Ensemble Methods

In machine learning, it has often been found that different classifiers make mistakes on different kinds of examples. When varied methods make uncorrelated errors, they may be combined in ensembles that often result in stronger classification per- formance than is given by any single classifier [33].

Lynam and Cormack found that combining the output of the 53 filters from

TREC 2005 into an ensemble filter produced filtering performance far exceeding that of any single filter [62].

They experimented with several methods for combining the raw output of different filters into a single classifier. These methods included simple voting across classifications and computing a composite score based on the average log-odds of per-filter classifications (see Lynam and Cormack [62] for details). Furthermore, they also tested methods where the output of filters were treated as the input fea- tures for a higher-level learner, such as a logistic regression learner or a Support

40 Vector Machine (SVM) [62]. The logistic regression-based combination gave best performance, followed closely by the SVM-based combination, but even the simpler methods of score combination gave results exceeding that of the best single classifier.

Ensemble methods may be costly, due to their reliance on multiple learners as inputs. However, when these input learners are efficient, then combining them in an ensemble does not add computational cost in asymptotic analysis. Thus, when the input learners satisfy the computational requirements set for practical spam

filters, then an ensemble of these methods is also practical by these criteria.

2.4 Experimental Comparisons

In this section, we empirically compare the methods described in this chapter, using benchmark data sets and evaluation methods developed for the TREC spam filtering competitions [24, 16, 18]. This also allows apples-to-apples comparisons with prior published results on these data sets, which we include in this chapter.

2.4.1 TREC Spam Filtering Methodology

The primary task at the TREC spam filtering competitions has been to filter spam in the online filtering scenario, described in Chapter 1. We focus on this task in this chapter, noting that different years have included tasks for delayed feedback

[16, 18], partial feedback [18], pool-based active learning [16], and our suggestion of online active learning [18].

In this setting, a data set is provided in a canonical order for repeatable online evaluation. Each message in the data set has a gold-standard label of spam or ham, which has been supplied by a careful bootstrapping methodology which includes hand labeling by human experts [23].

41 A learning based filter is then shown one message from the data set, in order.

For each message, the filter first makes a prediction of “spaminess”, expressed as a real number where higher numbers indicate a stronger prediction that the message is spam and lower numbers indicate a strong prediction that the message is ham [24].

After the prediction has been made, the gold-standard label is revealed to the filter, which may then use the label information to update its internal model. This process repeats until the all of the messages in the corpus have been exhausted.

2.4.2 Data Sets

We consider three benchmark data sets, as well as one separate data set used for parameter tuning as described below. The three benchmark data sets are trec05p-1

[24], trec06p [16], and trec07p [18].

The trec05p-1 corpus contains a total of 92,189 messages, of which are

39,399 ham and are 52790 spam [24]. The trec06p corpus contains 37,822 messages, with 12,910 ham and 24,912 spam [16]. The trec07p corpus contains 75,419 messages:

25,200 ham and 50,199 spam [18].

2.4.3 Parameter Tuning

Several of the above algorithms include parameters that must be tuned, such as the learning rate η in Logistic regression or the margin parameter m in Perceptron with

Margins. It is well known in machine learning that such parameters cannot be set by evaluation on the test data at hand, as this may yield unduly favorable results

[76].

The accepted practice is to tune such parameters on an independent data set, called the tuning set. That is, evaluate the performance of the learner on a

42 tuning set with many different parameter values. Select the parameter values giving best performance on the tuning data, and use those values in the final tests. This ensures that the generalization performance achieved by the classifier has not been unfairly influenced by advance knowledge of the test data [76].

Following this practice, we use the publicly available spamassassin corpus for parameter tuning, available at http://spamassassin.apache.org. This data set contains 6,034 examples. Because there is no canonical ordering associated with this data set, we created a randomized canonical ordering used for all of our tuning trials. (This ordering is available from the author on request). We used the standard (1-ROCA)% evaluation metric, described below, as our evaluation criteria in selecting the optimal parameter values.

The parameter values selected were as follows. Logistic Regression learning rate η was set to η = 0.1. Classical Perceptron learning rate η was set to η = 0.5.

Perceptron with Margins learning rate was set to η = 0.5 with margin parameter m = 8.

2.4.4 The (1-ROCA)% Evaluation Measure

There are several possible evaluation measures for comparing the classification per- formance of different filtering methods [94].

Clearly, we would like to maximize the number of True Positives (TPs), which are actual spam messages correctly predicted to be spam, and the number of

True Negatives (TNs), which are actual ham messages correctly predicted as such.

Furthermore, we would like to minimize the number of False Positives (FPs), which are good ham messages wrongly predicted to be spam, and to minimize the number of False Negatives (FNs), which are spam messages predicted to be ham. These

43 quantities can be combined into familiar measures such as the following:

T P + T N accuracy = T P + T N + F P + F N

T P precision = T P + F P T P recall = T P + F N

However, each of the methods described earlier in this section contain a threshold parameter τ, which can be adjusted to trade off the gain of additional

TPs, for example, against the cost of additional FPs. Because the cost of a FP or

FN may vary by user, it is considered beneficial for a filter to give good performance over a range of possible threshold values.

One evaluation measure that takes all possible thresholds into account is the

Area under the ROC curve (ROCA) measure. A ROC curve (pictured in Figure

7.3, for example) plots the performance of a classifier at different thresholds in a two dimensional space, with TP rate on the vertical axis and FP rate on the horizontal axis. When these points are connected into a curve, the area under this curve, called the ROCA score when normalized to the range [0, 1], has a statistical interpretation.

The ROCA score gives the probability that a randomly selected spam message will, indeed, be predicted to be more “spammy” than a randomly selected ham message

[15].

In this dissertation, we have followed standard practice in the spam filter- ing community by focusing on ROCA as our standard evaluation measure because different users may prefer different thresholds. For example, one set of users might weigh FNs as very costly, while another group may prefer as few FPs as possi- ble. Because ROCA is a composite metric computed across all possible thresholds,

44 this single-number evaluation measure takes different users preferences into account.

Furthermore, results from TREC evaluations show that filters scoring well on the

ROCA measure also tend to score well on other evaluation measures, such as FP rate at 0.1% FN rate [24]. However, these results have also shown that although the relative ranking of filters tends to remain consistent across a range of thresholds, these relative rankings may change when extreme thresholds are considered, such as thresholds tuned for 0.001% FN rate. The exploration of filters tuned for such extremes remains outside the scope of this dissertation.

It has become standard practice in the TREC spam filtering competitions to deal with the quantity (1-ROCA)% rather than ROCA, for readability. This can be interpreted as the percentage chance that a randomly selected spam message will be mistakenly predicted to be less “spammy” than a randomly selected ham message [24]. Numbers closest to 0 for the (1-ROCA)% score are optimal, with a

(1-ROCA)% score of 50 representing a result from random guessing.

2.4.5 Comparison Results

We tested each of the methods described in this section, and report results in tables

2.1, 2.2, and 2.3. For all of our tests, we used normalized binary feature mappings and message truncation. Different feature mappings tested are noted in the third column of each table. We also tested a range of mapping that included wildcards and gaps in different combinations. In all cases, these did not improve performance.

We also report results from a small number of previous filters that gave good results on the same data sets, noting the citation in the appropriate row for the publication in which the result appeared.

Several trends are apparent from these results. First, the various Naive Bayes

45 Table 2.1: Comparison results for methods on trec05p-1 data set in the idealized online scenario. Results are reported as (1-ROCA)%, with 0.95 confidence intervals in parenthesis

filter name learning method feature space (1-ROCA)% score SpamProbe [24] Naive Bayes word-based 0.059 (0.049 - 0.071) BogoFilter [24] Naive Bayes word-based 0.048 (0.038 - 0.062) PPM [6] compression implicit k-mers 0.019 (0.015 - 0.023) DMC [6] compression implicit k-mers 0.013 (0.010 - 0.018) MultiNomial Naive Bayes (TP) binary words 0.351 (0.3255 - 0.3781) MultiNomial Naive Bayes (TP) binary 4-mers 0.871 (0.8314 - 0.9132) MultiNomial Naive Bayes (DP) binary words 0.431 (0.3941 - 0.4708) MultiNomial Naive Bayes (DP) binary 4-mers 0.321 (0.2895 - 0.3562) MultiVariate Naive Bayes binary words 0.179 (0.1608 - 0.1983) Classical Perceptron binary words 0.105 (0.0912 - 0.1196) 46 Classical Perceptron binary 2-mers 0.139 (0.1261 - 0.1536) Classical Perceptron binary 3-mers 0.078 (0.0682 - 0.0895) Classical Perceptron binary 4-mers 0.060 (0.0513 - 0.0697) Perceptron with Margins binary words 0.037 (0.0301 - 0.0461) Perceptron with Margins binary 2-mers 0.053 (0.0462 - 0.0614) Perceptron with Margins binary 3-mers 0.030 (0.0236 - 0.0368) Perceptron with Margins binary 4-mers 0.022 (0.0184 - 0.0265) Logistic Regression binary words 0.046 (0.0407 - 0.0515) Logistic Regression binary 2-mers 0.038 (0.0335 - 0.0432) Logistic Regression binary 4-mers + wildcards 0.019 (0.0143 - 0.0255) Logistic Regression binary 3-mers 0.016 (0.0140 - 0.0188) Logistic Regression binary 4-mers 0.013 (0.0107 - 0.0152) 53-ensemble [62] voting combiner other filters 0.013 (0.010 - 0.018) 53-ensemble [62] logodds combiner other filters 0.009 (0.007 - 0.011) 53-ensemble [62] SVM combiner other filters 0.008 (0.005 - 0.013) 53-ensemble [62] Log-Reg combiner other filters 0.007 (0.005 - 0.008) Table 2.2: Comparison results for methods on trec06p data set in the idealized online scenario. Results are reported as (1-ROCA)%, with 0.95 confidence intervals in parenthesis

filter name learning method feature space (1-ROCA)% score BogoFilter Naive Bayes word-based 0.087 (0.066 - 0.114) PPM [16] compression implicit k-mers 0.061 (0.048 - 0.076) DMC compression implicit k-mers 0.031 (0.021 - 0.044) MultiNomial Naive Bayes (TP) binary words 0.145 (0.1162 - 0.1809) MultiNomial Naive Bayes (TP) binary words 0.477 (0.4255 - 0.5352) MultiNomial Naive Bayes (DP) binary words 0.147 (0.1250 - 0.1724) MultiNomial Naive Bayes (DP) binary 4-mers 0.340 (0.3069 - 0.3753) MultiVariate Naive Bayes binary words 0.138 (0.1179 - 0.1621) 47 Classical Perceptron binary words 0.163 (0.1303 - 0.2047) Classical Perceptron binary 2-mers 0.307 (0.2617 - 0.3600) Classical Perceptron binary 3-mers 0.170 (0.1408 - 0.2054) Classical Perceptron binary 4-mers 0.229 (0.1859 - 0.2824) Perceptron with Margins binary words 0.047 (0.0342 - 0.0641) Perceptron with Margins binary 2-mers 0.104 (0.0815 - 0.1330) Perceptron with Margins binary 3-mers 0.053 (0.0381 - 0.0730) Perceptron with Margins binary 4-mers 0.049 (0.0340 - 0.0704) Logistic Regression binary words 0.087 (0.0728 - 0.1042) Logistic Regression binary 2-mers 0.086 (0.0710 - 0.1046) Logistic Regression binary 3-mers 0.037 (0.0292 - 0.0471) Logistic Regression binary 4-mers 0.032 (0.0247 - 0.0406) Logistic Regression binary 4-mers + wildcards 0.034 (0.0253 - 0.0447) 53-ensemble Log-Reg combiner other filters 0.020 (0.007 - 0.050) Table 2.3: Comparison results for methods on trec07p data set in the idealized online scenario. Results are reported as (1-ROCA)%, with 0.95 confidence intervals in parenthesis

filter name learning method feature space (1-ROCA)% score BogoFilter Naive Bayes word-based 0.027 (0.017 - 0.043) DMC compression implicit k-mers 0.006 (0.003 - 0.016) MultiNomial Naive Bayes (TP) binary words 0.145 (0.1162 - 0.1809) MultiNomial Naive Bayes (TP) binary words 0.477 (0.4255 - 0.5352) MultiNomial Naive Bayes (DP) binary words 0.147 (0.1250 - 0.1724) MultiNomial Naive Bayes (DP) binary 4-mers 0.340 (0.3069 - 0.3753) MultiVariate Naive Bayes binary words 0.138 (0.1179 - 0.1621) 48 Classical Perceptron binary words 0.042 (0.0265 - 0.0668) Classical Perceptron binary 2-mers 0.030 (0.0237 - 0.0389) Classical Perceptron binary 3-mers 0.035 (0.0241 - 0.0513) Classical Perceptron binary 4-mers 0.050 (0.0334 - 0.0729) Perceptron with Margins binary words 0.019 (0.0105 - 0.0324) Perceptron with Margins binary 2-mers 0.017 (0.0118 - 0.0253) Perceptron with Margins binary 3-mers 0.011 (0.0056 - 0.0212) Perceptron with Margins binary 4-mers 0.011 (0.0060 - 0.0185) Logistic Regression binary words 0.011 (0.0063 - 0.0205) Logistic Regression binary 2-mers 0.010 (0.0055 - 0.0186) Logistic Regression binary 3-mers 0.006 (0.0022 - 0.0162) Logistic Regression binary 4-mers 0.005 (0.0017 - 0.0166) Logistic Regression binary 4-mers + wildcards 0.006 (0.0025 - 0.0163) variants all perform poorly with the feature mappings tested here, which included no feature selection. Previous Naive Bayes variants such as BogoFilter and SpamProbe performed significantly better than the Naive Bayes variants tested here, in part because these methods employ feature selection to reduce the dimensionality of the feature space. However, even these best performing Naive Bayes variants with feature selection did not perform as well as the best discriminative methods using no feature selection.

Second, Perceptron with Margins out-performs Classical Perceptron in all cases, with all feature spaces. Because the computational cost for these two methods is the same, Perceptron with Margins should be preferred exclusively.

Third, Logistic Regression out-performs Perceptron with Margins. Inter- estingly, Chapter 6 will show that Perceptron with Margins can match or even out-perform Logistic Regression in the presence of noise.

Fourth, we see that the results from the 53-ensemble out-perform Logistic

Regression. These ensemble results can be matched by Online SVMs, as described in the next chapter.

49 Chapter 3

Online Filtering with Support Vector Machine Variants

In the previous chapter, we described a variety of machine learning methods that have been applied to the online filtering scenario, with the exception of methods based on Support Vector Machines (SVMs). This is because there has been debate between academic researchers and industrial practitioners about the applicability of

SVMs to the spam filtering problem. The former have advocated the use SVMs for such filtering, because SVMs give state-of-the-art performance for text classification.

However, similar performance gains were yet to be demonstrated for online spam

filtering before our work in 2007. Additionally, practitioners have cited the high cost of SVMs as reason to prefer faster (if less statistically robust) methods such as

Naive Bayes variants. In this chapter, we offer a resolution to this controversy. First, we show that online SVMs indeed give state-of-the-art classification performance on online spam filtering on large benchmark data sets. Second, we show that nearly equivalent performance may be achieved by a Relaxed Online SVM (ROSVM) at

50 greatly reduced computational cost. This chapter is based on our work with Gabriel

Wachman [84, 85].

3.1 An Anti-Spam Controversy

Through 2007, the anti-spam community was been divided on the choice of the best machine learning method for content-based spam detection. Academic researchers have tended to favor the use of Support Vector Machines (SVMs), a statistically robust machine learning method [27, 78] which yields state-of-the-art performance on general text classification [45]. However, SVMs typically require training time that is quadratic in the number of training examples, and are impractical for large-scale email systems. Practitioners requiring content-based spam filtering have typically chosen to use the faster (if less statistically robust) machine learning method of

Naive Bayes text classification [39, 40, 63]. This Bayesian method requires only linear training time, and is easily implemented in an online setting with incremental updates. This allows a deployed system to easily adapt to a changing environment over time. Other fast methods for spam filtering include compression models [6] and logistic regression [38]. Furthermore, until our work in 2007 it has not yet been empirically demonstrated that SVMs give improved performance over these methods in an online spam detection setting [20].

3.1.1 Contributions

In this chapter, we address the anti-spam controversy and offer a potential resolution.

We first demonstrate that online SVMs do indeed provide state-of-the-art spam detection through empirical tests on TREC benchmark data sets of email spam. We then analyze the effect of the regularization parameter in the SVM objective function,

51 which shows that the expensive SVM methodology may, in fact, be overkill for spam detection. We reduce the computational cost of SVM learning by relaxing this requirement on the maximum margin in online settings, and create a Relaxed Online

SVM (ROSVM) appropriate for high performance content-based spam filtering in large-scale settings [84].

3.2 Spam and Online SVMs

The controversy that existed through 2007 between academics and practitioners in spam filtering centered on the use of SVMs. Academics advocated their use, but had yet to demonstrate strong performance with SVMs for online spam filtering.

Indeed, the results of [20] showed that, when used with default parameters, SVMs actually perform worse than other methods. In this section, we review the basic workings of SVMs and describe a simple Online SVM algorithm. We then show that Online SVMs indeed achieve state-of-the-art performance on filtering email spam, so long as the regularization parameter C is set to a high value. However, the cost of Online SVMs turns out to be prohibitive for large-scale applications. These

findings motivate our proposal of Relaxed Online SVMs in the following section.

3.2.1 Background: SVMs

SVMs are a robust machine learning methodology which has been shown to yield state-of-the-art performance on text classification [45] by finding a hyperplane that separates two classes of data in data space while maximizing the margin between them.

The linear SVMs we employ in this chapter use a hypothesis vector w and bias term b to classify a new example x, by generating a prediction value using the

52 Separating Hyperplane

Margin

Error

Figure 3.1: Visualizing SVM Classification. An SVM learns a hyperplane that sep- arates the positive and negative data examples with the maximum possible margin. Error terms ξi > 0 are given for examples on the wrong side of their respective margin.

53 function f(x):

f(x) = (< w, x > +b)

As with the methods described in Chapter 2, when a threshold parameter τ is fixed then we predict spam when f(x) τ and ham otherwise. ≥ SVMs find the hypothesis w, which defines the separating hyperplane, by minimizing the following objective function over all n training examples:

n 1 2 g(w,ξ)= w + C ξi 2|| || X i=1 under the constraints that

i = 1..n : yi f(xi) 1 ξi,ξi 0 ∀ { } · ≥ − ≥

In this objective function, each slack variable ξi shows the amount of error (relative to the margins) that the classifier makes on a given example xi. Minimizing the sum of the slack variables corresponds to minimizing the loss function on the training data, while minimizing the term 1 w 2 corresponds to maximizing the margin 2 || || between the two classes [78]. These two optimization goals are often in conflict; the regularization parameter C determines how much importance to give each of these tasks.

Linear SVMs exploit data sparsity to classify a new instance in O(s) time, where s is the number of non-zero features. This is the same classification time as the other linear classifiers from Chapter 2. Training SVMs, however, typically takes O(n2) time, for n training examples. A variant for linear SVMs was recently proposed which trains in O(ns) time [47], but because this method has a high constant, we do not explore it here.

54 Given: data set X = (x1,y1),..., (xn,yn), C, m: Initialize w := 0, b := 0, seenData := { } For Each xi X do: ∈ Classify xi using f(xi) = (< w, xi > +b) IF yif(xi) < 1 Find w′, b′ using SMO on seenData, using w, b as seed hypothesis. Add xi to seenData done

Figure 3.2: Pseudo code for Online SVM.

3.2.2 Online SVMs

In many traditional machine learning applications, SVMs are applied in batch mode.

That is, an SVM is trained on an entire set of training data, and is then tested on a separate set of testing data. Spam filtering is typically tested and deployed in an online setting, which proceeds incrementally. Here, the learner classifies a new example, is told if its prediction is correct, updates its hypothesis accordingly, and then awaits a new example. Online learning allows a deployed system to adapt itself in a changing environment.

Re-training an SVM from scratch on the entire set of previously seen data for each new example is cost prohibitive. However, using an old hypothesis as the starting point for re-training reduces this cost considerably. One method of incremental and decremental SVM learning was proposed in [9]. Because we are only concerned with incremental learning, we apply a simpler algorithm for converting a batch SVM learner into an online SVM (see Figure 3.2 for pseudo-code), which is similar to the approach of [49].

Each time the Online SVM encounters an example that was poorly classified, it retrains using the old hypothesis as a starting point. Note that due to the Karush-

55 Kuhn-Tucker (KKT) conditions, it is not necessary to re-train on well-classified examples that are outside the margins [78].

We used Platt’s SMO algorithm [69] as a core SVM solver, because it is an it- erative method that is well suited to converge quickly from a good initial hypothesis.

Because previous work (and our own initial testing) indicates that binary feature values give the best results for spam filtering [63, 34], we optimized our implemen- tation of the Online SMO to exploit fast inner-products with binary vectors.1

We performed feature mapping as described in Chapter 2. We tested normal- ized binary k-mer feature mappings, and a normalized binary word-based feature mapping. We used message truncation, truncating each message after 3000 charac- ters.

3.2.3 Tuning the Regularization Parameter, C

The SVM regularization parameter C must be tuned to balance the (potentially conflicting) goals of maximizing the margin and minimizing the training error. Early work on SVM based spam detection [34] showed that high values of C give best performance with binary features. Later work has not always followed this lead: a

(low) default setting of C was used on splog detection [50], and also on email spam

[20].

Following standard machine learning practice, we tuned C on separate tuning data not used for later testing. We used the publicly available spamassassin email spam data set, and created an online learning task by randomly interleaving all 6034 labeled messages to create a single ordered set.

For tuning, we performed a coarse parameter search for C using powers of ten from .0001 to 10000. We used the Online SVM described above, and tested

1Our source code is freely available at www.cs.tufts.edu/∼dsculley/onlineSMO.

56 1 0.999

0.995

0.99 ROC Area

0.985

2-grams 3-grams 0.98 4-grams words 0.1 1 10 100 1000 C

Figure 3.3: Tuning the Regularization Parameter C. Tests were conducted with Online SMO, using binary feature vectors, on the spamassassin data set of 6034 examples. Graph plots C versus Area under the ROC curve.

57 Table 3.1: Results for Email Spam filtering with Online SVM on benchmark data sets. Score reported is (1-ROCA)%, where 0 is optimal. These results are directly comparable to those on the same data sets with other filters, reported in Chapter 2.

trec05p-1 trec06p OnSVM: words 0.015 (.011-.022) 0.034 (.025-.046) 3-mers 0.011 (.009-.015) 0.025 (.017-.035) 4-mers 0.008 (.007-.011) 0.023 (.017-.032) SpamProbe 0.059 (.049-.071) 0.092 (.078-.110) BogoFilter 0.048 (.038-.062) 0.077 (.056-.105) TREC Winners 0.019 (.015-.023) 0.054 (.034-.085) Log-Reg 4-mers 0.013 (.011-.019) 0.032 (.025-.041) 53-Ensemble 0.007 (.005-.008) 0.020 (.007-.050)

both binary bag of words vectors and n-mer vectors with n = 2, 3, 4 . We used { } the first 3000 characters of each message, which included header information, body of the email, and possibly attachments. Following the recommendation of [25], we use Area under the ROC curve as our evaluation measure. The results (see Figure

3.3) agree with [34]: there is a plateau of high performance achieved with all values of C 10, and performance degrades sharply with C < 1. For the remainder ≥ of our experiments with SVMs in this chapter, we set C = 100, enforcing little regularization.

In Chapters 6 and 7 we will return to this issue of regularization, when we explore spam filtering in the presence of class-label noise. Later in this chapter, we will come back to the observation that very high values of C do not degrade performance as support for the intuition that relaxed SVMs should perform well on spam.

58 3.2.4 Email Spam and Online SVMs

With C tuned on a separate tuning set, we then tested the performance of Online

SVMs in spam detection. We used two large benchmark data sets of email spam as our test corpora. These data sets are the 2005 TREC and the 2006 TREC public data sets, trec05p-1 [24] and trec06p [16], described in Chapter 2.

Results for these experiments, with bag of words vectors and and k-mer vectors appear in Table 3.1. To compare our results with previous scores on these data sets, we use the same (1-ROCA)% measure described in Chapter 2.

These results, current through mid-2007, show that Online SVMs indeed gave state of the art performance on email spam. The only known system that out-performed the Online SVMs on the trec05p-1 data set is an ensemble classifier which combines the results of 53 unique spam filters [62]. To our knowledge, the

Online SVM has out-performed every other single filter on these data sets, including those using Bayesian methods [24, 16], compression models [24, 16], and perceptron variants [16], the TREC competition winners [24, 16], and open source email spam

filters BogoFilter v1.1.5 and SpamProbe v1.4d. Online SVM also out-performed logistic regression using word-based features [38], although later work by Cormack applying logistic regression with a binary 4-mer feature space and message trunca- tion achieved results that are competitive with that of Online SVM [19].

3.2.5 Computational Cost

The results presented in this section demonstrate that linear SVMs give state of the art performance on content-based spam filtering. However, this performance comes at a price. Although the blog comment spam and splog data sets are too small for the quadratic training time of SVMs to appear problematic, the email data sets are

59 large enough to illustrate the problems of quadratic training cost.

Table 3.2 shows computation time versus data set size for each of the online learning tasks (on same system with no other significant load). The training cost of SVMs are prohibitive for large-scale content based spam detection, or a large blog host. In the following section, we reduce this cost by relaxing the expensive requirements of SVMs.

3.3 Relaxed Online SVMs (ROSVM)

One of the main benefits of SVMs is that they find a decision hyperplane that maximizes the margin between classes in the data space. Maximizing the margin is expensive, typically requiring quadratic training time in the number of training examples. However, as we saw in the previous section, the task of content-based spam detection is best achieved by SVMs with a high value of C. Setting C to a high value for this domain implies that minimizing training loss is more important than maximizing the margin (see Figure 3.4).

Online SVMs re-compute an exact solution to a full SVM optimization prob- lem on the entire set of seen data, each time a new example violates the KKT conditions. While the method of seeding an SVM with the previous best hypothesis

features trec06p trec05p-1 words 12196s 66478s 3-mers 44605s 128924s 4-mers 87519s 242160s corpus size 32822 92189

Table 3.2: Execution time for Online SVMs with email spam detection, in CPU seconds. These times do not include the time spent mapping strings to feature vectors. The number of examples in each data set is given in the last row as corpus size.

60 B

A

Figure 3.4: Visualizing the effect of C. Hyperplane A maximizes the margin while accepting a small amount of training error. This corresponds to setting C to a low value. Hyperplane B accepts a smaller margin in order to reduce training error. This corresponds to setting C to a high value. Content-based spam filtering appears to do best with high values of C.

61 Given: dataset X = (x1,y1),..., (xn,yn), C, m, p: Initialize w := 0, b := 0, seenData := { } For Each xi X do: ∈ Classify xi using f(xi)=< w, xi > +b If yif(xi) p remove oldest example from seenData Add xi to seenData done

Figure 3.5: Pseudo-code for Relaxed Online SVM. speeds this computation, the overall computation time incurred by an Online SVM is still prohibitive for large-scale spam detection systems.

Thus, while SVMs do create high performance spam filters, applying them in practice is overkill. The full margin maximization feature that they provide is unnecessary, and relaxing this requirement can reduce computational cost. We propose three ways to relax Online SVMs:

Reduce the size of the optimization problem by only optimizing over the last • p examples.

Reduce the number of training updates by only training on actual errors. •

Reduce the number of iterations in the iterative SVM solver by allowing an • approximate solution to the optimization problem.

As we describe in the remainder of this subsection, all of these methods trade sta- tistical robustness for reduced computational cost. Experimental results reported in the following section show that they equal or approach the performance of full

62 Online SVMs on content-based spam detection.

3.3.1 Reducing Problem Size

In the full Online SVMs, we re-optimize over the full set of seen data on every update, which becomes expensive as the number of seen data points grows. We can bound this expense by only considering the p most recent examples for optimization

(see Figure 3.5 for pseudo-code).

Note that this is not equivalent to training a new SVM classifier from scratch on the p most recent examples, because each successive optimization problem is seeded with the previous hypothesis w [30]. This hypothesis may contain values for features that do not occur anywhere in the p most recent examples, and these will not be changed. This allows the hypothesis to remember rare (but informative) features that were learned further than p examples in the past.

Formally, the optimization problem is now defined most clearly in the dual form [78]. In this case, the original soft-margin SVM is computed by maximizing at example n: n n 1 W (α)= αi αiαjyiyj < xi, xj >, X − 2 X i=1 i,j=1 subject to the constraints [78]:

i 1,...,n : 0 αi C ∀ ∈ { } ≤ ≤ and n α y = 0 X i i i=1

63 To this, we add the additional lookback buffer constraint

j 1,..., (n p) : αj = cj ∀ ∈ { − }

where cj is a constant, fixed as the last value found for αj while j > (n p). Thus, − the margin found by an optimization is not guaranteed to be one that maximizes the margin for the global data set of examples x ,..., xn) , but rather one that { 1 } satisfies a relaxed requirement that the margin be maximized over the examples

x ,..., xn , subject to the fixed constraints on the hyperplane that were { (n−p+1) } found in previous optimizations over examples x ,..., x . (For completeness, { 1 (n−p)} when p n, define (n p) = 1.) This set of constraints reduces the number of free ≥ − variables in the optimization problem, reducing computational cost.

3.3.2 Reducing Number of Updates

As noted before, the KKT conditions show that a well classified example will not change the hypothesis; thus it is not necessary to re-train when we encounter such an example. Under the KKT conditions, an example xi is considered well-classified when yif(xi) > 1. If we re-train on every example that is not well-classified, our hyperplane will be guaranteed to be optimal at every step.

The number of re-training updates can be reduced by relaxing the definition of well classified. An example xi is now considered well classified when yif(xi) > M, for some 0 M 1. Here, each update still produces an optimal hyperplane. The ≤ ≤ learner may encounter an example that lies within the margins, but farther from the margins than M. Such an example means the hypothesis is no longer globally optimal for the data set, but it is considered good enough for continued use without immediate retraining.

64 This update procedure is similar to that used by variants of the Perceptron algorithm [52]. In the extreme case, we can set M = 0, which creates a mistake driven Online SVM. In the experimental section, we show that this version of Online

SVMs, which updates only on actual errors, does not significantly degrade perfor- mance on content-based spam detection, but does significantly reduce cost.

3.3.3 Reducing Iterations

As an iterative solver, SMO makes repeated passes over the data set to optimize the objective function. SMO has one main loop, which can alternate between passing over the entire data set, or the smaller active set of current support vectors [69].

Successive iterations of this loop bring the hyperplane closer to an optimal value.

However, it is possible that these iterations provide less benefit than their expense justifies. That is, a close first approximation may be good enough. We introduce a parameter T to control the maximum number of iterations we allow. As we will see in the experimental section, this parameter can be set as low as 1 with little impact on the quality of results, providing computational savings.

3.4 Experiments

In Section 3.2, we argued that the strong performance on content-based spam de- tection with SVMs with a high value of C show that the maximum margin criteria is overkill, incurring unnecessary computational cost. In Section 3.3, we proposed

ROSVM to address this issue, as both of these methods trade away guarantees on the maximum margin hyperplane in return for reduced computational cost. In this section, we test these methods on the same benchmark data sets to see if state of the art performance may be achieved by these less costly methods. We find that

65 ROSVM is capable of achieving these high levels of performance with greatly re- duced cost. Our main tests on content-based spam detection are performed on large benchmark sets of email data. We then apply these methods on the smaller data sets of blog comment spam and blogs, with similar performance.

3.4.1 ROSVM Tests

In Section 3.3, we proposed three approaches for reducing the computational cost of Online SMO: reducing the problem size, reducing the number of optimization iterations, and reducing the number of training updates. Each of these approaches relax the maximum margin criteria on the global set of previously seen data. Here we test the effect that each of these methods has on both effectiveness and efficiency.

In each of these tests, we use the large benchmark email data sets, trec05p-1 and trec06p.

Testing Reduced Size

For our first ROSVM test, we experiment on the effect of reducing the size of the optimization problem by only considering the p most recent examples, as described in the previous section. For this test, we use the same 4-mer mappings as for the reference experiments in Section 3.2, with the same value C = 100. We test a range of values p in a coarse grid search. Figure 3.6 reports the effect of the buffer size p in relationship to the (1-ROCA)% performance measure (top), and the number of CPU seconds required (bottom). These results were obtained on machines with

2.8Ghz Intel Xeon CPUs with 4 gigabyes of RAM, run with no other system load during execution.

The results show that values of p < 100 do result in degraded performance,

66 0.1

0.05

0.025 (1-ROCA)%

0.01

0.005 trec05p-1 trec06p 10 100 1000 10000 100000 Buffer Size 250000

200000

150000 CPU Sec. 100000

50000

trec05p-1 trec06p 0 10 100 1000 10000 100000 Buffer Size

Figure 3.6: Reduced Size Tests.

67 although they evaluate very quickly. However, p values from 500 to 10,000 perform almost as well as the original Online SMO (represented here as p = 100, 000), at dramatically reduced computational cost.

These results are important for making state of the art performance on large- scale content-based spam detection practical with online SVMs. Ordinarily, the training time would grow quadratically with the number of seen examples. However,

fixing a value of p ensures that the training time is independent of the size of the data set. Furthermore, a lookback buffer allows the filter to adjust to concept drift.

Testing Reduced Iterations

In the second ROSVM test, we experiment with reducing the number of iterations.

Our initial tests showed that the maximum number of iterations used by Online

SMO was rarely much larger than 10 on content-based spam detection; thus we tested values of T = 1, 2, 5, . Other parameters were identical to the original { ∞} Online SVM tests.

The results on this test were surprisingly stable (see Figure 3.7). Reducing the maximum number of SMO iterations per update had essentially no impact on classification performance, but did result in a moderate increase in speed. This suggests that any additional iterations are spent attempting to find improvements to a hyperplane that is already very close to optimal. These results show that for content-based spam detection, we can reduce computational cost by allowing only a single SMO iteration (that is, T = 1) with effectively equivalent performance.

Testing Reduced Updates

For our third ROSVM experiment, we evaluate the impact of adjusting the param- eter M to reduce the total number of updates. As noted before, when M = 1,

68 0.1

0.05

0.025 (1-ROCA)%

0.01

0.005 trec06p trec05p-1 1 2 5 10 Max Iters. 250000

200000

150000 CPU Sec.

100000

trec06p trec05p-1 50000 1 2 5 10 Max Iters.

Figure 3.7: Reduced Iterations Tests.

69 0.1

0.05

0.025 (1-ROCA)%

0.01

0.005 trec05p-1 trec06p 0 0.2 0.4 0.6 0.8 1 M 40000

35000

30000

25000

CPU Sec. 20000

15000

10000

trec05p-1 trec06p 5000 0 0.2 0.4 0.6 0.8 1 M

Figure 3.8: Reduced Updates Tests.

70 the hyperplane is globally optimal at every step. Reducing M allows a slightly in- consistent hyperplane to persist until it encounters an example for which it is too inconsistent. We tested values of M from 0 to 1, at increments of 0.1. (Note that we used p = 10000 to decrease the cost of evaluating these tests, but did not reduce iterations for these tests.)

The results for these tests are appear in Figure 3.8, and show that there is a slight degradation in performance with reduced values of M, and that this degradation in performance is accompanied by an increase in efficiency. Values of

M > 0.7 give effectively equivalent performance as M = 1, and still reduce cost.

3.4.2 Online SVMs and ROSVM

We now compare ROSVM against Online SVMs on the email spam, blog comment spam, and splog detection tasks. These experiments show comparable performance on these tasks, at radically different costs. In the previous section, the effect of the different relaxation methods was tested separately. Here, we tested these methods together to create a full implementation of ROSVM. We chose the values p = 10000,

T = 1, M = 0.8 for the email spam detection tasks. Note that these parameter values were selected as those allowing ROSVM to achieve comparable performance results with Online SVMs, in order to test total difference in computational cost.

Experimental Setup

We compared Online SVMs and ROSVM on email spam, blog comment spam, and splog detection. For the email spam, we used the benchmark corpora, trec05p-1 and trec06p, in the standard online ordering. We ran each method on each task, and report the results in Table 3.3. Note that the CPU time reported for each

71 Table 3.3: Email Spam Benchmark Data. These results compare Online SVM and ROSVM on email spam detection, using binary 4-mer feature space. Score reported is (1-ROCA)%, where 0 is optimal.

trec05p-1 trec05p-1 trec06p trec06p (1-ROC)% CPUs (1-ROC)% CPUs OnSVM 0.0084 242,160 0.0232 87,519 ROSVM 0.0090 24,720 0.0240 18,541 method was generated on the same computing system. This time reflects only the time needed to complete online learning on tokenized data. We do not report the time taken to tokenize the data into binary 4-mers, as this is the same additive constant for all methods on each task. In all cases, ROSVM was significantly less expensive computationally.

3.4.3 Results

The comparison results shown in Table 3.3 are striking in two ways. First, they show that the performance of Online SVMs can be matched and even exceeded by relaxed margin methods. Second, they show a dramatic disparity in computational cost. ROSVM is dramatically more efficient than the normal Online SVM, and gives comparable results. Furthermore, the fixed lookback buffer ensures that the cost of each update does not depend on the size of the data set already seen, unlike Online

SVMs. Note the blog and splog data sets are relatively small, and results on these data sets must be considered preliminary. Overall, these results show that there is no need to pay the high cost of SVMs to achieve this level of performance on content-based detection of spam. ROSVMs offer a far cheaper alternative with little or no performance loss.

72 3.5 ROSVMs at the TREC 2007 Spam Filtering Com-

petition

The experiments reported in this chapter were performed in late 2006, and suggested that ROSVMs are, indeed, a strong spam filtering methodology. This algorithm was then tested on new data at the TREC 2007 spam filtering competition, again with strong results.

3.5.1 Parameter Settings

In compliance with the TREC guidelines, we were allowed to test our method at a small number of different parameter settings. We tested ROSVMs at three different parameter settings, all of which were chosen prior to the data sets being released.

These settings were intended to compare the performance tradeoffs of different sizes of the lookback buffer p and update margin parameter M. For each trial, the regularization parameter C = 100 and the maximum number of iterations was 1.

Settings for p and M appear in Table 3.4.

3.5.2 Experimental Results

TREC 2007 employed two different data sets for tests in the idealized filtering sce- nario. The trec07p data set has been described in Chapter 2, and is now publicly available for research purposes [18]. A second data set, MrX3, is a private corpus that has not been publicly released [18]. This private corpus contains 161,975 ex- amples, of which 8082 are ham and 153893 are spam. Note that for this corpus, ham is a minority class. Because the MrX3 corpus is composed of all emails sent to an anonymous but real human user, we can expect that this proportion of ham and spam reflects a real-world distribution in contemporary email filtering.

73 Table 3.4: Results for ROSVMs and comparison methods at the TREC 2007 Spam Filtering track. Score reported is (1-ROCA)%, where 0 is optimal, with .95 confi- dence intervals in parenthesis.

filter name details trec07p MrX3 Bogo Naive Bayes 0.027 (.017-.043) N/A ijsppm [18] PPM compression 0.0299 (.0177-.0504) 0.0397 (.0228-.0690) wat1 [19] Log-Reg 4-mers 0.0105 (.0053-.0208) 0.0096 (.0068-.0137) wat2 [19] DMC compression 0.0207 (.0158-.0272) 0.0219 (.0155-.0319) wat3 [19] ensemble: wat1, wat2 0.0086 (.0038-.0195) 0.0076 (.0054-.0108) ROSVM p = 500,M =0.8 0.0099 (.0032-.0308) 0.0166 (.0070-.0396) ROSVM p = 1000,M =0.5 0.0103 (.0031-.0337) 0.0054 (.0036-.0080) ROSVM p = 5000,M =0.8 0.0093 (.0021-.0406) 0.0042 (.0024-.0073)

We report results for the ROSVMs at various parameter settings, along with results from BogoFilter (a Naive Bayes filter), PPM compression [6], DMC compres- sion [6, 19], logistic regression using binary 4-mers [19], and a small ensemble using results from both DMC and logistic regression [19]. In general, ROSVMs gave best or near-best performance on both data sets. However, the confidence bounds for the

(1-ROCA)% score overlap results for several methods for trec07p, and with results for the wat1 logistic regression filter and the wat3 ensemble method for MrX3. Thus, we conclude that ROSVMs are one of several methods that give near-perfect results on these data sets.

3.6 Discussion

In the past, academic researchers and industrial practitioners have disagreed on the best method for online content-based detection of spam on the web. We have presented one resolution to this debate. Online SVMs do, indeed, produce state-

74 of-the-art performance on this task with proper adjustment of the regularization parameter C, but with cost that grows quadratically with the size of the data set.

The high values of C required for best performance with SVMs show that the margin maximization of Online SVMs is overkill for this task. Thus, we have proposed a less expensive alternative, ROSVM, that relaxes this maximum margin requirement, and produces nearly equivalent results.

It is natural to ask why the task of content-based spam detection gets strong performance from ROSVM. After all, not all data allows the relaxation of SVM requirements. We conjecture that email spam has the characteristic that a subset of rare but informative features are particularly indicative of content being either spam or not spam. These indicative features may be sparsely represented in the data set, because of spam methods such as word obfuscation, in which common spam words are intentionally misspelled in an attempt to reduce the effectiveness of word-based spam detection. Maximizing the margin may cause these sparsely represented features to be ignored, creating an overall reduction in performance.

It appears that spam data is highly separable, allowing ROSVM to be successful with high values of C and little effort given to maximizing the margin. Future work will determine how applicable relaxed SVMs are to the general problem of text classification.

Indeed, the success of ROSVMs with high values of C suggests that the benchmark data sets are remarkably free from noise, as very little regularization is required to achieve strong classification performance. In Chapters 6 and 7, we will explore the impact of added noise in these data sets, and will show that additional regularization (with lower values of C) is necessary in such cases.

Finally, we note that the success levels of ROSVMs, along with ensemble

75 methods and logistic regression, are essentially perfect, making one or fewer errors per thousand messages in the idealized online scenario. While some observers have concluded that the spam filtering problem is therefore solved, we contend that the

filtering problem can only be considered solved when such performance levels are ev- ident in experimental tests reflecting real world conditions rather than the idealized case. Thus, in the remainder of this dissertation we move forward by investigating more realistic filtering scenarios than the idealized case.

76 Chapter 4

Online Active Learning Methods for Spam Filtering

In the previous chapter, it was demonstrated that machine learning methods such as Logistic Regression and ROSVMs give near-perfect classification performance for online filtering of email spam data, when label feedback was provided for each message. However, in real-world settings, users may be unwilling to label every message sent to them – in part because being required to give exhaustive feedback on all messages largely defeats the purpose of automated filtering. This chapter investigates the use of online active learning to reduce the amount of label feedback required from users without a large reduction in classification performance compared with results from the idealized filtering scenario.

4.1 Re-Thinking Active Learning for Spam Filtering

Active learning methods have been developed in the machine learning community to reduce labeling cost by identifying informative examples for which to request

77 labels. It has been shown in practice that only a small portion of a large unlabeled data set may need to be labeled to train an active learner that achieves classifica- tion performance that rivals (or in some cases exceeds) classification performance attained when all unlabeled training data has been labeled [56] [36] [74] [91] [10].

Thus, active learning is an appealing tool for real-world spam filtering.

The pool-based approach to active learning has previously been applied to spam filtering, with good results [57] [87] [16]. Similar to prior results in text classi-

fication [56], it has been shown that only a small subset of a larger unlabeled email data set needs to be labeled to achieve strong performance. Several active learning methods are appropriate for this task, including uncertainty sampling [56], version space reduction [91], and query by committee [36]. However, the iterative pool- based approach is computationally expensive, often requiring many passes through the entire unlabeled data set. Segal et al. introduced an efficient approximation that reduced this cost, but still required at least one full pass through an entire unlabeled data set before any labels could be requested [87].

We depart from the pool-based approach and investigate the novel use of online active learning methods for spam filtering. In this online case, the filter is exposed to a stream of messages and is asked to classify them one by one. At each point in time, the active filter may choose to request a label for the given example.

The goal here is to create a strong classifier while requesting as few labels as possible.

This approach has several advantages over the pool-based methodology. First, it reflects the actual application scenario of real-world spam filters, which are applied in online (and often real-time) settings. A user may be much more willing to label a new incoming message than one from the past, especially if label requests are made relatively infrequently. Second, online active learning enables solutions with only

78 O(1) additional computation and storage costs. The online active learning scenario involves no repeated passes over unlabeled data, which is the primary source of computational cost in pool-based active learning, and does not require storage of a large pool of unlabeled examples.

In the remainder of this chapter, we review related work in both pool-based and online active learning. We then describe several online active learning strategies for linear classifiers. We test these methods on spam filtering tasks using three of the best performing learners from Chapters 2 and 3: Perceptron with Margins, Logistic

Regression and ROSVMs. We find that online active learning methods greatly reduce the number of labels needed to achieve strong classification performance on two large benchmark data sets. These results exceed the performance of uniform subsampling on both data sets, and also out-perform the pool-based active learning methods from the 2006 TREC spam filtering competition by at least an order of magnitude on the (1-ROCA)% evaluation metric. We conclude with a discussion on the implication of these results for spam filtering and user interface.

4.2 Related Work

There are many active learning approaches in the literature from machine learning

[56, 91, 36, 74, 42, 11, 29]. Although pool-based active learning has received more attention, there has also been significant work in online active learning (sometimes referred to as label efficient learning). In this section, we explore connections between both the pool-based approach and the online approach to the goal of spam filtering with reduced human labeling effort. We also discuss related attempts to reduce human labeling effort using a semi-supervised learning approach.

79 4.2.1 Pool-based Active Learning

As discussed in the introduction, there have been a variety of pool-based active learning methods proposed in the literature [56, 91, 36, 74]. There have been several examinations of pool-based active learning for spam filtering [87], including a task in the 2006 TREC spam filtering competition [16]. Pool-based active learning assumes that the learner has access to a pool of n unlabeled examples, and is able to request labels for up to m<

There are several methods for selecting these m examples. Uncertainty sam- pling [56] requests the m examples for which the current hypothesis has the least confidence in classification. Other methods explicitly seek examples whose labels are expected to most reduce the size of the version space [91]. The Query by Committee algorithm is another approach to version space reduction, that relies on predictions from hypotheses sampled from the version space [36]. It is also possible to request labels for those examples that are estimated to most greatly reduce training error

[74].

There are two issues with the pool-based active learners, as applied to spam

filtering. The first is cost. In an iterative pool-based scheme, each of the n exam- ples must be re-evaluated on each iteration. Some methods, such as version space reduction, Query by Committee, and estimation of error rate reduction are expen- sive in practice. But even with inexpensive evaluation methods such as uncertainty sampling, or the simple method of version space reduction [91], the cost of active learning is still O(ni) for a pool of n examples over i iterations. For large email systems, this cost may be prohibitive. Segal et al. have proposed a method of reducing this cost for spam filtering with approximation to uncertainty sampling

[87], but even here the entire pool of unlabeled examples must be examined at least

80 once before labels may be requested. Thus, the main overhead in pool-based active learning is in choosing examples for which to request labels.

The second issue with pool-based active learning is that in practical settings, spam filtering is most naturally viewed as an online task. Emails enter a system in a stream, not a pool. Online active learning enables natural user-interface for requesting labels from actual users in real time.

Pool-based active learning does have one potential advantage over online active learning. Because pool-based active learning considers an entire data set at once, it is possible that pool-based active learning may identify the optimal subset of training examples. Online active learning necessarily uses a greedy strategy that may not select the optimal subset.

4.2.2 Online Active Learning

To our knowledge, ours is the first examination of online active learning methods to spam filtering [79]. The first analysis of online active learning for general machine learning was performed by Helmbold and Panizza [42], under the heading of Label

Efficient Learning. This work examined the tradeoffs between the cost of label requests and the cost of errors in an online learning setting. There have since been several proposals for Label Efficient learning methods for linear classifiers [11] [29], but to our knowledge the b-Sampling approach (reviewed in Section 4.3.1) is the only approach that has been analyzed that does not require either the number of total examples in the data stream or the maximum number of label requests to be specified in advance. Because practical spam filtering is performed in an essentially unbounded online setting, it is important not to have such restrictions.

We should also point out that Query by Committee [36] is essentially an

81 online active learning algorithm. We do not examine it in this chapter because the process of sampling hypotheses from the version space is too expensive for practical spam filtering.

4.2.3 Semi-Supervised Learning and Spam Filtering

Active learning is not the only machine learning approach for reducing dependency on labeled data. The semi-supervised learning paradigm assumes that a small amount of labeled data is available for training, along with a large pool of unla- beled data [12]. However, where pool-based active learning seeks to select useful examples from the unlabeled pool for labeling, semi-supervised learning attempts to make use of the unlabeled data directly.

There are several common approaches for semi-supervised learning. Self- training applies a learner trained on a small amount of data to a larger, unlabeled data pool [12]. Examples that the learner labels with high confidence are then treated as labeled examples for additional training. Co-training relies on redundancy in the feature space. Co-training uses independent learners trained on conditionally independent subsets of the feature space to train each other on the unlabeled data

[5]. Transductive SVMs (TSVMs) seek to solve a joint optimization problem, in which the classical soft-margin SVM problem for labeled data is combined with an optimization problem seeking to assign labels to unlabeled data with the minimum possible disagreement [46].

Although attempts have been made to apply semi-supervised learning meth- ods to spam filtering, thus far these have not been successful on benchmark filtering data sets. The 2006 ECML/PKDD learning challenge tested several methods for semi-supervised learning on a small data set of email spam and ham of reduced

82 Message Stream New Incoming Message

Classify Output Message Ham or Spam

No Request Label?

Yes

Get Label and Update Filter

Figure 4.1: Online Active Learning. dimensionality, with initially promising results that showed TSVMs to be a strong methodology [4]. However, a following study attempting to apply both self-training and TSVMs to online spam filtering using the TREC data sets showed that these semi-supervised methods only degraded results [67]. Our own attempts in this re- gard, conducted independently in the winter of 2008, found that both self-training and co-training degraded results in cases where more than ten examples were la- beled in the data set. We conjecture that setting of high dimensionality with many sparsely represented but relevant features causes difficulty for semi-supervised ap- proaches. Thus, at the time of writing, the effective application of semi-supervised learning methods for spam filtering remains an open problem.

83 4.3 Online Active Learning Methods

The basic online active learning framework is shown in Figure 4.1. Messages come to the filter in a stream, and the filter must classify them one by one. At each point, the filter has the option of requesting a label for the given message. The goal is for the filter to achieve strong classification performance with as few label requests as possible.

The issue considered in this section is: given an stream of unlabeled examples, how should the filter decide to request a label? We examine several schemes for making this decision. The first is a randomized label efficient method first proposed for linear classifiers such as classical Perceptron [10]. The second is similar, but uses a logistic sampling rule that we introduce for comparison. The third is our fixed margin variant that borrows from the idea of uncertainty sampling by requesting labels for examples lying within a fixed distance from the classification hyperplane.

Finally, we include two baseline methods for comparison: uniform random sub- sampling, and first-n sampling.

The notation in this section uses pi as the signed distance of example xi from the decision hyperplane. For Perceptron with Margins and ROSVMs, pi is given by f(xi). For Logistic Regression, we can recover pi from the classification function

f(xi) using the transformation: pi = log =< w, xi > +b.  1−f(xi) 

4.3.1 Label Efficient b-Sampling

Cesa-Bianci et al. introduced a label efficient method of selective sampling for linear classifiers such as classical Perceptron and Winnow, and gave theoretical mistake bounds and expected sampling rates [10]. We refer to this method as b-sampling, and describe it here.

84 1 b=0.001 b=0.01 b=0.1 b=1

0.8

0.6

0.4 Sampling Probability

0.2

0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Distance from Hyperplane

Figure 4.2: Label Efficient b-Sampling Probabilities

85 The b sampling rule [10] is: given a sampling parameter b > 0, request a label for example xi with probability Pi, where:

b Pi = b + pi | |

As pi approaches zero, the probability of a label request for xi approaches 1. | | This makes intuitive sense: the closer an example is to the hyperplane, the less confidence we have in L’s prediction. There is always some non-zero probability of requesting a label for examples far from the hyperplane to ensure that the hypothesis is performing well across the entire data space.

The parameter b defines a function relating the sampling probability Pi to the classification confidence pi . To help illustrate the effects of varying b, we have | | mapped Pi against pi for several values of b as shown in Figure 4.2. Note that when | | pi = 0, then Pi = 1 for all values of b > 0. That is, a label is always requested | | when the hypothesis has zero confidence. A particular choice of b trades off the number of requested labels against classification performance, and will depend on the particular data and user needs.

4.3.2 Logistic Margin Sampling

The b-Sampling method of online active learning used one particular method of map- ping pi distance values to Pi sampling probabilities, in part because this method | | allowed clean theoretical analysis. For comparison, we propose and test another natural mapping from pi to Pi based on a logistic model of confidence probabilities | | (without theoretical analysis) that we call Logistic Margin Sampling.

For an example xi, let probability qi represent the probability that the pre- dicted label given by sign(pi) matches the true label yi. (As before, pi is the signed

86 1 gamma=1 gamma=2 gamma=4 gamma=8

0.8

0.6

0.4 Sampling Probability

0.2

0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Distance from Hyperplane

Figure 4.3: Logistic Margin Sampling Probabilities distance of xi to the classification hyperplane.) Thus, qi is the confidence that our label is correct.

In this Logistic Margin approach, we model the confidence value qi using a logistic function: 1 qi = 1+ e−γ|pi|

This approximation is reasonable given the work of Platt [70], who showed that a similar sigmoid function is a good model for confidence values for linear classifiers.

Following the intuition behind uncertainty sampling, we request a label for xi with probability:

−γ|pi| Pi = e

Like b-Sampling, Logistic Margin sampling gives the highest sampling prob-

87 ability to those examples lying closest to the classification hyperplane. Those exam- ples xi with pi = 0 are always sampled, and every example has a non-zero sampling probability. The difference is in the shape of the distribution, shown in Figure 4.3 for different values of γ.

4.3.3 Fixed Margin Sampling

The previous two methods of online active learning are probabilistic, mapping classi-

fication confidence values to sampling probabilities. For comparison, we also propose and test a deterministic variant that we call fixed margin sampling. Fixed margin sampling is a sampling heuristic that can reduce the total number of label requests needed for a given performance level, but offers no theoretical guarantees.

In fixed margin sampling, a confidence threshold c is set as a parameter. The sampling rule is straightforward: request a label for an example xi when (and only when) pi < c. Fixed margin sampling is thus visualized as a step function. Unlike | | the two prior methods, fixed margin sampling does not assign a non-zero sampling probability to all examples. Examples xi with pi c will never have their labels | | ≥ requested. Thus, the theoretical guarantees of b-Sampling do not apply to fixed margin sampling: it is possible that a bad initial hypothesis will continually make mistakes with high confidence. A learner with this hypothesis, “never in doubt, but never correct,” will not receive any label information and thus will never update.

However, in practice, we have found that fixed margin sampling can be ef- fective for spam filtering. This is because the online linear classifiers tend to make low-confidence mistakes before they make high-confidence mistakes, due to the in- cremental nature of online updates for the linear classifiers we examined. Thus, they avoid this theoretical problem in actual tests. Furthermore, because fixed margin

88 sampling does not request labels for examples the learner is confident for, this ap- proach may require fewer labels in the long run. This is especially true when the learner is able to achieve a strong hypothesis, such as is the case in spam filtering where filters can achieve extremely high classification performance.

4.3.4 Baselines

For comparison, we provide two baseline methods for evaluating performance.

Uniform Subsampling. With this method, a fixed probability value q is set • as a parameter, and a label is requested for each example with probability q.

Using uniform subsampling as a baseline comparison is common practice in

active learning research, including active learning for spam filtering [16]. This

allows us to examine the difference between a learner trained on n examples

drawn at random versus a learner trained on n examples selected by an active

learning method.

First-n Sampling. We also test the performance of methods that simply • request labels for the first n messages in the message stream. This allows

us to compare the differences between a learner trained on n actively sampled

examples against a learner trained on the first n examples. Note that if we can

assume that examples are randomly ordered, this first-n approach is effectively

a uniform subsampling, but one that allows the learners knowledge to be

exploited for the maximum possible time.

89 4.4 Experiments

In this section, we report results from experiments testing the effectiveness of the online active learning methods from Section 2 with the learning methods described in Section 3 on spam filtering. These results show strong support for the use of online active learning in spam filtering. We used Perceptron with Margins, ROSVMs, and

Logistic Regression as filtering methods for our experiments in this chapter. The binary 4-mer feature space was used as described in Chapter 2.

4.4.1 Data Sets

For the first experiments in this chapter, we use the trec05p-1 and trec06p data sets described in Chapter 2 [24, 16], and test Perceptron with Margins, Logistic

Regression, and ROSVMs in conjunction with each online active learning method.

Initial testing and parameter tuning was performed on a separate tuning data set, the publicly available spamassassin corpus. We also tested Perceptron with Mar- gins and Logistic Regression on the trec07p data set. Results for ROSVMs with the trec07p data set are described separately, in Section 4.4.6, when we describe the inclusion of an Online Active Learning has at the TREC 2007 spam filtering competition.

4.4.2 Classification Performance

We tested each of the active learning methods with each of the base machine learning methods on both trec05p-1 and trec06p. We varied the parameter values of the active learning methods to assess performance with different total numbers of label requests. For b-Sampling, b was varied between 0.001 and 1, for Logistic Margin sampling γ was varied between 1 and 16, for Fixed Margin sampling the value m

90 was varied from .001 to 2.4, for uniform subsampling p was varied from .001 to 1, and for first-n sampling n was varied from 10 to 75,000. Probabilistic tests were repeated five times each, with mean results reported.

The results are given in Figures 4.4 through 4.11, using the (1-ROCA)% measure [24] as the performance measure with 0.95 confidence intervals shown as vertical bars. Each graph shows the (1-ROCA)% score (on the vertical axis) achieved over the entire online test by an active learner requesting a given number of labels

(on the horizontal axis) during that test. Probabilistic methods requested differing numbers of labels for each test; horizontal bars indicate 0.95 confidence intervals for label requests at a given parameter setting.

These results show a clear win for online active learning methods, compared to both first-n sampling and uniform random subsampling. The first-n sampling also performs worse than uniform random subsampling; we conjecture that this is because first-n sampling is more vulnerable to concept drift and new patterns in the data stream. (Note that we omitted to test first-n sampling with ROSVMs.)

The online active learning methods dominated random subsampling in nearly the full range of label requests tested. Exceptions were the the trec07p data set where confidence intervals with all methods began to overlap above 10,000 requested labels, and results for very small amounts of labels requested (such as 100 labels), as shown on the far left in Figures 4.5 and 4.6. In all cases, the online active learning methods achieved similar results using only 1,000 labels that the baseline methods required ten times as many labeled examples to achieve. Note also that although the different learning methods achieve different performance levels, with Perceptron with Margins being the weakest of the methods and Online SVMs and Logistic

Regression being the strongest, the online active learning methods give the same

91 Online Active Learning with Perceptron with Margins, trec05p-1 data set

uniform sampling first-n label efficient logistic margin 2 fixed margin

1

0.5

0.2

0.1 (1-ROCA)%

0.05

0.02

0.01

0.005

1000 10000 100000 Label Requests

Figure 4.4: Online Active Learning using Perceptron with Margins, on trec05p-1 data.

92 Online Active Learning with Logistic Regression, trec05p-1 data set

uniform sampling first-n label efficient logistic margin 2 fixed margin

1

0.5

0.2

0.1 (1-ROCA)%

0.05

0.02

0.01

0.005

1000 10000 100000 Label Requests

Figure 4.5: Online Active Learning using Logistic Regression, on trec05p-1 data.

93 Online Active Learning with ROSVM, trec05p-1 data set

uniform sampling label efficient logistic margin fixed margin 2

1

0.5

0.2

0.1 (1-ROCA)%

0.05

0.02

0.01

0.005

1000 10000 100000 Label Requests

Figure 4.6: Online Active Learning using ROSVM, on trec05p-1 data.

94 Online Active Learning with Perceptron with Margins, trec06p data set

uniform sampling first-n label efficient logistic margin 2 fixed margin

1

0.5

0.2

0.1 (1-ROCA)%

0.05

0.02

0.01

0.005

1000 10000 100000 Label Requests

Figure 4.7: Online Active Learning using Perceptron with Margins, on trec06p data.

95 Online Active Learning with Logistic Regression, trec06p data set

uniform sampling first-n label efficient logistic margin 2 fixed margin

1

0.5

0.2

0.1 (1-ROCA)%

0.05

0.02

0.01

0.005

1000 10000 100000 Label Requests

Figure 4.8: Online Active Learning using Logistic Regression, on trec06p data.

96 Online Active Learning with ROSVM, trec06p data set

uniform sampling label efficient logistic margin fixed margin 2

1

0.5

0.2

0.1 (1-ROCA)%

0.05

0.02

0.01

0.005

1000 10000 100000 Label Requests

Figure 4.9: Online Active Learning using ROSVM, on trec06p data.

97 Online Active Learning with Perceptron with Margins, trec07p data set

uniform sampling first-n label efficient logistic margin 2 fixed margin

1

0.5

0.2

0.1 (1-ROCA)%

0.05

0.02

0.01

0.005

1000 10000 100000 Label Requests

Figure 4.10: Online Active Learning using Perceptron with Margins, on trec07p data.

98 Online Active Learning with Logistic Regression, trec07p data set

uniform sampling first-n label efficient logistic margin 2 fixed margin

1

0.5

0.2

0.1 (1-ROCA)%

0.05

0.02

0.01

0.005

1000 10000 100000 Label Requests

Figure 4.11: Online Active Learning using Logistic Regression, on trec07p data.

99 improvement over uniform random subsampling in all cases.

There are interesting comparisons among the online active learning methods, as well. Both Fixed Margin and Logistic Margin sampling tended to out-perform the label-efficient b-Sampling, often requiring a fraction of the labels needed by label efficient b-Sampling to achieve equivalent performance levels. This is most often true when the total number of labels requested is fewer than 10,000. With higher numbers of label requests, the performance of the different methods tends to converge. An exception to this observation occurred with the trec05p-1 data set, where at very low levels of label requests the randomized label-efficient b-sampling and Logistic Margin sampling methods out-performed Fixed Margin sampling.

4.4.3 Comparing Online and Pool-Based Active Learning

How effective are the online active learning methods compared to pool-based active learners? On the surface, it would appear that pool-based active learners would have a significant advantage over the online active learners, as the pool-based methods have more information at their disposal. In particular, pool-based methods are able to compare among many examples in the unlabeled pool, while online active learning methods are restricted to evaluating each example in isolation. In this section, we provide an empirical evaluation of these two approaches.

Experimental Design

As discussed in Section 4.2.1, pool-based active learning methods were tested in the

2006 TREC spam filtering competition [16]. In this scenario, the learners were al- lowed to select up to n examples for labeling from a pool composed of the first 90% of the trec06p data set for training. Evaluation was then performed on the remain-

100 ing 10% of this data set, in batch mode. The two best performing methods used pool-based uncertainty sampling with the ijs compression-based method described in Chapter 2, and pool-based uncertainty sampling with the osbf-lua method (a

Naive Bayes variant) described more fully in Chapter 6.3.3.

To provide an apples-to-apples comparison between online and pool-based methods, we tested an online active learning methods in a batch setting identical to those employed by the pool-based methods in TREC 2006. That is, we used online active learning with Fixed Margin sampling to train on examples in the first 90% of the trec06p data set, and then applied this trained filter to the remaining 10% of the data set for evaluation.

Results

The results of this test, using Logistic Regression and Fixed Margin sampling were surprising to us. The results, given in Figure 4.12, show that the online active learning methods actually give superior results when compared to the best pool- based methods, for similar amounts of label requests. How is this possible? The best performing methods from TREC 2006 used variants of pool-based uncertainty sampling [16], which proceeded in rounds requesting n examples for labeling. How- ever, because there is significant redundancy in email spam data, it is possible that in any given round many redundant examples will have similar uncertainty scores.

In such cases, the pool-based uncertainty sampling will request labels for all these redundant examples, reducing the benefit of the requested labels.

In contrast, the online active methods update for each new labeled example, and benefit from new label information immediately, reducing the tendency to re- quest labels for redundant examples. Pool-based methods could, of course, easily be

101 Pool-based and Online Active Learning Comparison, trec06p data set

pool-based uncertainty sampling with ijs filter pool-based uncertainty sampling with osbf-lua filter fixed margin sampling with logistic regression

2

1

0.5

0.2

0.1 (1-ROCA)%

0.05

0.02

0.01

0.005

1000 10000 100000 Label Requests

Figure 4.12: Comparing Pool-based and Online Active Learning on trec06p

102 0.1 uniform b-Sampling logisic margin fixed margin

0.08

0.06

Sampling Rate 0.04

0.02

0 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 Examples Seen

Figure 4.13: Perceptron with Margins, sampling rate over time, trec05p-1 made resistant to redundant examples by only selecting a single example for label- ing in each particular round. However, the pool-based methods would still need to re-assess all of the examples in the unlabeled pool after each update, so this strategy would increase computational cost.

4.4.4 Online Sampling Rates

Aside from examining overall performance levels, it is useful to consider how the sampling rates for the online active learners change over time. In Figure 4.13, the sampling rate for each active learning method is plotted against the number of examples seen, where parameter values for each method were set to each request roughly 3,000 labels during the entire set. (Results are shown for Perceptron with

Margins on trec05p-1. Other results are similar.)

103 Over time, the number of labels requested by the active learning methods tends to decrease, with the Logistic Margin and Fixed Margin methods requesting labels for less than 1% of the examples by the end of the trial, compared to an overall sampling rate of 3.2% on these tests. The sampling rate of b-Sampling decreases steadily, but less slowly over time. Naturally, the sampling rate of uniform subsampling remains constant.

The decrease in sampling rate over time by the active learners is due to the fact that the quality of the hypothesis for each learner improves with additional labeled examples. This allows the learner to make more predictions with high con-

fidence over time, reducing the number of label requests.

This observation is of practical value in large-scale email systems. Online active learning methods not only reduce the number of labeled examples needed to make a spam filtering system operable, but will also greatly reduce the number of labels needed to maintain strong classifiers over time.

4.4.5 Online Active Learning at the TREC 2007 Spam Filtering Competition

As previously noted, our suggestion of the use of online active learning was adopted for the 2007 TREC spam filtering competition [18]. In this competition, and addi- tional constraint was added to the problem. Each filter was given a label quota of n labels, and allowed to make no more than n total label requests during a given online active learning filtering run.

The best performing methods in this competition all used some form of the

fixed-margin sampling that we had earlier proposed [79]. One issue that was high- lighted in these experiments was the difficulty of selecting an appropriate parameter

104 for fixed-margin sampling for a given value of the label quota [85]. The best ap- proach to this problem was to tune on prior TREC data sets [19]. However, ideally one would have a closed-form solution that allows a user to fix a sampling parameter in advance with reasonable expectation of the resulting labeling effort and classifi- cation performance. This remains an open problem.

4.5 Conclusions

We have proposed an online active learning framework for spam filtering, and have explored several reasonable approaches to determining when to request labels for new examples. We believe that online active learning is the most appropriate form of active learning for spam filtering. These methods give improved results over uniform subsampling and over prior pool-based active learning methods that select batches of n unlabeled examples for labeling, reducing the number labels needed to achieve high performance with negligible computational cost. Furthermore, the online active learning approach is well suited to this domain, because spam filtering is an inherently online task.

Not only do online active learning methods reduce the number of email labels needed for strong performance, they also reduce computational cost of training.

Training updates in the online active learning framework are only able to occur when a label request has been made. Because these methods require only a fraction of the total possible labels, training cost is necessarily reduced. This finding is of key importance for large email systems filtering millions or billions of messages each day.

These results have implications not only for the statistical side of spam fil- tering, but for user interface as well. Because online active learners require only few

105 Figure 4.14: Screen shot of proposed user interface for active requests for label feedback. In this framework, the user would be encouraged to label a small number of informative messages.

106 label requests, and the rate of label requests decreases over time, it is possible to envision an email system that asks a user to label perhaps one or two messages per day (see Figure 4.14). Such a system would have strong filtering performance while requiring little feedback from the user. Given the low cost and high performance of this approach, we recommend online active learning methods as an effective general strategy for real-world spam filtering when users are willing and able to provide accurate label feedback for a small fraction of all messages received. In the follow- ing chapter, we examine the scenario in which users are unwilling or unable to give feedback for messages predicted to be spam.

107 Chapter 5

Online Filtering with One-Sided Feedback

In many real-world filtering settings, online labeling feedback is only available for examples which were predicted to be ham. One-sided feedback can cripple the per- formance of classical mistake-driven online learners such as Perceptron. Previous theoretical work under the Apple Tasting framework showed how to transform stan- dard online learners into successful learners from one sided feedback [43]. However, we find in practice that this transformation may request more labels than necessary to achieve strong performance. In this chapter, we modify two online active learn- ing methods to suit the one-sided feedback scenario, and find that both reduce the number of labels requested in practice. One method is the use of Label Efficient ac- tive learning. The other method, somewhat surprisingly, is the use of margin-based learners without modification, which we show combines implicit active learning and a greedy strategy to managing the exploration exploitation tradeoff. Experimental results show that these methods can be significantly more effective in practice than

108 Message stream Learner User feedback ?

Spam Inbox Box

Figure 5.1: Spam Filtering with One-Sided Feedback.

those using the Apple Tasting transformation, even on minority class problems.

5.1 The One-Sided Feedback Scenario

The problem of learning from one-sided feedback was introduced by Helmbold, Lit- tlestone, and Long [43], who described this as the the Apple Tasting problem: the problem of learning to identify sweet apples from visual cues. Of course, an apple taster gets no feedback from those apples it rejects, only from those that it actually chooses to taste. This is a variant of the standard online learning framework. In one-sided filtering, the learner only receives feedback when it predicts a ham label for the given example (see Figure 5.1). That is, the only way a learner can see the true label of an example is to predict that it is a member of the ham class. This setting may occur in practice if a user never checks any messages sent to the spam folder in a given filtering system, either because ignorance or laziness. It would also

109 occur in systems where predicted spam messages are deleted or removed entirely from the system, in which case the user would never be given the opportunity to give feedback on such messages.

The problem of learning from one-sided feedback defeats several classical on- line learning algorithms, such as Perceptron [73] and Winnow [59]. These mistake- driven algorithms suffer in this scenario, especially in the presence of noise, as the online updates tend to sacrifice recall for precision and may recognize very few hams. Helmbold, Littlestone and Long showed how to convert any standard online learner, including these mistake-driven methods, into an apple tasting algorithm by randomly sampling from those examples predicted to be in the spam class [43] with resultant mistake bounds. However, this method samples uniformly from the pre- dicted spams, and thus does not necessarily request labels for the most informative examples.

5.2 Contributions

We propose that practical filtering on one-sided feedback streams is best done with active learning methods. We show that Label Efficient active learners perform well from one-sided feedback, requesting fewer labels than the Apple Tasting methods.

We also show, somewhat surprisingly, that margin-based learners such as ROSVMs and Perceptron with Margins both learn effectively from one-sided feedback without modification. In the one-sided feedback scenario, it turns out that margin-based learners implicitly use an active learning strategy and a greedy search solution to the exploration/exploitation tradeoff. Our experiments show that both types of active methods can achieve high levels of performance with many fewer labels than the Apple Tasting solution, and that the margin-based methods are often the most

110 3 1

2 4 4 4

30 30

Figure 5.2: One Sided Feedback Breaks Perceptron. Here, white dots are ham exam- ples, the black dots are spam, the dashed line is the prediction hyperplane, and the shaded area predicts spam. Examples 1, 2, and 3 each cause no updates: 1 and 3 are correct, and no feedback is given on 2. Examples 4 and 30 are the only examples causing updates, ratcheting the hyperplane until no hams are correctly identified. effective.

The remainder of this chapter proceeds as follows. Section 5.2 gives pre- liminary background on one-sided feedback and reviews the Apple Tasting trans- formation with an eye towards possible improvements for practical use. In Section

5.3, we discuss the application of Label Efficient active learners to one-sided feed- back problems. In Section 5.4, we show that in many cases margin-based methods can learn effectively from one-sided feedback without transformation due to im- plicit uncertainty sampling and a greedy approach to the exploration/exploitation tradeoff. Section 5.5 covers difficulties posed for learning from one-sided feedback in minority-class distributions, in a text classification scenario similar to spam fil- tering. Experimental results are in Section 5.6, and the final section contains our conclusions for this chapter.

5.3 Preliminaries and Background

In this chapter, we are concerned with the problem of online learning from one-sided feedback, first described as the Apple Tasting problem [43]. We assume a distribu-

d tion D on a space of examples X = R , and each example xi has an associated label

111 d yi. There is a learner L with a hypothesis function h( ) : R 1, 1 predicting · 7→ {− } the label of a given example. The learner is allowed to update its hypothesis when it is shown an example and label pair (xi,yi). There is an oracle T that returns a

(possibly noisy) label yi for a given xi.

Learning proceeds in a (potentially unbounded) number of rounds, t , ..., tmax . { 1 } Given D, L, T , for each round ti:

An example xi is drawn from D. •

L guesses a label h(xi) for xi. •

′ If h(xi) = 1 then oracle T returns a (possibly noisy) label y and L may update • i ′ its hypothesis using (xi,yi).

′ However, if h(xi)= 1, then y is never revealed to L. • − i

In this chapter, we assume that the cost of requesting a label for an actual spam example is equivalent to the cost of misclassifying a spam as ham, while the cost of requesting a label for an actual ham example is zero. This is equivalent to saying that the only way to request a label for a given example is to predict that it is ham.

5.3.1 Breaking Classical Learners

To illustrate the issues surrounding the one-sided feedback problem, we first show that noisy one-sided feedback can break classical mistake-driven online learners such as Perceptron [73] and Winnow [59]. These learners update their hypotheses only on mistakes. However, in the one-sided feedback scenario they are never told about mistakes made when they predict a spam label, thus no updates can occur. The

112 only mistakes that will result in updates are those for which a ham label is predicted for a spam example.

Updating on one-sided errors creates a ratcheting effect shown in Figure 5.2.

Once the hyperplane has been shifted towards the ham side, it can never be shifted back. If the noise rate p> 0, then the hypothesis will converge to one which predicts a spam label for every example. This can cause recall levels for the ham class to suffer greatly, as we show in our experiments. Thus, purely mistake driven learners are unsuitable for one-sided feedback.

5.3.2 An Apple Tasting Solution

Helmbold et al. proposed a solution to the one-sided feedback problem and an- alyzed it theoretically using the mistake bound model from learning theory [43].

They showed that if a learner can be forced to make a maximum of either Mp mis- takes on ham examples or Mn mistakes on spams from full feedback from a given (noiseless) distribution, then it can be transformed into a learner making at most

Mp Mn + 2√T Mn mistakes from one-sided feedback on that distribution. These − mistake conditions can be met for Perceptron or Winnow by setting an initial bias.

Their solution (the “Apple Tasting method”, hereafter), relies on occasional random sampling from those examples which are predicted to have spam labels.

When an example is sampled, a label request is made to the oracle by flipping the predicted label from -1 to 1. A label request is made on step i when h(xi) = 1 − with probability p = (1 + m )/i, where m is the number of mistakes found p n n so far among the examples for which labels have been specifically requested [43].

Intuitively, this method samples the learner’s error rate to determine how much ex-

1+mn ploration is needed. As i grows, more labels are requested because the observed

113 Given: β > 0, data set X = (x1,y1),..., (xn,yn): Initialize: w := 0, K = 0 For each xi X do: ∈ Compute f(xi)=< w, xi > ′ Classify xi using yi = sign(f(xi)) Draw Bernoulli random variable Zi 0, 1 ∈ with parameter bi bi+|f(xi)| where bi = β√1+ K If Zi = 1 then Request label yi ′ If y = yi i 6 Update w K=K+1

Figure 5.3: Pseudo-code for Label Efficient active learner.

error rate is high. When this estimate of the error rate decreases, fewer labels are requested.

5.3.3 Improving on Apple Tasting

While this Apple Tasting transformation offers a robust solution to the problem of learning from one-sided feedback, we note that there are areas of possible improve- ment for practical use, using ideas from active learning.

Active Learning The Apple Tasting method samples from the spam predictions in a uniform manner that ignores the confidence of the prediction. Although uni- form sampling enables theoretical guarantees of correctness for purely separable data

[43], it is not always the most efficient way to learn a good hypothesis in practice.

Active learning methods attempt to choose informative examples to learn from.

Uncertainty sampling is one such method, in which examples are chosen based on

114 how uncertain the current hypothesis is about their label [56]. Other active learning methods include Query by Committee, in which disagreement among possible learn- ers is cause for sampling [36]; choosing examples based on how much they would reduce the current version space [14]; and estimating how much the example would reduce training error if its label were known [74].

We propose that active learning can improve on the Apple Tasting bounds in practice on one-sided feedback problems. The methods we explore in this chapter are based on uncertainty sampling, which is computationally efficient. Label Efficient learners use uncertainty sampling methods to adjust the probability that a label will be requested for a predicted spam. We also show that margin-based learners implicitly use a fixed form of uncertainty sampling to request labels. Exploring other active learning methods in one-sided feedback problems remains for future work.

Exploration/Exploitation Tradeoff The Apple Tasting solution strikes a par- ticular balance between exploration and exploitation, to use terminology from rein- forcement learning [90], by requesting more labels when the estimated error rate is high. Exploration of the data space allows the learner to acquire new knowledge and better estimate the optimal hypothesis, but this exploration may incur cost as- sociated with label requests. Exploiting previous knowledge carries no exploration cost, but may incur misclassification cost if the hypothesis is faulty. Determining the optimal balance between exploration and exploitation a priori for an arbitrary task is an open problem, and different approaches may be better for different situations.

The Label Efficient learner uses a strategy similar to that of Apple Tasting, by at- tempting to explore more when observed error rate is high. Margin-based learners implicitly use a greedy exploration strategy that we describe in Section 5.4, that can request many fewer labels in practice but offers no theoretical guarantees.

115 Categorizing the Learners The methods we explore in this chapter can be or- ganized as follows. Classical Perceptron, with no active learning and no exploration, fails on one-sided learning. Apple tasting adds exploration to solve this problem, but without active learning may request more labels than necessary. Label Efficient methods request fewer labels and maintain theoretical guarantees. Margin-based methods use a greedy exploration to further reduce the number of needed labels in many cases, but at the sacrifice of theoretical guarantees. The following sections examine these last two learners in more detail.

Learner Active Exploration Perceptron no none Apple Tasting no error-rate driven Label Efficient yes error-rate driven Margin-Based (implicit) yes greedy

5.4 Label Efficient Online Learning

An online active learner must decide whether or not to request a label for a given example at that particular time step, without knowledge of the future or the abil- ity to reconsider at a future point. When labels are costly, this creates resource allocation issues. The Label Efficient problem is to learn well with few label re- quests. Although this problem was posed in the standard online learning setting

[42], it has natural application with one-sided feedback where sampling from the spam predictions carries cost, and sampling from ham predictions is essentially free.

To our knowledge, this is the first use of label efficient learners on one-sided feedback problems.

Cesa-Bianchi et al. proposed a label efficient active learner based on the perceptron algorithm (see Figure 5.3 for pseudo-code, simplified for the case of nor-

116 malized example vectors) and give bounds on the expected number of mistakes and the expected number of label requests for linearly separable data [10]. The method adapts to the number of known mistakes seen so far, and samples more frequently when higher error rates are observed. Furthermore, unlike other label efficient learners that have been proposed, this method does not require the user to specify a maximum number of examples to label, and instead manages the ex- ploration/exploitation problem adaptively, given an initial setting of parameter β.

Finally, this active method takes uncertainty into account and is more likely to sample points that lie close to the classification hyperplane.

This method can be applied in the case of one-sided feedback. Here, request- ing a label for a given example forces the learner to predict a ham label for a given example. Label requests are made on all ham examples. In terms of the pseudo-

′ code, Zi = 1 whenever yi = 1. The method was originally analyzed in terms of the classical Perceptron algorithm; here we apply it to other linear classifiers as well.

5.5 Margin-Based Learners

One claim of this chapter is that margin-based learners can be effective learners from one-sided feedback. In this section, we demonstrate how margin updates enable learning from one-sided feedback, revealing implicit uncertainty sampling and a greedy exploration/exploitation tradeoff strategy. This section concludes with an examination of conditions that can cause this greedy strategy to fail.

5.5.1 Two Margin-Based Learners

Both Perceptron with Margins and ROSVMs are linear classifiers that update their linear hypothesis not only on mistakes, but also on correctly classified examples that

117 2 4 4

1 25 3 15 15

Figure 5.4: Margin-Based Pushes and Pulls. Examples 1, 2, and 3 cause no updates, as before. But Examples 4 and 25, each correctly classified but within the margins, push the hyperplane towards the spam. Example 15, a misclassified spam, pulls the hyperplane towards the ham.

lie close to the classification hyperplane, enabling learning from one-sided feedback.

Like ROSVM, the Perceptron with Margins updates its hypothesis both on mistaken predictions and on correctly predicted examples that lie within the margin of the classification hyperplane. Note that classical Perceptron is equivalent to Per- ceptron with Margins using parameter m = 0, as classical Perceptron only updates on mistakes. This is the critical distinction that allows Perceptron with Margins to learn from one-sided feedback, while classical Perceptron fails.

5.5.2 Margin-Based Pushes and Pulls

At first, it may seem counter-intuitive that any learner can learn effectively from one- sided feedback without modification. We now show the intuition driving the finding that margin-based learners can indeed learn in this scenario with no modification.

Recall that classical learners such as Perceptron are subject to ratcheting because they can only recognize one kind of mistake in the one-sided feedback sce- nario. Margin-based learners are resistant to ratcheting as they can update their classification in both directions. As before, misclassified spams still cause updates

118 m+ h

h' m-

Figure 5.5: Implicit Uncertainty Sampling for Perceptron with Margins. The margin-based learner with hypothesis h and margins m+ and m− learning from ′ one-sided feedback reduces to an active learner with hypothesis h and margins m+ and h using uncertainty sampling in the region between h and h′.

moving the hyperplane more towards the ham, correctly classified spams have no effect and misclassified hams have no effect as no feedback is given. Furthermore, correctly classified hams that lie outside the margins also cause no update to occur.

The key difference is: margin based learners update their hypothesis on cor- rectly classified examples that lie within the margin. (See Figure 5.4.) The hy- perplane may be pulled towards the ham region by misclassified spam and pushed towards the spam region by ham examples classified within the margins. These hy- pothesis updates are not irreversible, and the hyperplane can converge to a good hypothesis.

5.5.3 Margins, One-Sided Feedback, and Active Learning

Here, we demonstrate the claim that margin-based methods that are applied to one-sided feedback problems implicitly use active learning to sample from spam pre-

119 dictions. We offer a reduction showing that Perceptron with Margins with margin parameter m using one-sided feedback reduces exactly to Perceptron with Margins

m with margin 2 using fixed-margin active learning on predicted spams. (An analo- gous reduction for ROSVMs is possible, but results in only an approximate reduction as the two learners would have different values of the regularization parameter C.)

Reduction A Perceptron with Margins learner L with margin m learning from

′ m one-sided feedback reduces to an active margin-based learner L with margin 2 . The sampling rule for this active learner is to perform uncertainty sampling on any

′ −m example xi for which the prediction f (xi) . ≥ 4 Assume the learner has classification hyperplane h defined by its weight vec- tor w, with margin planes m+ on the ham side and m− on the spam side, and that m the distance between each margin and the hyperplane is 2 (see Figure 5.6). Now ′ ′ consider the hyperplane h , which lies halfway between h and m+. We can view h

′ as a classification hyperplane for learner L , with margins m+ on the ham side and h

m ′ ′ on the spam side, each at a distance of 4 from h . Because h is translated distance m ′ ′ m 4 from h, L will score each xi with f (xi)= f(xi)+ 4 . ′ Active learner L requests the label yi for any example xi found to lie between

′ ′ m h and h – that is, for any example such that f (xi) . This label yi is always ≥− 4 available, because xi lies to the ham side of h and one-sided feedback will be provided to L. L′ requests labels for its predicted spams that lie close to its classification hyperplane: a simple form of uncertainty sampling [77]. Furthermore, L′ requests labels for all examples that lie to the ham side of h′, and such labels are also always available to L under the one-sided feedback scenario. Thus, L′ performs uncertainty

′ m sampling on any xi such that f (xi) . ≥− 4 As a Perceptron with Margins learner, L′ updates on mistakes or examples

120 xxxxxxxxxxxxx

xxxxxxxxxxxxx

xxxxxxxxxxxxxxxx

xxxxxxxxxxxxx

xxxxxxxxxxxxx

xxxxxxxxxxxxx m xxxxxxxxxxxxx + xxxxxxxxxxxxxh mxxxxxxxxxxxxx xxxxxxxxxxxxx-

xxxxxxxxxxxxx 1

xxxxxxxxxxxxx

xxxxxxxxxxxxx

xxxxxxxxxxxxx xxxxxxxxxxxxx2 xxxxxxxxxxxxx

xxxxxxxxxxxxx

xxxxxxxxxxxxx m xxxxxxxxxxxxx e- xxxxxxxxxxxxx xxxxxxxxxxxxxh xxxxxxxxxxxxxe m xxxxxxxxxxxxxe+ xxxxxxxxxxxxx

xxxxxxxxxxxxx

xxxxxxxxxxxxx

xxxxxxxxxxxxx

xxxxxxxxxxxxx

Figure 5.6: Exploration and Exploitation. If the initial hypothesis is h, then exam- ples 1 and 2 cause margin updates pushing he out towards m−, but not beyond it unless an example is found to lie between h and he.

found within the margins of h′, computing a new hypothesis h′′. L then adopts a new hypothesis from L′ as follows. The new hypothesis h for L will be h = h′′ m . − 4 Thus, L reduces to L′.

5.5.4 Exploring and Exploiting

One of the primary problems in online active learning is resource allocation [42], often referred to as the exploration/exploitation tradeoff [90]. It is difficult to de- termine a priori the best balance between sampling (which may incur labeling cost) and prediction without sampling (which may incur misclassification cost) for arbi- trary distributions [90]. The Apple Tasting [43] and Label Efficient [10] methods both attempt to strike a balance by estimating the error rate. Determining the error rate incurs sampling cost, and the upper bounds computed for these methods may not be tight in practice. For example, the current hypothesis may be good but many

121 labels may need to be requested before the error rate confirms this. Margin-based methods use a greedy approach to balancing this tradeoff that can require many fewer labels in practice. However, performance for this greedy method cannot be guaranteed due to the possibility of malicious or pathological distributions.

The greedy strategy is the following. When h is a hyperplane consistent with all seen labeled data, the learner L only requests labels for examples that it predicts to be ham. This is a conservative strategy emphasizing exploitation, and incurring zero labeling cost. Note that at this time, examples on the spam side of m− are strongly believed to be spams, and those between h and m− are suspected to be spam.

When a new example xi is found to lie within the margins, between h and m+, the learner is willing to explore, and the hyperplane is shifted through margin- updates, to create a new hyperplane he, with margins me+ and me− (see Figure 5.6). Assuming a moderate learning rate (that is, η m for Perceptron with Margins or ≤ 2 non-extreme values of C for ROSVMs), he will lie somewhere between h and m−, causing the learner to sample from the suspected spams (from the perspective of h), but never from examples strongly believed to be spams. Each new ham example between h and the new me+ will push he closer to m−. However, each such update will shrink the gap between h and me+, until h = me+. At this point, he can be located no further toward spam than m−. Note that h is still consistent with all the seen, labeled data at this point

(although h is no longer maximum margin). The only thing that will cause an update now is misclassifying a spam, or finding a ham between h and he. Either of these cases would show that the original h is no longer consistent with the seen data, and L must recompute a new h and start again with the conservative strategy.

122 Gappy Striped

Figure 5.7: Pathological Distributions for One-Sided Feedback.

Thus, unlike the Apple Tasting or Label Efficient strategies, margin-based learners do not sample from all predicted spams with non-zero probability. Sampling is performed on only those predicted spams that are close to h, and only when there have been sufficiently many hams found between h and m+ to encourage further exploration.

5.5.5 Pathological Distributions

Before moving on in this discussion, some caveats are in order as this greedy ex- ploration strategy can be stalled or defeated by certain pathological distributions.

Linearly separable distributions that include large gaps may cause margin-based methods to cease making progress. This will occur whenever the probability of an example landing in the space between h and he is zero. For many interesting dis- tributions, this only occurs in the margin between the two classes. However, it is possible to have such a gap within a single class (see Figure 5.7), which will have

m the same effect when the size of the gap is greater than 2 . Gappy distributions may be dealt with by increasing the margin size. Note that this provides a second

123 intuition for the failure of classical Perceptron: when m = 0, every distribution is a gappy distribution.

In some cases, a distribution may not be linearly separable in the feature space, but we may still wish to find a hypothesis that minimizes loss as with the soft margin SVM. In many of these cases, the margin-based methods will be success- ful, as the greedy exploration will continue in the limit so long as the expected loss per example from examples lying between h and m+ is less than some set threshold. However there is the possibility of striped distributions (see Figure 5.7) that can

m cause the greedy exploration to fail. A stripe is a region of at least width 2 where the expected cost of applying the surrounding area’s class label to examples from that region is greater than the cost of applying the opposing label. As with gaps, stripes may be dealt with by increasing the margin size in some cases, by adjust- ing misclassification costs, or by finding a transformation of the feature space that renders the data linearly separable (removing the stripes).

Another possible failure of margin-based learners can be caused by malicious orderings of the examples. In a noisy domain, it is possible for an adversary to select a sequence of incorrectly labeled examples that will cause ratcheting by one-sided learners. Such malicious orderings may be possible in certain one-sided feedback applications such as optimal placement of banner advertisements.

Finally, when the learning rates are set too high, the learner may overshoot the ham class after it has misclassified a spam (or a series of spams). If the resulting hyperplane is placed beyond the ham class, classifying everything as a spam, then the learner will be ratcheted and no further learning will take place.

124 5.6 Minority Class Problems

Learning from one-sided feedback is particularly challenging in when the ham class is a minority in the distribution. This is a challenging problem for active learn- ing in general. It has been shown that for non-homogeneous distributions, such as minority-class distributions, a linear classifier such as Perceptron using active learn- ing can require as many labels to achieve a given error rate as the same method without active learning [28]. That is, active learning may give no benefit in these situations. This is unfortunate, as minority class problems are common in practical data mining.

Thus, active learning solutions to the one-sided feedback problem necessar- ily suffer in the case of minority class distributions. Furthermore, Apple Tasting and Label Efficient methods that attempt to sample error rate may have difficulty because the measured error may be low even when the entire minority class is mis- classified. Margin-based methods may have similar difficulty, where a long sequence of observed spams may cause ratcheting in the hyperplane. These issues may be ameliorated by assigning different misclassification costs to the two classes. We ex- plored this issue in depth for general text classification problems in our paper on this topic [80], but omit further reports of this work here as it is not directly related to spam filtering.

5.7 Experiments

In this section, we report experimental results for spam filtering in the one-sided feedback scenario.

We construct an online learning task with one-sided feedback as follows. On

125 each round, a learner is shown a message and asked to predict its label from 1, 1 {− } for spam and ham respectively. When a ham label is predicted, the true label is revealed to the learner and it may update its hypothesis. If the learner wishes to sample the label for a message it predicts to be spam, it directs that message into the inbox by predicting a ham label. Thus, label requests have the same cost as false hams.

We map spam messages to feature vectors using the methodology from Chap- ter 2, using normalized binary 4-mer feature vectors, drawn from the first 3000 characters of each email.

We tested three methods, ROSVMs, Perceptron with Margins, and classical

Perceptron (which is equivalent to Perceptron with Margins where m = 0.)1 For

ROSVMs, we set parameters C = 100 and buffer size 1000, updating on all margin errors. For Perceptron with Margins, we used m = 2 as a default parameter, and learning rate 1. For classical Perceptron, we used the same learning rate and m = 0.

We tested these methods on one-sided feedback in three ways: in their un- modified version, with the addition of Label Efficient sampling, and with the addi- tion of Apple Tasting sampling. The results reported for the Label Efficient methods were for optimal values of the parameter β > 0 found in coarse grained trials on spamassassin tuning data. Both Apple Testing and Label Efficient methods are randomized algorithms; we report average results over ten trials with each. We tested other sampling strategies, including threshold biasing, ǫ-greedy, and the Soft- max variant, but do not report these results as they were not competitive on these data sets. Finally, we report results on these data sets using full feedback for com- parison.

1Note that at the time this work was performed, the strength of Logistic Regression as an online learner for spam filtering had not yet been fully appreciated. Applying Logistic Regression in the one-sided feedback scenario remains an area for interesting future work.

126 Data Sets We use the two largest publicly available labeled data sets (at the time this work was conducted) of spam and ham email, which are the trec05p-1 data set of 92,189 messages with a 43% ham rate [24] and the trec06p data set of 37,822 messages with a 34% ham rate [16], described in Chapter 2. Both of these benchmark corpora have a canonical ordering for online learning which we use for repeatability.

In preliminary tests and where parameter tuning was needed, we used a separate corpus, the smaller publicly available spamassassin corpus of 6032 examples.

Evaluation Metrics Evaluating the performance of spam filtering methods is typically done by measuring the area under the ROC curve [24], which accounts for potentially uneven misclassification costs by assuming the ability to freely vary the classification threshold. In the one-sided feedback scenario, threshold modification after the fact is problematic, as the predicted class has implications on what feedback was available during learning. Thus, we evaluate performance using precision, recall, and the F-measure from the classifier’s actual threshold. However, as a sanity check, we did calculate the ROC curve areas for each of the results, and these results agreed with the trends reported in this section.

Precision (P ), recall (R), and the Fα measure are defined in terms of true hams (ham in the inbox), false hams (spam in the inbox), true spams (spam in the spam box) and false spams (ham in the spam box). When a learner makes a label requests, it is counted as a false ham when the example is a spam. Label requests for ham examples are counted as true hams. The measures are computed as follows:

TP TP (1+α)(P ·R) P = TP +FP R = TP +F N Fα = αP +R

The Fα measure gives a single number summary of classifier performance, where the parameter α determines how much weight to assign to precision and recall.

127 We report the F1 measure results as a conservative view; we computed scores for

F2 through F4 and found they emphasized our reported trends.

Results The results for experiments on both data sets are reported in Table 5.1. In general, they show a clear win for the margin-based methods, which had the highest

F1 scores and lowest false ham rate, which is equivalent to making the fewest label requests. A spam filter using the unmodified ROSVM or Perceptron with Margins would place roughly half as much spam in the user’s inbox as the Apple Tasting methods, while making roughly the same amount (or fewer) of misclassifications on ham. These results were not far from those achieved with full feedback.

As expected, classical Perceptron algorithm was defeated by the one-sided feedback scenario. Of the three methods of fixing classical Perceptron, the addition of margins (turning classical Perceptron to Perceptron with Margins) was more effective than either the Label Efficient or Apple Tasting strategies with classical

Perceptron on all one-sided feedback tests.

Finally, the Label Efficient method out-performed the Apple Tasting method on all trials. Recall that the Label Efficient method also contains implicit use of uncertainty sampling, while Apple Tasting relies on uniform sampling of the spam predictions. This supports the claim that active learning is a good strategy for one-sided learning.

5.8 Conclusions

The goal of this chapter was to show that active learning improves performance from one-sided feedback. We have shown that the Label Efficient active learner can be applied to one-sided learning with good results. We have also shown that

128 Table 5.1: Results for Email Spam filtering. We report F1 score, Recall, Precision, number of False Spams (lost ham) and number of False Hams (spam in inbox) for with one-sided feedback. We report results with full feedback for comparison.

trec05p-1 F1 Rec. Prec. # FN # FP ROSVM Margins 0.993 0.996 0.990 141 396 Label Efficient 0.991 0.997 0.985 135 609 With AppleTaste 0.989 0.996 0.981 145 757 Full Fb. 0.996 0.997 0.995 121 200 Pcptrn. Mgn. Margins 0.990 0.998 0.983 68 697 Label Efficient 0.989 0.998 0.980 68 785 With AppleTaste 0.984 0.998 0.970 91 1211 Full Fb. 0.994 0.996 0.992 153 318 Perceptron No Margins 0.516 0.348 0.999 25686 2 Label Efficient 0.960 0.964 0.956 1430 1748 With AppleTaste 0.930 0.920 0.940 3143 2330 Full Fb. 0.991 0.991 0.991 343 343

trec06p F1 Rec. Prec. # FN # FP ROSVM Margins 0.988 0.996 0.981 51 253 Label Efficient 0.984 0.995 0.973 67 362 With AppleTaste 0.976 0.994 0.958 75 557 Full Fb. 0.993 0.994 0.992 81 100 Pcptrn. Mgn. Margins 0.984 0.998 0.970 32 397 Label Efficient 0.983 0.997 0.968 35 423 With AppleTaste 0.975 0.997 0.953 44 627 Full Fb. 0.993 0.995 0.991 69 111 Perceptron No Margins 0.005 0.003 0.945 12875 2 Label Efficient 0.938 0.943 0.933 738 876 With AppleTaste 0.891 0.886 0.897 1475 1309 Full Fb. 0.986 0.987 0.986 174 180

129 margin-based learners are active learners in the one-sided feedback scenario, and can exceed the performance of both the Apple Tasting and Label Efficient methods under many circumstances. One interesting ancillary contribution is the notion that margin-based learners may be used in conjunction with Apple Tasting or Label

Efficient methods for additional robustness to pathological cases.

The Label Efficient method is well designed for active learning in an online setting, and adapts naturally to the problem of one-sided feedback. While margin- based methods are not generally considered to be active learners, we have shown that they do perform implicit active learning using uncertainty sampling in the one- sided feedback scenario. The margin-based methods gave best performance on spam data, approaching that of the full-feedback scenario. This performance was possible because the data contains relatively even class distribution with little noise.

In addition to the problem of spam filtering, there are numerous practical machine learning applications that suffer from the problem of one-sided feedback.

Future applications involving one-sided feedback may range from geo-statistical data mining to discover ore and oil deposits to personalized online news agents that learn effectively from one-sided feedback about the relevance of shown articles.

130 Chapter 6

Online Filtering with Noisy Feedback

In preceding chapters of this dissertation, we have examined the online filtering sce- nario both with idealized assumptions and with various assumptions of incomplete user label feedback. In each of these cases, we have found that near-perfect filtering results are achievable despite incomplete user feedback, as long as the given feed- back is perfectly accurate. However, this assumption of accuracy is optimistic. Real users give feedback that is often mistaken, inconsistent, or even maliciously inaccu- rate. To our knowledge, the impact of this noisy labeling feedback on current spam

filtering methods has not been previously explored in the literature. In this chapter, we show that noisy feedback may harm or even break state-of-the-art spam filters, including recent TREC winners. We then describe and evaluate several approaches to make such filters robust to label noise. We find that although such modifications are effective for uniform random label noise, more realistic “natural” label noise from human users remains a difficult challenge. This chapter is based on our work

131 with Gordon Cormack on filtering with noisy label feedback [83].

6.1 Noise in the Labels

The promising results from previous chapters were attained in a laboratory setting where gold-standard feedback was given to the filters for training. That is, the labels of spam and ham assigned to each message was carefully vetted, and such labeling was done in a consistent and accurate fashion [23]. In real-world systems involving users, it is unlikely that these (possibly anonymous) humans will consistently give label feedback of gold-standard quality. Instead, real users give noisy feedback, and the labels used for training real-world filters may contain errors which can reduce classification performance.

6.1.1 Causes of Noise

Label noise may come from a variety of causes. Several different insiders in industrial anti-spam settings have reported that at least 3% of all user feedback is simply mistaken. That is, these labels are objectively wrong, such as email lottery scams being incorrectly reported as ham. (This 3% user mistake rate is also reported in

Yin, et al., 2006 [96].) Sources of labeling mistakes include misunderstanding the feedback mechanisms, accidental clicks, and even users who are actually fooled by such scams.

Inconsistent labels arise when similar messages are perceived differently among users. A common example of this are gray mail messages (such as email newsletters) that some users value and others prefer to block [97]. In a recent paper studying gray mail, it was found that one sample of 418 messages1 contained 163 gray mail

1This was apparently randomly sampled from a large corpus of representative spam emails.

132 messages – nearly 40% [97]. An additional data point comes from John Graham-

Cumming’s spamorham.org project,2 in which human users across the internet were invited to manually label messages in the trec05p-1 data set. In this project, in- dividual human labelers disagreed with the gold standard labels 10.9% of the time, in large part due to inconsistency or human error [41].

Finally, maliciously inaccurate feedback is an issue in large free email systems.

For example, a spammer may acquire many free accounts, send large amounts of spam to these accounts, and then report such messages as being ham. Industry insiders at several different large free email systems have confirmed that this is a common tactic by spammers. Indeed, at least one such system chooses to completely ignore all ham labels given by users.

Thus, the possibility of label noise is an important real-world consideration for spam filtering. Spam filters based on machine learning techniques must be robust to a variety of label noise levels, and not be narrowly optimized to perform well only in the noise-free case.

6.1.2 Contributions

This chapter makes three main contributions. First, it is shown that uniformly random noise in label feedback significantly harms the performance of state of the art spam filters. Second, several modifications are proposed for making filters robust to label noise, including making less aggressive updates, label cleaning, label correcting, and various forms of regularization. It is shown that the best of these methods make

filters significantly more robust to uniform label noise at a small cost in classification performance in the noiseless case. Third, it is found that natural noise from real

2Sadly, this project ended in early 2007 and the spamorham.org domain is now apparently controlled by web spammers.

133 users is more challenging than uniform noise.

6.2 Related Work

There is relatively little published work on the impact of label noise in spam filtering.

However, the problem of noisy class labels and avoiding overfitting is well studied in general machine learning literature.

6.2.1 Label Noise in Email Spam

To our knowledge, ours is the first published work to explicitly explore the problem of label noise in email spam filtering [83]. However, label noise has been considered as an issue by industry experts in spam filtering, and has been given as a reason to prefer filtering methods that do not rely on user feedback. The published acknowl- edgment of 3% labeling errors in data gained by the Hotmail implies that this problem has been previously studied internally [96]. Additionally, Yin et al. studied a gray mail detection as a subset of this general label noise problem [97].

They found that adding a gray mail detector as the first stage of a two-stage filtering process reduced false negatives between two and six percent.

6.2.2 Avoiding Overfitting

In machine learning, it has long been known that overfitting is a potential problem when applying optimization methods to noisy training data [65, 78]. For iterative methods such as gradient descent, early stopping is a common and effective practice for reducing overfitting [65]. In the online filtering scenario for streaming data the effect of early stopping may be approximated by selecting a conservative learning rate. Another common approach is regularization, in which a penalty for model

134 complexity is added to the optimization problem (see, for example, Scholkopf &

Smola [78]).

The problem of label noise in data has also been well studied. Zhu and Wu observed that class label noise can be even more harmful in training classifiers than feature noise [99].3 Several methods have been proposed for cleaning data containing class label noise by discarding training examples that are suspected to have incor- rect labels (see Zhu and Wu [99] for an overview.) A more aggressive approach is label correcting, in which instances suspected to be mislabeled are automatically re- labeled [98]. Rebbapragada and Brodley showed that correcting and cleaning can be considered within the same unifying framework, but also note that label correcting is a more difficult task that may introduce additional errors into the data [71].

6.3 Label Noise Hurts Aggressive Filters

We first wished to investigate how well top performing filters from recent TREC evaluations fared with respect to label noise. This section describes our basic exper- imental design, the filters under consideration, and results from this first exploration.

6.3.1 Evaluation

We use a noisy variant of the online filtering scenario as our experimental framework, in which messages are presented to the filter one at a time. For each message, the

filter is asked to give a prediction score of ham or spam. After the prediction is made, a (possibly noisy) label is given to the filter, and this information may be used for a training update. We evaluate performance using the (1-ROCA)% measure described in Chapter 2.

3In the spam setting, feature noise is typified by the good word attack [61].

135 Note that although the labels given to the filter during online filtering may be noisy, the (1-ROCA)% evaluation score is computed with respect to the gold- standard labels supplied with the original data set [23]. Thus, our goal is to assess the impact of noisy training labels on the filters’ ability to predict “true” gold- standard labels. In an ideal world, a small amount of label noise would do only minimal harm to the filters’ classification performance.

6.3.2 Data Sets with Synthetic Noise

For these initial tests, we built noisy data sets from two large, publicly available benchmark data sets for spam filter evaluation: the trec06p data set of 37,822 emails [16] and the trec07p data set of 75,419 emails [18], both of which were originally constructed for the TREC spam filtering competitions. (All of the noisy data sets created for this work are publicly available for research purposes; contact the author.)

For this initial evaluation, we chose to investigate the effect of uniform, random label noise. (Natural label noise from actual human feedback is tested in Section 6.5.2.) Synthetic noise was added to each data set as follows: for each message, the label of that message was flipped with uniform probability p.

We created one test set for each data set and each of seven noise levels, with p = 0, 0.01, 0.05, 0.1, 0.15, 0.2, 0.25 . Note that when p = 0, the test set is identical { } to the original, unaltered TREC data set.

6.3.3 Filters

We tested a range of current statistical spam filtering methods, as follows. Where noted, parameters were set by tuning on the separate spamassassin data set as

136 described in Chapter 2. Other filters were tested with parameters set as given by default, or as given in the reference describing that filter. Unless noted otherwise, the filters tested used a feature space of binary 4-mers drawn from the first 3000 characters of each message and all feature vectors were normalized with the Eu- clidean norm, as described in Chapter 2.

Multi-Nomial Naive Bayes (MN NB). Metsis et al. tested a number • Naive Bayes variants, and found Multi-Nomial Naive Bayes with binary feature

values to be one of the top performing methods [63]. We apply this variant

here.

Logistic Regression. As described in Chapter 2, this machine learning • method gave best results on several tasks at TREC 2007 when coupled with

binary 4-mers [18, 19]. Online logistic regression has a parameter η that con-

trols learning rate, determining how aggressively to update the model on each

new example [66]. We set to η = 0.1 after tuning on spamassassin data.

Perceptron with Margins (PwM). This noise-tolerant algorithm [48] was • the base learner in an approach that gave strong results at TREC 2006 [86].

After tuning, we set the fixed margin m = 8 and learning rate η = 0.5.

Relaxed Online Support Vector Machine (ROSVM) In previous appli- • cations of SVM variants for spam filtering, it was found that high values of

the cost parameter C, encouraging little regularization, gave best results for

spam filtering tasks. This result agreed with a prior finding by Drucker et al.

(1999), which also found that SVMs gain best performance on spam data with

high values of C. Using C = 100, the ROSVM method gave best results on

several tasks at TREC 2007 [18].

137 Dynamic Markov Chain Compression (DMC). Perhaps the best per- • forming of the compression-based spam filters [6], DMC was tested at TREC

2007 as wat2 with strong results [19].

BogoFilter. BogoFilter has been the best performing open-source spam filter • at TREC for several years [16, 18], and employs a fast variant of the Naive

Bayes classifier.

WAT1. This is the same filter submitted as wat1 at TREC 2007 [19], which • gave best results on several tasks. This filter employs logistic regression and

binary 4-mers, but does not normalize the feature vectors.

OSBF-Lua. This is the same filter [3] that gave best overall performance • at TREC 2006 [16]. OSBF-lua uses an aggressive variant of the Naive Bayes

classifier, in which the filter continues to re-train on the header of a given

email message so long as the filter scores that message near the classification

boundary, or until other stopping conditions such as maximum number of

iterations are met [3]. We refer to this update strategy as train until no error.

6.3.4 Initial Results

The results of this first experiment reveal a disturbing trend, as shown in Tables

6.1 and 6.2. The methods giving best results without label noise give worst results with moderate to large amounts of label noise. This is true for Logistic Regression,

ROSVMs, DMC, BogoFilter, and even Perceptron with Margins, each of which has given strong performance in the TREC spam filtering competitions. In contrast, the

Multi-Nomial Naive Bayes method gives relatively modest results without noise, but is much more robust to increasing levels of label noise.

138 Table 6.1: Results for prior methods on trec06p data set with uniform synthetic noise. Results are reported as (1-ROCA)%, with 0.95 confidence intervals. Bold numbers indicate best result for a given noise level, or confidence interval overlapping with confidence interval of best result.

trec06p noise noise noise noise noise noise noise 0 0.01 0.05 0.10 0.15 0.20 0.25 MN NB 0.477 0.513 0.517 0.517 0.624 0.665 0.685 (0.425 - 0.535) (0.460 - 0.571) (0.459 - 0.582) (0.459 - 0.583) (0.557 - 0.698) (0.594 - 0.744) (0.620 - 0.758) LogReg 0.032 0.035 0.118 0.615 2.107 4.914 9.077 (0.025 - 0.041) (0.027 - 0.046) (0.099 - 0.140) (0.558 - 0.677) (1.985 - 2.236) (4.732 - 5.102) (8.774 - 9.390) 139 PwM 0.049 0.069 0.181 0.577 1.517 3.328 6.666 (0.034 - 0.070) (0.050 - 0.094) (0.149 - 0.221) (0.526 - 0.632) (1.424 - 1.615) (3.173 - 3.491) (6.423 - 6.918) ROSVM 0.031 0.328 2.430 6.532 11.512 16.852 21.680 (0.021 - 0.044) (0.288 - 0.373) (2.305 - 2.561) (6.297 - 6.775) (11.182 - 11.850) (16.449 - 17.263) (21.262 - 22.104) DMC 0.031 0.053 0.183 0.619 1.430 3.044 5.208 (0.024 - 0.041) (0.040 - 0.070) (0.150 - 0.222) (0.542 - 0.706) (1.308 - 1.564) (2.869 - 3.230) (4.986 - 5.439) Bogo 0.087 0.096 0.277 1.203 3.168 7.336 11.478 (0.066 - 0.114) (0.071 - 0.130) (0.231 - 0.332) (1.110 - 1.304) (2.999 - 3.346) (7.066 - 7.616) (11.148 - 11.818) WAT1 0.036 0.075 0.389 1.839 4.548 8.358 13.112 (0.027 - 0.049) (0.058 - 0.096) (0.347 - 0.435) (1.723 - 1.963) (4.360 - 4.743) (8.073 - 8.651) (12.755 - 13.478) OSBF-Lua 0.054 0.075 0.316 29.575 35.011 38.486 39.699 (0.034 - 0.085) (0.053 - 0.107) (0.155 - 0.644) (28.622 - 30.546) (34.371 - 35.657) (37.855 - 39.122) (39.046 - 40.356) Table 6.2: Results for prior methods on trec07p data set with uniform synthetic noise. Results are reported as (1-ROCA)%, with 0.95 confidence intervals. Bold numbers indicate best result for a given noise level, or confidence interval overlapping with confidence interval of best result. Methods unable to complete a given task are marked with dnf.

trec07p noise noise noise noise noise noise noise 0 0.01 0.05 0.10 0.15 0.20 0.25 MN NB 0.168 0.181 0.237 0.249 0.297 0.298 0.342 (0.151 - 0.185) (0.163 - 0.203) (0.216 - 0.259) (0.227 - 0.273) (0.273 - 0.324) (0.272 - 0.326) (0.309 - 0.377)

140 LogReg 0.005 0.006 0.071 0.550 2.266 5.192 9.501 (0.002 - 0.017) (0.004 - 0.009) (0.061 - 0.084) (0.512 - 0.590) (2.184 - 2.351) (5.045 - 5.343) (9.271 - 9.737) ROSVM 0.010 0.031 0.473 2.246 5.604 9.531 14.714 (0.003 - 0.030) (0.020 - 0.048) (0.436 - 0.512) (2.157 - 2.339) (5.410 - 5.804) (9.260 - 9.809) (14.215 - 15.228) DMC 0.006 0.021 0.103 0.242 0.594 1.208 2.484 (0.003 - 0.016) (0.014 - 0.031) (0.084 - 0.126) (0.209 - 0.280) (0.533 - 0.661) (1.123 - 1.300) (2.354 - 2.621) Bogo 0.027 0.033 0.097 0.264 0.504 3.221 10.294 (0.017 - 0.043) (0.023 - 0.049) (0.077 - 0.122) (0.231 - 0.302) (0.454 - 0.559) (3.083 - 3.365) (10.041 - 10.551) WAT1 0.006 0.019 0.428 1.984 5.221 9.226 14.116 (0.002 - 0.015) (0.014 - 0.026) (0.396 - 0.462) (1.904 - 2.068) (5.043 - 5.405) (9.000 - 9.457) (13.851 - 14.385) OSBF-Lua 0.029 0.054 0.290 29.478 dnf dnf dnf (0.015 - 0.059) (0.016 - 0.184) (0.097 - 0.859) (27.591 - 31.432) What causes the steep degradation in performance with the state of the art filters? Each of these methods is tuned to perform aggressive online updates, necessary to attain competitive results in the noise-free TREC evaluations. For example, the Logistic Regression method is tuned with an aggressive learning rate η and uses no regularization. The ROSVM method is tuned with the cost parameter

C set to a high value discouraging regularization. Such settings allow the filters to quickly adapt to new spam attacks when user feedback contains no noise, but makes these filters subject to overfitting when label noise is present.

As an extreme case, OSBF-Lua, the top performer from TREC 2006, was actually broken by the noisy data, eventually giving results of nan on messages in all noisy data sets with p > 0. The train-until-no-error approach severely overfit mislabeled instances, resulting in a useless model.

These are troubling initial results. Together, they call into question the real- world utility of results from evaluations assuming noiseless user feedback. Are the strong performance levels from TREC and similar evaluations only achievable in laboratory settings, with users willing and able to give perfectly accurate feedback?

The remainder of this chapter investigates this question.

6.4 Filtering without Overfitting

In this section, we suggest several strategies for making learning-based filtering methods more robust to noise in feedback. These strategies include tuning parame- ters to prevent overly aggressive updates, various forms of regularization for logistic regression and SVM variants, and methods that attempt to automatically clean or even correct labels given for training.

For preliminary experiments and tuning runs in this section, we created noisy

141 9 noise 0.25 8 noise 0.2 noise 0.15 7 noise 0.1 noise 0.05 6 noise 0.01 noise 0 5

4 (1−ROCA)%

3

2

1

0 0 0.02 0.04 0.06 0.08 0.1 Learning Rate

Figure 6.1: Results for varying learning rate η for Logistic Regression, on spamassassin tuning data with varying levels of synthetic uniform label noise. For clarity, the order of results is consistent between legend and figure.

versions of the spamassassin data set, adding uniform synthetic label noise at different levels as described in Section 6.3.2.

6.4.1 Tuning Learning Rates

As discussed above, both Logistic Regression and Perceptron with Margins utilize a learning rate parameter η that controls the size of the step taken on any given update during online gradient descent. Lower values of η lead to less aggressive updates, giving an online approximation of the early stopping strategy that gives good results when gradient descent is applied in batch mode [65].

142 Figure 6.1 shows the effect of varying η for Logistic Regression at different levels of label noise on the noisy spamassassin data sets. (Similar effects are seen with Perceptron with Margins). Note that when there is little or no label noise, high values of η give best results, but when label noise becomes more prevalent lower η values (centering on η = 0.02) improve results.

For our final experiments in the next section, we set η = 0.02 for Logistic

Regression and η = 0.02 with margin= 2 for Perceptron with Margins, as these values give best results at noise level p = 0.25 and near-best results for other noise values at or above p = 0.1 on the noisy spamassassin tuning data.

6.4.2 Regularization

Another general strategy for reducing overfitting is regularization, requiring that the learned model not only describes the training data well, but also has low complexity

(see, for example, Scholkopf and Smola [78]). One measure of model complexity is the L2-norm (or Euclidean norm) of the weight vector. Thus, L2 regularization seeks to ensure that the Euclidean norm of the weight vector is as small as possible, while still describing the training data well. These goals of fitting the training data and reducing model complexity are often in conflict, and the balance between these goals is controlled by a parameter.

Regularization with SVM variants

As described in Chapter 3, the classic soft-margin SVM optimization problem is to minimize: m 2 w + C ξi || || X i

143 1 10

0 10

noise 0.25 (1−ROCA)% noise 0.2 noise 0.15 noise 0.1 noise 0.05 noise 0.01 noise 0 −1 10

−4 −3 −2 −1 0 1 2 10 10 10 10 10 10 10 Value of C

Figure 6.2: Results for varying C in ROSVM for regularization, on spamassassin tuning data with varying levels of synthetically added label noise. For clarity, the order of results is consistent between legend and figure.

144 Here, w is the weight vector storing the model, each ξi is a slack term describing the amount of error associated with a particular training example xi [78]. Thus, the optimization problem seeks to minimize both model complexity (the L2-norm of w) and training error (given by the sum of the slack terms), and the cost parameter C controls how much emphasis to place on each of these tasks in training.

A high value of C focuses on reducing training error by enforcing little regu- larization, resulting in the possibility of overfitting. Both Sculley and Wachman [84] and Drucker [34] found that high values of C gave best performance on spam data for ROSVMs and SVMs, respectively, but these results were gained with no label noise in the data. As shown in Figure 6.2, lower values of C give much improved performance in the presence of noise. We set C = 0.5 for our final experiments, as this gives best results for noise level p = 0.25.

Regularization for Logistic Regression

Logistic Regression is often considered to be especially prone to overfitting in the absence of regularization [66], as the base optimization problem allows weights to be driven to arbitrarily large values with noiseless, linearly separable data.

For the online gradient descent algorithm commonly used for online logistic regularization, L2 regularization is achieved with a modified update rule [66]:

w w + η(yi f(xi))xi ηλw ← − −

As before, w is the weight vector, xi is an individual training example with label yi 0, 1 , and the learning rate is given by η. The prediction function f(xi) returns ∈ { } a value between 0 and 1 indicating predicted “spamminess” of xi. Regularization is controlled by the parameter λ, where larger values of λ enforce more regularization

145 for each new example xi: use filter to make prediction, using f(xi) get (possibly noisy) label yi from oracle if (f(xi) t−1 and yi == +1) then update model using (xi, yi) else discard xi and skip model update

Figure 6.3: Pseudo-code for Online Label Cleaning.

and reduce overfitting.

Note that this modified update rule increases computational cost. Where before each update could be performed in time O( xi ), where xi is the number | | | | of nonzero elements of the sparse vector xi, now each update requires O( w ), the | | number of non-zero features in w, which may be considerably larger.

After a grid search for parameter values of λ and η using spamassassin tuning data, we were surprised to find that L2 regularization with values of λ ranging from λ = 10−7 to λ = 102 did not improve results at any noise level. Values above

λ = 10−4 monotonically degraded results, and smaller values gave results effectively equivalent to λ = 0. To investigate this further, we chose a value of λ = 0.0001

(with η = 0.02) for our final experiments on test data, as this was the largest value that did not significantly decrease performance on tuning data compared to λ = 0.

6.4.3 Label Cleaning

Another machine learning approach for coping with noisy labels is automated label cleaning, in which examples that are suspected to be incorrectly labeled are dis- carded from the data set [7]. To use this approach in the online filtering scenario,

146 for each new example xi: use filter to make prediction, using f(xi) get (possibly noisy) label yi from oracle if (f(xi) >t+1) then set yi := +1 if (f(xi)

Figure 6.4: Pseudo-code for Online Label Correcting.

we suggest an obvious online algorithm, given in Figure 6.3. This method uses confidence thresholds t+1 and t−1, to define criteria for cleaning. In our experiments, we apply this algorithm using Logistic Regression as the base learner, so that t+1 and t−1 may be interpreted as probability thresholds. After tuning on noisy spamassassin data, we set t+1 = 0.7 and t−1 = 0.3, as these values gave best results for noise level p = 0.25 with η = 0.02.

6.4.4 Label Correcting

In the label correcting method, the filter proactively changes the labels of examples for which the filter strongly disagrees with the given label [98], at the risk of intro- ducing additional noise into the data [71]. We propose a simple online method for label correcting, given in Figure 6.4, similar to the online label cleaning method. Af- ter tuning, we set t+1 = 0.95 and t−1 = 0.05 with η = 0.02, using Logistic Regression as the base learner.

147 Table 6.3: Results for modified methods on trec06p data set with uniform synthetic noise. Results are reported as (1-ROCA)%, with 0.95 confidence intervals. Bold numbers indicate best result, or confidence interval overlapping with confidence interval of best result.

trec06p noise noise noise noise noise noise noise 0 0.01 0.05 0.10 0.15 0.20 0.25 LogReg η =0.02 0.070 0.060 0.047 0.064 0.074 0.142 0.401 (0.057 - 0.085) (0.049 - 0.073) (0.038 - 0.058) (0.051 - 0.080) (0.062 - 0.089) (0.120 - 0.169) (0.363 - 0.444) 148 LogReg L2-rglz. 0.068 0.059 0.046 0.064 0.074 0.141 0.398 η =0.02, λ =0.0001 (0.057 - 0.082) (0.049 - 0.071) (0.037 - 0.059) (0.049 - 0.082) (0.060 - 0.090) (0.118 - 0.168) (0.361 - 0.439) LogReg Labl-Corr. 0.107 0.112 0.106 0.136 0.093 0.147 0.403 (0.087 - 0.132) (0.094 - 0.134) (0.086 - 0.132) (0.111 - 0.167) (0.077 - 0.113) (0.122 - 0.176) (0.366 - 0.443) LogReg Labl-Clean 0.049 0.049 0.045 0.060 0.055 0.086 0.107 (0.037 - 0.065) (0.038 - 0.062) (0.035 - 0.058) (0.047 - 0.076) (0.043 - 0.069) (0.068 - 0.110) (0.090 - 0.128) PwM η =0.02,m =2 0.036 0.035 0.043 0.049 0.053 0.066 0.082 (0.027 - 0.047) (0.026 - 0.048) (0.032 - 0.058) (0.035 - 0.068) (0.040 - 0.070) (0.048 - 0.089) (0.066 - 0.103) ROSVM C =0.5 0.033 0.032 0.036 0.040 0.046 0.062 0.095 (0.025 - 0.044) (0.023 - 0.044) (0.026 - 0.052) (0.029 - 0.054) (0.034 - 0.062) (0.047 - 0.081) (0.076 - 0.119) Table 6.4: Results for modified methods on trec07p data set with uniform synthetic noise. Results are reported as (1-ROCA)%, with 0.95 confidence intervals. Bold numbers indicate best result, or confidence interval overlapping with confidence interval of best result.

trec07p noise noise noise noise noise noise noise 0 0.01 0.05 0.10 0.15 0.20 0.25 LogReg η =0.02 0.009 0.007 0.007 0.014 0.036 0.091 0.363 (0.004 - 0.018) (0.004 - 0.011) (0.005 - 0.010) (0.008 - 0.023) (0.028 - 0.045) (0.078 - 0.107) (0.332 - 0.396) 149 LogReg L2-rglz. 0.009 0.007 0.007 0.014 0.035 0.091 0.359 η =0.02, λ =0.0001 (0.004 - 0.020) (0.004 - 0.011) (0.005 - 0.009) (0.008 - 0.025) (0.027 - 0.047) (0.077 - 0.107) (0.326 - 0.395) LogReg Labl-Corr. 0.009 0.010 0.011 0.016 0.037 0.092 0.364 (0.005 - 0.018) (0.005 - 0.018) (0.006 - 0.020) (0.011 - 0.024) (0.029 - 0.047) (0.079 - 0.107) (0.338 - 0.393) LogReg Labl-Clean 0.010 0.011 0.013 0.010 0.010 0.017 0.018 (0.005 - 0.018) (0.005 - 0.021) (0.008 - 0.023) (0.005 - 0.019) (0.006 - 0.018) (0.011 - 0.025) (0.013 - 0.025) PwM η =0.02,m =2 0.008 0.011 0.021 0.021 0.043 0.056 0.087 (0.004 - 0.017) (0.006 - 0.020) (0.013 - 0.033) (0.013 - 0.035) (0.030 - 0.062) (0.040 - 0.079) (0.069 - 0.110) ROSVM C =0.5 0.006 0.007 0.010 0.008 0.027 0.033 0.048 (0.002 - 0.018) (0.003 - 0.014) (0.006 - 0.017) (0.005 - 0.013) (0.017 - 0.044) (0.020 - 0.054) (0.034 - 0.068) 6.5 Experiments

In this section, we test the modified filters, with parameters tuned as described in

Section 6.4, on both synthetic label noise and on natural noise from real users.

6.5.1 Synthetic Label Noise

We first tested the modified methods on synthetic label noise, using the same data and evaluation methods described in Section 6.3. Results for these experiments are shown in Tables 6.3 and 6.4.

First, we note that simply reducing the learning rate η makes Logistic Regres- sion and Perceptron with Margins much more resistant to label noise for both data sets, at a slight cost in performance without label noise. Perceptron with Margins, in particular, demonstrates its effectiveness as a “noise-tolerant” algorithm [48].

Second, additional regularization gives strong results with ROSVM, but does not give added benefit for Logistic Regression with λ = 0.0001. To check if this was because of a particular λ value, we ran additional tests and found that higher values of λ monotonically degraded classification performance at all noise levels, while lower values of λ converged to the results where λ = 0. Thus, it appears that L2 regularization for logistic regression is simply not helpful for email spam filtering.

We believe this is due to the fact that in the online gradient descent with L2 penalty used for Logistic Regression, rare-but-informative features are penalized over time.

In contrast, the SVM variant is better suited to maintaining values for many relevant features [45].

Third, the approach of label cleaning gave excellent results, clearly improving on base Logistic Regression results on both data sets at moderate to high levels of noise. Label correcting, on the other hand, did not give added benefit. When we

150 explored other parameter settings, we found that more aggressive label correcting only degraded results on these data sets.

6.5.2 Natural Label Noise

The previous experiments show that there are several methods available for dealing with uniform label noise, an interesting result given the failure of the best TREC

filters on the same task. But is a uniform model of label noise always realistic? It seems reasonable that messages such as gray mail may have higher rates of labeling inconsistency than average. Similarly, if spammers are inside the labeling system, then certain spam messages may have a disproportionately high noise rate. In this section, we experiment with our best available approximation of label noise caused by actual users, using human labels collected by the spamorham.org project [41] for the trec05p-1 data set.

To prepare test data with natural label noise, for each message in the trec05p-1 data set we sampled one human labeling from the set of all human labeling for that message, uniformly at random. Thus, the final data set contained the same mes- sages as trec05p-1 in the same order, but with labels that reflected the distribution of label noise produced by human users. In comparison with the trec05p-1 gold standard labels, this test set contained 6.75% incorrect labels.

For comparison, we then created a synthetic data set from trec05p-1, with a uniform p = 0.0675 noise rate identical to that of the natural label noise data set. Finally, we also tested all methods on the original trec05p-1 data with gold- standard labels.

The results, given in Table 6.5, show that natural label noise appears to be more challenging to filters than uniform noise. The unmodified filters perform

151 Table 6.5: Results for natural and synthetic noise at identical noise levels. Natural label noise for trec05p-1 was uniformly sampled from human labelings collected by the spamorham.org project. Results are reported as (1-ROCA)%, with 0.95 confidence intervals.

trec05p-1 no synth. natural filters noise noise noise MN NB 0.871 1.270 1.425 (0.831-0.913) (1.210-1.333) (1.364-1.489) LogReg 0.013 0.249 0.563 η =0.1 (0.011-0.015) (0.232-0.267) (0.531-0.596) PwM 0.022 0.324 1.056 η =0.5,m =8 (0.018-0.027) (0.297-0.352) (0.982-1.137) ROSVM 0.012 2.063 2.172 C = 100 (0.010-0.016) (1.987-2.142) (2.088-2.259) DMC 0.013 0.241 0.574 (0.009-0.017) (0.217-0.268) (0.535-0.616) BogoF 0.042 7.215 0.551 (0.031-0.056) (7.021-7.413) (0.512-0.593) WAT1 0.012 1.073 1.294 (0.010-0.015) (1.017-1.131) (1.233-1.358) OSBF-Lua 0.011 37.025 32.770 (0.008-0.014) (34.895-39.207) (32.436-33.105) LogReg 0.031 0.037 0.156 η =0.02 (0.027-0.034) (0.324-0.043) (0.145-0.168) LogReg 0.030 0.039 0.146 labl. corr. (0.028-0.036) (0.034-0.043) (0.135-0.159) LogReg 0.022 0.025 0.463 labl. clean (0.018-0.027) (0.020-0.030) (0.423-0.506) PwM 0.022 0.047 0.304 η =0.02,m =2 (0.018-0.027) (0.038-0.060) (0.0276-0.336) ROSVM 0.019 0.030 0.294 C =0.5 (0.015-0.023) (0.023-0.039) (0.264-0.327)

152 badly on both the synthetic noise and natural noise. In contrast, the modified filters perform relatively well on the synthetic noise, but give results roughly an order of magnitude worse on the natural noise (although still better than the unmodified

filters). These results agree with previous observations that uniform class label noise is easier to filter than label noise that skews the label distribution in certain regions of the feature space [7, 71]. Additional work is needed for this natural label noise.

6.6 Discussion

At the outset of this investigation, our goal was to find out how top-performing

filters from TREC competitions fared in the presence of label noise. To our dismay, we found that even uniform label noise dramatically reduces the effectiveness of these state of the art filters when run “out of the box.” We then found inexpen- sive modifications enabling TREC filters to become significantly more tolerant of label noise. Uniform label noise, which models random user errors in feedback, is well handled by several modified methods. Natural noise, reflecting inconsistent or malicious judgments, remains more difficult.

We observe that these “noise tolerant” filters would not necessarily have achieved best performance on the tasks as given in TREC-style evaluations. The noiseless evaluation setting rewards aggressive online updates, and promotes filters that may be prone to overfitting in real-world applications. However, we feel that a slight decrease in classification performance in the noiseless setting is more than compensated for by improved performance in the noisy setting.

It is critical that we are able to distinguish those filters that are robust to label noise (or may be made robust with appropriate parameter settings) from those

153 that fail in noisy settings. Thus, we would like to propose that future spam filtering evaluations include filtering tasks with various levels of label noise. Ideally, this label noise would be natural noise from real human users rather than synthetic, wherever possible, as this is the more challenging case.

In the following chapter, we extend this exploration by examining the prob- lem of natural label noise in a more challenging setting in which diverse users give radically inconsistent label feedback.

154 Chapter 7

Online Filtering with Feedback from Diverse Users

In the previous chapter, we found that uniform class label noise was not problematic when filters were tuned away from overly aggressive online updates. However, non- uniform class label noise, such as that produced by real users, is still a challenge. In this chapter, we further investigate the impact of non-uniform class label noise on machine learning methods for spam filtering by investigating filtering in an extreme setting in which diverse users give label feedback. Diversity in a user base leads to the possibility of inconsistent label feedback, possible when different users disagree as to what constitutes spam, or not.

The filtering of blog comment abuse provides exactly this setting. Internet blogs provide forums for discussions within virtual communities, allowing readers to post comments on what they read. However, such comments may contain abuse, such as personal attacks, offensive remarks about race or religion, or commercial advertisements, all of which reduce the value of community discussion. Ideally,

155 filters would promote civil discourse by removing abusive comments while protecting free speech by not removing any comments unnecessarily. The readership of a blog may contain multiple factions or groups with highly divergent opinions on what constitutes abuse. Thus, the label feedback from such a readership is inconsistent, resulting in high levels of non-uniform label noise.

This chapter has two primary goals. First, we wish to explore the effective- ness of noise-tolerant filtering methods from the previous chapter in this more diffi- cult setting. We find that class-specific misclassification costs give additional benefit in this setting, due not only to minority class issues but also to the observation that positive class labels may be more reliable than negative class labels in this setting.

Second, we wish to determine the reliability of evaluation using noisy class labels as “ground truth” in computing classification performance statistics. We determine that constructing gold-standard evaluation labels may give a more accurate view of the true performance of the filtering methods than using user-generated labels.

However, it turns out that evaluations using user-generated labels show the same relative performance rankings as more expensive gold-standard labels, and may be thus sufficient for future evaluation. This chapter is based on work that originally appeared in our paper on blog comment abuse filtering [81].

7.1 Blog Comment Filtering

Internet blogs – online journals – have become important forums for fostering com- munity discussion. In a typical blog, readers may write blog comments in response to a blog posting. While the majority of blog comments are not abusive, some com- ments do contain abuse. Unlike email spam, blog comment abuse is not primarily commercial in nature. More often, comment abuse contains personal attacks, ob-

156 scenities, and even messages of hate based on race, religion, or nationality. Such comments mar the ability of blogs to foster constructive discussion in a virtual com- munity, especially among a diverse readership with multiple factions or points of view.

7.1.1 User Flags and Community Standards

Typical blog services allow the community to protect itself from abusive comments through the process of user flagging, in which readers are given the option to flag abusive comments. Comments receiving a sufficient number of flags are removed from view. This flagging process enables a virtual community to enforce its own standards for civility in discourse. However, when a readership has multiple factions, such flagging information becomes noisy. Users may flag inconsistently, inaccurately, or even maliciously. Thus, care is needed to construct and evaluate filters capable of learning from this data.

7.1.2 Contributions

Our investigation in this chapter centers on a multi-lingual corpus of comments with associated user flag information that is three orders of magnitude larger than the data set used in the largest prior study. To our knowledge, this is the first reported use of user flags for training blog comment abuse filters. We show that, despite noise, user flag data can train filters that approach the performance of dedicated human annotators. Additionally, this chapter gives analysis of blog comment abuse, compares several filtering methods, and offers suggestions for practical application of filters in real-world blog hosting systems.

157 7.2 Related Work

Relatively little work on blog comment abuse filtering appears in the literature, and none on using user flag data for training learning-based filters. However, there is significant prior work in splog detection, and email spam filtering remains a related but distinct area of research.

7.2.1 Blog Comment Abuse Filtering

To our knowledge, the only prior work on blog comment abuse filtering centered on a small, hand labeled data set of roughly one thousand examples [64]. Mishne et al. proposed filtering blog comments by measuring the disagreement between language models for comments and associated blog entries.

In this work, we do not consider language-model disagreement methods for two reasons. First, it has been shown that SVM variants exceed the performance of the language-model disagreement method on the same data set, using information only from the comment [84]. Second, although our multi-lingual corpus contains comments primarily in English, many comments are either partly or completely written in other languages. This renders language-model disagreement methods problematic.

7.2.2 Splog Detection

A related task involving filtering and blogs is splog detection. A splog (or, spam blog) is a intended to fool search engines into assigning undue importance to an associated website [50]. This is done by inserting links from the splog to the website, increasing the site’s PageRank (an importance measure based on link structure). The splog detection problem is important to search engines, and has

158 been approached by content analysis [50] and temporal and link analysis [58].

This task is essentially distinct from the task of filtering blog comment abuse.

In the abuse filtering task we are concerned with removing obscene, offensive, or com- mercial comments from real blogs rather than detecting fake blogs. These tasks do overlap, however, in relatively rare cases when abusers insert links within comments to deceive search engines.

7.2.3 Comparisons to Email Spam Filtering

There are important differences between filtering email spam and blog comment abuse. In the TREC-style tests for email spam filtering, it is assumed that all of the training data has gold-standard quality labels [23] that are equivalent to a consistent, accurate human judgment for each new message. Thus, even when noise is introduced into the training data, as in the previous chapter, there is an objective standard to use for evaluation. In contrast, the training labels for blog comment abuse are given by user flags from thousands of users, who may be inconsistent or even malicious in their flagging. Thus, the class label noise in this case is far from uniform, and there is no objective gold-standard set of labels to use for evaluation.

Later in this chapter, we show that the subjective nature of abuse makes constructing gold-standard labels a challenging task, with far lower inter-annotator agreement than is seen in the email gold-standard labelings.

Another difference is that blog comments are read by a community of readers, rather than an individual email recipient, and abuse is defined by the standards of the community. While email spam is motivated primarily by commercial intent [15], the majority of this blog comment abuse appears to be socially motivated. Thus, abusive blog comments are most often unique. With the exception of relatively

159 cricket getahead money movies news sports total unique blogs 416 188 1,380 627 1,748 405 4,764 unique userids 28,051 7,967 42,339 44,940 61,138 8,262 130,885 total comments 101,662 13,545 124,597 188,222 497,312 22,524 947,862 flag rate 0.12 0.07 0.11 0.22 0.23 0.19 0.20

Table 7.1: Summary statistics for the msgboard1 corpus of blog comments, broken out by topic.

infrequent commercial blog comment abuse, campaigns of abusive comments similar to high-volume email spam campaigns are rare.

7.3 The msgboard1 Data Set

This chapter centers on the msgboard1 corpus, a data set of nearly one million blog comments with user flag information (see Table 7.1). The corpus of blog comments was provided by Rediff.com, a leading blog hosting service catering to the needs of

India’s large expatriate community.1

The corpus consists of blog comments gathered from January through August of 2007. The corpus contains comments from blogs on six self-identified topics, listed in Table 7.1. There are comments from a total of 4,764 unique blogs, of which 2,901 contribute at least ten comments to the corpus. Although the primary language of the comments is English, there are also comments written either partly or entirely in

Hindi, Tamil and many other languages of India, all represented with the standard

ASCII character set. Misspellings and character substitutions are common.

Each comment is annotated with a userid identifying the author of the comment, a blog title showing the blog in which it was posted, and a flag variable

1This corpus may be available on a per-case basis for research purposes. Contact pran- shus@rediff.co.in.

160 0.4

0.35

0.3

0.25

0.2

0.15

Fraction of Comments Flagged 0.1

0.05

0 0 20 40 60 80 100 120 140 160 Day

Figure 7.1: Flag rates over time for the most popular blog. The spikes indicate periods of high amounts of flagging, often caused by abusive flame wars among users. Graphs for other blogs show similar patterns.

showing whether or not the comment had been flagged by users. Over half of the

130,855 unique userid’s contributed only a single comment, rendering user history insufficient for reliable filtering.

7.3.1 Noise and User Flags

As noted above, typical blog hosting services remove comments that receive a suffi- cient number of flags. Although some blog hosting services require several flags to be received before a comment is removed from view, the msgboard1 data marked any comment receiving even a single user flag as a flag comment. The flag label

161 is thus a binary label because any comments receiving a flag were removed from view, making the accumulation of additional flags impossible.

This aggressive application of user flags has the benefit of removing abusive comments from view as quickly as possible. However, it has the drawback that

flag information may contain significant noise as individual users may flag without checks or balances. For example, in the Rediff.com system, there is no way for users to mark a comment as “non-abusive”. This enables vindictive users to flag comments without repercussion.

Furthermore, note that not all abusive messages will receive user flags. Many users will simply skip over abusive comments without flagging them. Thus, class label noise is present in the data for both flag and non-flag labels.

Finally, the subjective nature of the concept of abuse creates large levels of inconsistency in flagging. As we describe in Section 7.6, we found that the pairwise inter-annotator agreement was around 74%. That is, dedicated human adjudicators disagreed about the class label of comments 26% of the time. Thus, the level of class label noise in this data set is significantly higher than noise levels encountered in email spam filtering.

7.3.2 Patterns of Abuse

Are comments flagged at a steady rate, or are there flagging peaks and lows? To answer this, we plotted flag rate (fraction of comments getting flagged) per day for the most popular blogs over the span of data collection, Laplacian smoothing the rates to remove the effect of low-traffic days. Figure 7.1 shows results for the most popular blog (with over 40,000 comments); results for other high-volume blogs were similar. Interestingly, the graph shows several aperiodic spikes, denoting dates in

162 which a large percentage of comments were flagged.2 We speculate that these spikes are caused primarily by flame wars, often seen in blog comments, in which users direct abuse at each other in heated conflict.

7.3.3 Understanding User Flags

What causes a comment to get flagged? To explore this question, we computed information gain [94] values for words in comments that have been flagged, and for non-flagged comments. (A selection of these are shown in Table 7.2.) Brief exami- nation of each case showed that obscenities and commercial terms ranked highly for

flagged comments, which agrees with the intuitive notion of comment abuse. Reli- gious terms also ranked highly for flagged comments, highlighting the large amount of abuse found in the corpus that contains hateful messages directed against mem- bers of several different religions. For non-flagged terms, it was surprising to note that common stop words (such as pronouns and articles) scored highly – perhaps because comments written with proper grammatical usage tended not to be abusive.

Other words that stand out are terms associated with civil discourse.

To further understand the breakdown of abuse, we examined a random uni- form sample of 100 flagged comments. In this sample, 39 were found to be contain obscenities or personal attacks, 39 were found to contain racial, national, or re- ligiously motivated comments, 9 were found to be of commercial intent, and the remaining 12 were flagged for other reasons. We were surprised at both the high amount of socially motivated abuse, together totaling almost 80% of the flags, and also at the relatively low amount of commercial abuse.

2It is possible that these temporal patterns may be useful in abuse filtering, an area for future work.

163 flagged non-flagged com wife india right prophet dog people please contest lovers good best hmm launched think same annihilate sexual country agree ipods lord time money women alternative mr life causing bangladesh team president defending hassle world work

Table 7.2: Selected words with high information gain, for flagged and non-flagged comments. Obscenities and references to specific religious figures have been removed from the flagged list for display, and stop words have been removed from the non- flagged list.

7.4 Online Filtering Methods with Class-Specific Costs

As with our prior experiments with email spam filtering, we treat the blog comment abuse filtering task as an online filtering task, to model the effect of filtering a stream of incoming comments over time [15].

However, we note that the class labels of flag and non-flag may have different levels of reliability. For example, users may skip over an abusive comment without flagging it, making the fact that a given comment received a non-flag label less reliable than a flag label. Additionally, note that the flag class is a minority class in this data set (see Table 7.1).

To cope with these conditions, we modify several of the online filtering meth- ods to include class-specific costs. Previously, class-specific costs have been explored in spam filtering to cope with situations in which the cost of misclassifying ham as spam was different than the misclassification of spam as ham [51]

Perceptron with Margins. Training with class-specific misclassification • costs may be implemented by using class-specific learning rates η+ and η−

164 values for positive and negative example updates.

Logistic Regression. Like Perceptron with Margins, Logistic Regression • (and Logistic Regression with 2-Norm Regularization), class-specific misclas-

sification costs may be implemented using separate η+ and η−.

Relaxed Online SVMs. Class-specific misclassification costs in SVM vari- • ants are implemented by modifying the standard soft-margin SVM optimiza-

tion problem to include separate cost parameters, C+ and C−. Thus, we seek to minimize: i i 1 2 w + C ξj + C ξj 2|| || + X − X ∀j:yj =+1 ∀j:yj =−1

with the same previous constraints described in Chapter 3

7.4.1 Feature Sets

Each of the above methods was tested in conjunction with the binary 4-mer feature space, which gave best results on the small set of blog comment spam developed by Mishne [64, 84]. We also tested a binary word-based feature space and found that this feature space decreased performance for the above methods. However, the word-based feature space proved superior for the two Naive Bayes variants we tested; the reported results for Naive Bayes variants use the word-based feature space. In addition, with the either word-based or 4-mer features, we included distinct binary features based on blog title, and others on userid.

7.4.2 Alternatives

We also investigated the utility of ensemble methods combining the output of several different filters, which were found effective for email spam [62]. Our experiments

165 with these methods showed no improvement over the best filter included in the ensemble, as the filters tended to make correlated errors. We also experimented with the online methods of cleaning or correcting labels described in the previous chapter. However, neither label cleaning nor label correcting improved results.

7.5 Experiments

In this section, we detail our experimental framework and report results using user

flag data for evaluation, both for testing individual filters and for comparing global versus per-topic filtering.

7.5.1 Experimental Design

For each filtering method, we performed separate online filtering experiments for each topic and a global all-test experiment using data from all topics excluding sports. The sports topic was reserved for initial tests and parameter tuning. We used user-flag labels as ground truth for training, with flagged examples counting as positives.

Our primary evaluation measure is Area under the ROC Curve (ROCA). We use this measure instead of reporting (1-ROCA)% because the (1-ROCA)% score is designed to make high ROC areas easy to compare. In these experiments, our ROC areas are not high enough to warrant the (1-ROCA)% transformation.

7.5.2 Parameter Tuning

Following standard machine learning methodology, we used the sports topic data as a separate tuning set to tune parameters for all filters before running the full experiments. Coarse grid search was performed to tune all parameters, using ROCA

166 as our evaluation measure and user flag labels as ground truth for both training and evaluation.

The use of class-specific misclassification costs gave improved performance for each of the discriminative filters in tuning tests. This was due both to the fact that

flagged comments are a minority of the data (see Table 7.1), and that the presence of a flag may be more reliable information than a non-flag. For Perceptron with

Margins, tuning set η+ = 0.7 and η− = 0.2. For both Logistic Regression variants,

η+ = 0.3 and η− = 0.08, and for 2-norm regularization λ = 0.00004. For ROSVM,

C+ = 1.5 and C− = 0.5. Note that the ROSVM parameter performed best with low values of the cost parameter C, which enforces regularization to help cope with label noise. This agrees with results from the previous chapter, which showed that low values of

C are helpful for combating class label noise in spam filtering. This is a radical difference from prior email spam filtering evaluations, where very high values of C

(representing minimal regularization) have been found most effective [34, 84] due to the lack of noise in the training labels.

7.5.3 Results Using User-Flags for Evaluation

ROCA scores using user-flag labels as ground truth for evaluation are given for each

filter and task in the left graph of Figure 7.2. We make three observations from these results.

First, the discriminative classifiers strongly out-perform the generative Naive

Bayes classifiers. Within the Naive Bayes classifiers, the document prior was much superior to the term prior. ROSVM gives best results across the majority of tasks, but both Logistic Regression and Perceptron with Margins yield close performance

167 1 NB Term Prior NB Doc Prior 0.95 Perceptron w/Marg. Log. Reg. (Classic) Log. Reg. (Regularized) 0.9 ROSVM

0.85

0.8

0.75 ROC Area

0.7

0.65

0.6

0.55

0.5 1 2 3 4 5 6 cricket getahead money movies news all−test

1

0.95

0.9

0.85

0.8

0.75 ROC Area

0.7

0.65

0.6

0.55

0.5 1 2 3 4 5 6 cricket getahead money movies news all−test

Figure 7.2: ROCA results for User-Flag evaluation (top) and Gold Standard Eval- uation (bottom). Legend is the same for both graphs.

168 and may be preferred for lower computational cost.

Second, using the 2-norm regularization for Logistic Regression was not ben- eficial, and actually reduced performance in almost all cases compared to the non-

Regularized variant. We believe this is because in abuse filtering there are many rare-but-informative features, such as misspellings and intentional word obfusca- tions. The 2-norm regularization aggressively reduces the influence of rare features.

The regularization provided by the low C values of ROSVMs appeared better to handle these sparse, informative features.

Third, the overall ROCA performance of the filters is far below the levels commonly seen in email spam filtering. As we show in Section 6, the noise in user

flags causes this evaluation to be an under estimate of true performance.

7.5.4 Filtering Thresholds

The full ROC curves for the all-test task with all filters are given in Figure 7.3, showing the effect of varying the classification threshold between abuse and non- abuse. As discussed at the beginning of this chapter, the goal of promoting civil discourse requires removing abusive comments from view; but to protect free speech we must be careful to keep false positive rates low. Thus, we would prefer the ROC curve to remain tight against the left vertical axis (showing 0 false positive rate) for as long as possible. The ROSVM, classical Logistic Regression, and Perceptron with Margins all perform well on this task, filtering out 30% of user-flagged abuse with a 1% false positive rate, and roughly half of all flagged about with a 5% false positive rate.

169 1

0.9

0.8

0.7

0.6

0.5 ROSVM Log. Reg. (Classic) 0.4

True Positive Rate Perceptron w/Margins

0.3 Log. Reg. (Regularized)

Multinomial NB (Doc Prior) 0.2 Multinomial NB (Term Prior) 0.1 Uniform Random

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate

1

0.9

0.8

0.7

0.6

0.5 ROSVM Log Reg (Classic) 0.4

True Positive Rate Perceptron w/Margins

0.3 Log. Reg. (Regularized)

Multinomial NB (Doc Prior) 0.2 Multinomial NB (Term Prior) 0.1 Uniform Random

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate

Figure 7.3: ROC Curves using User Flag Evaluation (top) and Gold Standard Evaluation(bottom), for all-test.

170 Cumulative Global Topic Specific All-Test Naive Bayes Term Prior 0.630 0.608 Naive Bayes Doc. Prior 0.787 0.716 Perceptron w/Margins 0.848 0.855 Logistic Regression 0.852 0.857 Regularized Log. Reg. 0.850 0.845 ROSVM 0.862 0.863

Table 7.3: ROCA results of topic-specific versus global filtering. Generative meth- ods benefit from topic-specific filtering, while discriminative methods are not signif- icantly harmed by global filtering.

7.5.5 Global Versus Per-Topic Filtering

Looking at the above results, it is natural to ask if it is better to apply one global

filter for all topics or specialized filters for each topic. To answer this, we compared the results of each filter on the global all-test topic to the cumulative result across each of the per-topic tasks.

The results, shown in Table 7.3, show that topic-specific filtering improves the results of the generative filters using Naive Bayes. This is because the distribu- tions of abuse vary by topic; when all topics are combined the generative methods are less able to accurately estimate these distributions, reducing predictive per- formance. However, discriminative filters give nearly equivalent results for topic specific and global filtering, because these methods do not estimate distributions as an intermediate step. Thus, for the discriminative methods global filtering may be preferred for simplicity.

7.6 Gold-Standard Evaluation

The results of automated filtering using user flags for training are significantly lower than we have seen for email spam filtering, even for best performing methods. How-

171 ever, these results were computed using the user-flag labels not only for training, but also for evaluation. In contrast, our experiments for email spam filtering in the presence of class label noise used noisy labels for training, but gold-standard labels for evaluation. Evaluating with noisy labels may have under-estimated the true performance of the filters.

Furthermore, in order to interpret these results, it is important to know how much noise exists in the user flags. We estimated this in email spam filtering by comparing the noisy user-generated labels to the gold-standard labels provided with the data sets.

Clearly, gold-standard labels were needed for the msgboard1 data set, both for reliable evaluation and meaningful interpretation. However, hand labeling nearly a million comments was beyond our capacity. Thus, we used a bootstrapping method similar to that employed by Cormack and Lynam in the creation of the gold-standard labels for the TREC email spam corpora [23]

7.6.1 Constructing a Gold Standard Set

Our goal was to construct a gold standard set of labels for a subset of the msgboard1 data set, large enough for meaningful performance evaluation and small enough to be labeled by volunteer human adjudicators. Our basic methodology was to sample examples uniformly from each topic in the data set. We then followed the bootstrapping methodology for efficient use of human adjudication used proposed by Cormack and Lynam [23], using multiple automated filters to assess each message and only asking humans to adjudicate messages for which there was disagreement among the filters.

We first sampled a pool of examples uniformly at random from each topic,

172 stratifying the sampling by topic. This uniform sampling ensures that the gold standard set may be used for future evaluations. We then removed all examples from the pool for which the user-flag label and the predicted label given by each of four

filters3 unanimously agreed. Such unanimous labels were considered trustworthy

[23] and were used as gold-standard labels for these examples.

From the sampled pool, there were 2,767 examples for which there was dis- agreement either among filter predictions or between a filter and the user flag label.

These examples were human adjudicated. Three volunteer human adjudicators (one computer scientist and two non-computer scientists), who were each independently asked to label each example as abuse, non-abuse (ok), or unsure. Adjudicators were shown the text of the comment, the title of the blog in which it was posted, and the topic of the blog, along with brief guidelines for defining abuse, as shown in

Figure 7.4. Adjudicators were not shown any of the predicted labels, nor the user

flag label, nor any of the other adjudicators’ ratings. Furthermore, the userid was withheld for privacy reasons.

Gold standard labels for the adjudicated set were determined by majority vote. The inter-annotator agreement for abuse and non-abuse was 0.742 0.006, ± excluding unsure ratings, and the kappa measure of agreement [8] was 0.48. Com- ments receiving predominantly unsure ratings – most often because the comment was in a non-English language – were given to a language expert for final adjudi- cation. A small number of these were also labeled unsure by the language expert, and were removed from the set. The final gold standard set, which includes both adjudicated and unanimous examples, is described in Table 7.4.

3The four filters we used were multinomial Naive Bayes with the document prior, perceptron with margins, classical logistic regression, and ROSVM.

173 Figure 7.4: Screen shot of the blog comment rating tool used by adjudicators.

174 total fraction fraction sampled adjudicated corrected cricket 994 0.22 0.07 getahead 1310 0.10 0.03 money 1275 0.22 0.05 movies 1909 0.34 0.12 news 2610 0.43 0.16 sports 1178 0.30 0.08 all-test 8096 0.30 0.10

Table 7.4: Summary statistics for the gold standard evaluation set. Adjudication and correction rates vary widely by topic. The news topic, in particular, required extensive adjudication of religious and racial comments.

7.6.2 Gold Standard Results

As before, each filter was tested on each task, using the online filtering methodol- ogy and the noisy user flag labels for training. Here, however, the evaluation was performed using gold standard labels, considering results only on examples in the gold standard set. The ROCA results for this evaluation are given in Figure 7.2

(bottom).

Note that the results for gold standard evaluation are uniformly higher than those using user-flag labels for evaluation. This shows that the user flag evaluation is, indeed, an under-estimate of true performance. However, it is interesting to observe that the relative performance of filters for tasks is largely preserved between the two evaluations. Thus, in practical settings, it may be best to use the inexpensive user-flag labels for system tuning, testing, and monitoring to track relative changes.

The more costly gold standard evaluation would only be required for occasional confirmation of true performance levels.

175 Naive Bayes Naive Bayes Perceptron Logistic Regularized User topic Term Prior Doc. Prior w/Margins Regression Log. Reg. ROSVM Flags cricket 0.384 ±0.075 0.323 ±0.071 0.677 ±0.080 0.616 ±0.081 0.626 ±0.081 0.717 ±0.078 0.735 ±0.073 getahead 0.132 ±0.059 0.191 ±0.069 0.500 ±0.097 0.441 ±0.095 0.456 ±0.095 0.485 ±0.097 0.753 ±0.086 money 0.313 ±0.062 0.398 ±0.067 0.711 ±0.069 0.711 ±0.069 0.695 ±0.070 0.727 ±0.068 0.791 ±0.062

176 movies 0.380 ±0.045 0.405 ±0.046 0.624 ±0.048 0.638 ±0.048 0.595 ±0.049 0.663 ±0.048 0.662 ±0.043 news 0.336 ±0.035 0.426 ±0.038 0.610 ±0.040 0.617 ±0.040 0.581 ±0.040 0.652 ±0.039 0.642 ±0.036 all-test 0.316 ±0.022 0.395 ±0.024 0.629 ±0.026 0.629 ±0.026 0.573 ±0.026 0.651 ±0.026 0.681 ±0.023

Table 7.5: Results for F1 Measure, Gold Standard Evaluation. F1 Measure is computed using precision and recall, where an abusive comment is considered a positive example. For all filters, the F1 measure was computed at the precision-recall break-even point. 7.6.3 Filters Versus User Flags

The gold standard set also allows us to compare the performance of the filters with the process of user-flagging, itself. Using user-flags as a prediction, we computed recall (r), the fraction of all abuse that was detected, and precision (p), the rate of correct predictions found when abuse was predicted. For each task, we then

2pr computed the F 1 measure [94], defined as p+r , for the user-flag predictions, and compared these to the F 1 measure for each filter when the classification threshold

τ was set at the precision-recall break-even point (where p = r).

The results, shown in Table 7.5 (with 0.95 confidence intervals), show that user flags give slightly superior performance to the discriminative filters, and far out- performed the generative filters. An exception is the small getahead topic where lack of training data made the filters much inferior to user-flags. Furthermore,

ROSVMs give improved performance on the contentious news task. However, in most cases, the confidence intervals for discriminative filters and user flags overlap, showing that these filters are able to approach the performance of user flagging.

While it is encouraging that automated filters approach the effectiveness of user flags, as we discuss in Section 7 it is not necessary to choose between these approaches.

7.6.4 Filters Versus Dedicated Adjudicators

As noted before, the inter-annotator agreement for the human adjudicated examples was 0.742 0.006. The discriminative classifiers approached this level of accuracy on ± the human adjudicated examples (a subset of the gold standard set). ROSVM and

Perceptron with Margins both gave accuracy of 0.728 0.018 and Logistic Regression ± had accuracy 0.722 0.018. For contrast, the user flag accuracy on the adjudicated ± examples was 0.689 0.018. Thus, the discriminative filters agreed nearly as well ±

177 with the human adjudications as the human adjudicators agreed with each other.

This shows that these filters may be approaching the limits of human subjectivity on this task despite the noisy training labels.

7.7 Discussion

We have shown that effective filters for blog comment abuse may be trained using user flag labels, despite the noise inherent in this signal. Online SVM variants give best results, but other discriminative classifiers give nearly as strong performance and are considerably cheaper. Generative models, represented here by Naive Bayes, fare relatively poorly.

This filtering domain has proven to be considerably more difficult than the email spam filtering domain. This is because the labels for training are inherently noisy, there are more varied forms of abuse in this domain than in commercial email spam, and subtle cases of abuse and non-abuse are difficult to distinguish in this domain. Nevertheless, when plentiful training data is available, discriminative filters approach the performance of user flagging as a filtering methodology, and rival the effectiveness of dedicated human annotators. In the remainder of this section, we consider issues surrounding real world application of automated abuse filtering for blog comments, including plans for future work.

7.7.1 Two-Stage Filtering

In the real world application, it is not necessary to choose between automated

filtering and user flagging. They can be used in series, with an automated filter serving as a first level filter, and user flagging serving as a second level. Initial offline analysis of this approach using the gold standard set shows that this approach can

178 cut the amount of abuse shown to users for flagging down to a third of current rates with little loss of non-abuse comments. Although live testing will be required for confirmation, it is our belief that limiting the amount of abuse shown to users will increase their ability to flag effectively. This conjecture is supported by the observation that the topics with the lowest amount of abuse in our corpus (Table

7.1) also had the most accurate user flagging (Table 7.4).

7.7.2 Feedback to and from Users

Of course, any comments that are automatically filtered will not be seen by users, and thus cannot be used for training the filter. This is an extension of the one-sided feedback scenario from Chapter 5. Exploring this effect in blog comment abuse

filtering is an area for future work.

A larger concern is the feedback that is provided to users, who will immedi- ately see when their comments are filtered out. This will allow abusers to adopt a trial and error approach to posting abuse, which may be detrimental to the filtering performance. One possible alternative is to use a filter to adjust the flagging thresh- old needed to remove a comment from view. This way, comments predicted to be abusive would require fewer flags to be removed from view, and comments predicted to be non-abusive would require many more. This element of non-determinism would reduce abusers’ ability to break the filter, while maintaining the ability of a community to enforce its standards.

7.7.3 Individual Thresholds

Part of the noise in user-flag data is caused by the inherent subjectivity of the

flagging task: different users have different standards for acceptable modes of dis-

179 course. As we have seen, this makes it difficult to create a single filter with a single threshold for defining abuse or non-abuse. However, enabling individualized filters for unique users is costly and inefficient. An alternative would be to allow users to select their own threshold for defining abuse. Some users would prefer to see all comments, regardless of obscenities, while others would prefer to be shown only the most civil. This information could be efficiently stored as a user preference, and used at serving time to hide particular comments. Furthermore, such a mechanism would enable the collection of more fine-grained flagging information. Comments

flagged by tolerant users are likely to be highly offensive, and may be scored differ- ently than comments that are flagged by sensitive users but unflagged by others. In this way, we could tailor the needs of protecting civil discourse and free speech not only to a community, but to individual users as well.

180 Chapter 8

Conclusions

Some have considered the near-perfect classification performance attained in the idealized online filtering scenario by current machine learning methods, such as those presented in Chapters 2 and 3 of this dissertation, as reason to believe that the problem of spam filtering is essentially solved and deserving of little further research.

We strongly disagree with this sentiment. In this dissertation, we have furthered research in this field by examining the assumptions of the idealized filtering. We have explored methods for reducing the requirements on human labeling effort, for coping with realistic scenarios where humans do not give feedback on predicted spam, and for dealing with noisy feedback in both the email and blog comment domains. Together, these advances enable more accurate spam filters in real world settings, with lower cost and greater robustness to noise.

However, important questions in the filtering domain remain open, requiring further investigation. We conclude this dissertation with a discussion of these open questions, as a door to future work.

181 8.1 How Can We Benefit from Unlabeled Data?

In Chapter 4, we showed how online active learning can be used in the online fil- tering scenario to dramatically reduce human labeling effort. Such methods select unlabeled examples to give to users for labeling; however, they do not make use of unlabeled data directly. As noted in Section 4.2.3, attempts to use semi-supervised learning methods to improve filtering performance with the use of unlabeled data have been essentially unsuccessful [67].

Thus, unlabeled data still seems to be an untapped resource. A typical large- scale spam filtering system for email spam may see billions of messages per day, only a small fraction of which are actually labeled. Future work should continue exploring ways of capitalizing on this vast resource of unlabeled data.

Possibilities that are, to our knowledge, unexplored in the published litera- ture include the use of unlabeled data for dimensionality reduction along the lines of Latent Semantic Analysis [31] and the use of unlabeled data to provide additional information to generative models as a means of combating data sparsity [12].

8.2 How can we attain better user feedback?

Chapter 6 showed that class label noise from unreliable user feedback reduces fil- ter performance. Chapters 6 and 7 both showed that the natural label noise from human users are particularly problematic. In this dissertation, we have focused on ameliorating the impact of class label noise with modified machine learning algo- rithms. However, in practice, it may be more beneficial to modify the manner in which user feedback is collected.

One intriguing possibility along these lines is suggested by Luis von Ahn’s

182 work in constructing online games to encourage humans to hand label images for research in computer vision [92]. In this “ESP Game”, humans compete in teams of two, assigned randomly and anonymously online, and attempt to agree on words describing images that are shown to them simultaneously. The structure of the game incentivizes users to work by rewarding them with a point scoring system, and makes

“cheating” impossible through the factors of anonymimity, limited communication, and statistical tests comparing the results across groups of users [92]. This game has resulted in a high quality hand-labeled corpus of over 260,000 images [92].

One obvious application of this work would be to construct a similar “spam game” to acquire reliable ground truth labels for email messages. Other possibilities include tracking the reputation of human labelers, or creating incentive structures to encourage accurate labeling.

8.3 How can the academic research community gain ac-

cess to larger scale, real world benchmark data sets?

The emergence of the three public TREC benchmark data sets from 2005, 2006, and 2007 have resulted in a significant improvement in filtering performance in the idealized online filtering scenario [24, 16, 18]. These data sets are the first public corpora of their size for spam filter benchmark testing, ranging from roughly 37,822 messages to 92,189 messages. However, even these corpora are small compared to filtering load of typical email systems, which may handle millions or billions of messages per day. Furthermore, these corpora do not reflect natural labeling behavior by real users. This limits the applicability of academic research in this important domain.

We encourage industrial practitioners to participate in the academic spam

183 filtering community through the construction and administration of large scale, real world benchmark data sets. Because of privacy issues, it is impractical that such data sets be made public. However, following the lead of the TREC competitions, it is possible for industrial practitioners to maintain such benchmark data sets pri- vately and report aggregate results of new filtering methods without revealing pri- vate message content [24].

184 Bibliography

[1] The American Heritage Dictionary of the English Language, Fourth Edition.

spam. Houghton Mifflin, 2004.

[2] http://spamassassin.apache.org/tests 3 2 x.html, 2008.

[3] F. Assis. OSBF-lua – a text classification module for lua: the importance of the

training method. In The Fifteenth Text REtrieval Conference (TREC 2006)

Proceedings, 2006.

[4] S. Bickel. ECML-PKDD discovery challenge overview. In Proceedings of the

ECML/PKDD Discovery Challenge Workshop, 2006.

[5] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-

training. In COLT’ 98: Proceedings of the eleventh annual conference on Com-

putational learning theory, 1998.

[6] A. Bratko, B. Filipiˇc, G. Cormack, T. Lynam, and B. Zupan. Spam filtering

using statistical data compression models. J. Mach. Learn. Res., 7, 2006.

[7] C. E. Brodley and M. A. Friedl. Identifying mislabeled training data. Journal

of Artificial Intelligence Research, (11), 1999.

185 [8] J. Carletta. Assessing agreement on classification tasks: the kappa statistic.

Comput. Linguist., 22(2), 1996.

[9] G. Cauwenberghs and T. Poggio. Incremental and decremental support vector

machine learning. In NIPS, pages 409–415, 2000.

[10] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Worst-case analysis of selec-

tive sampling for linear classification. Journal of Machine Learning Research,

7:1205–1230, 2006.

[11] N. Cesa-Bianchi, G. Lugosi, and G. Stoltz. Minimizing regret with label efficient

prediction. In COLT, pages 77–92, 2004.

[12] A. Chapelle, B. Scholkopf, and A. Zien.

[13] J. G. Cleary and I. H. Witten. Data compression using adaptive coding and

partial string matching. IEEE Transactions on Communications, 32(4), April

1984.

[14] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learn-

ing. Mach. Learn., 15(2):201–221, 1994.

[15] G. Cormack and T. Lynam. On-line supervised spam filter evaluation. ACM

Transactions on Information Systems, 25(3), July 2007.

[16] G. V. Cormack. TREC 2006 spam track overview. In TREC 2006: Proceedings

of the The Sixteenth Text REtrieval Conference, 2006.

[17] G. V. Cormack. Personal communication, 2007.

[18] G. V. Cormack. TREC 2007 spam track overview. In TREC 2007: Proceedings

of the The Sixteenth Text REtrieval Conference, 2007.

186 [19] G. V. Cormack. University of waterloo participlation in the TREC 2007 spam

track. In TREC 2007: Proceedings of the The Sixteenth Text REtrieval Con-

ference, 2007.

[20] G. V. Cormack and A. Bratko. Batch and on-line spam filter comparison. In

Proceedings of the Third Conference on Email and Anti-Spam (CEAS), 2006.

[21] G. V. Cormack, J. M. G. Hidalgo, and E. P. S´anz. Spam filtering for short

messages. In CIKM ’07: Proceedings of the sixteenth ACM conference on Con-

ference on information and knowledge management, 2007.

[22] G. V. Cormack and R. N. S. Horspool. Data compression using dynamic markov

modelling. Comput. J., 30(6), 1987.

[23] G. V. Cormack and T. R. Lynam. Spam corpus creation for TREC. In Pro-

ceedings of the Second Conference on Email and Anti-Spam (CEAS), 2005.

[24] G. V. Cormack and T. R. Lynam. TREC 2005 spam track overview. In The

Fourteenth Text REtrieval Conference (TREC 2005) Proceedings, 2005.

[25] G. V. Cormack and T. R. Lynam. On-line supervised spam filter evaluation.

Technical report, David R. Cheriton School of Computer Science, University of

Waterloo, Canada, February 2006.

[26] L. F. Cranor and B. A. LaMacchia. Spam! Commun. ACM, 41(8), 1998.

[27] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines.

Cambridge University Press, 2000.

[28] S. Dasgupta. Analysis of a greedy active learning strategy. NIPS: Advances in

Neural Information Processing Systems, 2004.

187 [29] S. Dasgupta, A. T. Kalai, and C. Monteleoni. Analysis of perceptron-based

active learning. In COLT, pages 249–263, 2005.

[30] D. DeCoste and K. Wagstaff. Alpha seeding for support vector machines. In

KDD ’00: Proceedings of the sixth ACM SIGKDD international conference on

Knowledge discovery and data mining, pages 345–349, 2000.

[31] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A.

Harshman. Indexing by latent semantic analysis. Journal of the American

Society of Information Science, 41(6):391–407, 1990.

[32] P. J. Denning. ACM President’s letter: electronic junk. Commun. ACM, 25(3),

1982.

[33] T. G. Dietterich. Ensemble methods in machine learning. In MCS ’00: Pro-

ceedings of the First International Workshop on Multiple Classifier Systems,

2000.

[34] H. Drucker, V. Vapnik, and D. Wu. Support vector machines for spam catego-

rization. IEEE Transactions on Neural Networks, 10(5):1048–1054, 1999.

[35] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2nd Ed. Wiley

Interscience, 2001.

[36] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using

the query by committee algorithm. Machine Learning, 28(2-3):133–168, 1997.

[37] J. Goodman, G. V. Cormack, and D. Heckerman. Spam and the ongoing battle

for the inbox. Commun. ACM, 50(2), 2007.

[38] J. Goodman and W. Yin. Online discriminative spam filter training. In Pro-

ceedings of the Third Conference on Email and Anti-Spam (CEAS), 2006.

188 [39] P. Graham. A plan for spam. 2002.

[40] P. Graham. Better bayesian filtering. 2003.

[41] J. Graham-Cumming. There’s one born every minute: spam and .

http://www.jgc.org/blog/2006 05 01 archive.html, 2006.

[42] D. Helmbold and S. Panizza. Some label efficient learning results. In COLT ’97:

Proceedings of the tenth annual conference on Computational learning theory,

pages 218–230, 1997.

[43] D. P. Helmbold, N. Littlestone, and P. M. Long. Apple tasting. Inf. Comput.,

161(2):85–139, 2000.

[44] S. Hershkop and S. J. Stolfo. Combining email models for false positive re-

duction. In KDD ’05: Proceeding of the eleventh ACM SIGKDD international

conference on Knowledge discovery in data mining, pages 98–107, 2005.

[45] T. Joachims. Text categorization with suport vector machines: Learning with

many relevant features. In ECML ’98: Proceedings of the 10th European Con-

ference on Machine Learning, pages 137–142, 1998.

[46] T. Joachims. Transductive inference for text classification using support vec-

tor machines. In Proceedings of ICML-99, 16th International Conference on

Machine Learning, 1999.

[47] T. Joachims. Training linear svms in linear time. In KDD ’06: Proceedings of

the 12th ACM SIGKDD international conference on Knowledge discovery and

data mining, pages 217–226, 2006.

[48] R. Khardon and G. Wachman. Noise tolerant variants of the perceptron algo-

rithm. J. Mach. Learn. Res., 8, 2007.

189 [49] J. Kivinen, A. Smola, and R. Williamson. Online learning with kernels. In

Advances in Neural Information Processing Systems 14, pages 785–793. MIT

Press, 2002.

[50] P. Kolari, T. Finin, and A. Joshi. SVMs for the : Blog identification

and splog detection. AAAI Spring Symposium on Computational Approaches

to Analyzing Weblogs, 2006.

[51] A. Kolcz and J. Alspector. SVM-based filtering of e-mail spam with content-

specific misclassification costs. In Proceedings of the TextDM’01 Workshop on

Text Mining - held at the 2001 IEEE International Conference on Data Mining,

2001.

[52] W. Krauth and M. M´ezard. Learning algorithms with optimal stability in

neural networks. Journal of Physics A, 20(11):745–752, 1987.

[53] C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for

svm protein classification. Proceedings of the Pacific Symposium on Biocom-

puting, pages 564–575, January 2002.

[54] C. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mismatch string kernels for

svm protein classification. Neural Information Processing Systems, (15):1441–

1448, 2003.

[55] C. Leslie and R. Kuang. Fast string kernels using inexact matching for protein

sequences. Journal of Machine Learning Research, (5):1435–1455, 2004.

[56] D. D. Lewis and W. A. Gale. A sequential algorithm for training text classi-

fiers. In SIGIR ’94: Proceedings of the 17th annual international ACM SIGIR

190 conference on Research and development in information retrieval, pages 3–12,

1994.

[57] K. Li, K. Li, H. Huang, and S. Tian. Active learning with simplified svms for

spam categorization. In International Conference on Machine Learning and

Cybernetics, volume 3, pages 1198–1202, 2002.

[58] Y.-R. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. L. Tseng. Splog de-

tection using self-similarity analysis on blog temporal dynamics. In AIRWeb

’07: Proceedings of the 3rd international workshop on Adversarial information

retrieval on the web, 2007.

[59] N. Littlestone. Learning quickly when irrelevant attributes abound: A new

linear-threshold algorithm. Mach. Learn., 2(4):285–318, 1988.

[60] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. . Watkins.

Text classification using string kernels. Journal of Machine Learning Research,

(2):419–444, 2002.

[61] D. Lowd and C. Meek. Good word attacks on statistical spam filters. In

Proceedings of the Second Conference on Email and Anti-Spam (CEAS), 2005.

[62] T. Lynam, G. Cormack, and D. Cheriton. On-line spam filter fusion. In SIGIR

’06: Proceedings of the 29th annual international ACM SIGIR conference on

Research and development in information retrieval, pages 123–130, 2006.

[63] V. Metsis, I. Androutsopoulos, and G. Paliouras. Spam filtering with naive

bayes – which naive bayes? Third Conference on Email and Anti-Spam

(CEAS), 2006.

191 [64] G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model

disagreement. Proceedings of the 1st International Workshop on Adversarial

Information Retrieval on the Web (AIRWeb), May 2005.

[65] T. M. Mitchell. Machine Learning. McGraw-Hill, 1997.

[66] T. M. Mitchell. Generative and discriminative classifiers: Naive bayes and

logistic regression. In Machine Learning. http://www.cs.cmu.edu/ tom/ ml- ∼ book/NBayesLogReg.pdf, 2005.

[67] M. Mojdeh and G. Cormack. Semi-supervised spam filtering: Does it work? In

The Thirty-First Annual ACM SIGIR Conference Proceedings, 2008.

[68] A. Ng and M. Jordan. On discriminative vs. generative classifiers: A compar-

ison of logistic regression and naive bayes. Advances in Neural Information

Processing Systems, (14), 2002.

[69] J. Platt. Sequenital minimal optimization: A fast algorithm for training support

vect or machines. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances

in Kernel Methods - Support Vector Learning. MIT Press, 1998.

[70] J. Platt. Probabilities for sv machines. In A. Smola, P. Bartlett, B. Scholkopf,

and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 61–74.

MIT Press, 1999.

[71] U. Rebbapragada and C. E. Brodley. Class noise mitigation through instance

weighting. In ECML 2007: 18th European Conference on Machine Learning,

2007.

[72] G. Rios and H. Zha. Exploring support vector machines and random forests for

192 spam detection. In Proceedings of the First Conference on Email and Anti-Spam

(CEAS), 2004.

[73] F. Rosenblatt. The perceptron: A probabilistic model for information storage

and organization in the brain. Psychological Review, 65:386–407, 1958.

[74] N. Roy and A. McCallum. Toward optimal active learning through sampling

estimation of error reduction. In ICML ’01: Proceedings of the Eighteenth

International Conference on Machine Learning, pages 441–448, 2001.

[75] G. Salton and C. Buckley. Improving retrieval performance by relevance feed-

back. Readings in information retrieval, pages 355–364, 1997.

[76] S. L. Salzberg. On comparing classifiers: Pitfalls toavoid and a recommended

approach. Data Min. Knowl. Discov., 1(3), 1997.

[77] G. Schohn and D. Cohn. Less is more: Active learning with support vector ma-

chines. In ICML ’00: Proceedings of the Seventeenth International Conference

on Machine Learning, pages 839–846, 2000.

[78] B. Sch¨olkopf and A. Smola. Learning with Kernels: Support Vector Machines,

Regularization, Optimizati on, and Beyond. MIT Press, 2001.

[79] D. Sculley. Online active learning methods for fast label-efficient spam filtering.

In CEAS 2007: Proceedings of the Fourth Conference on Email and Anti-Spam.

[80] D. Sculley. Practical learning from one-sided feedback. In KDD ’07: Proceedings

of the 13th ACM SIGKDD international conference on Knowledge discovery

and data mining, 2007.

[81] D. Sculley. On free speech and civil discourse: filtering abuse in blog comments.

In Under review at CEAS, 2008, 2008.

193 [82] D. Sculley and C. E. Brodley. Compression and machine learning: A new

perspective on feature space vectors. DCC: Data Compression Conference,

0:332–332, 2006.

[83] D. Sculley and G. V. Cormack. Filtering spam in the presence of noisy user

feedback. In Under review at CEAS, 2008, 2008.

[84] D. Sculley and G. Wachman. Relaxed online SVMs for spam filtering. In The

Thirtieth Annual ACM SIGIR Conference Proceedings, 2007.

[85] D. Sculley and G. Wachman. Relaxed online SVMs in the TREC spam filtering

track. In The Sixteenth Text REtrieval Conference (TREC 2006) Proceedings,

2007.

[86] D. Sculley, G. Wachman, and C. Brodley. Spam filtering using inexact string

matching in explicit feature space with on-line linear classifiers. In The Fifteenth

Text REtrieval Conference (TREC 2006) Proceedings, 2006.

[87] R. Segal, T. Markowitz, and W. Arnold. Fast uncertainty sampling for labeling

large e-mail corpora. In CEAS 2006: Conference on Email and Anti-Spam,

2006.

[88] C. E. Shannon and W. Weaver. The Mathematical Theory of Information.

University of Illinois Press, 1949.

[89] S. Sonnenburg, G. Rätsch, and B. Schölkopf. Large scale genomic

sequence svm classifiers. In ICML ’05: Proceedings of the 22nd international

conference on Machine learning, pages 848–855, 2005.

[90] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT

Press, 1998.

194 [91] S. Tong and D. Koller. Support vector machine active learning with applications

to text classification. Journal of Machine Learning Research, 2:45–66, 2002.

[92] L. von Ahn and L. Dabbish. Labeling images with a computer game. In CHI

’04: Proceedings of the SIGCHI conference on Human factors in computing

systems, 2004.

[93] G. L. Wittel and S. F. Wu. On attacking statistical spam filters. CEAS: First

Conference on Email and Anti-Spam, 2004.

[94] I. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and

Techniques, 2nd ed. Morgan Kaufman, 2005.

[95] I. H. Witten, R. M. Neal, and J. G. Cleary. Arithmetic coding for data com-

pression. Commun. ACM, 30(6), 1987.

[96] W. Yin, J. Goodman, and G. Hulten. Learning at low false positive rates.

[97] W. Yin, R. McCann, and A. Kolcz. Improving spam filtering by detecting gray

mail.

[98] X. Zeng and T. Martinez. An algorithm for correcting mislabeled data. Intel-

ligent Data Analysis, 5(1), 2001.

[99] X. Zhu and X. Wu. Class noise vs. attribute noise: a quantitative study of their

impacts. Artif. Intell. Rev., 22(3), 2004.

195