“In Vivo” Spam Filtering: a Challenge Problem For
Total Page:16
File Type:pdf, Size:1020Kb
“In vivo” spam filtering: A challenge problem for KDD Tom Fawcett HewlettPackard Laboratories 1501 Page Mill Road Palo Alto, CA USA [email protected] ABSTRACT vivo" we mean the problem as it is truly faced in an op- erating environment, that is, by an on-line filter on a mail Spam, also known as Unsolicited Commercial Email (UCE), account that receives realistic feeds of email over time, and is the bane of email communication. Many data mining serves a human user. In this context, spam filtering faces researchers have addressed the problem of detecting spam, issues of skewed and changing class distributions; unequal generally by treating it as a static text classification prob- and uncertain error costs; complex text patterns; a complex, lem. True in vivo spam filtering has characteristics that disjunctive and drifting target concept; and challenges of make it a rich and challenging domain for data mining. In- intelligent, adaptive adversaries. Many real-world domains deed, real-world datasets with these characteristics are typ- share these characteristics and would benefit indirectly by ically difficult to acquire and to share. This paper demon- work on spam filtering. strates some of these characteristics and argues that re- searchers should pursue in vivo spam filtering as an accessi- Improving spam filtering is a worthy goal in itself, but this ble domain for investigating them. paper takes the (admittedly selfish) position that data min- ing researchers should study the problem for the benefit of data mining. It is unclear whether spam filtering ef- General Terms forts could genuinely benefit from data mining research. On the other hand, one of the persistent difficulties of research spam, text classification, challenge problems, class skew, im- in many real-world domains is that of acquiring and shar- balanced data, cost-sensitive learning, data streams, concept ing datasets. Most companies, for example, do not release drift datasets containing real customer transactions; we are aware of no public domain datasets containing genuine fraudu- 1. INTRODUCTION lent transactions for studying fraud detection. Even sharing Spam, also known as Unsolicited Commercial Email (UCE) such data between partner companies usually requires for- and Unsolicited Bulk Email (UBE), is commonplace every- mal non-disclosure agreements. In other domains datasets where in email communication1. Spam is a costly problem may still have copyright or privacy issues. Few datasets and many experts agree it is only getting worse [7; 24; 31; 34; involving concept drift or changing class distributions are 14]. Because of the economics of spam and the difficulties publicly available. Without such datasets, the ability to inherent in stopping it, it is unlikely to go away soon. replicate results and compare algorithm performance is hin- Many data mining and machine learning researchers have dered and progress on these research topics will be impaired. worked on spam detection and filtering, commonly treat- Spam data are easily accessible and shareable, which makes ing it as a basic text classification problem. The problem spam filtering a good domain testbed for investigating many is popular enough that it has been the subject of a Data of the same issues. Mining Cup contest [10] as well as numerous class projects. The remainder of the paper enumerates these research issues Bayesian analysis has been very popular [28; 30; 16; 3], but and describes how they are manifested in in vivo spam filter- researchers have also used SVMs [20], decisions trees [4], ing. The final section of the paper describes how researchers memory and case-based reasoning [29; 8], rule learning [27] could begin exploring the domain. and even genetic programming [19]. But researchers who treat spam filtering as an isolated text 2. CHALLENGES classification task have only addressed a portion of the prob- lem. This paper argues that real-world in vivo spam filtering 2.1 Skewed and drifting class distributions is a rich and challenging problem for data mining. By \in Like most text classification domains, spam presents the 1 problem of a skewed class distribution, i.e., the proportion of The term \spam" is sometimes used loosely to mean any spam to legitimate email is uneven. There are no generally message broadcast to multiple senders (regardless of intent) or any message that is undesired. Here we intend the nar- agreed upon class priors for this problem. G´omez Hidalgo rower, stricter definition: unsolicited commercial email sent [15] points out that the proportion of spam messages re- to an account by a person unacquainted with the recipient. ported in research datasets varies considerably, from 16.6% to 88.2%. This may be simply because the proportion varies considerably from one individual to another. The amount of spam received depends on the email address, the degree SIGKDD Explorations. Volume 5,Issue 2 - Page 140 300 30 250 25 200 20 150 15 100 10 50 5 Number of messages per week Number of spam messages per week 0 0 0 10 20 30 40 50 0 10 20 30 40 50 Week of 2002 Week of 2002 (a) Spam messages (b) Legitimate messages Figure 1: Weekly variation in message traffic, spam versus legitimate email 1 0.9 0.8 p(spam) 0.7 0.6 0.5 0 10 20 30 40 50 Week of 2002 Figure 2: SpamCop: Spam forwarded and reports sent Figure 3: Drifting priors: weekly estimates of p(spam) taken from data in figure 1. of exposure, the amount of time the address has been pub- lic and the upstream filtering. The amount of legitimate several datasets we can make a case that spam priors change email received similarly varies greatly from one individual significantly over time. to another. Figure 1a shows a graph of spam volume received in 2002 Perhaps more importantly, spam varies over time as well. by Paul Wouters of Xtended Internet2. In 2002 the spam This was demonstrated dramatically in 2002 when a large volume was 146 55 messages per week, indicating a great number of open relays and open proxies were brought on-line deal of variation in spite of its upward trend. For most in Asian countries, primarily Korea and China. Such a large people, the volume of the legitimate email received varies as new pool of unprotected machines provided great opportu- well. Figure 1b shows a graph of the number of legitimate nities for spammers, and soon email servers throughout the messages saved by the author over the weeks in 2002. The world experienced a huge surge in the amount of spam they volume is 12:3 6:4 messages per week. forwarded and received. The problem became so bad that Figure 2 shows the volume of reports issued from Spam- for a brief time all email from certain Asian countries was Cop's website3 This graph also demonstrates some of spam's blocked completely by some ISPs [9]. episodic nature. SpamCop is a service used by many peo- In spite of claims that spam is generally increasing [7; 24; ple to filter spam and to submit reports (complaints) to the 5], the volume varies considerably and non-monotonically originators of spam. Both the amount of spam submitted on a daily or weekly scale. Calculating spam proportion and the number of reports sent show clear episodic behavior. even approximately is difficult. Although some public spam These graphs show time variation in both the volume of datasets are available (see Appendix A), we are aware of no personal email datasets arranged over time, so it is difficult 2http://spamarchive.xtdnet.nl/ to match the two to establish priors. Nevertheless, using 3http://www.spamcop.net/spamstats.shtml SIGKDD Explorations. Volume 5,Issue 2 - Page 141 1 0.9 0.8 0.7 0.6 0.5 p(spam) 0.4 0.3 0.2 0.1 0 0 10 20 30 Day (a) (b) Figure 4: Email volume from Eide's trial, (a) absolute volume, (b) resulting prior p(spam). spam and the volume of legitimate email received, something If we assume that the cost of a false positive (that is, of that researchers have not generally acknowledged. Since the classifying a legitimate message as spam) is about ten times two sources of email|senders of spam and senders of legiti- that of a false negative, this reduces to mate email|are independent parties with little in common, p(spam) we can expect their variation to be statistically uncorrelated, PCFspam = and the class priors will vary over time. No fixed prior will p(spam) + p(legit) × 10 be correct. How much could we expect class priors to vary? If we assume that a user received the spam shown in figure 1a and the Using the p(spam) range from figure 4b, the PCF range of legitimate email shown in figure 1b, we can estimate the class interest for spam filtering is :04 ≤ PCFspam ≤ :47. Since the prior p(spam) simply as the proportion of weekly messages entire PCF range is [0,1], this means nearly half of cost space that are spam. Figure 3 shows a graph of this value, which is influenced by this variation in priors. Any classifier whose ranges between about .67 to .99. performance lies within this 44% could be a competitive A further demonstration of changing priors appears in Kris- solution. This is a wide range, and it is reasonable to expect tian Eide's study of bayesian spam filters [12]. In evaluating classifier superiority to vary within it. these filters he measured the volume of spam and legitimate The purpose of this analysis is not to call into question the email he received over the course of one month. These vol- validity of prior work, but to point out that changing class umes are graphed in figure 4a, and the computed daily spam distributions are a reality in this domain and their influence prior is graphed in figure 4b.