Tightening the net: a review of current and next generation spam filtering tools

James Carpinter & Ray Hunt∗ Department of Computer Science and Software Engineering University of Canterbury

Abstract IT infrastructure worldwide. While it is dif- ficult to quantify the level of spam currently This paper provides an overview of cur- sent, many reports suggest it represents sub- rent and potential future spam filtering ap- stantially more than half of all sent and proaches. We examine the problems spam in- predict further growth for the foreseeable fu- troduces, what spam is and how we can mea- ture [18, 43, 30]. sure it. The paper primarily focuses on auto- For some, spam represents a minor irritant; mated, non-interactive filters, with a broad for others, a major threat to productivity. Ac- review ranging from commercial implemen- cording to a recent study by Stanford Univer- tations to ideas confined to current research sity [36], the average Internet user loses ten papers. Both machine learning and non- working days each year dealing with incoming machine learning based filters are reviewed as spam. Costs beyond those incurred sorting potential solutions and a taxonomy of known legitimate email from spam are also present: approaches presented. While a range of dif- 15% of all email contains some type of virus ferent techniques have and continue to be payload, and one in 3,418 contained evaluated in academic research, heuristic and pornographic images particularly harmful to Bayesian filtering dominate commercial filter- minors [54]. It is difficult to estimate the ulti- ing systems; therefore, a case study of these mate dollar cost of such expenses; however, techniques is presented to demonstrate and most estimates place the worldwide cost of evaluate the effectiveness of these popular spam in 2005, in terms of lost productivity techniques. and IT infrastructure investment, to be well Keywords: spam, ham, heuristics, over US$10 billion [29, 52]. machine learning, non-machine learning, The magnitude of the problem has intro- Bayesian filtering, blacklisting. duced a new dimension to the use of email: the spam filter. Such systems can be expen- 1 Introduction sive to deploy and maintain, placing a further strain on IT budgets. While the reduced flow The first message recognised as spam was sent of spam email into a user’s inbox is gener- to the users of Arpanet in 1978 and repre- ally welcomed, the existence of false positives sented little more than an annoyance. Today, often necessitates the user manually double- email is a fundamental tool for business com- checking filtered messages; this reality some- munication and modern life, and spam repre- what counteracts the assistance the filter de- sents a serious threat to user productivity and livers. The effectiveness of spam filters to im- prove user productivity is ultimately limited ∗email: [email protected] by the extent to which users must manually

1 review filtered messages for false positives. of current research. Section 4 details the eval- Unfortunately, the underlying business uation of spam filters, including a case study model of bulk emailers (spammers) is simply of the PreciseMail Anti-Spam system operat- too attractive. Commissions to spammers of ing at the University of Canterbury. Section 25–50% on products sold are not unusual [30]. 5 finishes the paper with some conclusions on On a collection of 200 million email addresses, the state of this research area. a response rate of 0.001% would yield a spam- mer a return of $25,000, given a $50 product. 1.1 Definition Any solution to this problem must reduce the profitability of the underlying business model; Spam is briefly defined by the TREC 2005 by either substantially reducing the number of Spam Track as “unsolicited, unwanted email emails reaching valid recipients, or increasing that was sent indiscriminately, directly or in- the expenses faced by the spammer. directly, by a sender having no current rela- Regrettably, no solution has yet been found tionship with the recipient” [12]. The key el- to this vexing problem. The classification task ements of this definition are expanded on in is complex and constantly changing. Con- a more extensive definition provided by structing a single model to classify the broad Abuse Prevention Systems [35], which spec- range of spam types is difficult; this task ifies three requirements for a message to be is made near impossible with the realisation classified as spam. Firstly, the message must that spam types are constantly moving and be equally applicable to many other potential evolving. Furthermore, most users find false recipients (i.e. the identity of the recipient positives unacceptable. The active evolution and the context of the message is irrelevant). of spam can be partially attributed to chang- Secondly, the recipient has not granted ‘delib- ing tastes and trends in the marketplace; how- erated, explicit and still-revocable permission ever, spammers often actively tailor their mes- for it to be sent’. Finally, the communica- sages to avoid detection, adding a further im- tion of the message gives a ‘disproportionate pediment to accurate detection. benefit’ to the sender, as solely determined by The similarities between junk postal mail the recipient. Critically, they note that sim- and spam can be immediately recognised; ple personalisation does not make the identity however, the nature of the Internet has al- of the sender relevant and that failure by the lowed spam to grow uncontrollably. Spam user to explicitly opt-out during a registration can be sent with no cost to the sender: the process does not form consent. economic realities that regulate junk postal Both these definitions identify the predomi- mail do not apply to the internet. Further- nant characteristic of spam email: that a user more, the legal remedies that can be taken receives unsolicited email that has been sent against spammers are limited: it is not diffi- without any concern for their identity. cult to avoid leaving a trace, and spammers easily operate outside the jurisdiction of those 1.2 Solution strategies countries with anti-spam legislation. The remainder of this section provides sup- Proposed solutions to spam can be separated porting material on the topic of spam. Sec- into three broad categories: legislation, pro- tion 2 provides an overview of spam classifi- tocol change and filtering. cation techniques. Sections 3.1 and 3.2 pro- A number of governments have enacted leg- vide a more detailed discussion of some of the islation prohibiting the sending of spam email, spam filtering techniques known: given the including the USA (Can Spam Act 2004) rapidly evolving nature of this field, it should and the EU (directive 2002/58/EC). Ameri- be considered a snapshot of the critical areas can legislation requires an ‘opt-out’ list that

2 bulk mailers are required to provide; this # spam correctly classified is arguably less effective than the European SR = (and Australian) approach of requiring ex- Total # of spam messages plicit ‘opt-in’ requests from consumers want- # spam correctly classified ing to receive such emails. At present, legisla- SP = Total # of messages classified as spam tion has appeared to have little effect on spam volumes, with some arguing that the law has 2 × SP × SR F1 = contributed to an increase in spam by giving SP + SR bulk advertisers permission to send spam, as # email correctly classified A = long as certain rules were followed. Total # of emails Many proposals to change the way in which we send email have been put forward, includ- Figure 1: Common experimental measures for ing the required of all senders, the evaluation of spam filters. a per email charge and a method of encap- sulating policy within the [28]. Such proposals, while often providing a near fications; however, no human element is re- complete solution, generally fail to gain sup- quired during the initial classification deci- port given the scope of a major upgrade or sion. Such systems represent the most com- replacement of existing email protocols. mon solution to resolving the spam problem, Interactive filters, often referred to as precisely because of their capacity to execute ‘challenge-response’ (C/R) systems, intercept their task without supervision and without re- incoming emails from unknown senders or quiring a fundamental change in underlying those suspected of being spam. These mes- email protocols. sages are held by the recipient’s email server, which issues a simple challenge to the sender 1.3 Statistical evaluation to establish that the email came from a hu- man sender rather than a bulk mailer. The Common experimental measures include underlying belief is that spammers will be un- spam recall (SR), spam precision (SP), F1 and interested in completing the ‘challenge’ given accuracy (A) (see figure 1 for formal defini- the huge volume of messages they sent; fur- tions of these measures). Spam recall is ef- thermore, if a fake email address is used by fectively spam accuracy. A legitimate email the sender, they will not receive the chal- classified as spam is considered to be a ‘false lenge. Selective C/R systems issue a challenge positive’; conversely, a spam message classi- only when the (non-interactive) spam filter is fied as legitimate is considered to be a ‘false unable to determine the class of a message. negative’. Challenge-response systems do slow down the The accuracy measure, while often quoted delivery of messages, and many people refuse by product vendors, is generally not useful to use the system1. when evaluating anti-spam solutions. The Non-interactive filters classify emails with- level of misclassifications (1 − A) consists out human interaction (such as that seen in of both false positives and false negatives; C/R systems). Such filters often permit user clearly a 99% accuracy rate with 1% false neg- interaction with the filter to customise user- atives (and no false positives) is preferable to specific options or to correct filter misclassi- the same level of accuracy with 1% false pos- itives (and no false negatives). The level of false positives and false negatives is of more 1A cynical consideration of this approach may con- clude that the recipient considers their time is of more interest than total system accuracy. Further- value that the sender’s. more, accuracy can be severely distorted by

3 the composition of the corpus; clearly, if the Characteristics). The curve shows the trade false positive and negative rates are different, off between true positives and false posi- overall accuracy will largely be determined by tives as the classification threshold parame- the ratio of legitimate email to spam. ter within the filter is varied. If the curve A clear trade-off exists between false pos- corresponding to one filter is uniformly above itives and false negatives statistics: reduc- that corresponding to another, it is reason- ing false positives often means letting more able to infer that its performance exceeds that spam through the filter. Therefore, the re- of the other for any combination of evalua- ported levels of either statistic will be signifi- tion weights and external factors [10]; the per- cantly affected by the classification threshold formance differential can be quantified using employed during the evaluation. False pos- the area under the ROC curves. The area itives are regarded as having a greater cost represents the probability that a randomly than false negatives; cost sensitive evaluation selected spam message will receive a higher can be used to reflect this difference. This ‘score’ than a randomly selected legitimate imbalance is reflected in the λ term: misclas- email message, where the ‘score’ is an indi- sification of a legitimate email as spam is con- cation of the likelihood that the message is sidered to be λ times as costly as misclassify- spam. ing a spam email as legitimate. λ values of 1, 9 and 999 are often used [47, 26] to rep- resent the cost differential between false posi- 2 Overview tives and false negatives; however, no evidence exists [26] to support the assumption that a Filter classification strategies can be broadly false positive is 9 or 999 times more costly separated into two categories: those based on as a false negative. The value of λ is difficult machine learning (ML) principles and those to quantify, as it depends largely on the likeli- not based on ML (see figure 2). Traditional hood of a user noticing a misclassification and filter techniques, such as heuristics, - on the importance of the email in question. ing and signatures, have been complemented The definition and measurement of this cost in recent years with new, ML-based technolo- imbalance (λ) is an open research problem. gies. In the last 3–4 years, a substantial aca- The recall measure (see figure 1) defines the demic research effort has taken place to eval- number of relevant documents identified as uate new ML-based approaches to filtering a percentage of all relevant documents; this spam; however, this work is ongoing. measures a spam filter’s ability to accurately ML filtering techniques can be further cate- identify spam (as 1 − SR is the false nega- gorised (see figure 2) into complete and com- tive rate). The precision measure defines the plementary solutions. Complementary solu- number of relevant documents identified as a tions are designed to work as a component of a percentage of all documents identified; this larger filtering system, offering support to the shows the noise that filter presents to the user primary filter (whether it be ML or non-ML (i.e. how many of the messages classified as based). Complete solutions aim to construct spam will actually be spam). A trade-off, sim- a comprehensive knowledge base that allows ilar to that between false positives and nega- them to classify all incoming messages inde- tives, exists between recall and precision. F1 pendently. Such complete solutions come in a is the harmonic mean of the recall and preci- variety of flavours: some aim to build a uni- sion measures and combines both into a single fied model, some compare incoming email to measure. previous examples (previous likeness), while As an alternative measure, Hidalgo [26] others use a collaborative approach, combin- suggests ROC curves (Receiver Operating ing multiple classifiers to evaluate email (en-

4 Figure 2: Classification of the various approaches to spam filtering detailed in section 2. semble). user for a similar hosted (off-site) solution. Filtering solutions operate at one of two On-site filtering can take place at both the levels: at the mail server or as part of the hardware and software level. user’s mail program. Server-level filters ex- Software-based filters comprise many com- amine the complete incoming email stream, mercial and most open source products, which and filter it based on a universal rule set for can operate at either the server or user level. all users. Advantages of such an approach in- Many software implementations will operate clude centralised administration and mainte- on a variety of hardware and software combi- nance, limited demands on the end user, and nations [49]. the ability to reject or discard email before it Appliance (hardware-based) on-site solu- reaches the destination. tions use a piece of hardware dedicated to User-level filters are based on a user’s termi- email filtering. These are generally quicker to nal, filtering incoming email from the network deploy than a similar software-based solution, mail server as it arrives. They often form a given that the device is likely to be transpar- part of a user’s email program. ML-based so- ent to network traffic [37]. The appliance is lutions often work best when placed at the likely to contain optimised hardware for spam user level [19], as the user is able to correct filtering, leading to potentially better perfor- misclassifications and adjust rule sets. mance than a general-purpose machine run- ning a software-based solution. Furthermore, Spam filtering systems can operated either general-purpose platforms, and in particular on-site or off-site. On-site solutions can give their operating systems, may have inherent local IT administrators greater control and security vulnerabilities: appliances may have more customisation options, in addition to pre-hardened operating systems [8]. relieving any security worries about redirect- ing email off-site for filtering. According to Off-site solutions (service) are based on the subscribing organisation redirecting their MX Cain [5], of the META Group, it is likely that 2 on-site solutions are cheaper than their ser- records to the off-site vendor, who then fil- vice (off-site) counterparts. He estimates on- ters the incoming email stream, before redi- premises solutions have a cost of US$6–12 per 2Mail exchange records are found in a domain name user (based on one gateway server and 10,000 database and specify the email server used for han- users), compared to a cost of US$12–24 per dling emails addressed to that domain.

5 recting the email back to the subscriber [41]. filter. Open source heuristic filters, provide Theoretically, spam email will never enter the both the filter and the rule set for download, subscriber’s network. Given that the organi- allowing the spammer to test their message sation’s email traffic will flow through exter- for its penetration ability. nal data centres, this raises some security is- Graham [22] acknowledges the potentially sues: some vendors will only process incom- high levels of accuracy achievable by heuris- ing email in memory, while others will store tic filters, but believes that as they are tuned to disk [5]. Initial setup of an off-site filter to achieve near 100% accuracy, an unaccept- option is substantially quicker: it can be op- able level of false positives will result. This erational within a week, while similar software prompted his investigation of Bayesian filter- solutions can take IT staff between 4–8 weeks ing (see section 3.2.1 and 4.2). to install, tune and test [5]. Off-site solutions require only a supervisory presence from local 3.1.2 Signatures IT staff and no upfront hardware or software investments in exchange for a monthly fee. Signature-based techniques generate a unique hash value (signature) for each known spam message. Signature filters compare the hash 3 Filter technologies value of an incoming email against all stored hash values of previously identified spam 3.1 Non-machine learning emails to classify the email. Signature genera- tion techniques make it statistically improba- 3.1.1 Heuristics ble for a legitimate email message to have the Heuristic, or rule-based, analysis uses regular same hash as a spam message. This allows expression rules to detect phrases or charac- signature filters to achieve a very low level of teristics that are common to spam; the quan- false positives. 3 tity and seriousness of the spam features iden- Cloudmark provides a commercial imple- tified will suggest the appropriate classifica- mentation of a signature filter, integrating tion for the message. The historical and cur- with the network mail server and commu- rent popularity of this technology has largely nicating with the Cloudmark server to sub- been driven by its simplicity, speed and con- mit and receive spam signatures. Vipul’s Ra- 4 sistent accuracy. Furthermore, it is superior zor is an open source alternative, using a to many advanced filtering technologies in the distributed, collaborative mechanism to dis- sense that it does not require a training pe- tribute signatures with appropriate trust safe- riod. guards that prohibit the network’s penetra- However, in light of new filtering technolo- tion by a malicious spammer. gies, it has several drawbacks. It is based on However, signature-based filters are unable a static rule set: the system cannot adapt to identify spam emails until such time as the the filter to identify emerging spam charac- email has been reported as spam and its hash teristics. This requires the administrator to distributed. Furthermore, if the signature dis- construct new detection heuristics or regu- tribution network is disabled, local filters will larly download new generic rule files. The rule be unable to catch newly created spam mes- set used by a particular product will be well sages. known: it will be largely identical across all Simple signature matching filters are trivial installation sites. Therefore, if a spammer can for spammers to work around. By inserting craft a message to penetrate the filter of a par- a string of random characters in each spam ticular vendor, their messages will pass unhin- 3http://www.cloudmark.com dered to all mail servers using that particular 4http://razor.sourceforge.net

6 message sent, the hash value of each mes- P2P distribution network. sage will be changed. This has led to new, advanced hashing technique, which can con- 3.1.3 Blacklisting tinue to match spam messages that have mi- nor changes aimed at disguising the message. Blacklisting is a simplistic technique that is Spammers do have a window of opportu- common within nearly all filtering products. nity to promote their messages before a signa- Also known as block lists, black lists filter ture is created and propagated amongst users. out emails received from a specific sender. Furthermore, for the signature filter to remain Whitelists, or allow lists, perform the opposite efficient, the database of spam hashes has to function, automatically allowing email from a be properly managed; the most common tech- specific sender. Such lists can be implemented nique is to remove older hashes [42]. Once the at the user or at the server level, and represent spammer’s message hash has been removed a simple way to resolve minor imperfections from the network, they can resume sending created by other filtering techniques, without their message. drastically overhauling the filter. Yoshida et al. [57] use a combination of Given the simplistic nature of technology, hashing and document space density to iden- it is unsurprising that it can be easily pen- tify spam. Substrings of length L are ex- etrated. The sender’s email address within tracted from the email, and hash values gen- an email can be faked, allowing spammers to erated for each. The first N hash values form easily bypass blacklists by inserting a differ- a vector representation of the email. This al- ent (fake) sender address with each bulk mail- lows similar emails to be identified and their ing. Correspondingly, whitelists can also be frequency recorded; given the high volumes of targeted by spammers. By predicting likely email spammers are required to send to gen- whitelisted emails (e.g. all internal email ad- erate a worthwhile economic benefit, there is dresses, your boss’s email address), spammers a heavy maldistribution of spam email traffic can penetrate other filtering solutions in place which allows for easy identification. Docu- by appropriately forging the sender address. ment space density is therefore used to sep- DNS blacklisting operates on the same prin- arate spam from legitimate email, and when ciples, but maintains a substantially larger this method is combined with a short whitelist database. When a SMTP session is started for solicited mass email, the authors report re- with the local mail server, the foreign host’s sults of 98% recall and 100% precision, using address is compared against a list of networks over 50 million actual pieces of email traffic. and/or servers known to allow the distribu- Damiani et al. [15] use message digests, ad- tion of spam. If a match is recorded, the dresses of the originating mail servers and session is immediately closed, preventing the URLs within the message to identify spam delivery of the spam message. This filtering mail. Each message maps to a 256-bit digest, approach is highly effective at discarding sub- and is considered the same as another message stantial amounts of spam email, yet requires if it differed by at most 74 bits. Previous work low system requirements to operate, and en- [16] has identified that this approach is ro- abling it often requires only minimal changes bust against attempts to disguise the message. to the mail server and filtering solution. This email identification approach is then im- However, such lists often have a notori- plemented within a P2P (peer-to-peer) archi- ously high rate of false positives, making them tecture. Similarly, Gray & Haahr [25] present “dangerous” to use as a standalone filtering the CASSANDRA architecture for a person- system [51]. Once blacklisted, spammers can alised, collaborative spam filtering system, us- cheaply acquire new addresses. Often sev- ing a signature-based filtering technology and eral people must complain before an address is

7 blacklisted; by the time the list is updated and resents the ‘state-of-the-art’ approach in in- distributed, the spammer can often send mil- dustry. lions of spam messages. Spammers can also It addresses many of the shortcomings of masquerade as legitimate sites. Their moti- heuristic filtering. It uses an unknown (to the vation here is twofold: either they will escape sender) rule set: the tokens and their associ- being blacklisted or they will cause a legiti- ated probabilities are manipulated according mate site to be blacklisted (reducing the accu- to the user’s classification decisions and the racy, and therefore the attractiveness, of the types of email received. Therefore each user’s DNS blacklist) [42]. filter will classify emails differently, making it Several filters now use such lists as part of impossible for a spammer to craft a message a complete filtering solution, weighting infor- that bypasses a particular brand of filter. The mation provided by the DNS blacklist and in- rule set is also adaptive: Bayesian filters can corporating it into results provided by other adapt their concepts of legitimate and spam techniques to produce a final classification de- email, based on user feedback, which contin- cision. ually improves filter accuracy and allows de- tection of new spam types. Bayesian filters maintain two tables: one 3.1.4 Traffic analysis of spam tokens and one of ‘ham’ (legitimate) While strictly not a spam filtering technology mail tokens. Associated with each spam to- at present, Gomes et al. [21] provide a charac- ken is a probability that the token suggests terisation of spam traffic patterns. By exam- that the email is spam, and likewise for ham ining a number of email attributes, they are tokens. For example, Graham [22] reports able to identify characteristics that separate that the word ‘sex’ indicates a 0.97 probabil- spam traffic from non-spam traffic. Several ity that an email is spam. Probability values key workload aspects differentiate spam traf- are initially established by training the filter fic; including the email arrival process, email to recognise spam and legitimate email, and size, number of recipients per email, and pop- are then continually updated (and created) ularity and temporal locality among recipi- based on email that the filter successfully clas- ents. An underlying difference in purpose sifies. Incoming email is tokenised on arrival, gives rise to these differences in traffic: le- and each token is matched with its probability gitimate mail is used to interact and socialise, value from the user’s records. The probability where spam is typically generated by auto- associated with each token is then combined, matic tools to contact many potential, mostly using Bayes’ Rule, to produce an overall prob- unknown users. They consider their research ability that the email is spam. An example is as the first step towards defining a spam sig- provided in figure 3. nature for the construction of an advanced Bayesian filters perform best when they op- spam detection tool. erate on the user level, rather than at the network mail server level. Each user’s email and definition of spam differs; therefore a 3.2 Machine learning token database populated with user-specific 3.2.1 Unified model filters data will result in more accurate filtering [19]. The use of Bayes formula as a tool to iden- Bayesian filtering now commonly forms a key tify spam was initially applied to spam filter- part of many enterprise-scale filtering solu- ing in 1998 by Sahami et al. [46] and Pantel & tions. No other machine learning or sta- Lin [39]. Graham [22] [23] later implemented tistical filtering technique has achieved such a Bayesian filter that caught 99.5% of spam widespread implementation and therefore rep- with 0.03% false positives. Androutsopoulos

8 For example, the following set of keywords were extracted from an unseen email: prescription (0.9) tomorrow (0.1) student (0.1) james (0.01) quality (0.85)

A value of 0.9 for prescription indicates 90% of previously seen emails that included that word were ultimately classified as spam, with the remaining 10% classified as legitimate email.

To calculate the overall probability of an email being spam (P ): x · x ··· x P = 1 2 n x1 · x2 ··· xn + (1 − x1) · (1 − x2) ··· (1 − xn) 0.9 · 0.1 · 0.1 · 0.01 · 0.85 = 0.9 · 0.1 · 0.1 · 0.01 · 0.85 + (1 − 0.9) · (1 − 0.1) · (1 − 0.1) · (1 − 0.01) · (1 − 0.85) = 0.006 (to three decimal places)

This value indicates that it is unlikely that the email message is spam; however, the ultimate classification decision would depend on the decision boundary set by the filter.

Figure 3: A simple example of Bayesian filtering.

et al. [2] established that a naive Bayesian standard Bayesian filters, Yerazunis et al. filter clearly surpasses keyword-based filter- [56, 50] introduced sparse binary polynomial ing, even with a very small training corpus. hashing (SBPH) and orthogonal sparse bi- More recently, Zdziarski [58] has introduced grams (OSB). SBPH is a generalisation of the Bayesian Noise reduction as a way of increas- naive Bayesian filtering method, with the abil- ing the quality of the data provided to a naive ity to recognise mutating phrases in addition Bayes classifier. It removes irrelevant text to to individual words or tokens, and uses the provide more accurate classification by iden- Bayesian Chain Rule to combine the individ- tifying patterns of text that are commonplace ual feature conditional probabilities. Yerazu- for the user. nis et al. reported results that exceed 99.9% accuracy on real-time email without the use Given the high levels of accuracy that a of whitelists or blacklists. An acknowledged Bayesian filter can potentially provide, it has limitation of SBPH is that the method may unsurprisingly emerged as a standard used to be too computationally expensive; OSB gen- evaluate new filtering technologies. Despite erates a smaller feature set than SBPH, de- such prominence, few Bayesian commercial creasing memory requirements and increasing filters are fully consistent with Bayes’ Rule, speed. A filter based on OSB, along with creating their own artificial scoring systems the non-probabilistic Winnow algorithm as rather than relying on the raw probabilities a replacement for the Bayesian Chain rule, generated [53]. Furthermore, filters generally saw accuracy peak at 99.68%, outperform- use ‘naive’ Bayesian filtering, which assumes ing SBPH by 0.04%; however, OSB used just that the occurrence of events are independent 600,000 features, substantially less than the of each other; i.e. such filters do not consider 1,600,000 features required by SBPH. that the words ‘special’ and ‘offers’ are more likely to appear together in spam email than Support vector machines (SVMs) are gener- in legitimate email. ated by mapping training data in a nonlinear In attempt to address this limitation of manner to a higher-dimensional feature space,

9 where a hyperplane is constructed which max- ent corpus to which it was trained and tested. imises the margin between the sets. The hy- Chhabra et al. [7] present a spam classi- perplane is then used as a nonlinear decision fier based on a Markov Random Field (MRF) boundary when exposed to real-world data. model. This approach allows the spam classi- Drucker et al. [17] applied the technique to fier to consider the importance of the neigh- spam filtering, testing it against three other bourhood relationship between words in an text classification algorithms: Ripper, Roc- email message (MRF cliques). The inter-word chio and boosting decision trees. Both boost- dependence of natural language can there- ing trees and SVMs provided “acceptable” fore be incorporated into the classification performance, with SVMs preferable given process; this is normally ignored by naive their lesser training requirements. A SVM- Bayesian classifiers. Characteristics of in- based filter for Outlook has also coming emails are decomposed into feature been tested and evaluated [55]. Rios & Zha vectors and are weighted in a superincreas- [45] also experiment with SVMs, along with ing manner, reflective of inter-word depen- random forests (RFs) and naive Bayesian fil- dence. Several weighting schemes are consid- ters. They conclude that SVM and RF clas- ered, each of which differently evaluates in- sifiers are comparable, with the RF classifier creasingly long matches. Accuracy over 5000 more robust at low false positive rates; they test messages is shown to be superior to that both outperform the naive Bayesian classifier. shown by a naive Bayesian-equivalent classi- While chi by degrees of freedom has been fier (97.98% accurate), with accuracy reach- used in authorship identification, it was first ing 98.88% with a window size (i.e. maximum applied by O’Brien & Vogel [38] to spam fil- phrase length) of five and an exponentially su- tering. Ludlow [34] concluded that tens of perincreasing weighting model. millions of spam emails may be attributable to 150 spammers; therefore authorship identi- 3.2.2 Previous likeness based filters fication techniques should identify the textual fingerprints of this small group. This would Memory-based, or instance-based, machine allow a significant proportion of spam to be ef- learning techniques classify incoming email fectively filtered. This technique, when com- according to their similarity to stored exam- pared with a Bayesian filter, was found to pro- ples (i.e. training emails). Defined email vide equally good or better results. attributes form a multi-dimensional space, Clark et al. [9] construct a backpropoga- where new instances are plotted as points. tion trained artificial neural network (ANN) New instances are then assigned to the major- classifier named LINGER. ANNs require rel- ity class of its k closest training instances, us- atively substantial amount of time for param- ing the k-Nearest-Neighbour algorithm, which eter selection and training, when compared classifies the email. Sakkis et al. [47] [3] against other previously evaluated methods. use a k-NN spam classifier, implemented us- The classifier can go beyond the standard ing the TiMBL memory-based learning soft- spam/legitimate email decision, instead clas- ware [14]. The basic k-NN classifier was ex- sifying incoming email into an arbitrary tended to weight attributes according to their number of folders. LINGER outperformed importance and to weight nearer neighbours naive Bayesian, k-NN, stacking, stumps and with greater importance (distance weighting). boosted trees filtering techniques, based on The classifier was compared with a naive their reported results, recording perfect re- Bayesian classifier using cost sensitive evalu- sults (across many measures) on all tested cor- ation. The memory-based classifier compares pora, for all λ. LINGER also performed well “favourably” to the naive Bayesian approach, when feature selection was based on a differ- with spam recall improving at all levels (1, 9,

10 999) of λ, with a small cost of precision at λ 3.2.3 Ensemble filters = 1, 9. The authors conclude that this is a “promising” approach, with a number of re- Stacked generalisation is a method of combin- search possibilities to explore. ing classifiers, resulting in a classifier ensem- ble. Incoming email messages are first given to ensemble component classifiers whose in- Case-based reasoning (CBR) systems main- dividual decisions are combined to determine tain their knowledge in a collection of pre- the class of the message. Improved perfor- viously classified cases, rather than in a set mance is expected given that different ground- of rules. Incoming email is matched against level classifiers generally make uncorrelated similar cases in the system’s collection, which errors. Sakkis et al. [48] create an ensemble provide guidance towards the correct classifi- of two different classifiers: a naive Bayesian cation of the email. The final classification, classifier ([2] [1]) and a memory-based classi- along with the email itself, then forms part fier ([47] [3]). Analysis of the two component of the system’s collection for the classification classifiers indicated they tend to make un- of future email. Cunningham et al. [13] con- correlated errors. Unsurprisingly, the stacked struct a case-based reasoning classifier that classifier outperforms both of its component can track concept drift. They propose that classifiers on a variety of measures. the classifier both adds new cases and removes The boosting process combines many mod- old cases from the system collection, allowing erately accurate weak rules (decision stumps) the system to adapt to the drift of characteris- to induce one accurate, arbitrarily deep, de- tics in both spam and legitimate mail. An ini- cision tree. Carreras and Marquez [6] use the tial evaluation of their classifier suggests that AdaBoost boosting algorithm and compare it outperforms naive Bayesian classification. its performance against spam classifiers based This is unsurprising given that naive Bayesian on decision trees, naive Bayesian and k-NN filters attempt to learn a “unified spam con- methods. They conclude that their boosting cept” that will identify all spam email; spam based methods outperform standard decision email differs significantly depending on the trees, naive Bayes, k-NN and stacking, with product or service on offer. their classifier reporting F1 rates above 97% (see section 1.3). The AdaBoost algorithm provides a measure of confidence with its pre- Rigoutsos and Huynh [44] apply the Teire- dictions, allowing the classification threshold sias pattern discovery algorithm to email clas- to be varied to provide a very high precision sification. Given a large collection of spam classifier. email, the algorithm identifies patterns that appear more than twice in the corpus. Neg- 3.2.4 Complementary filters ative training occurs by running the pattern identification algorithm over legitimate email; Adaptive spam filtering [40] targets spam by patterns common to both corpora are re- category. It is proposed as an additional spam moved from the spam vocabulary. Success- filtering layer. It divides an email corpus into ful classification relies on training the sys- several categories, each with a representative tem based on a comprehensive and represen- text. Incoming email is then compared with tative collection of spam and legitimate email. each category, and a resemblance ratio gener- Experimental results are based on a training ated to determine the likely class of the email. corpus of 88,000 pieces of spam and legiti- When combined with Spamihilator, the adap- mate email. Spam precision was reported at tive filter caught 60% of the spam that passed 96.56%, with a false positive rate of 0.066%. through Spamihilator’s keyword filter.

11 Boykin & Roychowdhury [4] identify a tial to lose legitimate email. Also, legitimate user’s trusted network of correspondents with email can be unnecessarily delayed; however, an automated graph method to distinguish this is mitigated by source IP addresses being between legitimate and spam email. The clas- automatically whitelisted after they have suc- sifier was able to determine the class of 53% cessfully retried once. An analysis performed of all emails evaluated, with 100% accuracy. by Levine [33] over a seven-week period (cov- The authors intend this filter to be part of ering 715,000 delivery attempts), 20% of at- a more comprehensive filtering system, with tempts were greylisted; of those, only 16% re- a content-based filter responsible for classi- tried. Careful system design can minimise the fying the remaining messages. Golbeck and potential for lost legitimate email; certainly Hendler [20] constructed a similar network greylisting is an effective technique for reject- from ‘trust’ scores, assigned by users to peo- ing spam generated by poorly implemented ple they know. Trust ratings can then be in- . ferred about unknown users, if the users are SMTP Path Analysis [32] learns the repu- connected via a mutual acquaintance(s). tation of IP addresses and email domains by Content-based email filters work best when examining the paths used to transmit known words inside the email text are lexically cor- legitimate and spam email. It uses the ‘re- rect; i.e. most will rapidly learn that the word ceived’ line that the SMTP protocol requires ‘viagra’ is a strong indicator of spam, but may that each SMTP relay add to the top of each not draw the same conclusions from the word email processed, which details its identity, the ‘V.i-a.g*r.a’. Assuming the spammer contin- processing timestamp and the source of the ues to use the obfuscated word, the content- message. Despite the fact that these head- based filter will learn to identify it as spam; ers can easily be spoofed, when operating in however, given the number of possibilities combination with a Bayesian filter, overall ac- available to disguise a word, most standard curacy is approximately doubled. filters will be unable to detect these terms in a reasonable amount of time. Lee and Ng [31] use a hidden Markov model in order to deob- 4 Evaluation fuscate text. Their model is robust to many 4.1 Barriers to comparison types of obfuscation, including substitutions and insertions of non-alphabetic characters, This paper outlines many new techniques re- straightforward misspellings and the addition searched to filter spam email. It is difficult to and removal of unnecessary spaces. When ex- compare the reported results of classifiers pre- posed to 60 obfuscated variants of ‘viagra’, sented in various research papers given that their model successfully deobfuscated 59, and each author selects a different corpora of email recorded an overall deobfuscation accuracy of for evaluation. A standard ‘benchmark’ cor- 94% (across all test data). pus, comprised of both spam and legitimate Spammers typically use purpose-built ap- email is required in order to allow meaningful plications to distribute their spam [27]. comparison of reported results of new spam Greylisting tries to deter spam by rejecting filtering techniques against existing systems. email from unfamiliar IP addresses, by reply- However, this is far from being a straight- ing with a soft fail (i.e. 4xx). It is built on forward task. Legitimate email is difficult to the premise that the so-called ‘spamware’ [33] find: several publicly available repositories of does little or no error recovery, and will not spam exist (e.g. www.spamarchive.org); how- retry to send the message. Any correct client ever, it is significantly more difficult to lo- should retry; however, some do not (either cate a similarly vast collection of legitimate due to a bug or policy), so there is the poten- emails, presumably due to the privacy con-

12 cerns. Spam is also constantly changing. ever, when dealing with a public corpus (like Techniques used by spammers to communi- the Enron corpus), it is more difficult to deter- cate their message are continually evolving mine the actual class of a message for accurate [27]; this is also seen, to a lesser extent, in evaluation of filter performance. Therefore, legitimate email. Therefore, any static spam Cormack and Lynam [11] propose establish- corpus would, over time, no longer resemble ing a ‘gold standard’ for each message, which the makeup of current spam email. is considered to be the message’s actual class. Graham-Cumming [24], maintainer of the They use a bootstrap method based on several Spammers’ Compendium, has identified 18 different classifiers to simplify the task of sort- new techniques used by spammers to disguise ing through this massive collection of email; their messages between 14 July 2003 and 14 it remains as a work in progress. Their filter January 2005. A total of 45 techniques are evaluation toolkit, given a corpus and a filter, currently listed (as of 11 December 2005). compares the filter classification of each mes- While the introduction of modern spam con- sage with the gold standard to report effec- struction techniques will affect a spam filter’s tiveness measures with 95% confidence limits. ability to detect the actual content of the mes- In order to compare different filtering tech- sage, it is important to note that most heuris- niques, a standard set of legitimate and spam tic filter implementations are updated regu- email must be used; both for the testing and larly, both in terms of the rule set and under- the training (if applicable) of filters. Inde- lying software. pendent tests of filters are generally limited Several alternatives to a standard cor- to usable commercial and open source prod- pus exist. SpamAssassin (spamassas- ucts, excluding experimental classifiers ap- sin.apache.org) maintains a collection of legit- pearing only in research. Experimental clas- imate and spam emails, categorised into easy sifiers are generally only compared against and hard examples. However, the corpus is standard techniques (e.g. Bayesian filtering) now more than two years old. Androutsopou- in order to establish their relative effective- los et al. [1] have built the ‘Ling-Spam’ corpus, ness; however this makes it difficult to iso- which imitates legitimate email by using the late the most promising new techniques. Net- postings of the moderated ‘Linguist’ mailing workWorldFusion [51] review 41 commercial list. The authors acknowledge that the mes- filtering solutions, while Cormack and Lyman sages may be more specialised in topic than review six open source filtering products [10]. received by a standard user but suggest that it can be used as a reasonable substitute for 4.2 Case study legitimate email in preliminary testing. Spa- mArchive maintains an archive of spam con- Throughout this paper we have discussed the tributed by users. Archives are created that advances made in spam filtering technology. contain all spam received by the archive on a In this section, we evaluate the extent to particular day, providing researchers with an which users at the University of Canterbury easily accessible collection of up-to-date spam could potentially benefit from these advances emails. As a result of the Enron bankruptcy, in filtering techniques. Furthermore, we hope 400 MB of realistic workplace email has be- to collect data to substantiate some recom- come publicly available: it is likely that this mendations when evaluating spam filters. will form part of future standard corpora, de- The University of Canterbury maintains a spite some outstanding issues [11]. two-stage email filtering solution. A sub- Building an artificial corpus or a corpus scription DNS blacklisting system is used in from presorted user email ensures the class of conjunction with Process Software’s Precise- each message is known with certainty. How- Mail Anti-spam System (PMAS). The Uni-

13 versity of Canterbury receives approximately sages filtered. 110,000 emails per day, of which approxi- The spam corpus drawn from the Spa- mately 50,000 are eliminated by the DNS mArchive was constructed from the spam blacklisting system before delivery is com- email submitted manually (by users) to Spa- plete. Of those emails that are successfully mArchive on the 14, 15 and 16th of each delivered, PMAS discards around 42% and month used. These dates were randomly cho- quarantines around 35% for user review. In sen. The total number of emails collected at its standard state, PMAS filters are based each point varied from approximately 1700 to on a comprehensive heuristic rule collection 3200. and be combined with both server-level and The performance of each filter (heuris- user-level block and allow lists. However, the tic, Bayesian and combined) steadily declined software has a Bayesian filtering option, that over time as newer spam from the SpamAssas- works in conjunction with the heuristic filter, sin corpus was introduced. It is assumed that and which was not currently active before the spam more recently submitted to the archive evaluation. would be more likely to employ newer message Two experiments were conducted. The first construction techniques. No effort has been used the publicly available SpamAssassin cor- made to individually examine the test corpus pus to provide a comparable evaluation of to identify these characteristics. Any person PMAS in terms of false positives and false with an email account can submit spam to the negatives. This experiment aimed to evaluate archive: this should create a sufficiently di- the overall performance of the filter, as well verse catchment base, ensuring a broad range as the relative performance of the heuristic of spam messages are archived. A broad and Bayesian components. The second used corpus of spam should reflect, to some ex- spam collected from the SpamArchive reposi- tent, new spam construction techniques. The tory to evaluate false positive levels on spam fact that updates are regularly issued by ma- collected at various points over the last two jor anti-spam product vendors indicates that years. The aim of this experiment was to ob- such techniques are becoming widespread. serve whether the age of spam has any effect Overall results are consistent with those on the effectiveness of the filter, as well as published by NetworkWorldFusion [51]: they attempting to compensate for the age of the recorded 0.75% false positives, and 96% accu- SpamAssassin corpus. racy, while we recorded 0.75% (with the par- The training of the PMAS Bayesian filter tial SpamAssassin corpus) false positives and took place over 2 weeks. PMAS automati- 97.67% accuracy. cally (as recommended by the vendor) trains Under both the full and partial SpamAssas- the Bayesian filter by showing it emails that sin corpora, the combined filtering option sur- score5 above and below defined thresholds, as passes the alternatives in the two key areas: a examples of spam and non-spam respectively. lower level of false positives, and a higher level The results of passing the partial SpamAs- of spam caught (i.e. discarded). This can be sassin corpus through the PMAS filter can be clearly seen in figure 4. In terms of these mea- seen in figure 4. The partial corpus has the sures, the heuristic filter is closest to the per- ‘hard’ spam removed, which consists of email formance of the combined filter. This is un- with unusual HTML markup, coloured text, surprising given that the Bayesian component spam-like phrases etc. The use of the full cor- of the combined filter contributes relatively pus increases false positives made by the over- little and that it was initially trained by the all filter from 1 to 4% of all legimitate mes- heuristic filter. The Bayesian filter performs comparatively worse than the other two filter- 5Scores were generated by the heuristic filter. ing option, as less email is correctly treated

14 PMAS performance with SpamAssassin corpus 90 Combined 80 Bayesian Heuristic 70

60

50

40 Percentage 30

20

10

0 Spam Spam Spam Ham Ham Ham forwarded quarantined discarded forwarded quarantined discarded

Figure 4: Performance of the PMAS filtering elements using the partial SpamAssassin public corpus.

(i.e. spam discarded or ham forwarded) and when compared against the July 2003 col- notably more email is quarantined for user re- lection. However, the filter appears to per- view. This is consistent with Garcia et al. [19], form best on the 2004 collections (January who suggested such a filtering solution was and July). It is possible that this is due to the best placed at the user, rather than the server, training of the Bayesian filter; the automated level. training performed by PMAS may have incor- The performance of the heuristic filter de- rectly added some tokens to the ham/spam teriorates as messages get more recent. This databases. Furthermore, the spam received would suggest that the PMAS rule set and un- by the University of Canterbury may not re- derlying software has greater difficulty iden- flect the spam received by the SpamArchive; tifying a spam message when its message is this would therefore impact the training of the deliberately obscured by advanced spam con- Bayesian filter. struction techniques. This is despite regular New spam construction techniques are updates to the filter rule set and software. likely to have impacted on the lower spam The combined filter performs similarly to the accuracy scores; heuristic filters seem espe- heuristic filter. This is unsurprising given that cially vulnerable to these developments. It the heuristic filter contributes the majority of is reasonable to say that such techniques are the message’s score (which then determines effective: a regularly updated heuristic filter the class of the message). The introduction of becomes less effective and therefore reinforces Bayesian filtering improved overall filter per- the need for a complementary machine learn- formance in all respects when dealing with ing approach when assembling a filtering so- both the SpamAssassin archive and the Spa- lution. mArchive collections. Broadly, one can conclude two things from The results from the Bayesian filter are less this experiment. Firstly, the use of a Bayesian obvious. One would expect the Bayesian fil- filtering component improves overall filter ter to become more effective over time, given performance; however, it is not a substitute that it has been trained exclusively on more for the traditional heuristic filter, but more a recent messages. In the broadest sense, this complement (at least at the server level). Sec- can be observed: the filter’s performance im- ondly, the concerns raised about the effects of proves by 7% on the January 2005 collection time on the validity of the corpora seem to

15 be justified: older spam does seem to be more techniques have been evaluated in academic readily identified, suggesting changing tech- papers, and some have been taken into the niques. community at large via open source products. It is interesting to note that, despite im- The implementation of machine learning al- proved performance, the Bayesian filtering gorithms is likely to represent the next step component was deactivated some months af- in the ongoing fight to reclaim our inboxes. ter the completion of this evaluation due to increasing CPU and memory demands on the References mail filtering gateway. This can be primarily attributed to the growth of the internal to- [1] I. Androutsopoulos, J. Koutsias, ken database, as the automatic training sys- K. Chandrinos, G. Paliouras, and tem remained active throughout the period; C. Spyropoulos. An evaluation of naive arguably this could have been disabled once bayesian anti-spam filtering. In Proc. of a reasonably sized database had been con- the workshop on Machine Learning in structed but this would have negated some of the New Information Age, 2000. the benefits realised by a machine learning- based filtering system (such as an adaptive [2] I. Androutsopoulos, J. Koutsias, rule set). This is a weakness of both the K. Chandrinos, and C. Spyropoulos. implementation, as no mechanism was pro- An experimental comparison of naive vided to reduce the database size, and of the bayesian and keyword-based anti-spam Bayesian approach and unified model machine filtering with personal e-mail messages. learning approaches in general. When con- In SIGIR ’00: Proceedings of the 23rd structing a unified model, the text of each annual international ACM SIGIR con- incoming message affects the current model; ference on Research and development however, reversing these changes can be par- in information retrieval, pages 160–167. ticularly difficult. In the case of a Bayesian ACM Press, 2000. filter, a copy of each message processed (or [3] I. Androutsopoulos, G. Paliouras, some kind of representative text) would be V. Karkaletsis, G. Sakkis, C. Spyropou- necessary to reverse the impact of past mes- los, and P. Stamatopoulos. Learning sages on the model. to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. In Workshop on Machine 5 Conclusion Learning and Textual Information Access, 4th European Conference on Spam is rapidly becoming a very serious prob- Principles and Practice of Knowledge lem for the internet community, threatening Discovery in Databases (PKDD), 2000. both the integrity of networks and the pro- ductivity of users. Anti-spam vendors offer a [4] P.O. Boykin and V. Roychowdhury. Per- wide array of products designed to keep spam sonal email networks: An effective anti- out; these are implemented in various ways spam tool. In MIT Spam Conference, Jan (software, hardware, service) and at various 2005. levels (server and user). The introduction of [5] M. Cain. Spam blocking: What new technologies, such as Bayesian filtering, is matters. META Group, 2003. improving filter accuracy; we have confirmed www.postini.com/brochures. this for ourselves after examining the Precise- Mail Anti-Spam system. The net is being [6] X. Carreras and L. M´arquez. Boosting tightened even further: a vast array of new trees for anti-spam email filtering. In

16 Proceedings of RANLP-01, 4th Interna- [15] E. Damiani, S. De Capitani di Vimercati, tional Conference on Recent Advances in S. Paraboschi, and P. Samarati. P2P- Natural Language Processing, 2001. based collaborative spam detection and filtering. In P2P ’04: Proceedings of [7] S. Chhabra, W. Yerazunis, and the Fourth International Conference on C. Siefkes. Spam filtering using a Peer-to-Peer Computing (P2P’04), pages markov random field model with vari- 176–183. IEEE Computer Society, 2004. able weighting schemas. In Data Mining, Fourth IEEE International Conference [16] E. Damiani, S. De Capitani di Vimercati, on, pages 347–350, 1–4 Nov. 2004. S. Paraboschi, and P. Samarati. Using digests to identify spam messages. Tech- [8] T. Chiu. Anti-spam appliances are bet- nical report, University of Milan, 2004. ter than software. NetworkWorldFu- sion, March 1 2004. www.nwfusion.com/- [17] H. Drucker, D. Wu, and V.N. Vap- columnists/2004/0301faceoffyes.html. nik. Support vector machines for spam categorization. Neural Networks, IEEE [9] J. Clark, I. Koprinska, and J. Poon. A Transactions on, 10(5):1048–1054, Sep. neural network based approach to au- 1999. tomated e-mail classification. In Web Intelligence, 2003. WI 2003. Proceed- [18] T. Espiner. Demand for anti-spam prod- ings. IEEE/WIC International Confer- ucts to increase. ZDNet UK, Jun 2005. ence on,, pages 702–705, 13–17 Oct 2003. [19] F.D. Garcia, J.-H. Hoepman, and J. van [10] G. Cormack and T. Lynam. A Nieuwenhuizen. Spam filter analysis. study of supervised spam detection ap- In Proceedings of 19th IFIP Interna- plied to eight months of personal e- tional Information Security Conference, mail. http://plg.uwaterloo.ca/ gvcor- WCC2004-SEC, Toulouse, France, Aug mac/spamcormack.html, July 1 2004. 2004. Kluwer Academic Publishers.

[11] G. Cormack and T. Lynam. Spam cor- [20] J. Golbeck and J. Hendler. Reputa- pus creation for TREC. In Conference tion network analysis for email filtering. on Email and Anti-Spam, 2005. In Conference on Email and Anti-Spam, 2004. [12] G. Cormack and T. Lynam. TREC 2005 spam track overview. In Text Retrieval [21] Luiz Henrique Gomes, Cristiano Cazita, Conference, 2005. Jussara M. Almeida, Virgilio Almeida, and Jr. Wagner Meira. Characterizing [13] P. Cunningham, N. Nowlan, S. Delany, a spam traffic. In IMC ’04: Proceed- and M. Haahr. A case-based approach ings of the 4th ACM SIGCOMM con- to spam filtering that can track concept ference on Internet measurement, pages drift. In ICCBR’03 Workshop on Long- 356–369. ACM Press, 2004. Lived CBR Systems, June 2003. [22] P. Graham. A plan for spam. [14] W. Daelemans, J. Zavrel, K. van der http://paulgraham.com/spam.html, Au- Sloot, and A. van den Bosch. Timbl: gust 2002. Tilburg memory based learner, version 3.0, reference guide. ILK, Computa- [23] P. Graham. Better bayesian filtering. In tional Linguistics, Tilburg University. Proc. of the 2003 Spam Conference, Jan- http://ilk.kub.nl/ ilk/papers, 2000. uary 2003.

17 [24] J. Graham-Cumming. The [36] N. Nie, A. Simpser, I. Stepanikova, and spammers’ compendium. L. Zheng. Ten years after the birth of www.jgc.org/tsc/index.htm, Feb 2005. the internet, how do americans use the internet in their daily lives? Technical [25] A. Gray and M. Haadr. Personalised, col- report, Stanford University, 2004. laborative spam filtering. In Conference on Email and Anti-Spam, 2004. [37] R. Nutter. Software or appliance solu- [26] J.M.G. Hidalgo. Evaluating cost- tion? NetworkWorldFusion, March 1 sensitive unsolicited bulk email catego- 2004. www.nwfusion.com/columnists/- rization. In SAC ’02: Proceedings of the 2004/0301nutter.html. 2002 ACM symposium on Applied com- [38] C. O’Brien and C. Vogel. Spam filters: puting, pages 615–620. ACM Press, 2002. bayes vs. chi-squared; letters vs. words. [27] R. Hunt and A. Cournane. An analysis of In ISICT ’03: Proceedings of the 1st the tools used for the generation and pre- international symposium on Information vention of spam. Computers & Security, and communication technologies. Trinity 23(2):154–166, 2004. College Dublin, 2003.

[28] J. Ioannidis. Fighting spam by encap- [39] P. Pantel and D. Lin. Spamcop—a spam sulating policy in email addresses. In classification & organisation program. In Network and Distributed System Security Learning for Text Categorization: Pa- Symposium, Feb 6–7 2003. pers from the 1998 Workshop, Madison, Wisconsin, 1998. AAAI Technical Report [29] R. Jennings. The global economic impact WS-98-05. of spam, 2005 report. Technical report, Ferris Research, 2005. [40] L. Pelletier, J. Almhana, and [30] T. Zeller Jr. Law barring junk e-mail V. Choulakian. Adaptive filtering allows a flood instead. The New York of spam. In Communication Networks Times, Feb 1 2005. and Services Research, Second Annual Conference on, pages 218–224, 19–21 [31] H. Lee and A. Ng. Spam deobfuscation May 2004. using a hidden markov model. In Con- ference on Email and Anti-Spam, 2005. [41] Postini Inc. Postini perimeter manager makes encrypted mail easy and painless. [32] B. Leiba, J. Ossher, V. Rajan, R. Segal, www.postini.com/brochures, 2004. and M. Wegman. SMTP path analysis. In Conference on Email and Anti-Spam, [42] Process Software. Explanation of com- 2005. mon spam filtering techniques (white pa- per). http://www.process.com/, 2004. [33] J. Levine. Experiences with greylisting. In Conference on Email and Anti-Spam, [43] Radicati Group. Anti-spam 2004 execu- 2005. tive summary. Technical report, Radicati [34] M. Ludlow. Just 150 ‘spammers’ blamed Group, 2004. for e-mail woe. The Sunday Times, 1 De- [44] I. Rigoutsos and T. Huynh. Chung-kwei: cember 2002. a pattern-discovery-based system for the [35] Mail Abuse Prevention Systems. automatic identification of unsolicited e- Definition of spam. www.mail- mail messages (spam). In Conference on abuse.com/spam def.html, 2004. Email and Anti-Spam, 2004.

18 [45] G. Rios and H. Zha. Exploring support [53] S. Vaughan-Nichols. Saving private e- vector machines and random forests for mail. Spectrum, IEEE, pages 40–44, Aug spam detection. In Conference on Email 2003. and Anti-Spam, 2004. [54] M. Wagner. Study: E-mail viruses up, [46] Mehran Sahami, Susan Dumais, David spam down. Internetweek.com, Nov 9 Heckerman, and Eric Horvitz. A 2002. http://www.internetweek.com/- bayesian approach to filtering junk E- story/INW20021109S0002. mail. In Learning for Text Categoriza- tion: Papers from the 1998 Workshop, [55] M. Woitaszek, M. Shaaban, and R. Cz- Madison, Wisconsin, 1998. AAAI Tech- ernikowski. Identifying junk electronic nical Report WS-98-05. email in with a sup- port vector machine. In Applications and [47] G. Sakkis, I. Androutsopoulos, the internet, 2003 Symposium on, pages G. Paliouras, V. Karkaletsis, C. Spy- 166–169, 27–31 Jan. 2003 2003. ropoulos, and P. Stamatopoulos. A [56] W. Yerazunis. Sparse binary polynomial memory-based approach to anti-spam hashing and the crm114 discriminator. In filtering. Technical report, Tech Report MIT Spam Conference, 2003. DEMO 2001., 2001. [57] K. Yoshida, F. Adachi, T. Washio, [48] G. Sakkis, I. Androutsopoulos, H. Motoda, T. Homma, A. Nakashima, G. Paliouras, V. Karkaletsis, C.D. H. Fujikawa, and K. Yamazaki. Density- Spyropoulos, and P. Stamatopoulos. based spam detector. In KDD ’04: Pro- Stacking classifiers for anti-spam filter- ceedings of the 2004 ACM SIGKDD in- ing of e-mail. In Empirical Methods ternational conference on Knowledge dis- in Natural Language Processing, pages covery and data mining, pages 486–493. 44–50, 2001. ACM Press, 2004. [49] K. Schneider. Anti-spam appliances [58] J. Zdziarski. Bayesian noise reduc- are not better than software. Net- tion: contextual symmetry logic uti- workWorldFusion, March 1 2004. lizing pattern consistency analysis. www.nwfusion.com/columnists/2004/- http://www.nuclearelephant.com/- 0301faceoffno.html. papers/bnr.html, 2004. [50] C. Siefkes, F. Assis, S. Chhabra, and W. Yerazunis. Combining winnow and orthogonal sparse bigrams for incremen- tal spam filtering. In Proceedings of ECML/PKDD 2004, LNCS. Springer Verlag, 2004.

[51] J. Snyder. Spam in the wild, the sequel. http://www.nwfusion.com/- reviews/2004/122004spampkg.html, Dec 2004.

[52] J. Spira. Spam e-mail and its impact on it spending and productivity. Technical report, Basex Inc., 2003.

19