A Review of Current and Next Generation Spam Filtering Tools
Total Page:16
File Type:pdf, Size:1020Kb
Tightening the net: a review of current and next generation spam filtering tools James Carpinter & Ray Hunt∗ Department of Computer Science and Software Engineering University of Canterbury Abstract IT infrastructure worldwide. While it is dif- ficult to quantify the level of spam currently This paper provides an overview of cur- sent, many reports suggest it represents sub- rent and potential future spam filtering ap- stantially more than half of all email sent and proaches. We examine the problems spam in- predict further growth for the foreseeable fu- troduces, what spam is and how we can mea- ture [18, 43, 30]. sure it. The paper primarily focuses on auto- For some, spam represents a minor irritant; mated, non-interactive filters, with a broad for others, a major threat to productivity. Ac- review ranging from commercial implemen- cording to a recent study by Stanford Univer- tations to ideas confined to current research sity [36], the average Internet user loses ten papers. Both machine learning and non- working days each year dealing with incoming machine learning based filters are reviewed as spam. Costs beyond those incurred sorting potential solutions and a taxonomy of known legitimate email from spam are also present: approaches presented. While a range of dif- 15% of all email contains some type of virus ferent techniques have and continue to be payload, and one in 3,418 emails contained evaluated in academic research, heuristic and pornographic images particularly harmful to Bayesian filtering dominate commercial filter- minors [54]. It is difficult to estimate the ulti- ing systems; therefore, a case study of these mate dollar cost of such expenses; however, techniques is presented to demonstrate and most estimates place the worldwide cost of evaluate the effectiveness of these popular spam in 2005, in terms of lost productivity techniques. and IT infrastructure investment, to be well Keywords: spam, ham, heuristics, over US$10 billion [29, 52]. machine learning, non-machine learning, The magnitude of the problem has intro- Bayesian filtering, blacklisting. duced a new dimension to the use of email: the spam filter. Such systems can be expen- 1 Introduction sive to deploy and maintain, placing a further strain on IT budgets. While the reduced flow The first message recognised as spam was sent of spam email into a user’s inbox is gener- to the users of Arpanet in 1978 and repre- ally welcomed, the existence of false positives sented little more than an annoyance. Today, often necessitates the user manually double- email is a fundamental tool for business com- checking filtered messages; this reality some- munication and modern life, and spam repre- what counteracts the assistance the filter de- sents a serious threat to user productivity and livers. The effectiveness of spam filters to im- prove user productivity is ultimately limited ∗email: [email protected] by the extent to which users must manually 1 review filtered messages for false positives. of current research. Section 4 details the eval- Unfortunately, the underlying business uation of spam filters, including a case study model of bulk emailers (spammers) is simply of the PreciseMail Anti-Spam system operat- too attractive. Commissions to spammers of ing at the University of Canterbury. Section 25–50% on products sold are not unusual [30]. 5 finishes the paper with some conclusions on On a collection of 200 million email addresses, the state of this research area. a response rate of 0.001% would yield a spam- mer a return of $25,000, given a $50 product. 1.1 Definition Any solution to this problem must reduce the profitability of the underlying business model; Spam is briefly defined by the TREC 2005 by either substantially reducing the number of Spam Track as “unsolicited, unwanted email emails reaching valid recipients, or increasing that was sent indiscriminately, directly or in- the expenses faced by the spammer. directly, by a sender having no current rela- Regrettably, no solution has yet been found tionship with the recipient” [12]. The key el- to this vexing problem. The classification task ements of this definition are expanded on in is complex and constantly changing. Con- a more extensive definition provided by Mail structing a single model to classify the broad Abuse Prevention Systems [35], which spec- range of spam types is difficult; this task ifies three requirements for a message to be is made near impossible with the realisation classified as spam. Firstly, the message must that spam types are constantly moving and be equally applicable to many other potential evolving. Furthermore, most users find false recipients (i.e. the identity of the recipient positives unacceptable. The active evolution and the context of the message is irrelevant). of spam can be partially attributed to chang- Secondly, the recipient has not granted ‘delib- ing tastes and trends in the marketplace; how- erated, explicit and still-revocable permission ever, spammers often actively tailor their mes- for it to be sent’. Finally, the communica- sages to avoid detection, adding a further im- tion of the message gives a ‘disproportionate pediment to accurate detection. benefit’ to the sender, as solely determined by The similarities between junk postal mail the recipient. Critically, they note that sim- and spam can be immediately recognised; ple personalisation does not make the identity however, the nature of the Internet has al- of the sender relevant and that failure by the lowed spam to grow uncontrollably. Spam user to explicitly opt-out during a registration can be sent with no cost to the sender: the process does not form consent. economic realities that regulate junk postal Both these definitions identify the predomi- mail do not apply to the internet. Further- nant characteristic of spam email: that a user more, the legal remedies that can be taken receives unsolicited email that has been sent against spammers are limited: it is not diffi- without any concern for their identity. cult to avoid leaving a trace, and spammers easily operate outside the jurisdiction of those 1.2 Solution strategies countries with anti-spam legislation. The remainder of this section provides sup- Proposed solutions to spam can be separated porting material on the topic of spam. Sec- into three broad categories: legislation, pro- tion 2 provides an overview of spam classifi- tocol change and filtering. cation techniques. Sections 3.1 and 3.2 pro- A number of governments have enacted leg- vide a more detailed discussion of some of the islation prohibiting the sending of spam email, spam filtering techniques known: given the including the USA (Can Spam Act 2004) rapidly evolving nature of this field, it should and the EU (directive 2002/58/EC). Ameri- be considered a snapshot of the critical areas can legislation requires an ‘opt-out’ list that 2 bulk mailers are required to provide; this # spam correctly classified is arguably less effective than the European SR = (and Australian) approach of requiring ex- Total # of spam messages plicit ‘opt-in’ requests from consumers want- # spam correctly classified ing to receive such emails. At present, legisla- SP = Total # of messages classified as spam tion has appeared to have little effect on spam volumes, with some arguing that the law has 2 × SP × SR F1 = contributed to an increase in spam by giving SP + SR bulk advertisers permission to send spam, as # email correctly classified A = long as certain rules were followed. Total # of emails Many proposals to change the way in which we send email have been put forward, includ- Figure 1: Common experimental measures for ing the required authentication of all senders, the evaluation of spam filters. a per email charge and a method of encap- sulating policy within the email address [28]. Such proposals, while often providing a near fications; however, no human element is re- complete solution, generally fail to gain sup- quired during the initial classification deci- port given the scope of a major upgrade or sion. Such systems represent the most com- replacement of existing email protocols. mon solution to resolving the spam problem, Interactive filters, often referred to as precisely because of their capacity to execute ‘challenge-response’ (C/R) systems, intercept their task without supervision and without re- incoming emails from unknown senders or quiring a fundamental change in underlying those suspected of being spam. These mes- email protocols. sages are held by the recipient’s email server, which issues a simple challenge to the sender 1.3 Statistical evaluation to establish that the email came from a hu- man sender rather than a bulk mailer. The Common experimental measures include underlying belief is that spammers will be un- spam recall (SR), spam precision (SP), F1 and interested in completing the ‘challenge’ given accuracy (A) (see figure 1 for formal defini- the huge volume of messages they sent; fur- tions of these measures). Spam recall is ef- thermore, if a fake email address is used by fectively spam accuracy. A legitimate email the sender, they will not receive the chal- classified as spam is considered to be a ‘false lenge. Selective C/R systems issue a challenge positive’; conversely, a spam message classi- only when the (non-interactive) spam filter is fied as legitimate is considered to be a ‘false unable to determine the class of a message. negative’. Challenge-response systems do slow down the The accuracy measure, while often quoted delivery of messages, and many people refuse by product vendors, is generally not useful to use the system1. when evaluating anti-spam solutions. The Non-interactive filters classify emails with- level of misclassifications (1 − A) consists out human interaction (such as that seen in of both false positives and false negatives; C/R systems). Such filters often permit user clearly a 99% accuracy rate with 1% false neg- interaction with the filter to customise user- atives (and no false positives) is preferable to specific options or to correct filter misclassi- the same level of accuracy with 1% false pos- itives (and no false negatives).