Comparing Naïve Bayesian and k-NN algorithms for automatic email classification Louis Eisenberg Stanford University M.S. student PO Box 18199 Stanford, CA 94309 650-269-9444 [email protected]

ABSTRACT desired messages by content is still quite rare, many users are the beneficiaries of machine learning algorithms that attempt to distinguish The problem of automatic email classification has spam from non-spam (e.g. SpamAssassin [2]). In numerous possible solutions; a wide variety of contrast to the relative simplicity of spam natural language processing algorithms are filtering – a binary decision – filing messages potentially appropriate for this text classification into many folders can be fairly challenging. The task. Naïve Bayes implementations are popular most prominent non-commercial email classifier, because they are relatively easy to understand POPFile, is an open-source project that wraps a and implement, they offer reasonable user-friendly interface around the training and computational efficiency, and they can achieve classification of a Naïve Bayesian system. My decent accuracy even with a small amount of personal experience with POPFile suggests that it training data. This paper seeks to compare the can achieve respectable results but it leaves performance of an existing Naïve Bayesian considerable room for improvement. In light of system, POPFile [1], to a hand-tuned k-nearest the conventional wisdom in NLP research that k- neighbors system. Previous research has NN classifiers (and many other types of generally shown that k-NN should outperform algorithms) should be able to outperform a Naïve Naïve Bayes in text classification. My results fail Bayes system in text classification, I adapted to support that trend, as POPFile significantly TiMBL [3], a freely available k-NN package, to outperforms the k-NN system. The likely the email filing problem and sought to surpass the explanation is that POPFile is a system accuracy obtained by POPFile. specifically tuned to the email classification task that has been refined by numerous people over a DATA period of years, whereas my k-NN system is a crude attempt at the problem that fails to exploit the full potential of the general k-NN algorithm. I created the experimental dataset from my own inbox, considering the more than 2000 non-spam INTRODUCTION messages that I received in the first quarter of 2004 as candidates. Within that group, I selected approximately 1600 messages that I felt confident Using machine learning to classify email classifying into one of the twelve “buckets” that I messages is an increasingly relevant problem as arbitrarily enumerated (see Table 1). I then split the rate at which Internet users receive emails each bucket and allocated half of the messages to continues to grow. Though classification of the training set and half to the test set. As input to POPFile, I kept the messages in necessary operations. To train the classifier I fed Eudora mailbox format. For TiMBL, I had to the mbx files (separated by category) directly to convert each message to a feature vector, as the provided utility script insert.pl. For testing, I described in section 3. split each test set mbx file into its individual messages, then used a simple Perl script fed the messages one at a time to the provided script Code Size** Description pipe.pl, which reads in a message and outputs the ae 86 academic events, talks, seminars, etc. same message with POPFile’s classification bslf 63 buy, sell, lost, found decision prepended to the Subject header and/or c 145 courses, course announcements, etc. added in a new header called X-Test- hf 43 humorous forwards Classification. After classifying all of the na 37 newsletters, articles messages, I ran another Java program, p 415 personal popfilescore, to tabulate the results and generate a pa 53 politics, advocacy confusion matrix. se 134 social events, parties s 426 sports, intramurals, team-related ua 13 University administrative k-NN w 164 websites, accounts, e-commerce, support wb 36 work, business To implement my k-NN system I used the * - training and test combined Tilburg Memory-Based Learner, a.k.a. TiMBL. I Table 1. Classification buckets installed and ran the software on various Unix- based systems. TiMBL is an optimized version of the basic k-NN algorithm, which attempts to POPFILE classify new instances by seeking “votes” from the k existing instances that are closest/most similar to the new instance. The TiMBL POPFile implements a Naïve Bayesian algorithm. reference guide [5] explains: Naïve Bayesian classification depends on two crucial assumptions (both of which are results of Memory-Based Learning (MBL) is based on the the single Naïve Bayes assumption of conditional idea that intelligent behavior can be obtained by independence among features as described in analogical reasoning, rather than by the Manning and Schutze [4]): 1. each document can application of abstract mental rules as in rule be represented as a bag of words, i.e. the order induction and rule-based processing. In and syntax of words is completely ignored; 2. in a particular, MBL is founded in the hypothesis that given document, the presence or absence of a the extrapolation of behavior from stored given word is independent of the presence or representations of earlier experience to new absence of any other word. Naïve Bayes is thus situations, based on the similarity of the old and incapable of appropriately capturing any the new situation, is of key importance. conditional dependencies between words, guaranteeing a certain level of imprecision; Preparing the messages to serve as input to the k- however, in many cases this flaw is relatively NN algorithm was considerably more difficult minor and does not prevent the classifier from than in the Naïve Bayes case. A major challenge performing well. in using this algorithm is deciding how to To train and test POPFile, I installed the software represent a text document as a vector of features. on a Windows system and then used a I chose to consider five separate sections of each combination of Java and Perl to perform the email: the attachments, the from, to and subject headers, and the body. For attachments each  m, the distance metric: how to calculate feature was a different file type, e.g. jpg or doc. the nearness of two points based on their For the other four sections, each feature was an features; options that I tried included email address, hyperlink URL, or stemmed and overlap (basic equals or not equals for lowercased word or number. I discarded all other each feature), modified value difference headers. I also ignored any words of length less metric (MVDM), and Jeffrey divergence than 3 letters or greater than 20 letters and any words that appeared on POPFile’s brief  d, the class vote weighting scheme for stopwords list. All together this resulted in each neighbors; this can be simple majority (all document in the data set being represented as a have equal weight) or various alternatives, vector of 15,981 features. For attachments, such as Inverse Linear and Inverse subject, and body, I used tf.idf weighting Distance, that assign higher weight to according to the equation: those neighbors that are closer to the instance weight(i,j) = (1+log(tf ))log(N/df ) iff tf ≥ 1, For distance metrics, MVDM and Jeffrey i,j i i,j divergence are similar and, on this task with its numeric feature vectors, both clearly preferable where i is the term index and j is the document to basic overlap, which draws no distinction index. For the to and from fields, each feature between two values that are almost but not quite was a binary value indicating the presence or equivalent and two values that are very far apart. absence of a word or email address. The other options have no clearly superior setting The Java program mbx2featurevectors parses the a priori, so I relied on the advice of the TiMBL training or test set and generates a file containing reference guide and the results of my various trial all of the feature vectors, represented in TiMBL’s runs. Sparse format. RESULTS/CONCLUSIONS TiMBL processes the training and test data in response to a single command. It has a number of The confusion matrices for POPFile and for the command-line options with which I experimented most successful TiMBL run are reproduced in in an attempt to extract better accuracy. Among Tables 2 and 3. Figure 4 compares the accuracy them: scores of the two algorithms on each category.  k, the number of neighbors to consider Table 5 lists accuracy scores for various when classifying a test point: the literature combinations of TiMBL options. The number of suggests that anywhere between one and a TiMBL runs possible was limited considerably handful of neighbors may be optimal for by the length of time that each run takes – up to this type of task several hours even on a fast machine, depending greatly on the exact options specified.  w, the feature weighting scheme: the classifier attempts to learn which features have more relative importance in determining the classification of an instance; this can be absent (all features get equal weight) or based on information gain or other slight variations such as gain ratio and shared variance ae bs c hf na p pa se s ua w wb Table 2. Confusion matrix for best TiMBL run ae 3 0 0 0 0 1 0 25 14 0 0 0 bs 0 5 0 0 0 3 0 4 19 0 0 0 c 0 1 38 0 0 12 0 8 13 0 0 0 ae bs c hf na p pa se s ua w wb hf 0 1 0 5 0 10 0 0 5 0 0 0 ae 38 0 1 0 0 0 0 0 2 0 2 0 na 1 1 0 0 5 11 0 0 0 0 0 0 bs 0 10 0 0 0 0 0 0 21 0 0 0 p 0 0 0 2 0 189 0 0 15 0 1 0 8 3 51 0 0 4 1 0 2 1 0 0 pa 0 0 0 0 0 2 13 6 5 0 0 0 0 0 0 7 0 7 1 1 4 0 0 0 se 0 2 0 1 0 8 0 27 29 0 0 0 na 0 0 0 1 32 0 0 0 0 0 0 0 s 0 1 0 0 0 28 0 6 178 0 0 0 0 10 3 8 0 140 2 7 20 0 4 4 ua 0 0 0 0 0 1 0 0 0 5 0 0 pa 3 1 0 0 0 0 18 0 2 0 1 0 w 2 0 0 0 0 41 0 0 12 0 27 0 se 0 5 2 1 0 3 0 33 20 0 0 0 wb 0 0 0 0 0 18 0 0 0 0 0 0 0 14 3 2 0 15 0 2 173 0 0 3 ua 0 0 0 0 0 0 0 0 0 6 0 0 1 0 7 0 0 4 1 2 4 2 59 0 wb 0 0 0 1 0 2 0 0 0 0 0 14

Table 3. Confusion matrix for POPFile

As the tables and figure indicate, POPFile clearly outperformed even the best run by TiMBL. POPFile’s overall accuracy was 72.7%, compared to only 61.1% for the best TiMBL trial. In addition, POPFile’s accuracy was well over 60% in almost all of the categories; by contrast, the k-NN system only performed well in three categories. Interestingly, it performed best in the two largest categories, personal and sports – in fact, it was more accurate than POPFile. Apparently it succeeded in distinguishing those categories from the rest of the buckets and from each other, but failed to pick up on most of the other important differences across buckets. m w k d accuracy w b MVDM gain ratio 9 inv. dist. 51.0% w overlap none 1 majority 54.9% u a overlap inf. gain 15 inv. dist. 53.7% s MVDM shared var 3 inv. linear 61.1% s e Jeffrey shared var 5 inv. linear 60.2% p a TiMBL overlap shared var 9 inv. linear 58.9% p POPFile MVDM gain ratio 21 inv. dist. 49.4% n a MVDM inf. gain 7 inv. linear 57.4% h f MVDM shared var 1 inv. dist. 61.0% c b s MVDM shared var 5 majority 54.6% a e Table 4. Sample of TiMBL trials 0% 20% 40% 60% 80% 100%

Figure 1. Accuracy by category OTHER RESEARCH

The various TiMBL runs provide evidence for a A vast amount of research already exists on this few minor insights about how to get the most out and similar topics. Some people, e.g. Rennie et al of the k-NN algorithm. The overwhelming [6], have investigated ways to overcome the conclusion is that shared variance is far superior faulty Naïve Bayesian assumption of conditional to the other weighting schemes for this task. independence. Kiritchenko and Matwin [7] found Based on the explanation given in the TiMBL that support vector machines are superior to documentation, this performance disparity is Naïve Bayesian systems when much of the likely a reflection of the ability of shared variance training data is unlabeled. Other researchers have (and chi-squared, which is very similar) to avoid attempted to use semantic information to improve a bias toward features with more values – a accuracy [8]. significant problem with gain ratio. The results In addition to the two models discussed in this also suggest that k should be a small number – paper, there exist many other options for text the highest values of k gave the worst results. The classification: support vector machines, effect of the m and d options is unclear, though maximum entropy and logistic models, decision simple majority voting seems to perform worse trees and neural networks, for example. than inverse distance and inverse linear. It is also important to recognize the impact of the REFERENCES original construction of the feature vectors. Perhaps the k-NN system’s poor performance [1] POPFile: http://popfile.sourceforge.net was a result of unwise choices in [2] SpamAssassin: http://www.spamassassin.org mbx2featurevector: focusing on the wrong [3] TiMBL: http://ilk.kub.nl/software.html#timbl headers, not parsing symbols and numbers as [4] Manning, Christopher and Hinrich Schutze. Foundations of elegantly as possible, not trying a bigram or Statistical Natural Language Processing. 2000. trigram model on the message body, choosing a [5] TiMBL reference guide: poor tf.idf formula, etc. http://ilk.uvt.nl/downloads/pub/papers/ilk0310.pdf [6] Jason D. M. Rennie, Lawrence Shih, Jaime Teevan and David R. Karger. Tackling the Poor Assumptions of Naive Bayes Text Classifiers. Proceedings of the Twentieth International Conference on Machine Learning. 2003. [7] Svetlana Kiritchenko and Stan Matwin. Email classification [8] Nicolas Turenne. Learning Semantic Classes for Improving with co-training. Proceedings of the 2001 conference of the Email Classification. Biométrie et Intelligence. 2003. Centre for Advanced Studies on Collaborative Research. 2001.