On Using SVM and Kolmogorov Complexity for Spam Filtering

Proceedings of the Twenty-First International FLAIRS Conference (2008) On Using SVM and Kolmogorov Complexity for Spam Filtering Sihem Belabbes2 and Gilles Richard1;2 1British Institute of Technology and E-commerce, Avecina House,258-262 Romford Road London E7 9HZ, UK 2Institut de Recherche en Informatique de Toulouse, 118 Rte de Narbonne 31062 Toulouse, France [email protected] fbelabbes,[email protected] Abstract 2003)), and has so far proved to be a powerful filtering tool. First described by P. Graham (Graham 2002), the Bayesian As a side effect of e-marketing strategy the number of filters block over 90% of spam. Due to their statistical nature spam e-mails is rocketing, the time and cost needed to deal with spam as well. Spam filtering is one of the most they are adaptive but can suffer statistical attacks simply by difficult tasks among diverse kinds of text categoriza- adding a huge number of relevant ham words in a spam mes- tion, sad consequence of spammers dynamic efforts to sage (see (Lowd & Meek 2005)). Next-generation spam fil- escape filtering. In this paper, we investigate the use of tering systems may thus depend on merging rule-based prac- Kolmogorov complexity theory as a backbone for spam tices (like blacklists and dictionaries) and machine learning filtering, avoiding the burden of text analysis, keywords techniques. and blacklists update. Exploiting the fact that we can A useful underlying classification notion is the mathemati- estimate a message information content through com- cal concept of distance. Roughly speaking, this relies on the pression techniques, we represent an e-mail as a multi- fact that when 2 objects are close one to another, it means dimensional real vector and then we implement a sup- that they share some similarity. In spam filtering settings, if port vector machine classifier to classify new incoming e-mails. The first results we get exhibit interesting ac- an e-mail m is close to a junk e-mail, then m is likely a junk curacy rates and emphasize the relevance of our idea. e-mail as well. Such a distance takes parameters like sender domain or specific words occurrences. An e-mail is then represented as a point in a vector space and spam are identi- Introduction fied by simply implementing a distance-based classifier. At Detecting and filtering spam e-mails face a number of com- first glance an e-mail can be identified as spam by looking plex challenges due to the dynamic and malicious nature of at its origin. We definitely think that this can be refined by spam. A truly effective spam filter must block the maxi- considering the whole e-mail content, including header and mum unwanted e-mails while minimizing the number of le- body. An e-mail is classified as spam when its informative gitimate messages wrongly identified as spam (namely false content is similar or close to that of another spam. So we positives). However individual users may not share the same are interested in a distance between 2 e-mails ensuring that view on what really a spam is. when they are close, we can conclude that they have simi- A great deal of existing methods, generally rule-based, pro- lar informative content without distinction between header ceed by checking content of incoming e-mail looking for and body. One can wonder about the formal meaning of that specific keywords (dictionary approach), and/or comparing ‘informative content’ concept. Kolmogorov (also known as with blacklists of hosts and domains known as issuing spam Kolmogorov-Chaitin) complexity is a strong tool for this (see (Graham 2002) for a survey). In one case, the user can purpose. In its essence Kolmogorov’s work considers the define his own dictionary thus adapting the filter to his own informative content of a string s as the measure of the size use. In the other, the blacklist needs to be regularly updated. of the ultimate compression of s, noted K(s). Anyway, getting rid of spam remains a classification prob- In this paper we aim at showing that Kolmogorov complex- lem and it is quite natural to apply machine learning meth- ity and its associated distance can be pretty reliable in clas- ods. The main idea on which they rely is to train a learner on sifying spam. Most important is that this can be achieved: a sample e-mails set (known as the witness or training set) clearly identified as spam or ham (legitimate e-mail), and • without any body analysis then to use the system output as a classifier for next incoming e-mails. • without any header analysis Amongst the most successful approaches to deal with spam The originality of our approach is the use of support vec- is the so-called Bayesian technique which is based on the tor machines for classification such that we represent every probabilistic notion of Bayesian networks( (Pearl & Russell e-mail in the training set and in the testing set as a multi- Copyright c 2008, Association for the Advancement of Artificial dimensional real vector. This can be done by considering a Intelligence (www.aaai.org). All rights reserved. set of typical e-mails previously identified as spam or ham. 130 The rest of this paper is organized as follows: we briefly understand how to build a suitable distance starting from K. review the Kolmogorov theory and its very meaning for a This is our aim in the next section. given string s. Then we describe the main idea behind prac- tical applications of Kolmogorov theory which is defining a Information distance suitable information distance. Since K(s) is an ideal number, we explain how to estimate K and the relationship with When considering data mining and knowledge discovery, commonly used compression techniques. We then exhibit mathematical distances are powerful tools to classify new our first experiment with a simple k-nearest neighbors algo- data. The theory around Kolmogorov complexity helps to rithm and show why we move to more sophisticated tools. define a distance called ‘Information Distance’ or ‘Bennett’s After introducing support vector machine classifiers, we de- distance’ (see (Bennett et al. 1998) for a complete study). scribe our multi-dimensional real vector representation of an The informal idea is that given two strings or files, a and b, e-mail. We exhibit and comment our new results. We dis- we can say: cuss related approaches before concluding and outlining fu- K(a) = K(a n b) + K(a \ b) ture perspectives. K(b) = K(b n a) + K(a \ b) What is Kolmogorov complexity? It is important to point out that those 2 equations do not be- long to the theoretical framework but are approximate for- The data a computer deals with are digitalized, namely de- mula allowing to understand the final computation. The first scribed by finite binary strings. In our context we focus on equation says that a’s complexity (its information content) is pieces of text. Kolmogorov complexity K(s) is a measure of the sum of the proper a’s information content denoted a n b, the descriptive complexity contained in an object or string and the common content with b denoted a \ b. Concatenat- s. A good introduction to Kolmogorov’s complexity is con- ing a with b yields a new file denoted a:b whose complexity tained in (Kolmogorov 1965) with a solid treatment in (Li & is K(a:b): Vitanyi´ 1997; Kirchherr, Li, & Vitanyi´ 1997). Kolmogorov’s complexity is related to Shannon entropy (Shannon 1948) in K(a:b) = K(a n b) + K(b n a) + K(a \ b); K(s) that the expected value of for a random sequence is since there is no redundancy with Kolmogorov compression. approximately the entropy of the source distribution for the So the following number: process generating this sequence. However, Kolmogorov’s complexity differs from entropy in that it is related to the m(a; b) = K(a) + K(b) − K(a:b) = K(a \ b) specific string being considered rather than to the source dis- is a relevant measure of the common information content to tribution. Kolmogorov complexity can be roughly described a and b. as follows, where T represents a universal computer (Turing Normalizing this number in order to avoid side effects due machine), p represents a program, and s represents a string: to strings size helps in defining the information distance: K(s) is the size jpj of the smallest program p s.t. T (p) = s d(a; b) = 1 − m(a; b)=max(K(a); K(b)) Thus p can be considered as the essence of s since there is Let us understand the very meaning of d. If a = b, then no way to get s with a shorter program than p. It is logical to K(a) = K(b) and m(a; b) = K(a) thus d(a; b) = 0. On consider p as the most compressed form of s. The size of p, the opposite side, if a and b do not have any common in- K(s), is a measure of the amount of information contained formation, m(a; b) = 0 and then d(a; b) = 1. Formally, in s. Consequently K(s) is the lower bound limit of all pos- d is a metric satisfying d(a; a) = 0, d(a; b) = d(b; a) and sible compressions of s: it is the ultimate compression size d(a; b) ≤ d(a; c) + d(c; b) over the set of finite strings. d of the string. Random strings have rather high Kolmogorov is called the information distance or the Bennett distance.

Load more