Fast Statistical Spam Filter by Approximate Classifications Kang Li Zhenyu Zhong Department of Computer Science Department of Computer Science University of Georgia University of Georgia Athens, Georgia, USA Athens, Georgia, USA
[email protected] [email protected] ABSTRACT based on its contents, have found wide acceptance in tools Statistical-based Bayesian filters have become a popular and used to block spam. These filters can be continually trained important defense against spam. However, despite their ef- on updated corpora of spam and ham (good email), resulting fectiveness, their greater processing overhead can prevent in robust, adaptive, and highly accurate systems. them from scaling well for enterprise-level mail servers. For Bayesian filters usually perform a dictionary lookup on example, the dictionary lookups that are characteristic of each individual token and summarize the result in order to this approach are limited by the memory access rate, there- arrive at a decision. It is not unusual to accumulate over fore relatively insensitive to increases in CPU speed. We 100,000 tokens in a dictionary, depending on how training is address this scaling issue by proposing an acceleration tech- handled [23]. Unfortunately, the performance of these dic- nique that speeds up Bayesian filters based on approximate tionary lookups is limited by the memory access rate, there- classification. The approximation uses two methods: hash- fore relatively insensitive to increases in CPU speed. As a based lookup and lossy encoding. Lookup approximation is result of this lookup overhead, classification can be relatively based on the popular Bloom filter data structure with an ex- slow.