A Spam-Detecting Artificial Immune System By
Total Page:16
File Type:pdf, Size:1020Kb
A Spam-Detecting Artificial Im m une System by Terri Oda A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree of Master of Computer Science Ottawa-Carleton Institute for Computer Science School of Computer Science Carleton University Ottawa, Canada January, 2005 Copyright © 2005 Terri Oda Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Library and Bibliotheque et 1*1 Archives Canada Archives Canada Published Heritage Direction du Branch Patrimoine de I'edition 395 Wellington Street 395, rue Wellington Ottawa ON K1A 0N4 Ottawa ON K1A 0N4 Canada Canada Your file Votre reference ISBN: 0-494-00764-8 Our file Notre reference ISBN: 0-494-00764-8 NOTICE: AVIS: The author has granted a non L'auteur a accorde une licence non exclusive exclusive license allowing Library permettant a la Bibliotheque et Archives and Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve,sauvegarder, conserver, transmettre au public communicate to the public by par telecommunication ou par I'lnternet, preter, telecommunication or on the Internet,distribuer et vendre des theses partout dans loan, distribute and sell theses le monde, a des fins commerciales ou autres, worldwide, for commercial or non sur support microforme, papier, electronique commercial purposes, in microform, et/ou autres formats. paper, electronic and/or any other formats. The author retains copyright L'auteur conserve la propriete du droit d'auteur ownership and moral rights in et des droits moraux qui protege cette these. this thesis. Neither the thesis Ni la these ni des extraits substantiels de nor substantial extracts from it celle-ci ne doivent etre imprimes ou autrement may be printed or otherwise reproduits sans son autorisation. reproduced without the author's permission. In compliance with the Canadian Conformement a la loi canadienne Privacy Act some supporting sur la protection de la vie privee, forms may have been removed quelques formulaires secondaires from this thesis. ont ete enleves de cette these. While these forms may be included Bien que ces formulaires in the document page count, aient inclus dans la pagination, their removal does not represent il n'y aura aucun contenu manquant. any loss of content from the thesis. i * i Canada Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A bstract Electronic junk mail, colloquially known as spam, has increasingly become a part of life for users of email. Although people dislike the constant barrage, it is difficult to design an anti-spam solution that can adapt appropriately to the constantly changing messages. Inspired by models of the human immune system, the spam immune system is a spam classifier that protects users from spam. The metaphor of the immune system is a compelling one because of the immune system’s ability to protect the body by distinguishing self from non-self, even as the non-self changes. The spam immune system creates digital antibodies which are used as email clas sifiers. As the system is exposed to messages, it learns about the user’s classification of spam and non-spam. Once trained, the resulting system can achieve classification rates comparable to those of existing commercial anti-spam products. iii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgements Many people have helped me bring the spam immune system from an idea to an entire thesis. Foremost among them is my supervisor, Tony White. His swarm intelligence class introduced me to the idea of artificial immune systems, and his encouragement, support and enthusiasm for this project has made it not only possible, but also a great experience. My undergraduate supervisor, Irwin Pressman, was the first to encourage me to consider graduate school, and has provided a wealth of useful advice. He has been a great inspiration, and he was right when he said that graduate work was more fun than undergraduate work. I wish to thank my thesis committee, who have helped make this possible even before they agreed to be part of the committee. Stan Matwin’s course in data min ing helped me understand many of the other approaches to text classification, and Anil Somayaji’s “Computer Immunology” paper was among the first immune sys tems papers I read, and it is one that made me think this idea might just be worth pursuing. This work would not have been possible without the support of my family and friends. They cheered me on, provided advice and background information (especially about biology), and often listened to me talk about my system, asking questions that helped me make my explanations more clear. In particular, I would like to thank Ken O’Byrne for providing not only invaluable support, but also providing extra computing power when I needed it most. iv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Contents 1 Introduction 1 1.1 Motivation ................................. 1 1.2 Problem Statement ........................................... 2 1.3 Contributions . ...................................................................................... 3 1.4 Thesis Organization ........................... 4 2 Background 5 2.1 Spam ............................................................................................................ 5 2.1.1 About Spam .................................................................................... 6 2.1.1.1 What is Spam? . ........................................................ 6 2.1.1.2 How pervasive is the spam p ro b lem ? ......................... 10 2.1.1.3 Why is the problem so severe? ................................... 11 2.1.1.4 How do spam senders get email addresses? ................ 12 2.1.1.5 Anatomy of E m a il ........................................................ 14 2.1.1.6 Why is spam classification hard? ...................... 15 2.1.2 Classifying Current S o lu tio n s ......................... 17 2.1.2.1 When do these solutions w o rk ? .................................. 19 2.1.3 Current Solutions ................... 21 2.1.3.1 Spam Legislation and Legal A c tio n ............................ 21 2.1.3.2 Copyright L a w .............................................................. 23 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.1.3.3 Grassroots M e th o d s ................... 23 2.1.3.4 Address Obfuscation ................... 24 2.1.3.5 Heuristics...................................... 25 2.1.3.6 Sender A uthentication ................................ 26 2.1.3.7 Blacklisting ................... 28 2.1.3.8 Whitelisting ................................... 29 2.1.3.9 Challenge-Response . ................................................ 30 2.1.3.10 Spamtrap email addresses ............................................ 33 2.1.3.11 Bayesian-inspired F ilters ................................................ 34 2.1.4 S u m m ary ....................................................................................... 38 2.2 Immune S y stem s ...................................................................................... 40 2.2.1 What does the immune system d o ? ............................................ 41 2.2.2 The Adaptive Immune System ......................................... 42 2.2.2.1 Creation of Antibodies ................................................... 43 2.2.2.2 Autoimmune reactions ................................................... 44 2.2.2.3 Detection/Binding ......................................................... 44 2.2.2.4 Memory . ..................................................................... 46 2.2.3 S u m m ary ........................................................................... 47 3 The Spam AIS 49 3.1 Parts of the Spam A IS ................................................................ 50 3.1.1 Antigen and protein signatures: Email ................. 50 3.1.1.1 The Non-self: S p a m .......................... 52 3.1.1.2 The Self: N on-spam ...................................................... 53 3.1.2 The Lymphocytes and their Antibodies: Digital Detectors . 53 3.1.2.1 Digital Antibodies ............................................. 53 3.1.2.2 W eights ................... 55 3.1.3 The L ibrary ............................ 57 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.1.3.1 All possible characters ...................... 58 3.1.3.2 A dictionary of English words ............. 58 3.1.3.3 Bayesian-style tokens ..................................................... 59 3.1.3.4 Heuristics ............................... 59 3.1.3.5 Other Libraries ............................... 61 3.2 Lifecycle . ................................................................................................ 63 3.2.1 Creation of Lymphocytes and their Antibodies .............. 65 3.2.1.1 Regular Expressions ..................................................... 69 3.2.2 Training of Lymphocytes ................ 71 3.2.3 Application and weighting of lym phocytes ......................... 72 3.2.3.1 But what if none of the antibodies match? ...... 75 3.2.4 Culling of antibodies: Ageing and D e a th ..................................