CLASSIFICATION of IMAGE SPAM a Thesis Presented to the Graduate
Total Page:16
File Type:pdf, Size:1020Kb
CLASSIFICATION OF IMAGE SPAM A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Computer Science Shruti Wakade August, 2011 CLASSIFICATION OF IMAGE SPAM Shruti Wakade Thesis Approved: Accepted: _______________________________ _______________________________ Advisor Department Chair Dr. Kathy J. Liszka Dr. Chien-Chung Chan _______________________________ _______________________________ Committee Member Dean of the College Dr. Zhong-Hui Duan Dr. Chand Midha _______________________________ ______________________________ Committee Member Dean of the Graduate School Dr. Chein–Chung Chan Dr. George R. Newkome ________________________________ Date ii ABSTRACT Image spam is one of the most prevalent forms of spam ever since its inception. Spammers have refined their spamming techniques to use smaller, more colorful and photo quality images as spam. In spite of numerous efforts to build efficient spam filters against e-mail spam by researchers and free-mailing services like yahoo mail, Gmail etc spam filters still fail to arrest image spam. This research is an attempt to understand the techniques used in spamming and identifying a set of features that can help in classification of image spam from photographs. A set of eight features were identified based on observations and existing research in this area. Among the eight features, six features have been introduced by us and two other features have been included from previous research. Data mining techniques were then applied to classify image spam from photographs. Identifying a set of efficient yet computationally inexpensive features was the objective that guided this research work. We achieved classification accuracy of 89% for the test samples. A detailed trail of image spam has been studied to identify the most prevalent types and patterns in image spam. Our results indicate that five of the six features we had introduced proved to be of high significance in identifying image spam from photographs. iii ACKNOWLEGEMENTS I extend my heartfelt gratitude and appreciation to Dr. Kathy J. Liszka, an extremely helpful teacher and a wonderful advisor who is the guiding force behind this research work. Without her guidance, inputs and encouragement this work would not have been possible. I express my sincere appreciation and gratitude to Dr. Chan for helping me with the data mining experiments and for insightful corrections. I appreciate my committee member Dr. Duan for her thoughtful inputs. I wish to thank Chuck Van Tilburg, for extending his help in the research labs and providing a workable environment in the labs. I also wish to thank Knujon for contributing spam images which helped me to build a substantial corpus for this research. Last, but not the least, I would like to convey my heartfelt gratitude to my family and friends for their constant encouragement, support and timely help. iv TABLE OF CONTENTS Page LIST OF TABLE................................................................................................................ix LIST OF FIGURE...............................................................................................................x CHAPTER I. INTRODUCTION............................................................................................................1 II. SPAM DEFINITION AND TYPES...............................................................................3 2.1 Overview............................................................................................................3 2.2 Types of spam....................................................................................................4 2.3 Image Spam.......................................................................................................5 2.4 Related Research................................................................................................7 III. SPAM IMAGES AND DATASET..............................................................................9 3.1 Types of Images................................................................................................9 3.2 Image Spam Dataset........................................................................................11 v 3.3 Corpus..............................................................................................................12 3.3.1 Statistics of Images in the Corpus.....................................................13 3.4 Preprocessing...................................................................................................14 3.4.1 Feature Selection...............................................................................15 3.5 Feature Extraction Process...............................................................................18 IV. DATA MINING TECHNIQUES................................................................................20 4.1 Data Mining Overview....................................................................................20 4.2 Classification....................................................................................................22 4.3 Decision Trees.................................................................................................23 4.3.1 J48.....................................................................................................24 4.3.2 RepTree ............................................................................................24 V. EXPERIMENTS AND RESULTS...............................................................................25 5.1 Weka Data Mining Tool..................................................................................27 5.2 Data Set Preparation........................................................................................26 5.3 Methodology....................................................................................................26 5.3.1 Run 1- Using J48 Classifier..............................................................26 vi 5.3.2 Run 2- Using RepTree Classifier......................................................27 5.3.3 Depth of the RepTree........................................................................27 5.3.4 Dataset Proportions...........................................................................28 5.3.5 Training and Testing data selection..................................................29 5.3.6 Testing on Unseen data.....................................................................29 VI. VALIDATION BY FEATURE ANALYSIS.............................................................33 VII. TRENDS IN IMAGE SPAM.....................................................................................38 7.1 Count of Image Spam......................................................................................38 7.2 Trend of the Month..........................................................................................39 7.3 New Trends in Image Spam.............................................................................42 7.3.1 Scraped Images.................................................................................42 7.3.2 Malware Embedding in Images........................................................42 VIII. CONCLUSIONS AND FUTURE WORK...............................................................46 REFERENCES..................................................................................................................49 APPENDICES...................................................................................................................52 APPENDIX A. DATA ANALYSIS............................................................53 vii APPENDIX B. GENERATING MD5SUM AND SELECTING UNIQUE FILES.................................................................55 viii LIST OF TABLES Table Page 3.1 Statistics of the images collected to form the corpus...................................................13 4.1 Example data for classification....................................................................................20 4.2 Example data for clustering.........................................................................................21 5.1 Depth value of RepTree...............................................................................................28 5.2 Accuracy of classification for different ratios of ham and spam images.....................28 5.3 Count of spam images in 2010.....................................................................................30 5.4 Accuracy of classification for unseen samples............................................................31 5.5 Computing time for extracting features.......................................................................31 7.1 Image spam count in 2008- 2011.................................................................................38 ix LIST OF FIGURES Figure Page 1.1 Example of Image Spam................................................................................................2 2.1 Adding noise to the Image.............................................................................................6 2.2 Wavy images..................................................................................................................6 2.3 Rotating Image and adding noise...................................................................................6 3.1 Text only image spam...................................................................................................9