CLASSIFICATION of IMAGE SPAM a Thesis Presented to the Graduate

Total Page:16

File Type:pdf, Size:1020Kb

CLASSIFICATION of IMAGE SPAM a Thesis Presented to the Graduate CLASSIFICATION OF IMAGE SPAM A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Computer Science Shruti Wakade August, 2011 CLASSIFICATION OF IMAGE SPAM Shruti Wakade Thesis Approved: Accepted: _______________________________ _______________________________ Advisor Department Chair Dr. Kathy J. Liszka Dr. Chien-Chung Chan _______________________________ _______________________________ Committee Member Dean of the College Dr. Zhong-Hui Duan Dr. Chand Midha _______________________________ ______________________________ Committee Member Dean of the Graduate School Dr. Chein–Chung Chan Dr. George R. Newkome ________________________________ Date ii ABSTRACT Image spam is one of the most prevalent forms of spam ever since its inception. Spammers have refined their spamming techniques to use smaller, more colorful and photo quality images as spam. In spite of numerous efforts to build efficient spam filters against e-mail spam by researchers and free-mailing services like yahoo mail, Gmail etc spam filters still fail to arrest image spam. This research is an attempt to understand the techniques used in spamming and identifying a set of features that can help in classification of image spam from photographs. A set of eight features were identified based on observations and existing research in this area. Among the eight features, six features have been introduced by us and two other features have been included from previous research. Data mining techniques were then applied to classify image spam from photographs. Identifying a set of efficient yet computationally inexpensive features was the objective that guided this research work. We achieved classification accuracy of 89% for the test samples. A detailed trail of image spam has been studied to identify the most prevalent types and patterns in image spam. Our results indicate that five of the six features we had introduced proved to be of high significance in identifying image spam from photographs. iii ACKNOWLEGEMENTS I extend my heartfelt gratitude and appreciation to Dr. Kathy J. Liszka, an extremely helpful teacher and a wonderful advisor who is the guiding force behind this research work. Without her guidance, inputs and encouragement this work would not have been possible. I express my sincere appreciation and gratitude to Dr. Chan for helping me with the data mining experiments and for insightful corrections. I appreciate my committee member Dr. Duan for her thoughtful inputs. I wish to thank Chuck Van Tilburg, for extending his help in the research labs and providing a workable environment in the labs. I also wish to thank Knujon for contributing spam images which helped me to build a substantial corpus for this research. Last, but not the least, I would like to convey my heartfelt gratitude to my family and friends for their constant encouragement, support and timely help. iv TABLE OF CONTENTS Page LIST OF TABLE................................................................................................................ix LIST OF FIGURE...............................................................................................................x CHAPTER I. INTRODUCTION............................................................................................................1 II. SPAM DEFINITION AND TYPES...............................................................................3 2.1 Overview............................................................................................................3 2.2 Types of spam....................................................................................................4 2.3 Image Spam.......................................................................................................5 2.4 Related Research................................................................................................7 III. SPAM IMAGES AND DATASET..............................................................................9 3.1 Types of Images................................................................................................9 3.2 Image Spam Dataset........................................................................................11 v 3.3 Corpus..............................................................................................................12 3.3.1 Statistics of Images in the Corpus.....................................................13 3.4 Preprocessing...................................................................................................14 3.4.1 Feature Selection...............................................................................15 3.5 Feature Extraction Process...............................................................................18 IV. DATA MINING TECHNIQUES................................................................................20 4.1 Data Mining Overview....................................................................................20 4.2 Classification....................................................................................................22 4.3 Decision Trees.................................................................................................23 4.3.1 J48.....................................................................................................24 4.3.2 RepTree ............................................................................................24 V. EXPERIMENTS AND RESULTS...............................................................................25 5.1 Weka Data Mining Tool..................................................................................27 5.2 Data Set Preparation........................................................................................26 5.3 Methodology....................................................................................................26 5.3.1 Run 1- Using J48 Classifier..............................................................26 vi 5.3.2 Run 2- Using RepTree Classifier......................................................27 5.3.3 Depth of the RepTree........................................................................27 5.3.4 Dataset Proportions...........................................................................28 5.3.5 Training and Testing data selection..................................................29 5.3.6 Testing on Unseen data.....................................................................29 VI. VALIDATION BY FEATURE ANALYSIS.............................................................33 VII. TRENDS IN IMAGE SPAM.....................................................................................38 7.1 Count of Image Spam......................................................................................38 7.2 Trend of the Month..........................................................................................39 7.3 New Trends in Image Spam.............................................................................42 7.3.1 Scraped Images.................................................................................42 7.3.2 Malware Embedding in Images........................................................42 VIII. CONCLUSIONS AND FUTURE WORK...............................................................46 REFERENCES..................................................................................................................49 APPENDICES...................................................................................................................52 APPENDIX A. DATA ANALYSIS............................................................53 vii APPENDIX B. GENERATING MD5SUM AND SELECTING UNIQUE FILES.................................................................55 viii LIST OF TABLES Table Page 3.1 Statistics of the images collected to form the corpus...................................................13 4.1 Example data for classification....................................................................................20 4.2 Example data for clustering.........................................................................................21 5.1 Depth value of RepTree...............................................................................................28 5.2 Accuracy of classification for different ratios of ham and spam images.....................28 5.3 Count of spam images in 2010.....................................................................................30 5.4 Accuracy of classification for unseen samples............................................................31 5.5 Computing time for extracting features.......................................................................31 7.1 Image spam count in 2008- 2011.................................................................................38 ix LIST OF FIGURES Figure Page 1.1 Example of Image Spam................................................................................................2 2.1 Adding noise to the Image.............................................................................................6 2.2 Wavy images..................................................................................................................6 2.3 Rotating Image and adding noise...................................................................................6 3.1 Text only image spam...................................................................................................9
Recommended publications
  • Spamming Botnets: Signatures and Characteristics
    Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Geoff Hulten+,IvanOsipkov+ Microsoft Research, Silicon Valley +Microsoft Corporation {yxie,fangyu,kachan,rina,ghulten,ivano}@microsoft.com ABSTRACT botnet infection and their associated control process [4, 17, 6], little In this paper, we focus on characterizing spamming botnets by effort has been devoted to understanding the aggregate behaviors of leveraging both spam payload and spam server traffic properties. botnets from the perspective of large email servers that are popular Towards this goal, we developed a spam signature generation frame- targets of botnet spam attacks. work called AutoRE to detect botnet-based spam emails and botnet An important goal of this paper is to perform a large scale analy- membership. AutoRE does not require pre-classified training data sis of spamming botnet characteristics and identify trends that can or white lists. Moreover, it outputs high quality regular expression benefit future botnet detection and defense mechanisms. In our signatures that can detect botnet spam with a low false positive rate. analysis, we make use of an email dataset collected from a large Using a three-month sample of emails from Hotmail, AutoRE suc- email service provider, namely, MSN Hotmail. Our study not only cessfully identified 7,721 botnet-based spam campaigns together detects botnet membership across the Internet, but also tracks the with 340,050 unique botnet host IP addresses. sending behavior and the associated email content patterns that are Our in-depth analysis of the identified botnets revealed several directly observable from an email service provider. Information interesting findings regarding the degree of email obfuscation, prop- pertaining to botnet membership can be used to prevent future ne- erties of botnet IP addresses, sending patterns, and their correlation farious activities such as phishing and DDoS attacks.
    [Show full text]
  • Zambia and Spam
    ZAMNET COMMUNICATION SYSTEMS LTD (ZAMBIA) Spam – The Zambian Experience Submission to ITU WSIS Thematic meeting on countering Spam By: Annabel S Kangombe – Maseko June 2004 Table of Contents 1.0 Introduction 1 1.1 What is spam? 1 1.2 The nature of Spam 1 1.3 Statistics 2 2.0 Technical view 4 2.1 Main Sources of Spam 4 2.1.1 Harvesting 4 2.1.2 Dictionary Attacks 4 2.1.3 Open Relays 4 2.1.4 Email databases 4 2.1.5 Inadequacies in the SMTP protocol 4 2.2 Effects of Spam 5 2.3 The fight against spam 5 2.3.1 Blacklists 6 2.3.2 White lists 6 2.3.3 Dial‐up Lists (DUL) 6 2.3.4 Spam filtering programs 6 2.4 Challenges of fighting spam 7 3.0 Legal Framework 9 3.1 Laws against spam in Zambia 9 3.2 International Regulations or Laws 9 3.2.1 US State Laws 9 3.2.2 The USA’s CAN‐SPAM Act 10 4.0 The Way forward 11 4.1 A global effort 11 4.2 Collaboration between ISPs 11 4.3 Strengthening Anti‐spam regulation 11 4.4 User education 11 4.5 Source authentication 12 4.6 Rewriting the Internet Mail Exchange protocol 12 1.0 Introduction I get to the office in the morning, walk to my desk and switch on the computer. One of the first things I do after checking the status of the network devices is to check my email.
    [Show full text]
  • Enisa Etl2020
    EN From January 2019 to April 2020 Spam ENISA Threat Landscape Overview The first spam message was sent in 1978 by a marketing manager to 393 people via ARPANET. It was an advertising campaign for a new product from the company he worked for, the Digital Equipment Corporation. For those first 393 spammed people it was as annoying as it would be today, regardless of the novelty of the idea.1 Receiving spam is an inconvenience, but it may also create an opportunity for a malicious actor to steal personal information or install malware.2 Spam consists of sending unsolicited messages in bulk. It is considered a cybersecurity threat when used as an attack vector to distribute or enable other threats. Another noteworthy aspect is how spam may sometimes be confused or misclassified as a phishing campaign. The main difference between the two is the fact that phishing is a targeted action using social engineering tactics, actively aiming to steal users’ data. In contrast spam is a tactic for sending unsolicited e-mails to a bulk list. Phishing campaigns can use spam tactics to distribute messages while spam can link the user to a compromised website to install malware and steal personal data. Spam campaigns, during these last 41 years have taken advantage of many popular global social and sports events such as UEFA Europa League Final, US Open, among others. Even so, nothing compared with the spam activity seen this year with the COVID-19 pandemic.8 2 __Findings 85%_of all e-mails exchanged in April 2019 were spam, a 15-month high1 14_million
    [Show full text]
  • Locating Spambots on the Internet
    BOTMAGNIFIER: Locating Spambots on the Internet Gianluca Stringhinix, Thorsten Holzz, Brett Stone-Grossx, Christopher Kruegelx, and Giovanni Vignax xUniversity of California, Santa Barbara z Ruhr-University Bochum fgianluca,bstone,chris,[email protected] [email protected] Abstract the world-wide email traffic [20], and a lucrative busi- Unsolicited bulk email (spam) is used by cyber- ness has emerged around them [12]. The content of spam criminals to lure users into scams and to spread mal- emails lures users into scams, promises to sell cheap ware infections. Most of these unwanted messages are goods and pharmaceutical products, and spreads mali- sent by spam botnets, which are networks of compro- cious software by distributing links to websites that per- mised machines under the control of a single (malicious) form drive-by download attacks [24]. entity. Often, these botnets are rented out to particular Recent studies indicate that, nowadays, about 85% of groups to carry out spam campaigns, in which similar the overall spam traffic on the Internet is sent with the mail messages are sent to a large group of Internet users help of spamming botnets [20,36]. Botnets are networks in a short amount of time. Tracking the bot-infected hosts of compromised machines under the direction of a sin- that participate in spam campaigns, and attributing these gle entity, the so-called botmaster. While different bot- hosts to spam botnets that are active on the Internet, are nets serve different, nefarious goals, one important pur- challenging but important tasks. In particular, this infor- pose of botnets is the distribution of spam emails.
    [Show full text]
  • The History of Spam Timeline of Events and Notable Occurrences in the Advance of Spam
    The History of Spam Timeline of events and notable occurrences in the advance of spam July 2014 The History of Spam The growth of unsolicited e-mail imposes increasing costs on networks and causes considerable aggravation on the part of e-mail recipients. The history of spam is one that is closely tied to the history and evolution of the Internet itself. 1971 RFC 733: Mail Specifications 1978 First email spam was sent out to users of ARPANET – it was an ad for a presentation by Digital Equipment Corporation (DEC) 1984 Domain Name System (DNS) introduced 1986 Eric Thomas develops first commercial mailing list program called LISTSERV 1988 First know email Chain letter sent 1988 “Spamming” starts as prank by participants in multi-user dungeon games by MUDers (Multi User Dungeon) to fill rivals accounts with unwanted electronic junk mail. 1990 ARPANET terminates 1993 First use of the term spam was for a post from USENET by Richard Depew to news.admin.policy, which was the result of a bug in a software program that caused 200 messages to go out to the news group. The term “spam” itself was thought to have come from the spam skit by Monty Python's Flying Circus. In the sketch, a restaurant serves all its food with lots of spam, and the waitress repeats the word several times in describing how much spam is in the items. When she does this, a group of Vikings in the corner start a song: "Spam, spam, spam, spam, spam, spam, spam, spam, lovely spam! Wonderful spam!" Until told to shut up.
    [Show full text]
  • A Survey of Learning-Based Techniques of Email Spam Filtering
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Unitn-eprints Research A SURVEY OF LEARNING-BASED TECHNIQUES OF EMAIL SPAM FILTERING Enrico Blanzieri and Anton Bryl January 2008 (Updated version) Technical Report # DIT-06-056 A Survey of Learning-Based Techniques of Email Spam Filtering Enrico Blanzieri, University of Trento, Italy, and Anton Bryl University of Trento, Italy, Create-Net, Italy [email protected] January 11, 2008 Abstract vertising pornography, pyramid schemes, etc. [68]. The total worldwide financial losses caused by spam Email spam is one of the major problems of the to- in 2005 were estimated by Ferris Research Analyzer day’s Internet, bringing financial damage to compa- Information Service at $50 billion [31]. nies and annoying individual users. Among the ap- Lately, Goodman et al. [39] presented an overview proaches developed to stop spam, filtering is an im- of the field of anti-spam protection, giving a brief portant and popular one. In this paper we give an history of spam and anti-spam and describing major overview of the state of the art of machine learn- directions of development. They are quite optimistic ing applications for spam filtering, and of the ways in their conclusions, indicating learning-based spam of evaluation and comparison of different filtering recognition, together with anti-spoofing technologies methods. We also provide a brief description of and economic approaches, as one of the measures other branches of anti-spam protection and discuss which together will probably lead to the final victory the use of various approaches in commercial and non- over email spammers in the near future.
    [Show full text]
  • Secure Email Gateway - Market Quadrant 2016 ∗
    . The Radicati Group, Inc. Palo Alto, CA 94301 . Phone: (650) 322-8059 . www.radicati.com . THE RADICATI GROUP, INC. Secure Email Gateway - Market Quadrant 2016 ∗ ......... An Analysis of the Market for Secure Email Gateway Solutions, Revealing Top Players, Trail Blazers, Specialists and Mature Players. November 2016 SM ∗ Radicati Market Quadrant is copyrighted November 2016 by The Radicati Group, Inc. Reproduction in whole or in part is prohibited without expressed written permission of the Radicati Group. Vendors and products depicted in Radicati Market QuadrantsSM should not be considered an endorsement, but rather a measure of The Radicati Group’s opinion, based on product reviews, primary research studies, vendor interviews, historical data, and other metrics. The Radicati Group intends its Market Quadrants to be one of many information sources that readers use to form opinions and make decisions. Radicati Market QuadrantsSM are time sensitive, designed to depict the landscape of a particular market at a given point in time. The Radicati Group disclaims all warranties as to the accuracy or completeness of such information. The Radicati Group shall have no liability for errors, omissions, or inadequacies in the information contained herein or for interpretations thereof. Secure Email Gateway - Market Quadrant 2016 TABLE OF CONTENTS RADICATI MARKET QUADRANTS EXPLAINED .................................................................................. 2 MARKET SEGMENTATION – SECURE EMAIL GATEWAYS .................................................................
    [Show full text]
  • Account Administrator's Guide
    ePrism Email Security Account Administrator’s Guide - V10.4 4225 Executive Sq, Ste 1600 Give us a call: Send us an email: For more info, visit us at: La Jolla, CA 92037-1487 1-800-782-3762 [email protected] www.edgewave.com © 2001—2016 EdgeWave. All rights reserved. The EdgeWave logo is a trademark of EdgeWave Inc. All other trademarks and registered trademarks are hereby acknowledged. Microsoft and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. Other product and company names mentioned herein may be the trademarks of their respective owners. The Email Security software and its documentation are copyrighted materials. Law prohibits making unauthorized copies. No part of this software or documentation may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into another language without prior permission of EdgeWave. 10.4 Contents Chapter 1 Overview 1 Overview of Services 1 Email Filtering (EMF) 2 Archive 3 Continuity 3 Encryption 4 Data Loss Protection (DLP) 4 Personal Health Information 4 Personal Financial Information 5 Document Conventions 6 Other Conventions 6 Supported Browsers 7 Reporting Spam to EdgeWave 7 Contacting Us 7 Additional Resources 7 Chapter 2 Portal Overview 8 Navigation Tree 9 Work Area 10 Navigation Icons 10 Getting Started 11 Logging into the portal for the first time 11 Logging into the portal after registration 12 Changing Your Personal Information 12 Configuring Accounts 12 Chapter 3 EdgeWave Administrator
    [Show full text]
  • Index Images Download 2006 News Crack Serial Warez Full 12 Contact
    index images download 2006 news crack serial warez full 12 contact about search spacer privacy 11 logo blog new 10 cgi-bin faq rss home img default 2005 products sitemap archives 1 09 links 01 08 06 2 07 login articles support 05 keygen article 04 03 help events archive 02 register en forum software downloads 3 security 13 category 4 content 14 main 15 press media templates services icons resources info profile 16 2004 18 docs contactus files features html 20 21 5 22 page 6 misc 19 partners 24 terms 2007 23 17 i 27 top 26 9 legal 30 banners xml 29 28 7 tools projects 25 0 user feed themes linux forums jobs business 8 video email books banner reviews view graphics research feedback pdf print ads modules 2003 company blank pub games copyright common site comments people aboutus product sports logos buttons english story image uploads 31 subscribe blogs atom gallery newsletter stats careers music pages publications technology calendar stories photos papers community data history arrow submit www s web library wiki header education go internet b in advertise spam a nav mail users Images members topics disclaimer store clear feeds c awards 2002 Default general pics dir signup solutions map News public doc de weblog index2 shop contacts fr homepage travel button pixel list viewtopic documents overview tips adclick contact_us movies wp-content catalog us p staff hardware wireless global screenshots apps online version directory mobile other advertising tech welcome admin t policy faqs link 2001 training releases space member static join health
    [Show full text]
  • Spam: History, Perceptions, Solutions
    Spam: History, Perceptions, Solutions Report written by Geneviève Reed and submitted to Industry Canada, Office of Consumer Affairs 2004 Spam: History, Perceptions and Solutions OPTION CONSOMMATEURS MISSION Option consommateurs is a nonprofit association whose mission is to defend and promote consumers’ rights by assisting them both individually and collectively, by providing them with information, and by advocating on their behalf to decision-makers. HISTORY The association has existed since 1983. In 1999, it merged with the Association des consommateurs du Québec (ACQ), an organization with a 50-year history and a mission similar to that of Option consommateurs. PRINCIPAL ACTIVITIES Option consommateurs’s staff of 20 are grouped into four departments: the Budgeting Department, the Legal Affairs Department, the Media Relations Department, and the Research and Representation Department. Over the years, Option consommateurs has developed expertise in the areas of financial services, health, agri-food, energy, travel, access to justice, trade practices, indebtedness, and protection of privacy. Each year, we reach 7,000–10,000 consumers directly and many more through our extensive media coverage. We participate in working groups and sit on boards of directors, carry out large-scale projects with important partners, and produce research reports, policy papers, buyer’s guides, and a consumer information and action magazine called Consommation. MEMBERSHIP Option consommateurs pursues a variety of activities aimed at making change, including research, class-action lawsuits, and lobbying of public- and private-sector bodies. You can help us do more for you by becoming a member of Option consommateurs at www.option-consommateurs.org. Report by Option consommateurs, 2004 ii Spam: History, Perceptions and Solutions ACKNOWLEDGMENTS This research report was written by Geneviève Reed, Director of Research and Representation, Option consommateurs, with the support of Annie Hudon, attorney, for the legislative analysis section.
    [Show full text]
  • Emai1 Security Annua1 Review & Threat Report 2005
    bã~áä=pÉÅìêáíó ^ååì~ä=oÉîáÉï=C qÜêÉ~í=oÉéçêí= OMMR REPORT PUBLISHED BY POSTINI, INC. JANUARY 2005 PREEMPTIVE EMAIL PROTECTION As the leading provider of secure email boundary services, Postini is in a unique position to describe email security activity and trends because of the scale of our global email processing. Currently processing more than 3 billion email messages per week for 6.6 million email users worldwide, Postini sits between the email gateway and the Internet, preventing spam, viruses, phishing and other email attacks from impacting our customers email systems and networks. More than 4,000 customers now route their emails through Postini's redundant bã~áä data centers to remove unwanted emails and threats, and instantly deliver legitimate emails to recipients. Because all customer email flows through Postini's ^Çãáåáëíê~íçêë= processing centers, Postini is able to directly monitor and collect statistics in real time. The hundreds of millions of emails passing through Postini's managed service on a daily basis constitute approximately 1% of the world's business C pÉÅìêáíó email traffic, and therefore provide a unique opportunity to accurately view worldwide email activity and trends. The data provided in this report, unless mêçÑÉëëáçå~äëW specifically stated otherwise, is based upon direct measurements of mail flowing through Postini's systems, and is not the result of extrapolation, estimation, or subjective analysis. The Postini Email Security Annual Review & Threat Report provides a summary of how spam and other email threats have evolved over the course of the past year; changes in the regulatory climate that impact email communications; how organizations have responded to changes in email threats and regulations; and what to expect in email security trends in 2005.
    [Show full text]
  • How Hosted Email Security – Inbound Filtering Adds Value to Your Existing Environment
    How Trend MicroTM Hosted Email Security – Inbound Filtering Adds Value to Your Existing Environment Trend Micro Hosted Email Security Stop Spam. Save Time. How Hosted Email Security – Inbound Filtering Adds Value to Your Existing Environment A Trend Micro White Paper l March 2010 1 How Trend MicroTM Hosted Email Security – Inbound Filtering Adds Value to Your Existing Environment Table of Contents Introduction ............................................................................................................................3 Solution Overview ..................................................................................................................3 Industry-Leading Quality of Service—or Money Back..........................................................4 How Inbound Filtering Works ................................................................................................4 What is reputation-based filtering? ..................................................................................................................4 What is content-based filtering? ......................................................................................................................5 Best Practice Defaults for Hosted Email Security – Inbound Filtering.................................5 Rule Type 1: Antivirus ......................................................................................................................................6 Rule Type 2: Exceeding Message Size or Allowed Number of Recipients ...................................................6
    [Show full text]