Spam (Spam 2.0) Through Web Usage

Digital Ecosystems and Business Intelligence Institute Addressing the New Generation of Spam (Spam 2.0) Through Web Usage Models Pedram Hayati This thesis is presented for the Degree of Doctor of Philosophy of Curtin University July 2011 I Abstract Abstract New Internet collaborative media introduce new ways of communicating that are not immune to abuse. A fake eye-catching profile in social networking websites, a promotional review, a response to a thread in online forums with unsolicited content or a manipulated Wiki page, are examples of new the generation of spam on the web, referred to as Web 2.0 Spam or Spam 2.0. Spam 2.0 is defined as the propagation of unsolicited, anonymous, mass content to infiltrate legitimate Web 2.0 applications. The current literature does not address Spam 2.0 in depth and the outcome of efforts to date are inadequate. The aim of this research is to formalise a definition for Spam 2.0 and provide Spam 2.0 filtering solutions. Early-detection, extendibility, robustness and adaptability are key factors in the design of the proposed method. This dissertation provides a comprehensive survey of the state-of-the-art web spam and Spam 2.0 filtering methods to highlight the unresolved issues and open problems, while at the same time effectively capturing the knowledge in the domain of spam filtering. This dissertation proposes three solutions in the area of Spam 2.0 filtering including: (1) characterising and profiling Spam 2.0, (2) Early-Detection based Spam 2.0 Filtering (EDSF) approach, and (3) On-the-Fly Spam 2.0 Filtering (OFSF) approach. All the proposed solutions are tested against real-world datasets and their performance is compared with that of existing Spam 2.0 filtering methods. This work has coined the term ‘Spam 2.0’, provided insight into the nature of Spam 2.0, and proposed filtering mechanisms to address this new and rapidly evolving problem. Acknowledgements II Acknowledgements It is a great pleasure to thank everyone who helped me write my dissertation successfully. I am sincerely and heartily grateful to my supervisor, Dr. Vidyasagar Potdar. His enthusiasm, inspiration, and patient efforts to explain things clearly and simply, provided me with encouragement, sound advice and teaching, good company, and numerous ideas. I would have been lost without him. I also extend my deep gratitude to my co-supervisor, Dr. Alex Talveski, for his valuable critique, insights and suggestions for improving my research. I am obliged to many of my colleagues for their helpful suggestions and peer reviews of my research. In particular, I would like to express my gratitude to Prof. William F. Smyth, Prof. Elizabeth Chang , Prof. Tharam Dillon, Dr. Chen Wu, Dr. D. Sculley, Prof. Takayuki Ito, Dr. Andrew Over, Prof. Costas S. Iliopoulos, Prof. Ian Sommerville, Prof. Robert Meersman, Dr. Fedja Hazdic, Dr. Jaipal Singh, Dr. Song Han, Prof. Heinz Dreher, Dr. Peter Dell, Dr. Ashley Aitken, Dr. Michael Brodie, Prof. Jianwei Han and my best friend, Kevin Chai. I am truly indebted and thankful to my parents, Heshmatollah Hayati and Dr. Mina Davoodi, for their faith in me and allowing me to be as ambitious as I wanted. It was under their watchful eye and with their encouragement that I remained motivated, committed, and able to tackle challenges head on. Lastly, I am grateful to my sister, my colleagues at the Institute for Advanced Studies in Basic Sciences, and my close friends for all their emotional support, camaraderie, entertainment and care. III Contents Contents Abstract .................................................................................................................................................... I Acknowledgements ................................................................................................................................ II Contents ................................................................................................................................................ III List of Figures .................................................................................................................................... XIII List of Tables .................................................................................................................................. XVIII List of Algorithms .............................................................................................................................. XXI Chapter 1: Introduction ........................................................................................................................... 1 1.1 Introduction ................................................................................................................................... 1 1.2 What is spam? ............................................................................................................................... 2 1.3 Origin of the term ‘spam’.............................................................................................................. 2 1.4 Different forms of spam ................................................................................................................ 2 1.4.1 Email Spam ............................................................................................................................ 4 1.4.2 Instant Messenger Spam (SPIM) ........................................................................................... 4 1.4.3 Mobile Phone Spam (M-spam) .............................................................................................. 4 1.4.4 Web page/Website Spam (Web spam or Spamdexing) ......................................................... 4 1.4.5 Spam over Internet Telephony (SPIT) ................................................................................... 4 1.5 Spam in Web 2.0 Applications (Spam 2.0) ................................................................................... 5 1.5.1 Definition of Spam 2.0 ........................................................................................................... 5 Contents IV 1.5.2 Spam 2.0 vs. Email Spam ...................................................................................................... 7 1.5.3 Spam 2.0 vs. Web spam ......................................................................................................... 8 1.6 Anti-Spam Solutions (Spam Filtering) ......................................................................................... 9 1.6.1 Importance of spam filtering .................................................................................................. 9 1.6.2 Anti-Spam concepts ............................................................................................................... 9 1.6.3 Legal Approach .................................................................................................................... 11 1.6.4 Technical Approach ............................................................................................................. 11 1.7 Research Motivations .................................................................................................................. 18 1.7.1 Spam 2.0 is not addressed as a new type of spam ................................................................ 19 1.7.2 Spam 2.0 is parasitic by nature ............................................................................................ 19 1.7.3 Spam 2.0 causes user inconvenience ................................................................................... 19 1.7.4 Existing detection approaches are not effective in filtering Spam 2.0 ................................. 20 1.7.5 The increasing rate of Spam 2.0 distribution is out of control ............................................. 20 1.7.6 Online fraud and scam are facilitated by Spam 2.0 ............................................................. 21 1.7.7 Trust, credibility and quality of information on Web 2.0 platforms are compromised because of Spam 2.0 ..................................................................................................................... 21 1.8 Significance of Research ............................................................................................................. 22 1.8.1 Social.................................................................................................................................... 22 1.8.2 Economic ............................................................................................................................. 23 1.8.3 Technical .............................................................................................................................. 24 1.9 Research Objectives .................................................................................................................... 25 1.10 Structure of Dissertation ........................................................................................................... 26 V Contents 1.11 Conclusion ................................................................................................................................ 27 Chapter 2: Literature Review ................................................................................................................ 29 2.1 Introduction ................................................................................................................................. 29 2.2 Web Spam ..................................................................................................................................

Spam (Spam 2.0) Through Web Usage

Technical and Legal Approaches to Unsolicited Electronic Mail, 35 USFL Rev

Application Development with Tocollege.Net

Locating Spambots on the Internet

A Rule Based Approach for Spam Detection

Spambots: Creative Deception

Address Munging: the Practice of Disguising, Or Munging, an E-Mail Address to Prevent It Being Automatically Collected and Used

A Visual Analytics Toolkit for Social Spambot Labeling

Adversarial Web Search by Carlos Castillo and Brian D

It's All About Visibility

A Survey on Spam Detection Techniques

Google Cheat Sheets [.Pdf]

Before the FEDERAL COMMUNICATIONS COMMISSION Washington, D.C