A Social-Spam Detection Framework

A Social-Spam Detection Framework De Wang, Danesh Irani, and Calton Pu Georgia Institute of Technology Atlanta, Georgia 30332-0765 {wang6, danesh, calton}@cc.gatech.edu ABSTRACT 100 million users a month, they are becoming an important Social networks such as Facebook, MySpace, and Twitter medium of communication. This in turn attracts spammers have become increasingly important for reaching millions to the social networks as well, causing an increase in the of users. Consequently, spammers are increasing using such incidence of social spam. networks for propagating spam. Existing filtering techniques Social spam is low-quality information on social networks such as collaborative filters and behavioral analysis filters that is similar to email spam in that it is unsolicited bulk are able to significantly reduce spam, each social network messages that users do not ask for or specifically subscribe needs to build its own independent spam filter and support to. Such spam, is a nuisance to people and hinders them a spam team to keep spam prevention techniques current. from consuming information that is pertinent to them or We propose a framework for spam detection which can be that they are looking for. Individual social networks are used across all social network sites. There are numerous capable of filtering a significant amount of the spam they benefits of the framework including: 1) new spam detected receive, although they usually require large amounts of re- on one social network, can quickly be identified across social sources (e.g, personnel) and incur a delay before detecting networks; 2) accuracy of spam detection will improve with new types of spam. a large amount of data from across social networks; 3) other We propose a social spam detection framework which can techniques (such as blacklists and message shingling) can be be used by any social network to detect spam. The frame- integrated and centralized; 4) new social networks can plug work will avoid the need for each social network to build into the system easily, preventing spam at an early stage. We their own spam detection mechanism and hire anti-spam provide an experimental study of real datasets from social personnel. Using this framework, once a new type of spam networks to demonstrate the flexibility and feasibility of our is detected on one network, it can automatically be identi- framework. fied on the other networks as well. In addition, new social networks can quickly protect their users from social spam by using the framework. Categories and Subject Descriptors The social-spam detection framework can be split into H.3.3 [Information Systems]: Information Search and Re- three main components. Figure 1 shows an overview of the trieval - Information filtering; I.5.2 [Pattern Recognition]: system and we provide a brief explanation for each part here: Design Methodology - Classifier design and evaluation 1) Mapping and Assembly: Mapping techniques are used to convert a social network specific object into a framework- General Terms defined standard model for the object (e.g., profile model, message model, or webpage model). If associated objects Algorithm, Experimentation can be fetched based on this object, it is assembled here; 2) Pre-filtering: Fast-path techniques (e.g., blacklists, hash- Keywords ing, and similarity matching) are used to check incoming objects against known spam objects; 3) Classification: Su- Social-spam, Framework, Detection, Classification pervised machine learning techniques are used to classify the incoming object and associated objects. We use a Bayesian 1. INTRODUCTION technique to combine the classification results into spam or Social networks are growing at an alarming rate, with net- non-spam. works like Facebook and Twitter attracting audiences of over More concretely, we make the following contributions: • Build a social-spam detection framework to filter spam on multiple social networks. We build the three main Permission to make digital or hard copies of all or part of this work for components of the system and demonstrate the use of personal or classroom use is granted without fee provided that copies are the system on data from Twitter, MySpace, and the not made or distributed for profit or commercial advantage and that copies WebbSpamCorpus. bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific • Demonstrate cross social-corpora classification and permission and/or a fee. measure the feasibility of doing so. Namely, we show CEAS 2011 - Eighth annual Collaboration, Electronic messaging, Anti- Abuse and Spam Conference Sept. 1-2, 2011, Perth, Australia that we can build a classifier for a particular model Copyright 2011 ACM 978-1-4503-0788-8/11/09 ...$10.00. on one social network and apply it to another social network. After doing so, we use existing datasets to work [19]. Most of these social networks use collaborative approximate the accuracy of this technique. filtering (where users report objects that are spammy) and behavioral analysis (where logs of interactions are used to • Demonstrate associative classification or classification detect spamming patterns) to detect spam on their network. in which the results depend not only on the object Such dynamic methods may be eventually able to detect so- being classified, but also on objects associated with cial spam, but require a non-trivial amount of lag time to it. e.g., classification of a message object takes into accumulate sufficient evidence. account classification outcomes of the associated web- Apart from this, social networks will also employ classifi- page objects that may be linked inside the message. cation based techniques which use labeled training data to We also measure the feasibility of this technique. find similar occurrences of spam on the social network. Due to the evolving nature of spam [21, 15], these classification based techniques need to be retrained and adapted to newer spam [22]. Although techniques to propagate spam may vary from one social network to another, due to specificities of each social network, anecdotal evidence suggests that spam gen- erally fall into the category of pharmaceutical, pornographic, phishing, stocks, and business promotion campaigns. Bot- nets have already been shown to use templates to send vary- ing spam campaigns messages (similar to the campaigns pre- viously mentioned) [18] to different targets. With techniques to distribute a legitimate message across multiple social networks already implemented [1, 2], it is only a matter of time before botnets and spammers also employ such techniques. We propose a social spam detection framework that uses general models of a profile, message, and webpage model to perform classification across social networks. Misclassifica- tions or feedback via other methods can be used to update the classification models which apply across social networks. This will allow new types of spam detected on one social network to be detected across social networks, and also reduce the burden of the social network spam teams responsible for keeping the classifiers updated. Further, new social networks which do not have any spam detection solutions can use the social spam detection framework to protect their users from spammers. 3. SOCIAL-SPAM DETECTION FRAMEWORK In this section we present the social spam detection framework. An overview of the framework is shown in Figure 1 Figure 1: Overview of the spam detection frame- and we present the three main parts in the following subsec- work tions. The remainder of the paper is organized as follows. We 3.1 Mapping and Assembly motivate the problem further in Section 2. Section 3 pro- To build a framework that is social network agnostic, we vides the structure and implementation of the framework. have to create a standard model for the objects within the Section 4 describes the experimental setup used to evaluate social network. We define a model of an object as a schema our framework and Section 5 presents our experimental re- containing the most common attributes of the object across sults. We introduce related work in Section 6 and conclude social networks. Once a model is defined, we need to map the paper in Section 7. incoming objects from the social networking into objects of the model. We discuss both these steps in more detail below. 2. MOTIVATION With the rise of social networks as an important medium 3.1.1 Models of communication, spammers have increasingly targeted so- Our framework defines three models representing the most cial networks with spam. In most social networks, spammers important objects in social networks, namely: profile model, can send spam to other users in a number of ways, such as message model, and web page model. We omit other models, messages, friend requests, wall posts, tweets, and profiles. as they are not required to demonstrate the feasibility of the In most cases spammers can also include links to a website framework. where the user will take another action. The profile model we defined has 74 attributes and is de- Facebook, Twitter, MySpace, and other major social net- rived from the Google Open Social Person API [3]. The at- works employ dozens of people to fight spam on their net- tributes we selected cover attributes most commonly used in user profiles across websites like Facebook, MySpace, Twit- Table 1: The name mapping between Twitter pro- ter, and Flickr. files and our profile model. Empty mappings are The message model we defined has 15 attributes based on omitted. common attributes used in messages – such as“To”,“From”, Twitter Profile Profile Model “Timestamp”, “Subject”, and “Content”. We also include in Id Id the message model a few attributes which would be com- Name NickName mon for a social-network to have and also found in e-mail Location CurrentLocation messages, e.g., “Sender-IP”, and other header attributes.

A Social-Spam Detection Framework

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support