Collective Spammer Detection in Evolving Multi-Relational Social Networks
Total Page:16
File Type:pdf, Size:1020Kb
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/278968259 Collective Spammer Detection in Evolving Multi-Relational Social Networks Conference Paper · August 2015 DOI: 10.1145/2783258.2788606 CITATIONS READS 17 282 4 authors, including: Shobeir Fakhraei Lise Getoor University of Southern California University of Maryland, College Park 16 PUBLICATIONS 169 CITATIONS 238 PUBLICATIONS 9,126 CITATIONS SEE PROFILE SEE PROFILE Some of the authors of this publication are also working on these related projects: Medical Decision Making for individual patients View project All content following this page was uploaded by Shobeir Fakhraei on 26 June 2015. The user has requested enhancement of the downloaded file. Collective Spammer Detection in Evolving Multi-Relational Social Networks Shobeir Fakhraei James Foulds Madhusudana Shashanka ∗ y University of Maryland University of California Santa Cruz, CA, USA if(we) Inc. College Park, MD, USA San Francisco, CA, USA [email protected] [email protected] [email protected] Lise Getoor University of California Santa Cruz, CA, USA [email protected] ABSTRACT Keywords Detecting unsolicited content and the spammers who create Social Networks, Spam, Social Spam, Collective Classifica- it is a long-standing challenge that affects all of us on a daily tion, Graph Mining, Multi-relational Networks, Heteroge- basis. The recent growth of richly-structured social net- neous Networks, Sequence Mining, Tree-Augmented Naive works has provided new challenges and opportunities in the Bayes, k-grams, Hinge-loss Markov Random Fields (HL- spam detection landscape. Motivated by the Tagged.com1 MRFs), Probabilistic Soft Logic (PSL), Graphlab. social network, we develop methods to identify spammers in evolving multi-relational social networks. We model a so- cial network as a time-stamped multi-relational graph where 1. INTRODUCTION vertices represent users, and edges represent different ac- Unsolicited or inappropriate messages sent to a large num- tivities between them. To identify spammer accounts, our ber of recipients, known as \spam", can be used for various approach makes use of structural features, sequence mod- malicious purposes, including phishing and virus attacks, elling, and collective reasoning. We leverage relational se- marketing of objectionable materials and services, and com- quence information using k-gram features and probabilistic promising the reputation of a system. From printed ad- modelling with a mixture of Markov models. Furthermore, vertisements to unsolicited phone calls, spam has been a in order to perform collective reasoning and improve the perennial problem in modern human communication. With predictive power of a noisy abuse reporting system, we de- the emergence of the Internet, spammers have found a cost- velop a statistical relational model using hinge-loss Markov effective medium to reach a broader audience than was pre- random fields (HL-MRFs), a class of probabilistic graphical viously possible. Email spam is almost as old as the Internet models which are highly scalable. We use Graphlab Cre- TM 2 itself. The first email spam was sent in 1978 to all several ate and Probabilistic Soft Logic (PSL) to prototype and hundred users of ARPANET [1]. experimentally evaluate our solutions on internet-scale data More recently, social media has given spammers a new from Tagged.com. Our experiments demonstrate the effec- and effective medium to spread their content. Using social tiveness of our approach, and show that models which in- media platforms, spammers can disguise themselves as le- corporate the multi-relational nature of the social network gitimate users and engage in realistic looking interactions. significantly gain predictive performance over those that do They can use these platforms to send messages to users, not. leave spam comments on popular pages, and reply to legiti- ∗Contribution partly performed while under internship at mate comments using spam content. Such diversity of choice if(we) Inc., formerly Tagged Inc. has often increased spammers' ability to conceal their inten- tions from traditional spam filters. According to a study by yCurrently with Niara, Inc., Sunnyvale, CA. Nexgate [2], social spam grew by more than 355% between 1Tagged.com was founded in 2004, has over 300 million reg- istered members, and is aimed towards fostering new con- January to July of 2013, one in 200 social messages contain nections between people. spam, and 5% of all social apps are spammy. 2http://psl.umiacs.umd.edu While content-based approaches have been shown to be effective in stopping spam in email and the web, they can Permission to make digital or hard copies of all or part of this work for be manipulated by sophisticated spammers via incorporat- personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies ing content randomness. Unlike in email and the web, social bear this notice and the full citation on the first page. To copy otherwise, to media enables spammers to split their content across multi- republish, to post on servers or to redistribute to lists, requires prior specific ple messages in order to bypass spam filters. Link-based permission and/or a fee. Request permissions from [email protected]. approaches that leverage the connectivity of the entities, KDD’15 August 11-14, 2015, Sydney, NSW, Australia have been combined with content-based methods to build Copyright is held by the owner/author(s). more effective methods. While it is easier to pass tradi- Publication rights licensed to ACM. ACM 978-1-4503-3664-2/15/08 ...$15.00 tional content-based filters, behavioral patterns and graph DOI: http://dx.doi.org/10.1145/2783258.2788606. properties of the users' interactions are harder to manipu- 1 t(1) 2. PROBLEM STATEMENT We represent a social network as a directed time-stamped t(10) dynamic multi-relational graph G = hV; Ei, where V is the t(3) t(5) set of vertices of the form v = hf1; : : : ; fni representing t(6) users and their demographic features fi, and E is the set t(7) t(4) of directed edges of the form e = hvsrc; vdst; ri; tii represent- ing their interactions, relation type r , and a discrete time- t(8) i stamp ti. The social spam detection problem is to predict t(8) t(3) t(2) whether vi with an unobserved label is a spammer or not, t(9) based on the given network G and a set of observed labels for already identified spammers. Since the deployed security system could employ different measures based on the clas- Figure 1: A time-stamped multi-relation social network with sification confidence, we are interested in (un-normalized) probabilities or ranking scores of the likelihood that each legitimate users and spammers. Each link hv1; v2i in the network represents an action (e.g. profile view, message, or user is a spammer. In other words, the problem is assign- ment of a score (e.g., a probability) to user accounts to rank poke) performed by v1 towards v2 at specific time t. them from the most to the least probable spammer in the system: c : vi ! [0; 1]. 3. OUR METHOD late. Furthermore, many social networks can not monitor In our framework, we focus on three different mechanisms all the generated contents due to privacy and resources con- to identify spammers and malicious activities. We first cre- cerns. Content-independent frameworks, such as the one ate networks from the user interactions and compute net- proposed in this paper, can be applied to systems that pro- work structure features from them. As these are evolving vide maximum user privacy with end-to-end encryption. networks, each user generates a sequence of actions with the Perhaps the most important difference between social net- passage of time. Mining these sequences can provide valu- works and email or web graphs is that social networks have able insights into the intentions of the user. We use two a multi-relational nature, where users have relationships of methods to study these sequences and extract features from different types with other users and entities in the networks. them. We use the output of these methods as features to For example, they can send messages to each other, add classify spammers. We then employ a collective model to each other as friends, \like"each other's posts, and send non- identify spammer accounts only based on the signals from verbal signals such as \winks" or \pokes." Figure1 shows a the abuse reporting system (Greport) as a secondary source representation of a social network as a time-stamped multi- to reassure predictions. The following sections discuss our relation graph. The multi-relational nature provides more framework and extracted features in more details. choices for spammers, but it also empowers detection sys- tems to monitor patterns across activity types, and time. 3.1 Graph Structure Features (XG ) In this paper, we propose a content-independent framework We create a directed graph Gr = hV; Eri for each relation r which is based on the multi-relational graph structure of in the social network, where vertices V consist of users, and different activities between users, and their sequences. edges Er represent interactions of type r between users, e.g. Our proposed framework is motivated by Tagged.com, a if user1 sends a message to user2 then Gmessage will contain v1 social network for meeting new people which was founded and v2 representing the two users, and e1;2 representing the in 2004 and has over 300 million registered members. More relation between them. We have ten different graphs each generally, the framework is applicable to any multi-relational containing the same users as vertices but different actions social network. Our goal is to identify sophisticated spam- as edges.