Online Social Spammer Detection
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence Online Social Spammer Detection Xia Hu, Jiliang Tang, Huan Liu Computer Science and Engineering, Arizona State University, USA xiahu, jiliang.tang, huan.liu @asu.edu { } Abstract been devoted to detect spammers in various social network- The explosive use of social media also makes it a popular ing sites, including Twitter (Lee et al. 2010), Renren (Yang platform for malicious users, known as social spammers, to et al. 2011), Blogosphere (Lin et al. 2007), etc. Exist- overwhelm normal users with unwanted content. One effec- ing methods can generally be divided into two categories. tive way for social spammer detection is to build a classifier First category is to employ content analysis for detecting based on content and social network information. However, spammers in social media. Profile-based features (Lee et al. social spammers are sophisticated and adaptable to game the 2010) such as content and posting patterns are extracted to system with fast evolving content and network patterns. First, build an effective supervised learning model, and the model social spammers continually change their spamming content is applied on unseen data to filter social spammers. Another patterns to avoid being detected. Second, reflexive reciprocity category of methods is to detect spammers via social net- makes it easier for social spammers to establish social influ- work analysis (Ghosh et al. 2012). A widely used assump- ence and pretend to be normal users by quickly accumulating a large number of “human” friends. It is challenging for ex- tion in the methods is that spammers cannot establish an ar- isting anti-spamming systems based on batch-mode learning bitrarily large number of social trust relations with legitimate to quickly respond to newly emerging patterns for effective users. The users with relatively low social influence or social social spammer detection. In this paper, we present a gen- status in the network will be determined as spammers. eral optimization framework to collectively use content and Traditional spammer detection methods become less ef- network information for social spammer detection, and pro- fective due to the fast evolution of social spammers. First, vide the solution for efficient online processing. Experimen- social spammers show dynamic content patterns in social tal results on Twitter datasets confirm the effectiveness and efficiency of the proposed framework. media. Spammers’ content information changes too fast to be detected by a static anti-spamming system based on of- fline modeling (Zhu et al. 2012). Spammers continue to Introduction change their spamming strategies and pretend to be nor- Social media services, like Facebook and Twitter, are in- mal users to fool the system. A built system may be- creasingly used in various scenarios such as marketing, jour- come less effective when the spammers create many new, nalism and public relations. While social media services evasive accounts. Second, many social media sites like have emerged as important platforms for information dis- Twitter have become a target of link farming (Ghosh et semination and communication, it has also become infa- al. 2012). The reflexive reciprocity (Weng et al. 2010; mous for spammers who overwhelm other users with un- Hu et al. 2013b) indicates that many users simply follow wanted content. The (fake) accounts, known as social spam- back when they are followed by someone for the sake of mers (Webb et al. 2008; Lee et al. 2010), are a special type courtesy. It is easier for spammers to acquire a large num- of spammers who coordinate among themselves to launch ber of follower links in social media. Thus, with the per- various attacks such as spreading ads to generate sales, dis- ceived social influence, they can avoid being detected by seminating pornography, viruses, phishing, befriending vic- network-based methods. Similar results targeting other plat- tims and then surreptitiously grabbing their personal infor- forms such as Renren (Yang et al. 2011) have been re- mation (Bilge et al. 2009), or simply sabotaging a system’s ported in literature as well. Existing systems rely on build- reputation (Lee et al. 2010). The problem of social spam- ing a new model to capture newly emerging content-based ming is a serious issue prevalent in social media sites. Char- and network-based patterns of social spammers. Given the acterizing and detecting social spammers can significantly rapidly evolving nature, it is necessary to have a framework improve the quality of user experience, and promote the that efficiently reflects the effect of newly emerging data. healthy use and development of a social networking system. One efficient approach to incrementally update existing Following spammer detection in traditional platforms like model in large-scale data analysis is online learning. While Email and the Web (Chen et al. 2012), some efforts have online learning has been studied for years and shown its ef- Copyright c 2014, Association for the Advancement of Artificial fectiveness in many applications such as image and video Intelligence (www.aaai.org). All rights reserved. processing (Mairal et al. 2009) and human computer in- 59 teraction (Madani et al. 2009), it has not been applied in al. 2010; Lee et al. 2010), we focus on classifying users as social spammer detection. In this paper, we study how to spammers or normal users, i.e., c =2. It is straightforward capture the fast evolving nature of social spammers using to extend this setting to a multi-class classification task. online learning. In particular, we investigate: (1) how do With the given notations, we formally define the problem we model the content and network information in a unified of online social spammer detection as follows: framework for effective social spammer detection?; and (2) Given k users with their content information Xk, social how do we update the built model to efficiently incorporate network information Gk, and identity label information Yk, newly emerging data objects? Our solutions to these two we learn a factorization model Vk and Uk which could be questions result in a new framework for Online Social Spam- used to learn a classifier Wk to automatically assign iden- mer Detection (OSSD). The proposed framework is a formu- tity labels for unknown users (i.e., test data) as spammers or lation based on directed Laplacian constrained matrix fac- normal users. Given one more user, our goal is to efficiently torization, and is used to incorporate refined social network update the built model Vk+1, Uk+1 and Wk+1 for social information into content modeling. Then we incrementally spammer detection based on k +1users with their content update the factors appropriately to reflect the rapidly evolv- information Xk+1, social network information Gk+1, and ing nature of the social spammers. The main contributions identity label information Yk+1. of this paper are outlined as follows: Formally define the problem of online social spammer de- Social Spammer Detection • tection with content and network information; In this section, we propose a general framework for social Introduce a unified framework that considers both of spammer detection. We first discuss the modeling of con- • the content and network information for effective social tent and social network information separately, and then in- spammer detection; troduce a unified framework to integrate both information. To use content information, one way is to learn a super- Propose a novel scheme to incrementally update the built • vised model, and apply the learned model for spammer de- model for social spammer detection; and tection. Due to the unstructured and noisy content infor- Empirically evaluate the proposed OSSD framework on mation in social media, this method yields two problems • real-world Twitter datasets. to be directly applied to our task. First, text representation models, like n-gram model, often lead to a high-dimensional The remainder of this paper is organized as follows. In feature space because of the large size of data and vocabu- the second section, we formally define the problem of on- lary (Hu et al. 2009). Second, In addition to the short form line social spammer detection. We then introduce a general of texts, abbreviations and acronyms are widely used in so- framework for social spammer detection with content and cial media, thus making the data representation very sparse. network information. In the fourth section, we propose a To tackle the problems, instead of learning word-level novel online learning scheme for the social spammer detec- knowledge, we propose to model the content informa- tion framework. In the experiment section, we report em- tion from topic-level. Motivated by topic modeling litera- pirical results on real-world Twitter datasets to evaluate the ture (Blei et al. 2003), a user’s posts usually focus on a few effectiveness and efficiency of the proposed method. Finally, topics, resulting in X very sparse and low-rank. The pro- we conclude the paper and present the future work. posed method is built on a non-negative matrix factorization model (NMF) (Lee and Seung 1999), which seeks a more Problem Statement compact but accurate low-rank representation of the users In this section, we first introduce the notations used in the by solving the following optimization problem: paper and then formally define the problem we study. 2 min X UH F , Notation: Scalars are denoted by lowercase letters, vec- U,H 0 k − k (1) tors by boldface lowercase letters, and matrices by boldface ≥ n r uppercase letters. Let A denote the Euclidean norm, and where X is the content matrix, U R ⇥ is a mixing matrix r m 2 A the Frobenius normk k of the matrix A, i.e., A = and H R ⇥ with r n is an encoding matrix that k kF k kF 2 ⌧ m n 2 T indicates a low-rank user representation in a topic space.