A Co-Classification Framework for Detecting Web Spam
Total Page:16
File Type:pdf, Size:1020Kb
A Co-classification Framework for Detecting Web Spam and Spammers in Social Media Web Sites Feilong Chen, Pang-Ning Tan, and Anil K. Jain Department of Computer Science & Engineering Michigan State University East Lansing, Michigan 48824 {chenfeil, ptan, jain}@cse.msu.edu ABSTRACT Despite the extensive research in this area, Web spam de- Social media are becoming increasingly popular and have tection remains a critically important but unsolved problem attracted considerable attention from spammers. Using a because spammers may adjust their strategies to adapt to sample of more than ninety thousand known spam Web the defense mechanisms employed against them. sites, we found between 7% to 18% of their URLs are posted Social media are becoming increasingly popular and have on two popular social media Web sites, digg.com and de- attracted considerable attention from spammers. From a licious.com. In this paper, we present a co-classi¯cation list of 94; 198 spam Web sites extracted from a benchmark framework to detect Web spam and the spammers who are email spam data [9], we found 6; 420 (¼ 7%) of them were responsible for posting them on the social media Web sites. posted at digg.com and 16; 537 (¼ 18%) of them were posted The rationale for our approach is that since both detection at delicious.com. While there has been considerable re- tasks are related, it would be advantageous to train them search to detect spam from hyperlinked Web pages to im- simultaneously to make use of the labeled examples in the prove search engine performance, spam detection from social Web spam and spammer training data. We have evaluated media is still in its infancy, with existing work focusing pri- the e®ectiveness of our algorithm on the delicious.com marily on detecting spam in blogs [3, 7, 6] and online review data set. Our experimental results showed that the pro- forums [4]. Unlike spam detection from hyperlinked Web posed co-classi¯cation algorithm signi¯cantly outperforms pages, there are richer amount of data available in social classi¯ers that learn each detection task independently. media that can be utilized for Web spam detection, such as the links between users, hyperlinks between the Web con- tent, and links between users and the Web content. Users Categories and Subject Descriptors may also assign a set of keyword tags to annotate the Web H.3.3 [Information Storage and Retrieval]: Information content they have posted. Therefore, a key challenge is to Search and Retrieval|Information ¯ltering systematically incorporate all the heterogeneous data in a uni¯ed framework to improve Web spam detection. General Terms In addition to Web spam detection, it is useful to iden- tify the spammers who are responsible for posting such links Algorithms, Experimentation, Measurement, Security to prevent them from future spamming activities. This is a challenging task because some legitimate users may unknow- Keywords ingly post links to spam Web sites, while some spammers may deliberately add links to non-spam Web sites to avoid Social Media, Web Spam, Classi¯cation detection. This paper assumes that spammers tend to post considerably higher amount of Web spam compared to non- 1. INTRODUCTION spammers. Since both Web spam and spammer detection Web spamming refers to any deliberate activity to pro- tasks are related, it would be advantageous to train their mote fraudulent Web pages in order to mislead Web users classi¯ers simultaneously to make use of their labeled exam- into believing they were viewing legitimate Web pages. Since ples. To the best of our knowledge, there has not been any search engines play a pivotal role in guiding users to ¯nd prior work on the joint detection of Web spam and spam- relevant information on the World Wide Web, much of the mers, an approach which we termed as co-classi¯cation. early spamming activities were directed toward misguiding This paper presents a co-classi¯cation framework for Web search engines into ranking spam pages unjusti¯ably higher. spam and spammer detection in social media based on the maximum margin principle. Speci¯cally, we formalize the joint detection tasks as a constraint optimization problem, in which the relationships between users and their submit- Permission to make digital or hard copies of all or part of this work for ted Web content are represented as constraints in the form personal or classroom use is granted without fee provided that copies are graph regularization. To ground the discussion of our frame- not made or distributed for profit or commercial advantage and that copies work, we use as an example the social bookmarking Web site bear this notice and the full citation on the first page. To copy otherwise, to delicious.com. While the concepts in this paper are pre- republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. sented for the social bookmarking domain [2], our proposed CIKM’09, November 2–6, 2009, Hong Kong, China. framework is applicable to other social media Web sites Copyright 2009 ACM 978-1-60558-512-3/09/11 ...$5.00. (u) where the following data are available: (1) links between vector for the i-th user and yi is its class label (+1 users, (2) links between users and their submitted Web con- for a spammer and ¡1 for non-spammer). tent (social news, blogs, opinions, etc), and (3) tags or other ² a set of n ¡ l unlabeled users, U = fx(u) ; x(u) , u u u lu+1 lu+2 content-based features derived from the Web content. We (u) ¢ ¢ ¢ , x g. also show that our framework is applicable to both super- nu ² a set of user-bookmark pairs, E , where (u; b) 2 E if vised and semi-supervised learning settings. Experimental b b user u posts a bookmark b. results using the delicious.com data set showed that our ² a set of user-user pairs, E , where (u; v) 2 E if user co-classi¯cation framework signi¯cantly outperforms classi- u u u is a fan of user v. ¯ers that learn each detection task independently. th Furthermore, let Xb (or Xu) be a matrix whose i row corresponds to the feature vector of bookmark (or user) i, 2. PRELIMINARIES (b)T (u)T i.e., xi (or xi ). We begin with a brief discussion of the terminology and Given Vb, Vu, Eb, and Eu, the goal of Web spam and notations used in this paper. While the terminology intro- spammer detection tasks is to learn a pair of classi¯ers: (1) duced here are based on the features available at delicious. (b) (b) fb(x ) that accurately maps a bookmark x to its class com, they are also applicable to other social media Web sites. label y(b) 2 f¡1; +1g and (2) f (x(u)) that accurately maps ² User: A registered visitor of the social bookmarking u a user x(u) to its class label y(u) 2 f¡1; +1g. Web site. Let U be the set of all users. ² Bookmark: A shortcut to the URL of a Web page. Let B be the set of all bookmarks. ² Tag: A keyword or text description assigned by a user 3.2 Maximum Margin Classifier to a bookmark. Let T be the set of all tags. The classi¯ers used in this paper are extensions of the ² Post: A 4-tuple (b, u, t, ¿), where b 2 B, u 2 U, least-square support vector machine (LS-SVM), a variant t ⊆ T , and ¿ is the timestamp at which the user posted of maximum margin classi¯er proposed by Suykens et al. the bookmark on the Web site. We denote the set [8]. Speci¯cally, we may construct a LS-SVM classi¯er for of all posts, also known as the posting history, as ¦. bookmarks by minimizing the following objective function: Furthermore, let Eb = f(u; b)ju 2 U; b 2 B; 9t; ¿ : (b; u; t; ¿) 2 ¦g. ² Fan: A directional link from one user to another. If a Xlb user u adds another user v to his/her network, then u 1 T 1 2 Lb(wb; bb; e; ®) = w wb + γ1 e becomes a fan of v. Let Eu = f(u; v)ju; v 2 Ug be the 2 b 2 i set of all pairs of users in which u is a fan of v. i=1 Xlb n h i o The overall data can be represented as a graph G = (V; E), (b) T (b) where V = B[U is the set of nodes (bookmarks and users) ¡ ®i yi wb xi + bb ¡ 1 + ei ; i=1 and E = Eb [ Eu is the set of links. Our work is based on the following two assumptions: (1) If (u; b) 2 Eb and b is a spam bookmark, then u is more likely to be a spammer. (2) If (u; v) 2 Eu and v is a spammer, then u is also likely to Similarly, the corresponding objective function for classify- be a spammer. As will be shown later in Section 3, these ing users can be expressed as follows: assumptions are enforced as graph regularization constraints in our proposed co-classi¯cation framework. Throughout this paper, matrices are denoted by boldface Xlu 1 T 1 2 capital letters like X and vectors are denoted by boldface Lu(wu; bu; »; ¯) = wu wu + γ2 »i T 2 2 small letters like y (for column vector) or y (for row vec- i=1 tor), where T is the transpose operator. Elements of matri- Xlu n h i o ces and vectors are of the form X and y , respectively.