A Co-classification Framework for Detecting Web Spam and Spammers in Social Media Web Sites

Feilong Chen, Pang-Ning Tan, and Anil K. Jain Department of Computer Science & Engineering Michigan State University East Lansing, Michigan 48824 {chenfeil, ptan, jain}@cse.msu.edu

ABSTRACT Despite the extensive research in this area, Web spam de- Social media are becoming increasingly popular and have tection remains a critically important but unsolved problem attracted considerable attention from spammers. Using a because spammers may adjust their strategies to adapt to sample of more than ninety thousand known spam Web the defense mechanisms employed against them. sites, we found between 7% to 18% of their URLs are posted Social media are becoming increasingly popular and have on two popular social media Web sites, digg.com and de- attracted considerable attention from spammers. From a licious.com. In this paper, we present a co-classification list of 94, 198 spam Web sites extracted from a benchmark framework to detect Web spam and the spammers who are data [9], we found 6, 420 (≈ 7%) of them were responsible for posting them on the social media Web sites. posted at digg.com and 16, 537 (≈ 18%) of them were posted The rationale for our approach is that since both detection at delicious.com. While there has been considerable re- tasks are related, it would be advantageous to train them search to detect spam from hyperlinked Web pages to im- simultaneously to make use of the labeled examples in the prove search engine performance, spam detection from social Web spam and spammer training data. We have evaluated media is still in its infancy, with existing work focusing pri- the effectiveness of our algorithm on the delicious.com marily on detecting spam in [3, 7, 6] and online review data set. Our experimental results showed that the pro- forums [4]. Unlike spam detection from hyperlinked Web posed co-classification algorithm significantly outperforms pages, there are richer amount of data available in social classifiers that learn each detection task independently. media that can be utilized for Web spam detection, such as the links between users, hyperlinks between the Web con- tent, and links between users and the Web content. Users Categories and Subject Descriptors may also assign a set of keyword tags to annotate the Web H.3.3 [Information Storage and Retrieval]: Information content they have posted. Therefore, a key challenge is to Search and Retrieval—Information filtering systematically incorporate all the heterogeneous data in a unified framework to improve Web spam detection. General Terms In addition to Web spam detection, it is useful to iden- tify the spammers who are responsible for posting such links Algorithms, Experimentation, Measurement, Security to prevent them from future activities. This is a challenging task because some legitimate users may unknow- Keywords ingly post links to spam Web sites, while some spammers may deliberately add links to non-spam Web sites to avoid Social Media, Web Spam, Classification detection. This paper assumes that spammers tend to post considerably higher amount of Web spam compared to non- 1. INTRODUCTION spammers. Since both Web spam and spammer detection Web spamming refers to any deliberate activity to pro- tasks are related, it would be advantageous to train their mote fraudulent Web pages in order to mislead Web users classifiers simultaneously to make use of their labeled exam- into believing they were viewing legitimate Web pages. Since ples. To the best of our knowledge, there has not been any search engines play a pivotal role in guiding users to find prior work on the joint detection of Web spam and spam- relevant information on the , much of the mers, an approach which we termed as co-classification. early spamming activities were directed toward misguiding This paper presents a co-classification framework for Web search engines into ranking spam pages unjustifiably higher. spam and spammer detection in social media based on the maximum margin principle. Specifically, we formalize the joint detection tasks as a constraint optimization problem, in which the relationships between users and their submit- Permission to make digital or hard copies of all or part of this work for ted Web content are represented as constraints in the form personal or classroom use is granted without fee provided that copies are graph regularization. To ground the discussion of our frame- not made or distributed for profit or commercial advantage and that copies work, we use as an example the Web site bear this notice and the full citation on the first page. To copy otherwise, to delicious.com. While the concepts in this paper are pre- republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. sented for the social bookmarking domain [2], our proposed CIKM’09, November 2–6, 2009, Hong Kong, China. framework is applicable to other social media Web sites Copyright 2009 ACM 978-1-60558-512-3/09/11 ...$5.00. (u) where the following data are available: (1) links between vector for the i-th user and yi is its class label (+1 users, (2) links between users and their submitted Web con- for a spammer and −1 for non-spammer). tent (social news, blogs, opinions, etc), and (3) tags or other • a set of n − l unlabeled users, U = {x(u) , x(u) , u u u lu+1 lu+2 content-based features derived from the Web content. We (u) ··· , x }. also show that our framework is applicable to both super- nu • a set of user-bookmark pairs, E , where (u, b) ∈ E if vised and semi-supervised learning settings. Experimental b b user u posts a bookmark b. results using the delicious.com data set showed that our • a set of user-user pairs, E , where (u, v) ∈ E if user co-classification framework significantly outperforms classi- u u u is a fan of user v. fiers that learn each detection task independently. th Furthermore, let Xb (or Xu) be a matrix whose i row corresponds to the feature vector of bookmark (or user) i, 2. PRELIMINARIES (b)T (u)T i.e., xi (or xi ). We begin with a brief discussion of the terminology and Given Vb, Vu, Eb, and Eu, the goal of Web spam and notations used in this paper. While the terminology intro- spammer detection tasks is to learn a pair of classifiers: (1) duced here are based on the features available at delicious. (b) (b) fb(x ) that accurately maps a bookmark x to its class com, they are also applicable to other social media Web sites. label y(b) ∈ {−1, +1} and (2) f (x(u)) that accurately maps • User: A registered visitor of the social bookmarking u a user x(u) to its class label y(u) ∈ {−1, +1}. Web site. Let U be the set of all users. • Bookmark: A shortcut to the URL of a Web page. Let B be the set of all bookmarks. • : A keyword or text description assigned by a user 3.2 Maximum Margin Classifier to a bookmark. Let T be the set of all tags. The classifiers used in this paper are extensions of the • Post: A 4-tuple (b, u, t, τ), where b ∈ B, u ∈ U, least-square support vector machine (LS-SVM), a variant t ⊆ T , and τ is the timestamp at which the user posted of maximum margin classifier proposed by Suykens et al. the bookmark on the Web site. We denote the set [8]. Specifically, we may construct a LS-SVM classifier for of all posts, also known as the posting history, as Π. bookmarks by minimizing the following objective function: Furthermore, let Eb = {(u, b)|u ∈ U, b ∈ B, ∃t, τ : (b, u, t, τ) ∈ Π}. • Fan: A directional link from one user to another. If a Xlb user u adds another user v to his/her network, then u 1 T 1 2 Lb(wb, bb, e, α) = w wb + γ1 e becomes a fan of v. Let Eu = {(u, v)|u, v ∈ U} be the 2 b 2 i set of all pairs of users in which u is a fan of v. i=1 Xlb n h i o The overall data can be represented as a graph G = (V,E), (b) T (b) where V = B ∪ U is the set of nodes (bookmarks and users) − αi yi wb xi + bb − 1 + ei , i=1 and E = Eb ∪ Eu is the set of links. Our work is based on the following two assumptions: (1) If (u, b) ∈ Eb and b is a spam bookmark, then u is more likely to be a spammer. (2) If (u, v) ∈ Eu and v is a spammer, then u is also likely to Similarly, the corresponding objective function for classify- be a spammer. As will be shown later in Section 3, these ing users can be expressed as follows: assumptions are enforced as graph regularization constraints in our proposed co-classification framework. Throughout this paper, matrices are denoted by boldface Xlu 1 T 1 2 capital letters like X and vectors are denoted by boldface Lu(wu, bu, ξ, β) = wu wu + γ2 ξi T 2 2 small letters like y (for column vector) or y (for row vec- i=1 tor), where T is the transpose operator. Elements of matri- Xlu n h i o ces and vectors are of the form X and y , respectively. (u) T (u) ij j − βi yi wu xi + bu − 1 + ξi , i=1 3. METHODOLOGY In this section, we first formalize the Web spam and spam- mer detection problems. We then present the derivation of However, by training them independently, the classifiers do our proposed co-classification framework. not make use of the labeled examples in the Web spam and spammer training data as well as the link information be- 3.1 Problem Formulation tween the users and their corresponding bookmarks. Suppose we are given: (b) (b) (b) • a set of lb labeled bookmarks, Vb = {(x1 , y1 ), (x2 , (b) (b) (b) (b) y ), ··· (x , y )}, where x is a db-dimensional fea- 2 lb lb i 3.3 Co-Classification of Bookmarks and Users (b) ture vector for the i-th post and yi is its class label Instead of solving the optimization problems for classify- (+1 for a spam bookmark and −1 for non-spam). ing bookmarks and users independently, our co-classification (b) • a set of nb − lb unlabeled bookmarks, Ub = {x , framework utilizes additional information from the link struc- lb+1 (b) (b) ture between users and bookmarks in Eb to ensure that their x , ··· , xn }. lb+2 b solutions are consistent with each other. (u) (u) (u) (u) • a set of lu labeled users, Vu = {(x1 , y1 ), (x2 , y2 ), ··· The objective function for our co-classification framework (x(u), y(u))}, where x(u) is a d -dimensional feature of bookmarks and users can be written as follows: lu lu i u vector of 0s, 1d is a d-dimensional vector of 1s, and

lb lu (u)T (u) (u)T (u) X X Zu = [x y ; ... ; x y ] 1 T 1 2 1 T 1 2 1 1 lu lu L = w wb + γ1 e + w wu + γ2 ξ 2 b 2 i 2 u 2 i (b)T (b) (b)T (b) i=1 i=1 Zb = [x1 y1 ; ... ; xl yl ] X b b Xlb n h i o (u) (u) (b) T (b) p1 = γ3 δij [xi − xj ] − αi yi wb xi + bb − 1 + ei (i,j)∈Eu i=1 X (u) Xlu n h i o + γ4 ηij xi (u) T (u) − βi y w x + bu − 1 + ξi (i,j)∈Eb i u i X i=1 (b) X p2 = − γ4 ηij x T (u) T (u) j − γ3 δij [wu xi − wu xj ] (i,j)∈Eb (i,j)∈Eu X X q = γ4 ηij T (u) T (b) − γ4 ηij [wu xi + bu − wb xj − bb], (1) (i,j)∈Eb (i,j)∈E b The block structure of the matrix equation suggests that the

where (wb, wu, bb, bu, e, ξ, α, β) are the model param- system of linear equations can be further decoupled into two eters to be estimated from training data and (γ1, γ2, γ3, subproblems, one for learning the parameters of the book- γ4) are the user-specified parameters. For our experiments, mark classifier and the other for user classifier. · ¸ · ¸ · ¸ we set γ1 = γ2 = 1 whereas γ3 and γ4 are estimated from T (u) ZuZu + γ2Ilu y β 1 − Zup1 the data via ten-fold cross validation. The term associated (u)T = (2) y 0 bu −q with γ3 is used to enforce the constraint due to links be- tween users. To understand the rationale for this term, note · T (b)¸ · ¸ · ¸ that the value of the objective function increases whenever ZbZb + γ1Ilb y α 1 − Zbp2 (u) (u) (b)T = (3) T T y 0 bb q wu xi + bu < wu xj + bu (i.e., if a non-spammer is a fan of a spammer). Thus, the graph regularization term can be Equations (2) and (3) can be solved for the model parame- viewed as penalizing models that assign non-spammers as ters (α, β, bu, bb). Furthermore, wu and wb can be expressed fans of spammers. This idea was inspired by prior works on (u) using graph regularization methods to combine link struc- in terms of β, α, Zu and Zb. As a result, a user xtest and (b) (u) ture and content information in Web pages [10, 1]. The a bookmark xtest can be classified as follows: fu(xtest) = T (u) (b) T (b) parameter δij is the weight of the directed edge Eu(i, j). In- wu xtest + bu and fb(xtest) = wb xtest + bb. Finally, although stead of assigning a binary 0/1 weight to every pair of nodes the proposed co-classification framework assumes a linear (users), we normalize the weightP based on the out-degree of classifier, it can be extended to non-linear models using the the source node, i.e., δij = 1/ j Eu(i, j). well-known kernel trick. However, we have considered only Similarly, the last term in the objective function is used linear kernels in this study. to penalize models in which non-spammers are allowed to Algorithm 1 summarizes the high-level overview of our Co- bookmark many spam pages: Class algorithm. The LinearSolver function solves the sys- tem of linear equations given in (2) and (3). The Classify T (u) T (b) ηij [wu xi + bu − wb xj − bb], ∀(i, j) ∈ Eb, function takes as input the model parameters and unlabeled P examples to generate their predicted class labels. where ηi,j = 1/ j Eb(i, j) is the weight of Eb(i, j). The objective function given in (1) can be solved in a Algorithm 1 Co-Class Algorithm supervised or semi-supervised learning setting, depending Input: Vb, Vu, Ub, Uu,Eb,Eu, γ on the nodes used to express the graph regularization con- (u) (b) Output: {y |i = l + 1 ··· n }, {y |b = l + 1 ··· n }. straints in E and E . In a supervised learning setting, i u u j b b b u Method: E and E involve only bookmarks and users that are part b u 1. (α, β, b , b ) ← LinearSolver(V , V ,E ,E , γ) of the labeled training data. For semi-supervised learning, u b b u b u 2. y(u) ← Classify(U , V , β, b ) the constraints due to E and E in (1) include unlabeled u u u b u 3. y(b) ← Classify(U , V , α, b ) bookmarks and users as well. The solution of the objective b b b function is obtained by taking the derivative of L with re- spect to each of the model parameters and setting them to zero. This reduces to a system of linear equations that can be expressed in matrix notation as follows: 4. EXPERIMENTAL EVALUATION This section presents our experiment results to demon-  T    strate the effectiveness of our Co-Class algorithms. We use Id 0 0 −Z 0 0 0 0   u u wu p1 0 0 0 y(u)T 0 0 0 0 a real-world data set acquired from delicious.com to eval-   bu  −q   0 0 γ I −I 0 0 0 0    2 lu lu ξ  0lu  uate our algorithms. The dataset consists of posting history  (u)       Zu y Ilu 0 0 0 0 0   β  1lu for nearly 3 million users, whose feature vectors are gen-  T    =   0 0 0 0 I 0 0 −Z w  p2   db b   b    erated from about 2.5 million tags. In addition, the data  0 0 0 0 0 0 0 y(b)T   bb   q    e 0l also contains about 110, 000 bookmarks, whose feature vec- 0 0 0 0 0 0 γ1Ilb −Ilb b (b) α 1lb tors include about 300, 000 tags. The user-bookmark links, 0 0 0 0 Zb y Ilb 0 Eb and user-user links Eu are also obtained by preprocess- where Id is a d × d identity matrix, 0d is a d-dimensional ing the user profile pages. URLs harnessed from the email Table 1: Experimental results on the delicious.com data set. Training data Bookmark User = 10% Recall Precision F1 Recall Precision F1 linear SVM 0.4266 ± 0.0335 0.7587 ± 0.0150 0.5451 ± 0.0255 0.2537 ± 0.0397 0.6873 ± 0.0220 0.3686 ± 0.0408 SVM-rbf 0.4320 ± 0.0415 0.7591 ± 0.0136 0.5517 ± 0.0310 0.2541 ± 0.0276 0.6901 ± 0.0408 0.3720 ± 0.0541 Co-class 0.5101 ± 0.0251 0.7702 ± 0.0326 0.6136 ± 0.0430 0.3802 ± 0.0390 0.7476 ± 0.0275 0.5131 ± 0.0316

Training data Bookmark User = 30% Recall Precision F1 Recall Precision F1 linear SVM 0.5783 ± 0.0177 0.8110 ± 0.0163 0.6749 ± 0.0103 0.3521 ± 0.0162 0.7097 ± 0.0206 0.4703 ± 0.0147 Co-class 0.6300 ± 0.0186 0.8203 ± 0.0200 0.7103 ± 0.0317 0.4502 ± 0.0319 0.7330 ± 0.0211 0.5647 ± 0.0280 spam benchmark data in [9] are used to label the spam book- strated that such a strategy is more effective than learning marks. A user is labeled as a spammer if he/she posted to each task independently. For future work, we plan to ex- at least one of those spam bookmarks. To test the perfor- tend the methodology to incorporate data from multiple so- mance of our algorithms, we randomly selected a sample of cial media Web sites. For example, an interesting research 20, 000 bookmarks and 20, 000 users. 20% of the data set direction is to investigate the co-classification of bookmarks were identified to be spam (or spammers) and the remaining and users from both delicious.com and digg.com. Further- 80% were non-spam (or non-spammers). more, it would be useful to investigate methods for reducing For comparison purposes, we use SVM-light [5] to build a the number of user parameters in our learning framework. pair of support vector machine classifiers that learn the clas- sification of users and bookmarks independently using their 6. ACKNOWLEDGMENTS respective training data. We reported the precision, recall, This research was partially supported by ONR Grant Num- and F1-measures after performing repeated 10-fold cross val- ber N00014-09-1-0663 and the Army Research Office. idation on the data. To make the problem more challenging, for each fold, we use 10% of the data for training while the 7. REFERENCES remaining 90% are used as test data. We repeated the ex- [1] J. Abernethy, O. Chapelle, and C. Castillo. Web spam periment 20 times by sampling another 20, 000 bookmarks identification through content and hyperlinks. In Proc. and users from the original data. Table 1 shows the relative of the SIGIR Workshop on Adversarial Information performance of linear and nonlinear SVM against the semi- Retrieval on the Web (AIRWEB’08), Beijing, China, supervised Co-Class algorithm. Clearly, the F1-measure for April 2008. Co-Class is significantly higher than that for SVM, both in [2] F. Chen, J. Scripps, and P. Tan. Link mining for a terms of classifying bookmarks and classifying users. The social bookmarking web site. In Proc. of F1-measure for bookmark classification is higher than that IEEE/WIC/ACM Int’l Conf. on Web Intelligence, for user classification but the margin of improvement using 2008. Co-Class is larger in terms of classifying users than classi- [3] K. Ishida. Extracting spam blogs with co-citation fying bookmarks. We suspect this is due to the additional clusters. In Proc. of the 17th Int’l Conf. on World information in the link structure of Eu, which helps to boost Wide Web, pages 1043–1044, New York, NY, 2008. the performance of our co-class algorithm when classifying [4] N. Jindal and B. Liu. Opinion spam and analysis. In users. The results also showed that most of the improvement Proc. of Int’l Conf. on Web Search and Web Data in Co-Class is due to its higher recall value. Since the data Mining (WSDM 08), 2008. set is moderately skewed, SVM tends to classify many of the [5] T. Joachims. http://svmlight.joachims.org/. examples into the larger class, which explains its lower recall [6] G. Koutrika, F. A. Effendi, Z. Gyongyi, P. Heymann, value. In contrast, graph regularization using the link struc- and H. Garcia-Molina. Combating spam in tagging ture in E and E allows our Co-Class algorithm to identify b u systems: An evaluation. 2(4):1–34, 2008. more spam and spammers from the data set without losing [7] Y. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. L. its precision. We also increased the percentage of training Tseng. Detecting splogs via temporal dynamics using data during 10-fold cross validation from 10% to 30%. The self-similarity analysis. ACM Trans. Web, 2(1):1–35, results in Table 1 showed our algorithm significantly out- Feb. 2008. performed linear SVM, though the margin of improvement is less than that obtained using 10% training data. [8] J. Suykens, T. Gestel, J. Brabanter, B. Moor, and J. Vandewalle. Least Squares Support Vector 5. CONCLUSION Machines. World Scientific Pub, Singapore, 2002. [9] S. Webb, J. Caverlee, , and C. Pu. Introducing the This paper focuses on the problem of Web spam and spam- Webb spam corpus: Using email spam to identify web mer detection in social media Web sites. Unlike search en- spam automatically. In Proc. of CEAS ’06, 2006. gine spam, Web spam from social bookmarking and social [10] T. Zhang, A. Popescul, and B. Dom. Linear prediction Web sites are potentially damaging because models with graph regularization for web-page it may direct users to malicious Web sites that compromise categorization. In Proc. of ACM SIGKDD Int’l Conf browser security. To overcome this problem, we presented a on Data Mining, pages 821–826, Philadelphia, PA, co-classification framework that simultaneously trains clas- 2006. sifiers for detecting Web spam and spammers. We demon-