Link Spam Detection Based on Mass Estimation
Total Page:16
File Type:pdf, Size:1020Kb
Link Spam Detection Based on Mass Estimation ∗ Zoltan Gyongyi Pavel Berkhin Computer Science Department Yahoo! Inc. Stanford University 701 First Avenue Stanford, CA 94305, USA Sunnyvale, CA 94089, USA [email protected] [email protected] Hector Garcia-Molina Jan Pedersen Computer Science Department Yahoo! Inc. Stanford University 701 First Avenue Stanford, CA 94305, USA Sunnyvale, CA 94089, USA [email protected] [email protected] ABSTRACT tivity remains largely undetected by search engines, often Link spamming intends to mislead search engines and trig- manage to obtain very high rankings for their spam pages. ger an artificially high link-based ranking of specific target web pages. This paper introduces the concept of spam mass, This paper proposes a novel method for identifying the largest a measure of the impact of link spamming on a page’s rank- and most sophisticated spam farms, by turning the spam- ing. We discuss how to estimate spam mass and how the es- mers’ ingenuity against themselves. Our focus is on spam- timates can help identifying pages that benefit significantly ming attempts that target PageRank. We introduce the from link spamming. In our experiments on the host-level concept of spam mass, a measure of how much PageRank a Yahoo! web graph we use spam mass estimates to success- page accumulates through being linked to by spam pages. fully identify tens of thousands of instances of heavy-weight The target pages of spam farms, whose PageRank is boosted link spamming. by many spam pages, are expected to have a large spam mass. At the same time, popular reputable pages, which have high PageRank because other reputable pages point to 1. INTRODUCTION them, have a small spam mass. In an era of search-based web access, many attempt to mis- chievously influence the page rankings produced by search We estimate the spam mass of all web pages by computing engines. This phenomenon, called web spamming, represents and combining two PageRank scores: the regular PageRank a major problem to search engines [14, 11] and has nega- of each page and a biased one, in which a large group of tive economic and social impact on the whole web commu- known reputable pages receives more weight. Mass esti- nity. Initially, spammers focused on enriching the contents mates can then be used to identify pages that are significant of spam pages with specific words that would match query beneficiaries of link spamming with high probability. terms. With the advent of link-based ranking techniques, such as PageRank [13], spammers started to construct spam The strength of our approach is that we can identify any farms, collections of interlinked spam pages. This latter major case of link spamming, not only farms with regu- form of spamming is referred to as link spamming as op- lar interconnection structures or cliques, which represent posed to the former term spamming. the main focus of previous research (see Section 5). The proposed method also complements our previous work on The plummeting cost of web publishing has rendered a boom TrustRank [10] in that it detects spam as opposed to “de- in link spamming. The size of many spam farms has in- tecting” reputable pages (see Sections 3.4 and 5). creased dramatically, and many farms span tens, hundreds, or even thousands of different domain names, rendering naive This paper is organized as follows. We start with some countermeasures ineffective. Skilled spammers, whose ac- background material on PageRank and link spamming. The ∗ first part of Section 3 introduces the concept of spam mass Work performed during a summer internship at Yahoo! Inc. through a transition from simple examples to formal defini- tions. Then, the second part of Section 3 presents an efficient way of estimating spam mass and a practical spam detection Permission to copy without fee all or part of this material is granted pro- algorithm based on mass estimation. In Section 4 we discuss vided that the copies are not made or distributed for direct commercial ad- our experiments on the Yahoo! search engine index and offer vantage, the VLDB copyright notice and the title of the publication and its evidence that spam mass estimation is helpful in identifying date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, to post heavy-weight link spam. Finally, Section 5 places our results on servers or to redistribute to lists, requires a fee and/or special permission into the larger picture of link spam detection research and from the publisher, ACM. PageRank analysis. VLDB ‘06, September 12-15, 2006, Seoul, Korea. Copyright 2006 VLDB Endowment, ACM 1-59593-385-9/06/09 2. PRELIMINARIES where I is the identity matrix. 2.1 Web Graph Model We adopt the notation p = PR(v) to indicate that p is the Information on the web can be viewed at different levels of (unique) vector of PageRank scores satisfying (2) for a given granularity. For instance, one could think of the web of in- v. In general, we will allow for non-uniform random jump dividual HTML pages, the web of hosts, or the web of sites. distributions. We even allow v to be unnormalized, that is Our discussion will abstract from the actual level of gran- 0 < kvk ≤ 1, and leave the PageRank vector unnormalized ularity, and see the web as an interconnected structure of as well. The linear system (2) can be solved, for instance, nodes, where nodes may be pages, hosts, or sites, respec- by using the Jacobi method, shown as Algorithm 1. tively. We adopt the usual graph model of the web, and use G = input : transition matrix T, random jump vector v, (V, E) to denote the web graph that consists of a set V of damping factor c, error bound nodes (vertices) and a set E of directed links (edges) that output: PageRank score vector p connect nodes. We disregard the exact definition of a link, i ← 0 although it will usually represent one or a group of hyper- p[0] ← v links between corresponding pages, hosts, or sites. We use repeat unweighted links and disallow self-links. i ← i + 1 p[i] ← cTT p[i−1] + (1 − c)v Each node has some incoming links, or inlinks, and some until kp[i] − p[i−1]k < outgoing links, or outlinks. The number of outlinks of a [i] node x is its outdegree, whereas the number of inlinks is its p ← p indegree. The nodes pointed to by a node x are the out- Algorithm 1: Linear PageRank. neighbors of x. Similarly, the nodes pointing to x are its A major advantage of the adopted formulation is that the in-neighbors. PageRank scores are linear in v: for p = PR(v) and v = v1 + v2 we have p = p1 + p2 where p1 = PR(v1) and p2 = 2.2 Linear PageRank PR(v2). Further details on linear PageRank are provided A popular discussion and research topic, PageRank as intro- in [3]. duced in [13] is more of a concept that allows for different mathematical formulations, rather than a single clear-cut 2.3 Link Spamming algorithm. From among the available approaches (for an In this paper we focus on link spamming that targets the overview, see [3], [4], and [5]), we adopt the linear system PageRank algorithm. PageRank is fairly robust to spam- formulation of PageRank, which is briefly introduced next. ming: a significant increase in score requires a large num- The extended version of this paper [8] offers a more detailed ber of links from low-PageRank nodes and/or some hard- presentation of linear PageRank and, most importantly, dis- to-obtain links from popular nodes, such as The New York cusses its equivalence to the probabilistic formulation based Times site www.nytimes.com. Spammers usually try to blend on the random surfer model. these two strategies, though the former is more prevalent. Given a damping factor c ∈ (0, 1) (usually around 0.85), the In order to better understand the modus operandi of link PageRank score px of a page x is defined as the solution spamming, we introduce the model of a link spam farm, a of the following equation, involving the PageRank scores of group of interconnected nodes involved in link spamming. pages y pointing to x: A spam farm has a single target node, whose ranking the X p 1 − c spammer intends to boost by creating the whole structure. p = c y + , (1) x out(y) n A farm also contains boosting nodes, controlled by the spam- (y,x)∈E mer and connected so that they would influence the Page- where out(y) is the outdegree of node y and n = |V| is the Rank of the target. Boosting nodes are owned either by number of nodes of the web graph. The second additive term the author of the target, or by some other spammer (finan- on the right side of the equation, (1 − c)/n, is traditionally cially or otherwise) interested in collaborating with him/her. referred to as the random jump component or teleportation Commonly, boosting nodes have little value by themselves; component and corresponds to a fixed, minimal amount of they only exist to improve the ranking of the target. Their PageRank that every node gets by default. PageRank tends to be small, so serious spammers employ a large number of boosting nodes (occasionally, thousands of Consider the transition matrix T corresponding to the web them) to trigger high target ranking. graph G, defined as ( In addition to the links within the farm, spammers may 1/out(x), if (x, y) ∈ E, gather some external links from reputable nodes.