Link Spam Detection Based on Mass Estimation ∗ Zoltan Gyongyi Pavel Berkhin Computer Science Department Yahoo! Inc. Stanford University 701 First Avenue Stanford, CA 94305, USA Sunnyvale, CA 94089, USA
[email protected] [email protected] Hector Garcia-Molina Jan Pedersen Computer Science Department Yahoo! Inc. Stanford University 701 First Avenue Stanford, CA 94305, USA Sunnyvale, CA 94089, USA
[email protected] [email protected] ABSTRACT tivity remains largely undetected by search engines, often Link spamming intends to mislead search engines and trig- manage to obtain very high rankings for their spam pages. ger an artificially high link-based ranking of specific target web pages. This paper introduces the concept of spam mass, This paper proposes a novel method for identifying the largest a measure of the impact of link spamming on a page’s rank- and most sophisticated spam farms, by turning the spam- ing. We discuss how to estimate spam mass and how the es- mers’ ingenuity against themselves. Our focus is on spam- timates can help identifying pages that benefit significantly ming attempts that target PageRank. We introduce the from link spamming. In our experiments on the host-level concept of spam mass, a measure of how much PageRank a Yahoo! web graph we use spam mass estimates to success- page accumulates through being linked to by spam pages. fully identify tens of thousands of instances of heavy-weight The target pages of spam farms, whose PageRank is boosted link spamming. by many spam pages, are expected to have a large spam mass. At the same time, popular reputable pages, which have high PageRank because other reputable pages point to 1.