NEIGHBORWATCHER: A Content-Agnostic Comment Spam Inference System Jialong Zhang and Guofei Gu SUCCESS Lab, Department of Computer Science & Engineering Texas A&M University, College Station, TX fjialong,
[email protected] Abstract the search rank of a target website than it should have. The rise of such spam causes unnecessary work for search en- Comment spam has become a popular means for spam- gine crawlers, annoys search engine users with poor search mers to attract direct visits to target websites, or to manip- results, and even often leads to phishing websites or mal- ulate search ranks of the target websites. Through posting ware drive-by downloads. In a recent study [15], Google a small number of spam messages on each victim website reports about 95,000 new malicious websites every day, (e.g., normal websites such as forums, wikis, guestbooks, which results in 12-14 million daily search-related warnings and blogs, which we term as spam harbors in this paper) but and 300,000 download alerts. spamming on a large variety of harbors, spammers can not To boost the ranking of the target websites, spammers only directly inherit some reputations from these harbors have already developed lots of spamdexing techniques [25] but also avoid content-based detection systems deployed on in the past few years, most of which also called Black Hat these harbors. To find such qualified harbors, spammers SEO (Search Engine Optimization). Text and Link manipu- always have their own preferred ways based on their avail- lations are two main SEO techniques frequently abused by able resources and the cost (e.g., easiness of automatic post- spammers to exploit the incorrect application of page rank ing, chances of content sanitization on the website).