A Quantitative Study of Forum Spamming Using Context-Based Analysis
Total Page:16
File Type:pdf, Size:1020Kb
A Quantitative Study of Forum Spamming Using Context-based Analysis Yuan Niu†, Yi-Min Wang‡, Hao Chen†, Ming Ma‡, and Francis Hsu† †University of California, Davis {niu,hchen,hsuf}@cs.ucdavis.edu ‡Microsoft Research, Redmond {ymwang,mingma}@microsoft.com Abstract forums range from the older message boards and guestbooks to the newer blogs and wikis. Since forums are designed to facilitate collaborative content Forum spamming has become a major means of search contribution, they become attractive targets for engine spamming. To evaluate the impact of forum spammers. For example, spammers have created spam spamming on search quality, we have conducted a blogs (splogs) and have injected spam comments into comprehensive study from three perspectives: that of legitimate forums (See Table A1 in Appendix for the search user, the spammer, and the forum hosting sample spammed forums). Compared to the site. We examine spam blogs and spam comments in “propagandist” methods of web spamming through the both legitimate and honey forums. Our study shows use of link farms 2 [16], forum spamming poses new that forum spamming is a widespread problem. challenges to search engines because (1) it is easy and Spammed forums, powered by the most popular often free to create forums and to post comments in software, show up in the top 20 search results for all processes that can often be automated, and (2) search the 189 popular keywords. On two blog sites, more engines cannot simply blacklist forums that contain than half (75% and 54% respectively) of the blogs are spam comments because these forums may be spam, and even on a major and reputably well 1 legitimate and contain valuable information. maintained blog site, 8.1% of the blogs are spam . The Anecdotal evidence as well as our own experience observation on our honey forums confirms that indicates that spammers have successfully promoted spammers target abandoned pages and that most their web sites in search results through forum comment spam is meant to increase page rank rather spamming. To combat forum spamming and to gain than generate immediate traffic. We propose context- insight on effective defense mechanisms, we need a based analyses, consisting of redirection and cloaking comprehensive study of the problem, such as its scale analysis, to detect spam automatically and to overcome and its popular techniques. shortcomings of content-based analyses. Our study This paper reports our quantitative study of forum shows that these analyses are very effective in spamming. We examine the scale of forum spamming identifying spam pages. from three different perspectives: that of the search user, the spammer, and the forum hosting site. We 1 Introduction examine both spam blogs, and spam comments in forums. For such a large-scale study, manual Search engine spamming (or search spamming or identification of spam would be unscalable, so we need web spamming [1]) refers to the practice of using automated approaches. Most existing automatic spam questionable search engine optimization techniques to detection approaches are based on content analysis. improve the ranking of web pages in search results. However, since spammers have discovered many ways When search spamming started, spammers created a to circumvent content-based analysis (such as large number of web pages with crafted keywords and plagiarized legitimate content, redirection 3, and link structures to promote the ranking of their web sites cloaking 4), we propose context-based analyses of in search results. As search engines develop more redirection and cloaking. sophisticated techniques to identify spam web pages, spammers are moving towards more fertile 2 A link farm consists of a group of sites using specific linking playgrounds: web forums . We define a web forum as a structures to boost rankings of one or more pages in the farm. web page where visitors may contribute content. Web 3 Redirection changes the user’s final destination or fetches dynamic content from other web sites. 4 Cloaking serves different content to different web visitors. For 1 We detected only redirection spam, so the actual percentage of example, it may serve browsers a page with spam content while spam blogs could be much higher. serving crawlers a page optimized for improving ranking. 1.1 Three perspectives of forum spamming based approach that uses URL-redirection and cloaking analysis . Our work was primarily motivated First, we examine forum spamming from the by two key observations: search user’s perspective: when a user searches • Many spam pages use cloaking and forums, how much spam shows up among the top redirection techniques [1,4] so that search results? To answer this question, we search for 190 top engines see different content than human keywords in forums powered by nine most popular users. A common technique is to serve page forum programs. We will describe our findings in content that the browser will dynamically Section 3.1.1. rewrite through script executions but the Second, we examine forum spamming from the crawler will not. Our approach is to treat each spammer’s perspective: we try to understand the page as a dynamic program, and to use a spammer’s modus operandi. To this end, we set up “monkey program” [6] to visit each page with three honey blogs, which should attract no legitimate a full fledged browser (so that the program comments, and have collected 41,100 comments over a can be executed in full fidelity) while year. We will analyze these spam comments in analyzing the redirection traffic. Section 3.1.2. • Many successful, large-scale spammers have Finally, we examine forum spamming from the created a huge number of doorway pages that forum hosting site’s perspective: how many forums are generate redirection traffic to a single domain spam on a hosting site, and what typical techniques do that serves the spam content. By identifying spammers apply to avoid detection? We examine over those domains that serve content to a large 20,000 blogs from four blog sites and will describe our number of doorway pages, we can catch findings in Section 3.2 and 3.3. major spammers’ domains together with all their doorway pages and doorway domains. 1.2 Context-based analysis To avoid exposing their spam domains directly 1.3 Contributions and being blacklisted by search engines, many We make the following contributions: spammers are creating doorway pages on free-hosting • domains and using their URLs in comment spamming. We conduct a comprehensive, quantitative When a user clicks a doorway-page link in search study of forum spamming. We examine this results, her browser is instructed to either redirect to or problem from three perspectives: that of the fetch ad listings from the actual spam domain or search user, the spammer, and the hosting site. advertising companies that serve the spammer as a • To overcome the limitation of content-based customer. For example, the comment-spammed URL spam detection, we propose context-based http://mywebpage.netscape.com/fendiblog/fendi- detection techniques that use URL redirection handbag.html redirects to the well-known domain tracing and cloaking analysis. We show that topsearch10.com . our context-based detection is effective in Many spammers set up doorway pages on free identifying spam pages automatically. forum hosting websites such as blogspot.com , Particularly, the two domains of a prolific blogstudio.com , forumsity.com , etc. Such doorway spammer, who created a large percentage of pages are a form of spam blogs. For example, as of this spam on two blog sites, appear in the top writing, http://seiko-diver-watch.blogspot.com results returned by our analysis tool. appeared as the top Yahoo! search result for “Seiko diver watch”, and http://forumsity.com/mobile/1/free- • Our study confirmed that forum spamming is motorola-ringtone.html appears as the top MSN search a widespread problem. The observation on result for “free motorola ringtone”. Figure 1 illustrates our honey forums confirms that spammers the end-to-end search spamming activity involving target abandoned pages and that most both spam blog creation and spam comment injection. comment spam is meant to increase page rank To overcome the limitations of content-based rather than generate immediate traffic. spam detection, we propose an orthogonal context- Spammed Pages #3: Search-engine Spamming “http://foobar.blogspot.com ” “http://foobar1.blogspot.com ” Spammed Forum … Search Engine “http://foobar.blogspot.com ” Spammed “http://foobar1.blogspot.com ” Spammer … Message Board Search Results “http://foobar.blogspot.com ” Spammed (1) http://foobar.blogspot.com “http://foobar1.blogspot.com ” #2: Comment Spamming Guestbook (2) http://GoodSite.com … (3) http://foobar1.blogspot.com (4) http://OKSite.com … Spam Blogs (Splogs) as Doorways #4: User Clicking http://good.blogspot.com Blogspot.com http://foobar.blogspot.com #5: URL Redirection Spammer Domain http://foobar1.blogspot.com Spammer’s content #1: Splog Creation (Usually advertisement) http://blogstudio.com/good Blogstudio.com http://blogstudio.com/foobar http://blogstudio.com/foobar1 Figure 1: Spam Blogs (splogs) and Comment Spamming. #1: The spammer creates splogs. #2: The spammer spams forums with the URLs of his splogs. #3: These splog URLs are ranked high in search results. #4: The search user clicks the splog URLs in the search results. #5: The splogs redirect the user to the spammer domain. syndicators and web-analytics servers, may crowd the Top Domain view because they serve a large number 2 Redirection-based spam detection of non-spam URLs. To filter out the noise, we scanned the top one million (mostly non-spam) click-through 2.1 URL redirection tracing and analysis URLs [6] to obtain a “known-good whitelist” of top third-party domains and filtered them out from the Top Our context-based spam detection process starts Domain view. The remaining ranked Top Domain list with the collection of a list of “URLs of interest”.