Characterizing Web Spam Using Content and HTTP Session Analysis

Characterizing Web Spam Using Content and HTTP Session Analysis Steve Webb James Caverlee Calton Pu College of Computing College of Computing College of Computing Georgia Institute of Georgia Institute of Georgia Institute of Technology Technology Technology Atlanta, GA 30332 Atlanta, GA 30332 Atlanta, GA 30332 [email protected] [email protected] [email protected] ABSTRACT vious studies [1, 4, 6] have shown that only about two thirds Web spam research has been hampered by a lack of statis- of all web pages are unique; thus, we expected to find a sim- tically significant collections. In this paper, we perform the ilar degree of duplication among our web spam pages. To first large-scale characterization of web spam using content evaluate duplication in the corpus, we constructed clusters and HTTP session analysis techniques on the Webb Spam (equivalence classes) of duplicate or near-duplicate pages. Corpus – a collection of about 350,000 web spam pages. Our Based on the sizes of these equivalence classes, we discov- content analysis results are consistent with the hypothesis ered that duplication is twice as prevalent among web spam that web spam pages are different from normal web pages, pages (i.e., only about one third of the pages are unique). showing far more duplication of physical content and URL The second part of the content analysis focuses on a cat- redirections. An analysis of session information collected egorization of web spam pages. Specifically, we identify five during the crawling of the Webb Spam Corpus shows signif- important categories of web spam: Ad Farms, Parked icant concentration of hosting IP addresses in two narrow Domains, Advertisements, Pornography, and Redi- ranges as well as significant overlaps among session header rection. The Ad Farms and Parked Domains cate- values. These findings suggest that content and HTTP ses- gories consist of pages that are comprised exclusively of ad- sion analysis may contribute a great deal towards future vertising links. These pages exist solely to generate traf- efforts to automatically distinguish web spam pages from fic for other sites and money for web spammers (through normal web pages. pay-per-click advertising programs). The Advertisements category contains pages that advertise specific products and services, and the pages in the Pornography category are 1. INTRODUCTION pornographic in nature. The Redirection category con- Web spam has grown to a significant percentage of all sists of pages that employ various redirection techniques. web pages (between 13.8% and 22.1% of all web pages [2, Within the Redirection category, we identify seven redi- 8]), threatening the dependability and usefulness of web- rection techniques (HTTP-level redirects, 3 HTML-based based information in a manner similar to how email spam redirects, and 3 JavaScript-based redirects), and we find has affected email. Unfortunately, previous research on the that 43.9% of web spam pages use some form of HTML nature of web spam [2, 5, 8, 10, 11, 13] has suffered from or JavaScript redirection. the difficulties associated with manually classifying and sep- The third component of our research is an evaluation of arating web spam pages from legitimate pages. As a result, the HTTP session information associated with web spam. these previous studies have been limited to a few thousand First, we examine the IP addresses that hosted our web web spam pages, which is insufficient for an effective content spam pages and find that 84% of the web spam pages were analysis (as customarily performed in email spam research). hosted on the 63.* – 69.* and 204.* – 216.* IP address In this paper, we provide the first large-scale experimen- ranges. Then, we evaluate the most commonly used HTTP tal study of web spam pages by applying content and HTTP session headers and values. As a result of this evaluation, we session analysis techniques to the Webb Spam Corpus [12] – find that many web spam pages have similar values for nu- a collection of almost 350,000 web spam examples that is two merous headers. For example, we find that 94.2% of the web orders of magnitude larger than the collections used in pre- spam pages with a “Server” header were hosted by Apache vious evaluations. Our main hypothesis in this study is that (63.9%) or Microsoft IIS (30.3%). These results are partic- web spam pages are fundamentally different from “normal” ularly interesting because they suggest that HTTP session web pages. To evaluate this hypothesis, we characterize the information might be extremely valuable for automatically content and HTTP session properties of web spam pages distinguishing between web spam pages and normal pages. using a variety of methods. The web spam content analy- The rest of the paper is organized as follows. Section 2 de- sis is composed of two parts. The first part quantifies the scribes our web spam corpus and summarizes its collection amount of duplication present among web spam pages. Pre- methodology. In Section 3, we report the results of a content analysis of web spam, which consists of two parts. The first part evaluates the amount of duplication that appears in web CEAS 2007 - Fourth Conference on Email and Anti-Spam, August 2-3, 2007, Mountain View, California USA spam. The second part identifies concrete web spam categories and provides an extensive description of the redirec- 100000 tion techniques being used by web spammers. In Section 4, we report the results of an analysis of web spam HTTP session information, which identifies the most common hosting 10000 IP addresses and HTTP header values associated with web spam. Section 5 summarizes related work, and Section 6 1000 concludes the paper and provides future research directions. 100 2. THE WEBB SPAM CORPUS Number of clusters In our previous research [12], we developed an automatic 10 technique for obtaining web spam examples that leverages the presence of URLs in email spam messages. Specifically, we extracted almost 1.2 million unique URLs from more 1 than 1.4 million email spam messages. Then, we built a 1 10 100 1000 10000 100000 crawler to obtain the web pages that corresponded to those Cluster size URLs. Our crawler attempted to access each of the URLs; Figure 1: Number and size of the shingling clusters. however, many of the URLs returned HTTP redirects (i.e., 3xx HTTP status codes). The crawler followed all of these redirects until it finally accessed a URL that did not return a redirect. In our previous work [12], we identified the existence of Our crawler obtained two types of information for every duplicate URLs in the corpus (i.e., multiple web spam pages successfully accessed URL (including those that returned a with the same URL), and we explained that these duplicate redirect): the HTML content of the page identified by the URLs are the result of multiple unique HTTP redirect chains URL and the HTTP session information associated with the that lead to the same destination. Specifically, we found page request transaction. As a result, we created a file for that the corpus contains 263,446 unique URLs, which means every successfully accessed URL that contains all of this in- about one fourth of the web spam pages have a URL that formation. After our crawling process was complete, we had is the same as one of the web spam pages in the remaining 348,878 web spam pages and 223,414 redirect files (i.e., files three fourths. that correspond to redirect responses). These files are col- To identify content duplication, we computed MD5 hashes lectively referred to as the Webb Spam Corpus1, and they for the HTML content of all of the web spam pages in our provide the basis for our analysis in this paper. For a more corpus. After evaluating these results, we found 202,208 detailed description of our collection methodology and the unique MD5 values. Thus, 146,670 of the web spam pages format of the files in the Webb Spam Corpus, please con- (42%) have the exact same HTML content as one of the sult [12]. pages in a collection of 202,208 unique web spam pages. We acknowledge that our collection of web spam exam- Many of these duplicates are explained by the URL dupli- ples is not representative of all web spam; however, it is two cation that exists in the corpus (described above), but since orders of magnitude larger than any other available source each of the duplicate URLs represents a distinct entry point of web spam to date, and as such, it currently provides the (i.e., a unique HTTP redirect chain) to a given page, we most realistic snapshot of web spammer behavior. Thus, consider them to be functionally equivalent to content du- although the characteristics of our corpus might not be in- plicates. dicative of all web spam, our observations still provide ex- To evaluate the amount of near-duplication in our corpus, tremely useful insights about the techniques being employed we used the shingling algorithm that was developed by Fet- by web spammers. terly et al. [3, 4, 6] to construct equivalence classes of duplicate and near-duplicate web spam pages. First, we prepro- cessed every web spam page in the corpus. Specifically, the 3. CONTENT ANALYSIS HTML tags in each page were replaced by white space, and In this section, we provide the results of our large-scale every page was tokenized into a collection of words, where a analysis of web spam content. This analysis consists of two word is defined as an uninterrupted series of alphanumeric parts. The first part, discussed in Section 3.1, quantifies the characters.

Load more