Understanding Website Behavior Based on User Agent
Total Page:16
File Type:pdf, Size:1020Kb
Understanding Website Behavior based on User Agent Kien Pham Aécio Santos Juliana Freire Computer Science and Computer Science and Computer Science and Engineering Engineering Engineering New York University New York University New York University [email protected] [email protected] [email protected] ABSTRACT can use the Robot Exclusion Protocol (REP) to regulate Web sites have adopted a variety of adversarial techniques what web crawlers are allowed to crawl. But since the REP to prevent web crawlers from retrieving their content. While is not enforced, crawlers may ignore the rules and access it is possible to simulate users behavior using a browser to the forbidden information. Web sites may also choose what crawl such sites, this approach is not scalable. Therefore, un- content to return based on the client identification included derstanding existing adversarial techniques is important to in the user-agent field of the HTTP protocol. However, this design crawling strategies that can adapt to retrieve the con- string is not reliable since the identification is not secure and tent as efficiently as possible. Ideally, a web crawler should one can easily spoof this information in the HTTP requests. detect the nature of the adversarial policies and select the More robust detection methods have been developed, includ- most cost-effective means to defeat them. In this paper, we ing the use of web server logs to build classification models discuss the results of a large-scale study of web site behavior that learn the navigational patterns of web crawlers [7] and based on their responses to different user-agents. We issued the adoption of mechanisms that detect human activity (e.g., over 9 million HTTP GET requests to 1.3 million unique embedding JavaScript code in pages to obtain evidence of web sites from DMOZ using six different user-agents and mouse movement) [5]. the TOR network as an anonymous proxy. We observed From a crawler's perspective, this raises a new question: that web sites do change their responses depending on user- how to find and retrieve content from sites that adopt ad- agents and IP addresses. This suggests that probing sites for versarial techniques. It is possible to automate a browser these features can be an effective means to detect adversarial to simulate human actions, but this approach is not scal- techniques. able. Since pages need to be rendered and the crawler must simulate a user's click behavior, crawler throughput would be significantly hampered. Ideally, this expensive technique Keywords should only be applied for sites that require it, provided that User-agent String, Web Crawler Detection, Web Cloaking, the crawler can identify adversarial techniques and adapt its Stealth Crawling behavior accordingly. In this paper, we take a first step in this direction by 1. INTRODUCTION studying how web sites respond to different web crawlers and user-agents. In addition, we are also interested in de- There has been a proliferation of crawlers that roam the termining whether the response patterns are associated to Web. In addition to crawler agents from major search en- specific topics or site types. We issued over 9M HTTP GET gines, there are shopping bots that gather product prices, requests to more than 1.3M unique web sites (obtained from email harvesters that collect address for marketing compa- DMOZ) using different user agent. User agents included a nies and spammers, and malicious crawlers attempting to web browser, different types of web crawlers (e.g., search en- obtain information for cyber-criminals. These crawlers can gine providers, well-known and less-known crawlers), and an overwhelm web sites and degrade their performance. In ad- invalid user-agent string. We also issued requests in which dition, they affect log statistics leading to an overestimation we masked our IP address using proxies from the TOR net- of user traffic. There are also sites connected to illicit ac- work. To reduce the risk that sites can identify our exper- tivities, such as selling illegal drugs and human trafficking, iment and to reduce the chance the content changes in be- that want to avoid crawlers to minimize their exposure and tween requests, requests with different user-agents were sent eventual detection by law enforcement agents. A number of from independent machines (with different IP addresses) and strategies can be adopted to prevent crawler access. Sites concurrently. As we discuss in Section 3, the analysis of the responses uncovered many interesting facts and insights Permission to make digital or hard copies of all or part of this work for personal or that are useful for designing adversarial crawling strategies classroom use is granted without fee provided that copies are not made or distributed at scale. For example, we observed that requests from less- for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than known crawlers have a higher chance of success. In contrast, ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- when a TOR proxy is used, not only are most requests un- publish, to post on servers or to redistribute to lists, requires prior specific permission successful, but there is also a large number of exceptions. and/or a fee. Request permissions from [email protected]. Another important finding is that response patterns vary SIGIR ’16, July 17-21, 2016, Pisa, Italy for different topics { sensitive topics result in a larger num- c 2016 ACM. ISBN 978-1-4503-4069-4/16/07. $15.00 ber of 403 (forbidden) responses. DOI: http://dx.doi.org/10.1145/2911451.2914757 1053 Table 1: User-agent strings used in our experiments The URLs used in the experiment, source code and re- sponse headers of all requests are available at https://github. ACHE Ache Mozilla/5.0 (compatible; bingbot/2.0; com/ViDA-NYU/user-agent-study. Bing +http://www.bing.com/bingbot.htm) Mozilla/5.0 (compatible; Googlebot/2.1; Google 2. RELATED WORK +http://www.google.com/bot.html) Many techniques have been proposed to detect web crawlers, Mozilla/5.0 (X11; Linux x86 64) for example log analysis, heuristic-based learning techniques, Browser AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 traffic pattern analysis and human-detection tests [1, 5, 6, 7]. Nutch Nutch Although relying on user-agent string is a na¨ıve approach, Empty our study shows that it is still used by web sites. To regulate the behavior of crawlers, Web sites can imple- 5.12% ment the Robots Exclusion Protocol (REP) in a file called All topics 6.51% 4.94% robots.txt. There have been studies that measured both REP News 7.65% 5.54% Regional 7.11% adoption and the extent to which crawlers respect the rules 5.87% Society 7.1% set forth by sites [4, 2]. These studies found that while REP 4.1% Reference 7.07% adoption is high, crawler compliance is low. 5.12% Health 7.07% 5.28% Cloaking is a technique whereby sites serve different con- Home 6.78% 4.93% tent or URLs to humans and search engines. While there Business 6.13% 5.02% are legitimate uses for cloaking, this technique is widely used Arts 5.96% 4.74% for spamming and presents a challenge for search engines [9, Recreation 5.81% 4.25% Sports 5.52% 8]. Cloaking detection usually requires a crawler to acquire 3.92% Computers 4.77% at least two copies of the web pages: from a browser's view 3.78% Shopping 4.64% and from web crawler's view. In this paper, we also acquire 3.66% Games 4.49% Conflict URL Percentage 3.51% multiple versions of pages returned to different user-agents. Science 4.36% 403 URL Percentage 0 2 4 6 8 10 3. EXPERIMENTS Figure 1: Percentage of URLs with conflicts and status code 3.1 Data Collection 403 for each topic We used DMOZ1, a human-created directory of web sites, to collect URLs for our experiment. First, we selected URLs these requests were sent simultaneously from machines with from all topics in DMOZ except the topic World, a non- different IP addresses. For each request, we stored the entire English version URLs of DMOZ. This resulted in 1.9M URLs. response header and exception type. We ran a second experiment in which our prober used a Then, we grouped URLs by their web sites so that only one 4 URL per web site is kept for the experiment. We do this TOR proxy. The HTTP requests were routed through the to avoid hitting a web site multiple times, since this may TOR network, and thus, the sites could not identify our IP lead a site to detect and potentially block our prober. After address, although they could detect that the request was filtering, the original list was reduced to 1.3M URLs that issued from within the TOR network. In this experiment, were used in the experiment. Note that since each URL is we used ACHE as the user-agent. associated with a DMOZ topic, we can analyze the behavior If a request fails and raises an exception, i.e., HTTPCon- of web sites in particular topics. nectionError or Timeout, we repeat the request at most 5 times until the web site responds. We set a delay of 20 min- 3.2 Methodology utes between each request to avoid web site detecting our We spoofed the identity of our prober by changing the actions by looking at request frequency.