Understanding Website Behavior Based on User Agent

Total Page:16

File Type:pdf, Size:1020Kb

Understanding Website Behavior Based on User Agent Understanding Website Behavior based on User Agent Kien Pham Aécio Santos Juliana Freire Computer Science and Computer Science and Computer Science and Engineering Engineering Engineering New York University New York University New York University [email protected] [email protected] [email protected] ABSTRACT can use the Robot Exclusion Protocol (REP) to regulate Web sites have adopted a variety of adversarial techniques what web crawlers are allowed to crawl. But since the REP to prevent web crawlers from retrieving their content. While is not enforced, crawlers may ignore the rules and access it is possible to simulate users behavior using a browser to the forbidden information. Web sites may also choose what crawl such sites, this approach is not scalable. Therefore, un- content to return based on the client identification included derstanding existing adversarial techniques is important to in the user-agent field of the HTTP protocol. However, this design crawling strategies that can adapt to retrieve the con- string is not reliable since the identification is not secure and tent as efficiently as possible. Ideally, a web crawler should one can easily spoof this information in the HTTP requests. detect the nature of the adversarial policies and select the More robust detection methods have been developed, includ- most cost-effective means to defeat them. In this paper, we ing the use of web server logs to build classification models discuss the results of a large-scale study of web site behavior that learn the navigational patterns of web crawlers [7] and based on their responses to different user-agents. We issued the adoption of mechanisms that detect human activity (e.g., over 9 million HTTP GET requests to 1.3 million unique embedding JavaScript code in pages to obtain evidence of web sites from DMOZ using six different user-agents and mouse movement) [5]. the TOR network as an anonymous proxy. We observed From a crawler's perspective, this raises a new question: that web sites do change their responses depending on user- how to find and retrieve content from sites that adopt ad- agents and IP addresses. This suggests that probing sites for versarial techniques. It is possible to automate a browser these features can be an effective means to detect adversarial to simulate human actions, but this approach is not scal- techniques. able. Since pages need to be rendered and the crawler must simulate a user's click behavior, crawler throughput would be significantly hampered. Ideally, this expensive technique Keywords should only be applied for sites that require it, provided that User-agent String, Web Crawler Detection, Web Cloaking, the crawler can identify adversarial techniques and adapt its Stealth Crawling behavior accordingly. In this paper, we take a first step in this direction by 1. INTRODUCTION studying how web sites respond to different web crawlers and user-agents. In addition, we are also interested in de- There has been a proliferation of crawlers that roam the termining whether the response patterns are associated to Web. In addition to crawler agents from major search en- specific topics or site types. We issued over 9M HTTP GET gines, there are shopping bots that gather product prices, requests to more than 1.3M unique web sites (obtained from email harvesters that collect address for marketing compa- DMOZ) using different user agent. User agents included a nies and spammers, and malicious crawlers attempting to web browser, different types of web crawlers (e.g., search en- obtain information for cyber-criminals. These crawlers can gine providers, well-known and less-known crawlers), and an overwhelm web sites and degrade their performance. In ad- invalid user-agent string. We also issued requests in which dition, they affect log statistics leading to an overestimation we masked our IP address using proxies from the TOR net- of user traffic. There are also sites connected to illicit ac- work. To reduce the risk that sites can identify our exper- tivities, such as selling illegal drugs and human trafficking, iment and to reduce the chance the content changes in be- that want to avoid crawlers to minimize their exposure and tween requests, requests with different user-agents were sent eventual detection by law enforcement agents. A number of from independent machines (with different IP addresses) and strategies can be adopted to prevent crawler access. Sites concurrently. As we discuss in Section 3, the analysis of the responses uncovered many interesting facts and insights Permission to make digital or hard copies of all or part of this work for personal or that are useful for designing adversarial crawling strategies classroom use is granted without fee provided that copies are not made or distributed at scale. For example, we observed that requests from less- for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than known crawlers have a higher chance of success. In contrast, ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- when a TOR proxy is used, not only are most requests un- publish, to post on servers or to redistribute to lists, requires prior specific permission successful, but there is also a large number of exceptions. and/or a fee. Request permissions from [email protected]. Another important finding is that response patterns vary SIGIR ’16, July 17-21, 2016, Pisa, Italy for different topics { sensitive topics result in a larger num- c 2016 ACM. ISBN 978-1-4503-4069-4/16/07. $15.00 ber of 403 (forbidden) responses. DOI: http://dx.doi.org/10.1145/2911451.2914757 1053 Table 1: User-agent strings used in our experiments The URLs used in the experiment, source code and re- sponse headers of all requests are available at https://github. ACHE Ache Mozilla/5.0 (compatible; bingbot/2.0; com/ViDA-NYU/user-agent-study. Bing +http://www.bing.com/bingbot.htm) Mozilla/5.0 (compatible; Googlebot/2.1; Google 2. RELATED WORK +http://www.google.com/bot.html) Many techniques have been proposed to detect web crawlers, Mozilla/5.0 (X11; Linux x86 64) for example log analysis, heuristic-based learning techniques, Browser AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 traffic pattern analysis and human-detection tests [1, 5, 6, 7]. Nutch Nutch Although relying on user-agent string is a na¨ıve approach, Empty our study shows that it is still used by web sites. To regulate the behavior of crawlers, Web sites can imple- 5.12% ment the Robots Exclusion Protocol (REP) in a file called All topics 6.51% 4.94% robots.txt. There have been studies that measured both REP News 7.65% 5.54% Regional 7.11% adoption and the extent to which crawlers respect the rules 5.87% Society 7.1% set forth by sites [4, 2]. These studies found that while REP 4.1% Reference 7.07% adoption is high, crawler compliance is low. 5.12% Health 7.07% 5.28% Cloaking is a technique whereby sites serve different con- Home 6.78% 4.93% tent or URLs to humans and search engines. While there Business 6.13% 5.02% are legitimate uses for cloaking, this technique is widely used Arts 5.96% 4.74% for spamming and presents a challenge for search engines [9, Recreation 5.81% 4.25% Sports 5.52% 8]. Cloaking detection usually requires a crawler to acquire 3.92% Computers 4.77% at least two copies of the web pages: from a browser's view 3.78% Shopping 4.64% and from web crawler's view. In this paper, we also acquire 3.66% Games 4.49% Conflict URL Percentage 3.51% multiple versions of pages returned to different user-agents. Science 4.36% 403 URL Percentage 0 2 4 6 8 10 3. EXPERIMENTS Figure 1: Percentage of URLs with conflicts and status code 3.1 Data Collection 403 for each topic We used DMOZ1, a human-created directory of web sites, to collect URLs for our experiment. First, we selected URLs these requests were sent simultaneously from machines with from all topics in DMOZ except the topic World, a non- different IP addresses. For each request, we stored the entire English version URLs of DMOZ. This resulted in 1.9M URLs. response header and exception type. We ran a second experiment in which our prober used a Then, we grouped URLs by their web sites so that only one 4 URL per web site is kept for the experiment. We do this TOR proxy. The HTTP requests were routed through the to avoid hitting a web site multiple times, since this may TOR network, and thus, the sites could not identify our IP lead a site to detect and potentially block our prober. After address, although they could detect that the request was filtering, the original list was reduced to 1.3M URLs that issued from within the TOR network. In this experiment, were used in the experiment. Note that since each URL is we used ACHE as the user-agent. associated with a DMOZ topic, we can analyze the behavior If a request fails and raises an exception, i.e., HTTPCon- of web sites in particular topics. nectionError or Timeout, we repeat the request at most 5 times until the web site responds. We set a delay of 20 min- 3.2 Methodology utes between each request to avoid web site detecting our We spoofed the identity of our prober by changing the actions by looking at request frequency.
Recommended publications
  • Browser Versions Carry 10.5 Bits of Identifying Information on Average [Forthcoming Blog Post]
    Browser versions carry 10.5 bits of identifying information on average [forthcoming blog post] Technical Analysis by Peter Eckersley This is part 3 of a series of posts on user tracking on the modern web. You can also read part 1 and part 2. Whenever you visit a web page, your browser sends a "User Agent" header to the website saying what precise operating system and browser you are using. We recently ran an experiment to see to what extent this information could be used to track people (for instance, if someone deletes their browser cookies, would the User Agent, alone or in combination with some other detail, be enough to re-create their old cookie?). Our experiment to date has shown that the browser User Agent string usually carries 5-15 bits of identifying information (about 10.5 bits on average). That means that on average, only one person in about 1,500 (210.5) will have the same User Agent as you. On its own, that isn't enough to recreate cookies and track people perfectly, but in combination with another detail like an IP address, geolocation to a particular ZIP code, or having an uncommon browser plugin installed, the User Agent string becomes a real privacy problem. User Agents: An Example of Browser Characteristics Doubling As Tracking Tools When we analyse the privacy of web users, we usually focus on user accounts, cookies, and IP addresses, because those are the usual means by which a request to a web server can be associated with other requests and/or linked back to an individual human being, computer, or local network.
    [Show full text]
  • Longitudinal Characterization of Browsing Habits
    Please cite this paper as: Luca Vassio, Idilio Drago, Marco Mellia, Zied Ben Houidi, and Mohamed Lamine Lamali. 2018. You, the Web, and Your Device: Longitudinal Charac- terization of Browsing Habits. ACM Trans. Web 12, 4, Article 24 (November 2018), 30 1 pages. https://doi.org/10.1145/3231466 You, the Web and Your Device: Longitudinal Characterization of Browsing Habits LUCA VASSIO, Politecnico di Torino IDILIO DRAGO, Politecnico di Torino MARCO MELLIA, Politecnico di Torino ZIED BEN HOUIDI, Nokia Bell Labs MOHAMED LAMINE LAMALI, LaBRI, Université de Bordeaux Understanding how people interact with the web is key for a variety of applications – e.g., from the design of eective web pages to the denition of successful online marketing campaigns. Browsing behavior has been traditionally represented and studied by means of clickstreams, i.e., graphs whose vertices are web pages, and edges are the paths followed by users. Obtaining large and representative data to extract clickstreams is however challenging. The evolution of the web questions whether browsing behavior is changing and, by consequence, whether properties of clickstreams are changing. This paper presents a longitudinal study of clickstreams in from 2013 to 2016. We evaluate an anonymized dataset of HTTP traces captured in a large ISP, where thousands of households are connected. We rst propose a methodology to identify actual URLs requested by users from the massive set of requests automatically red by browsers when rendering web pages. Then, we characterize web usage patterns and clickstreams, taking into account both the temporal evolution and the impact of the device used to explore the web.
    [Show full text]
  • RSA Adaptive Authentication (On-Premise) 7.2 Integration Guide
    RSA® Adaptive Authentication (On-Premise) 7.2 Integration Guide Contact Information Go to the RSA corporate website for regional Customer Support telephone and fax numbers: www.emc.com/domains/rsa/index.htm Trademarks RSA, the RSA Logo, BSAFE and EMC are either registered trademarks or trademarks of EMC Corporation in the United States and/or other countries. All other trademarks used herein are the property of their respective owners. For a list of EMC trademarks, go to www.emc.com/legal/emc-corporation-trademarks.htm#rsa. License agreement This software and the associated documentation are proprietary and confidential to EMC, are furnished under license, and may be used and copied only in accordance with the terms of such license and with the inclusion of the copyright notice below. This software and the documentation, and any copies thereof, may not be provided or otherwise made available to any other person. No title to or ownership of the software or documentation or any intellectual property rights thereto is hereby transferred. Any unauthorized use or reproduction of this software and the documentation may be subject to civil and/or criminal liability. This software is subject to change without notice and should not be construed as a commitment by EMC. Note on encryption technologies This product may contain encryption technology. Many countries prohibit or restrict the use, import, or export of encryption technologies, and current use, import, and export regulations should be followed when using, importing or exporting this product. Distribution Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.
    [Show full text]
  • Lab Exercise – HTTP
    Lab Exercise – HTTP Objective HTTP (HyperText Transfer Protocol) is the main protocol underlying the Web. HTTP functions as a re- quest–response protocol in the client–server computing model. A web browser, for example, may be the client and an application running on a computer hosting a website may be the server. The client submits an HTTP request message to the server. The server, which provides resources such as HTML files and other content, or performs other functions on behalf of the client, returns a response message to the client. The response contains completion status information about the request and may also contain requested con- tent in its message body. A web browser is an example of a user agent (UA). Other types of user agent include the indexing software used by search providers (web crawlers), voice browsers, mobile apps, and other software that accesses, consumes, or displays web content. HTTP is designed to permit intermediate network elements to improve or enable communications between clients and servers. High-traffic web- sites often benefit from web cache servers that deliver content on behalf of upstream servers to improve response time. Web browsers cache previously accessed web resources and reuse them when possible to reduce network traffic. HTTP is an application layer protocol designed within the framework of the Inter- net protocol suite. Its definition presumes an underlying and reliable transport layer protocol and Trans- mission Control Protocol (TCP) is commonly used. Step 1: Open the http Trace Browser behavior can be quite complex, using more HTTP features than the basic exchange, this trace will show us how much gets transferred.
    [Show full text]
  • Working with User Agent Strings in Stata: the Parseuas Command
    JSS Journal of Statistical Software February 2020, Volume 92, Code Snippet 1. doi: 10.18637/jss.v092.c01 Working with User Agent Strings in Stata: The parseuas Command Joss Roßmann Tobias Gummer Lars Kaczmirek GESIS – Leibniz Institute GESIS – Leibniz Institute University of Vienna for the Social Sciences for the Social Sciences Abstract With the rising popularity of web surveys and the increasing use of paradata by survey methodologists, assessing information stored in user agent strings becomes inevitable. These data contain meaningful information about the browser, operating system, and device that a survey respondent uses. This article provides an overview of user agent strings, their specific structure and history, how they can be obtained when conducting a web survey, as well as what kind of information can be extracted from the strings. Further, the user written command parseuas is introduced as an efficient means to gather detailed information from user agent strings. The application of parseuas is illustrated by an example that draws on a pooled data set consisting of 29 web surveys. Keywords: user agent string, web surveys, device detection, browser, paradata, Stata. 1. Introduction In recent years, web surveys have become an increasingly popular and important data collec- tion mode, and today, they account for a great share of the studies in the social sciences. Web surveys have been used to complement traditional offline surveys as a less expensive form of pre-test or a mode of choice for respondents who are not willing to use “traditional” modes, such as face-to-face or telephone interviews. Along with the rising popularity of web surveys, survey methodologists have taken an increas- ing interest in paradata, which are collected as a byproduct of the survey process (Couper 2000).
    [Show full text]
  • Why Is XHTML Needed? Isn't HTML Good Enough?
    Why is XHTML needed? Isn't HTML good enough? HTML is probably the most successful document markup language in the world. But when XML was introduced, a two-day workshop was organised to discuss whether a new version of HTML in XML was needed. The opinion at the workshop was a clear 'Yes': with an XML-based HTML other XML languages could include bits of XHTML, and XHTML documents could include bits of other markup languages. We could also take advantage of the redesign to clean up some of the more untidy parts of HTML, and add some new needed functionality, like better forms. What are the advantages of using XHTML rather than HTML? If your document is just pure XHTML 1.0 (not including other markup languages) then you will not yet notice much difference. However as more and more XML tools become available, such as XSLT for tranforming documents, you will start noticing the advantages of using XHTML. XForms for instance will allow you to edit XHTML documents (or any other sort of XML document) in simple controllable ways. Semantic Web applications will be able to take advantage of XHTML documents. If your document is more than XHTML 1.0, for instance including MathML, SMIL, or SVG, then the advantages are immediate: you can't do that sort of thing with HTML. Can I just put the XML declaration on top of existing HTML documents? Can I intermix HTML 4.01 and XHTML documents? No. HTML is not in XML format. You have to make the changes necessary to make the document proper XML before you can get it accepted as XML.
    [Show full text]
  • Economic and Technical Drivers of Technology Choice: Browsers
    Economic and Technical Drivers of Technology Choice: Browsers (Working paper: please do not cite without permission from authors) Timothy F. Bresnahan and Pai-Ling Yin First Draft: Nov 05, 2003 Abstract The diffusion of new technologies is their adoption by different economic agents at different times. A classical concern in the diffusion of technologies (Griliches, 1957) is the importance of raw technical progress versus economic forces. We examine this classical issue in the modern market of web browsers. Using a new data source, we study browser brand shares and the diffusion of new browser versions. Features of that market also generate novel questions in the economics of diffusion. We find that browser distribution via the expanding market for a complementary technology, personal computers (PCs), had a larger effect on the rate and direction of technical change than technical browser improvements. Timothy F. Bresnahan Pai-Ling Yin Landau Professor in Technology and the Economy Harvard Business School Stanford University Morgan Hall 241 Department of Economics Soldiers Field 579 Serra St. Boston MA 02163 Stanford, CA, 94305 USA [email protected] [email protected] 1 1) Introduction A new invention creates a technological opportunity. The diffusion of the new technology to the economic agents who will use it determines the rate and direction of realized technical change in the economy. We have known since the work of Griliches (1957) that both economic and technical forces shape diffusion. Invention is only the beginning of technical progress. Inventors introduce new technologies into the field and start their diffusion. The movement of the overall economy toward realizing a technological opportunity depends on the behavior of the adopters of the technology.
    [Show full text]
  • Information Systems Modeling
    Information systems modeling Tomasz Kubik Design Pattern “Each pattern describes a problem which occurs over and over again in our environment and then describes the core of the solution to that problem, in such a way that you can use this solution a million times over, without ever doing it in the same way twice” Ch.W. Alexander, S. Ishikawa, M. Silverstein, M. Jacobson, I. Fiksdahl-King, S. Angel: A Pattern Language: Towns, Buildings, Construction, 1977 Design patterns are “descriptions of communicating objects and classes that are customized to solve a general design problem in a particular context.” E. Gamma, R. Helm, R. Johnson, J. Vlissides (Gang of Four): Design Patterns: Elements of Reusable Object-Oriented Software, 1994 Design patterns can be expressed at various abstraction levels – OO programming (like GoF patterns) • Structure, Creational, and Behavioral – Design and implementation of the multi-tier software • Presentation, Business and Integration T.Kubik: ISM Design Pattern “All well-structured systems are full of patterns. A pattern provides a good solution to a common problem in a given context. A mechanism is a design pattern that applies to a society of classes; a framework is typically an architectural pattern that provides an extensible template for applications within a domain. You use patterns to specify mechanisms and frameworks that shape the architecture of your system. You make a pattern approachable by clearly identifying the slots, tabs, knobs, and dials that a user of that pattern may adjust to apply the pattern
    [Show full text]
  • HTML, CSS, & Javascript Mobile Development for Dummies
    Mobile Development/HTML ™ Making Everything Easier! Master the art of designing web pages for mobile devices — Open the book and find: a site for small screens! & JavaScript CSS HTML, • Why you should know WURFL Mobile Development HTML, CSS ® When designing a web page for mobile devices, the big thing • A system for keeping your site is — think small! Your objective is to provide what the mobile up to date user wants without losing the “wow” in your website. This • All about bitmap and vector book shows you how to do it using three key technologies. & JavaScript images Soon you’ll be building mobile pages with forms, quizzes, appropriate graphics, shopping carts, and more! • Easy ways to adjust your site for different devices Mobile Development • Think mobile — consider screen size, lack of a mouse, duel • Powerful SEO ideas to get your orientation screens, and mobile browsers site noticed • Know your audience — understand how people use the mobile • Tips for creating a mobile web and how their habits differ from those of desktop users shopping cart • Get interactive — optimize multimedia files and develop contact • How to take your blog theme forms that encourage visitors to interact with your site mobile ® • Latest and greatest — maximize the new features of HTML5 and CSS3, automate your site with JavaScript, and use WebKit Extensions • Ten mobile CSS-friendly apps and widgets • Be sure they find you — make your mobile site both easily searchable and search engine-friendly Learn to: • Use standard web tools to build sites Go to Dummies.com® for iPhone®, iPad®, BlackBerry®, and Visit the companion website at www.wiley.com/go/ for videos, step-by-step examples, Android ™ platforms htmlcssjscriptmobiledev for code samples you can how-to articles, or to shop! use when creating your mobile sites.
    [Show full text]
  • Economic and Technical Drivers of Technology Choice: Browsers
    Economic and Technical Drivers of Technology Choice: Browsers Timothy F. Bresnahan and Pai-Ling Yin First Draft: Nov 05, 2003 This Draft: August 12, 2005 Abstract The diffusion of new technologies is their adoption by different economic agents at different times. A classical concern in the diffusion of technologies (Griliches 1957) is the importance of raw technical progress versus economic forces. We examine this classical issue in a modern market, web browsers. Using a new data source, we study the diffusion of new browser versions. In a second analysis, we study the determination of browser brand shares. Both analyses let us examine the important of technical progress vs. economic forces. We find that the critical economic force was browser distribution with a complementary technology, personal computers (PCs). In both of our analyses, distribution had a larger effect on the rate and direction of technical change than technical browser improvements. This shows an important feedback. Widespread use of the Internet spurred rapid expansion of the PC market in the late 1990s. Our results show that the rapid expansion in PCs in turn served to increase the pace of diffusion of new browsers and thus move the economy toward new mass market commercial Internet use. Timothy F. Bresnahan Pai-Ling Yin Landau Professor in Technology and the Economy Harvard Business School Stanford University Morgan Hall 241 Department of Economics Soldiers Field 579 Serra St. Boston MA 02163 Stanford, CA, 94305 USA [email protected] [email protected] 1) Introduction A new invention creates a technological opportunity. The diffusion of the new technology to the economic agents who will use it determines the rate and direction of realized technical change in the economy.
    [Show full text]
  • HTTP Transactions
    05 4547 ch03 1/24/03 8:44 AM Page 29 3 HTTP Transactions HTTP TRAFFIC CONSISTS OF REQUESTS AND RESPONSES. All HTTP traffic can be associated with the task of requesting content or responding to those requests. Every HTTP message sent from a Web browser to a Web server is classified as an HTTP request, whereas every message sent from a Web server to a Web browser is classified as an HTTP response. HTTP is often referred to as a stateless protocol.Although this is accurate, it does little to explain the nature of the Web.All this means, however, is that each transaction is atomic, and there is nothing required by HTTP that associates one request with another. A transaction refers to a single HTTP request and the corresponding HTTP response. Another fundamental topic related to the nature of the Web is the topic of connections. Connections When I speak of a connection in HTTP,I refer to a TCP connection.As illustrated in Figure 3.1, a TCP connection requires three separate messages. TCP SYN TCP SYN + ACK TCP ACK Figure 3.1 A TCP connection requires three messages. SYN and ACK are two flags within the TCP segment of a packet. Because TCP is such a common transport layer protocol to be used in conjunction with IP,the combined pack- et of an IP packet containing a TCP segment is sometimes called a TCP/IP packet, even though it would best be described as a packet within a packet. By this example, you can see that a connection is unlike what you might otherwise expect.After this exchange, both computers simply consider themselves connected.
    [Show full text]
  • 3 User-Agent 17 3.1 Format
    Masaryk University Faculty of Informatics Detection of network attacks using HTTP related information Master’s Thesis Lenka Kuníková Brno, Spring 2017 Masaryk University Faculty of Informatics Detection of network attacks using HTTP related information Master’s Thesis Lenka Kuníková Brno, Spring 2017 This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document. Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Lenka Kuníková Advisor: RNDr. Pavel Minařík PhD. i Acknowledgement I would like to thank to my advisor RNDr. Pavel Minařík, PhD. and to Mgr. Martin Juřen for guidance and useful advices. iii Abstract This thesis deals with extended HTTP network flows and their ap- plication for the detection of various attacks and anomalies on the network. It highlights advantages of extended HTTP flows on cho- sen attacks, it implements and tests existing detection methods and suggests numerous improvements. Furthermore the thesis analyses in detail User-Agent request header. It describes possibilities of how this field can by used for anomaly detection and explains problems related to User-Agent analysis. iv Keywords Flow monitoring, HTTP, Anomaly detection, User-Agent, Brute-force attack, SQL injection v Contents 1 Introduction 1 2 HTTP 3 2.1 Basic concepts .........................3 2.2 URI ..............................4 2.3 Message format ........................5 2.3.1 HTTP request .
    [Show full text]