The Representativeness of Automated Web Crawls As a Surrogate For
Total Page:16
File Type:pdf, Size:1020Kb
The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing David Zeber, Sarah Bird, Camila Oliveira, Walter Rudametkin, Ilana Segall, Fredrik Wollsén, Martin Lopatka To cite this version: David Zeber, Sarah Bird, Camila Oliveira, Walter Rudametkin, Ilana Segall, et al.. The Representa- tiveness of Automated Web Crawls as a Surrogate for Human Browsing. The Web Conference, Apr 2020, Taipei, Taiwan. 10.1145/3366423.3380104. hal-02456195 HAL Id: hal-02456195 https://hal.inria.fr/hal-02456195 Submitted on 27 Jan 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing David Zeber Sarah Bird Camila Oliveira Mozilla Mozilla Mozilla [email protected] [email protected] [email protected] Walter Rudametkin Ilana Segall Fredrik Wollsén Univ. Lille / Inria Mozilla Mozilla [email protected] [email protected] [email protected] Martin Lopatka Mozilla [email protected] ABSTRACT KEYWORDS Large-scale Web crawls have emerged as the state of the art for Web Crawling, Tracking, Online Privacy, Browser Fingerprinting, studying characteristics of the Web. In particular, they are a core World Wide Web tool for online tracking research. Web crawling is an attractive ACM Reference Format: approach to data collection, as crawls can be run at relatively low David Zeber, Sarah Bird, Camila Oliveira, Walter Rudametkin, Ilana Segall, infrastructure cost and don’t require handling sensitive user data Fredrik Wollsén, and Martin Lopatka. 2020. The Representativeness of Au- such as browsing histories. However, the biases introduced by us- tomated Web Crawls as a Surrogate for Human Browsing. In Proceedings of ing crawls as a proxy for human browsing data have not been well The Web Conference 2020 (WWW ’20), April 20–24, 2020, Taipei, Taiwan. ACM, studied. Crawls may fail to capture the diversity of user environ- New York, NY, USA, 12 pages. https://doi.org/10.1145/3366423.3380104 ments, and the snapshot view of the Web presented by one-time crawls does not reflect its constantly evolving nature, which hinders 1 INTRODUCTION reproducibility of crawl-based studies. In this paper, we quantify The nature, structure, and influence of the Web have been subject the repeatability and representativeness of Web crawls in terms to an overwhelming body of research. However, its rapid rate of of common tracking and fingerprinting metrics, considering both evolution and sheer magnitude have outpaced even the most ad- variation across crawls and divergence from human browser usage. vanced tools used in its study [25, 44]. The expansive scale of the We quantify baseline variation of simultaneous crawls, then isolate modern Web has necessitated increasingly sophisticated strategies the effects of time, cloud IP address vs. residential, and operating for its traversal [6, 10]. Currently, computationally viable crawls system. This provides a foundation to assess the agreement be- are generally based on representative sampling of the portion of tween crawls visiting a standard list of high-traffic websites and the Web most seen by human traffic [23, 47] as indicated by top-site actual browsing behaviour measured from an opt-in sample of over lists such as Alexa [4] or Tranco [27]. 50,000 users of the Firefox Web browser. Our analysis reveals dif- Web crawls are used in a wide variety of research including fun- ferences between the treatment of stateless crawling infrastructure damental research on the nature of the Web [32], training language and generally stateful human browsing, showing, for example, that models [21], social network analysis [33], and medical research crawlers tend to experience higher rates of third-party activity than [60]. In particular, they are a core tool in privacy research, having human browser users on loading pages from the same domains. been used in numerous studies (see Section 2.2) describing and quantifying online tracking. However, little is known about the CCS CONCEPTS representativeness of such crawls compared to the experience of • Information systems → Online advertising; Web mining; human users browsing the Web, nor about how much they de- Traffic analysis; Data extraction and integration; Browsers. pend on the environment from which they are run. In order to contextualize the information obtained from Web crawls, this study evaluates the efficacy, reliability, and generalisability of the useof Web crawlers [25] as a surrogate for user interaction with the Web in the context of online tracking. The main contributions of this WWW ’20, April 20–24, 2020, Taipei, Taiwan paper are: © 2020 IW3C2 (International World Wide Web Conference Committee), published • under Creative Commons CC-BY 4.0 License. A systematic exploration of the sources of variation between This is the author’s version of the work. It is posted here for your personal use. Not repeated Web crawls under controlled conditions for redistribution. The definitive Version of Record was published in Proceedings • Quantification of the variation attributable to the dynamic of The Web Conference 2020 (WWW ’20), April 20–24, 2020, Taipei, Taiwan, https: //doi.org/10.1145/3366423.3380104. nature of the Web versus that resulting from client-specific environment and contextual factors 1 WWW ’20, April 20–24, 2020, Taipei, Taiwan Zeber, et al. • An assessment of the degree to which website ranking lists, on the objectives of the crawl and are an active area of discussion often used to define crawl strategies, are representative of [7, 48]. These are, however, outside the scope of this paper. Web traffic measured over a set of real Web browsing clients There are a multitude of crawler implementations that use differ- • A large-scale comparison between Web crawls and the expe- ent technologies with different tradeoffs. Simple crawlers are fast rience of human browser users over a dataset of 30 million but do not run JavaScript, a central component of the modern Web site visits across 50,000 users. environment. Other crawlers require more resources but are better These results present a multifaceted investigation towards answer- at simulating a user’s browser. The simplest forms of crawlers are ing the question: when a crawler visits a specific page, is it served scripts that use curl or wget to get a webpage’s contents. More com- a comparable Web experience to that of a human user performing plete crawlers rely on frameworks like Scrapy [51], a Python-based an analogous navigation? framework for crawling and scraping data. Selenium [11] goes a step further by providing plugins to automate popular browsers, in- 2 BACKGROUND AND MOTIVATIONS cluding Firefox, Chrome, Opera and Safari. Its popularity has led to the W3C Webdriver Standard [12]. In fact, Firefox and Chrome now Seminal research into the graph-theoretical characteristics of the provide headless modes that allow them to be run fully-automated Web simplified its representation by assuming static relationships from the command-line, without the need for a graphical interface. (edges) between static pages (nodes). Increasingly sophisticated Libraries such as Puppeteer [20] provide convenient APIs to con- Web crawlers [25] were deployed to traverse hyperlinkage struc- trol browsers. Finally, OpenWPM [17] is a privacy measurement tures, providing insights into its geometric characteristics [9]. Subse- framework built on Firefox that uses Selenium for automation and quent research has emphasized the importance of studying Internet provides hooks for data collection to facilitate large-scale studies. topology [18, 31, 54]. Models accounting for the dynamic nature There are important considerations when choosing a crawler of the Web, and morphological consequences thereof [31], have technology. Simple crawlers that do not run JavaScript and are encouraged emphasis on higher traffic websites in the context ofan quite limited in modern dynamic environments. A stateful crawler, increasingly nonuniform and ever-growing Web. Applied research i.e., one that supports cookies, caches or other persistent data, may into specific characteristics of the Web [5], including the study of be desirable to better emulate users, but the results of the crawl the tracking ecosystem that underlies online advertising [17, 50], may then depend on the accumulated state, e.g., the order of pages have leveraged Web crawlers to study an immense and diverse visisted. Finally, a distributed crawl may exploit the use of coordi- ecosystem of Web pages. nated crawlers over multiple boxes with distinct IP addresses. Indeed, much of our understanding of how the Web works, and the privacy risks associated with it, result from research based on large-scale crawls. Yet, most crawls are performed only once, pro- 2.2 Web measurement studies viding a snapshot view of the Web’s inner workings. Moreover, Crawls are a standard way of collecting data for Web measure- crawls are most often performed using dedicated frameworks or ments. Many crawl-based studies focus on