Apophanies Or Epiphanies? How Crawlers Impact Our Understanding of the Web

Apophanies or Epiphanies? How Crawlers Impact Our Understanding of the Web Syed Suleman Ahmad Muhammad Daniyal Dar Muhammad Fareed Zaffar University of Wisconsin - Madison University of Iowa LUMS Narseo Vallina-Rodriguez Rishab Nithyanand IMDEA Networks / ICSI University of Iowa ABSTRACT [22], International World Wide Web Conference (WWW) [25], In- Data generated by web crawlers has formed the basis for much of ternational AAAI Conference on Web and Social Media (ICWSM) our current understanding of the Internet. However, not all crawlers [24], and Privacy Enhancing Technology Symposium (PETS) [31] are created equal and crawlers generally find themselves trading off relied on Web data obtained through crawlers. However, not all between computational overhead, developer effort, data accuracy, crawlers are created equal or offer the same features, and therefore and completeness. Therefore, the choice of crawler has a critical im- the choice of crawler is critical. Crawlers generally find themselves pact on the data generated and knowledge inferred from it. In this trading-off between computational overhead, flexibility, data accuracy paper, we conduct a systematic study of the trade-offs presented by & completeness, and developer effort. On one extreme, command-line different crawlers and the impact that these can have on various tools such as Wget [35] and cURL [18] are lightweight and easy types of measurement studies. We make the following contributions: to use while providing data that is hardly accurate or complete – First, we conduct a survey of all research published since 2015 in the e.g., these tools have the inability to handle dynamically generated premier security and Internet measurement venues to identify and content or execute JavaScript which restricts the completeness of verify the repeatability of crawling methodologies deployed for dif- gathered data [21]. As JavaScript is currently used in over 95% of ferent problem domains and publication venues. Next, we conduct the Alexa Top-100K web rank (as of August 2019), their use as a qualitative evaluation of a subset of all crawling tools identified crawlers in specific types of research is questionable [37]. On the in our survey. This evaluation allows us to draw conclusions about other extreme, tools like Selenium [10] provide the ability to get the suitability of each tool for specific types of data gathering. Fi- extremely accurate and complete data at the cost of high developer nally, we present a methodology and a measurement framework effort (needed to develop automated web interaction models that to empirically highlight the differences between crawlers and how can mimic real users) and computational overhead (needed to run a the choice of crawler can impact our understanding of the web. fully-fledged Selenium-driven browser such as Firefox or Chrome). Further complicating matters, in response to a growing variety of ACM Reference Format: attacks, webservers are becoming increasingly sensitive to and will- Syed Suleman Ahmad, Muhammad Daniyal Dar, Muhammad Fareed Zaffar, Narseo Vallina-Rodriguez, and Rishab Nithyanand. 2020. Apophanies or ing to block traffic appearing to be anomalous – including traffic Epiphanies? How Crawlers Impact Our Understanding of the Web. In Pro- appearing to be generated by programmatic crawlers [15, 33]. ceedings of The Web Conference 2020 (WWW ’20), April 20–24, 2020, Taipei, As a consequence of the above mentioned trade-offs, it is impor- Taiwan. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/ tant to understand how the choice of crawler can make an impact 3366423.3380113 on data gathered and inferences drawn by a measurement study. What is lacking, however, is a systematic study of the trade-offs 1 INTRODUCTION presented by different crawlers and the impact that these can have Web crawlers and scrapers1 have been a core component of the on various types of measurement studies. In this paper, we fill this toolkits used by the Internet measurement, security & privacy, infor- gap. Our overall objective is to provide researchers with an un- mation retrieval, and networking communities. In fact, nearly 16% derstanding of the trade-offs proposed by different crawlers and of all publications from the 2015-18 editions of ACM Internet Mea- also to provide data-driven insight into their impact on Internet surement Conference (IMC) [23], Network and Distributed System measurements and the inferences drawn from them. We accomplish Security Symposium (NDSS) [27], ACM Conference on Computer this objective by making the following contributions. and Communication Security (CCS) [13], USENIX Security Sym- Surveying crawlers used in prior work. (§2) We start our study posium (SEC) [34], IEEE Symposium on Security & Privacy (S&P) by surveying research published since 2015 at the following venues: IMC [23], NDSS [27], S&P [22], CCS [13], SEC [34], WWW [25], 1Although there is a subtle distinction between a web crawler and a web scraper, we refer to both as “crawlers” in the remainder of this paper. ICWSM [24], and PETS [31]. In our survey of 2,424 publications, we find 350 papers that rely on data gathered by 19 unique crawlers, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial which we categorize by their intended purpose. We use the findings advantage and that copies bear this notice and the full citation on the first page. Copyrights for of this survey to select the most relevant crawlers in these academic components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires communities for further analysis. prior specific permission and/or a fee. Request permissions from [email protected]. A systematic qualitative evaluation of crawling tools. (§3) WWW ’20, April 20–24, 2020, Taipei, Taiwan © 2020 Association for Computing Machinery. We focus on performing a qualitative evaluation of different crawl- ACM ISBN 978-1-4503-7023-3/20/04. ing tools used by previous research. For this evaluation, we rely https://doi.org/10.1145/3366423.3380113 1 WWW ’20, April 20–24, 2020, Taipei, Taiwan S. Ahmad et al. on the results of our survey to select crawlers for further analysis. Crawler identification. Our first goal in this survey is to identify We study the capabilities and flexibility of each of these crawlers which crawlers were most commonly used by the research com- with the goal of understanding how they are likely to impact the munity. To accomplish this goal, we used the following filtering data collection and their limitations. This allows us to draw conclu- process. First, we randomly sampled 300 publications for manual sions about the suitability of each tool for specific types of research inspection from the total set of 2,424 papers. During this manual problems. inspection process, we extracted keywords that were observed to A framework for performing comparative empirical crawler be commonly associated with crawling (Table 1). Next, we filtered evaluations. (§4) We present a methodology and accompanying out publications not containing these keywords based on the idea framework for empirically highlighting the differences between that these were unlikely to include a crawler as part of their data crawlers in terms of the network and application data they gen- gathering. This left us with a total of 764 papers. Each of these erate. Further, in an effort to guide future researchers in making was then manually inspected to verify the use of a crawler for data a choice between different crawlers in specific scenarios, and to gathering. We found a total of 350 publications meeting this crite- make our work repeatable, we make this evaluation framework rion. This manual inspection helped us to eliminate false-positives available publicly at https://sparta.cs.uiowa.edu/projects/ generated by our keyword-based heuristic, often due to mentions methodologies.html. of crawling tools and browsers in the related work sections of these Illustrating the impact of different crawlers on research re- papers. During this second round of manual inspection, we identi- sults and inferences. (§5) We empirically demonstrate how dif- fied the use of 19 unique crawling tools in 350 papers. Welabeled ferent crawlers impact measurement results and inferences drawn papers as having used custom-built crawlers if the crawling tool from them (and consequently: our understanding of the web). Based was developed for the purpose of that specific study. For example, on large amounts of prior work, we focus on the three major prob- several studies leveraged custom-built extensions or app-store spe- lem domains: privacy, content availability, and security measure- cific crawlers to overcome the inadequacies of popular off-the-shelf ments. Through our experiments, we observe the impact on the crawlers [38, 43]. Although we include statistics on the occurances inferences made on the (1) success of traffic analysis classifiers, of custom tools, we do not consider them as part of our study. Addi- (2) third-party ecosystem, and (3) incidence rates of server-side tionally, we do not include papers relying on data generated using blocking. This illustrates that crawler selection does indeed impact RESTful APIs as part of our study since RESTful API responses are measurement and experiment quality, soundness, and complete- based on a protocol and are not impacted by crawler or browser ness. technologies. 2 SURVEY OF CRAWLING METHODOLOGIES Category Keywords General crawl*, scrap*, webkit, webdriver, spider, browser, head- We begin our study by surveying the use of crawlers in recent aca- less demic research conducted by the Internet measurement, security, Browsers Firefox, Chrome, Chromium and privacy communities. Our goal is to understand how these re- Tools cURL, Wget, PhantomJS, Selenium, OpenWPM search communities use web crawlers and to verify the repeatability Table 1: Keywords obtained from a manual analysis of 300 publications.

Load more