Tracking the Trackers: a Large-Scale Analysis of Embedded Web Trackers
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Tenth International AAAI Conference on Web and Social Media (ICWSM 2016) Tracking the Trackers: A Large-Scale Analysis of Embedded Web Trackers Sebastian Schelter Jer´ omeˆ Kunegis Technische Universitat¨ Berlin University of Koblenz–Landau [email protected] [email protected] Abstract Regardless of their primary purpose, third-party compo- nents can (and in many cases do) track web users across We perform a large-scale analysis of third-party trackers on many sites and record their browsing behavior, and thereby the World Wide Web. We extract third-party embeddings constitute a privacy hazard. In order to understand and con- from more than 3.5 billion web pages of the Common- trol this hazard, it is desirable to gain a deeper understanding Crawl 2012 corpus, and aggregate those to a dataset repre- senting more than 41 million domains. With that, we study of the ‘online tracking sphere’ as a whole. Previous research global online tracking on two levels: (1) On a global level, however has only studied small samples of this sphere due we give a precise figure for the extent of tracking, and anal- to the lack of comprehensive datasets. Recent developments yse which trackers (and subsequently, which companies) are allow us to study this online tracking sphere at a large scale: used by how many websites. (2) On a country-specific level, the availability of enormous web crawls comprised of hun- we analyse which trackers are used by websites in different dreds of terabytes of web data, such as CommonCrawl1. countries, and identify the countries in which websites choose significantly different trackers than in the rest of the world. We find that trackers are widespread (as expected), and that Online Tracking Fundamentals very few trackers dominate the web (Google, Facebook and Technical foundations In its basic form, online tracking Twitter), except for a few countries such as China and Russia. involves three types of actors: a user browsing the web, the website that she intentionally visits, and services called third-parties, which record her browsing to the website. Introduction Specifically, the user visits a website, whose HTML code The ability of a website to track which pages its visitors typically contains references to external resources, such as read has been present since the beginnings of the World style sheets, JavaScript code and images that are required Wide Web. With the advent of social media and Web 2.0 to render the page in the client browser. These external re- however, another tracking mechanism has appeared: that sources allow online trackers to enter the communication, of third-party websites embedded into the visited website when they reside on servers controlled by the third-party. by mechanisms such as JavaScript and images. Despite the A typical example for such an external resource is a piece of widespread deployment of such tracking technologies, only JavaScript code, which the user’s browser will automatically qualitative small analyses about online tracking have been load from the third-party server, and execute. This external performed to date. We bridge this gap by performing a large- loading enables the third-party to record a wide variety of in- scale study of the distribution of third-party trackers on the formation about the user, such as the browser version, oper- web. The majority of websites contain third-party content, ating system, or approximate geolocation. This information i.e., content from another domain that a visitor’s browser can be used to compute a ‘fingerprint’ of the browser, that loads and renders upon displaying the website. Such an em- works suprisingly well at recognizing individual users (Eck- bedding of third-party content has always been possible, but ersley 2010). Furthermore, the third-party has access to the was relatively rare, since most embedded images were lo- URI of the page the user is visiting, the HTTP referrer, and, cated on the same server as the page itself. In any case, the potentially, to previously set cookies (e.g., for a persistent embedding of content was not intended for tracking. With login). the rise of social media and Web 2.0, websites increasingly began to embed links (in various forms) to third-party con- Privacy implications There is a considerable variety of tent, allowing the providers of such content to track users on third-parties, like advertisers, analytics services, social wid- a wide scale. The inclusion of third-party content occurs for gets, content delivery networks and image hosters, all of a variety of reasons, e.g., advertising, conversion tracking, which have legitimate uses. However, the ability of many acceleration of content loading or provision of widgets. third-parties to record large portions of the browsing behav- ior of many users across a huge number of sites on the web Copyright c 2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1https://commoncrawl.org/ 679 google-analytics.com poses a privacy risk, and is the subject of ongoing legal dis- google.com googleapis.com facebook.com 100 youtube.com putes (The Guardian 2015). The data recorded by this track- twitter.com googlesyndication.com facebook.net ing infrastructure has been reported to contain large portions adobe.com macromedia.com 10−1 addthis.com of the online news consumption (Trackography 2014), as twimg.com gravatar.com flickr.com well as intimate, health-related personal data (EFFHealth amazonaws.com wordpress.com 10−2 vimeo.com feedburner.com 2015). The ability to consume news and form a political fbcdn.net googleadservices.com P(x larger than degree) 0.0 0.1 0.2 0.3 0.4 0.5 10−3 opinion in an independent and unwatched manner, as well 103 104 105 106 107 108 as the privacy of personal health-related data are vital for an rank share degree ( tracked websites) open society, and should not be subject to commercially mo- Figure 1: The twenty most Figure 2: Cumulative tivated data collection. Furthermore, recent frightening re- common third-parties by distribution of the num- ports suggest that intelligence agencies piggyback on online rank share. Tracking third- ber of domains visible tracking identifiers to build databases of the surfing behavior parties are highlighted. to tracking services. of millions of people (The Intercept 2015). Data Acquisition websites into which these services are embedded. Thereby, 3 Collection methodology and limitations We represent we construct the bipartite tracking network , which repre- both websites and third-parties by their pay-level domains sents the 36,982,655 embeddings of the 355 potentially pri- (the pay-level domain is a sub-domain of a public top-level vacy threatening third-parties in the 41,192,060 website pay- domain, for which users usually pay for2). We develop an level domains of our corpus. extractor that takes a HTML document as input and re- trieves all pay-level domains of third-party services that are Analysis of Tracking Services embedded in the HTML code. In a first pass through the Ranking tracking services We study the extracted track- document, we investigate the src attribute of script, ing network to gain insights into the distribution of online iframe, link and image tags. In order to also find tracking services on the web. We use Apache Flink (Alexan- third-parties that are dynamically embedded via JavaScript drov et al. 2014) for conducting the analysis. Our main ques- code, we parse all JavaScript and collect string variables that tion is to what proportion the browsing behavior of users match a URI pattern. We run this extractor on Common- on the web is visible to particular tracking services. Fur- Crawl 2012, which has a datasize of approximately 210 ter- thermore, we are interested in how strongly the tracking ca- abytes in uncompressed form (Spiegler 2013). We choose pabilities differ among various services. Unfortunately, we CommonCrawl 2012 over latter corpora, as it has been cre- are not aware of any data source available to scientists that ated using a breadth-first search crawling strategy, which would allow us to quantify the number of visitors over time produces a much better representation of the underlying link for our 41 million pay-level domains in 2012. We there- structure of the web than the crawling of predefined lists fore employ PageRank (Page et al. 1999), a well-known used in latter corpora (Lehmberg, Meusel, and Bizer 2014). measure of the relevance of websites, as proxy for ranking We run our extraction via Hadoop in the Amazon Elastic them by the traffic which they attract. We obtain the net- MapReduce service. We aggregate our data by pay-level do- work of 623,056,313 hyperlinks between the pay-level do- main, as we focus on high-level patterns of the tracker dis- mains in CommonCrawl4, and compute its PageRank dis- tribution. A limitation of our collection methodology is that tribution to get an importance ranking for all the pay-level our extractor cannot find transient trackers (trackers which domains in our corpus. We derive a ranking measure for are created from dynamically fetched external JavaScript third-parties from the PageRank distribution p of the pay- code), as we only parse the JavaScript in an HTML page, level domains in the hyperlink network as follows. Let D but cannot execute it for performance reasons. denote a set of pay-level domains to inspect (e.g., all pay- level domains belonging to a certain top-level domain) and Dataset statistics and characteristics We extract the pay- let t(D) denote the subset of pay-level domains contained level domains of third-parties contained in 3,536,611,510 in D having third-party t embedded.