On the Ubiquity of Web Tracking: Insights from a Billion-Page Web Crawl
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Web Science, 2018, 4: 53–66 On the Ubiquity of Web Tracking: Insights from a Billion-Page Web Crawl Sebastian Schelter1 and Jérôme Kunegis2 1Database Systems and Information Management Group, Technische Universität Berlin, Germany [email protected] 2Namur Center for Complex Systems, University of Namur, Belgium [email protected] ABSTRACT We perform a large-scale analysis of third-party trackers on the World Wide Web. We extract third-party embed- dings from more than 3.5 billion web pages of the CommonCrawl 2012 corpus, and aggregate those to a dataset containing more than 140 million third-party embeddings in over 41 million domains. We study this data on several levels and provide the following contributions: (1) Our work leverages the largest empirical web tracking dataset collected so far, and exceeds related studies by more than an order of magnitude in the number of domains and web pages analyzed. As our dataset also contains the link structure of the web, we are able to derive a ranking measure for tracker occurrences based on aggregated network centrality rather than simple domain counts. We make our extracted data and computed rankings available to the research community. (2) On a global level, we give a precise figure for the extent of tracking, give insights into the structural properties of the ‘online tracking sphere’ and analyse which trackers (and subsequently, which companies) are used by how many websites, leveraging our ranking measure derived from the link structure of the web. (3) On a country-specific level, we analyse which trackers are used by websites in different countries, and identify the countries in which websites choose significantly different trackers than in the rest of the world. In particular, the three tracking domains with the highest PageRank are all owned by Google. The only exception to this pattern are a handful of countries such as China and Russia. Our results suggest that this dominance is strongly associated with country-specific political factors such as freedom of the press. (4) We investigate whether the content of websites influences the choice of trackers they use, leveraging more than ninety thousand categorized domains. In particular, we analyse whether highly privacy-critical websites about health and addiction make different choices of trackers than other websites. Our findings indicate that websites with highly privacy-critical content are less likely to contain trackers (60% vs 90% for other websites), even though the majority of them still do contain trackers. Keywords: Web, Tracking, CommonCrawl ISSN 2332-4031; DOI 10.1561/106.00000014 ©2018 S. Schelter and J. Kunegis 1 Introduction to embed links (using various technologies) to third-party con- tent, and thereby allowed the providers of such content to track The ability of a website to track the pages read by its visitors users on a wide scale. The inclusion of third-party content oc- has been present since the beginnings of the World Wide Web. curs for a variety of reasons, of which not all require tracking, In recent years however, another tracking mechanism has be- e.g., advertising, conversion tracking, acceleration of content come widespread: that of third-party websites embedded into loading, and provision of widgets. Social media and advertising the visited site by mechanisms such as JavaScript and images. companies primarily use the data collected via these tracking mechanisms to improve their capability to show personalized, In fact, the majority of websites contain third-party con- tailored advertisements to their users, which represent their tent, i.e., content from another domain that a visitor’s browser main source of income. loads and renders upon displaying the website. Such an em- bedding of third-party content has always been possible, but Regardless of their primary purpose, third-party compo- was relatively rare, since most embedded images were located nents can (and in many cases do) track web users across many on the same server as the page itself, and in any case, the em- sites and record their browsing behavior. Therefore, these bedding of content was not intended for tracking. With the third-parties constitute a privacy hazard in many aspects, such rise of social media and Web 2.0,websitesincreasinglybegan as the collection of data about health conditions (Electronic 54 Schelter and Kunegis Frontier Foundation, 2015; Libert, 2015b), news consumption as opposed to other websites, even though the majority of them (Trackography, 2014), or their instrumentalization for mass still do contain trackers (Section 7). We also show that certain surveillance by intelligence agencies (The Intercept, 2015). In types of trackers are in fact present with higher probability order to understand and control these hazards, it is desirable on websites with privacy-critical content, indicating a lack of to gain a deeper understanding of the ‘online tracking sphere’ awareness of the tracking ability of their underlying services. as a whole. Previous research has studied small samples of this (5) Finally, we provide our dataset to the scientific community3 sphere, e.g., 1,200 English-language domains from Alexa’s pop- and contribute it to the Koblenz Network Collection (Kunegis, ular sites (Krishnamurthy and Wills, 2009b), while researchers 2013)4. have only recently started to study larger datasets, e.g., track- ing on the Alexa top 1 million domains (Libert, 2015a). The main reason for this is that, until lately, such a study would 2 Online Tracking Fundamentals only have been possible for large companies possessing their own ‘copy’ of the web. Recent developments however allow In this section, we give a technical introduction to web tracking, us to study this online tracking sphere at a large scale: the and highlight resulting privacy implications. availability of enormous web crawls comprised of hundreds of terabytes of web data (Spiegler, 2013), such as CommonCrawl 2012.1 2.1 Technical Foundations We process all 3.5 billion webpages from this corpus (which In its basic form, online tracking involves users browsing the amounts to more than 200 terabytes of data), and extract the Web and the website that they intentionally visit. In addi- bipartite ‘tracking graph’, which describes the tracking of more tion to these two actors, one or more services may be present than 41 million pay-level domains by 355 tracking services. that record users’ browsing to the website; we call such ser- Apay-leveldomain(PLD)isasub-domainofapublictop- vices third-parties. We refer to third-parties as trackers if their level domain that users pay for individually. PLDs allow us main purpose or the business model of their owning company to identify a realm, where a single organization is likely to depends on collecting browsing data of users. Therefore, we 2 be in control , e.g., the PLD for www.example.com would be categorize advertising services and social network plugins as example.com.Subsequently,weusethisdatatostudythree trackers, because their main source of income comes from tar- key research questions: (i) Which are the predominant track- geted, personalized advertisements which heavily depend on ing companies on the Web, and how many websites do they user data. On the other hand, we do not consider content track? (ii) How does the distribution of web trackers vary from delivery networks as trackers, because their main business is country to country? How is that variation related to different to accelerate website loading (see Krishnamurthy and Wills, political, socio-cultural and economic factors? (iii) How are 2009b; Englehardt and Narayanan, 2016). trackers distributed with respect to highly privacy-critical top- Specifically, the user visits a website, which means that ics such as health and addiction? Do authors of such websites her browser issues an HTTP “GET” request to the web server avoid web trackers? hosting the website. The web server returns an HTTP response To the best of our knowledge, the data extracted in the pa- containing the HTML code of the website to render. This re- per constitutes the largest empirical dataset of web trackers col- turned HTML code typically contains references to external re- lected so far, containing an order of magnitude more domains sources, such as style sheets, JavaScript code and images that than comparable studies. In this article, we extend our findings are required to render the page in the client browser. Next, from previous work (Schelter and Kunegis, 2016) in several di- the user’s browser automatically issues requests for these ad- rections: (1) We conduct an analysis of tracking in the long ditional resources. In many cases, resources used by a web tail of domains on the web, looking into 41 million pay-level page reside on a different server than the one hosting the web- domains (Section 5), and identify global structural properties site. This possibility allows third-parties to track the user by of the online tracking sphere. (2) We analyze the tracking com- recording and inspecting the external requests to their servers. pany distribution in top-level domains (TLDs) of non-English In the case of online tracking, the website’s HTML code speaking countries, and find that a small set of US-based com- embeds external resources from a third-party which aims to panies have a dominating role in the majority of countries; a track the user. A typical example for such an external resource correlation analysis suggests that this role is strongly associated is a piece of JavaScript code, which the user’s browser will au- with political factors such as freedom of the press (Section 6). tomatically load and execute from the third-party server. This Furthermore, we investigate two additional research directions: external loading enables the third-party to record a wide va- (3) We uncover country-specific as well as category-specific pat- riety of information about the user: (i) The third-party sees terns in the relations between third-party trackers by clustering the current IP address of the user machine, which reveals the their co-occurrences (Section 5.5).