What a Tangled Web We Weave: Understanding the Interconnectedness of the Third Party Cookie Ecosystem
Xuehui Hu Nishanth Sastry King’s College London King’s College London, UK London, UK University of Surrey, UK [email protected] [email protected] ABSTRACT KEYWORDS When users browse to a so-called “First Party” website, other third Container Identity, Third Party Cookie parties are able to place cookies on the users’ browsers. Although ACM Reference Format: this practice can enable some important use cases, in practice, these Xuehui Hu and Nishanth Sastry. 2020. What a Tangled Web We Weave: third party cookies also allow trackers to identify that a user has Understanding the Interconnectedness of the Third Party Cookie Ecosys- visited two or more first parties which both share the second party. tem. In 12th ACM Conference on Web Science (WebSci ’20), July 6–10,2020, This simple feature been used to bootstrap an extensive tracking Southampton, United Kingdom. ACM, New York, NY, USA, 10 pages. https: ecosystem that can severely compromise user privacy. //doi.org/10.1145/3394231.3397897 In this paper, we develop a metric called “tangle factor” that measures how a set of first party websites may be interconnected 1 INTRODUCTION or tangled with each other based on the common third parties used. It is common knowledge that cookies are placed in a user’s browser Our insight is that the interconnectedness can be calculated as the when they visit a website. The websites that the users visit are chromatic number of a graph where the first party sites are the called first party websites (FPs). Although some of these cookies nodes, and edges are induced based on shared third parties. are placed by FP domains, many are placed by third party affiliates We use this technique to measure the interconnectedness of (TPs) of the first party sites, for reasons such as advertising, orana- the browsing patterns of over 100 users in 25 different countries, lytics. Examples include advertising networks such as adnxs.com, through a Chrome browser plugin which we have deployed. The amazon-adsystem.com and doubleclick.net, analytics platforms users of our plugin consist of a small carefully selected set of 15 such as google-analytics.com, or social media trackers such as test users in UK and China, and 1000+ in-the-wild users, of whom Facebook and Twitter. 124 have shared data with us. We show that different countries Such TPs have proved to be an extremely powerful method for ag- have different levels of interconnectedness, for example China has gregating users’ browser histories – for example, a TP that appears a lower tangle factor than the UK. We also show that when visiting on both https://www.bbc.com, and https://www.nytimes.com the same sets of websites from China, the tangle factor is smaller, is able to infer that a user visited both sites, and therefore can infer due to blocking of major operators like Google and Facebook. that the user might be someone who is interested in news and We show that selectively removing the largest trackers is a very current affairs. An important motivation for such detailed profiling effective way of decreasing the interconnectedness of third party of users is that it can lead to more “targeted” advertisements, which websites. We then consider blocking practices employed by privacy- are much more profitable. For instance, recent research shows that conscious users (such as ad blockers) as well as those enabled by right-leaning hyperpartisan websites in the US use more sophis- default by Chrome and Firefox, and compare their effectiveness ticated and more intensive tracking techniques and are therefore using the tangle factor metric we have defined. Our results help able to command higher prices than left-leaning websites [3]. quantify for the first time the extent to which one ad blocker is As advertisers’ demand for detailed profiles continues to grow, so more effective than others, and how Firefox defaults also greatly does the sophistication of third party tracking technologies. Users help decrease third party tracking compared to Chrome. and many browsers have responded with new technologies to safe- guard user privacy. Indeed, this has led to an “arms race”, with users CCS CONCEPTS installing ad blockers such as AdBlock Plus[14], uBlock Origin[18] • Security and privacy → Web application security; Social net- and Ghostery[9], and certain websites (e.g., memeburn.com, english- work security and privacy. forum.ch) responding with anti ad-blocking technology[31, 37, 46] that refuse to deliver content unless users unblock ad blockers whilst visiting their sites. Permission to make digital or hard copies of all or part of this work for personal or An important addition to the arsenal of technologies for pre- classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation venting tracking of users is the notion of multi-account containers, on the first page. Copyrights for components of this work owned by others than the often referred to simply as “containers”. Pioneered by Firefox [33], author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or containers are a way to separate different sets of cookies from each republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. other. Containers were initially intended for providing “contextual WebSci ’20, July 6–10, 2020, Southampton, United Kingdom identities”1, i.e., to create different user identities depending on © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-7989-2/20/07...$15.00 1https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/Work_ https://doi.org/10.1145/3394231.3397897 with_contextual_identities WebSci ’20, July 6–10, 2020, Southampton, United Kingdom Xuehui Hu and Nishanth Sastry the context of operation. For example, containers allow a user to cleanly separate their work and personal identities on the same browser. Different containers can also be used to login to thesame sites with multiple user ids (for example, users having several email accounts from the same provider can be simultaneously logged in to each of them in different containers). In this sense, containers are similar to other browser extensions such as Multifox [28] or CookieSwap [13] (or the similar Swap my Cookies extension on Chrome [15]). Figure 1: Overlapping first parties which share a third party Firefox containers essentially create a different user profile within tracker must be placed in separate containers, thus the red each container, providing a different database of cookies and stor- and green sites must be separated. The blue website does not 2 age for each . Thus, each container identity is kept separate from share any third parties with either red or green and can be the others, and information such as third party cookies are not placed together with either of them in a container. shared across containers. Firefox suggests [33] that this can be used to also achieve additional privacy, for instance by placing a user’s shopping websites into a separate container from financial websites We calculate the tangle factor by modelling the set of FPs as such as banks and credit cards. nodes of a graph, drawing edges between two FPs when they share Although containers offer a clean way to separate third (and first) one or more TPs. We call this the first party interconnection graph parties from each other by creating different cookie stores, they are (FPIG). Two FPs in the FPIG that share an edge (i.e., share one or only effective if interfering parties are manually placed in separate more TPs) must be therefore placed in separate containers in order containers. Recently, Firefox has come up with add-ons which au- to prevent tracking by the shared TPs. If we assign a colour to each tomatically, or by default, provide protection for one of the most container, and label FPs in the FPIG with the colour of the container prevalent trackers – Facebook. This add-on creates a specialised they are placed in, it is easy to see that the vertex chromatic number container [36] for Facebook. Once installed, it opens Facebook in of the FPIG, i.e., the number of colours needed for nodes or vertices its own container, and the user is logged out of Facebook in other of the FPIG such that neighbouring vertices which share an edge containers, thus automatically preventing Facebook Pixel [35] and are coloured differently, gives the minimum number of containers other tracking by Facebook of users who are logged in. needed to effectively separate that set of FPs. We call this the tangle While a solution for Facebook’s tracking is indeed important as factor of a given set of first party websites. it is widely used on many websites, it is not sufficient, as it does not We apply the Tangle Factor metric to three different sets of offer protection against other third parties. Furthermore, Facebook first party websites. First, we look at the Alexak top- websites for is blocked by the Government in countries such as China, and thus different values of k, and for two different countries, UK and China. 3 is not a major third party in that country [20]. Thus, solutions are Second, we leverage an ongoing user study which developed a needed that work for other important third parties, as well as for Chrome plugin and collected anonymised browsing histories for other country and cultural contexts. a period of one year from a small cohort of users in the same two Generalising from the Facebook example, any third party can be countries (UK and China). Our plugin has since been released in- prevented from learning the (partial) browsing histories of users if the-wild, to help users to visualise their own browsing histories. two first parties sharing the same third party are placed in different We also provide these users an option to manually send us their containers. Fig. 1 illustrates this with an example. The websites histories, and contribute to our study. Our third and final set of in green and red share one or more third parties (e.g., Facebook, websites is two months of browsing histories of 124 users from 25 Google DoubleClick etc.), and therefore need to be placed in sepa- countries who have decided to voluntarily contribute data to our rate containers; otherwise the common third parties (e.g., Google’s user study. DoubleClick) will be able to infer that the same user visited both the We can use the tangle factor to understand the third party ecosys- green and red websites. However, the blue site does not share any tem from different vantage points. For example, we visit thek top- third parties with either the green or red website and therefore can most popular websites from UK and China, and find that for the be placed in the same container with either of those two websites. same set of websites (Global top 2K websites according to Alexa), This paper asks the simple question: if the above policy is applied visiting from the UK results in much higher interconnectivity than uniformly across different sets of websites, how many containers from China. In other words, the most popular sites have a higher will be needed to separate different first party websites that share tangle factor from UK locations, i.e., it requires many more separate one or more third parties? This number provides us with a way containers to prevent third parties from tracking browsing of top-k to characterise and quantify the interconnectedness of the third websites in the UK, than in China. party ecosystem for a given set of first party websites. We term this A similar result carries over into actual browsing histories of as the Third Party Tangle Factor, or simply the Tangle Factor of real users in both countries, which are based on country-specific that set of websites. The higher the Tangle Factor, the more the websites rather than synthetic “top-k” websites: browsing histories interconnectedness of third party cookie ecosystem. of users in China have a lower tangle factor, i.e., are more easily separated, and with fewer containers, than browsing histories in
2In practice, a userContextID column is added to the cookie database, and only 3This study has been approved by King’s College London Research Ethics Committee cookies matching the context ID of the container are sent to the website. (Approval no. MRS-1718-6539). What a Tangled Web We Weave: Understanding the Interconnectedness of the Third Party Cookie Ecosystem WebSci ’20, July 6–10, 2020, Southampton, United Kingdom
UK. We then expand to investigate the browsing histories of our x.doubleclick.net and y.doubleclick.net tracking bbc.com, in-the-wild users across 25 different countries, and show that the we record them both as doubleclick.net and only count them interconnectedness of websites has only a low (-0.0726) correlation once. Thus, we refer to third party services based on the 2nd -level with the actual numbers of third parties. Rather, it is the ubiquity of domain names [5]. large third party trackers (e.g., Facebook and Google DoubleClick) Next, we may have different third party domains that all belong which are present on a large proportion of websites, that increases to the same parent entity. For example, DoubleClick and Google the tangle factors. Analytics are both owned by Google. Following [23], We detect Consequently, we explore a “what if” scenario where the most such cases by tracing the Authoritative DNS (ADNS) servers for common third parties simply did not exist, or were prevented from the 2nd -level domain names of TP entities, and merging TPs that operating (e.g., through ad blockers). We show that deleting the share the same ADNS server. most common third party trackers, which corresponds to deleting Finally, some third parties use a cookie synchronisation mecha- the most common sources of edge creation in the FPIG, is a highly nism to establish a “data sharing tunnel” between different third- effective approach, and the tangle factor drops very quickly withthe party vendors. Following guidelines in [39], we detect cookie syn- removal of the most common trackers. We then use this approach chronisation by correlating shared unique userIDs embedded in to measure the effectiveness of several ad blockers, and show that cookies stored by different third parties. uBlock origin is more effective than Adblocker Plus, Ghostery and Ad Guard. We also show that a) that the largest containers of 2.3 First Party Interconnectivity Graphs (FPIG) non-interfering first parties contain regional and computer-related websites, and b) using Adblocker Plus and uBlock origin results in After merging third parties using the techniques mentioned above, UK interconnectivity dropping to levels similar to that of China. we then consider all the third parties of a given set of first parties. This lower interconnectivity is quantified in terms of the increase We then model this as a graph, where the nodes are the first party in the sizes of the largest set of non-interfering first parties (i.e., the websites, and edges are drawn between two nodes if the corre- size of the largest container when separating different first parties). sponding first party websites share a third party (after third parties The rest of this paper is organised as follows: §2 defines the are merged using the disambiguations discussed above). tangle factor, and provides details about our datasets. §3 quantifies the tangle factor by country, in terms of the numbers of containers 2.4 Calculating Tangle Factor needed to separate first party websites that share a third party; The tangle factor of a given set of websites, i.e., a given FPIG, is §4 uses the tangle factor to assess different methods to restrict calculated by computing an assignment of FPs to different contain- the numbers of containers needed; §5 discusses related work and ers, whilst respecting the restriction that two FPs that share a TP context, and §6 concludes. (i.e., two nodes connected by an edge in the FPIG) must be placed in different containers. 2 DATA COLLECTION & METHODOLOGY 2.1 Browser plugin for data collection The backbone of our data collection methodology is a plugin called “ThunderBeam”, which we have created for Google Chrome, by extending a Firefox extension called Lightbeam [32]4. Thunderbeam not only allows users to log their browsing history, but it also provides support to track third-party networks across sites. The original Lightbeam for Firefox was based on an add-on called Collusion developed by Mozilla in 2012 [43]. Branching from Collusion, there is an add-on called Disconnect [11] for Google Chrome browsers. However, as opposed to Lightbeam, Disconnect can only log trackers in a uncontextualized manner (i.e., by looking Figure 2: Restrictions in this first-party (FP) model example: at websites individually). Thus, Disconnect does not support track- ○1 FP1 cannot be in the same container with FP3; ing third parties across sites since there is no direct mechanism to ○2 FP2 cannot place with FP4 and FP6; capture and match the correspondence of first-party and third-party ○3 FP6 cannot be with FP5. requests in Chrome. Our plugin develops several adaptations to make this work on Chrome, as detailed in a separate paper [20]. Effectively, this corresponds to a vertex graph colouring problem, 2.2 Third party disambiguation where nodes with the same colour can be placed within the same In our analysis, we take the data provided by the Thunderbeam container, and nodes which share an edge must be assigned different plugin, and identify different third parties. This requires some sub- colours. Fig. 2 illustrates how a particular set of edges between tle disambiguation. First, we may have different domain names different first parties would give rise to a container assignment. from the same third party entity: For instance, if there are both The minimum number of colours for the vertex colouring problem, i.e., the minimum number of containers needed in the FPIG, is the 4Lightbeam is no longer a supported Firefox extension after Oct 2019 tangle factor of a given FPIG. WebSci ’20, July 6–10, 2020, Southampton, United Kingdom Xuehui Hu and Nishanth Sastry
Table 1: Year-long (Jan 2018 to Jan 2019) data collection from Table 2: Data collected from extension installers (ver.2) from 15 users in UK and China Dec. 2019 to Feb. 2020: 124 users across 25 countries in total., Countries with fewer than 5 users are shown in gray. User Group 1st party sites 3rd party cookies UK Users 8416 113,003 Continent Country Num (Users) CN Users 6144 74,313 Tunisia 5 Africa (8) Total 14,827 187,316 Egypt3 China 8 India 5 Asia (20) Israel3 Singapore2 Thailand2 Germany 13 United Kingdom 10 2.5 FPIG datasets France 10 We apply the above methodology to obtain tangle factors for several Netherlands 9 different sets of first parties. First, we use an automated browsing Denmark 8 Spain 6 system, built on top of Selenium [12] running in non-headless mode, Europe (70) and collect the cookies set by the Alexa5 top-k most popular web- Italy 5 sites. Specifically, we programmatically visit the country-specific Belgium 5 Alexa top500 sites in UK and China, and also the global top2k Switzerland3 websites from UK and China locations, in order to obtain a compar- Luxembourg2 ison between the number of containers required in both countries. Norway1 The degree of demand for containers is taken as an indication of Sweden1 United States of America 8 the degree of third-party risks in different countries. Further, we North America (15) visit the UK top500 websites after installing different ad blockers Canada 5 (uBlock Origin, Adblock Plus, Ghostery and Adguard Adblocker) Mexico2 on two different browsers (Chrome and Firefox), to understand the Oceania (3) Austria3 Chile3 privacy protection provided by different ad blockers. South America (5) To complement this dataset, we also collect data from real users Brazil2 who consented to support our work by providing anonymised browser histories6. Our users come in two cohorts. First, we have a small cohort of 9 users in the UK and 6 users in China, whose browsing activities were collected for a year-long period from Jan 2018–Jan 2019. Altogether, these users have visited around 15k first- party websites across one year, involving over 187k third-party domains (Table 1). We have also published the official version of our add-on in Chrome Web Store as “Thunderbeam-Lightbeam for Chrome”, and have seen over 1000+ installs of our plugin by Feb. 7th, 2020. This plugin offers users the option of submitting their data to our study. Through this mechanism, we have collected two-months of data from 124 users, who collectively provide us a picture of the third party ecosystem from 25 countries, as detailed in Table 2).
3 CONTAINER DEMAND BY COUNTRY For any two sets of websites, the set with the lower tangle factor (i.e., the number of separate containers needed to ensure that a common third party does not learn about two first parties visited) is Alexa top2k the one with the better privacy, and with less powerful third party Figure 3: Tangle factors for (global) websites tracking practices. visited from UK and China vantage points. We begin by exploring the tangle factor of popular sets of web- sites from different countries. We then go on to show that tangle factor depends to a large extent on the country rather than actual numbers of third parties involved. What a Tangled Web We Weave: Understanding the Interconnectedness of the Third Party Cookie Ecosystem WebSci ’20, July 6–10, 2020, Southampton, United Kingdom
8 UK 7 CN 6
5
4
3
2 AVG(FP)perContainer 1
0 6 8 10 12 14 AVG(TP) per FP
Figure 4: Linear correlation between TPs/FP and (a) Regression between container numbers and the weekly FPs by each user with 95% confidence interval.r ( 2 = 0.9688; r 2 = 0.9162) FPs/Container. Each point represents the average TPs/FP UK CN and FPs/containerfor one of the different ranked bins depicted in Figure 3. The Pearson correlation coefficient 0 between TPs/FP and FPs/Container in China is -0.681 and 0 UK is -0.728 0