A comparison of web privacy protection techniques

Johan Mazel Richard Garnier Kensuke Fukuda National Institute of Informatics/JFLI ENSIMAG National Institute of Informatics/Sokendai [email protected] [email protected] [email protected]

Abstract—Tracking is pervasive on the web. Third party users regarding their interests. To this end, advertisers leverage trackers acquire user data through information leak from context (e.g. visited ) or previous browsing interests. , and user browsing history using cookies and device Advertisers use techniques such as cookies to identify users fingerprinting. In response, several privacy protection techniques (e.g. the ) have been developed. To the across websites and build their browsing history. Other tech- best of our knowledge, our work is the first study that proposes niques have also been developed to allow advertising actors a reliable methodology for privacy protection comparison, and to communicate with each other (such as cookie syncing [2]), extensively compares a wide set of privacy protection techniques. or circumvent cookie removal by respawning cookies using Our contributions are the following. First, we propose a robust diverse types of data storage inside the browser (e.g. using methodology to compare privacy protection techniques when crawling many websites, and quantify measurement error. To this Flash [3]). Browser fingerprinting [4] allows a tracking entity end, we reuse the privacy footprint [1] and apply the Kolmogorov- to follow a user across websites without any in-browser data Smirnov test on browsing metrics. This test is likewise applied storage. In response to these techniques, several counter- to HTML-based metrics to assess webpage quality degradation. measures were designed. We can here quote the To complement HTML-based metrics, we also design a manual HTTP header [5] by which a user can ask not to be tracked. analysis. Second, we study the overlap of blocking resources between most popular browser extensions, and compare the Browsers can also block some or all cookies. Finally, many performances using the proposed methodology. We show that browser extensions hinder third party tracking by preventing protection techniques have vastly different performances, and cookie creation and/or blocking requests to tracking services. that the best of them exhibit a wide overlap. Next, we analyze End-users thus have many techniques available but have the impact of privacy protection techniques on webpage quality. trouble picking one that offers good protection. Similarly, pri- We show that automated HTML-based analysis sometimes fails to expose quality reduction perceived by users. Finally, we provide vacy researchers often want to assess protection efficiency but a set of usage recommendations for end-users and research rec- do not want to test all available tools. Our goal is to compare ommendations for the scientific community. Ghostery and uBlock existing privacy protection techniques and provide efficiency- Origin provide the best trade-off between protection and webpage based recommendations. Some work previously tried to com- quality. Ghostery however requires a configuration step which is pare privacy protection techniques [6], [7], [8], [9], [10], difficult for users. The RequestPolicy Continued and NoScript extensions exhibit the best performances but reduce webpage [11], [12], [13]. Krishnamurthy et al. [6] provide the first quality. Ghostery and uBlock Origin use manually built blocking comparison of privacy protection methods but do not evaluate lists which are cumbersome to maintain. Research efforts should most of the browser extensions available today because they focus on improving existing approaches that do not rely on appeared after the publication of the paper. Balebako et al. [8] blocking lists (such as or MyTrackingChoices), focus on a very specific use case: behavioral advertising. and automatically building reliable blocking lists. Mayer et al. [7], Hill [9], [10] and Traverso et al. [13] evaluate 4, 4, 4 and 7 browser extensions on a limited set of I.INTRODUCTION websites (respectively Alexa Top 500, 45, 84 and 100 italian The huge growth of the Internet comes along with an ever- URLs). Wills et al. [11] crawl a thousand of websites but arXiv:1712.06850v1 [cs.CR] 19 Dec 2017 increasing advertising market. Internet users access content use only six browser extensions. Merzdovnik et al. [12] use provided for free by publishers. Consequently, publishers five browser extensions. Furthermore, these studies do not monetize their audience through advertisement. Companies use the same metrics, it is thus difficult to compare their thus buy online exposure to promote their products. In order results. Our extensive coverage improves the state of the art to maximize advertisement efficiency, advertisers tailor ads to regarding measurement methodology and error, and evaluated techniques. Our contributions are the following. First, we propose a robust methodology to compare privacy protection techniques against third party tracking when crawling many websites. To this end, we reuse the privacy footprint [1] and apply the Kolmogorov-Smirnov test on browsing metrics. This test is likewise applied to HTML-based metrics to assess webpage quality degradation. We also design a manual analysis to complement HTML-based metrics. Second, we analyze the Name License User number (K) Version The next subsections provide a breakdown of browser-related [16] GPL v3 13,917 2.7 techniques that we compare in this work. Unless specified uBlock Origin [17] GPL v3 3,759 1.6 otherwise, these extensions are open source. Ghostery [18] Proprietary 966 5.4.10 Disconnect [19] GPL v3 204 3.15.3 1) Extensions: We classify extensions regarding used third NoTrace [20] - 0.8 2.4 party tracking impediment methods: blocking lists, heuristics, DoNotTrackMe/Blur [21] Prop. 86 6.0.2091 Blocking lists indiscriminate blocking, or other. BeefTaco [22] Apa. 2.0 10 1.3.7.1 The following browser extensions use blocking lists made Privacy Badger [23] GPL v3 205 1.0.6 of regex-based rules on domain names. These lists are usually MyTrackingChoices [22] - 0.1 1.0 Heur. community maintained. Ghostery [18] is a proprietary, privacy- NoScript [24] GPL v2 1,765 2.9 focused extension that uses a specific tracker blocking list. Ind. RPC [25] GPL v3 6 1.0 It has recently been bought by the privacy-focused browser HTTPSEverywhere [26] GPL v2 353 5.2.5 Decentraleyes [27] MPL 2.0 69 1.2.2 [30]. Ghostery requires a configuration step to select

Other WebOfTrust [28] GPL v3 315 - categories of trackers to block. uBlock Origin [17] is a general purpose blocker. It can also understand the syntax used by TABLE I the famous ad-blocker AdBlock Plus. uBlock Origin includes PRIVACYPROTECTIONTECHNIQUECHARACTERISTICS.HEUR.: by default: EasyList, EasyPrivacy, Peter Lowe’s Adservers, HEURISTICS ;IND.:INDISCRIMINATE ;POPULARITY (K): POPULARITYIN Malware domains and some specific lists. Disconnect [19] is THOUSANDS. another blocker. Blur [21] (also known as Do Not Track Me) is a proprietary extension owned by Abine. It blocks trackers, protects mail addresses and passwords. NoTrace [20] uses a blocked resource overlap between privacy protection tech- wide range of techniques in order to enhance privacy on the niques, and compare their performances. Third, we assess the Internet. NoTrace needs to be configured after installation. impact of privacy protection techniques on website quality. Fi- One can also block tracking by setting opt-out cookies that nally, we provide recommendations for end-users and scientific block domains from setting standard cookies. Several entities community. [31], [32] provide opt-cookie lists. BeefTaco [33] creates opt-out cookies in the browser. Another privacy protection II.BACKGROUND approach uses ad-blockers to hinder tracking in the same way This section presents an overview of third party web track- they block advertisements. AdBlock Plus [16] is the most ing and existing privacy protection techniques. We refer the popular ad blocker. Using specific lists, it can block trackers, reader to [14], [15] for a more complete description of these social widgets, or malwares. Since 2011, AdBlock Plus is aspects. commercially exploited by Eyeo which monetizes domain whitelisting [34]. In addition to the ad-focused EasyList [35] A. Third party that is used by default in AdBlock Plus, there is a privacy- Websites massively rely on advertisement to monetize user focused list called EasyPrivacy [35]. visit. Advertisers purchase ads for their products directly from Unlike previous extensions relying on blocking list, some publishers or ad exchanges. When users access websites, use heuristics to block trackers. Privacy Badger [23] is devel- they communicate with all these actors. From the users’ oped by the Electronic Frontier Foundation and its behavior viewpoint, these entities belong to two categories: first and is further described in § V-B. By default, Privacy Badger third parties. First parties are the entities that users intend uses the Do Not Track HTTP header [5] and strips the to reach, here, publishers. Third parties can be advertisers, referrer field in HTTP requests. MyTrackingChoices [22] is ad exchanges or others actors that provide services to first an advertisement friendly privacy protection extension. It uses parties (such as web ). Using techniques such as an hybrid approach which leverages both heuristics similar to cookies, local data storage or fingerprinting, third parties can Privacy Badger and blocking list for bootstrapping. Users can identify users across websites. Combining HTTP referrer field allow tracking for some website categories. and user identification, they can reconstruct users’ browsing Some extensions indiscriminately block resources NoScript history. Using cookie syncing, trackers can also exchange user [24] disables JavaScript. As a side effect, this also disables identification and thus improve data collection. some tracking. RequestPolicy Continued [25] blocks all third party requests. B. Privacy protection techniques Finally, some extensions use other mechanisms to pro- Several techniques have been designed to protect users tect users’ privacy. HTTPSEverywhere [26] tries to replace from third party tracking. Network-based techniques use DNS HTTP connections with HTTPS ones if HTTPS is supported. filtering or proxy. They however exhibit several shortcomings: WebOfTrust (WOT) [28] provides website rating regarding proxies cannot analyze encrypted traffic while DNS filter- trustworthiness and child safety. It was temporarly removed ing only blocks entire domains [12]. User agent spoofing from extensions stores when it was revealed that WOT broke browser extensions may also improve privacy by hindering privacy rules of browser developers [36]. Decentraleyes uses fingerprinting, but exhibit poor performance in practice [29]. local files to emulate resources, e.g. freely available JavaScript

2 libraries, hosted on centralized entities. This blocks tracking lists (see Table I). Some proposals automatically build tracking from these entities. Table I provides popularity (measured as blocking lists using specific keys in URL that correspond to the number of users for ), source code license and user identifying data [52], machine learning on DOM structure version of the extensions used in our work. [68], Javascript [53], [55] or user browsing behavior [69]. The 2) Browsers: Some browsers also have built-in privacy same machine-learning-based approach was also applied to ad- protection features. DoNotTrack [5] is a field in the HTTP blocking list building using network traffic features [51]. header that asks the destination not to track the sender. It was proposed in 2009 and is currently under standardization in . Privacy protection techniques comparison W3C. This feature is turned off by default in Firefox. The While some extensions are used in almost all studies (e.g last proposed amendments to the European Union’s ePrivacy AdBlock Plus [16] and Ghostery [18]), many of them are Regulation hints at an increase interest of policy makers seldom employed. Similarly, some works compare between towards DoNotTrack [37]. Firefox [38] can block cookies: four and seven extensions [7], [8], [9], [10], [11], [12], [13] either all or only third parties’ ones or only third party cookies while another focuses on two [6]. Table II describes how from previously visited domains. A tracking protection [39] existing web privacy-related work used or analyzed existing has also been added in Firefox version 42. [40] and privacy protection techniques. Cliqz [41] are browsers that emphasize their privacy protection Features used to compare privacy protection techniques are features. very diverse and thus complicate result comparison across work. Some works compare privacy protection techniques in III.RELATED WORK terms of HTTP request number [7], [9], [10], [12], cookie A. Tracking on the Internet number [7], [8], [9], [10], domain number [9], [11], [13], The seminal contribution in the field of web privacy was by private information diffusion [6], and occurrence of behav- Krishnamurthy et al. They first studied the diffusion of privacy ioral advertising [8]. Some counter-measure proposals are information [1], then, analyzed the impact of counter-measures also comparing their approaches to others regarding tracker on this diffusion and webpage quality [6], and observed the blocking [49], [50], [39] or impact on browser performance increase of user-data aggregation by a small number of entities [22]. Table III lists metrics and features leveraged by existing [58]. Several works then tried to analyze this phenomenon web privacy-related studies. with more details by classifying tracking [59] or analyzing Our work is close to [6], [7], [8], [9], [10], [11], [12], [13] geographical variations of tracking [43], [44]. but improves the state of the art along four axes: protection While early studies focused on cookie-based tracking, two techniques, target websites, metrics, and reliability. First, main new tracking techniques trends emerged: resilient in- we compare more protection techniques (15 and additional browser data storage [3] and browser fingerprinting [4]. Local combination of blocking lists), than Krishnamurthy et al. [6] data storage allows one to store data in multiple locations (2), Mayer et al. [7] (4), Balebako et al. [8] (4), Hill [9], (for example, Flash Local Storage Object, HTML5 or ETags [10] (4), Wills et al. [11] (6), Merzdovnik et al. [12] (5) among other means) on a device to bypass standard cookie and Traverso et al. [13] (7). It is difficult to compare the removal. This technique thus provides resilient local identifier extensions covered by our work with Krishnamurthy et al. [6] storage. By using browser fingerprint, a tracker can completely since the ecosystem was much simpler at the time (AdBlock avoid using and storing an identifier inside the browser and Plus and NoScript were the only extensions available). We instead recognize a browser, or the device it is installed did not use some extensions that were addressed in previous on, using its characteristics such as fonts [60], battery [61], studies because they are either, not supported anymore (e.g. canvas [62] or hardware level features [63]. Several studies TACO which was owned by Abine [8], [56]), or not available then tried to detect fingerprinting [64], or both local storage for Firefox (AdBlock [46], [22], [11], Superblock Adblocker and fingerprinting [2], [48]. Furthermore, Lerner et al. [65] [22], Adremover [22] and Adblock Pro [22]). Furthermore, show that local storage fingerprinting increased between 1996 AdBlock (resp Firefox Tracking Protection [39]) uses the same and 2016. Another privacy threat is the practice of sharing blocking list as AdBlock Plus (resp. Disconnect). By analyzing user identification across tracking entities. This is called ID- AdBlock Plus and Disconnect, we thus provide performance sharing, cookie-syncing or cookie-matching. While the phe- bounds on these two techniques. Achara et al. [22] and Wills nomenon has been documented for some time [66], recent et al. [11] also evaluated the performance of an ad-blocking work analyzed its prevalence in the wild and its impact on extension that we do not address in this study: AdGuard user browsing history sharing among trackers [2], [67], [48]. Adblocker. Second, we us more websites than most existing Finally, the feasibility of user surveillance through bulk work. We use the Alexa Top 1000 while Mayer et al. [7], passive network traffic monitoring-based cookie observation Hill [9], [10] and Traverso et al. [13] use the Alexa Top 500, has also been analyzed [47]. 45, 84 and 100 URLs, and 100 italian URLs. Balebako et al. discussed five browsing scenarios that use a small number of B. Automated blocking list building websites to assess the occurrence of behavioral advertising, Many privacy protection tools have been proposed (see we thus cannot compare the websites we use. We crawled a § II-B). Most of these tools use manually maintained blocking number of websites similar to Wills et al. [11] but analyzed

3 Type Authors & references Extensions Browsers AdBlock [ 42 ] AdBlock Plus EL [ 16 ], [ 35 ] AdBlock Plus EP [ 16 ], [ 35 ] uBlockOrigin [ 17 ] Ghostery [ 18 ] Privacy Badger [ 23 ] Disconnect [ 19 ] NoTrace [ 20 ] WebOfTrust [ 28 ] DoNotTrackMe/Blur [ 21 ] MyTrackingChoices [ 22 ] RPC [ 25 ] Decentraleyes [ 27 ] NoScript [ 24 ] DNT [ 5 ] Opt-out cookies [ 31 ], [ 32 ], [ 33 ] HTTPSEverywhere [ 26 ] Firefox [ 39 ] Cliqz Krishnamurthy et al [1] Castelluccia et al [43] Web Fruchter et al [44] privacy Carrascosa et al [45] + analysis Metwalley et al [46] · · · · Englehardt et al [47] + + + + Englehardt et al [48] + + + + Malandrino et al [49] + + + + + Privacy Malandrino et al [50] + + + + + protection Kontaxis et al [39] + + Achara et al [22] + + + + + Gugelmann et al [51] +? +? Tracker Metwalley et al [52] detection Wu et al [53] ? Yu et al [54] + + + Ikram et al [55] + + + + + User Leon et al [56] + + + + ana. Malandrino et al [57] + Krishnamurthy et al [6] + + Mayer et al [7] + + + + Comp. Balebako et al [8] + + + + Hill [9] + + + + Hill [10] + + + + Wills et al [11] + + + + + + Merzdovnik et al [12] + + + + + Traverso et al [13] + + + + + + + Our work - ∼ + +++++++++++++++ ∼ TABLE II EXTENSIONSUSEDINPREVIOUSPUBLICATIONS (· FORUSAGEINTHEWILD, FORGROUND-TRUTH, FORGROUND-TRUTHANDUSAGE, + FOR COMPARISON, ⊕ FORCOMPARISONANDGROUND-TRUTH, ? FOR ML BOOTSTRAPPING , ∼MEANS THAT WE DID NOT EVALUATE THESE TECHNIQUES BUT THAT THEIR RESULTS CAN BE DERIVED FROM OUR WORK:ADBLOCK USES THE SAME DEFAULT BLOCKING LIST AS ADBLOCK PLUSAND FIREFOX TRACKING PROTECTIONUSESTHESAMEBLOCKINGLISTAS DISCONNECT).

many more protection techniques. We use a smaller number of crawls for each URL but does not provide any justification crawled websites than Merzdovnik et al. [12] (who uses Alexa for this number. To the best our knowledge, we here propose Top 100000) because some of our metrics (such as number of the first measurement error analysis for privacy protection cookies) require a stateful crawl and thus forbid parallelism. technique comparison (see § V-A). Third, we use more metrics (see Table III) than any existing work. We are thus able to provide a better description of IV. METHODOLOGY techniques’ performance. We did not compute the host number This section presents data collection and used metrics for metric used by Hill [9], [10] because it is very similar to the privacy protection and webpage quality. number of domains. This metric actually reflects the internal A. Data collection network architecture of trackers, which is not relevant for privacy protection comparison. Finally, none of these work We use OpenWPM [48], an open-source framework written performs several measurements on each website to remove in Python that relies on Selenium for browser automation. measurement error, except Mayer et al. [7] who perform three OpenWPM supports Firefox and provides some extensions. With minor modifications, we add extensions presented in

4 Type Authors & references Privacy protection Webpage quality Browser # domains # 3rd party domains # hosts # cookies # third party cookies # requests # first party requests # third party requests TP/FP/FN/TN Privacy footprint # script # image Data Script data size Image data size Manual analysis Loading time Browser CPU usage Browser memory usage Use-case specific metric Krishnamurthy et al. [1] Web Fruchter et al. [44] privacy Carrascosa et al. [45] analysis Englehardt et al. [47] Englehardt et al. [48] Malandrino et al. [49] Privacy Malandrino et al. [50] protection Kontaxis et al. [39] Achara et al. [22] Tracker Yu et al. [54] detection Ikram et al. [55] Krishnamurthy et al. [6] Mayer et al. [7] Comparison Balebako et al. [8] Hill [9] Hill [10] Wills et al. [11] Merzdovnik et al. [12] Traverso et al. [13] Our work - TABLE III METRICSUSEDFORPRIVACYPROTECTIONTECHNIQUECOMPARISON ( FORMETRICUSEDAND FORMETRICUSEDWITHBREAKDOWN).

Table I and § II-B. This study has been conducted with Firefox A protection technique is effective if the number of first 45. party requests is not impacted, and the number of third party We crawl the highest-ranked websites by Alexa [70]. Unless requests decreases. stated otherwise, we perform measurements in August 2017, The third metric is the number of third party domains using IP addresses located in Japan. Typically, a crawl on accessed during a crawl. This is complementary to the number the Alexa Top 1000 (§ VI-A) takes about three days with of third party requests. It provides a metric on the number of commodity hardware. blocked entities. There may be a few domains that generate B. Privacy protection a lot of requests, or a many domains that produce a few requests. Efficient protection techniques reduce the number of In this section, we present our robust methodology to third party domains. compare privacy protection techniques across several websites This is however not sufficient since third party requests and using several metrics. We also present privacy footprint are not always used to track users, they can also provide [1]. content to users (e.g. media resources) or contact non-tracking 1) Browsing metrics: As explained in § II-A, privacy leak- third parties (e.g. website analytics). The fourth metric is the age occurs through communications with trackers, and local number of the profile cookies obtained from a stateful crawl. data storage (e.g. cookie) allows trackers to easily identify users across website. We consider five simple browsing-related Protection techniques with good performances diminish the metrics that reflect the ability of protection techniques to number of cookies. hinder these two phenomena. We first focus on HTTP requests. The last metric is the total amount of data transferred. We separately count the requests that are made to the accessed It is especially important in low-bandwidth situations for domain (the number of first party requests), and the requests example on mobile. This metric is directly related to tracking that are made to other domains (the number of third party and advertisement. Advertisement often generates considerable requests). To this end, we use the second level domain name traffic during browsing. However, as we will see in § V-A, this obtained from the Public Suffix List [71]. These metrics give a metric shows high variability when crawling the same website. rough estimation on the performance of a particular extension. We thus did not use this metric in our analysis.

5 1.0 the same hosting service or Content Delivery Network (CDN) are grouped together by ADNS because they share authoritative 0.8 DNS servers. They thus develop a new method called root. With root, if a third party is located in a hosting service or a CDN, the second-level domain-name of the third party 0.6 domain is the node identifier. Otherwise, as with ADNS, the second-level domain name of the authoritative DNS server of ECDF 0.4 the considered third party is the node. We use three metrics that are built on the graph. (1) the 0.2 bare number of third parties reveals tracking breadth. (2) the mean uBlock Origin number of third parties per first party corresponds to tracking number of first parties associated with the top 0.0 intensity. (3) the 10 1 100 101 102 103 10 third parties estimates how much tracking is concentrated mean # 3rd party requests on the most prominent third parties.

Fig. 1. Example of ECDF of the mean number of third party requests for C. Webpage quality all websites in the Alexa Top 1000 with two configurations: default Firefox (bare) and Firefox with uBlock Origin. The arrow represents the KS statistic. Privacy protection techniques prevent privacy informa- tion leakage by hindering of communications with trackers and blocking cookie creation, among other techniques. This 2) Kolmogorov-Smirnov test-based browsing metrics com- may however have negative side-effects on webpage quality. parison: While the metrics introduced in § IV-B1 provide Blocked third party domains may actually host images or a good estimation of the privacy protection for a particu- JavaScript that are needed to render the webpage correctly. We lar website, summarizing this aspect for a set of websites first analyze the impact of privacy protection on browsing data is non-trivial. Figure 1 presents two empirical cumulative in an automated fashion. We then perform a manual analysis to distribution functions (ECDFs) built on the mean number assess privacy protection repercussion on rendered webpage. of third party requests sent, for each website of the Alexa 1) HMTL-based metrics: Our goal is to detect layout dif- Top 1000 world. We use the mean of each metrics for ten ferences or missing elements that may reduce webpage quality crawls to reduce measurement error (see § V-A). We here both in terms of rendering quality and functionality. We crawl use two Firefox configurations: one without any extension HTML data using OpenWPM. We devise several metrics that (bare) and one with uBlock Origin. To determine whether analyze the browsed page. The first metric is the HTML page one Firefox configuration exhibits the same performance as size. We then derive two metrics from the crawled HTML data another, we use the Kolmogorov-Smirnov (or KS) test. It is a itself: number of images and scripts. The last two metrics are non-parametric test that does not make any assumption on the the total size of images and scripts. For all metrics, a reduction underlying distributions which fit our context. This test relies is associated with a webpage quality decrease. We here reuse on the KS-statistic which is the maximum difference between the metric comparison method presented in § IV-B2. two cumulative distributions. The KS statistic is represented 2) Manual analysis: By analyzing HTML, we are able to by the arrow on Figure 1. The null hypothesis is that both determine how many elements and how much data is missing ECDFs belong to the same distribution. In our case, this when privacy protection is used. This, however, does not means that both configurations have the same performance. assess how well the webpage is rendered inside the browser. We use a significance level α of 0.05. If the p-value of the We thus perform a manual analysis to determine webpage KS-test is smaller than α, we consider the two ECDFs as quality obtained with privacy protection techniques. Webpage distinct. In other words, the two considered configurations screenshots are captured using OpenWPM. First, we want have performances that are statistically different. to determine whether privacy protection techniques impact 3) Privacy footprint: In this work, we also use the privacy webpage layout. Our first question thus is: “Please rate layout footprint proposed by Krishnamurthy et al. [1]. The privacy similarity between Bare (left) and Extension (right).”. Good footprint represents interactions between first parties and third layout similarity, however, does not guarantee lack of missing parties (i.e. potential trackers) in a graph. For each third party elements. The next step is to compare webpage elements when accessed by the user when visiting a first party, an edge is privacy protection is used and not used. We want to avoid added between the nodes representing the considered first and comparing elements of the same webpages that may change third parties. Privacy footprint thus captures both the potential for different users or crawling time (e.g. items). We thus information leaking from first parties, and the aggregating ask user to focus on webpage elements that are part of the behavior of third parties. Krishnamurthy et al. [1] used the user interface. Our second question thus is: “Please rate the second-level domain name of the authoritative DNS server of proportion of elements (image, text frame, widgets, etc.) of the third parties to group domains in the same entity on the same user interface (e.g.: login button, tabs, etc, but not news items node in the graph. This method is called ADNS. Krishnamurthy or pictures) in Bare (left) that are also in Extension (right)”. et al. [58] then noticed that unrelated third parties located on For both questions, users give a rating between 0 and 10, 0

6 80 1.0 # 1st party requests 70 # 3rd party requests # 3rd party domains 0.8 60 # bytes received

50 0.6

40 ECDF 0.4 30

20 # 1st party requests 0.2 # 3rd party requests 10 Relative standard error of the mean (%) errormean the of standard Relative # 3rd party domains

0 0.0 5 10 15 20 25 30 35 40 10-4 10-3 10-2 10-1 100 101 Number of measurements Relative standard error of the mean (%)

Fig. 2. Relative standard error of the mean of browsing metrics in function Fig. 3. ECDF of the relative standard error of the mean of browsing metrics of the number of measurements when accessing www.yahoo.jp. Solid lines on the Alexa Top 1000 websites crawled ten times. Vertical dashed lines represent the median, dashed lines correspond to 5 and 95-centiles. represent the relative standard error for www.yahoo.jp. being the worst and 10 the best. Our manual analysis was the Alexa Top 1000 crawled 10 times. Most websites have performed by six users. a relative standard error smaller than that of www.yahoo.jp (here in dashed lines). Overall, 99% of websites have a relative V. MEASUREMENT PARAMETERS standard error smaller than 1% for all observed features. We address two measurement parameters: the impact of crawling parameters on measurement error, and the training B. Privacy Badger training of Privacy Badger. Privacy Badger employs heuristics to determine if a domain A. Impact of crawling parameters on measurement error is performing tracking (see § II-B1). Privacy Badger measures During our preliminary experiment, we notice that measure- the number of times that a domain reads a cookie as a ment results are not stable. Querying twice a website usually third party. When this number is greater than three, the yields two different metric values. To determine the appropri- considered domain is blocked. On top of this, cookies are also ate number of measurements that generates a small error, we whitelisted using heuristics. This behavior causes a freshly conducted a detailed study on the website showing the high- installed Privacy Badger not to block any domain. As the est variability in our preliminary experiment: www.yahoo.jp. user browses websites, more and more domains are blocked. These measurements were performed is November 2016. Vary- To fairly evaluate Privacy Badger, we analyze the impact of ing the number of measurements, we computed the relative Privacy Badger training. In other words, we intend to quantify standard error of the mean of four browsing metrics. how many websites need to be accessed to ensure that Privacy Figure 2 shows that the relative standard error decreases Badger is completely trained. when the number of measurements increases. Performing ten If we were to determine the number of websites regular crawls reduces the relative standard error on the number of first users need to browse to train Privacy Badger, we should use party requests, third party requests, and third party domains to a browsing model such as AOL search logs [73] as Roesner less than 5%. The number of bytes received, however, still has et al. [59] do. Our use-case here is however to ensure that a large error, we thus discard this metric. One possible reasons Privacy Badger is trained for a specific set of websites. We for this measurement error is the webpage advertisement thus measure Privacy Badger’s performance on the Alexa Top churn. Guha et al. [72] analyze ad churn during page reloads. 100 websites after training. These crawls were performed is They show that the number of unique new ads increases November 2016. This training consists of several repeated quickly during the first ten reloads but only linearly thereafter. crawls targeting the same set of websites and performed This further justifies our choice to use ten measurements. An without reinitializing the user profile, i.e. Privacy Badger existing comparison of privacy protection techniques [7] uses continues its training. We use zero to five successive training three crawls for each website and extension. They however crawls and then perform a single measurement crawl. do not motivate this choice nor provided measurement error The results of this experiment are shown in Figure 4. The analysis. dashed line corresponds to a reference: crawling results of Figure 3 is the ECDF of the relative standard error of the a bare browser. KS test groups the six measurements with mean of the number of first and third party requests, and Privacy Badger together, but separates them from the bare of the number of third party domains for all websites in browser. Without training, Privacy Badger is already able to

7 3400 (EasyList, EasyPrivacy, Peter Lowe’s Ad Server list and Mal- 3200 3000 ware domains). One configuration uses NoTrace with the high 2800 2600 preset (the default configuration has no effect on our metrics).

# 1st party req. 1st party # The last 14 configurations are extensions with their default 8000 7500 setting. We do not use the Firefox Tracking Protection [39] 7000 6500 because OpenWPM does not support it yet. However, this 6000 5500 5000 technique uses the Disconnect blocking list. This means that

# 3rd party req. 3rdparty # by analyzing Disconnect, we provide a performance bound on 340 320 Firefox Tracking Protection’s performances. The results are 300 280 260 shown in Figure 5. The number on top of technique name is 240 220 the KS-based rank (see § IV-B2).

# 3rd party dom. 3rdparty # All techniques have a limited impact on first party requests. 1000 950 900 The most aggressive technique is NoScript [24]. This is due 850 800 750 to the fact that Javascript is pervasive on Internet. Other 700 # cookies # 650 0 1 2 3 4 5 techniques exhibit a limited impact on th enumber of blocked # of training crawls first party requests. Fig. 4. Performances of Privacy Badger depending on the number of crawl Except for the number of cookies, the two most effective performed for its training. The horizontal dashed line represents a default Firefox as reference. extensions are RequestPolicy Continued and NoScript which reduce the number of third party HTTP requests by 96% and 87%. However, as shown in § VI-B and other studies [6], [50], provide partial protection. After a single training pass, the they strongly impact webpage quality. number of third party requests is reduced by 10.3%. The Among other techniques, Ghostery [18] and uBlock Origin results however do not improve when the number of training [17] provide the best performances. They reduce the number crawl increases. We also ran a single training crawl on the of third party HTTP requests by 58% and 55%. Ghostery Alexa Top 500 websites and observed similar performance as however needs to be configured which is difficult for users with a single training crawl. The training is thus complete [56]. AdBlock Plus [16] uBO-like has similar performances with less than one crawl (here one hundred websites). My- with uBlock Origin and Ghostery. Our results show that TrackingChoices uses the same principle but without Privacy users can manually setup AdBlock to obtain good results, Badger’s cookie data-based white-listing heuristics. One can unfortunately this is difficult [56]. Furthermore, AdBlock Plus thus expect MyTrackingChoices to train faster. Moreover, used with EasyList is much more CPU and memory-hungry MyTrackingChoices is provided with a small bootstrapping than uBlock Origin [22]. tracker list which should further reduce training time. For the Another group of techniques provides average perfor- remainder of this paper, we train Privacy Badger using a single mances: Disconnect (and Firefox Tracking Protection), Ad- crawl on the considered Alexa Top websites. Out of the four Block Plus (with EasyList and EasyList + EasyPrivacy), existing work that uses Privacy Badger [55], [9], [12], [13], Privacy Badger, Blur, BeefTaco and NoTrace. Their results only two actually peform some training [9], [12]. We here are mostly consistent across third party requests and domains, provide the first estimation of how much website needs to be and cookies. Privacy Badger [23], Blur [21], Disconnect [19] visited to complete Privacy Badger’s training. and AdBlock Plus with the EasyList block a relatively large number of third party HTTP requests (between 33% and 48%), VI.RESULTS but a fairly small number of third party domains (between 8% and 33%). We hypothesize that these techniques block the A. Privacy protection main trackers but fail to impact the tracking long-tail [48]. We compare protection techniques, first in terms of perfor- MyTrackingChoices performances are worse than untrained mance and then, regarding their blocking overlap. Privacy Badger. We hypothesize that this is due to a GUI 1) Browsing metrics: We perform a general comparison bug that forbids users from customizing website categories of privacy protection techniques. We crawled the Alexa Top where blocking occurs. MyTrackingChoices thus always block 1000 World using 21 different browser configurations. One tracking on 13 websites categories out of 32. BeefTaco has a configuration is the default setup of the Firefox browser (Bare). noticeable impact on cookies but no impact on other metrics. One configuration is Firefox setup with the DoNotTrack This is consistent with previously observed tracking long tail HTTP header. Two configurations use Firefox’s ability to [48]. This phenomenom makes the opt-out cookie maintenance block cookies: third party ones or third party cookies except very difficult while the protection that they offer is limited from previously visited websites. We do not use complete to cookie blocking. NoTrace [20] provides average protection cookie blocking because we hypothesize that users want to despite being setup at its highest level. NoTrace increases keep using cookie for some domains. Three configurations users’ online privacy awareness [57], but unfortunately, users set up AdBlock Plus with various lists: EasyList, EasyList cannot reliably setup the extensions [56] which is required. and EasyPrivacy and a filter set similar to uBlock Origin Remaining techniques provide poor protection. Decen-

8 35000 80000 30000

25000 60000 20000

15000 40000

10000 Nb requests 1st party Nb requests 3rd party 20000 5000

0 0

PB - 7 PB - 7

MTC - 6 Blur - 4 RPC - 2 MTC - 8 Blur - 5 RPC - 1 WOT - 7 WOT - 10 FF DNT - 7 ABP EL - 4 FF DNT - 9 ABP EL - 6 FF Bare - 7 FF Bare - 9 Decentr. - 7 NoTrace - 5 NoScript - 1 Decentr. - 9 NoScript - 2 BeefTaco - 7 Ghostery - 4 BeefTaco - 8 Ghostery - 3 FF no 3PC - 7 NoTrace - 11 FF no 3PC - 9 HTTPS Eve.Disconnect - 6 - 5 ABP EL+EP - 3 HTTPS Eve. - 9 Disconnect - 5 ABP EL+EP - 4 PB (trained) - 7 PB (trained) - 6 uBlockOriginABP - uBo-like3 - 3 uBlockOriginABP - uBo-like3 - 3 FF no 3PC EV - 7 FF no 3PC EV - 9 (a) # first party requests (b) # third party requests

50000 2000 40000 1500 30000

1000 Nb domain Nb cookie 20000

500 10000

0 0

PB - 12 Blur - 7 RPC - 1 PB - 11 MTC - 13 Blur - 6 RPC - 1 WOT - 15 MTC - 12 ABP EL - 6 WOT - 15 FF DNT - 14 NoScript - 2 ABP EL - 9 FF Bare - 14 Ghostery - 5 NoTrace - 11 FF DNT - 14 Decentr. - 14 FF Bare - 14 NoScript - 1 BeefTaco - 14 Ghostery - 2 FF no 3PC - 10 Disconnect - 8 ABP EL+EP - 4 Decentr. - 14 NoTrace - 10 FF no 3PC - 4 PB (trained) - 9 BeefTaco - 13 HTTPS Eve. - 14 uBlockOriginABP - 3uBo-like - 3 Disconnect - 5 ABP EL+EP - 3 PB (trained) - 8 HTTPS Eve. - 14 uBlockOrigin - 2 ABP uBo-like - 2 FF no 3PC EV - 10 FF no 3PC EV - 7 (c) # third party domains (d) # cookies Fig. 5. Performance comparison of protection techniques regarding the mean # first party request, mean # third party requests, mean # third party domains and # cookies. Each metric value corresponds to the total obtained by crawling the Alexa Top 1000 websites. The mean and standard deviation are computed on ten crawls. The number on top of technique name is the KS-based rank. Bare: Firefox alone, MTC: MyTrackingChoices, PB: Privacy Badger, FF no 3PC: Firefox no 3rd party cookie, FF no 3PC EV: Firefox no 3rd party cookie blocking except from previously visited domains, ABP: AdBlockPlus, EL: EasyList, EP: EasyPrivacy, RPC: RequestPolicy Continued . AdBlock Plus uBo-like uses all uBlock Origin’s lists except uBlock Origin’s specific ones; it thus loads EasyList, EasyPrivacy, Peter Lowe’s Ad Server list and Malware domains. traleyes [27] has almost no effect on the results. We hypoth- cookies also slightly reduces the number of third party re- esize that blocking library loading requests only marginally quests. As hypothesized in [48], this may be due to impeded impacts HTTP traffic. The DoNotTrack HTTP header [5] also cookie-syncing interactions. However, contrary to Englehardt has barely no effect. If trackers complied with DoNotTrack, et al. [48], third party cookie blocking here has poor perfor- the number of cookies would significantly drop. We thus mances. This may be due to the fact that, unlike [48], we conclude that DoNotTrack is not respected by most trackers. perform a stateful crawl. In a previous study [47] that uses WebOfTrust and HTTPSEverywhere have no effect. stateful crawl and OpenWPM, third party cookie blocking also We note that WebOfTrust and NoTrace both yield signifi- exhibits poor performances. The metric used in this study is cantly more third party requests than Firefox without any ex- however extremely specific, it thus is very difficult to compare tensions. WebOfTrust also contacts more third party domains our results with theirs. Furthermore, some third parties from and increases the number of cookies. WebOfTrust annotate the Alexa Top 1000 are first party at some point (e.g. Twitter, hyperlinks in webpages with trust rating. We hypothesize that ). Their cookies are thus created as first parties. this rating is obtained from WebOfTrust servers and thus 2) Privacy footprint: We then compared extensions using impact our metrics. We do not have any explanation regarding the privacy footprint (see § IV-B3 and [1]). Figure 6 presents NoTrace’s behavior. these results. They are similar to browsing metrics’ Figure 5. When we block third party cookies in Firefox, there is a Six extensions exhibit much better results than the rest: reduction in the number of third party domains. Blocking RequestPolicy Continued, NoScript, Ghostery, uBlock Origin

9 1000 qaerpeet h oa ubro eore lce by blocked resources of | number total the represents row square on square E purple) (resp. green column the top), (resp. h ignloag qaesraedsly eore htare that resources displays surface square orange diagonal The between intersection the by blocked sstu ihalisfitrctgre,wieDsonc is Disconnect while row the categories, filter its filters. own all its Ghostery with with lists. list, employed up specific Server set uBlock Ad some is Lowe’s and domains, Peter Origin Malware uBlock default EasyPrivacy, EasyList. its EasyList, the with uses used loads Plus is AdBlock extension Alexa list. each the blocking and on websites trained been 100 third Top has (resp. Badger Javascript Privacy not requests). 25]) [ blocks do party indiscriminately Continued We it RequestPolicy 2016. because (resp. November here on [24] in third crawled NoScript websites data blocked 100 analyze here and We Top view. Alexa requests matrix the the a party regarding using third domains ones party blocked popular most of the number among extensions five h nqeeso hseyadulc rgnbokn lists blocking Origin that uBlock hypothesized and We (see Ghostery order. of this uniqueness follows is the and between Plus clear AdBlock hierarchy very customized The the here NoScript Ghostery’s. and parties. Origin to first but uBlock of close detected Ghostery, much set here are limited is are a parties parties on results third present third are some of they that that number is means total party This first the higher. per while parties third small of very number mean RequestPolicy For the list. Continued, Origin’s uBlock with Plus AdBlock and domains, EasyPrivacy. visited mean EP: EV: previously The EasyList, crawl. from ten world. EL: except the 1000 blocking AdBlockPlus, of Top cookie ABP: value Alexa party metric 3rd the the no on for Firefox computed [58] are [1], deviation standard footprint and Privacy 6. Fig. B 200 400 600 800 i e scnie w xesos Let extensions. two consider us Let Overlap: 3) i 0 | (resp. h ne upesur nrow on square purple inner The . §

stemi atro hi ueirperformances. superior their of factor main the is VI-A3) WebOfTrust i j and ,

FF Bare # 1stpartylink.totop103rdparties # 3rdparty ersnstettlnme frsuc lce by blocked resource of number total the represents ) E j

E BeefTaco .O ahline each On ).

i HTTPS E

(resp. Everywhere j et emaueteoelpbtenthe between overlap the measure we Next, Decentraleyes h xeso ntecolumn the on extension the FF no 3rd cookie

E FF no 3rd

j cookie EV r noted are ) E FF DNT i Privacy and i Badger

ftemti,tesraeo the of surface the matrix, the of Privacy Badger (trained) E

j ABP EL n t ufc is surface its and

B NoTrace i i n column and MyTracking Mean #3rdpartyby1st E

(resp. Choices i Disconnect eteetninon extension the be Blur j B

h resources The . ABP EL+EP j .O h left the On ).

j ABP uBo-like represents

| uBlockOrigin B i

i Ghostery ∩ (resp.

B NoScript E

j Request Policy i | : . Continued 10 h erc rsne in in exposed presented of metrics configurations browser the different automated 21 the HTML-based analysis. an manual use a perform We and quality. approach website pact quality Webpage B. low no almost to provides average protection. header HTTP provide DoNotTrack The techniques users protection. Remaining protect Origin less. uBlock the slightly and show Ghostery performances. NoScript best privacy and overall of RequestPolicyContinued terms In that protection, extensions. resources other specific by affected block not Origin are uBlock and Ghostery [72]. lap. al. thus et the We Guha of by side-effect Origin. observed during another churn uBlock is contacted ad observation with not this have made that were may hypothesize measurements that Plus ten parties uBlock AdBlock the by with third made employed some crawls also reached ten is The which Origin. EasyList uses however the exhibit they why (see blocking explain performance their not may that best This do fact unique. the are extensions due settings is other list this that that speculate We resources block. specific many block have thus [48]. trackers heuristics prominent These less websites. be- blocking several heuristics trouble across the seen Badger’s with are Privacy consistent that is (see is This havior difference smaller. major is The impact 7a. Figure reliable. to are heuristics Badger’s overlap. Privacy significant the a that exhibit proves extensions This lists these techniques, heuristics, blocking different and two Privacy using covers uBlock Despite completely Badger. by re- almost blocked party Ghostery also third Similarly, are All Origin. Badger extensions. Privacy all by across blocked overlap quests big a is is there badger Privacy with intersection above). the (see as represented in fashion Origin same uBlock depicts and the line Ghostery this Disconnect, on blocked with squares resources intersections Remaining only. the is Plus line represents AdBlock this and by on AdBlock diagonal, square the next by The on blocked Badger. located Privacy the resources by outside not the surface but the Plus green of describes The surface square 678. the purple here by square, encoded intersection purple is the inner The which represents 1356. Badger line Privacy Plus: considered number with AdBlock the the by on represent blocked square line first requests this party on surfaces third Square of 7a. Fig. of by blocked only )HM metrics: HTML 1) im- techniques protection privacy how analyze finally We Summary: It Plus. AdBlock by blocked only are resources Some Origin uBlock and Ghostery that illustrate figures Both similar are 7b Fig. on domains party third the on Results that shows 7a ) (Fig. requests party third the of overlap The line second the on Plus AdBlock see we example, an As § § peet h eut.W discard We results. the presents 8 Figure IV-B2. h otpplretnin hwawd over- wide a show extensions popular most The htol lc hr at oan that domains party third block only that V-B) E i : | B ecalteAeaTp10wrdon world 100 Top Alexa the crawl We § i VI-A1). { − § B adternigmethod ranking the and IV-C1 k ∀ k 6= i }| . § euse We VI-A1. (a) # third party requests (b) # third party domains Fig. 7. Overlap of third party requests and domains blocked by five extensions (PB: Privacy Badger, Dis: Disconnect, ADB: AdBlock Plus, Gho: Ghostery, uBO: uBlock Origin). Square surfaces represent the metric value for the extension on the row. the HTML size metric because there is no significant change We use the questions about layout similarity and missing across protection techniques. NoScript has a noticeable impact elements presented in § IV-C2. Figure 9 exposes our results. on all metrics. This is especially visible for the script number, The blue (resp. red) boxplot corresponds to the first (resp. and image and script size. RequestPolicy Continued’s impact is second) question. Overall, Privacy Badger, Blur, Disconnect, smaller than NoScript’s except for image size. We hypothesize uBlockOrigin and Ghostery have a very small impact on ren- that RequestPolicy Continued blocks third party image hosting dered webpages. On the other hand, RequestPolicy Continued services, and thus impacts many websites. NoScript greatly and NoScript significantly impact on pages both in terms of reduces the total size of scripts which is consistent with its layout and missing elements. This is consistent with § VI-B1 goal: blocking JavaScript on webpages. Other privacy protec- and previous results [6], [50]. The manual analysis uncovers tion techniques appear to have a limited impact of website RequestPolicy Continued’s strong impact on webpage quality quality. Compared to other techniques, Ghostery exhibits a which was not exposed by HTML-based metrics. smaller impact on images and a higher reduction of the number Summary: Both studies show that RequestPolicyContinued of script and their total size. This is consistent with its tracking and NoScript reduce webpage quality. Other extensions do not protection goal. This extension actually does not intend to have such impact. block media such as advertisements. 2) Manual analysis: The previous subsubsection analyzes C. Summary webpages quality in terms of HTML and associated data. This Figure 10 summarizes our findings. Each metric-based automated approach does not necessarily reflect the quality synthetic index is build as the mean of previously used of the rendered webpage. We perform a manual analysis to metrics (see § IV-B1 § IV-C1) normalized using the metric cope with this limitation. We gather screenshots of the Alexa value obtained with Firefox used alone. Manual analysis- Top 50 without pages from the same entity but with distinct based index is the mean of the mean rating of both ques- localization (e.g. Google and Amazon). These screenshots tions. Figure 10a shows that our manual analysis captures were gathered in November 2016. We use the seven techniques the negative effects of RequestPolicy Continued that are not that provided the best privacy protection (see § VI-A1): Privacy visible with HTML-based metrics. Figures 10b and 10c Badger trained, Blur, Disconnect, uBlock Origin, Ghostery, demonstrate RequestPolicy Continued and NoScript show the NoScript and RequestPolicy Continued. For each webpage and best protection performances but impact browsing experience. protection technique, we display a picture that contains two Ghostery and uBlock Origin protect users slightly less but have screenshots side-to-side to the user: the left one captured with a smaller impact on webpage quality. Remaining techniques Firefox alone, the other with Firefox used with an extension. provide average protection.

11 1e7

5000 7 6 4000 5

3000 4

# image 3 2000 2

1000 Image data size (byte) 1

0 0

PB - 2 PB - 4

MTC - 2 Blur - 2 RPC - 1 MTC - 4 Blur - 4 RPC - 1 WOT - 2 WOT - 4

ABP EL - 2 FF DNT - 2 ABP EL - 4 FF DNT - 4 FF Bare - 2 FF Bare - 4 Decentr. - 2 NoTrace - 2 NoScript - 1 Decentr. - 4 NoTrace - 3NoScript - 2 BeefTaco - 2 Ghostery - 2 BeefTaco - 4 Ghostery - 4 FF no 3PC - 2 FF no 3PC - 4 ABP EL+EP - 2 Disconnect - 2 HTTPS Eve. - 2 Disconnect - 4 ABP EL+EP - 4 HTTPS Eve. - 4 PB (trained) - 2 PB (trained) - 4 ABP uBo-like - 2 uBlockOrigin - 2 ABP uBo-like - 4 uBlockOrigin - 4 FF no 3PC EV - 2 FF no 3PC EV - 4 (a) # images (b) Image size

1e7 3.0 2500 2.5 2000 2.0 1500 1.5 # script 1000 1.0

500 Script data size (byte) 0.5

0 0.0

PB - 3 PB - 3

Blur - 4 MTC - 3 RPC - 2 Blur - 3 MTC - 2 RPC - 2 WOT - 4 WOT - 3

ABP EL - 3 FF DNT - 3 ABP EL - 3 FF DNT - 3 FF Bare - 3 FF Bare - 3 Decentr. - 3 NoTrace - 3 NoScript - 1 NoTrace - 4 Decentr. - 3 NoScript - 1 BeefTaco - 3 Ghostery - 2 BeefTaco - 3 Ghostery - 2 FF no 3PC - 3 FF no 3PC - 3 Disconnect - 3 HTTPS Eve. - 3 ABP EL+EP - 3 HTTPS Eve. - 3 Disconnect - 2 ABP EL+EP - 2 PB (trained) - 3 PB (trained) - 3 ABP uBo-like - 3 uBlockOrigin - 2 uBlockOrigin - 2 ABP uBo-like - 2 FF no 3PC EV - 3 FF no 3PC EV - 3 (c) # scripts (d) Script size Fig. 8. Comparison of protection techniques’ website quality regarding: number of image and scripts and size of images and scripts. We crawled the Alexa Top 100 websites. The mean and standard deviation are computed on ten crawls. The number on top of technique name is the KS-based rank. Bare: Firefox alone, FF: Firefox feature, MTC: MyTrackingChoices, PB (trained): Privacy Badger (trained), FF no 3rd coo. EV: Firefox no 3rd party cookie blocking except from previously visited domains, ABP: AdBlockPlus, EL: EasyList, EP: EasyPrivacy, RPC: Request Policy Continued. AdBlock Plus uBo-like uses all uBlock Origin’s lists except uBlock Origin’s specific ones; it thus loads EasyList, EasyPrivacy, Peter Lowe’s Ad Server list and Malware domains.

VII.DISCUSSIONS that users are more preoccupied by blocking ads than tracking. Malloy et al. [75] show that ad-blockers usage vary between A. Recommendations 18% and 37% in analyzed countries (US, UK, Germany, Our analysis shows that RequestPolicy Continued and No- France, and Canada). Metwalley et al. show [46] that 10 to script are the most effective privacy protection techniques. 18% of users had installed AdBlock Plus while less than 3.5% Similarly to previous work [6], [50], we confirm that they (resp. 2%) were using DoNotTrackMe/Blur (resp. Ghostery). affect webpage quality. Users that want to avoid website Similarly, Pujol et al. [74] speculate that out of the 19.7% breakage can instead use uBlock Origin or Ghostery. The latter of users that install AdBlock Plus, less than 15% of them however needs to be configured which is difficult for users setup the EasyPrivacy list. We show that the default setup [56]. We note that previous work [12], [13] also recommended of AdBlock Plus has average performances but that it can uBlock Origin and Ghostery. Their results were however noisy be configured to achieve good results. This task is however (percentiles are overlapping in [13] and some techniques difficult for users [56]. We thus advise ad-blocking users that actually observed more fingerprintting than the default browser want to improve their privacy to directly change to more in [12]) due to the use of single crawl to each website. Our efficient extensions (see above). robust measurement methodology does not suffer from the Extensions that do not rely on manually maintained filters, same weakness, and show that uBlock Origin or Ghostery Privacy Badger and MyTrackingChoices, exhibit average per- exhibit statistically better performance than other techniques. formances. It however is not clear how their performances Previous network traffic monitoring studies [46], [74] show may be improved. Analyzing extensions’ behavior against

12 Layout similarity rating tory) are less valued by advertisers. A user that blocks all third Proportion of element missing with extension parties cookies thus reduces his customer value for advertisers, and, consequently, publisher’s revenue. This however does not 10 completely block tracking and thus, advertisement transactions

8 (e.g. bidding in RTB). Indiscriminate third party blocking techniques (such as RequestPolicy Continued [25]) however 6 have a much bigger impact on advertising. They for example block interactions between ad exchange and bidders in the 4 case of RTB. All privacy protection techniques presented in this work may thus have very different impact on advertising. 2 We intend to address this aspect in future work.

0 Ghostery [18] and MyTrackingChoices [22] can block track- Bad Score Good Score Bad Blur ing for specific website categories, and thus, allow monetiza-

Ghostery NoScript tion for others. This is obviously less aggressive than most Disconnect Trained uBlock Origin Privacy Badger RequestContinued Policy approaches addressed in this work. Achara et al. [22] however report that 30% of users block the tracking for all categories. Fig. 9. Webpage quality manual analysis. Users are presented two screenshots These users thus have no reason to use these two techniques obtained with Firefox alone and with a privacy protection technique. The first question (blue) asks user to rate layout similarity. The second question (red) instead of other ones. corresponds to the proportion of elements observed with Firefox alone also present when a privacy protection technique is used. Screenshots are captured C. Future work on the Alexa Top 50. Boxplots extend to the upper and lower quartile and contain a line that represents the median. Dashed lines show min/max range. As shown in § III, existing work uses a vast array of metrics. When not visible, median is located on top of the upper quartile. Some of them are tailored to very specific use cases (such as behavioral advertising [8], [45] and cookie-based mass- surveillance [47]). Similarly, data-related metrics present an existing blocking lists would help to improve their results. obvious interest in a mobile context. We intend to add new Blocking list-based protection techniques are efficient but their metrics to this work. We also want to analyze the impact of scalability is limited due to human intervention. Current efforts all these protection techniques on specific attacks (browser [68], [52], [53], [51], [55] to automatically build blocking list fingerprinting [4], [48], cookie syncing [67], [2], [76], [48]). are thus extremely relevant. Analyzing extensions inside other browsers, such as Ghostery and uBlock Origin block a significant amount of Chrome, or , would provide a wider resources that are not blocked by other protection techniques. picture of privacy protection techniques in the wild. Sele- Both should thus be considered as relevant when assessing nium, the browser automation framework used by OpenWPM, users’ ability to protect themselves. supports these browsers. OpenWPM, however, currently only supports Firefox. Furthermore, browsers include more and B. Limitations more privacy protection (e.g. Firefox [39], Brave [40], Cliqz In this work, we estimate the impact of privacy protection [54]). Cliqz and Brave are not supported by Selenium, which techniques on webpage quality in terms of missing elements make both browsers impossible to use with OpenWPM (see and data (using HTML-based metrics § IV-C1 and alteration to § IV-A). We however intend to add them later on. the webpage layout and elements (using a user manual analysis While some works use domain name-based heuristics [77], § IV-C2). We however do not address the functionality loss. [67] to identify third parties, others use blocking lists-based We intend to explore this aspect in future work along two extensions as ground-truth (see Table II). When the latter axes. We plan to explore each website beyond its homepage method is used, tracking’s long tail [48] may cause many to improve website coverage. false negatives. Beyond our analysis in § VI-A, we intend Several work [59], [68], [51], [52], [53], [55] automatically to investigate blocking lists more thoroughly. build blocking lists. We did not evaluate them because pro- duced lists are not publicly available. VIII.CONCLUSIONS Evaluated approaches aim at blocking tracking. One side ef- This work extensively compares existing privacy protection fect is advertisement blocking. These approaches thus threaten techniques against third party tracking and provides four the current economic model of many publishers on the Inter- contributions. First, we propose a robust methodology to net. Techniques examined in this work however do not have the compare privacy protection techniques when crawling many same impact on third party tracking. For example, Ghostery websites, and quantify measurement error. We use the privacy and MyTrackingChoices can block tracking for specific web- footprint [1] and apply the Kolmogorov-Smirnov (KS) test on site categories. Similarly, Firefox’s third party cookie blocking browsing metrics to evaluate privacy protection. This test is only partially impacts advertisement techniques (such as Real- likewise applied to HTML-based metrics to assess the impact Time Bidding, RTB) that rely on user interests. Olejnik et al. on webpage quality. We also design a manual analysis to [67] show that new users (i.e. users without any browsing his- complement HTML-based metrics. Our second contribution is

13 1.05 Disconnect Disconnect Blur uBlockOrigin 9.5 9.5 Privacy Badger Ghostery 1.00 uBlockOrigin (trained) Ghostery Disconnect uBlockOrigin 0.95 Blur Blur Privacy Badger 9.0 (trained) 9.0 Ghostery 0.90

8.5 Privacy Badger 0.85 8.5 (trained)

0.80 Request Policy 8.0 Continued 8.0 0.75 Request Policy Request Policy Continued Continued 0.70 NoScript 7.5 NoScript 7.5 NoScript

0.65 Low Manual analysis quality index High Low Manual analysis quality index High Low HTML quality index High 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Low HTML quality index High High Privacy protection index Low High Privacy protection index Low (a) HTML-based vs manual analysis quality (b) Privacy protection vs HTML-based quality (c) Privacy protection vs manual analysis Fig. 10. Scatter plots of synthetic index for privacy protection, HTML-based quality and manual analysis-based quality. a privacy protection techniques comparison in terms of overlap [9] R. Hill, “Comparative benchmarks against widely used blockers: and performances. We show that the most popular privacy Top 15 most popular news websites,” 2014, accessed: 2017-07-30. [Online]. Available: ://github.com/gorhill/httpswitchboard/ protection techniques exhibit a common blocking baseline. We wiki/Comparative-benchmarks-against-widely-used-blockers: also highlight the fact that some extensions, namely Ghostery -Top-15-Most-Popular-News-Websites and uBlock Origin, have a very specific behavior and are the [10] ——, “ublock and others: Blocking ads, track- ers, malwares,” 2015, accessed: 2017-07-30. [Online]. only ones to block some domains. Our third contribution is Available: https://github.com/gorhill/uBlock/wiki/uBlock-and-others% a comparison of the impact of privacy protection techniques 3A-Blocking-ads%2C-trackers%2C-malwares on webpage quality. Our fourth contribution is a set of usage [11] C. E. Wills and D. C. Uzunoglu, “What ad blockers are (and are not) doing,” in HotWeb 2016, pp. 72–77. recommendations for end-users, and research recommendation [12] G. Merzdovnik, M. Huber, D. Buhov, N. Nikiforakis, S. Neuner, for the scientific community. Ghostery and uBlock Origin M. Schmiedecker, and E. Weippl, “Block me if you can: A large-scale provide the best trade-off between protection and webpage study of tracker-blocking tools,” in EuroSP 2017. [13] S. Traverso, M. Trevisan, L. Giannantoni, M. Mellia, and H. Metwalley, quality. Ghostery however requires a configuration step which “Benchmark and comparison of tracker-blockers:should you trust them?” is difficult for users [56]. The RequestPolicy Continued and in TMA 2017. NoScript extensions exhibit the best performances but reduce [14] T. Bujlow, V. Carela-Espanol,˜ J. Sole-Pareta,´ and P. Barlet-Ros, “A survey on web tracking: Mechanisms, implications, and defenses,” webpage quality. Ghostery and uBlock Origin use manually Proceedings of the IEEE, vol. PP, no. 99, pp. 1–35, 2017. built blocking lists which are cumbersome to maintain. Re- [15] J. Estrada-Jimenez,´ J. Parra-Arnau, A. Rodr´ıguez-Hoyos, and J. Forne,´ search efforts should focus on improving existing approaches “: Analysis of privacy threats and protection ap- proaches,” Computer Communications, 2016. that do not rely on blocking lists (such as Privacy Badger or [16] “AdBlock Plus,” accessed: 2017-07-30. [Online]. Available: https: MyTrackingChoices) and automating blocking list building. //adblockplus.org/ [17] “uBlock Origin,” accessed: 2017-07-30. [Online]. Available: https: //github.com/gorhill/uBlock REFERENCES [18] “Ghostery,” accessed: 2017-07-30. [Online]. Available: https://www. ghostery.com/ [19] “Disconnect,” accessed: 2017-07-30. [Online]. Available: https:// [1] B. Krishnamurthy and C. E. Wills, “Generating a privacy footprint on disconnect.me/ the internet,” in IMC 2006, pp. 65–70. [20] “NoTrace,” accessed: 2017-07-30. [Online]. Available: http://www. [2] G. Acar, C. Eubank, S. Englehardt, M. Juarez, A. Narayanan, and isislab.it/projects/NoTrace/ C. Diaz, “The web never forgets: Persistent tracking mechanisms in [21] “Blur,” accessed: 2017-07-30. [Online]. Available: https://dnt.abine.com the wild,” in CCS 2014, pp. 674–689. [22] J. P. Achara, J. Parra-Arnau, and C. Castelluccia, “Mytrackingchoices: [3] A. Soltani, S. Canty, Q. Mayo, L. Thomas, and C. J. Hoofnagle, “Flash Pacifying the ad-block war by enforcing user privacy preferences,” in cookies and privacy,” in SSRN, 2009, pp. 158–163. WEIS 2016. [4] P. Eckersley, “How unique is your ?” in PETS 2010, pp. [23] “Privacy Badger,” accessed: 2017-07-30. [Online]. Available: https: 1–18. //www.eff.org/fr/privacybadger [5] “DoNotTrack,” accessed: 2017-07-30. [Online]. Available: https: [24] “NoScript,” accessed: 2017-07-30. [Online]. Available: https://noscript. //tools.ietf.org/html/draft-mayer-do-not-track-00 net/ [6] B. Krishnamurthy, D. Malandrino, and C. E. Wills, “Measuring privacy [25] “RequestPolicyContinued,” accessed: 2017-07-30. [Online]. Available: loss and the impact of privacy protection in web browsing,” in SOUPS https://requestpolicycontinued.github.io/ 2007, pp. 52–63. [26] “HTTPS Everywhere,” accessed: 2017-07-30. [Online]. Available: [7] J. R. Mayer and J. C. Mitchell, “Third-party web tracking: Policy and https://www.eff.org/fr/https-everywhere technology,” in SP 2012, pp. 413–427. [27] “Decentraleyes,” accessed: 2017-07-30. [Online]. Available: https: [8] R. Balebako, P. Leon, R. Shay, B. Ur, Y. Wang, and L. Cranor, //decentraleyes.org “Measuring the effectiveness of privacy tools for limiting behavioral [28] “WebOfTrust,” accessed: 2017-07-30. [Online]. Available: https: advertising,” in W2SP 2012. //www.mywot.com/

14 [29] N. Nikiforakis, A. Kapravelos, W. Joosen, C. Kruegel, F. Piessens, and [58] B. Krishnamurthy and C. Wills, “Privacy diffusion on the web: a G. Vigna, “Cookieless monster: Exploring the ecosystem of web-based longitudinal perspective,” in WWW 2009, pp. 541–550. device fingerprinting,” in SP 2013, pp. 541–555. [59] F. Roesner, T. Kohno, and D. Wetherall, “Detecting and defending [30] J. Vincent, “Ghostery has been bought by the developer against third-party tracking on the web,” in NSDI 2012, pp. 12–26. of a privacy-focused browser,” 2 2017, accessed: 2017-07- [60] D. Fifield and S. Egelman, “Fingerprinting web users through font 30. [Online]. Available: http://www.theverge.com/2017/2/15/14622484/ metrics,” in FC 2015, pp. 107–124. ghostery-ad-tracking-plug-in-cliqz [61] Ł. Olejnik, G. Acar, C. Castelluccia, and C. Diaz, “The leaking battery,” [31] “Network advertising industry,” accessed: 2017-07-30. [Online]. in DPM 2015, pp. 254–263. Available: http://www.networkadvertising.org/choices/ [62] K. Mowery and H. Shacham, “Pixel perfect: Fingerprinting canvas in [32] “Digital advertising alliance,” accessed: 2017-07-30. [Online]. Available: ,” in W2SP 2012. http://www.aboutads.info/choices/ [63] Y. Cao, S. Li, and E. Wijmans, “(Cross-)Browser Fingerprinting via OS [33] “Beeftaco,” accessed: 2017-07-30. [Online]. Available: https://jmhobbs. and Hardware Level Features,” in NDSS 2017. github.io/beef-taco/ [64] G. Acar, M. Juarez, N. Nikiforakis, C. Diaz, S. Gurses,¨ F. Piessens, and [34] “Allowing acceptable ads in plus - agreements,” B. Preneel, “Fpdetective: dusting the web for fingerprinters,” in CCS accessed: 2017-07-30. [Online]. Available: https://adblockplus.org/ 2013, pp. 1129–1140. acceptable-ads-agreements [65] A. Lerner, A. K. Simpson, T. Kohno, and F. Roesner, “Internet jones and [35] “EasyList/EasyPrivacy,” accessed: 2017-07-30. [Online]. Available: the raiders of the lost trackers: An archaeological study of web tracking https://easylist.to/ from 1996 to 2016,” in Usenix Security 2016, pp. 997–1013. [36] V. S. Eckert, J. Klofta, and J. L. Strozyk, ““Web of Trust” spaht¨ [66] A. Ghosh and A. Roth, “Selling privacy at auction,” in EC 2011, pp. nutzer aus,” 11 2016, accessed: 2017-07-30. [Online]. Available: 199–208. https://www.tagesschau.de/inland/tracker-online-103. [67] L. Olejnik, T. Minh-Dung, and C. Castelluccia, “Selling off privacy at [37] L. Olejnik, “Proposed amendments to eprivacy reg- auction,” in NDSS 2014, pp. 104–114. ulation are great,” 6 2017, accessed: 2017- [68] J. Bau, J. Mayer, H. Paskov, and J. C. Mitchell, “A promising direction 07-30. [Online]. Available: https://blog.lukaszolejnik.com/ for web tracking countermeasures,” in W2SP 2013. proposed-amendments-to-eprivacy-regulation-are-great [69] V. Kalavri, J. Blackburn, M. Varvello, and K. Papagiannaki, “Like a [38] “Firefox,” accessed: 2017-07-30. [Online]. Available: https://www. pack of wolves: Community structure of web trackers,” pp. 42–54. .org/en-US/firefox [70] “Alexa top 500 sites on the web,” accessed: 2017-07-30. [Online]. Available: http://www.alexa.com [39] G. Kontaxis and M. Chew, “Tracking protection in firefox for privacy [71] “Public suffix list,” accessed: 2017-07-30. [Online]. Available: https: and performance,” in W2SP 2015. //publicsuffix.org/ [40] “Brave,” accessed: 2017-07-30. [Online]. Available: https://brave.com/ [72] S. Guha, B. Cheng, and P. Francis, “Challenges in measuring online [41] “Cliqz,” accessed: 2017-07-30. [Online]. Available: https://cliqz.com/en advertising systems,” in IMC 2010, pp. 81–87. [42] “AdBlock,” accessed: 2017-07-30. [Online]. Available: https: [73] G. Pass, A. Chowdhury, and C. Torgeson, “A picture of search.” //getadblock.com/ InfoScale, vol. 152, p. 1, 2006. [43] C. Castelluccia, S. Grumbach, and L. Olejnik, “Data harvesting 2.0: [74] E. Pujol, O. Hohlfeld, and A. Feldmann, “Annoyed users: Ads and ad- from the visible to the invisible web,” in WEIS 2013. block usage in the wild,” in IMC 2016, pp. 93–106. [44] N. Fruchter, H. Miao, S. Stevenson, and R. Balebako, “Variations in [75] M. Malloy, M. McNamara, A. Cahn, and P. Barford, “Ad blockers: tracking in relation to geographic location,” in W2SP 2015. Global prevalence and impact,” in IMC 2016, pp. 119–125. [45] J. M. Carrascosa, J. Mikians, R. Cuevas, V. Erramilli, and N. Laoutaris, [76] M. Falahrastegar, H. Haddadi, S. Uhlig, and R. Mortier, “Tracking “I always feel like somebody’s watching me. measuring online be- personal identifiers across the web,” in PAM 2016, pp. 30–41. havioural advertising,” in CoNext 2015. [77] ——, “The rise of panopticons: Examining region-specific third-party [46] H. Metwalley, S. Traverso, M. Mellia, S. Miskovic, and M. Baldi, “The web tracking.” in TMA 2014, pp. 104–114. online tracking horde: a view from passive measurements,” in TMA 2015, pp. 111–125. [47] S. Englehardt, D. Reisman, C. Eubank, P. Zimmerman, J. Mayer, A. Narayanan, and E. W. Felten, “Cookies that give you away: The surveillance implications of web tracking,” in WWW 2015, pp. 289–299. [48] S. Englehardt and A. Narayanan, “Online tracking: A 1-million-site measurement and analysis,” in CCS 2016. [49] D. Malandrino, A. Petta, V. Scarano, L. Serra, R. Spinelli, and B. Krish- namurthy, “Privacy awareness about information leakage: Who knows what about me?” in WPES 2013, pp. 279–284. [50] D. Malandrino and V. Scarano, “Privacy leakage on the web: Diffusion and countermeasures,” Computer Networks, vol. 57, no. 14, pp. 2833– 2855, 2013. [51] D. Gugelmann, M. Happe, B. Ager, and V. Lenders, “An automated approach for complementing ad blockers blacklists,” in PETS 2015, pp. 282–298. [52] H. Metwalley, S. Traverso, and M. Mellia, “Unsupervised detection of web trackers,” in GLOBECOM 2015, pp. 1–6. [53] Q. Wu, Q. Liu, Y. Zhang, and G. Wen, “Trackerdetector: A system to detect third-party trackers through machine learning,” Computer Networks, vol. 91, pp. 164–173, 2015. [54] Z. Yu, S. Macbeth, K. Modi, and J. M. Pujol, “Tracking the trackers,” in WWW 2016, pp. 121–132. [55] M. Ikram, H. J. Asghar, M. A. Kaafar, A. Mahanti, and B. Krishna- murthy, “Towards seamless tracking-free web: Improved detection of trackers via one-class learning,” ser. PETS 2017. [56] P. Leon, B. Ur, R. Shay, Y. Wang, R. Balebako, and L. Cranor, “Why johnny can’t opt out: A usability evaluation of tools to limit online behavioral advertising,” in CHI 2012, pp. 589–598. [57] D. Malandrino, V. Scarano, and R. Spinelli, “How increased awareness can impact attitudes and behaviors toward online privacy protection,” in SocialCom 2013, pp. 57–62.

15