Long-Term Observation on Browser Fingerprinting: Users' Trackability
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings on Privacy Enhancing Technologies ; 2020 (2):558–577 Gaston Pugliese*, Christian Riess, Freya Gassmann, and Zinaida Benenson Long-Term Observation on Browser Fingerprinting: Users’ Trackability and Perspective Abstract: Browser fingerprinting as a tracking technique plugins, were sufficient to uniquely recognize between to recognize users based on their browsers’ unique fea- 83.6% and 94.2% of browsers in his dataset. tures or behavior has been known for more than a Motivation. The collected browser fingerprints of decade. We present the results of a 3-year online study Panopticlick have not been published. Ten years later, on browser fingerprinting with more than 1,300 users. after many related studies [2–6], research on browser This is the first study with ground truth on user level, fingerprinting still lacks available and appropriate data. which allows the assessment of trackability based on Depending on the investigated aspects of browser finger- fingerprints of multiple browsers and devices per user. printing, the requirements for data collection are quite Based on our longitudinal observations of 88,000 mea- extensive: long-term, large-scale, fine-grained, diversi- surements with over 300 considered browser features, fied, sometimes cross-browser and cross-device. These we optimized feature sets for mobile and desktop de- requirements apply to research studying the evolution vices. Further, we conducted two user surveys to deter- of browser features over time, assessing users’ trackabil- mine the representativeness of our user sample based on ity, or examining the effectiveness of countermeasures. users’ demographics and technical background, and to At the time of writing, we are aware of only two learn how users perceive browser fingerprinting and how available datasets of browser fingerprints. Tillmann col- they protect themselves. lected fingerprints in 2012 [2] and published a dataset that does not contain raw values of the most discrimi- Keywords: browser fingerprinting, tracking, privacy nating features, like fonts or plugins, and the data col- DOI 10.2478/popets-2020-0041 lection was performed over a short period of one month. Received 2019-08-31; revised 2019-12-15; accepted 2019-12-16. Vastel et al. published an unfiltered sample of their data in 2018 of fingerprints collected via browser ex- tensions [3]. Although the dataset contains fingerprints 1 Introduction from 1.5 years, it lacks mobile devices, other browser families than Firefox and Chrome (after filtering) as well In 2020, the PETS paper How Unique Is Your Web as precise timestamps and frequencies of observations. Browser? [1] by Peter Eckersley will celebrate its 10th Furthermore, most studies relied on cookies to rec- publication anniversary. It was the first paper describ- ognize recurring browsers [1, 2, 4, 6]. This, however, ing a study on browser fingerprinting (Panopticlick) is not a reliable way to establish ground truth for long- that gained far-reaching attention and can thus be seen term observations, especially for privacy studies with al- as the origin of this research field. Back then, Eckersley legedly savvy users: cookies can be easily deleted, either showed that a small set of eight browser characteristics, manually by the user or automatically by the browser, including the user-agent string and the list of installed and they do not help to recognize users if they switch browsers or devices. Moreover, it diminishes the relia- bility of ground truth on fingerprints being unique-by- entity and trackable (Sec. 2.2). *Corresponding Author: Gaston Pugliese: Friedrich- Alexander University Erlangen-Nürnberg, E-mail address: In general, both the overall long-term trackability of [email protected] users across multiple browsers and devices, and users’ Christian Riess: Friedrich-Alexander University Erlangen- understanding of or actions against browser fingerprint- Nürnberg, E-mail address: [email protected] ing have received little attention so far. Furthermore, Freya Gassmann: Saarland University, E-mail address: the representativeness of fingerprint datasets has only [email protected] Zinaida Benenson: Friedrich-Alexander University Erlangen- been investigated with respect to technical aspects [1– Nürnberg, E-mail address: [email protected] 8], but not with respect to user demographics. Long-Term Observation on Browser Fingerprinting 559 Define Finally, the impact of participation in the studies on Collect browser features browser features users’ perception of fingerprinting remains unknown. to collect Research Questions. Based on the considerations above, we present a 3-year online study with the goals Preprocess Feature to collect longitudinal fingerprint data and to investi- browser features stemming gate the users’ perspective on browser fingerprinting for the first time. Our data collection (Sec. 3) estab- Fingerprint Compose Feature set lishes ground truth on user level instead of browser or linkability fingerprints optimization device level. Thereby, we can link fingerprints to indi- vidual study participants over longer periods of time Evaluate without relying on the persistence of client-side identi- fingerprint metrics fiers and regardless of the number of devices or browsers they used. Furthermore, we conducted two user surveys Fig. 1. Systemized workflow for browser fingerprinting to determine the demographic representativeness of our user sample and to understand the users’ perception of 3. We present a simple, yet effective approach for op- browser fingerprinting, the role of their study participa- timizing feature sets towards different metrics (e.g., tion on this perception, and their protection measures. stability of fingerprints) for desktop and mobile de- We aim to answer the following research questions: vices. Further, we introduce feature stemming as a – RQ1: How trackable are users based on their way to improve feature stability, e.g., by stripping browser fingerprints regardless of the number of off version substrings (Sec. 4). browsers and devices in use? 4. We make a dataset of our long-term study available – RQ2: How do different feature sets perform regard- for research purposes (Sec. 7). ing fingerprint stability and trackability of users? – RQ3: Do demographics of users, as well as their technical background, privacy concerns and privacy behavior, correlate with their trackability? 2 Background – RQ4: How do users perceive browser fingerprinting, what is the role of study participation in this per- A fingerprint is “a set of information elements that ception, and which countermeasures do they apply? identifies a device or application instance” and finger- Hypotheses. For RQ3, we formulate the follow- printing is the “process of uniquely identifying” these ing six hypotheses, here combined into one sentence for entities [9]. For browser fingerprinting, these informa- brevity. H1-6: The following user characteristics are re- tion elements (features) can be obtained passively from lated to their trackability: (1) age, (2) gender, (3) educa- the client HTTP headers (e.g., user-agent string or lan- tion level, (4) computer science background, (5) privacy guage), and actively using a client-side script to col- behavior, (6) privacy concerns. lect information like screen resolution or plugins. Un- Contributions. The main contributions of this pa- like cookies which are stateful identifiers stored on the per cover technical findings as well as insights in users’ client side, fingerprinting is considered a stateless track- perception of browser fingerprinting based on quantita- ing technique [10]. tive and qualitative analyses: In the following, we review concepts and termi- 1. We present a novel long-term study on browser nology of browser fingerprinting, and we provide an fingerprinting with ground truth on user level. overview of studies that collected browser fingerprint- Between 2016–2019, we collected 88,088 measure- ing data since 2009 (Table 1). ments of 305 browser features of 1,304 participants (Sec. 3). 2. We present two online surveys with study partic- 2.1 Evaluating Browser Fingerprints ipants to assess their demographic characteristics and thus the representativeness of our sample as Figure 1 shows a workflow for browser fingerprinting: well as to study participants’ privacy behavior, com- (1) The browser features that shall be collected are de- prehension of browser fingerprinting, and applied fined and the fingerprint script is implemented for the countermeasures (Sec. 3.2, 5, and 6). client and server side. (2) The fingerprint script is de- ployed to collect the clients’ browser features. If ap- Long-Term Observation on Browser Fingerprinting 560 plicable, these features are enriched with further state- Def. 1. A fingerprint f(e, t) w.r.t. T is unique-by- ful identifiers (e.g., cookies, session ID after authenti- entity if, and only if, it is linked to a single entity, i.e., cation, personalized token in URL). Depending on the 0 0 0 0 0 type of ground truth, a fingerprint can be linked to ei- ∀e ∈ E, ∀t ∈ T, e 6= e ⇒ f(e , t ) 6= f(e, t) . (5) ther an individual browser instance, device, or even user. (3) The collected browser features are preprocessed Def. 2. A fingerprint f(e, t) w.r.t. T is unique-by- which may include normalization (e.g., screen resolu- appearance if, and only if, it was observed once, i.e., tion [8]) or derivation of additional information (e.g., ∀e0 ∈ E, ∀t0 ∈ T, f(e0, t0) = f(e, t) ⇒ e0 = e ∧ t0 = t. (6) user-agent parsing). Our work contributes to this step of the workflow and proposes feature stemming and fea- Def. 3. A fingerprint f(e, t) is stable if, and only if, ture set optimization to improve the stability of features its stability period is > 0. and to compile an optimized feature set from collected features (Sec. 4). (4) The actual fingerprints are com- Def. 4. A fingerprint f(e, t) is trackable if, and only posed using a feature set. In practice, fingerprints can if, f(e, t) is (i) unique-by-entity and (ii) stable. be handled as MD5 hashes, or as vectors of raw feature values.