Automated Discovery of Privacy Violations on the Web
Total Page:16
File Type:pdf, Size:1020Kb
Automated discovery of privacy violations on the web Steven Tyler Englehardt A Dissertation Presented to the Faculty of Princeton University in Candidacy for the Degree of Doctor of Philosophy Recommended for Acceptance by the Department of Computer Science Adviser: Arvind Narayanan September 2018 c Copyright by Steven Tyler Englehardt, 2018. All rights reserved. Abstract Online tracking is increasingly invasive and ubiquitous. Tracking protection provided by browsers is often ineffective, while solutions based on voluntary cooperation, such as Do Not Track, haven't had meaningful adoption. Knowledgeable users may turn to anti-tracking tools, but even these more advanced solutions fail to fully protect against the techniques we study. In this dissertation, we introduce OpenWPM, a platform we developed for flexi- ble and modular web measurement. We've used OpenWPM to run large-scale studies leading to the discovery of numerous privacy violations across the web and in emails. These discoveries have curtailed the adoption of tracking techniques, and have in- formed policy debates and browser privacy decisions. In particular, we present novel detection methods and results for persistent track- ing techniques, including: device fingerprinting, cookie syncing, and cookie respawn- ing. Our findings include sophisticated fingerprinting techniques never before mea- sured in the wild. We've found that nearly every new API is misused by trackers for fingerprinting. The misuse is often invisible to users and publishers alike, and in many cases was not anticipated by API designers. We take a critical look at how the API design process can be changed to prevent such misuse in the future. We also explore the industry of trackers which use PII-derived identifiers to track users across devices, and even into the offline world. To measure these techniques, we develop a novel bait technique, which allows us to spoof the presence of PII on a large number of sites. We show how trackers exfiltrate the spoofed PII through the abuse of browser features. We find that PII collection is not limited to the web|the act of viewing an email also leaks PII to trackers. Overall, about 30% of emails leak the recipient's email address to one or more third parties. Finally, we study the ability of a passive eavesdropper to leverage tracking cookies for mass surveillance. If two web pages embed the same tracker, then the adversary iii can link visits to those pages from the same user even if the user's IP address varies. We find that the adversary can reconstruct 62|73% of a typical user's browsing history. iv Acknowledgements I am grateful to my advisor, Arvind Narayanan, for always being willing to provide assistance and advice. Arvind has encouraged me to do the uncomfortable, to be confident in my work, and to not be afraid to do things differently. I owe a lot of my professional growth over the past five years to his guidance. I've learned the value of real world impact in research, and perhaps more importantly, an understanding of how to choose the right projects. I am also grateful to my dissertation committee, Nick Feamster, Edward Felten, Jennifer Rexford, and Prateek Mittal for their assistance and advice, not only during the dissertation process, but throughout my time at Princeton. I am incredibly lucky to have worked with such talented collaborators. I've bene- fited from their advice, critiques, and insights. In particular, I'd like to thank G¨une¸s Acar, Claudia Diaz, Christian Eubank, Edward Felten, Jeffrey Han, Marc Juarez, Jonathan Mayer, Arvind Narayanan, Lukasz Olejnik, Dillion Reisman, and Peter Zimmerman. This research in this dissertation would not have been possible without their contributions. I could always count on the members of the security group to keep life interesting! At any time of day you could jump in to an engaging conversation, whether it was working through research ideas or playing along with one of our countless nonsensical jokes. In particular, I'm thankful to Steven Goldfeder for the late-night comradery, to Harry Kalodner who was never afraid to jump in to help me struggle through a research problem, to Joe Bonneau for his assistance and encouragement, to Pete and Janee Zimmerman for never failing to bring us all together, and to Ben Burgess for teaching me that no amount of physical security is enough physical security. My decision to pursue a career in research was the result of countless conversations and mentoring from friends and collaborators at Stevens Institute of Technology. My initial exposure to research was at the encouragement of Chris Merck, who introduced v me to the atmospheric research lab at Stevens. I owe a debt of gratitude to all of my mentors along the way, including Jeff Koskulics and Harris Kyriakou. I simply wouldn't have made it to where I am today without my wife, Kim. She is always encouraging and optimistic, helping me to be the best version of myself. She has provided unwavering support and patience through the ups and downs of grad school, and has been a constant source of happiness. Lastly, I am grateful to my family, who have always supported me in all my endeavours. My mother and father have taught me the value of hard and honest work. I couldn't ask for better role models in life. The work in this dissertation was supported in part by a National Science Foun- dation Grant No. CNS 1526353, a grant from the Data Transparency Lab, a research grant from Mozilla, a Microsoft Excellence Scholarship for Internships and Summer Research given through Princeton's Center for Information Technology Policy, and by Amazon Web Services Cloud Credits for Research. vi Contents Abstract . iii Acknowledgements . .v 1 Introduction 1 1.1 Overview . .1 1.2 Contributions . .7 1.2.1 OpenWPM: a web measurement platform . .8 1.2.2 The state of web tracking . .9 1.2.3 Measuring device fingerprinting . 11 1.2.4 PII collection and use by trackers . 12 1.2.5 The surveillance implications of web tracking . 14 1.3 Structure . 15 2 Background and related work 17 2.1 Third-party web tracking . 17 2.1.1 Stateful web tracking . 19 2.1.2 Stateless tracking . 24 2.1.3 Cookie syncing . 27 2.1.4 Personally Identifiable Information (PII) leakage . 28 2.1.5 Cross-device tracking . 29 2.1.6 Tracking in emails . 31 vii 2.2 The role of web tracking in government surveillance . 32 2.2.1 NSA and GCHQ use of third-party cookies . 33 2.2.2 United States Internet monitoring . 34 2.2.3 Surveillance: attacks, defenses, and measurement . 34 2.3 The state of privacy review in web standards . 35 2.3.1 The W3C standardization process . 36 2.3.2 W3C privacy assessment practices and requirements . 37 2.3.3 Past privacy assessment research . 38 3 OpenWPM: A web measurement platform 39 3.1 The design of OpenWPM . 40 3.1.1 Previous web tracking measurement platforms . 41 3.1.2 Design and Implementation . 43 3.1.3 Evaluation . 51 3.1.4 Applications of OpenWPM . 53 3.2 Core web privacy measurement methods . 54 3.2.1 Distinguishing third-party from first-party content . 54 3.2.2 Identifying trackers . 55 3.2.3 Browsing Models . 56 3.2.4 Detecting User IDs . 59 3.2.5 Detecting PII Leakage . 61 3.2.6 Measuring Javascript calls . 62 4 Web tracking is ubiquitous 64 4.1 A 1-million-site census of online tracking . 64 4.1.1 Measurement configuration . 65 4.1.2 Measuring stateful tracking at scale . 66 4.1.3 The long but thin tail of online tracking . 67 viii 4.1.4 Prominence: a third party ranking metric . 69 4.1.5 News sites have the most trackers . 71 4.1.6 Does tracking protection work? . 73 4.2 Measuring Cookie Respawning . 74 4.2.1 Flash cookies respawning HTTP cookies . 75 4.2.2 HTTP cookies respawning Flash cookies . 78 4.3 Measuring Cookie Syncing . 79 4.3.1 Detecting cookie synchronization . 79 4.3.2 Measurement configuration . 80 4.3.3 Cookie syncing is widespread on the top sites . 81 4.3.4 Back-end database synchronization . 83 4.3.5 Cookie syncing amplifies bad actors . 85 4.3.6 Opt-out doesn't help . 86 4.3.7 Nearly all of the top third parties cookie sync . 87 4.4 Summary . 88 5 Persistent tracking with device fingerprinting 90 5.1 Fingerprinting: a 1-Million site view . 91 5.1.1 Measurement configuration . 92 5.1.2 Canvas Fingerprinting . 93 5.1.3 Canvas Font Fingerprinting . 96 5.1.4 WebRTC-based fingerprinting . 97 5.1.5 AudioContext Fingerprinting . 99 5.1.6 Battery Status API Fingerprinting . 102 5.1.7 The wild west of fingerprinting scripts . 103 5.2 Case study: the Battery Status API . 104 5.2.1 The timeline of specification and adoption . 105 5.2.2 Use and misuse of the API in the wild . 111 ix 5.2.3 Lessons Learned & Recommendations . 114 5.3 Summary . 119 6 Third-party trackers collect PII 120 6.1 Trackers collect PII in emails . 122 6.1.1 Collecting a dataset of emails . 123 6.1.2 Measurement methods . 128 6.1.3 Privacy leaks when viewing emails . 132 6.1.4 Privacy leaks when clicking links in emails . 138 6.1.5 Evaluation of email tracking defenses . 140 6.1.6 Survey of tracking prevention in email clients . 143 6.1.7 Our proposed defense against email tracking . 144 6.1.8 Limitations . 146 6.2 Trackers collect PII on the web .