Towards a Privacy-Preserving Web
Total Page:16
File Type:pdf, Size:1020Kb
TOWARDS A PRIVACY-PRESERVING WEB by Umar Iqbal A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Computer Science in the Graduate College of The University of Iowa August 2021 Thesis Committee: Zubair Shafiq, Thesis Supervisor Omar Haider Chowdhury Rishab Nithyanand Juan Pablo Hourcade Supreeth Shastri ACKNOWLEDGEMENTS I have been incredibly fortunate to have helpful and supportive mentors, col- laborators, friends, and family. This thesis would not have been possible without their guidance and support. First and foremost, I would like to thank my advisor, Zubair Shafiq, for his invaluable mentorship. I am especially thankful to him for his openness to ideas, encouragement to go out of my comfort zone, teaching me to critically question my research, and to pick important research problems. I also enjoyed our informal dis- cussions, especially while playing cricket and walking around cities during conference travels. Reflecting back, it has been a very rewarding experience. I would also like to thank several folks at the University of Iowa, who have made my Ph.D. an incredible experience. Especially my comprehensive, proposal, and thesis defense committee members, Omar Chowdhury, Rishab Nithyanand, Juan Pablo, and Supreeth Shastri for their valuable feedback, that helped me improve my research. I am particularly thankful to Omar Chowdhury for providing immense support and writing tens of letters. I cannot forget my weekly meetings with my lab members and friends, Shehroze Farooqi, Adnan Ahmad, Hammad Mazhar, John Cook, and Huyen Le. They provided invaluable critical feedback and a friendly environment for me to learn and excel. Last, but not the least, I want to thank Sheryl Semler and Catherine Till for providing swift administrative support and tolerating my last minute requests. ii I would also like to thank my undergraduate research advisor, Fareed Zaffar, for introducing me to research, providing me an opportunity to work with talented individuals, and encouraging me to pursue a Ph.D. I would not have been where I am today, if it was not for the encouragement and advise of my middle school principal, Khalida Parveen. I have been fortunate to collaborate and learn from many talented people, both in academia and in industry. I would especially like to thank, Steven Englehardt, Zhiyun Qian, Carmela Troncoso, Shitong Zhu, Sandra Siby, John Hazen, and John Wilander. I am thankful to have friends who were always there for me through the ups and downs of graduate school. A special thank you to Bilal Khan, Bilal Bhatti, Hamza Naveed, Bilal Asif, Usman Rana, and Ankur Parupally. I owe everything to my parents, Iqbal and Farhana, who provided their utmost support in all my endeavors. I thank them for their unwavering love and guidance. I want to thank my wife, Jeehan, for supporting me through thick and thin. Especially for tolerating my busy work schedule and helping me get better at maintaining my work life balance : ) I dedicate this thesis to my parents and my wife. iii ABSTRACT Modern web applications are built by combining functionality from external third parties; with the caveat that the website developers trust them. However, third parties come from various sources and website developers are often unaware of their origin and their complete functionality. Thus, the presence of “trusted” third parties has lead to many security and privacy abuses on the web, with one of the most severe consequence being privacy-invasive cross-site tracking without the knowledge or consent of users. In this thesis, I aim to tackle the cross-site tracking menace to make the web more secure and private. Specifically, I build novel privacy-enhancing systems using system instrumentation, machine learning, program analysis, and internet measure- ments techniques. At a high level, my research process involves: instrumenting web browsers to capture detailed execution of webpages, conducting rigorous measure- ments of privacy and security abuses, and using insights from the measurement stud- ies to build machine learning based approaches that counter the privacy and security abuses. In the first half of this thesis, I build privacy-enhancing systems that counter third party cross-site tracking by blocking both stateful tracking, commonly referred to as cookie based tracking, and stateless tracking, commonly referred to as browser fingerprinting. In the second half of this thesis, I build privacy-enhancing systems that counter retaliation by third party circumvention services that evade blocking, by iv either detecting and removing them or by stealthily concealing the traces of blocking. At the end of the thesis, I highlight the research contributions made by this thesis and some of the open research problems. v PUBLIC ABSTRACT Modern web applications are built by combining functionality from external third parties; with the caveat that the website developers trust them. However, third parties come from various sources and website developers are often unaware of their origin and their complete functionality. Thus, the presence of “trusted” third parties has lead to many security and privacy abuses on the web, with one of the most severe consequence being privacy-invasive cross-site tracking without the knowledge or consent of users. In this thesis, I aim to tackle the cross-site tracking menace to make the web more secure and private. Specifically, I build novel privacy-enhancing systems using system instrumentation, machine learning, program analysis, and internet measure- ments techniques. At a high level, my research process involves: instrumenting web browsers to capture detailed execution of webpages, conducting rigorous measure- ments of privacy and security abuses, and using insights from the measurement stud- ies to build machine learning based approaches that counter the privacy and security abuses. In the first half of this thesis, I build privacy-enhancing systems that counter third party cross-site tracking by blocking both stateful tracking, commonly referred to as cookie based tracking, and stateless tracking, commonly referred to as browser fingerprinting. In the second half of this thesis, I build privacy-enhancing systems that counter retaliation by third party circumvention services that evade blocking, by vi either detecting and removing them or by stealthily concealing the traces of blocking. At the end of the thesis, I highlight the research contributions made by this thesis and some of the open research problems. vii TABLE OF CONTENTS LIST OF TABLES . xiii LIST OF FIGURES . xvii CHAPTER 1 INTRODUCTION . .1 1.1 Problem Statement . .1 1.2 Research Challenges . .1 1.3 Research Approach . .2 1.4 Research Contributions . .2 1.4.1 Content Blocking . .3 1.4.1.1 Stateful tracking (AdGraph [216], WebGraph [288], Khaleesi [217]) . .3 1.4.1.2 Stateless tracking (FP-Inspector [214]) . .5 1.4.2 Anti Circumvention . .6 1.4.2.1 Active blocking (AdWars [215]) . .6 1.4.2.2 Stealthy blocking (ShadowBlock [286]) . .7 1.4.2.3 Research Impact . .8 1.4.3 Thesis Organization . .9 2 ADGRAPH: A GRAPH-BASED APPROACH TO AD AND TRACKER BLOCKING . 10 2.1 Introduction . 10 2.2 Background and Related Work . 13 2.2.1 Problem Difficulty . 13 2.2.2 Existing Blocking Techniques . 14 2.2.3 JavaScript Attribution . 19 2.3 AdGraph Design . 23 2.3.1 Graph Representation . 24 2.3.2 Graph Construction . 27 2.3.3 Feature Extraction . 30 2.3.4 Classification . 32 2.4 AdGraph Evaluation . 34 2.4.1 Accuracy . 34 2.4.2 Disagreements Between AdGraph and Filter Lists . 38 2.4.3 Site Breakage . 43 2.4.4 Feature Analysis . 46 viii 2.4.5 Tradeoffs in Browser Instrumentation . 49 2.4.6 Performance . 50 2.5 Discussions . 52 2.5.1 Offline Application of AdGraph .............. 52 2.5.2 AdGraph Limitations And Future Improvements . 54 2.6 Conclusion . 56 3 WEBGRAPH: CAPTURING ADVERTISING AND TRACKING IN- FORMATION FLOWS FOR ROBUST BLOCKING . 57 3.1 Introduction . 57 3.2 Background & Related Work . 60 3.3 AdGraph Robustness . 64 3.3.1 Threat Model & Attack . 65 3.3.2 Results . 67 3.4 WebGraph ............................. 71 3.4.1 Design & Implementation . 74 3.4.1.1 Graph Construction . 74 3.4.1.2 Features . 79 3.4.2 Evaluation . 82 3.4.3 Efficiency . 84 3.5 WebGraph Robustness . 86 3.5.1 Content mutation attacks . 86 3.5.2 Structure mutation attacks . 87 3.5.3 Empirical evaluation . 88 3.5.3.1 Adversary’s success . 90 3.5.3.2 Impact of mutation choice . 95 3.5.3.3 Comparison with AdGraph............ 96 3.6 Limitations . 97 3.7 Conclusion . 99 3.8 Appendix . 100 3.8.1 Comparison between AdGraph and WebGraph features 100 3.8.2 Distribution of graph sizes . 100 3.8.3 Experimental run times . 100 3.8.4 Graph Mutation algorithm . 103 3.8.5 Mutations on a single web page. 107 3.8.6 WebGraph robustness experiments . 108 4 KHALEESI: BREAKER OF ADVERTISING & TRACKING REQUEST CHAINS . 112 4.1 Introduction . 112 4.2 Background & Related Work . 116 4.2.1 Background . 116 ix 4.2.2 Related Work . 119 4.3 Khaleesi .............................. 122 4.3.1 Motivation & Key Idea . 122 4.3.2 Request Chain Construction . 124 4.3.3 Feature Extraction . 126 4.3.4 Classification . 130 4.4 Evaluation . 131 4.4.1 Accuracy . 131 4.4.2 Robustness Analysis . 137 4.4.2.1 Classification accuracy over time . 137 4.4.2.2 Robustness against evasions . 138 4.5 Performance . 140 4.5.1 Breakage Analysis . 143 4.5.2 Performance . 145 4.6 Discussion . 146 4.6.1 Feature Analysis . 146 4.6.2 Request Chain Graph . 148 4.7 Concluding Remarks . 151 4.8 Appendix . 153 4.9 Features definitions . 153 4.9.1 Sequential features . 153 4.9.2 Response features .