Swedes Online: You Are More Tracked Than You Think by Joel Purra
Total Page:16
File Type:pdf, Size:1020Kb
Institutionen för datavetenskap Department of Computer and Information Science Final thesis Swedes Online: You Are More Tracked Than You Think by Joel Purra LIU-IDA/LITH-EX-A--15/007--SE 2015-02-19 Linköpings universitet Linköpings universitet SE-581 83 Linköping, Sweden 581 83 Linköping This page intentionally left blank. Almost. Linköping University Department of Computer and Information Science Final Thesis Swedes Online: You Are More Tracked Than You Think by Joel Purra LIU-IDA/LITH-EX-A--15/007--SE 2015-02-19 Supervisor: Patrik Wallström The Internet Infrastructure Foundation (.SE) Supervisor: Staffan Hagnell The Internet Infrastructure Foundation (.SE) Examiner: Niklas Carlsson Division for Database and Information Techniques (ADIT) This page intentionally left blank. Almost. Master’s thesis Swedes Online: You Are More Tracked Than You Think Joel Purra [email protected], joepu444 http://joelpurra.com/ +46 70 352 1212 v1.0.0 Abstract When you are browsing websites, third-party resources record your online habits; such tracking can be considered an invasion of privacy. It was previously unknown how many third-party resources, trackers and tracker companies are present in the different classes of websites chosen: globally popular websites, random samples of .se/.dk/.com/.net domains and curated lists of websites of public interest in Sweden. The in-browser HTTP/HTTPS traffic was recorded while downloading over 150,000 websites, allowing comparison of HTTPS adoption and third-party tracking within and across the different classes of websites. The data shows that known third-party resources including known trackers are present on over 90% of most classes, that third-party hosted content such as video, scripts and fonts make up a large portion of the known trackers seen on a typical website and that tracking is just as prevalent on secure as insecure sites. Observations include that Google is the most widespread tracker organization by far, that content is being served by known trackers may suggest that trackers are moving to providing services to the end user to avoid being blocked by privacy tools and ad blockers, and that the small difference in tracking between using HTTP and HTTPS connections may suggest that users are given a false sense of privacy when using HTTPS. Acknowledgments The Internet Infrastructure Foundation1 (.SE) .SE is also known as Stiftelsen f¨orinternetinfrastruktur (IIS). This thesis was written in the office of – and in collaboration with – .SE, who graciously supported me with domain data and internet knowledge. Part of .SE’s research efforts include continuously analyzing internet infrastructure and usage in Sweden. .SE is an independent organization, responsible for the Swedish top level domain, and working for the benefit of the public that promotes the positive development of the internet in Sweden. Thesis supervision Niklas Carlsson Associate Professor (Swedish: docent and universitetslektor) at Division for Database and Information Techniques (ADIT), Department of Computer and Information Science (IDA), Link¨oping University, Sweden. Thank you for being my examiner! Patrik Wallstr¨om Project Manager within R&D, .SE (The Internet Infrastructure Founda- tion), Sweden. Thank you for being my technical supervisor! Staffan Hagnell Head of New Businesses, .SE (The Internet Infrastructure Foundation), Swe- den. Thank you for being my company supervisor! Anton Nilsson Master student in Information Technology, Link¨opingUniversity, Sweden. Thank you for being my opponent! Domains, data and software .SE (Richard Isberg, Tobbe Carlsson, Anne-Marie Eklund-L¨owinder,Erika Lund), DK Hostmas- ter A/S (Steen Vincentz Jensen, Lise Fuhr), Reach50/Webmie (Mika Wenell, Jyry Suvilehto), Alexa, Verisign. Disconnect.me, Mozilla. PhantomJS, jq, GNU Parallel, LYX. Thank you! Tips, feedback, inspiration and help Dwight Hunter, Peter Forsman, Linus Nordberg, Pamela Davidsson, Lennart Bonnevier, Isabelle Edlund, Amar Andersson, Per-Ola Mj¨omark,Elisabeth Nilsson, Mats Dufberg, Ana Rodriguez Garcia, Stanley Greenstein, Markus Bylund. Thank you! And of course everyone I forgot to mention – sorry and thank you! 1https://www.iis.se/ 2 Contents 1 Introduction 11 2 Background 13 2.1 Trackers are a commercial choice . 13 2.2 What is known by trackers? . 14 2.3 What is the information used for? . 14 3 Methodology 16 3.1 High level overview . 16 3.2 Domain categories . 16 3.3 Capturing tracker requests . 17 3.4 Data collection . 19 3.5 Data analysis and validation . 19 3.6 High level summary of datasets . 21 3.7 Limitations . 22 4 Results 24 4.1 HTTP, HTTPS and redirects . 25 4.2 Internal and external requests . 27 4.3 Tracker detection . 28 5 Discussion 30 5.1 .SE Health Status comparison . 30 5.2 Cat and Mouse ..................................... 31 5.3 Follow the Money .................................... 32 5.4 Trackers which deliver content . 32 5.5 Automated data collection and analysis . 35 5.6 Tracking and the media . 36 5.7 Privacy tool reliability . 36 5.8 Open source contributions . 36 5.9 Platform for Privacy Preferences (P3P) analysis . 37 6 Related work 40 6.1 .SE Health Status .................................... 40 6.2 Characterizing Organizational use of Web-based Services .............. 41 6.3 Challenges in Measuring Online Advertising Systems ................ 41 6.4 .SE Domain Check . 42 6.5 Cookie syncing . 42 3 Contents Swedes Online: You Are More Tracked Than You Think 6.6 HTTP Archive . 42 7 Conclusions and future work 44 7.1 Conclusions . 44 7.2 Open questions . 44 7.3 Improving information sharing . 45 7.4 Improving domain retrieval . 46 7.5 Improved data analysis and accuracy . 48 Bibliography 52 A Methodology details 57 A.1 Domains . 57 A.2 Public suffix list . 58 A.3 Disconnect.me’s blocking list . 59 A.4 Retrieving websites and resources . 62 A.5 Analyzing resources . 68 B Software 74 B.1 Third-party tools . 74 B.2 Code . 75 C Detailed results 81 C.1 Differences between datasets . 81 C.2 Failed versus non-failed . 81 C.3 HTTP status codes . 84 C.4 Internal versus external resources . 88 C.5 Domain and request counts . 93 C.6 Requests per domain and ratios . 95 C.7 Insecure versus secure resources . 97 C.8 HTTP, HTTPS and redirects . 102 C.9 Content type group coverage . 106 C.10 Public suffix coverage . 110 C.11 Disconnect’s blocking list matches . 112 C.12 Undetected external domains . 132 4 List of Tables 3.1 Domain lists in use . 18 3.2 Disconnect’s categories . 18 3.3 Organizations in more than one category . 20 3.4 TLDs in dataset in use . 23 3.5 .SE Health Status domain categories . 23 4.1 Results summary . 25 5.1 .SE Health Status HTTPS coverage 2008-2013 . 33 5.2 Cat and Mouse ad coverage . 33 5.3 Cat and Mouse ad server match distribution . 34 5.4 Follow the Money aggregators . 34 5.5 Top P3P values . 39 A.1 Machine specifications . 67 A.2 Google Tag Manager versus Google Analytics and DoubleClick . 67 A.3 Output file size . 67 A.4 Dataset variations . 67 A.5 System load . 67 A.6 Mime-type grouping . 73 C.1 Dataset HAR failure rates . 83 C.2 Dataset origin HTTP response code/group coverage . 86 C.3 Internal versus external resources coverage . 90 C.4 Requests per domain . 95 C.5 Requests per domain and ratios . 97 C.6 Secure versus insecure resources coverage . 99 C.7 Origin domains with redirects . 104 C.8 Content type group coverage (internal) . 108 C.9 Content type group coverage (external) . 110 C.10 Public suffixes in external resources . 112 C.11 Disconnect coverage for internal, external and all requests . 114 C.12 Disconnect requests, organizations and categories counts and ratios . 118 C.13 Top Disconnect domain match coverage . 122 C.14 Top Disconnect Google domain match coverage . 124 C.15 Disconnect category match coverage . 126 C.16 Top Disconnect organization match coverage . 130 C.17 Requests per domain and ratios . 134 5 List of Figures 3.1 Domains per organization . 20 4.1 HTTP status, redirects, resource security . 26 4.2.