WEB TRACKER SCANNER Bachelor's Thesis

WEB TRACKER SCANNER Bachelor's thesis James Nolan Bachelor candidate HEIG-VD Prof Nastaran Fatemi Ph.D. Supervising Professor HEIG-VD Félicien Fleury, Dipl.-Ing Industrial partner representative NGSENS SARL University of Applied Sciences and Arts Western Switzerland School of Engineering and Management Vaud Yverdon-les-Bains, Switzerland NGSENS SARL Geneva, Switzerland July 29, 2016 Abstract Technology users often don't know what organizations appear in the pages they visit and the connected devices they carry, nor that these organizations may track them across multiple websites and devices. To address this issue, this thesis presents methods to develop a tracker scanner, sug- gests strategies to aggregate trackers in a meaningful way and lays the groundwork of a scalable, cloud-based platform that allows users to automatically extract trackers from web pages. We analyzed Alexa's list of Switzerland's 500 most visited websites, where a signicant number of third party organizations were found. These results encourage to pursue the implementation of the platform for all users, privacy researchers and website owners to benet from an automated, on-demand custom analysis. Contents 1 Introduction5 1.1 Bachelor's thesis..............................5 1.2 Organization of the report.........................5 1.3 Context...................................6 1.4 Problem...................................7 1.5 Objectives..................................7 2 Web trackers8 2.1 Denitions..................................8 2.1.1 Trackers...............................8 2.1.2 Parties................................8 2.2 Tracking methods.............................. 10 2.2.1 IP address.............................. 10 2.2.2 Cookies and the HTML5 Web Storage API........... 11 2.2.3 HTML5 canvas and Web Audio API............... 12 2.2.4 Conclusion............................. 12 3 Web tracker analyzer 14 3.1 Approaches................................. 14 1 Chapter 0 James Nolan 3.1.1 Browser extensions......................... 15 3.1.1.1 Existing browser extensions............... 15 3.1.1.2 Custom browser extension................ 16 3.1.2 Black box model.......................... 16 3.1.2.1 Description........................ 16 3.1.2.2 Trac interception.................... 18 3.1.2.2.1 Tcpdump.................... 18 3.1.2.2.2 Tcpow..................... 19 3.1.2.2.3 Wireshark and tshark............. 20 3.1.2.2.4 Mitmproxy and mitmdump.......... 21 3.1.2.2.5 Conclusion................... 21 3.1.2.3 Secure connections.................... 21 3.1.2.4 Browser automation................... 22 3.1.2.4.1 Headless browsers............... 24 3.1.2.4.2 Full-featured browsers............. 24 3.1.2.5 Containerization and parallelization.......... 25 3.1.2.5.1 Docker..................... 25 3.1.2.5.2 Network namespaces.............. 26 3.2 Test dataset................................. 26 3.2.1 Switzerland country code top-level domain............ 26 3.2.2 Swiss IP blocks........................... 27 3.2.3 Alexa ranking............................ 27 3.2.4 Conclusion............................. 28 3.3 Grouping methods............................. 28 2 Chapter 0 James Nolan 3.3.1 Domain parsing........................... 28 3.3.2 Organization mapping....................... 30 3.3.2.1 Whois records...................... 30 3.3.2.2 Name servers....................... 32 3.3.2.3 AS numbers....................... 32 3.3.3 Location mapping......................... 33 4 Data visualization 34 4.1 Features................................... 34 4.2 User interface................................ 35 4.2.1 Graphic charter........................... 35 4.2.1.1 Colors........................... 35 4.2.1.2 Fonts........................... 36 4.2.2 Templates.............................. 36 4.2.3 Mock-ups.............................. 37 4.3 Infrastructure architecture......................... 42 4.3.1 Requirements............................ 42 4.3.2 Providers.............................. 42 4.3.3 Diagrams.............................. 43 4.4 T4WT Portal components......................... 44 4.4.1 Web framework........................... 46 4.4.2 Cloud scan poster......................... 49 4.4.3 Cloud scan getter.......................... 50 4.4.4 Visualization app.......................... 51 4.4.4.1 Specications....................... 51 3 Chapter 0 James Nolan 4.4.4.2 Mock-ups......................... 52 4.4.4.3 Tooling suite....................... 58 4.4.4.3.1 JavaScript view library............ 58 4.4.4.3.2 JavaScript map library............ 59 4.4.4.3.3 JavaScript chart library............ 60 4.4.4.3.4 JavaScript module bundler.......... 60 4.5 T4WT Cloud API............................. 61 4.5.1 RESTful API server........................ 61 4.5.2 Queue manager........................... 62 4.5.3 Scanner............................... 62 5 Conclusion 64 5.1 Technical.................................. 64 5.2 Personal................................... 65 6 Future directions 67 Appendices 71 A Scripts 72 A.1 scan_v2.sh................................. 72 B Logbook 74 4 Chapter 1 Introduction 1.1 Bachelor's thesis This document is a report on James Nolan's bachelor's thesis, supervised by Professor Nastaran Fatemi at School of Engineering and Management Vaud (HEIG-VD). This thesis is a collaboration with NGSENS SARL, a Geneva-based company special- ized in security and information systems. The rst part of the thesis was a part-time (two days a week) work over a period of 16 weeks during the Spring semester of 2016. The second part of the thesis was a full-time work over a period of 6 weeks from mid-June to end of July of 2016. 1.2 Organization of the report This report is divided into four parts: Chapter1 explains the problem that we identied and sets the goals we hope to reach. Chapter2 gives key denitions on trackers, third party organizations and an overview of tracking methods. Chapter3 discusses approaches to develop a tracker scanner, along with solutions to their technical challenges. Chapter4 details the architecture and the components of a cloud-based platform 5 Chapter 1 James Nolan for users, privacy researchers and website owners to extract trackers from web pages. 1.3 Context Today, most web pages require some resources such as fonts, ads or analytics scripts to be loaded from the servers of third party organizations. A third party organization that serves resources in multiple web pages can track individual end users by logging their requests at dierent points of their navigation. This tracking history can be used to deduce information about end users for the purpose of targeted advertising, price discrimination or even surveillance. Moreover, the same third party organizations may also serve content on connected devices, thus allowing them to track users even further in their everyday life. In 2012, the European Network and Information Agency (ENISA) published a report describing privacy considerations in the eld of online behavioral tracking. They pointed out that users are being increasingly tracked and that there is a gap between the legal regulations around personal data protection and the real life practices in web tracking. ENISA presents the following recommendations to address these concerns[1]: Mobile environment. Encourage the development of anti-tracking solutions for mobile applications, which are neglected compared to desktop applications. For now, the only way to disable tracking from applications is to uninstall them. Easy-to-use tools for transparency and control. Encourage the development of methods that help users be aware of how their personal data is collected, managed and transferred. Enforcement solutions. A technical framework should be established to eas- ily monitor and block actors that don't comply with personal data protection regulations. Promotion of privacy-by-design. In addition to encourage developers to use Privacy Enhancing Technologies (PET) such as anonymizers, identity management systems, encryption mechanisms, privacy should be incorporated in systems at the early stages of their conceptions. 6 Chapter 1 James Nolan 1.4 Problem End users often don't know that third party organizations take part in loading the pages they visit, nor that these organizations can track them across multiple pages, websites and devices. 1.5 Objectives Our main goal is to contribute to ENISA's second recommendation by developing a web platform that helps users be aware of how trackers can follow them across dierent web pages, websites and devices. To do so, we split our work into three phases: Develop a web tracker scanner. Find a way to group trackers by relevant properties: name, physical locations and owners. Develop a web platform that allows users to automatically extract trackers from web pages and compare results with other datasets. 7 Chapter 2 Web trackers 2.1 Denitions 2.1.1 Trackers In software engineering, a tracker is a component embedded in a product that regu- larly sends data to a remote server. The data transferred by a tracker is often not fundamental for the product to run. It is usually information specic to the user's environment and behavior that can be analyzed to prole his personality and habits. Users' proles are a valuable information for product owners. They have multiple uses, from fraud detection to marketing and targeted advertising. Every device that is able to communicate over the internet potentially runs trackers, from computers to smartphones and even some televisions, cars,

Load more