WEB TRACKER SCANNER Bachelor's thesis



James Nolan Bachelor candidate HEIG-VD

Prof Nastaran Fatemi Ph.D. Supervising Professor HEIG-VD

Félicien Fleury, Dipl.-Ing Industrial partner representative NGSENS SARL



University of Applied Sciences and Arts Western Switzerland School of Engineering and Management Vaud Yverdon-les-Bains, Switzerland

NGSENS SARL Geneva, Switzerland



July 29, 2016 Abstract

Technology users often don't know what organizations appear in the pages they visit and the connected devices they carry, nor that these organizations may track them across multiple websites and devices.

To address this issue, this thesis presents methods to develop a tracker scanner, sug- gests strategies to aggregate trackers in a meaningful way and lays the groundwork of a scalable, cloud-based platform that allows users to automatically extract trackers from web pages.

We analyzed Alexa's list of Switzerland's 500 most visited websites, where a signicant number of third party organizations were found. These results encourage to pursue the implementation of the platform for all users, privacy researchers and website owners to benet from an automated, on-demand custom analysis. Contents

1 Introduction5

1.1 Bachelor's thesis...... 5

1.2 Organization of the report...... 5

1.3 Context...... 6

1.4 Problem...... 7

1.5 Objectives...... 7

2 Web trackers8

2.1 Denitions...... 8

2.1.1 Trackers...... 8

2.1.2 Parties...... 8

2.2 Tracking methods...... 10

2.2.1 IP address...... 10

2.2.2 Cookies and the HTML5 Web Storage API...... 11

2.2.3 HTML5 canvas and Web Audio API...... 12

2.2.4 Conclusion...... 12

3 Web tracker analyzer 14

3.1 Approaches...... 14

1 Chapter 0 James Nolan

3.1.1 Browser extensions...... 15

3.1.1.1 Existing browser extensions...... 15

3.1.1.2 Custom browser extension...... 16

3.1.2 Black box model...... 16

3.1.2.1 Description...... 16

3.1.2.2 Trac interception...... 18

3.1.2.2.1 Tcpdump...... 18

3.1.2.2.2 Tcpow...... 19

3.1.2.2.3 Wireshark and tshark...... 20

3.1.2.2.4 Mitmproxy and mitmdump...... 21

3.1.2.2.5 Conclusion...... 21

3.1.2.3 Secure connections...... 21

3.1.2.4 Browser automation...... 22

3.1.2.4.1 Headless browsers...... 24

3.1.2.4.2 Full-featured browsers...... 24

3.1.2.5 Containerization and parallelization...... 25

3.1.2.5.1 Docker...... 25

3.1.2.5.2 Network namespaces...... 26

3.2 Test dataset...... 26

3.2.1 Switzerland country code top-level domain...... 26

3.2.2 Swiss IP blocks...... 27

3.2.3 Alexa ranking...... 27

3.2.4 Conclusion...... 28

3.3 Grouping methods...... 28

2 Chapter 0 James Nolan

3.3.1 Domain parsing...... 28

3.3.2 Organization mapping...... 30

3.3.2.1 Whois records...... 30

3.3.2.2 Name servers...... 32

3.3.2.3 AS numbers...... 32

3.3.3 Location mapping...... 33

4 Data visualization 34

4.1 Features...... 34

4.2 User interface...... 35

4.2.1 Graphic charter...... 35

4.2.1.1 Colors...... 35

4.2.1.2 Fonts...... 36

4.2.2 Templates...... 36

4.2.3 Mock-ups...... 37

4.3 Infrastructure architecture...... 42

4.3.1 Requirements...... 42

4.3.2 Providers...... 42

4.3.3 Diagrams...... 43

4.4 T4WT Portal components...... 44

4.4.1 Web framework...... 46

4.4.2 Cloud scan poster...... 49

4.4.3 Cloud scan getter...... 50

4.4.4 Visualization app...... 51

4.4.4.1 Specications...... 51

3 Chapter 0 James Nolan

4.4.4.2 Mock-ups...... 52

4.4.4.3 Tooling suite...... 58

4.4.4.3.1 JavaScript view library...... 58

4.4.4.3.2 JavaScript map library...... 59

4.4.4.3.3 JavaScript chart library...... 60

4.4.4.3.4 JavaScript module bundler...... 60

4.5 T4WT Cloud API...... 61

4.5.1 RESTful API server...... 61

4.5.2 Queue manager...... 62

4.5.3 Scanner...... 62

5 Conclusion 64

5.1 Technical...... 64

5.2 Personal...... 65

6 Future directions 67

Appendices 71

A Scripts 72

A.1 scan_v2.sh...... 72

B Logbook 74

4 Chapter 1

Introduction

1.1 Bachelor's thesis

This document is a report on James Nolan's bachelor's thesis, supervised by Professor Nastaran Fatemi at School of Engineering and Management Vaud (HEIG-VD).

This thesis is a collaboration with NGSENS SARL, a Geneva-based company special- ized in security and information systems.

The rst part of the thesis was a part-time (two days a week) work over a period of 16 weeks during the Spring semester of 2016. The second part of the thesis was a full-time work over a period of 6 weeks from mid-June to end of July of 2016.

1.2 Organization of the report

This report is divided into four parts:

ˆ Chapter1 explains the problem that we identied and sets the goals we hope to reach.

ˆ Chapter2 gives key denitions on trackers, third party organizations and an overview of tracking methods.

ˆ Chapter3 discusses approaches to develop a tracker scanner, along with solutions to their technical challenges.

ˆ Chapter4 details the architecture and the components of a cloud-based platform

5 Chapter 1 James Nolan

for users, privacy researchers and website owners to extract trackers from web pages.

1.3 Context

Today, most web pages require some resources such as fonts, ads or analytics scripts to be loaded from the servers of third party organizations. A third party organization that serves resources in multiple web pages can track individual end users by logging their requests at dierent points of their navigation. This tracking history can be used to deduce information about end users for the purpose of targeted advertising, price discrimination or even surveillance.

Moreover, the same third party organizations may also serve content on connected devices, thus allowing them to track users even further in their everyday life.

In 2012, the European Network and Information Agency (ENISA) published a report describing privacy considerations in the eld of online behavioral tracking. They pointed out that users are being increasingly tracked and that there is a gap between the legal regulations around personal data protection and the real life practices in web tracking.

ENISA presents the following recommendations to address these concerns[1]:

ˆ Mobile environment. Encourage the development of anti-tracking solutions for mobile applications, which are neglected compared to desktop applications. For now, the only way to disable tracking from applications is to uninstall them.

ˆ Easy-to-use tools for transparency and control. Encourage the develop- ment of methods that help users be aware of how their personal data is collected, managed and transferred.

ˆ Enforcement solutions. A technical framework should be established to eas- ily monitor and block actors that don't comply with personal data protection regulations.

ˆ Promotion of privacy-by-design. In addition to encourage developers to use Privacy Enhancing Technologies (PET) such as anonymizers, identity man- agement systems, encryption mechanisms, privacy should be incorporated in systems at the early stages of their conceptions.

6 Chapter 1 James Nolan

1.4 Problem

End users often don't know that third party organizations take part in loading the pages they visit, nor that these organizations can track them across multiple pages, websites and devices.

1.5 Objectives

Our main goal is to contribute to ENISA's second recommendation by developing a web platform that helps users be aware of how trackers can follow them across dierent web pages, websites and devices.

To do so, we split our work into three phases:

ˆ Develop a web tracker scanner.

ˆ Find a way to group trackers by relevant properties: name, physical locations and owners.

ˆ Develop a web platform that allows users to automatically extract trackers from web pages and compare results with other datasets.

7 Chapter 2

Web trackers

2.1 Denitions

2.1.1 Trackers

In software engineering, a tracker is a component embedded in a product that regu- larly sends data to a remote server. The data transferred by a tracker is often not fundamental for the product to run. It is usually information specic to the user's environment and behavior that can be analyzed to prole his personality and habits. Users' proles are a valuable information for product owners. They have multiple uses, from fraud detection to marketing and targeted advertising. Every device that is able to communicate over the internet potentially runs trackers, from computers to smartphones and even some televisions, cars, refrigerators and light bulbs.

Sometimes, the company that collects information about the user is dierent from the product owner. This is a common scenario on web pages, because they often load resources from third party servers, as discussed in the next section.

It must also be noted that the term tracker often refers to the company that collects the user's information.

2.1.2 Parties

When a user wants to visit a web page, like http://www.example.com/page.html, his computer has to establish a connection to the that hosts the content of www.example.com. To do so, it asks the DNS (Domain Name System), a global directory of all domain names, for the IP address that is in charge of delivering the

8 Chapter 2 James Nolan content of this specic website. It then sends an HTTP request to that IP address to retrieve the content of the page.html document.

The web server replies with the HTML structure of the page.html document. A web page usually requires additional resources (stylesheets, scripts, images, etc.), to be rendered as intended by the web developer. These referenced resources are loaded through HTML tags (,