On Detecting Web-Tracking
Total Page:16
File Type:pdf, Size:1020Kb
INSTITUT FUR¨ INFORMATIK DER LUDWIG–MAXIMILIANS–UNIVERSITAT¨ MUNCHEN¨ Masterthesis On detecting Web-Tracking Thomas Muller¨ INSTITUT FUR¨ INFORMATIK DER LUDWIG–MAXIMILIANS–UNIVERSITAT¨ MUNCHEN¨ Masterthesis On detecting Web-Tracking Thomas Muller¨ Supervising professor: Prof. Dr. Dieter Kranzlmuller¨ Supervisors: Dr. Michael Schiffers Dr. Nils gentschen Felde Date of Submission: 13. July 2015 I assure the single handed composition of this master’s thesis only supported by declared resources. Hiermit versichere ich, dass ich die vorliegende Masterarbeit selbstandig¨ verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe. Munich, 13. July 2015 ............................................. (Signature of the Thesis Candidate) Web-Tracking is an incredible big business focusing on the collection of as many user data as possible and use it - among other scenarios - for hand-crafted advertising. Trackers are constantly seeking to improve their tracking mechanisms in order to be able to gather more user-based data. The most recent developments of this research are fingerprinting techniques: Browser- and Canvas-Fingerprinting. Fingerprinting techniques differ to the de facto standard for implementing Web-Tracking - Cookies - in several aspects. Firstly, it is very hard to notice that a web site is fingerprinting the user. Secondly, even if the user knows that he is being fingerprinted, it is very hard to effectively block it. Lastly, fingerprinting techniques enable the trackers to recognize users, which consequently leads to even more user-based data, because this enables them to track users among different web sites. Using the collected user data facilitates the trackers to create detailed user profiles, which clearly threatens the privacy of users. In order to increase privacy being endangered by Web- Tracking, the first step required is to be able to detect the use of fingerprinting techniques. This is the basic task this thesis sets out to solve. The detection part is realized using proven detection mechanisms from different fields like Data Mining or Knowledge Discovery in Databases. The results of the thesis show that Canvas- Fingerprinting is reliably detectable with each of those classifiers. The more general Browser-Fingerprinting is way harder to detect, but the best classifiers still managed to achieve a very high success rate. Consequently, it is safe to say that the developed system is capable of detecting the use of fingerprinting techniques on a web site and can therefore serve as a first step towards a system which can be used to increase privacy by mitigating fingerprinting techniques. Contents 1. Introduction 1 2. Fundamentals 3 2.1. Web-Tracking in General . 3 2.2. Cross-Domain-Tracking . 4 2.3. Fingerprinting Techniques . 5 2.3.1. Browser-Fingerprinting . 6 2.3.2. Canvas-Fingerprinting . 10 2.4. Detection Mechanisms . 13 2.4.1. Introduction . 13 2.4.2. Bayesian Classifiers . 15 2.4.3. Decision Trees . 17 2.4.4. ExtraTrees . 23 3. Related Work 24 3.1. Fingerprinting . 24 3.1.1. Browser-Fingerprinting . 24 3.1.2. Canvas-Fingerprinting . 27 3.2. Detecting Fingerprinting . 31 3.2.1. Chameleon . 31 3.2.2. CanvasBlocker . 31 3.3. Counter-Measures . 32 3.3.1. Browser-Fingerprinting . 32 3.3.2. Canvas-Fingerprinting . 34 4. Design 36 4.1. Deriving Components . 36 4.2. Architecture . 37 4.3. Sensors . 39 4.3.1. Defining Suitable Features . 39 4.4. Data Collection . 41 4.4.1. Database System . 41 4.4.2. Database Content . 42 4.5. Logic/Evaluation . 43 4.6. Communication . 44 4.7. Discussion . 45 4.7.1. Deployment Variants . 45 4.7.2. Security and Privacy . 46 5. Implementation and Evaluation 51 5.1. Implementation . 51 5.1.1. Sensors . 51 5.1.2. Data Collection . 59 5.1.3. Logic/Evaluation . 60 5.2. Evaluation and Methodology . 63 5.3. Measurements . 65 5.3.1. Na¨ıve Bayesian Classifier . 65 5.3.2. Decision Tree . 66 i 5.3.3. ExtraTrees . 68 5.3.4. Aggregator . 70 6. Discussion 72 6.1. Discussion and Interpretation . 72 6.1.1. Differences in the Training Data . 72 6.1.2. EqualSplit Results . 72 6.1.3. RandomSplit Results . 75 6.2. Room for Improvement . 78 6.2.1. Training Data . 78 6.2.2. Optimizing Classifiers . 78 6.2.3. Implementing New Classifiers . 78 6.2.4. Improve Sensors . 79 6.3. Summary . 79 7. Conclusion and Future Work 80 A. Training data 83 A.1. Training set . 83 A.2. Test set . 85 A.3. Complete training data . 86 B. Content of CD 88 C. Installation 89 C.1. Database setup . 89 C.2. Firevest . 89 C.3. Fireshorts . 89 List of Figures 2.1. Functionality of basic Web-Tracking. 4 2.2. Functionality of Cross-Domain-Tracking. 5 2.3. Exemplary result of the Panopticlick project. 8 2.4. The picture produced by the source code in Listing 2.4. 12 2.5. Example of a WebGL rendering. 13 2.6. The steps that build the basis of data classification. 14 2.7. Visualized Decision Tree after the first split. 21 2.8. Visualized Decision Tree after every split has been done. 21 3.1. Top 1.000.000 Alexa web sites using JavaScript-based font probing . 28 3.2. CanvasBlocker’s notification when user approval is requested before the <canvas> element can be used. 35 4.1. Flowchart describing the basic way of functioning of the proposed system. 37 4.2. Fundamental design of the proposed system. 38 4.3. Database scheme for the data collection layer. 42 4.4. Basic idea of a monolithic system. 45 4.5. Basic idea of a distributed system. 46 4.6. Basic concept of security tokens. 48 5.1. AST representation of JavaScript code. 52 5.2. Database scheme used in the prototype implementation. 60 5.3. Decision Tree visualization for Equal-Split with all features. 67 5.4. Decision Tree visualization for Equal-Split without Canvas-Fingerprinting-related features. 69 6.1. Histogram for ExtraTrees with RandomSplit and all features. 77 6.2. Histogram for ExtraTrees with RandomSplit without Canvas-Fingerprinting-related features. 77 C.1. Firevest settings. ..