Master Thesis Wkmeijer an ... Acking.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
Android App Tracking Investigating the feasibility of tracking user behavior on mobile phones by analyzing en- crypted network traffic W.K. Meijer Technische Universiteit Delft Android app tracking Investigating the feasibility of tracking user behavior on mobile phones by analyzing encrypted network traffic by W.K. Meijer to obtain the degree of Master of Science at the Delft University of Technology, to be defended publicly on 17 December 2019. Student number: 4224426 Project duration: 19 November 2018 – 17 December 2019 Thesis committee: Dr. C. Doerr, TU Delft, supervisor Dr. ir. J.C.A. van der Lubbe, TU Delft, chair Dr. F.A.Oliehoek TU Delft An electronic version of this thesis is available at http://repository.tudelft.nl/. Abstract The mobile phone has become an important part of people’s lives and which apps are used says a lot about a person. Even though data is encrypted, meta-data of network traffic leaks private information about which apps are being used on mobile devices. Apps can be detected in network traffic using the networkfingerprint of an app, which shows what a typical connection of the app resembles. In this work, we investigate whether fingerprinting apps is feasible in the real world. We collected automatically generated data from various versions of around 500 apps and real-world data from over 65 unique users. We learn thefingerprints of the apps by training a Random Forest on the collected data. This Random Forest is used to detect appfingerprints in network traffic. We show that it is possible to build a model that can classify a specific subset of apps in network traffic. We also show that it is very hard to build a complete model that can classify all possible apps traffic due to overlappingfingerprints. Updates to apps have a significant effect on the networkfingerprint, such that models should be updated every one or two months. We show that by only selecting a subset of apps it is possible to successfully classify network traffic. Various countermeasures against network traffic analysis are investigated. We show that using a VPN is not an effective countermeasure because an effective classifier can be trained on VPN data. We conclude thatfingerprinting in the real world is feasible, but only on specific sets of apps. iii Preface When I started this research end of November 2018 I never expected to write so many pages. But to you, the reader, you’ll just have to trust me when I say: reading them is worth it. Atfirst, the plan was tofinish somewhere at the end of the summer but I am glad that I took the time to work a few extra months on this thesis. In the past months so much more was added, completing the story, and I am happy to present the work that lies before you. Although my name is on the front page, this work would not have been possible without others. Thanks to dr. Christian Doerr and ir. Harm Griffioen for their daily guidance, fruitful brainstorm sessions, and technical support. Thanks for all the talks we had, about my thesis and the cyber security world in general. I also want to thank some others who have contributed positively to my day-to-day life at the TU Delft. Thanks to everyone on the 6thfloor for studying, drinking coffee, and lunching together. You made working on my thesis every day much more enjoyable. Rico, thank you for having the same pace in your research as I had and thank you for letting me defend earlier than you. Thanks to Sandra, for always trying to improve our lives; by giving us coffee, getting the sun blinds installed, and helping with submitting the correct forms. Andfinally, to give credit where credit is due: The icons used in thefigures are made by Eucalyp, Dinosoft- Labs, Freepik, Good Ware, Google, Kiranshastry, prettycons, Smashicons, and srip from www.flaticon.com. W.K. Meijer Delft, December 2019 v List of Figures 1.1 Overview of attackers . ..................... 2 2.1 Example decision tree . ..................... 7 3.1 Network traffic analysis pipeline . ................... 9 4.1 Data collection on Android emulator . ................... 18 4.2 Overview of how features are extracted . 22 4.3 Relation size and amount of classes of classifier.................... 24 4.4 Network traffic collection overview . 26 4.5 Network traffic processing overview . 27 4.6 Example extrapolation AD test . ..................... 31 5.1 Timeline collection apps and traces . .................... 34 5.2 Distribution of source ports . ..................... 38 5.3 Possible decision tree IP address as feature . 39 5.4 Results version experiment various classifiers . 41 5.5 Clustering 10 well performing classes . ................... 43 5.6 Clustering 10 poor performing classes . 43 5.7 Correlation sample size of classes and F1-score . 45 5.8 Comparison feature distributions for various app versions . 46 5.9 Histogram correlation feature similarity F1-score . 49 6.1 Split of real-world data . ..................... 54 6.2 Metrics scores for various confidence thresholds . 56 6.3 Histogram of number of samples per app . 56 6.4 Metrics scores for various confidence thresholds . 58 6.5 Correlation F1-scores and number of samples and users . 60 6.6 Various examples of feature distributions . 64 7.1 VPN experiment setup . ..................... 68 B.1 Data collection app start screen . ..................... 90 B.2 Data collection app VPN notification . 91 B.3 Data collection app running notification . 91 vii List of Tables 3.1 Features used in related work . ..................... 11 3.2 Classifiers used in related work . .................... 13 3.3 Countermeasures used by Dyer et al. [19] . 15 4.1 Distribution of apps selected from Play Store top 10 000 . 17 4.2 New versions of apps downloaded . .................... 18 4.3 Example output of /proc/net/udp . .................... 19 4.4 Fraction of bursts without label . ..................... 21 4.5 List of features . ..................... 21 4.6 Details of machines used . ..................... 23 4.7 Preliminary performance of various classifiers . 24 4.8 Classifier size on disk . ..................... 25 4.9 Collected device information . ..................... 28 5.1 Baseline performance version experiment basic feature set . 34 5.2 Results version experiment basic feature set . 35 5.3 Results version experiment only new versions basic feature set . 35 5.4 Results version experiment only new versions basic feature set and smaller training set . 36 5.5 Baseline performance version experiment basic feature set and smaller training sets . 37 5.6 Relative difference baseline and result version experiment with only new versions . 37 5.7 List of apps with unconventional destination port . 37 5.8 Baseline performance version experiment IP address continuous . 38 5.9 Results version experiment IP address continuous . 39 5.10 Results version experiment IP address categorical . 40 5.11 Results performance version experiment IP address categorical per class classifier . 40 5.12 Results version experiment Domain Names . 40 5.13 Top 10 important features . ..................... 42 5.14 Difference in performance for various app sets . 44 5.15 Classifying selected apps in a larger dataset . 45 5.16 F1-scores for com.rovio.ABstellapop over time . 46 5.17 Similarity feature distributions for com.rovio.ABstellapop.................... 47 5.18 Feature distribution drift for com.rovio.ABstellapop.................... 48 5.19 Similarity feature distributions for all apps . 48 6.1 Android versions from participating devices . 51 6.2 Characteristics of crowdsourced workers . 52 6.3 Baseline performance real-world dataset . 53 6.4 Apps with unusual ports in real-world dataset . 53 6.5 Results from training and testing on different parts of real-world dataset . 54 6.6 Performance of ten of the most common apps . 55 6.7 Baseline performance emulator and real-world datasets . 57 6.8 Results training on emulator data, testing on real-world data . 57 6.9 Difference in performance between training on emulator data and real-world data . 59 6.10 Comparison between feature distributions of the unique users . 61 6.11 Metadata from bursts of network traffic.................... 63 7.1 VPN servers details . ..................... 68 7.2 Baseline performance VPN datasets . ................... 69 7.3 Results VPN as a countermeasure, non-encapsulated . 70 ix x List of Tables 7.4 Effect change in RTT on metrics . ..................... 70 7.5 Results VPN as countermeasure, VPN encapsulated . 71 7.6 Results VPN as countermeasure, time features only . 72 7.7 Results VPN experiment, size features only . 72 7.8 Similarities of feature distribution VPN and non-VPN data . 73 7.9 Results VPN as countermeasure when training on VPN data . 73 7.10 Feature importance VPN classifiers . 74 7.11 Results padding schemes as countermeasures . 75 7.12 Results padding schemes as countermeasures, train on countermeasure data . 76 7.13 Feature distribution similarity for various padding schemes . 76 7.14 Feature importance classifiers trained on countermeasure data . 77 7.15 Results train on different countermeasure data . 77 7.16 Results train on countermeasure data, basic feature set . 78 C.1 Classes used in t-SNEfigures . .................... 93 C.2 List of the 20 most common apps in the collected real-world data . 94 C.3 List of closest neighbor to various apps . 94 C.4 Baseline performance VPN datasets time features only . 95 C.5 Baseline performance VPN datasets size features only . 95 Contents Abstract iii List of Figures vii List of Tables ix 1 Introduction 1 1.1 Network traffic analysis . ................. ..