Fakultät für Elektrotechnik und Informationstechnik

On the Feasibility and Impact of Digital Fingerprinting for System Recognition

Dissertation

Thomas Hupperich

01. Februar 2017

Fakultät für Elektrotechnik und Informationstechnik

On the Feasibility and Impact of Digital Fingerprinting for System Recognition

Dissertation

zur Erlangung des Grades eines Doktor-Ingenieurs der Fakult¨atf¨urElektrotechnik und Informationstechnik an der Ruhr-Universit¨atBochum

vorgelegt von

Thomas Hupperich aus Wermelskirchen

Bochum, 01. Februar 2017 Gutachter: Prof. Dr. Thorsten Holz Ruhr-Universit¨atBochum

Zweitgutachter: Prof. Dr. Felix Freiling Friedrich-Alexander-Universit¨atErlangen-N¨urnberg

Tag der m¨undlichen Pr¨ufung:30. Juni 2017 “The most merciful thing in the world, I think, is the inability of the human mind to correlate all its contents. We live on a placid island of ignorance in the midst of black seas of infinity, and it was not meant that we should voyage far. The sciences, each straining in its own direction, have hitherto harmed us little; but some day the piecing together of dissociated knowledge will open such terrifying vistas of reality, and of our frightful position therein, that we shall either go mad from the revelation or flee from the deadly light into the peace and safety of a new dark age.”

— H.P. Lovecraft - The Call of Cthulhu

Acknowledgement

First and foremost, I thank my supervisor Prof. Dr. Thorsten Holz for guiding my explorations through the infinite black seas of research and for all the enlightening scientific insights. I am grateful for the freedom to research topics of my interest and for the lively discussions about them. Leaving the island of ignorance and broadening the horizon requires exploration. Exploring something new, however, means to voyage and to give up at least a part of placidness. I express my gratitude to all my colleagues and contributors for the inspiring collaborations and for making my voyage an unforgettable time. Special thanks go to Jannik Pewny for discussing absurd and brilliant ideas, Henry Hosseini and Nicolai Wilkop for supporting my research countless times, and to Mark K¨uhrer,Katharina Krombholz, Davide Maiorca, Christian Rossow, as well as Sebastian Uellenbeck and Katharina Kohls for joyful and productive teamwork. Finally, I thank my family and friends for their support and help to not going mad.

Abstract

In the recent years, obtaining data about users and their online devices to track their activities and find out about their interests has become standard practice. While browser cookies have been state-of-the-art for user tracking a long time, a new technique evolved and is more and more applied in practice: digital fingerprinting. In contrast to cookies, fingerprinting a digital system reveals information about the system itself. It is possible to learn a systems software or hardware configuration with this technique. Still, it remained unclear to what extent fingerprinting can be utilized and also what risks and opportunities the use of fingerprinting entails. In the past, there has been researching mostly about fingerprinting a system’s browser—so-called browser fingerprinting. Transferring such methods to finger- printing whole digital systems may be achieved in different ways and has to be fitted to the exact scenario, including the types of systems like mobile devices, the data available, e. g., browser attributes or hardware measurements, and the overall goal—for instance recognizing a single device among others. Hence, differentiated approaches and determining suitable methods are required to make fingerprinting feasible. This thesis emphasizes two key aspects: First, we explore the feasibility of fin- gerprinting in various scenarios, e. g., browser fingerprinting in a Web context or system-based hardware fingerprinting, to investigate which goals can be achieved by this technique. Second, we investigate the impact of fingerprinting, including the risks for user privacy and the chances to enhance existing security mechanisms. In this work, we shift to fingerprinting not only a browser but complete systems by determining the general feasibility of fingerprinting mobile devices, like smartphones and tablets, and also the possibilities to elude fingerprinting methods. Fingerprinting mobile devices is deemed to be hard as these are highly standardized in contrast to desktop computers or browsers and, as a general rule, fingerprinting is less effective in a uniform group of devices. While some methods of browser fingerprinting can be instrumented for fingerprinting mobile devices as well, also new approaches can be developed since modern devices are more complex systems and, for instance, include hardware sensors. Hence, we also examine the feasibility of fingerprinting system hardware, e. g., a mobile device’s sensors and investigate whether or not it is possible to recognize devices just by hardware imperfections of these sensors. We also study whether or not differences in digital fingerprints may lead to triggering online marketing policies. In the media, there have been articles over the past years, claiming that users of specific computer systems experience price differentiation. Since such behavior may be based on digital system fingerprinting, we shed light on the impact of various fingerprints on online pricing. Finally, we take fingerprinting beyond digital systems and analyze whether writing styles may also be fingerprinted

i effectively as an example of the transferability of fingerprinting methods to other research areas. Our findings reveal both the opportunities of digital fingerprinting as well as its limits. We present various applications for this new technique, investigating its power, risks, and chances.

ii Kurzfassung

In den letzten Jahren ist es g¨angigePraxis geworden, Daten ¨uber Benutzer und deren Internetger¨ate zu sammeln, um ihre Aktivit¨atenzu verfolgen und Informationen ¨uber ihre Interessen zu gewinnen. W¨ahrendBrowser-Cookies lange das Mittel der Wahl f¨urBenutzer-Tracking waren, etabliert sich mehr und mehr eine neue Technik: Digitales Fingerprinting. Im Gegensatz zu Cookies erm¨oglicht Fingerprinting den direkten Zugriff auf Informationen ¨uber ein System. So erm¨oglicht es diese Technik, Informationen ¨uber eine Softwarekomponente, wie z.B. Browser in Erfahrung zu bringen. Dennoch blieb bislang unklar, welche weiteren Anwendungsm¨oglichkeiten f¨urFingerprinting genutzt werden k¨onnenund welche Risiken und Chancen der Einsatz von Fingerprinting mit sich bringt. In der Vergangenheit wurde vor allem ¨uber Fingerprinting des Browsers eines Systems, sog. Browser-Fingerprinting, geforscht. Die Ubertragbarkeit¨ bekannter Methoden aus dem Browser-Fingerprinting auf Fingerprinting ganzer Systeme, sog. System Fingerprinting, kann auf unterschiedliche Weise erreicht werden. Angewen- dete Methoden m¨ussen jedoch stets auf das vorliegende Szenario abgestimmt werden und sowohl die Arten von Systemen (z.B. Mobilger¨ate),als auch die verf¨ugbaren Daten (bspw. Browserattribute oder Hardwaremessungen) ber¨ucksichtigen. Zudem entscheidet das letztliche Ziel, wie zum Beispiel die Erkennung eines einzelnen Ger¨ates unter anderen, ¨uber die Vorgehensweise. Schließlich sind differenzierte Ans¨atzeund die Bestimmung geeigneter Verfahren erforderlich, um Fingerprinting durchf¨uhrbar zu machen. Die vorliegende Dissertation setzt den Fokus auf die folgenden zwei Schwerpunkte: Zun¨achst wird die Machbarkeit von Fingerabdr¨ucken in verschiedenen Szenarien un- tersucht, bspw. Browser-Fingerprinting im Webkontext oder Hardware-basiertes Fin- gerprinting, und außerdem, welche Ziele mit dieser Technik erreicht werden k¨onnen. Zweitens werden die Auswirkungen von Fingerprinting erforscht, einschließlich der Risiken f¨urdie Privatsph¨areder Nutzer und die M¨oglichkeiten, bestehende Sicher- heitsmechanismen zu verbessern. Diese Arbeit fokussiert sich nicht nur auf das Fingerprinting von Browsern, sondern kompletter Systeme, indem die allgemeine Machbarkeit von Fingerprinting mobiler Ger¨atenwie Smartphones und Tablets unter- sucht wird, sowie die M¨oglichkeiten, Fingerprinting zu umgehen. Das Fingerprinting mobiler Ger¨ategilt als schwierig, da diese hoch standardisiert sind, im Gegensatz zu Desktop-Computern und Browsern, die sich stark personalisieren lassen, z.B. durch Installieren von Erweiterungen oder Anpassen des Erscheinungsbilds. In der Regel ist Fingerprinting weniger wirksam in einer homogenen Gruppe von Ger¨aten. W¨ahrendmanche Methoden des Browser-Fingerprintings auch f¨urFingerprinting mobiler Ger¨ateinstrumentiert werden k¨onnen,k¨onnenauch neue Ans¨atzeentwickelt werden, da moderne Ger¨atekomplexere Systeme sind und zum Beispiel Hardware- Sensoren umfassen. Daher wird auch die Durchf¨uhrbarkeit von Hardware-basiertem

iii Fingerprinting untersucht, z.B. ob es m¨oglich ist, Ger¨atenur durch Fertigungsfehler von Sensoren zu erkennen. Es wird ebenfalls gekl¨art,inwieweit Online-Marketing- Strategien auf Unterschiede von digitalen Fingerprints reagieren. In den vergangenen Jahren meldeten Medien immer wieder, dass die Nutzer bestimmter Computersys- teme andere Preise angeboten bekommen als Nutzer anderer Computersysteme. Da ein solches Verhalten auf dem Fingerabdruck des digitalen Systems basieren kann, werden Auswirkungen verschiedener Fingerabdr¨ucke auf die Online-Preisgestaltung aufgezeigt. Schließlich wird der Bezug des Themas Fingerprinting ¨uber digitale Systeme hinaus erweitert und er¨ortert,wie Fingerprinting-Verfahren auch effektiv f¨ur andere Forschungsgebiete eingesetzt werden k¨onnen,etwa durch Wiedererkennung von Gesten zum L¨osenvon CAPTCHAs oder zur Zuordnung eines Textes zu seinem Autor. Unsere Ergebnisse zeigen sowohl die Machbarkeit des digitalen Fingerprintings als auch seine Grenzen. Wir pr¨asentieren verschiedene Anwendungen f¨urdiese neue Technik und untersuchen ihre Auswirkungen, Risiken und Einsatzm¨oglichkeiten.

iv CONTENTS

Abstracti

Kurzfassung iii

Table of Contentsv

1. Introduction1 1.1. Motivation ...... 1 1.2. Overview of Fingerprinting ...... 2 1.2.1. Human Fingerprinting ...... 2 1.2.2. Digital Fingerprinting ...... 3 1.2.3. System Identification ...... 6 1.2.4. System Recognition ...... 8 1.3. Topics and Scientific Contributions ...... 8 1.4. List of Publications ...... 12 1.5. Outline ...... 13

2. Fingerprinting Techniques for Mobile Devices 15 2.1. Introduction ...... 15 2.2. Analysis of Browser Fingerprinting Libraries ...... 17 2.2.1. Existing Methods ...... 18 2.2.2. Effectiveness for Mobile Devices ...... 23 2.3. Fingerprinting of Mobile Devices ...... 24 2.3.1. Attribute Selection ...... 24 2.3.2. Formalization ...... 28 2.4. Evaluation ...... 30 2.4.1. Feature Distribution ...... 32 2.4.2. Recognition of Mobile Devices ...... 33 2.4.3. Evasion Resistance ...... 36

v 2.5. Discussion ...... 42 2.6. Related Work ...... 43 2.7. Conclusion ...... 45

3. System Fingerprints as Influence on Online Pricing Policies 47 3.1. Introduction ...... 47 3.2. Price Differentiation via System Fingerprinting ...... 50 3.2.1. Price Differentiation ...... 50 3.2.2. System Fingerprinting ...... 51 3.3. Searching for Price Differentiation ...... 52 3.3.1. Design Goals ...... 52 3.3.2. High-level Overview of Workflow ...... 54 3.3.3. System Fingerprints ...... 54 3.3.4. Scanner ...... 57 3.3.5. Scraper ...... 59 3.4. Evaluation ...... 60 3.4.1. Price Analyses ...... 60 3.4.2. Location-based Price Differentiation ...... 61 3.4.3. Fingerprint-based Price Differentiation ...... 64 3.4.4. Price-influencing Features ...... 66 3.5. Discussion ...... 71 3.6. Related Work ...... 73 3.7. Conclusion ...... 74

4. Hardware Fingerprinting as Second Authentication Factor 75 4.1. Introduction ...... 75 4.2. Sensor-based Device Authentication ...... 78 4.2.1. Device Registration ...... 79 4.2.2. Device Authentication ...... 79 4.3. Fingerprinting for Sensors-based Authentication ...... 81 4.3.1. Data Set ...... 81 4.3.2. Feature Set ...... 83 4.3.3. Classifier ...... 86 4.3.4. Formalization ...... 88 4.4. Evaluation ...... 89 4.4.1. Single Sensor Tests ...... 90 4.4.2. Multi Sensor Tests ...... 93 4.5. Discussion ...... 95 4.6. Related Work ...... 96 4.7. Conclusion ...... 98

vi 5. Usability of Motion Fingerprints for Liveliness Tests 99 5.1. Introduction ...... 100 5.2. Hardware Sensors as User Input for Captchas ...... 102 5.2.1. Gesture Design ...... 102 5.2.2. Satisfaction of Requirements ...... 104 5.3. Usability Study ...... 105 5.3.1. Design and Procedure ...... 105 5.3.2. Implementation ...... 108 5.3.3. Recruitment and Participants ...... 109 5.4. Evaluation ...... 109 5.4.1. Comparison of Mechanisms ...... 109 5.4.2. Gesture Analysis ...... 111 5.4.3. Survey Results ...... 113 5.4.4. Habituation ...... 115 5.4.5. Classification ...... 116 5.5. Discussion ...... 117 5.6. Related Work ...... 119 5.7. Conclusion ...... 120

6. Impeding Authorship Attribution via Stylometry Obfuscation 121 6.1. Introduction ...... 121 6.2. Authorship Attribution ...... 123 6.2.1. Writeprints ...... 123 6.2.2. Stylometry Obfuscation ...... 124 6.3. Discovering Obfuscation Limits ...... 125 6.3.1. Scenarios ...... 125 6.3.2. Data Corpus ...... 126 6.3.3. Extended Writeprints ...... 126 6.3.4. Machine Learning ...... 127 6.3.5. Obfuscators ...... 128 6.3.6. Readability ...... 131 6.3.7. Experiment Setup ...... 131 6.4. Evaluation ...... 132 6.4.1. Unsupervised Authorship Attribution ...... 132 6.4.2. Supervised Authorship Attribution ...... 134 6.4.3. Readability ...... 136 6.4.4. Number of Authors and Texts ...... 138 6.5. Discussion ...... 143 6.6. Related Work ...... 144 6.7. Conclusion ...... 146

vii 7. Conclusion 147

List of Figures 151

List of Tables 153

Bibliography 155

A. System Fingerprints as Influence on Online Pricing Policies 169 A.1. Median Hotel Prices ...... 169

B. Impeding Authorship Attribution via Stylometry Obfuscation 173 B.1. Readability Measures Interpretations ...... 173 B.2. Authorship Attribution Precision Matrices ...... 175

viii CHAPTER ONE

INTRODUCTION

Today, information about users and their commodity systems is valuable since it can be sold to companies, satisfying their needs for optimizing marketing strategies. The more is known about a potential customer, the more specific and personal advertisements can be deployed. Also, additional information can be derived, like a human’s routines, actions of everyday life, preferences and situation. In recent years, researchers and IT professionals have created different methods to obtain such data from users’ computers and mobile devices like smartphones and tablets. While these techniques bring new features, e. g., for more comfortable online browsing, they may often cause privacy breaches and may even be abused for applications the system owner did not consent to, like user tracking, creating a movement profile or retrieving preferences of products and brands. Some even allow users to prohibit such practice like some fitness tracker apps which work locally and a user may opt-out from sharing health data. In contrast, other methods for collecting personal data are covert so that a user may not be able to forbid the data collection or even to detect such methods.

1.1. Motivation

A technique which can be abused for such non-consent data retrieval is fingerprinting. During a fingerprinting procedure, data about a user’s system, software configuration, or hardware setup is obtained and forms a unique identifier which can be used to track and recognize a user all over the web. This might be an enhancement for user experience as it enables website providers to recognize single customers, but it is also a threat to user privacy as personal data may be collected without permission. Furthermore, fingerprinting is not limited to web techniques and may be applied in

1 Chapter 1. Introduction various scenarios, instrumenting information about software, hardware, user behavior, motions and writing style. We aim to show in this thesis, how fingerprinting works and how it is capable of recognizing and identifying single user systems among others. Mobile devices are in a special focus of fingerprinting as more and more users go online with a hand-held device instead of a PC. So, we need to investigate if methods, risks and chances of fingerprinting for classical computer systems also apply for mobiles. We look at fingerprinting as a malicious threat from a privacy point of view and as a benign enhancement to existing challenges like user authentication. Especially hardware is of the essence as it not easy to change for mobile devices and enables using a hardware device as second user authentication factor. Our goal is to describe functionalities of data retrieval used by fingerprinting as wells as its risks and chances. While fingerprinting may also be applied for user tracking, identification and obtaining private information, we also intend to show the usefulness of this technique.

1.2. Overview of Fingerprinting

The term fingerprinting originates from criminalistics and forensic science. While the focus of this work is digital fingerprinting of systems, we do not emphasize on human fingerprinting. Nevertheless, it is necessary to explain the criminalistic context to completely understand the parallels between fingerprinting humans and fingerprinting systems, including any modern digital system like a personal computer, a mobile device like smartphones and tablets, a remote server, and even a piece of software or hardware.

1.2.1. Human Fingerprinting

In the late 19th century, systems were developed for identification of humans with the help of their fingerprints [47]. Although archaeologists discovered historical items and artifacts with fingerprints of ancient people, it can only be assumed that they were used on purpose, e. g., for signing, as there is no scientific basis for such assumptions [112, 131]. Systematic fingerprinting first occurred 1892 as Francis Galton suggested to instrument fingerprints for classification and identifications of humans. He set out three patterns—loops, whorls, and arches—and suggested that a fingerprint is unique to one individual. In 1999 this method was stated as admissible for crime investigations for the first time. The fingerprints of a criminal suspect were compared to fingerprints found at a crime scene, and as they were found to match, the suspect could be convicted [47]. Until the 21st century, fingerprint identification was a mainstay of criminal investigations. Today, DNA typing has become more important and more trusted, so that it seems to supersede classic fingerprinting [31].

2 1.2. Overview of Fingerprinting

Although every human fingerprint belongs to one of the pattern groups and that each pattern has distinguishable focal points, there is no standard or consensus among experts, how many points have to be matched to claim two fingerprints being the same [47]. So, the number of characteristic artifacts of a human fingerprint is arbitrary. Nevertheless, human fingerprints satisfy the following characteristics [54]:

1. Uniqueness: A fingerprint is unique to a particular individual, and no two fingerprints possess the same set of characteristics. Image quality is crucial for determining characteristics and thus for comparison of two fingerprints. But also high-quality fingerprint images may yield collisions so that some fingerprints seem generally more suspicious than others, which is a major point of critique until today [102].

2. Time-Invariance: Fingerprints do not change over the course of a person’s lifetime. Although little scars and injuries may change details of a fingerprint, it usually remains recognizable. However, greater injuries affect a person’s fingerprint temporary until the human body restored its skin completely [47].

3. Classifiability: Fingerprint patterns can be classified, and those classifications then used to narrow the range of suspects. For this classification, a decider— usually image recognition software—is needed determining whether or not two fingerprints match [102].

These goals are biologically achieved by human skin: The epidermis contains several tiers of cells forming friction ridges which can be recorded. A human fingerprint record is a picture of these ridges which are naturally restored after injury and can even be obtained after attempted obliteration with sulfuric acid [54]. However, the risk of collisions remains: Though worries about getting falsely convicted by a computer error are deemed unfounded, there are cases in which persons with criminal histories were reported as having no convictions [31].

1.2.2. Digital Fingerprinting

Similar to human fingerprinting, it is possible to fingerprint digital systems. Instead of patterns and focal points, other criteria is instrumented depending on the digital system itself, including software like browsers and hardware like sensors. The approach and goals of both are connatural:

• Recognize a human by individual skin patterns.

• Recognize a system by individual software, hardware, or configuration.

3 Chapter 1. Introduction

While the characteristic attributes of a human fingerprint are given naturally, we have to choose carefully what system attributes should be instrumented. Such system attributes should be characteristic and are called features. A system’s fingerprint consists of such features and as it is a digital fingerprint of a system, the terms digital fingerprint and system fingerprint refer to the same set of features. The feature set describes which attributes are included in all fingerprints that should be examined. The data set includes all features’ values from all systems. The first step of digital fingerprinting is the selection of attributes from a system. We need to determine such item attributes which are characteristic and thus may be helpful to distinguish items. Usually, access to a system is limited and attributes that would be perfectly distinguishable, and therefore characteristic, may not be available. An example is a phone’s international mobile equipment identity (IMEI). Though it could be leveraged to distinguish mobile devices it is not available via browser and therefore unworkable for web-based fingerprinting. Depending on the overall goal, the accessibility of a system’s resources needs to be considered when determining characteristic attributes designated to be included in the feature set. The choice of these features is crucial for a fingerprinting mechanism. On the one hand, a feature set may not precisely describe all systems if attributes are chosen that are similar for all or most of the systems. On the other hand, computational costs increase with the number of features in a feature set. When designing a fingerprinting system, such features need to be chosen which vary among the systems to be fingerprinted and are characteristic for one system or a small group of systems. For instance, if a system was to detect drawings and recognize geometrical shapes, then the color of a drawing may be a non-characteristic attribute in contrast to the number of corners or the number of straight lines. The color also exists as attribute but may not be helpful for distinguishing shapes. Consequently, the more information an attribute yields about a subgroup of systems, the more characteristic it likely is. Second, we combine these characteristic attributes for every item as a (weighted) feature vector, i. e., fingerprint. This combination of a system features’ values should be as unique as possible in order to be an identifier for a particular system. A fingerprint is considered as the vector of features, and for every single system, this feature vector has to be built, which enables weighting of features. If a feature is strongly descriptive among all systems, it should get a higher weight, so it is considered more important than a feature which is not unique and held by many similar systems. Third, a decision engine is required to assess whether a fingerprint belongs to a specific system or not. While for human fingerprinting this decider may be image recognition software, for digital fingerprinting a comprehensive rule mechanism may meet this requirement, e. g., with a rather simple rule: If all features of a present fingerprint are equal to the features of the fingerprint of a specific system, then this present fingerprint belongs to this system. In this case, the two compared vectors are

4 1.2. Overview of Fingerprinting equal. An advantage of rule-based mechanisms is the possibility to clearly reproduce and confirm any decision. A disadvantage is the lack of flexibility: if a fingerprint may have changed in one feature only, it may deceive such a rule engine. Such changes may be caused by even little customization of a system, e. g., installing a new browser plugin. If a list of available plugins is utilized as feature for digital fingerprinting, it will change when a plugin gets removed or installed. An alternative to rule-based decisions is the use of machine learning algorithms. These are capable of classifying fingerprints and matching them to a specific system. Depending on the feature set, an algorithm has to be chosen and trained with digital fingerprints from the system set. The machine learning training results in a model which is later used for classification. Determining the most suitable model for a given dataset is still a major issue of machine learning and requires cross-validation comparison between different algorithms [146]. However, utilizing machine learning yields a significant advantage: it is able to adapt to differences within a feature vector and therefore able to recognize a system by its fingerprint although this might have changed. The weights of features can also be determined by ML methods like the InformationGain algorithm, representing their importance among the data set. In this work, we frequently utilize machine learning for system fingerprinting. For detailed descriptions of machine learning concepts, please refer to Machine Learning by the Information Resources Management Association. [69]. Finally, a system fingerprint can be instrumented to recognize its corresponding system. Given a set of systems each yielding its own fingerprint, we are able to

1. distinguish systems of this set by their fingerprints, and

2. examine whether or not an arbitrary system is one of the systems in this set.

If an arbitrary system is spotted, its fingerprint can be created the same way as for the set of known systems. This new fingerprint can be compared to the known fingerprints by the decision engine for classification:

• The fingerprint either belongs to one of the systems in the known system set and therefore is a match,

• or it cannot be related to a system in the set, meaning that its original system has not been seen, yet.

Figure 1.1 illustrates this approach. At this point, a system fingerprint may serve the same tasks as a human fingerprint. However, a digital fingerprint does not comply with all three characteristics of a human fingerprint described above. The features of a system fingerprint may not be time-invariant. Depending on which features are chosen to build a fingerprint, it is possible that a fingerprint changes over time when a feature changes. For

5 Chapter 1. Introduction

Machine Learning Items Items Items KnItoewmns SelecItieomn sof DiIgtietmals Training Systems Set Attributes Fingerprints Model Matched to one Arbitrary Selection of New System in Set Classification System Attributes Fingerprint Not in System Set

Figure 1.1.: Fingerprinting process instance, if a smartphone’s fingerprint instruments a list of installed apps, this list may change by uninstalling unused apps or installing new ones. Hence, every time a user installs or uninstalls an application, the system fingerprint changes. For system fingerprinting, a feature set has to be as robust as possible and therefore needs to rely on preferably unchangeable features. Moreover, if fingerprint features are not chosen carefully, it is feasible to undermine the uniqueness property. If the only feature obtained for a system fingerprinting scenario is the operating system (OS), one might be able to determine the share of a specific OS among all data (e. g., Windows or Linux), but a thorough identification or recognition cannot be made as there will be systems running the same OS. System fingerprinting needs to be based on characteristic attributes to distinguish particular systems from others. The classifiability of fingerprints strongly depends on this uniqueness characteristic. It can only be achieved if fingerprints are unique to individual systems as a fingerprint, which does not describe a system most precisely, may be matched with a non- corresponding fingerprint of another system. In general, digital fingerprinting can be instrumented for two purposes: system identification and system recognition.

1.2.3. System Identification

Fingerprinting for identification is applied when it is important to find out specifica- tions of systems, e. g., to analyze certain Internet hosts vulnerable to a specific attack like amplification DDoS attacks. In such an attack, publicly available systems (such as open recursive DNS resolvers) are abused to reflect traffic to a DDoS victim [124]. In particular, hosts are abused which do not only reflect but also amplify the traffic. An attacker uses a system with spoofed IP address and sends a relative small requests to one or several vulnerable hosts. The response to this request is rather large, so

6 1.2. Overview of Fingerprinting that the victim—whose IP has been taken for spoofing—gets overloaded with traffic. Figure 1.2 depicts this kind of attack.

Requests ... Responses

Attacker Victim

Vulnerable hosts

Figure 1.2.: Amplification DDoS attack

Typically, attackers choose connection-less protocols in which they can send relatively small requests that result in significantly larger responses. Prior work has revealed that at least 14 UDP-based protocols are vulnerable to such abuse [30]. These protocols offer severe amplification rates, as with the monlist feature of NTP, which responses a list of latest communication partner systems. This feature is able to amplify traffic by a factor of up to 4,670. Also TCP-based protocols, like FTP, HTTP, HTTPS, SSH and Telnet, can be exploited this way [86]. Scanning 20 million random IP addresses for 13 common TCP- based protocols revealed that there are less TCP hosts vulnerable for amplification attacks than UDP hosts, but the amplification factor is still up to 2,500 for several thousand hosts [87]. As such attacks are a realistic thread, it may be helpful to discover which systems are vulnerable hosts. For this purpose, fingerprinting can be used. By obtaining system attributes from the vulnerable system, it is possible to categorize different systems groups. Such information can be gained from a returned payload of an amplification host. The application of regular expressions or other matching methods discloses the systems’ entity, so that different hardware types or operating systems may be revealed [87]. Thus, the systems vulnerable to this attack type can be grouped and labelled, e. g., as router or embedded device, for instance running Linux. Finally, this allows an assessment of risks as it is possible to identify specific systems based on their attributes. System identification is confined to obtaining information about systems and discover their specifications.

7 Chapter 1. Introduction

1.2.4. System Recognition Contrary to fingerprinting for system identification, the purpose of recognition of systems requires a more sophisticated approach. The overall goal is to decide whether or not a specific system is known or was never seen before. For this purpose also information about systems is gathered, so that system recognition may include system identification as it is possible to learn a system’s specifications from features. However, the main focus of this approach is the classification of systems (see Fig. 1.1). System recognition as goal of digital fingerprinting as well as threats and possibilities of this technique are in the scope of this thesis and will be elaborated in the following chapters.

1.3. Topics and Scientific Contributions

The topics of this theses include different applications of fingerprinting and investigate its chances, risks, and impacts. We aim to shed light on usually hidden mechanisms as well as to point out privacy-related risks concerning a system’s user. In the following, we provide an overview of scientific contributions covered by peer-reviewed papers referred to in this thesis. Figure 1.3 depicts the chapters of this thesis and their relations.

Fingerprinting

software stylometry hardware

Mobile Online Pricing Authorship Liveliness Hardware Devices Policies Attribution Tests Authentication Chapter 2 Chapter 3 Chapter 6 Chapter 5 Chapter 4

Figure 1.3.: Overview of thesis topics

Chapter 2 tackles the problem of transferring known methods for desktop finger- printing into the mobile domain. In the third chapter, we investigate how system fingerprinting affects online pricing policies. Both chapters describe the impact of fingerprinting; first stating that effective fingerprinting of mobile devices is a

8 1.3. Topics and Scientific Contributions real-world scenario and second showing that a system’s fingerprint changes asset prices on the Internet. Also, both parts are related to software fingerprinting while Chapter 4 and 5 investigate the feasibility of hardware fingerprinting in different contexts. Chapter 4 shows the general feasibility of fingerprinting hardware of mobile devices, instrumenting sensors. Chapter 5 applies found insights to tackle a common web problem: modern CAPTCHA schemes leverage fingerprinting for liveliness proof and, thus, track user behavior, which can be considered a privacy-invasive. Therefore, we present a CAPTCHA mechanism relying on sensor input as well as motion recognition without setting user privacy at risk. Chapter 6 leverages another factor for fingerprinting: author stylometry. The link between Chapter 6 and Chapter 5 is privacy as Chapter 5 implements a method which is not privacy invasive like future mechanisms probably will be. Chapter 6 also describes a risk emerging from fingerprinting techniques—the risk of being detected as an author—just like Chapter 3 discloses the risk and consequences of secretly using fingerprint data without user consent. In the following, we describe the chapters of this work in more detail.

Chapter 2: Fingerprinting Techniques for Mobile Devices. Knowing the key functionalities of fingerprinting, we show the feasibility of modern techniques in this chapter. While especially browser fingerprinting of traditional computer systems, e. g., PCs and notebooks, has been the focus of researchers worldwide in the past years, it remained unclear, to what extent these fingerprinting methods can be applied to mobile devices, like smartphones and tablets. Therefore, we intend to clarify if common fingerprinting functions developed for traditional computer systems can be applied to mobile devices as well as which new attributes may be utilized for this purpose. We also challenge fingerprinting methods for possible evasion scenarios to check if users are able to avoid this practice. First, we examine user tracking libraries instrumenting fingerprinting for recog- nition of user systems in order to investigate which techniques are deployed in the wild. Second, we apply these techniques to both mobile and desktop systems to give insights about how well existing approaches for desktops work for mobile devices. Then, we implement a fingerprinting system especially dedicated to mobile devices and conduct several experiments to prove the feasibility of mobile device fingerprinting. Finally, we study evasion attacks against this system to assess the possibilities of avoiding fingerprinting.

9 Chapter 1. Introduction

Chapter 3: System Fingerprints as Influence on Online Pricing Policies. After proving the general feasibility of fingerprinting, we investigate its influence on online pricing policies. In the past years there have been various articles about web stores and online service providers offering different prices to different customers for the same product [147]. The reason for such price differences is rooted in pricing policies which are accused of leveraging private data about customers to set a product’s price. Fingerprinting is used to obtain such data, which raises privacy issues. We aim to clarify whether or not systematic price discrimination exists in the wild and if online prices are adjusted based on a system’s fingerprint. We develop and implement a system to reveal online price discrimination based on fingerprinting in web portals and conducted an empirical study to show if such cases reported by news and media can be reproduced and if there exists systematic price discrimination based on fingerprinting techniques. Furthermore, we shed light on which characteristic attributes are used in practice and how these can be affected.

Chapter 4: Hardware Fingerprinting as Second Authentication Factor. While a digital fingerprint on the web may be applied for user tracking, we seek for applications of fingerprinting to enhance existing security techniques. The ability to recognize a system by its characteristic attributes can be used to enhance user authentication. While software might change during a device’s life cycle, its hardware usually stays unmodified. Hence, fingerprinting a system’s hardware is a way to ensure that a specific action is performed with a specific device. We intend to clarify whether a system’s hardware fingerprint can be used for two-factor authentication. We use hardware fingerprinting to enrich common password authentication schemes by leveraging a device’s fingerprint as a second factor. For this purpose, we examine the feasibility of fingerprinting hardware sensors and investigate which sensors are most reliable. Combining data from different sensor types enables a thorough recognition of single devices among others. So, using a device’s fingerprint as second authentication factor besides a user’s password ensures that a specific user logs in using a specific device. While a password represents authentication by knowledge, a device fingerprint can be categorized as authentication by ownership as a user needs to possess the specific device.

10 1.3. Topics and Scientific Contributions

Chapter 5: Usability of Motion Fingerprints for Liveliness Tests. After examining fingerprinting for software as well as hardware, we shift the topic of fingerprinting to other types of data: not systems or users will be fingerprinted, but movements and gestures for the creation of a novel CAPTCHA1 mechanism. Today, modern captcha schemes apply fingerprinting for detecting and tracking systems and users [117]. This may raise privacy issues as users are typically not aware of being fingerprinted. In this chapter, we introduce a captcha scheme, which utilizes fingerprinting not for this purpose but for relying on specific gestures to prove a user’s liveliness. As hardware sensors provide a huge amount of quality data and are hard to tamper (see Chapter 4), we instrument them as user input for liveliness tests and analyze this approach from a usability perspective. We aim to inquire if users are willing to engage with such a new method as well as studying the effectiveness of this approach. For this purpose, we conducted a user study and evaluated the robustness and usability of our captcha scheme compared to well-established approaches and another innovative scheme.

Chapter 6: Impeding Authorship Attribution via Stylometry Obfuscation. While the other chapters deal with fingerprinting of hardware and software, we examine the topic of fingerprinting in a different domain and present text-based fingerprinting as an approach for authorship attribution. When authors write texts, their works contain specific peculiarities, including the use of punctuation and other stylistics. Every author tends to use particular phrases in specific frequencies. Extracting individual attributes from a text, it is possible to assign a specific text to its original author. These attributes of writing style can be considered a text- based fingerprint of an author. However, there are many cases authors need to stay unknown and in private, e. g., whistle-blowers, human rights activists and victims of political persecution. There exist techniques to anonymize a text so that it cannot be assigned to an author anymore. A vast majority of these methods requires unobfuscated texts as input for machine learning. However, this assumption cannot be made for realistic scenarios as it rarely happens that an author published texts under a real name and afterward obfuscates one single text. Thinking of real-world scenarios like a group of human rights activists who need to stay anonymous, all published texts need to be obfuscated before publishing. Hence, there are either no original, i. e., unobfuscated texts available or there has been an information leakage revealing authors and texts. We aim to investigate whether text-based fingerprinting is effective under these circumstances, even after performing obfuscation.

1The acronym will further be written in lowercase for better readability.

11 Chapter 1. Introduction

1.4. List of Publications

This thesis is based on previous academic publications but also contains novel and unpublished material. The second chapter about fingerprinting techniques for mobile devices is built on a joint work together with Davide Maiorca as well as Marc K¨uhrerand Thorsten Holz. It was published at the 31st Annual Computer Security Applications Conference (ACSAC 2015) [63]. Chapter 3 covers an examination of the impact of fingerprinting and is a joint work together with Nicolai Wilkop and Thorsten Holz. The conducted study is unpublished yet but at the time of writing in submission at the 17th Privacy Enhancing Technologies Symposium (PETS 2017). The next chapter takes the topic of this thesis from software to hardware level by fingerprinting different sensor types of mobile devices. While previous publications only took a device’s accelerometers and gyroscope into account [39] as these are accessible via web technology, we also included all other available sensors—like magnetic field or rotation sensors—in our experiments. This work has been published at the 13th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA 2016) [61] and arose together with Henry Hosseini and Thorsten Holz. Hardware sensors may also serve as user input for liveliness detection. In Chapter 5 we propose such an approach as an alternative to common captcha methods which may state a risk to user privacy as fingerprinting is also leveraged for this purpose. This usability study was a joint work together with Katharina Krombholz and Thorsten Holz and has been published at the 9th International Conference on Trust and Trustworthy Computing (TRUST 2016) [62]. Chapter 6 describes a possible way to apply fingerprinting to another source: writ- ing style. We revisit the approach of writeprints which are text-based fingerprints of an authors’ styles of writing, so-called Stylometry. To hide an authors stylom- etry and therefore prevent a text from relation to its original author, obfuscators emerged trying to anonymize text styles. We re-implement the writeprinting process and examine different state-of-the-art obfuscators regarding their effectiveness of deceiving authorship attribution and preserving a text’s readability. This work was accomplished together with Henry Hosseini and will be submitted to the 17th Privacy Enhancing Technologies Symposium (PETS 2017). Other publications emerged in the course of this thesis with a particular focus on fingerprinting or mobile devices. Together with Marc K¨uhrer,Cristian Rossow and Thorsten Holz, we investigated the impact and cause of distributed reflective denial of service (DRDoS) attacks leveraging amplification hosts. These examinations included classifying vulnerable hosts with the help of fingerprinting methods. The results have been published at the 23rd USENIX Security Symposium (USENIX

12 1.5. Outline

Security 2014) [86] and especially utilizing the TCP protocol’s handshake, published at the 8th USENIX Workshop on Offensive Technologies (WOOT 2014) [87]. As openly accessible DNS resolvers are prone to such attacks, we analyzed the landscape of such hosts based on empirical data and determined device type and software version by system fingerprinting. This joint work with Marc K¨uhrer,Jonas Bushart, Christian Rossow and Thorsten Holz has been published at the ACM Conference on Internet Measurement Conference (IMC 2015) [85]. In cooperation with Katharina Krombholz and Thorsten Holz, we conducted a usability study on force-sensitive PIN authentication on mobile devices. We utilized the force sensors in a smartphone’s display in order to enable strongly pressed digits additionally to normally pressed digits in a user’s personal identification number. This work has been published at the Twelfth Symposium on Usable Privacy and Security (SOUPS 2016) [83] and as an enhanced version in the IEEE Journal of Internet Computing in 2017 [84]. Together with Sebastian Uellenbeck, Christopher Wolf, and Thorsten Holz, we developed a tactile one time pad which is transported to the user by the vibration motor of a mobile device to make PIN-based authentication more secure. As it is important for this method that vibration sounds from different devices cannot be distinguished, we applied fingerprinting to the sounds of vibration motors from several devices. This joint work has been published at the International Conference on Financial Cryptography and Data Security (FC 2015) [145].

1.5. Outline

The remainder of this thesis is structured as follows. Chapter 2 describes the basic approach of fingerprinting for recognition and introduces machine learning for this purpose. We investigate state-of-the-art fingerprinting methods as well as characteristic attributes and take these to mobile domain by applying common techniques to smartphones and tablets. We analyze how well a recognition works based on a common feature set and extend it especially aiming for mobile devices. Consequently, we examine the possibilities of escaping such algorithms and the feasibility to deceive our new-built fingerprinting system. We show that modern fingerprinting techniques state a risk to user privacy. We further demonstrate the impact of fingerprinting in the scenario of online price discrimination in Chapter 3. We obtain data from four different accommodation search providers and compare hotel room prices across the various countries and fingerprints and show that user systems with different fingerprints are not treated equally and are at risk to get a higher or lower price than other systems’ users. In Chapter 4 we take fingerprinting techniques from software domain to hardware sensors. While in the previous chapters a piece of software like a browser has been

13 Chapter 1. Introduction the main target for fingerprinting, we now aim at all sensor types available in modern mobile devices. We show that a device can be recognized among others by its sensor fingerprint only, which measures hardware imperfections. Furthermore, we propose an authentication mechanism based on this fingerprint technology to enhance user security, e. g., for website logins or online banking. Chapter 5 focusses on CAPTCHA mechanisms. Modern CAPTCHA schemes resort fingerprinting as a possibility to detect realistic user behavior. As this states a risk to user privacy, we propose a new CAPTCHA mechanism for liveliness tests leveraging hardware sensors. This privacy-preserving scheme instruments mobile devices’ movement sensors and instructs users to perform certain gestures as a challenge. We conduct a decent usability study and show that such a mechanism could technically be an alternative to fingerprinting CAPTCHAs whereas the user acceptance for classical methods remains higher. Each of these chapters introduces different methods used for the specific approach and includes a summary at the end. Finally, we conclude in Chapter 7 with a summary of all addressed topics in this thesis and possible enhancements for future work.

14 CHAPTER TWO

FINGERPRINTING TECHNIQUES FOR MOBILE DEVICES

Client fingerprinting techniques enhance classical cookie-based user tracking to increase the robustness of tracking techniques. A unique identifier is created based on characteristic attributes of the client device and then used for deployment of personalized advertisements or similar use cases. Whereas fingerprinting performs well for highly customized devices (especially desktop computers), these methods often lack in precision for highly standardized devices like mobile phones. In this chapter, we show that widely used techniques do not perform well for mobile devices yet, but that it is possible to build a fingerprinting system for precise recognition and identification. We evaluate our proposed system in an online study and verify its robustness against misclassification. Fingerprinting of web clients is often seen as an offense to web users’ privacy as it usually takes place without the users’ knowledge, awareness, and consent. Thus, we also analyze whether it is possible to outrun fingerprinting of mobile devices. We investigate different scenarios in which users can circumvent a fingerprinting system and evade our newly created methods.

2.1. Introduction

Tracking is an essential technique in today’s web and use cases reach from session management over personalization (especially related to advertisements) to fraud detection technologies. The traditional approach for web tracking bases on HTTP cookies, where the web server stores some information on the client side in a persistent way that outlasts the current browsing session. Furthermore, other kinds of (transient or persistent) cookies are possible; Flash content or HTML5 storage are just two of many examples of technical approaches to tracking. Such tracking induces concerns

15 Chapter 2. Fingerprinting Techniques for Mobile Devices related to the privacy of web users, especially since tracking usually takes place without the users’ knowledge, awareness, and consent. Thus, users are inclined to delete cookies or perform other actions to avoid tracking. To complement these state-based tracking techniques, fingerprinting as a stateless technique has become an increasingly important. In recent years, many features and methods have been proposed that facilitate the generation of fingerprints for device identification [42, 74, 114, 119]. The features of a fingerprint yield descriptive information about the system, e. g., specifications or customizations, and give clues about the kind of usage and configurations. They are desired to be as unique as possible, while weights can be leveraged to express the attributes’ importance. Latest studies demonstrate that fingerprinting works well for highly customized devices—especially desktop computers—while lacking precision for highly standard- ized devices like mobile phones or tablets [42, 119]. This is because fingerprinting relies on characteristic features that can be either customized by a user (e.g., installed fonts) or depend on the actual device (e.g., screen resolution or color depth). Mobile phones lack such customizations and thus tracking of such devices is still an open problem in practice. Due to the fact that mobile devices (especially Android-based mobile phones) significantly gained market share in the last years, tracking of such devices is ever more relevant. In this chapter, we focus on this problem and examine it in detail. In a first step, we perform a comprehensive analysis of tracking libraries used in the wild. More specifically, we study nine commercial tracking libraries and analyze the fingerprinting techniques used by these commercial vendors. We find a wide variety of potential tracking methods and study the information gain provided by each feature. This analysis bases on a real-world data set consisting of data collected from more than 15.000 client systems. The main finding is that the currently used features do not perform well for mobile devices, especially because such devices cannot be customized easily as compared to desktop computers. As such, we confirm the observation [119] that tracking of mobile devices is a hard problem in practice. In the second step, we propose several features that tracking systems could leverage to fingerprint mobile devices. We study four different categories of features (i. e., browser, system, hardware, and behavioral attributes) and discuss how they can be utilized for tracking. We implemented the proposed features, built a prototype of a fingerprinting system and evaluate its effectiveness with 724 mobile users who took part in our experiments over a duration of four months. As a third step, we study the robustness of our algorithm against evasion attacks, i.e., we study how a user could influence the features by changing device attributes to bypass our tracking system. Based on a discussion of the changeability of features, we evaluate four different evasion scenarios (e.g., using a second browser or a proxy connection). We find that users can evade fingerprinting, but that it is not as easy as one would expect at first glance.

16 2.2. Analysis of Browser Fingerprinting Libraries

Contribution In summary, we make the following contributions: • We provide a comprehensive analysis of existing tracking techniques used by (commercial) tracking companies and study the performance of such techniques for mobiles devices in a field study. • We discuss how tracking for mobile devices can be improved. To this extent, we propose a fingerprinting system based on known and new features that result from a systematic study of browser, system, hardware, and behavioral attributes. • We implemented the proposed system and evaluated the prototype in an online study with 724 participants, of which 459 participants accessed the experiment more than once over a duration of 4 months. • We study evasion techniques against fingerprinting systems, i. e., we analyze how a user can bypass the tracking system by changing (some of) the features of a mobile device. We study the robustness of our proposed approach by evaluating evasion attacks under four different scenarios.

Outline In the following section, we analyze real-world browser fingerprinting libraries which are implemented for the sake of user tracking and compare their effectiveness for desktop computers and mobile devices. Next, we select attributes from these libraries as well as new ones to build a fingerprinting system to target mobile devices in particular. The evaluation section shows the results of a single-iteration experiment—the complete data set is available from the beginning—and a multi- iteration experiment—the data set is simulated to grow successively. We then discuss the chances to evade such a fingerprint system and conduct an evasion experiment including four scenarios of trying to escape device recognition. Finally, we give an overview of related work and a summary of this chapter.

2.2. Analysis of Browser Fingerprinting Libraries

Fingerprinting and web tracking libraries to complex behavior over the past 20 years [44, 96]. First, we review the basic approach of existing techniques and then dive into technical details of state-of-the-art fingerprinting methods and their effectiveness. To investigate which features and fingerprinting techniques are used by (com- mercial) tracking libraries, we first collected a representative set of commonly used libraries. We analyzed popular websites using the Alexa ranking [64] and obtained a set of commonly used tracking libraries. Such JavaScript libraries leverage different

17 Chapter 2. Fingerprinting Techniques for Mobile Devices features to implement device tracking and fingerprinting methods. We collected and analyzed the code of nine tracking libraries, and found that they leverage many different features.

2.2.1. Existing Methods We extensively analyzed known commercial tracking libraries—namely BlueCava, Device Ident, Inside Graph, Iovation, Threat-Metrix and many more. The following attributes were found to serve as features for fingerprinting web clients. This analysis complements the work by Nikiforakis et al. [119].

Cookies Cookies are the most common way to implement device identification by storing a unique identifier (usually a short string) directly on the client system. They provide an easy yet effective way of identifying and tracking systems, but do not use any of the system’s characteristic features.

• HTTP cookies may be set and queried at the HTTP protocol level or via script- ing languages such as JavaScript (unless the HTTP-Only flag is set). Storing cookies, however, may be disabled by the user in the browser configuration. Furthermore, HTTP cookies are only valid for a limited time span.

• Flash cookies use the Adobe Flash plugin to store unique identifiers at the client using so-called Local Storage Objects (LSOs). Flash cookies are not managed by the browser’s cookie policy and thus harder to remove than HTTP cookies.

• Silverlight and ActiveX cookies store identification objects directly on the client using the Silverlight, respectively, ActiveX plugin.

• PNG cookies consist of images that are placed in the client’s browser cache, whereby the individual RGB pixel values represent the actual unique identifier. Once a PNG cookie is stored, successive requests for that image are answered with HTTP 304 Not Modified responses, indicating that the client should load the cached version of the image, which then can be read and used for tracking.

HTTP The HTTP protocol provides several properties in the header generated at the client-side that can give clues about a system.

18 2.2. Analysis of Browser Fingerprinting Libraries

• The user agent includes the name and version string of the browser. OS version and particular system specifications might also be included.

• The accept language header specifies the language of the browser, while the Accept header field contains a list of supported mime types for the content. Browser add-ons and plug-ins (e.g., Adobe Flash) might extend this list, thus can be detected by analyzing this header field.

• The ETag value is mainly used for caching, however, might be misused to assign unique tracking identifiers to individual clients.

• The X-Forwarded-For field is inserted into a client’s request header by HTTP proxies and might reveal the client’s IP address.

• The referer header field includes the origin of the request and is only set when the user reaches the target resources, e.g., via a hyperlink.

Storage

Many browsers also provide functionality to store data directly on the clients.

• The local storage is a client-side storage buffer that can be addressed via JavaScript and is restricted by origin policies, i.e., a local storage object may not be shared across distinct origins. Similar to cookies, unique identifiers can be stored on the client host.

• The session storage provides similar storage capabilities, however, the stored objects are destroyed when the browser application is shut down.

• Web SQL and Indexed DB are interfaces for local databases, in which objects can be stored for a longer period of time. The Web SQL interface, however, is no longer maintained, while Indexed DB is not fully supported by all browsers yet.

• The userData extension—for Internet Explorer only— allows storing larger fragments of data.

• Caching can be implemented by the browser object window.name. It is capable of storing data that is persistent over domain contexts, thus allowing cross-domain tracking. Similar to the session storage, the data is destroyed when closing the browser.

19 Chapter 2. Fingerprinting Techniques for Mobile Devices

Browser Object Model Various browser attributes, accessible via the Browser Object Model (BOM) can be used in the generation of unique fingerprints. • The set of available fonts correlates to installed applications, themes, and plugins. Additional software may install special fonts.

• A list of installed plugins can be iterated via the navigator.plugins object or by probing for particular image URLs on Chrome [22]. Plugins are commonly installed to modify the browser’s behavior and to add new features, but as the navigator.plugins object is (by default) not sorted, it might allow a more fine-grained breakdown of clients. Its order depends on the date of installation and may vary across different end hosts [119]. However, plugins such as Adobe Flash or ActiveX might also introduce further attributes that can be used for device tracking.

• Generating a Canvas element provides useful system information [114]. The rendering process and the pixel values of the resulting image are influenced by the rendering engine, hardware, and system-specific libraries. Thus, the image might vary on each client system.

• The integer precision may vary between different browsers and versions and thus help to distinguish client systems.

• Additional information about the browser and the system are directly accessible via the BOM. This includes: currently used language, the URL of the root document (Root-URL), current timezone, CPU and OS versions, as well as, display properties like screen width, height, and color depth.

Server-side Besides these groups of features, network specifications are also of interest for fingerprinting and are used by at least some of the examined tracking libraries. Note that the following features are gathered at the server-side, and therefore we cannot make an assured statement about which library does not implement methods for obtaining these. • The MAC address is the unique identifier of the underlying networking device.

• The IP address can also be obtained and may reveal the client’s location via GeoIP lookups.

• IP / TCP headers might provide additional feature values for device tracking, e.g., to guess the uptime and operating system of the user’s host.

20 2.2. Analysis of Browser Fingerprinting Libraries

Note that all HTTP-based properties might also be obtained at the server-side as well. To investigate which features and fingerprinting techniques are used by (com- mercial) tracking libraries, we first collected a representative set of commonly used libraries. We analyzed popular websites using the Alexa ranking [64] and obtained a set of commonly used tracking libraries. Such JavaScript libraries leverage different features to implement device tracking and fingerprinting methods. We collected and analyzed the code of nine tracking libraries, and found that they leverage many different features. We focus on the effectiveness of these features and possible fingerprinting evasion scenarios in this chapter. In general, we observe that attributes can be easily obtained from the Browser Object Model (BOM) without performing sophisticated calculations. Table 2.1 outlines the extracted features that are used for generating fingerprints and storing unique identifiers. Additionally, we find that all libraries collect information about these characteristics: i) user agent, ii) display properties, iii) timezone setting, and iv) CPU & OS versions.

21 Chapter 2. Fingerprinting Techniques for Mobile Devices Metrix Threat- ActiveX/ Silverlight - ActiveX/ Silverlight ActiveX/ Silverlight ------Integer precision ActiveX/ Silverlight Table 2.1. : Tracking libraries and the applied fingerprinting techniques ------Flash ------Referrer --- XFF - - userData ------Language - Root-URL Language - Flash - - - - - Referrer - - - Referrer - Canvas Root-URL - Language - Root-URL - Language - - WebSQL Root-URL - Language Root-URL - Language userData Root-URL - - Referrer - - - Caching - - Canvas ------Canvas - - - PNG - - PNG - - Plugins Plugins Plugins Plugins - Plugins Plugins Plugins Plugins AFK Media Analytics Engine BlueCava Device Ident Inside Graph Iovation ITT Max-Mind Cookies -HTTP Mimetypes HTTPStorage Mimetypes Mimetypes - HTTPBOM Mimetypes HTTP Fonts - - HTTP Fonts HTTP - Local HTTP Fonts Mimetypes Mimetypes Local Mimetypes - - - - Fonts Local Fonts Local Fonts - Fonts Fonts Local

22 2.2. Analysis of Browser Fingerprinting Libraries

2.2.2. Effectiveness for Mobile Devices

Intuitively, many of the features that are implemented in the examined libraries do not work for mobile devices (e. g., Flash or Silverlight cookies). Therefore, we aim to study whether common fingerprint methods for desktop computers might also be applicable for mobile devices. To do so, we analyzed a set of real-world data consisting of values aggregated using common fingerprinting methods by an advertisement service provider. This dataset (collected in June and July 2014) includes features collected from over 15,000 client systems. In total, 211,652 feature values were obtained from desktop computers and mobile devices. By reasons of privacy, the data was anonymized and freed from personal identifiers. We divided the data into two subsets, by filtering the user-agent string for desktop and mobile device identifiers. The first subset SMobile was extracted by filtering the complete data for mobile device specifiers. The second subset SDesktop consists of data of desktop computers and features the same size as SMobile. Each subset includes over 2,100 representative devices with about 35,000 feature values. These features contain many HTTP header fields and aim especially for fingerprinting desktop computers. In the following, we refer to them as desktop feature set. We measured the information gain of features in each set with respect to the classes instrumenting the Kullback-Leibler divergence (KLD) [57] to obtain an information score for every feature. A higher score represents a higher entropy and hence more worth of information. The scores do not provide percentage values about detection, but allow us to compare the average information content of each feature in both subsets. The features ranked according to their information gain are shown in Table 2.2.

Table 2.2.: KLD results for SDesktop and SMobile

SDesktop SMobile Score Feature Score Feature 6.784 plugins 4.551 accept language 6.730 mimetypes 4.533 user agent 5.920 user agent 3.601 language 5.865 accept language 2.119 timezone 5.213 plugin versions 1.843 screen y 4.419 fonts 1.492 canvas 4.185 language 1.284 screen x 2.952 canvas 1.106 mimetypes 2.755 screen x 0.939 accept encoding 2.426 screen y 0.605 plugins 2.095 timezone 0.184 accept 1.750 accept encoding 0.072 color depth 1.118 accept 0.058 plugin versions 0.700 color depth 0.044 fonts

23 Chapter 2. Fingerprinting Techniques for Mobile Devices

Note that almost every feature provides less information when applied to mobile devices. The fact that the scores in SMobile are generally lower compared to SDesktop is a first hint that features which perform well for fingerprinting desktop computers may not achieve the same precision when applied to mobile devices. The most descriptive features for desktop computers in our dataset are browser plugins and mime types. These attributes are standardized for most mobile devices and usually cannot be changed by the user. Due to this lack of customization ability, these two features have a low information score for mobile devices. However, the user-agent seems to provide valuable information for both desktop computers and mobile devices. The score is, though, lower for the mobile subset, meaning that a classification by the user-agent for mobile devices would be less precise than the one for desktops. The high standardization of mobile devices results in less diversity of attributes like fonts, screen size, and color depth. Also, there are only a few possibilities to customize mobile devices: Standard browsers often do not support plugins natively, and the installation of non-standard browsers is rare. After determining the descriptive power of features we have seen in the wild, we investigate their ability for device recognition in the following ways. For the ground truth of the classification, we use a device ID which is a hash value stored in a cookie. To analyze the two subsets from our data set, we used a C4.5 decision tree model. The evaluation showed that 91.45% of SDesktop were correctly classified using the desktop feature set. In contrast, the model was able to correctly classify only 37.16% of SMobile using the same feature set. The decrease in correct classification has already been foreshadowed by the information gain analysis. The low classification rate for the subset of mobile devices substantiates our claim that features which perform well for fingerprinting desktop computers are not necessarily appropriate for fingerprinting mobile devices as well.

2.3. Fingerprinting of Mobile Devices

2.3.1. Attribute Selection

As shown in the previous section, existing fingerprinting techniques lack precision for mobile devices. We now propose a feature set that is particularly applicable for fin- gerprinting mobile devices. This feature set consists of properties and attributes that have been aggregated by instrumenting the browser environment using JavaScript. We aim to study the effectiveness of the feature set for mobile devices, even if not all of these features may be exclusively available. We divide the characteristics of a mobile device into four categories and discuss each in the following.

24 2.3. Fingerprinting of Mobile Devices

Browser Attributes

Browser applications already provide various information with respect to the systems’ environment. We discovered that common mobile web browsers—Android’s native browser, Google Chrome, Firefox, IE mobile, Opera, Opera Mini and Safari—reveal information about the browser version, the OS, and the underlying rendering engine. Furthermore, Android’s native browser, Chrome (the two most frequently used browsers on Android devices [68]), and Safari also provide the device manufacturer, model, and the browser’s language. IE mobile and Opera allow the detection of device manufacturer and model as well. Additionally, we obtain further browser attributes such as the “Do-Not-Track” (DNT) option, the capability of storing cookies, using Local Storage, and Java. We can also detect whether the browser blocks popups by default and—if newer web technologies are supported—the standard search engine. Whereas on desktop computers features like supported mime types and installed plugins change with the installation or uninstallation of software, on mobile de- vices changes to these features are very uncommon and generally imprecise (see Section 2.2.2). We tried to determine if specific protocol handlers are registered with the browser (e.g., the one for Skype), thus revealing if specific applications are installed. Although a list of installed apps may provide identifying information [4], this is a noisy procedure and interferes with user operations as a message will pop up asking for the application to handle this protocol. In any case, the user will be alarmed. As this is not in our interest, such features are out of scope. To build a ground truth for evaluation, we set device IDs for newly occurring devices that are either contained in the local storage or stored as a cookie. If a device is already flagged by such an ID, the device is identified as re-visitor. We found the File API of HTML5 to be too noisy for storing such IDs as the API requires user permission to store this data.

System Attributes

Due to high standardization of mobile browsers, it cannot be expected to distinguish devices by browser information only. For this reason, we aim to gather more information about the device system itself. However, we still use the browser to obtain information, and we are therefore subjected to certain restrictions due to sandboxing and limited permissions. Furthermore, we want to employ low-noise fingerprinting, i.e., we do not install any app or perform any activities that raise a user’s suspicion. With these restrictions in mind, we are able to obtain the following system-wide information. From the navigator object, the screen width and height, and the display’s color depth are extracted. Additionally, the OS name and version that are provided by

25 Chapter 2. Fingerprinting Techniques for Mobile Devices the navigator are useful for our purposes. Most current versions of common browsers also yield information about the current connection type. We obtain information when the device is in a WiFi network or using a mobile connection like 3G or 4G. Besides the connection type, we gather information about the environment of the mobile device. More specifically, we obtain the device’s timezone by calculating the time offset to 13 different time points and building a hash of the differences. We also store the device’s IP address and the hostname of the network node, e.g., a WiFi router. To have a more general view, the hostname is masked with a wildcard which can be used as an additional feature. Hostnames often look like ip-xxx-xxx- xxx-xxx.web.provider.com, consisting of the device’s IP address and the network provider’s (sub)domain. The hostname wildcard in this example would be *.web. provider.com, which allows a grouping of devices based on the network they are logged in. We use MaxMind GeoIP2 [67] to determine geographical information about the current location of the device. We also implemented an Apple AirPlay detector. AirPlay receivers listen for local network devices to potentially stream media content. We implemented a function that requests to stream an audio file if a mobile device is connected to WiFi and has already been identified as running iOS. This makes the AirPlay protocol return a list of available devices able to play the file. After receiving this list, we abort and withdraw the streaming request. The list of AirPlay enabled network devices may provide information about the environment (e.g., if a user owns an AppleTV). Additional system specific attributes like active widgets, enabled/ disabled phone encryption or developer options were not accessible through any web browser. Certainly, it might be possible to check these options when running an app such as Ad-Trackers or having unlimited access to the underlying system [88]. However, in our scenario, we are restricted to browser techniques. Additionally, we considered measuring the device’s CPU and memory as well as the network-based and GPS-based location. We also developed a JavaScript-based network scanner that determines the device’s local IP address next to other network devices such as routers. However, as we do not want to arouse suspicion, we decided to omit such attributes. Scanning the local network, or testing CPU and memory would result in a high load and performance loss, and determining the GPS-based location usually leads to a popup asking the user for access permission.

Hardware Attributes

The restrictions of browser permissions mentioned above lead to the fact that barely any hardware meta-data can be accessed. As such, we are not capable of obtaining identifiers like serial numbers of specific hardware elements, e.g., the camera module. Nevertheless, we aggregate the following three attributes in the browser context: the

26 2.3. Fingerprinting of Mobile Devices device’s platform, the number of the device’s touchpoints, and the availability of a vibration motor. Additionally, we can access a device’s gyroscope and accelerometers via JavaScript, which is commonly used in browser-based games. Prior work has shown that these sensors have imperfections that vary among different devices [39]. To determine these imperfections, we implemented a function to gather accelerometer and gyroscope data and used such data as another descriptive feature. Please note that we do not have information about the users’ current activities, and the device may be moving while gathering this data. To avoid distortions, we filter out these movements based on the amplitude of acceleration. Other hardware information such as the availability of a second SIM card slot or sensor specifications is not available for the browser.

Behavioral Attributes We also implemented three functions to gather more detailed information about a device’s user, including behavior. First, as we aimed to learn about the user’s browsing habits, we implemented a timing attack technique for history stealing [144]. The rendering of visited hyperlinks differs from the rendering of unvisited links in various browsers. This fact is used to determine whether a user has visited specific websites by measuring the rendering time of specific links using the JavaScript function requestAnimationFrame. In our experiments, we decided to check if a user visited the websites of Amazon, Ebay, , Google, , and Zalando—each with different top level domains—to also gain information about the user’s localization. We chose these websites because we can expect them to have a large user base, and hence we can see a fair chance for a random Internet user to be logged in at one or more of them. As a limitation of this feature, every website is defined as unvisited after a user clears the browser history. Second, we query popular websites from the user’s browser for objects that are only accessible for logged-in users, e.g., a specific image. More precisely, a URL is prepared so that a logged-in user gets redirected to specific content, whereas the website’s login screen will show for a non-logged-in user. This URL is called in the background. If it is loaded correctly, we can assume that the user is logged-in, and otherwise logged-out. This method can be applied to several popular websites, although the URLs are slightly differently built for each site. Additionally, we load an image which is publicly accessible to detect text-based browsing. Hence, if a user disabled image loading completely, we do not classify such a case the same as a user who is not logged-in to any of the tested websites. Third, we implemented a function to measure the user’s typing speed. As such, a text field (e.g., a CAPTCHA) is placed on the website, which can then be monitored for user input. Once the user starts typing, a timer is triggered that stops after the user did not strike a key on the keyboard for a certain time. The average number

27 Chapter 2. Fingerprinting Techniques for Mobile Devices of letters per second is then calculated and used as an attribute for user behavior fingerprinting. As the typing speed can vary for even one single user, this is not meant to be a single identifier for a person. However, in combination with the other features, the typing speed might improve our classification.

2.3.2. Formalization In summary, our feature set for fingerprinting mobile devices consists of several attributes of the device’s browser, system, hardware, and a small amount of user behavior. Our aim is to develop a system that, basing on the features previously described, can perform two operations:

• Recognizing new devices, i.e., devices that have never visited our service before.

• If a device is not new, recognizing and associating it with a device that has already visited our service.

We formalize this problem as an iterative algorithm, where each iteration is related to a device connecting to the service: For each iteration i of the algorithm, we define a set of known devices: Ki = {k1, k2...kn}, n, i ∈ N, where n is the number of devices that are already known by the system at the current iteration step, and i is the iteration index. Then, we define a set of feature vectors: i i i i F = {A1,A2...An}, n, i ∈ N, where An is a generic set containing the feature vectors associated to the accesses made by the generic device kn at the current iteration. This can be expressed by: i An = {fn1, fn2, ...fna}, a, i, n ∈ N, where a is the number of accesses made by the device kn at the current iteration. The generic feature vector fna is then defined by: fna = {mna1, mna2...mnad}, n, a, d ∈ N, where d is the number of features we de- scribed above. In this formulation, the following is also valid, i. e., each device k is associated to one feature vector set A: Ki → F i For each device ku → Au that visits our service, we initially suppose this condition: Au = {fu}, where fu is the generic feature vector associated to an input device. i Under this, we have to find the known feature vector fmin ∈ Amin that belongs to the known device kmin, and that is most similar to fu. This vector is given by this formulation: fmin = arg min D(f, fu) f∈Ai,Ai∈F i where D is a dissimilarity function among feature vectors, i. e., a function that measures how much two feature vectors differ to each other.

28 2.3. Fingerprinting of Mobile Devices

More precisely, our approach resorts to the dissimilarity function D in order to extract the closest points to the input feature vector fu in the feature space. We define D among two feature vectors f1 and f2 in thisd way: X D(f1, f2) = (w · c(f1, f2)) / wi i=0 with d as number of features, and w = {w1, w2, ...wd} represents the feature weight vector that is calculated by means of its information gain IG. Thus, for each mi- feature, we calculate its weight as follows: wi = IGi = H(D)−H(D|mi), where H(D) and H(D|mi) are the entropy values for a specific device (considering all its accesses) before and after observing the mi-feature. We also define c(f1, f2) = {c1, c2, ...cd} as a vector whose generic component ci is calculated as follows: ( 0 : f1i = f2i ci = 1 : Otherwise

As all the features in f1 and f2 are encoded as numbers, they contribute to the distance only if they have different values. As described in Section 2.3.2, the system determines the feature vector fmin with the lowest distance D from fu. If D(fu, fmin) < δ, the devices described by fu and fmin will be matched. Otherwise, the input feature vector fu will be associated to a new device and added to the system database. Furthermore, let δ be a dissimilarity threshold, we define: ( i kmin : D(fmin, fu) ≤ δ K ∩ ku = ∅ : D(fmin, fu) > δ

The first condition means that if the dissimilarity between the feature vector fu and fmin is lower than the threshold δ, then ku is already known by the system i and corresponds to the device kmin → Amin. This also means that the actual set of devices does not change in the next iteration, thus obtaining this: Ki+1 = Ki. i The corresponding set of feature vectors Amin must be updated with the latest, recognized, access: i+1 i i+1 i i+1 Amin = Amin ∪ {fu} and F = F ∪ {Amin} The second step is recognizing the device ku, depending on the results of step one. In particular, the second condition means that if the dissimilarity between the feature vector fu and fmin is higher than the threshold δ, then ku → Au is defined as unknown. Because of that, ku = kn+1, as it is a completely new device that must be added to the known devices list, and therefore we obtain a new set of devices i+1 i K = K ∪ {kn+1}. Consequently, we define a new set of known feature vectors i+1 i for the next iteration: F = F ∪ {An+1}. The iteration index i is increased by one so that the system will be ready for the next iteration.

29 Chapter 2. Fingerprinting Techniques for Mobile Devices

2.4. Evaluation

We implemented the features presented in the previous section and built a finger- printing system based on the proposed approach. To gather real-world data, we set up an online survey that can be visited by any Internet user to check whether a mobile device can be tracked using our fingerprinting methods. Most features (esp. the browser attributes) are gathered via GET and POST requests, while we leverage callbacks for asynchronous features. More precisely, we created a PHP parser to catch the return values of these callbacks. This is necessary for accessing the browsing history, fingerprinting the accelerometer and measuring the typing speed. Gathering this information takes more time than querying navigator objects. The measurement of typing speed has been obtained by including a text field where the user has to type two words that are shown in a CAPTCHA. In total, 45 features of the previously explained categories browser, system, hardware, and behavior were collected. Such features are listed in Table 2.3. We spread the link to our online service via three different mailing lists, addressing university students, IT security researchers and persons without any IT expertise. In total, almost 900 users participated in this study of which 724 were using a mobile device like smartphone or tablet. Of these mobile users, 459 accessed the test more than once over a duration of four months. These re-visitors who participated at least twice are recognized by a cookie ID and a local storage ID that serve as a ground truth for our evaluation. This choice is aimed to correctly evaluate the capabilities of the system to recognize devices that have visited our service and to detect previously unseen devices. The features used for fingerprinting are of different data types including integers, floats, hashes, bits, strings and plain texts. An exemplary list of these features, their data types, and a real-world value is presented in Table 2.3. Please note, that the identifier used as ground truth (the ID stored in a cookie and the local storage) contains random characters as well as a time-based component to avoid collisions of identifiers. There may be an information overlap between single features, e. g., operation system and is Android, which seems redundant but enables swift analyses and hence faster results and insights. During the machine learning process, these redundancies are eliminated. Additionally, we performed an encoding for non-numerical features. By this procedure, a number will be assigned to every unique value for all features so that every feature occurrence can be represented by a numerical value enabling fast comparisons in further analyses. The majority of devices—about 64 %—are Android (mostly version 4.4) systems on ARM platforms, followed by iPhones with about 27 % (mostly iOS 7.1). These two systems and architectures are not only the biggest group in our test but cover the main market share of mobile devices in the world. We found Windows phones and Blackberry devices to be a minority in our survey.

30 2.4. Evaluation

Table 2.3.: Feature data types and example values Feature Type Example devicefingerprint string 4812169833755445458 revisitor bit 1 ismobile bit 1 cookie id string QoSQIymCwjg0augzs D41-1415043670.767 localstorage id string rQG4fVJaDBNFtOyKd CL1-1415011415.67 mimetypes text [{”n”:”video/3gpp2”,”d”:”3GPP2 media”,”f”:[...]”}, ...] mimetype hash hash b96eebf2fd3fff0e165d77e75474ffaf plugins text [{”n”:”Shockwave Flash”,”d”: ”[...]Flash 11.1 r115”,...}] plugins hash hash 4e23a836cea77cf4af09affff2b64a75 plugins num int 6 canvas hash hash ea907f4cd06cf0f310d7acf62f2ffff6 useragent text Mozilla/5.0 (...) vendor text Google Inc. productsub text 20030107 is chrome bit 0 popup blocker bit 1 navigatorlanguage text en-en filesystem access bit 1 cookies enabled bit 1 dnt enabled bit 0 java enabled bit 0 loginstatus bitstring 10111 history bitstring 1000100000 screen height int 960 screen width int 600 display colordepth int 32 display orientation text landscape platform text Linux armv7l operation system text iOS 7.1 is Android bit 0 is iOS bit 1 touchpoints int 5 has vibration bit 1 airplay ref bit 1 typingspeed int 229 accelerometer key float 938.143359751 connection text wifi hostname text ip-xx-xx-xx-xxx.web.provider.com hostname wildcard text *.web.provider.com timezone id int 662525310 ipaddress string xx.xxx.xx.xxx country text Germany city text Bochum

31 Chapter 2. Fingerprinting Techniques for Mobile Devices

2.4.1. Feature Distribution

Device fingerprinting will benefit from features with high diversity among the devices, whereas features with a small distribution of values are not meaningful for recognizing devices. We now provide insight into the distribution of the features. The number of distinct values of a feature is not necessarily related to their importance. For instance, if a few devices had the “Do-Not-Track” option enabled, such devices could be well grouped, even if this feature only allows two distinct values. Nevertheless, we find features with many distinct values such as accelerometer benchmarks or user agent to be of high relevance to distinguish devices. We define a feature as volatile if, even for the same device, such feature could be characterized by multiple values. This volatility can occur for attributes that change due to environment, events, or actions. For example, the accelerometer data provides very precise float values of a devices’ sensors, which results in slightly different values for the same device. Different environments—a user may be in an accelerating bus once and sit still in a room another time—may cause this difference of data. That is why such values need to be grouped, or in some way condensed, to be a useful feature. Nevertheless, device recognition becomes easier, as more data about it is obtained. The hostname attribute exemplifies this situation: If a user re-visited our experiment with two different connections, e.g., wifi and mobile network, both network node hostnames would be registered with this device. Recognizing the device when using one of these known hostnames will be easier. Changing the mobile cell is an event that also affects this feature. Furthermore, the devices’ network hostname is often used as a native identifier for network providers. Hence, we are able to divide the mass of web users into groups based on their network-based location or at least their ISP. Please note, that volatile features have to be treated carefully. While some features need to be condensed reasonably (like the accelerometer data), other features provide more and more information with every change (like the hostname). The number of IDs stored in cookies and local storage is also higher than the number of re-visitors. This can be explained by the action of deleting cookies. If a user deleted cookies or local storage after taking our test and then re-visits the website, a new ID will be generated and set. The fact that there are fewer cookies than overall participants indicates that not everybody deleted their cookies and local storage afterward. We found typing speed and browsing history to have different values per user by trend, which makes them very discriminative for a device’s user as they stay the same for the revisitors of our online test, even when cookies and local storage are cleared. Whereas accelerometer benchmarks may vary even for the same device, the browsing history of a person stays the same until it is deleted manually. We expected the login status to act similar for the same reason, but it turns out that only a

32 2.4. Evaluation minority of users use the browser to log in to services which provide an alternative app. Users tend to use specialized apps (e. g. for Facebook or Twitter) instead of their mobile browser for these services. To have a ground truth, we only take revisitors into account because our aim is to show the capabilities of the selected features at recognizing known devices without relying on cookies or other unique identifiers. The recognition of visits from known devices is carried out by resorting to the “nearest neighbor” approach. To uniquely identify each device, we computed a hash value from the feature values associated with each device. Furthermore, we encoded every occurred value of all features that are naturally not numerical to accelerate comparison operations.

2.4.2. Recognition of Mobile Devices

In this section, we describe our implementation of a system for recognizing mobile devices realizing the formalism and features described in Section 2.2. We also present the experiments conducted to assess the system’s performance and the dataset.

The system applies a nearest-neighbor matching approach (essentially a 1-NN) to perform the detection of known and unknown devices. This choice is related to the matching-nature of the problem, for which this classifier exhibits good performances. We performed an experiment in which our system was designed to detect unknown devices, and to match at the same time known devices to the correct ones. We run such experiment under two possible scenarios:

1. Single-Iteration mode: In this scenario, we suppose the website has already been visited by a number of devices. The goal is recognizing if new devices have visited the system without updating its list of known devices. This is done to verify how many new devices could be correctly detected by the system during a single iteration. At the same time, the system must also be able of recognizing multiple accesses of the same device.

2. Multiple Iteration mode: In this scenario, the visits of each device are considered one after the other, and they are simulated at different times. After each iteration, the list of known devices is updated by adding the features related to the new visit. This procedure completely reproduces the algorithm that we described in the previous section.

We believe that these two scenarios are representative of typical real-world situa- tions, and can give a good overview of the general performances of our system.

33 Chapter 2. Fingerprinting Techniques for Mobile Devices

Single-Iteration Experiment

In the first experiment, we evaluated the matching properties of our system when a database of known devices that visited the system is built beforehand. For each device, the features related to different visits are stored. The aims of this experiment are the following. First, detecting known devices, i.e., finding in the database the device that correctly associates to the input of our system. We represent this case with the term match. Second, detecting mismatchings, i.e., successfully performing two operations: a) correctly distinguishing a never-seen device from all the ones included in the database; b) correctly recognizing all the devices in the database that are different to the input of our system. We represent this case with the term reject. The choice of the terms match and reject comes from the similarity of this problem to the ones found in biometrics. In a typical biometrics setting, the system should be able to authenticate (match) or refuse (reject) the user that tries to access to its system. We believe that such terminology can be useful in the scenario at hand. In this first experiment, we split our dataset into three pairs. Each pair is composed of a reference set and a test set, respectively composed by 206 accesses. Using multiple pairs of reference and test sets reduces the possibilities that specific performances are obtained because of a lucky/unlucky reference-test division. We then extracted the feature weights with a ten-fold cross validation calculated on the reference set. Finally, given each reference test pair, we verified the performances of our system on the corresponding test set. Such assessment has been repeated by considering different scenarios: a) all features are used for the detection; b) the most discriminant features (i.e., features with the highest weights) are progressively removed from the feature set. This evaluation is of interest, in particular when important information such as cookies or IP address is removed. Figure 2.1 shows the ROC (Receiver Operating Characteristic) plot that measures under multiple scenarios the average performances of the system on the three reference-test splits. On the y-axis, we report the amount of correctly matched devices, while on the x-axis we report the errors in rejecting devices. Each point of the ROC curve corresponds to a value of the threshold δ, and the optimal threshold is given by the point that is closer to the upper-left corner of the plot. From the obtained plot, we observe that the system has excellent performances at detecting new devices and at recognizing already seen ones. Performances do not change much when features like cookies, hostnames or IP addresses are excluded from the feature set. This is because weights are distributed on the features in a way that no feature is completely dominant on each other. For the same reason, when all features are considered (including cookies), the system does not have 100% detection rate with zero false positives. Of course, this figure would change by

34 2.4. Evaluation

1

0.8 e t a R

e 0.6 v i t i s o P

0.4 e u r T 0.2

0 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 False Positive Rate All Features No cookie (1 feat. removed) No local storage and host (3 feat. removed) No IP (4 feat. removed)

Highcharts.com Figure 2.1.: Average ROC performance chart for the single-iteration experiment increasing the weight assigned to cookies. However, this would compromise the general performances when such dominant features are not considered.

Multi-Iteration Experiment

The aims of this experiment are the same of the previous one, but in this case, we assess the performances of the system by strictly following the algorithm that we proposed in Section 2.2. Therefore, we simulate that all the devices in the database visit our service one after the other. The order of the visits is strictly random. At the first iteration, the reference set contains just one sample, and it will dynamically increase its size after each visit. In this scenario, the system has no supervised knowledge, and will progressively adapt itself to recognize new devices. Of course, this means that the system might exhibit more matching errors, especially when the size of the reference set is particularly small. Figure 2.2 shows the ROC curve by considering the same scenarios as those of the Single-Iteration Experiment. As the reference set dynamically increases after each visit, every score used to compute the ROC has been calculated on different reference sets. From the attained results, we observe that the performance of our system is significantly worse than the previous experiment. However, this was predictable as this experiment starts with only one sample. Errors in matching known devices and failures in recognizing new devices will affect the performances. However, even in these conditions, the system provides good performances with a decent number of false positives.

35 Chapter 2. Fingerprinting Techniques for Mobile Devices

1

0.8 e t a R

e 0.6 v i t i s o P

0.4 e u r T 0.2

0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 False Positive Rate All Features No cookie (1 feat. removed) No local storage and host (3 feat. removed) No IP (4 feat. removed)

Highcharts.com Figure 2.2.: Average ROC performance chart for the multi-iteration experiment

In this case, we also observe that the dependency of the performances on the most discriminant features is much more evident. In fact, without considering the IP address or the local storage ID, system performances exhibit a drastic decrease. We speculate that highly discriminant features are important for a more accurate matching with few reference samples. Figure 2.2 also confirms the trend shown in Figure 2.1: devices can be tracked, to some degree, even without resorting to cookies.

2.4.3. Evasion Resistance

A crucial aspect of our evaluation is the robustness of our system against evasion attacks. In our setting, an evasion attack can be defined as an attempt by a user to avoid a device being recognized by a service the used visits more than once. The actions carried out by the user to evade detection affect elements such as system applications (e.g., browsers), connection properties and features values related to such elements will be accordingly changed.

Changeability of Features

To shed light on the robustness of the proposed fingerprinting technique, we analyze the degree of changeability of the operating features. Whereas it might not be challenging for an experienced user to change some of these features, other features cannot be easily changed. For example, some features change according to the context (e. g., the time zone changes when the user is traveling), but others remain immutable (e. g. hardware attributes).

36 2.4. Evaluation

Among the set of features whose values can be directly changed by the user, we include browser-based features and other features related to user’s behavior. The following list includes all the features that the user can change with different degrees of difficulty:

• popup blocker active • browsing history

• DNT option enabled • navigator language • Java availability • user agent • cookies enabled • display orientation • autofill forms • AirPlay availability • filesystem access • typing speed • cookie ID • local storage ID • browser plugins • login status • accelerometer data

There are several easy-to-switch binary features. For example, the following elements contained in the set are linked to user options: Java, popup blocking, “Do- Not-Track”, autofill forms, filesystem access, cookies, and local storage. Although the user can block cookies and local storage by websites, this may result in limiting the functionality of the visited websites. For example, most online shops use cookies and local storage to track the items visualized by the user during visits. Thus, a user would probably not want to disable these functionalities completely, but might regularly delete cookies and local storage data. Additionally, the user can influence the features related to the login status and history by regularly logging out from websites that require authentication and regularly clearing the browser history. If an advanced user is aware that these features are gathered for fingerprinting purposes, the user could easily create fake accounts to alternately log in and out, to induce as much randomness as possible. The same applies to the history feature. To alter the navigator’s appearance, a user could easily set a different language. A user is also free to use a different browser that changes the navigator user agent completely. This would also affect the canvas fingerprinting method. However, this action requires the installation of new software and not just changing a browser’s configuration. On the other hand, if the reference set for a given device contains samples related to visits by different browsers, a browser change will not affect the matching performances.

37 Chapter 2. Fingerprinting Techniques for Mobile Devices

Some browser-independent attributes can also be influenced by the user: For example, display orientation can be easily changed, but as orientation may affect the usability of certain websites, it cannot be changed arbitrarily. The features related to the availability of Apple AirPlay can be influenced by deactivating the streaming service by default and enabling it when used. However, heavily resorting to this procedure can affect the usability of the system. Typing speed and accelerometer measurements can be tampered with, too. If a user is aware that the typing speed is measured, it is easy to manually add randomness (e.g., by typing slower than usual). To distort accelerometer data, the user could unpredictably move the device the whole time. Besides the user’s direct sphere of influence, the following features are likely to change on certain events:

• IP address • country

• hostname • city

• timezone • connection type

Except for the connection type, these features depend on the location of the device. A mobile device’s IP address is most likely to be volatile, as it changes according to network providers’ rules. The hostname and its corresponding wildcard are based on the IP address. Timezone, country, and city are determined with the help of the device’s IP. Consequently, if a user spoofs the device’s location, these features will change. Using a proxy (e. g., an anonymization service) invalidates all location-based attributes. The connection type might be changed by the user as well, for example by alternatively using public WiFi services and mobile data connections. Whereas some features can be directly influenced by the user, or can be changed according to the environment, others are (almost) completely immutable for a user in a normal setting:

• platform • screen height

• operating system • screen width

• vendor • display color depth

• productsub • no. of touchpoints

• vibration availability • is iOS / is Android

These features depend on the device’s hardware or operating system, and changing them would require a much higher effort. Even if a user is able to install another operating system on a device, the platform, vendor, and productsub cannot be

38 2.4. Evaluation changed. It may be possible to feign other vendors or use alternative display resolutions; however, faking the availability of vibration, the number of touchpoints or display data needs a higher effort. Therefore, we expect these features to be immutable in our scenarios. Our assumptions are that the user has to know that device attributes are captured first. Additionally, it needs to be known which specific attributes are gathered and used for fingerprinting. If a user is aware of this information, it would be possible to do a purposeful and selective randomization/faking of the attributes. If a user is only aware of the fact that the device gets fingerprinted but does not know about specific attributes, it would still be possible to perform a general randomization of common features to deceive standard fingerprinting techniques.

Evasion Attacks The overall goal of evasion attacks is to prevent the fingerprinting system from recognizing a mobile device by changing specific properties of the device so that its feature values will be changed accordingly. However, this is not an easy task for the user because the probability of performing a successful attack also depends on the knowledge that the user has of the fingerprinting system. This means that a user that wants to evade the system should have knowledge of several elements:

• all the features that the system uses;

• the system detection algorithm and (in our case) its measure of dissimilarity between devices;

• previous accesses that have been made to the system, and their impact on the feature set. This is crucial, as if some changes could create a greater distance among one access and the other, they can reduce the distance with another access (from the same device) made with different resources;

• the system decision boundary, which depends on the nearest-sample matching rule. This is particularly critical when the same device accesses the website multiple times.

Collecting the knowledge listed in the previous points implies that a user has perfect knowledge of the system [15]. Obtaining such information for an outsider is a difficult task, and might not be feasible most of the times. For this reason, the reported evaluation of the system assumes that the user has limited knowledge of the system. This means the properties of some devices that accessed the website (e.g., their browser or their proxy settings) are known. Users goal is changing the parameters of their own device so that they match, as much as possible, the ones of another device. The rationale behind this attack is that

39 Chapter 2. Fingerprinting Techniques for Mobile Devices the access to the modified parameters will confuse the system and will stop it from correctly detecting the device (as it should reduce the dissimilarity measure from other devices). It is worth noting that the user does not know the impact of actions on the features. To provide events that can be concretely realized by the user with relatively low effort, we imagine four scenarios under which the user makes changes. These scenarios can be manually achieved by changing a device’s configuration or can be automatically obtained by using specific applications.

1. Second browser. Users can create variance within the feature set by installing a second browser and alternately using two browsers. This would affect the following features: i) user agent, ii) canvas hash, and iii) plugins.

2. Second browser with different settings. In addition to alternate between two browsers, users can adjust the settings of one browser in contrast to the settings of the other browser, e. g., enabling DNT for one browser and disabling it for the second. Hence, several features that are extracted from these settings would change, creating more differences between the two browsers. For example, this could be achieved by deleting cookies and local storage after every usage; by changing the navigator language; by using popup blocker and DNT-header; by logging out from websites and clearing the browsing history every time. These actions would change the first scenario’s features and would additionally modify these: (i) local storage ID, (ii) cookie ID, (iii) navigator language, (iv) popup blocker active, (v) DNT option enabled, (vi) login status, and (vii) browsing history.

3. Proxy. Besides changing settings related to the device directly, users can influence features used for fingerprinting by using a proxy connection. This could be done by resorting to manual configurations, or by employing a proxy application. Such a behavior would change a client’s location-related features: (i) IP address, (ii) country, (iii) city, (iv) hostname, and (v) hostname wildcard.

4. Two browsers and a proxy. The combination of the actions described above that can be taken by users builds the strongest scenario. Using a second browser with differently adjusted settings, and resorting at the same time to a proxy connection affects all of the above-listed features.

Evasion Results For this experiment, we considered the whole dataset as a training set. As a test set, we changed the features of each sample of the training set, based on the scenarios described above. As target values, we randomly selected values that belonged to

40 2.4. Evaluation

1

0.8 e t a R

e 0.6 v i t i s o P

0.4 e u r T 0.2

0 0 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2 0.225 0.25 0.275 False Positive Rate Original Using another browser Change browser settings Using proxy Change proxy and browser

Highcharts.com Figure 2.3.: Average ROC performance chart for our system under multiple scenarios of evasion attacks other devices. This has been done to guarantee that the values were coherent and not just randomly chosen. We repeated this experiment ten times for each scenario by always using different target samples. It is worth noting that we tried to simulate the scenarios in the most accurate way. For example, we considered the fact that a device using Android could not switch its browser to one belonging to IOS. For instance, it was not possible to change Chrome to Safari in a non-IOS device. Figure 2.3 shows the average ROC curves for our system under the evasion scenarios mentioned above. Since some ROC curves are similar to each other, we also computed the portion of the Area Under the ROC curve (pAUC) for values of false positives between 0% and 1%, and for the ones between 0% and 10%. From these results, we can observe the following facts:

a) Simply changing the browser or using a proxy does not impact the system performances. Presumably, this happens because the user is not aware of the devices that the system has already seen. Changing the browser might be risky, as the system might be more sensitive to the new browser than to the previous one. Furthermore, the distance function that we chose also depends on the number of features that are different between the two samples. The more features the user manages to change, the better the attack will be. In this case, only changing browser or proxy brings too few features changes, degrading the effectiveness of the attack.

41 Chapter 2. Fingerprinting Techniques for Mobile Devices

b) Changing the browser and its settings highly affects the performances of the system. Although the detection rate does not completely break, we notice a significant drop of around 60% at zero false positives. We assume that this is due to an increased number of feature changes. This is in line with the action taken by the user: by completely changing the browser settings, the user significantly affects the fingerprinting capabilities of the system. c) When the user changes the browser settings and uses a proxy, we observe a complete crash of the detection rate with zero false positives. To make the system provide excellent detection values, false positives have to increase up to 10% for a detection rate of 70%. The results of this experiment suggest that when users change their browser settings and resort to a proxy, they can completely evade a system that does not make any mistakes at detecting devices that have never been seen before. This behavior creates serious concerns, as errors at detecting never-seen devices might compromise the functionality of the system in the long term. To conclude, this experiment shows that evading mobile device fingerprints is possible, but not so easy as it might be expected at first glance. The user has to produce a significant effort to obtain an effective evasion. At the same time, our analysis results show that even a complete evasion is possible when the user resorts to a second browser with modified settings, and to a proxy.

2.5. Discussion

While our institution does not do formal ethics review, e. g., such as an Internal Review Board (IRB), we carefully considered the ethical considerations of experi- mentation and data collection. We conducted our experiments such that: 1. all participants were informed of the nature of the experiment, how their data would be used, and had an opportunity to opt-out at any time during data collection, 2. all data was stored using non-identifiable information to protect the privacy of participants, and 3. all participants were allowed to view and received feedback on the data provided as an incentive for their participation. We did not collect any personally identifiable information, and all participants remained anonymous.

As we did not collect personal information about the participants of our experiments, we are not able to provide information about their age, technophilia or any demo- graphic characteristics. Although we employed in our work a comprehensive feature

42 2.6. Related Work set, future technologies might lead to features that can be used to fingerprint mobile devices. We also point out that our detection approach might suffer from slowdowns if the database was filled with millions of accesses. Although it was not necessary in this case, it is possible to rely on techniques such as prototype selection to reduce the size of the database without losing accuracy.

2.6. Related Work

Several studies showed that client fingerprinting is a valuable method and that it is used in practice for user tracking, fraud detection, or advertising [42, 74, 114, 119]. Fingerprinting does not aim to replace stateful fingerprinting via cookies, yet. However, fingerprinting is harder to detect and allows user recognition of those who deleted their browser’s cookies [3]. A detailed investigation of how web browser fingerprinting works has been con- ducted by Nikiforakis et al., showing that user privacy could be compromised by modern fingerprinting techniques [119]. The authors’ results are mostly limited to full-weight browsers installed on desktop computers which are far more customiz- able than browsers on mobile devices. As the habit of mobile browsing increases constantly, we reviewed browser fingerprinting techniques in the context of mobile devices. A first study on mobile web tracking was performed by Eubank et al. [45]. While stating that mobile tracking is an under-researched area, the authors show differences between desktop and mobile fingerprinting, focussing on cookies and HTTP headers. However, this work is limited to Firefox Mobile on Android, which is not the most common browser on mobile devices. Our study contains data from different browsers as well as different systems and devices, and it is based on a larger feature set. A study of canvas fingerprinting and evercookies showed that users have only little to oppose against these techniques [2]. The proposed mitigation is to ask the user for permission at every read event which is, as of yet, only implemented in the Tor Browser [19]. Evercookies are hard to defend on desktop computers as common browsers do not provide an interface for checking and deleting the local storage or Indexed DB, and Flash storage is not isolated. We found that the user cannot browse the local storage or IndexedDB on mobile devices, too. Newer versions of Apple’s iOS (7 and later) combine the functions of deleting cookies and the local storage. However, we see that many users do not delete their cookies and data regularly. Even worse, if a user’s device can be recognized without local storage data and cookies, the information about if and how often a user clears cookies and website data may become a new feature describing the user’s behavior and awareness. An approach to defending against fingerprinting is the randomization of character- istic attributes. Nikiforakis and Livshits have shown that this technique can be used

43 Chapter 2. Fingerprinting Techniques for Mobile Devices to deceive especially the fingerprinting of installed fonts and browser plugins [118]. Firegloves is a proof-of-concept Firefox plugin following this approach [16]. The drawback of randomization is its noisiness: If a feature is randomized on every access, sophisticated fingerprinting techniques could repeatedly perform measurements to determine the randomness and finally obtain the unrandomized features. Also, randomizing the lists of fonts and plugins cannot mitigate fingerprinting mobiles.

Prior work has shown that data measured by hardware sensors like accelerometers can be used to fingerprint mobile devices [17,39]. Although this is only a reasonable technique to recognize devices having such sensors, and it needs extensive training measurements. In practice, ad-trackers and anti-fraud systems gather data for fingerprints in less than a second. For our test, we also used accelerometer data, but this was measured in less time and combined with other features instead of being used as a single feature.

Besides accelerometers, the internal clock of mobile devices has been investigated for fingerprinting [80, 113]. This elaborated technique is used to measure hardware imperfections of a device’s internal clock by comparing to time synchronization services. Although this is a reliable method to fingerprint mobile devices, it requires measurements over an extended period of time, which is not applicable for modern web-based fingerprinting. Additionally, there is no way to access a device’s hardware clock via JavaScript, so calculating clock skews is not realizable using web techniques only.

Targeting mobile devices, Azizyan et al. proposed to identify the user’s ambiance with the help of environmental features like sound and light [9]. Although this approach may provide information about a user’s location, it may not be applicable for low-noise fingerprinting through the web. A mobile browser will ask for permission to use hardware such as the microphone, which arouses the user’s awareness. In contrast, we restricted our work to inconspicuous web techniques that are realizable in practical scenarios.

Baumann et al. suggested that for every browsing session major changes in browser and system configurations should be made to escape fingerprinting [13]. However, changing features like screen resolution, plugins, and user agent may affect a user’s browsing experience.

The feasibility of fingerprinting in web context have been confirmed by Laperdrix et al. [90]. While the authors found a large number of fingerprints in their set to be unique, we applied machine learning for recognizing devices iteratively and conducted an experiment about the challenge to escape fingerprinting methods.

44 2.7. Conclusion

2.7. Conclusion

Client fingerprinting is used in practice for several use cases like fraud detection or user tracking. In this chapter, we studied tracking libraries that are established in the field and compared their methods for fingerprinting client systems. We reviewed and investigated these common methods concerning mobile devices and discovered that attributes that are characteristic for full-weight browsers of desktop computers lose their descriptive power when applied to mobile devices. We then introduced a feature set containing common features as well as features especially targeted for mobile devices and applied weights based on the information gain of each feature. The evaluation of this feature set is based on real-world data from an online test survey and shows that mobile devices can be fingerprinted well. Finally, we investigated how users can evade mobile device fingerprinting, to allow users to decide whether or not being tracked. We discussed different possibilities for web users to protect their privacy and studied four different evasion scenarios. The results indicate that it is possible, although not easy in practice, to escape fingerprinting mechanisms that aim at tracking mobile devices using our advanced techniques.

45 Chapter 2. Fingerprinting Techniques for Mobile Devices

46 CHAPTER THREE

SYSTEM FINGERPRINTS AS INFLUENCE ON ONLINE PRICING POLICIES

Price differentiation describes a marketing strategy to determine the price of goods on the basis of a potential customer’s attributes like location, financial status, possessions, or behavior. In the past years, several cases of online price differentiation were revealed. For example, different pricing based on a user’s location were detected for online office supply chain stores and there were indications that offers for hotel rooms are higher priced for Apple users compared to Windows users at certain online booking websites. One potential source for distinctive features are system fingerprints, i.e., a technique to identify or recognize users’ systems by identifying unique attributes such as source IP address or system configuration. In this chapter, we shed light on the ecosystem of pricing at online platforms and aim to detect if and how such platform providers make use of price differentiation based on digital system fingerprints. We designed and implemented an automated price scanner capable of disguising as an arbitrary system, leveraging real-world system fingerprints, and searched for price differences related to different features (e.g., user location, language setting, or operating system). This system allows us to explore price differentiation cases and expose those characteristic features of a system that may influence a product’s price.

3.1. Introduction

Pricing policies of (online) business providers are typically not transparent for customers and based on parameters that a customer is not aware of. This opens up a number of opportunities for so called price differentiation and price discrimination. Price differentiation is a pricing policy of providers to demand different prices for the same asset, including special offers or discounts. In contrast, adjusting a product’s

47 Chapter 3. System Fingerprints as Influence on Online Pricing Policies price based on a customer’s personal information (e. g., gender, wealth, home address, or other features) is called price discrimination. In the past, suspicious cases of online price discrimination hit the headlines, like different pricing based on a user’s location at Staples [147] or indications that offers for hotel rooms are higher priced for Apple users compared to Windows users at Orbitz [104]. From a technical point of view, an online platform can leverage many kinds of techniques to identify a user, which would be the starting point for a price discrimination. Generally speaking, the term fingerprinting refers to the process of obtaining characteristic attributes of a system and to determine attribute values that can be leveraged to recognize or identify a single system among others. In the context of online user tracking, this technique complements cookie-based recognition, which has been state-of-the-art for many years [42]. In practice, browser fingerprinting provides more information about a customer compared to cookie-based methods, like software attributes (e. g., the used user-agent, installed plugins, and supported mimetypes [3,45,74,99]). Previous research demonstrated that browser-based system fingerprinting performs well for most types of commodity systems such as desktop computers and mobile devices [42, 119]. We also emphasized the feasibility of fingerprinting for mobile devices in Chapter 2. Our assumption is that such information about a user’s system—obtained via browser fingerprinting—is leveraged by online providers for price discrimination as it leaks information about the system configuration and the user himself. While flight tickets have been found to be subject to too many influence factors for finding methodical price discrimination [150], there has been no further systematic investigation of the existence of systematic price discrimination in online commerce. Especially hotel booking websites and portals are often criticized for non-transparent price differentiation and raise suspicion about price differentiation. Unfortunately, not all details about the interaction of all leveraged price differentiation mechanisms can be determined without detailed insights into the inner working of such platforms and thus we need to adopt a black-box strategy to explore abnormalities. In this chapter, we investigate cases in which prices are customized due to a user’s system. More specifically, we apply real-world browser fingerprints to simulate different systems and analyze corresponding price changes. To achieve this goal, we implemented an automated price scanner capable of disguising as an arbitrary system leveraging real-world system fingerprints and searched for price differences related to

1. user location represented by IP address,

2. specific systems represented by their fingerprints, and

3. single features of fingerprints.

48 3.1. Introduction

These features enables us to expose the impact on asset prices. Generally speaking, we aim at exposing system configuration features that may influence prices and perform a repeatable empirical analysis to measure the effects of fingerprint changes. In an empirical study, we examined several accommodation booking websites and a rental car provider platform and shed light on which parameters affect an asset’s price. Our results indicate the existence of both location-based and fingerprint-based price differentiation but do not discover a systematic discrimination. Furthermore, we also shed light on how changing single attributes in a system fingerprint affects an assets price. Associating reproducible price changes with specific attribute values allows users to change their system fingerprint and start hunting for prices of hotel rooms.

Contribution

In summary, we make the following contributions:

• We developed and implemented a method to find and analyze price differenti- ation by automatically testing different system configurations against online providers.

• We conducted an empirical study to explore price differentiation based on user location as well as system configuration.

• We provide insights into which specific system features influence pricing strate- gies and how a user can potentially affect them.

Outline

In the next section, we explain the phenomenon of price differentiation and revisit the definition of system fingerprinting. Then, we describe how we search for price differentiation at online shopping websites including our design goals, the components of the utilized fingerprints as well as the scanning and scraping procedures of our setup. Next, we present our findings regarding price differences categorized in location- based, fingerprint-based and feature-based analyses. This sheds light on how prices differ between countries and how they change when using another computer system. Furthermore, we investigate features and their values correlating with price changes. In the end of this chapter, we consider threats to validity of our experiments, provide an overview of related work and a summary of our findings.

49 Chapter 3. System Fingerprints as Influence on Online Pricing Policies

3.2. Price Differentiation via System Fingerprinting

Before we dive into details, we introduce both price discrimination and system fingerprinting in more detail and explain why and how both concepts are related to each other.

3.2.1. Price Differentiation As noted earlier, there is small yet important difference between price discrimination and price differentiation: while price differentiation describes a strategy to determine a product’s or service’s price based on a potential customer’s needs, it does not depend on customer’s characteristics. In contrast, with price discrimination the price is determined on the basis of a potential customer’s attributes like location, financial status, possessions, gender, or behavior. According to Varian [149], price discrimina- tion is defined as specific pricing for specific groups and a common technique since 1920. Traditionally, price discrimination and differentiation can be subdivided into three different degrees [149]:

• First degree: Involves individualization of prices for all customers.

• Second degree: Prices differ based on additional services. It is possible to distinguish between service related, quantitative, and price-pack forms.

• Third degree: Involves individual prices for groups of people. They can be individual, location, or time-related.

In most parts of the world, commerce may lawfully set a price for a specific customer, like discounts based on negotiations or special offers. While this is a legal business conduct and in most cases handled responsibly, it verges on inappropriate practice as a retailer may be able to adjust an offer’s price, e. g., based on a customer’s mindset, ethnicity, or the residential neighborhood. Online commerce has widely been resistant against price discrimination as a customer typically decides to buy a product for the lowest price possible. Also, tra- ditionally only few customer characteristics were revealed during an online purchase (like residential area) and there are usually no negotiations (at least for standard products). However, a client’s computer system reveals a lot more information about its user nowadays [3,42,119]. This offers new opportunities for online shop operators to personalize their content for each client separately [77, 93]. From their perspective, price discrimination is a way to maximize their profits and thus they have an incentive to utilize such techniques. To implement such a strategy, fingerprinting can be used for identifying user groups which are likely willing to pay a higher price than other user groups.

50 3.2. Price Differentiation via System Fingerprinting

3.2.2. System Fingerprinting

While fingerprinting techniques can be applied to different kinds of systems including servers, mobile devices or websites, we focus on client-side systems in this chapter, especially browsers on commodity systems like desktop computers and smartphones. This approach enables web platform providers to fingerprint—and consequently recognize or identify—a user’s system and improves classical cookie-based user tracking [119]. In practice, the attributes of a system are examined and analyzed if they are unique compared to the attributes of other systems. Every system is assigned a fingerprint which describes the system’s characteristic attributes (e. g., configuration items like a browser’s settings, display size, or the IP address). A provider obtaining fingerprints from various systems is able to compare these and distinguish specific systems. This concept is illustrated in Figure 3.1. As our work is in the context of online shopping, we focus on attributes accessible from the web and hence use browser attributes as features for fingerprinting. Common browsers reveal adequate information to generate this kind of fingerprints [119] and web-based fingerprinting of personal computers and mobile devices is a common technique investigated by several other researchers [42, 88,119, 161]. Figure 3.2 provides an example on how a JavaScript approach of system fin- gerprinting can look like. In particular, it shows the implementation of browser fingerprinting at the hotel booking platform Hrs.com. The code loads after visiting the landing page, it builds a HTTP GET request with fingerprint features (screen resolution, UserAgent, platform, language, etc.) resulting in the link shown in the lower half of the figure.

Features Fingerprint System

Fingerprint Provider Database Features Fingerprint System

Figure 3.1.: Every system yields its own fingerprint: different features are extracted from a system and stored in a provider’s database

51 Chapter 3. System Fingerprints as Influence on Online Pricing Policies

// JavaScript Code: var clientDataParamString=’?track=ci’+’&saw=’+screen.availWidth +’&sah=’+screen.availHeight+’&scd=’+screen.colorDepth +’&nua=’+navigator.userAgent+’&np=’+navigator.platform +’&nl=’+navigator.language+’&nce=’+navigator.cookieEnabled +’&nan=’+navigator.appName+’&cookie=’+cookieTestResultCode +’&sess=’+’08C191CD9A170B5B54FCFC8F656D9449.61-4’; var clientDataPixel=document.createElement(’img’); clientDataPixel.src=’bi/null.gif’+clientDataParamString; document.body.appendChild(clientDataPixel);

// Resulting link: http://www.hrs.com/web3/bi/null.gif?track=ci&saw=& sah=&scd=&nua=&np=& nl=&nce=&nan=& cookie=&sess=

Figure 3.2.: Exemplary JavaScript code snippet of system fingerprinting and tracking at Hrs.com

A website provider can easily obtain a browser’s attributes and settings, which are used to create a system fingerprint. Consequently, if a system re-visits the provider’s website, it is possible to recognize this specific system with the help of its fingerprint. Additionally, a fingerprint yields valuable information about the system itself. For instance, it tells a provider the system’s browser, screen resolution, supported mime-types, installed plugins, and much more. Such information poses a potential source for price discrimination. The fact that a website provider is able to obtain fingerprint information leads to our assumption that this information might be used to group the website visitors and may give some groups different prices compared to other groups. In our understanding, this would represent a case of price discrimination.

3.3. Searching for Price Differentiation

In the following, we outline the goals, workflow, and functionality of our method for searching the web for potential cases of price discrimination.

3.3.1. Design Goals

Our main goal is to conduct a systematic study as well as an objective analysis to clarify the existence of online price discrimination based either on location information or on system configuration. Therefore, we define the following goals that should be accomplished by our implementation to perform systematic, non-offensive scans.

52 3.3. Searching for Price Differentiation

• Fingerprint Variety. We intend to send realistic search requests to the exam- ined websites, which requires the application of real-world system fingerprints. Major fingerprinting libraries found in the wild utilize browser features as characteristic attributes to recognize and classify users’ systems (see Chapter 2). Therefore, fingerprints should be as comprehensive and complete as possible, including user agent information as well as every system feature that could be used by common fingerprinting libraries (see Sec. 3.3.3).

• Simulation of User Behavior. In order to avoid getting classified as a bot or even getting blocked by a provider due to automated crawling of their platforms, we strive for realistic user behavior. Hence, it is necessary to simulate human behavior when posting a search request, starting at the frontpage of a provider’s website, filling out the search input form (e. g., in case of hotels, with the desired travel information like arrival and departure dates) as well as traversing the received results. Because a possible price individualization has to take place before listing the results, fingerprinting is likely applied during this procedure. Through simulating user behavior, we increase the chance to get fingerprinted (see Sec. 3.3.4).

• Robustness. The scan results are external data to us. Hence, we cannot control them and the way they are deployed, e. g., their format, display position etc. For this reason, we have to ensure the proper handling of exceptions and any kind of unexpected data to avoid crashes. Platform providers constantly work on their websites adding features like responsive design extras and regrouping container classes. As the scraper components of our system include the automated navigation at a website, it needs to be robust against such changes (see Sec. 3.3.5).

• Deterministic Behavior. For comparison of different prices and scanning results, our system has to be deterministic, meaning that the same search requests using the same input parameters—including fingerprint, proxy con- nection, and search data—should lead to the same result. Note that external circumstances (e. g., seasonal vacancies, special offers, or a fully booked hotel) might influence product prices and state a limitation of our work (see Sec. 3.5). Although it is not possible to completely eliminate these factors, we intend to minimize their influence on our work by leveraging repeated scans and vacancy filters (see Sec. 3.4).

Besides these design goals, we also follow three additional principles. First, as we aim to include multiple platforms in our study, the implementation needs to be modular. For every scan, platforms, search parameters, fingerprints etc. can be chosen freely, which also enables us to extend the system with additional scrapers so

53 Chapter 3. System Fingerprints as Influence on Online Pricing Policies more websites and product categories may be scanned for fingerprint-based price discrimination in future work. Second, we should not send too many requests to a given website at once: during the scanning procedure, a large amount of requests is sent to the providers’ servers, which could raise alarms. As we certainly do not want to disturb legitimate services, we apply a time delay to our low-traffic implementation and hence ensure that our scans will be tolerable by platform providers and do not interfere with their daily business. Third, we want to be transparent about our work and thus we plan to publish the code and data obtained by our scanning practice. Therefore, researchers interested in this field and our study will be able to reproduce our results as well as make enhanced evaluations on their own.

3.3.2. High-level Overview of Workflow Our approach can be split into six entities. We have two data sources (system fingerprints and provider websites), three data processors (scanner, scraper, and price analysis), and result data (cases of price discrimination). Figure 3.3 provides a high-level overview of the system’s workflow. First, we build system profiles, each including four components: (i) a real-world fingerprint, (ii) a proxy server to be used, (iii) search parameters like the dates of arrival and departure for hotels, and (iv) the providers and websites to be examined. Bundles of such profiles are loaded by the scanner. The scanner’s duty is to automatically browse the website of a given provider to certain product result pages. Our scraper implementations, thereafter, extract the relevant price information from these pages. Finally, we analyze the extracted price information and this analysis of the collected data can point out cases of price discrimination. In the following sections, we describe each of these steps in more detail and provide information about implementation aspects.

3.3.3. System Fingerprints The real-world systems fingerprints that we use for our study are derived from two data sources:

1. the study conducted in Chapter 2 providing 385 fingerprints with different user agents, mainly from mobile devices, and

2. a project partner providing a big online gaming platform supporting our work with over 15,000 fingerprints of desktop systems.

We re-grouped these fingerprints in order to identify those including most common feature values as well as those including least common feature values (see Sec. 3.3.1).

54 3.3. Searching for Price Differentiation

Provider Websites

Scanner Scraper

Price Analysis

System Cases of Price Profiles Discrimination

Figure 3.3.: High-level overview of our system’s workflow

This set of most common and uncommon system fingerprints is suitable for our purpose: we surely need to include those systems in our study which can be found in the wild very often, but we also need to include special systems with unusual appearance to test how such rare fingerprints may influence a product’s price. Also, we reduced the set as many features’ values were identical through several fingerprints. Due to this re-grouping and reduction, our set includes a total of 332 real-world fingerprints for scanning web platforms. As noted above, a fingerprint may yield manifold features of a system. However, we include the features listed below, either gathered from the Browser Object Model (BOM) or the HTTP header, as these were proven to be common features used for browser fingerprinting [42, 119]. Table 3.1 shows example values for all features: • AvailHeight determines screen size height available for the browser. • AvailWidth determines screen size width available for the browser. • ColorDepth stores the color depth of the display in bits. • CookieEnabled stores a boolean value indication whether a website is allowed to set cookies in the system’s browser. • Height holds the display screen’s height the browser is located in. • Language determines the browser’s main language, usually stored in alpha-2 code format of ISO 3166-1.

55 Chapter 3. System Fingerprints as Influence on Online Pricing Policies

Table 3.1.: Leveraged fingerprint features with exemplary values Feature Name Example Value availHeight 588 availWidth 720 colorDepth 32 cookieEnabled True height 1080 language it-IT languages [ru-RU] mimeTypes [{n:application/x-shockwave-flash,d:Adobe ... ] pixelDepth 24 platform Linux armv7l plugins [{n:Shockwave Flash,d:11.2 r202, ... ] productSub 20030107 userAgent Mozilla/5.0 (Linux; Android 4.4.2; Nexus 4 ... vendor Google Inc. width 1680

• Languages yields a list of supported languages where the first language matches the main language.

• MimeTypes contains the object MimeTypeArray, which holds a list of all MIME types the browser can work with. Each MimeType is represented by a JSON array in our approach, containing three items: i) description (key: d), ii) suffix (key: f) and iii) type (key: n).

• PixelDepth indicates the bits per pixel of the display screen.

• Platform gives information about a system’s platform.

• Plugins provides the JavaScript object PluginArray containing all in- stalled browser plugins and is in our approach formatted the same way like MimeTypes, again including a JSON array with items for description, suffix and type.

• ProductSub represents the build number of the system’s browser.

• UserAgent provides the user-agent string of a browser containing several information about itself as well as the underlying system it was build for.

• Vendor depends on the type of browser and contains the name of its vendor.

56 3.3. Searching for Price Differentiation

• Width contains the display screen’s width the browser is located in.

Besides all of these device-level features, we also need to consider the network location (i. e., IP address) as this represents an important feature for fingerprinting. We opted to use free proxy servers and rent VPN gateways to enable a flexible routing of requests. As a result, we can issues queries from different network locations and observe changes in responses.

3.3.4. Scanner In our system design, the scanner implements a way to automatically scan websites for price information. The scanner deploys different system fingerprints to navigate self-acting through the target provider’s websites. In our study, the navigation process can be subdivided into four steps:

1. loading the landing page,

2. filling out the search input,

3. ascertaining the travel destination, and

4. reaching the search results.

Figure 3.4 depicts the components of our scanner implementation.

Fingerprint Proxy

Provider Search Website Parameters

Profiles

Custom PhantomJS Extended Ghost Selenium Scanner Driver

Requests Requests Requests

Figure 3.4.: Scanner components operation chart

57 Chapter 3. System Fingerprints as Influence on Online Pricing Policies

Table 3.2.: Search parameter features for hotel booking websites with example values Parameter Name Example Value travel target Berlin, Germany check in day 27 check in month 5 check in year 2016 check out day 28 check out month 5 check out year 2016 number of adults 1 number of single rooms 1 number of double rooms 0

As discussed above, we use real-world fingerprints to create fake user profiles and our scanner uses proxies to forge its location such that we can evaluate whether the IP address also plays a role in the whole process. Further input for our core scanner component are the provider websites which we want to analyze and the specific search parameters for the provider websites. During a run of the scanner, the search parameters remain constant to obtain comparable results. Table 3.2 illustrates the implementation of the search parameters as a dictionary for hotel booking websites. During the time of this work, we used up to four travel targets for our search parameters. We attempted to choose the search parameters such that no public holiday or big events are overlapping with our chosen travel periods to countervail price changes because of third-factor elements. Additionally, we chose dates far in the future to ensure that enough products are available. The core component of the scanner is a customized version of the headless browser PhantomJS that we use for automatic browsing on websites. The fingerprint injection (i.e., the manipulation of the browser fingerprint of the PhantomJS instance) is essential for our system design. Based on a set of fingerprint features, the browser instance is altered to imitate the system that is represented by the fingerprint. We combine out-of-the-box methods provided by PhantomJS and JavaScript injection to fake the system fingerprint. To perform this manipulation, we had to extend PhantomJS’ WebDriver implementation (GhostDriver) because we are utilizing Selenium [138] to communicate with the PhantomJS instance. The WebDriver is a remote control interface to instruct the behavior of the PhantomJS browser [152]. Selenium includes the WebDriver API and automates the driving of the browser as a real user would. Through this extension, we gain access to additional features of PhantomJS that allow us to replace identifying JavaScript objects. We achieve two

58 3.3. Searching for Price Differentiation

design goals using PhantomJS controlled via Selenium. First, we simulate authentic user behavior by navigating through the single steps of the website. Second, it leads to a deterministic behavior. To sum up, the real-world fingerprints, the proxies, the provider websites and the search parameters serve as input data for the scanner which uses Selenium to communicate with the custom PhantomJS browser via its extended GhostDriver implementation. The interaction of these components leads to the requests we use for gaining the adequate result pages which contain the price information. Thus, the scanner components simulate realistic browsing and user input under different parameters automatically (simulation of user behavior, see Sec. 3.3.1).

3.3.5. Scraper

In general, the scraper extracts product information from selected websites. The actual scraping is based on a Python implementation which is able to extract information out of HTML and XML documents. As input, it gets the source code of the target websites’ result pages and it extracts the required price information from the HTML code. We locate the separate information in the document via CSS-selectors which need to be adjusted to the particular markup structure of each target website. Because of this individual code, we need to implement one scraper for each website. Moreover, we encountered many cases where the results are not displayed completely because they are usually loaded on demand, e. g., when scrolling in a list of products. During this study, we found three different attempts of presenting the results: (i) list with pagination, (ii) list with a full scroll bar, and (iii) list with a part wise scroll bar. Via Selenium, we automated the navigation through the result page parts and extracted the price information of the first 20 parts as processing even more assets and their prices would exceed a functional limit. In case of a pagination failure, our scraper continues with the next accommodation. While extracting price information from a website, one has to handle different price presentation formats, currencies, and the meaning of the shown prices. Therefore, this data must be converted to a common format we can handle in our further data analysis. The conversion is in principle a price normalization that leads to the price in Euro per night for a particular hotel and to the total price in Euro for rental cars. The latest available exchange rate is taken for this conversion. Note that we update the exchange rate every time we start a new scan, where one scan corresponds to one execution of the whole workflow of our system. With this approach, we try to minimize the effect of exchange rate deviation on our advertised price changes, in particular by revision scans. At the end, the scraper stores all obtained information in a MySQL database. For robustness reasons, all errors are caught and the scraping algorithm does not stop.

59 Chapter 3. System Fingerprints as Influence on Online Pricing Policies

3.4. Evaluation

Based on the implementation of the scanning infrastructure, we performed several empirical tests. While many categories of products and services are offered on the web, we focus on two specific types of business: hotel booking platforms and rental cars suppliers.

3.4.1. Price Analyses We scanned different provider for hotels and rental cars, namely Booking.com, Hotels.com, Hrs.com, Orbitz.com, and Avis.com and conduct three kinds of analyses: (i) location-based, (ii) fingerprint-based, and (iii) fingerprint-feature-based price differentiation analyses. First, we investigate location-based price differentiation. We consider several countries (including France, Germany, United States, Russia, Pakistan, and the Netherlands) to determine how realistic it is to obtain a higher or lower price for the same asset when requesting it from a different country. For these countries, we obtained proxy servers or VPN gateways and re-routed our search requests through these servers. The target websites will treat these as search requests coming from the corresponding country. Furthermore, we randomly picked six fingerprints of our set to repeat these scans with different system configurations. Note that we focus on hotel providers for this analysis to compare our findings between these providers. Second, we shed light on price differentiation based on system configurations. Regarding countries, this analysis is normalized to France, United Kingdom, Germany, and United States, as we aim to highlight the systems’ fingerprints instead of different countries and for these countries we got complete result sets for our scans. While for location-based analyses we do not consider single fingerprints, we do so in this step. We used our set of 332 representative system fingerprints for the following analyses and instrument them to disguise our scanner as these systems. Third, these fingerprints are leveraged to create pairs in which one fingerprint significantly often results in a high price and another fingerprint significantly often yields a low price for the same asset. Intermediate fingerprints are then forged, simulating single feature changes. By re-scanning the providers’ platforms, we harvest insights of which specific system attributes effect online pricing policies. Note that we are always searching for one person and one single night in case of the hotel booking websites, hence, the search parameters described in Sec 3.3.4 are kept constant for the following analyses. After sending a search request, we scrape the top offer prices per hotel for every provider as our ground data for analysis. Finally, we repeat search requests and confirm that using the same configuration reproduces the same prices, so that we can exclude randomness and consider only reproducible price changes.

60 3.4. Evaluation

3.4.2. Location-based Price Differentiation We sent search requests for different parameters, e. g., dates of arrival and departure, to all accommodation providers, querying assets in four major cities, namely Los Angeles (USA), London (United Kingdom), Berlin (Germany), and Tokyo (Japan). These cities are popular travel destinations and frequently searched cities at online booking sites. Each scan lasted about one hour to not overwhelm a given site with queries. As a result of these scans, we obtained over 455,500 data records, including an accommodation’s name, its provider, and the normalized price in Euro. Figure 3.5 shows boxplots for all four providers, including the countries we re- routed the search requests through on the X-axis and the prices in e on the Y-axis. Each box depicts the median and quartiles as well as minimum and maximum values of prices for the corresponding country. Note that the prices of every country refer to the same set of hotels at all cities, while there may be differences when comparing providers as some of them may not cooperate with specific accommodations. This set is used for all location-based analyses and contains only hotels that were found in all single scans for all configurations. We omitted results with less than 1,000 responses per provider to avoid bias and keep the results representative, therefore the number of countries varies in Figure 3.5.

Orbitz.com At Orbitz.com we see extremely high-valued outliers up to e 723.25, but an almost equal distribution of medians and quartiles for all countries. The first quartile varies only between e 56.25 for Germany and e 55.32 for USA, which might be based on currency conversion. All medians are about e 75 (± e 1) and is only slightly higher for Germany with a value of e 77.24. Also, the third quartiles are around e 108 (± e 1). In total, we see an (almost) equal price distribution for the first four countries of our set. Hence, we did not examine further countries as it is likely to expect no differences after the first findings have been so clear. Furthermore, this shows the soundness of our method due to having a clear case of no price variations based on system fingerprints.

61 Chapter 3. System Fingerprints as Influence on Online Pricing Policies

USA

Russia

Germany

(d) Hrs.com (b) Booking.com Georgia

France

0 0

50 50

500 450 400 350 300 250 200 150 100 350 300 250 200 150 100 Euro Euro

USA USA

Russia

Germany Pakistan

Figure 3.5. : Location-based price differentiation by provider

Germany Georgia (c) Hotels.com (a) Orbitz.com

Georgia

France France

0 0

50 80 60 40 20

450 400 350 300 250 200 150 100 160 140 120 100 Euro Euro

62 3.4. Evaluation

Booking.com We see that prices for accommodations vary mainly between e 50 and e 150, while the maxima reach up to e 464.64. However, these high prices are outliers and describe single cases, presumably of luxury hotels. Interestingly, there are significant differences between specific countries. While the Georgian Republic, Germany and Russia show a similar range of product prices and also similar median values of e 77.17, e 79, and e 79.03, the range of prices queried from a French proxy is slightly higher as well as its median of e 84.60. Still, Netherlands and Pakistan seem to get the highest prices as their medians are e 109, and e 99.28. Generally, low prices could be achieved using a proxy in the US; the whole range of prices is between quartiles of e 45.01, and e 82.84. Note that for all countries prices vary in similar ranges, indicated by the box sizes, which indicates that prices do not have a greater variance, but all prices are generally lower or higher for a specific country.

Hotels.com While medians are almost equal for assets requested from France, Georgia (Georgian Republic) and Russia, France shows the greatest range of prices. For Georgia, prices tend to be in a lower range and in a higher range for Russia. Contrary, we see that Germany gets by trend higher prices with a median of e 106, which is very close to the third quartile (e 107.90). At the same time, the first quartile equals e 70, so that there is a wide range of prices lower than the median. Again, prices for accommodations requested over the USA proxy are generally lower. A median of e 64.24 shows that low prices are offered more often. Albeit generally lower, the range of prices is almost the same as for other countries.

Hrs.com Notably, all countries yield the same maximum value of e 292.50, which shows that one outlier achieves the same price regardless of which country. France, Georgia, and USA show almost the same median value of e 79, e 82.95, and e 79.50, although prices for France vary a little less than for the Georgian Republic and the United States. Germany and Russia tend to achieve lower prices around medians of e 69.69, and e 66.58. However, regarding price range, Germany is comparable to France, while Russia shows the least range.

Summary The results of our price differentiation analysis regarding location is mixed: Not all providers seem to leverage price adjustments based on a user’s location. As for Orbitz.com, all examined countries were treated the same in our study, giving no hints of systematic price differentiation being performed by this platform. In contrast, we see for the other accommodation search providers a moderate deviation of medians by countries for the same assets. The USA got privileged prices at Booking.com and Hotels.com, while Netherlands and Pakistan achieved rather high prices at Booking.com and Germany at Hotels.com. At Hrs.com, prices vary up more for requests from the Georgian Republic, whereas requests from Germany

63 Chapter 3. System Fingerprints as Influence on Online Pricing Policies and Russia likely achieve lower prices. Finally, we confirm the existence of price adjustment based on a user’s location (i. e., country), although prices seem to vary in a small range only.

3.4.3. Fingerprint-based Price Differentiation We scanned the providers mentioned above instrumenting our fingerprint set con- taining 332 system fingerprints. As a result of these scans with a duration of about 19 hours, we obtained over 4,370,000 data records, including an asset’s name, its provider, the used fingerprint, and the normalized price in e . Note that now the request country has been set to a fixed parameter as well as destination and dates of travel. In particular, we tested how much prices vary for every single hotel when the fingerprint of a request changes. We obtained for every product (hotel or car) two lists: (i) fingerprint(s) which yield a maximum price for this asset, and (ii) fingerprint(s) which achieve a minimum price for it. This results in almost 50,000 cases showing price differences, which makes only about 1.12 % of all scanning results. Table 3.3 shows the number of these cases per provider.

Table 3.3.: Fingerprint-based price changes per provider Provider Cases Share Booking.com 20,868 0.48 % Hotels.com 9,174 0.21 % Hrs.com 9,786 0.22 % Orbitz.com 9,600 0.22 % Avis.com 181 <0.01 %

As a first insight, we can see that the use of fingerprint-based pricing is used to different extents. While we found most suspicious price variation based on fingerprints at Booking.com, the other three hotel booking providers seem to deploy price differentiation in about the same intensity. However, the share of such suspicious cases showing a high price variance is rather small compared to over 4 million scanned prices. We may speculate that these are individual cases as a systematic price differentiation or even price discrimination usually has a greater impact and is not limited to a small share of cases. Additional to these initial findings, we further investigate how changing a system’s fingerprint affects prices by performing a statistical significance analysis. For this purpose, we conduct Friedman tests [33] with the following data: In total, nearly 600 hotels and a selection of 130 fingerprints which yield price results for the given hotels, were examined so that for every combination of fingerprint,

64 3.4. Evaluation hotel, request country and provider there exists a scanned price. The Friedman test calculates the significance of price changes resulting from the fingerprints by testing equality of medians. It is a non-parametric statistical test similar to the parametric repeated measures analysis of variance. By reducing the number of fingerprints to the intersection of fingerprints which occur in all records of our data, we ensure the comparability between the various characteristics, e. g.,, hotel providers. Before the Friedman tests can be performed further cleaning of the input data is necessary, like hotels with no free rooms must be purged, which keeps the sample size—the number of hotels—identical for each fingerprint, which is important for statistical analysis. As input, we use a data matrix including the normalized hotel prices (e. g.,, Euro as uniformed currency) for each fingerprint per hotel. Due to proxy availability, we scanned Hotels.com from France, Germany, and Romania and additionally from the United States of America for HRS.com and Orbitz.com. We could not include Booking.com like in the previous tests since the web application changed during our research, making scraping hotel prices impossible. In total, we conducted eleven Friedman tests—one for each combination of provider and country. In almost all cases, the p-value was lower than 0.05 which means a significant difference of at least two fingerprints in the corresponding subset. Only one test (Hotels.com from Romania) produced a p greater than 0.05, presumably because the median values are all equal. While for all other cases, we conducted post-hoc tests we calculated the median of medians immediately, for this single case. Using post-hoc test could possibly lead to false positives in such a case. Table 3.4 shows an excerpt of the medians of each fingerprint for all combinations of provider and country. The unlimited median values used by the Friedman test can be found in A.1. All prices in one column which differ show a significance as it is the median of medians. Note that also intra-column comparisons are allowed as the sample sizes, i. e., the number of hotels vary between 397 and 594. Within these results presented in Table 3.4, we see isolated price changes for Hotels.com, despite the request country. In fact, only a few fingerprints were found to be disadvantaged: Looking at France as request country, only one fingerprint (FP 171) deviates by 6e and all other fingerprints yield a median price value of 74e . For Germany, there are two fingerprints (FP 169 and FP 181) which deviate by 5.50e and for Romania even all fingerprints yield the same median price of 74e . While these fingerprints caused reproducible and significant price changes the majority of prices remained the same or showed only little variation for all other fingerprints. Examining the results for HRS, there is more significant variation of prices among the fingerprints. Generally, for every request country there are many different prices in median, which means that the provider’s website responded with different prices for different fingerprints. However, almost all of these significant price changes are below one Euro, so that we cannot exclude currency conversions to cause them. Only two fingerprints (FP 35 and FP 95) deviated by about 2.70e and 2.80e . Again, these

65 Chapter 3. System Fingerprints as Influence on Online Pricing Policies

Table 3.4.: Excerpt of median hotel prices per fingerprint, provider and country Hotels HRS Orbitz FP Fr De Ro Fr De Ro USA Fr De Ro USA 1 74 74 74 70 69.9 70 70.2 62.93 62.93 62.93 62.93 3 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 5 74 74 74 70.83 70.73 70.83 70.2 63.25 63.25 64.2 64.2 21 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 23 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 25 74 74 74 70.4 70.3 70.4 70.41 62.93 62.93 62.93 62.93 ···································· 165 74 74 74 70.4 70.24 70.4 70.65 63.24 63.24 64.19 64.19 167 74 74 74 70.34 70.19 70.4 70.41 63.25 63.25 64.2 64.2 169 74 79.5 74 70.53 70.3 70.4 70.41 62.93 62.93 63.87 63.87 171 80 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 173 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 175 74 74 74 70.53 70.3 70.4 70.41 62.93 62.93 63.87 63.87 177 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 179 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 181 74 74 74 70.53 70.3 70.4 70.41 62.93 62.93 63.87 63.87 183 74 79.5 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 ···································· 295 74 74 74 70 69.9 70 70.2 62.93 62.93 63.87 63.87 297 74 74 74 70.4 70.24 70.4 70.65 63.24 63.24 64.19 64.19 price difference is significant according to the Friedman test, but as such deviations occur only twice, the existence of a price differentiation system is questionable. These findings also apply for Orbitz, as there are many price variances, too. But again, the differences between the prices is about one Euro or lower and not a single fingerprint delivered a relevant price difference of several Euro. In fact, the price differences were found to be significant, but the reasons for such differences remain unclear and do not indicate systematic price discrimination.

3.4.4. Price-influencing Features In order to investigate the individual cases of price changes due to systems’ finger- prints, we dissected those fingerprints that have been involved in suspicious price changes found in the previous section. Although these are rare and individual cases, we aim to learn which features are involved in price changes. Hence, we created pairs such that a fingerprint causing a low price is combined with a fingerprint causing a high price. Then, we built intermediate fingerprints for all these pairs, so called morphprints, fading from one fingerprint to another by successively changing their

66 3.4. Evaluation

attribute values. The morphprints are naturally not real-world fingerprints, they are only intended to compare single feature changes. Combining these morphprints (Mx) with the two original fingerprints (O1, O2) gives a pack of feature changes. Table 3.5 shows such a package exemplary. This matched-pairs design enables a precise analysis of which and how feature values influence an asset’s price. To find the correct order for feature replacement, we applied the information gain algorithm, instrumenting the Kullback-Leibler divergence [57], to our data set, revealing every features importance to distinguish all data records. It provides an order of how important and descriptive each feature is in relation to our data. We instrument this output to set the order for successive feature value replacement. In total, we created 111 morphprints and re-scanned accommodation websites, resulting in over 14,000 records. These additional scans took about six hours each. To test for reproducibility, every fingerprint and morphprint has been re-scanned twice and in the following we only take those cases of price changes into account which could be confirmed this way. For instance: if a change of the platform attribute caused a price change of x, we switched this attribute back to the old value and compared if the price change is −x. Changing the platform attribute again, should now confirm this by resulting in a price change of x again. Only such confirmed cases are considered in the following to exclude random price changes, e. g., by exceeding a provider’s quota as it is common practice to reserve a definite number of rooms for online providers. First, we examine which features affect an asset’s price most often. Second, we shed light on how these features’ values influence online pricing.

Features While previous research identified a system’s user agent string to be the top feature for fingerprinting (see Sec. 3.6), we see that a system’s language is the most occurring price changing feature in our empirical data set. About one third of all found cases in

Table 3.5.: Example for morphprints of pairing (O1, O2) FP language platform productSub vendor O1 en-US iPad null .. Apple Inc. M1 de-DE iPad null .. Apple Inc. M2 de-DE Linux armv7l null .. Apple Inc. M3 de-DE Linux armv7l 20030107 .. Apple Inc...... Mi de-DE Linux armv7l 20030107 .. Apple Inc. O2 de-DE Linux armv7l 20030107 .. Google Inc.

67 Chapter 3. System Fingerprints as Influence on Online Pricing Policies our study include a language feature—including httpHeader.accept language, navigator.languages and navigator.language. However, we confirm navigator.userAgent to be of importance, occurring in about 8 % of all cases in our data set. Screen resolution represented by screen.width, screen.height, screen.availWidth, and screen.availHeight as well as the property navigator.vendor were found to be involved in about six 6 % of cases.

Table 3.6.: Features and their share in cases of price changes Feature Share httpHeader.acceptLanguage 14.57 % navigator.languages 9.73 % navigator.language 9.05 % navigator.userAgent 7.95 % screen.availHeight 6.90 % navigator.vendor 6.77 % screen.height 6.50 % navigator.platform 6.31 % screen.availWidth 6.17 % screen.width 5.37 % screen.colorDepth 4.63 % navigator.productSub 4.26 % screen.pixelDepth 4.04 % navigator.plugins 3.97 % navigator.mimeTypes 3.79 %

This indicates that these attributes might only play a minor role in pricing policies. Surprisingly, plugins and mime types are not often involved in price changes as they occurred in only less than 4 % of all price changes. Usually, these attributes are considered as highly personalized and therefore should have a greater affect on price customization. However, from our data, we cannot confirm this intuition. Table 3.6 lists the share of every feature taking part in price changes.

Feature Values Given these findings, we now investigate, which feature changes result in a price difference. For the following analysis, we only consider reproducible cases with only one single feature changing its value. Due to irregular website responses to our requests, it may happen that more than one feature changed before scraping these websites but we eliminated these cases beforehand. Table 3.7 presents the feature changes, their occurrences and average price changes.

68 3.4. Evaluation

Language Most often with an occurrence of about 14 % we see the change of the system’s language (including httpHeader.accept language, as well as navigator.languages, and navigator.language) from ru (Russian) to de-de (German). Although this change was found rather often to affect an asset’s price, the average price change is only about 1.27 %. Similarly, a language change from en-US (American English) to de (German) could be found in about 11.87 % of all cases and changed asset price’s about 8.88 % on average. In general, a system’s language seems to be a price-influencing feature, occurring in about half of all our cases. Though, the average influence on prices is rather low.

User Agent As already indicated by previous experiments (see Table 3.6), the content of a user agent string seems to affect asset prices. Although this feature was involved in less cases than language settings, adjusting the user agent may result in high price differences. For instance, changing from an Android 4.4.2 with native Android browser to Windows 7 using Firefox, changed hotel room prices by about 17 %. Also, switching from a Mac OS X to an iPad (both with Safari) affects the prices on average by about 15 %. Although the user agent string seems to affect asset prices on platforms leveraging fingerprinting, we cannot make a general claim about which user agent or system always achieves low prices. From our data, one might suggest that switching between mobile device and desktop computers may cause the highest price changes, e. g., Android vs. Windows, but the occurrences of such cases is too low to generalize this result. However, in our results, a price difference by switching user agents may be caused by changing from a mobile to a desktop or the other way around.

Other Features Besides language settings and user agent, other features could be related to price changes as well. The navigator.productSub property was switched from 20030107 to None in about 4 % of our cases, and achieved an average price change of almost 0.06 %. Also, setting navigator.vendor from Google Inc. to null and changing the screen resolution (screen.availHeight and screen.availWidth) resulted in negligible price changes of about 0.06 %. Still, it is possible that these features affect asset prices in some cases. But in the cases we found, they have only a subordinate role. This also applies to navigator.plugins and navigator.mimeTypes properties1. These could not be related with signifi- cant price changes as they either rarely occur or have a negligible impact on asset prices.

1We omitted these features in Tab 3.7 due to their length and unreadability.

69 Chapter 3. System Fingerprints as Influence on Online Pricing Policies Change represents the average price change in percent. Change Table 3.7. : Most influencing features The column For better readability we present only operating system and browser instead of the complete user agent string. Featurelanguagelanguagelanguagelanguagelanguage Old Valuelanguage en-USnavigator.productSub runavigator.userAgent 20030107 ru-RUnavigator.userAgent Android en-US 4.4.2navigator.userAgent Android Browser Mac ko-KR OSnavigator.userAgent X Windows 10.9.4 Android 7 Safari de 5.0.1 Firefoxnavigator.userAgent Chrome Android 4.1.2navigator.userAgent Android Browser Windows New 10navigator.userAgent Value Chrome Linux Android Iceweasel 4.4.2navigator.vendor Chrome iPad Android OS 4.4.4 7.0.4screen.availHeight Android Safari Browser iPad OS 8.1 descreen.availWidth None Safari Windows Phone Google 8.1 Inc. IE 588 Mobile en-US Android 4.1.2. 384 de-de Windows Android Phone Browser 8.1 it-IT 0.10 IE % Mobile en-US 17.33 % Occurrence 4.01 % 0.95 es-ES % 0.18 0.34 % % 0.06 0.26 % % 8.89 % 14.69 10.67 0.35 % % % null 0.06 % 10.81 % 942 4.01 338 % 11.87 % 9.32 % 14.16 % 0.06 % 8.88 % 8.48 % 9.10 0.83 % % 1.27 % 0.77 % 0.30 % 4.01 % 0.06 % 13.10 % 0.06 % 4.43 % 2.17 % 0.06 % 0.06 %

70 3.5. Discussion

Summary Our results show that language settings and user agent strings are the most influent of all features. Changing these features to specific values may increase the chance to get a lower price for online hotel booking. Adjusting other attributes, like vendor and screen resolution may also affect online pricing policies, but only in a small range and in specific cases. Although we cannot make a general claim of which feature values should be set to hunt for the best price, our results indicate that features which are closer to the user (like language settings, operating system, and browser) have a greater impact when it comes to fingerprint-based pricing policies. Notwithstanding, our findings—especially regarding single features and their values—refer to individual cases in our data set. Although we have shown the statistical significance of these cases, we cannot claim a systematic third-degree price differentiation or price discrimination. Small price changes of a few Cent may be related to currency conversions and price changes of more than one Euro are rare and cannot be proven to be based on system fingerprinting.

3.5. Discussion

Although we handled both the data collection and analysis phases thoroughly, there are some limitations and threats to validity that we discuss in the following. First, our findings are not omni-valid for the whole Internet given that we examined only a subset of all available accommodation booking platforms and one rental car provider. Hence, our results and conclusions are in general only valid for our data set and investigating other providers, product categories, countries, or fingerprints may verify or refute them. However, our data and results derive from realistic search requests and their valid responses, including real-world prices. To foster research on this topic, we also plan to publish all data collected during this study. Our analysis regarding location-based price differentiation sheds light on differences in pricing on a per-country basis based on geolocation information of IP addresses. The same might exist intra-national, so that users from the same country but from different regions or cities may not get the same price for the same product. However, such a fine-grained analysis is not within the scope of this work, as we mainly focus on fingerprint-based price differentiation. Regarding our fingerprint-based analyses, the greatest threats to validity are special offers and hidden price boosters or discounts. It may happen that some assets’ prices result from special offers or secret deals between the platform provider and an accommodations owner. In the worst case, a discount is offered during only parts of our scan, so that e. g., fingerprints which are applied early in the scanning order would get a reduced special offer price than all fingerprints later on. To remedy this threat, we applied a filter to catch these cases and to ensure that only such cases

71 Chapter 3. System Fingerprints as Influence on Online Pricing Policies of price changes are taken into account with nonlinear price changes. For instance, if a hotel cost e 100 per night for fingerprints 1 to i, but only e 80 per night for fingerprints i + 1 to n, it is possible that this price change is caused by a special offer. In contrast, if a hotel cost e 100 per night for fingerprints 1 to i, but e 140 per night for fingerprints i + 1 to n, we cannot exclude that the price has risen just because of our scanning as the first fingerprints simulate a high demand for this asset: the price could have been increased as a reaction, meeting supply and demand. Such cases are omitted to exclude price changes based on special discounts and provider quotas. However, we cannot ensure to catch all potential external influence factors. Another possible source of distortion may be the booking conditions of hotel providers. During the scraping process, we obtain the price offered at first sight per accommodation regardless of room type and amenities, e. g., breakfast. It is reasonable to assume that this is the best price for an offer as a lower price attracts more customers than showing a price for a premium suite including amenities. Hence, we assume that a provider’s platform would always list this best price to all search requests. In practice, if a hotel offered standard rooms and premium rooms at different prices, and the standard room price is advertised for the first search request, we presume that the prices shown to other requests of our scan are also the advertised standard room price. For providers of rental cars, this does not apply as there are only a few car types compared to possible room types. Although there are typically several room types available, it may happen that during a scan, standard rooms are fully booked and only premium suites are offered at a higher price. Such incidents are also detected by our filter described above and excluded from our data set. Although we normalized the accommodation prices to compensate changes in currency exchange rates, there may be external factors we cannot consider without insider knowledge. For instance, additional transaction fees for providers may differ based on their money exchanger. Therefore, especially slight price changes may result from external currency exchange factors, like already mentioned in Section 3.4.4. Regarding our analyses for single features to increase or decrease a price depending on their specific values, we have analyzed the most striking fingerprints and created artificial morphprints. Due to the huge amount of data, a complete analysis of all possible feature changes considering all possible values in all possible combinations is not feasible. However, our findings are derived from real-world data, although additional feature values may be seen in the wild meaning that even more value changes may occur and influence online pricing policies. In this study, we instrumented browser fingerprints as well as proxy connection- s/VPN gateways to create profiles. While practically not known to be deployed, it might be possible for a cross-layer fingerprinting mechanism to discover a profile, e. g., if a user agent shows a Windows machine, but a TTL (Time To Live) value in the IP header analysis reveals a Linux system. Note that our results show clear

72 3.6. Related Work price variations based on browser fingerprints, regardless of whether such a complex mechanism was in place or not. For future enhancements, more providers, as well as more fingerprints, may be taken into account to enlarge the data set and find more comparable insights. Moreover, a longitudinal analysis of possible price differentiation behavior of several providers may be future work. Also, including different product categories seems promising as we were able to compare our findings to other assets, like online-shopped goods, office supplies, used and new cars, etc. As the data obtained so far is stored in a database and our software is realized as modular python package, we plan to publish both, so that with other developers’ help, we may extend this study more and more.

3.6. Related Work

Several studies revealed that online price discrimination is a common technique for online shop operators [29, 58, 109, 110, 150]. These studies are closely related to our own work as we discuss in the following. Hannak et al. recently analyzed several e-business websites which personalize their content. They found out that personalization on e-business websites provides advan- tages for its users, but can also lead to disadvantages, e. g., price customization [58]. Their results give shreds of evidence for price steering and discrimination practices in 9 of 16 analyzed websites. Vissers et al. analyzed price discrimination of online airlines tickets [150]. Their results, however, demonstrate that it was not possible to find any evidence for systematic price discrimination on such platforms. This result might be based on the fact that airlines utilize highly volatile pricing algorithms for their tickets. Another empirical study was performed by Mikians et al.: as one of the first, they empirically demonstrate the existence of price discrimination [109]. With this knowledge, they started another large-scale crowd-source study and they were able to validate that there are price differences in e-business based on location [110]. One more recent work by Chen et al. takes a closer look at the algorithmic pricing on Amazon Marketplace [29]. Our work concentrates on price discrimination of hotel booking and car rental websites. In addition, we make use of system fingerprints and analyze which fingerprinting features are the main attributes causing price changes. Additional web personalization work tries to increase the resulting quality of Web search requests and their personalized site content [77,93]. The personalization is important for our work because we analyze on which level system fingerprinting methods are used for personalization. To the best of our knowledge, we are the first to extract specific fingerprinting attributes which cause price changes.

73 Chapter 3. System Fingerprints as Influence on Online Pricing Policies

System fingerprinting of clients is a conventional method used for user tracking and identification [42,55,88,119,161]. In contrast to client fingerprinting, website fingerprinting is a method to attack anonymity networks such as Tor by a passive observer [122,155]. In this work, we validate our assumption that client fingerprinting methods are also served for price discrimination. The economic fundamentals are extensively discussed by several economists [140, 149]. In the scope of this work, third-degree price differentiation is relevant (see Sec. 3.2.1). Datta et al. have conducted a study, finding that user profile information is instrumented for discrimination of genders in the context of advertising [37]. Although this indicates the existence of discrimination in the Internet, this study does not include price differentiation. Melicher et al. have shown that users are uncomfortable especially with invisible methods of user-tracking, like price discrimination [107]. In contrast, noticeable effects (e. g., advertising) are experienced as tolerable. This shows the importance of secret price differentiation based on user behavior or system fingerprints.

3.7. Conclusion

In this chapter, we proposed a system to search for online price differentiation in a systematic way. To this end, we implemented a system capable of disguising as different systems based on real-world fingerprints. Utilizing this system, we sent search requests from several locations and systems to four accommodation booking and one rental car provider. The returned prices of all found assets were examined regarding systematic price differentiation behavior. We ensured that only reproducible cases of online pricing were considered to exclude external factors. Despite latest articles about possible price discrimination based on a user’s system, we could not prove the existence of a systematic approach for the examined providers. Getting a lower (or higher) price for an asset based on a digital system fingerprint are probably individual cases. We have shown that in our data such cases are rare or may be caused by currency conversions. Notwithstanding, it is possible that price differentiation based on other attributes and factors is applied in the wild, like regional price discrimination. Furthermore, we investigated single attributes to find their values provoking a reproducible price change. We found that a user’s language settings and user agent (containing information about the operating system and browser) to be the most promising attributes to manipulate when searching for an asset’s best price. In contrast to other attributes like screen resolution, these features represent a user’s choice and may, therefore, be more frequently instrumented for fingerprint-based price discrimination. Albeit existent, we found price changes based on changed feature values to be individual cases and to not follow a systematic approach.

74 CHAPTER FOUR

HARDWARE FINGERPRINTING AS SECOND AUTHENTICATION FACTOR

While common fingerprinting systems depend on software attributes, sensor-based fingerprinting relies on hardware imperfections and thus opens up new possibilities for device authentication. Recent work focusses on accelerometers as easily accessible sensors of modern mobile devices. However, it has remained unclear if device recognition via sensor-based fingerprinting is feasible under real-world conditions. In this chapter, we analyze the effectiveness of specialized features for sensor- based device fingerprinting and compare the results to feature-less fingerprinting techniques based on raw measurements. Furthermore, we evaluate other sensor types—like gravity and magnetic field sensors—as well as combinations of different sensors concerning their suitability for the purpose of device authentication. We demonstrate that combinations of different sensors yield precise device fingerprints when evaluating the approach on a real-world data set consisting of empirical measurement results obtained from almost 5,000 devices.

4.1. Introduction

Many providers of modern web services aim to recognize the device a user accesses their services from. An emerging functionality is the detection whether a user has changed the device, e. g., owns a new smartphone. The main target here is authentication of a user’s hardware to detect malicious activity like account theft: If a user logs in from a device never used before, this might be a hint that the login credentials have been stolen and are abused for malicious purposes. If a user logs in from an authenticated device which is known to be the user’s device, it is probably a legitimate login. Google+ already implements such a detection: If a group member performs a login from a device never seen before and this login is

75 Chapter 4. Hardware Fingerprinting as Second Authentication Factor deemed suspicious, a security alert is raised resulting in an email to the group’s administrator. Facebook keeps track of its users’ devices and aims to associate all systems belonging to a single user. Hence, detecting whether a login is performed either from a known or a new device is essential for fraud detection and account theft. Authenticating a device—and consequently binding an action to a specific device—can be an important step to achieving this security goal. For this purpose, often the browser is fingerprinted at login time. In the course of browser fingerprinting, software attributes like user-agent and installed plugins are leveraged [3, 45, 74, 99]. Previous research found that software-based device fingerprinting performs reasonably well for highly customized commodity systems like desktop computers, mainly since the configurations of these devices vary signifi- cantly [42, 119]. In contrast, mobile devices like smartphones and tablets are highly standardized. Still, it is possible to gather characteristic attributes of such systems and even about its user using web technologies only (see Chap. 2). However, device fingerprinting is strongly dependent on software attributes. For device authentication, the fingerprint should be as immutable as possible [153], and thus it should be hardware-based. The entropy of software-based fingerprints of mobile devices may be too small to be used for authentication [143]. Additionally, cookies may be deleted and software can be changed, hence, device authentication should not rely on these factors. A hardware-based fingerprint should stay the same if a user decides to use another browser or even installs a different operating system. A devices’ sensors seem to be suitable for this purpose and offer essential advantages: 1. Sensors are easily accessible: accelerometers and gyroscope data can be obtained even via JavaScript without special permissions. 2. Sensors yield measurable hardware imperfections which can be leveraged for fingerprinting a device. 3. These imperfections are immune to most software changes. Due to their manufacturing processes, hardware sensors exhibit imperfections which cause minimal yet measurable deviations between every single sensor [18]. Hence, several sensors provide distinguishable measurements for the same events, making them a suitable source for device fingerprinting. Dey et al. proposed a system called AccelPrint introducing a thorough feature set, setting new standards for accelerometer fingerprinting [39]. However, mobile devices typically contain several sensors, and an open challenge is to figure out which sensor (or which combination of sensors) yields the best device fingerprint in practice. Furthermore, the performance of such sensor-based device fingerprinting techniques was only analyzed in lab settings so far. Thus, it remains unclear if these techniques could actually be applied in practice, e. g., for device authentication. In this chapter, we address these open research gaps and focus on two different aspects of sensor-based fingerprinting: First, we evaluate the features proposed by AccelPrint on a data set containing almost 5,000 devices. This data set includes

76 4.1. Introduction more than eight million accelerometer events collected by an app we developed and enables us to review the performance of such an approach in real-world conditions. While the mathematical features introduced by Dey et al. enable device recognition based on accelerometer data under ideal circumstances in a lab, our goal is to shed light on how precise sensor-based fingerprinting can be in the real world and what limitations such an approach yields in practice. We also compare the recognition precision of the introduced features and the raw measurement data to determine whether there is a realistic need for these features. Second, we study other sensors available on modern devices (e. g., gyroscope and magnetic field sensors) and assess how device fingerprinting techniques can be improved by leveraging this information. We extend current research by investigating how the seven most common sensor types can be used for fingerprinting devices on a hardware level and empirically verify our proposed approach. Our analysis is based on five different machine learning algorithms and three data preparation processes to perform a comprehensive feasibility study. We evaluate the precision at which a unique device and a device model can be recognized.

Contribution In summary, we make the following contributions:

• We examine the performance and necessity of the state-of-the-art feature set for accelerometer fingerprinting on a large, real-world data set.

• We investigate how other kinds of sensors available on modern devices can extend hardware-based fingerprinting for the goal of device authentication.

• We show how sensor data from several sensors can be combined to achieve a better device recognition precision.

Outline In the next section, we describe the process of device registration and authentication leveraging sensor fingerprinting. We show how this technique can be used to enhance common user authentication methods like password authentication. Then, we introduce our data set, define the feature set used for the following experiments and present the instrumented machine learning classifiers. We conduct experiments in order to recognize single devices and device models based on their sensors’ hardware imperfections and test both: every sensor separately and combinations of sensors. A discussion of the results follows as well as previous research findings related to this topic. Finally, we conclude this chapter with a short summary.

77 Chapter 4. Hardware Fingerprinting as Second Authentication Factor

4.2. Sensor-based Device Authentication

In contrast to user authentication which aims to prove a user’s identity, we target to confirm a specific device (e. g., a unique smartphone) with device authentication. The overall goal is to bind an action performed by a user to a specific system (device) which is used to perform this action. Hence, if a device is authenticated, one can be sure that a specific user action was performed using exactly this device. Use cases include online banking, handling of suspicious logins, and password reset requests. If a user of an online platform has forgotten the password and requests a new one, he usually has to answer a security question. Instead of proving the identity with the knowledge of the answer, he could authenticate his device which would make this an authentication by possession. This also applies for suspicious logins: large web service providers keep track of the devices which are used to access their services and consequently check if a user performs a login from a known or a never-seen-before system. If a login attempt seems suspicious, a user could authenticate a device to prove his identity. Another use case is online banking: In Europe and in several countries around the world, online banking transactions—no matter if web-based or app-based—need to be confirmed via a transaction number (TAN). In addition to this established method, device authentication could be performed for crucial actions like transactions above a certain amount or voiding a lost credit card. This way, the bank can be sure from which device this action was performed. In a practical attack, an attacker may get hold of an original SIM card or a replacement card of a victim’s phone number and abuse it (e. g., for app-based banking as the phone number is commonly used as an identifier). Implementing a hardware-based mechanism for device authentication may remedy this fault: Binding transactions to hardware—in this case, a user’s mobile device—enables the detection of such fraud attempts as the service provider is capable of recognizing that the attacker’s action is not performed on the user’s device. With hardware-based device authentication, SIM card theft and spoofing may be detected before a crucial action can be carried out by an attacker.

In any case, the hardware of the device to be authenticated needs to be fingerprinted as relying on software fingerprints may not be robust enough for this purpose. We differentiate between two types of use cases: 1. Web context: The provider operates an online platform and fingerprint tech- niques are restricted to Web technologies like HTML5 and JavaScript. 2. App context: The provider has deployed an app for using the service. As such an application may possess more permissions than a browser, it is able to access more of the device’s resources (namely its sensors) for fingerprinting. Although device authentication is not user authentication, it may be used as a second factor for user authentication as it constitutes that a specific user owns a

78 4.2. Sensor-based Device Authentication specific device. This can be used as a second factor, e. g. besides a knowledge-based authentication like passwords.

4.2.1. Device Registration In order to use a device’s sensor fingerprint for authentication, it needs to be registered first. The provider obtains the fingerprint belonging to a device which is to be registered and stores it in the fingerprint database. During this registration process, the device needs to stay still for some seconds. In this time, the sensors’ manufacturing imperfections are measured resulting in the device’s sensor fingerprint. These specific measuring errors are an inherent factor of the device. In contrast to knowledge and ownership/possession factors of authentication, one could refer to these hardware peculiarities as “biometrics of hardware”, thus to be considered as authentication by inherence. The registration procedure is crucial and needs to be secured against adversaries. An attacker could try to register a device for a targeted user account as a legitimate user device and consequently authorize banking transactions or perform successful logins or password resets. Therefore, the registration of a new device must only be possible after a successful user authentication, e. g., login at a provider’s website. For example, a user may log in to an online banking account and register a new device which needs to be confirmed via email. Only when such a second channel is used, the registration process can be performed, so that an attacker is not able to register a device without the user’s knowledge and confirmation. Hence, the registration should be on-demand only. Additionally, for banking scenarios the device registration could be confirmed by a device-independent TAN method to avoid malicious registrations.

4.2.2. Device Authentication Once the registration is done, a provider is able to distinguish and recognize devices based on their sensor fingerprints. In practice, this additional authentication could be performed to authorize crucial transactions in online banking (e. g., transactions above a certain threshold) or password resets at online platforms. It could also be used to verify a login attempt which is considered suspicious by common methods to clarify whether it is a legitimate login or a possible attack. In any of these cases, a user would have to let the phone lay still for a few seconds, e. g., by laying it on a table. Previous work has proven that sensor imperfections can be measured in a duration of less than 30 seconds [39]. During this time, the device’s sensors are fingerprinted again by measuring their hardware imperfections. The fingerprint can then be checked against previously registered devices by the provider, resulting either in a match which represents a legitimate user action or a reject possibly indicating illicit behavior. Figure 4.1 illustrates this procedure.

79 Chapter 4. Hardware Fingerprinting as Second Authentication Factor

knows User Password Database User Password Matching owns Authentication Attempt Fingerprint Matching Device Provider Database inherents

Sensor Fingerprint Device Figure 4.1.: Sensor-based device authentication for user authentication

In practice, if a device cannot be recognized unambiguously, instead of failing the authentication immediately there could be a fallback solution: The unique device may not be determined, but at least the device model could be recognized from the sensor data. So, instead of being sure that a user performs an action from a specific device, at least information about the device type and model are available as enrichment for other fingerprint mechanisms. We included this scenario in our experiments as well. As the sensor data is transferred to the provider during authentication, an attacker could try to replay a specific device fingerprint to perform a successful authentication without the previously registered device. However, obtaining the victim’s sensor fingerprint or even mimic the devices’ sensors’ peculiarity is hard to achieve in practice given that the sensor imperfections are hard to replicate. An attacker would either have to possess a special mobile device to intercept its sensor readings by manipulating system drivers or had to set up a computer to simulate the targeted mobile device exactly. Consequently, if any other fingerprinting or system check is assembled by the provider, this has to be deceived as well, resulting in an increased effort for such a mimicry attack. Furthermore, sensor-based device authentication is an enhancement to other mechanisms and designed as reinforced user authentication. An attacker would still need to obtain user credentials or break other user authentication methods to perform an attack successfully.

80 4.3. Fingerprinting for Sensors-based Authentication

4.3. Fingerprinting for Sensors-based Authentication

Modern mobile devices contain a variety of hardware sensors like accelerometers, gyroscopes, and sensors for rotation, magnetic fields, and gravity. Accelerometer and gyroscope sensor readings are usually accessible via JavaScript and therefore useful for web-based fingerprinting and tracking. Although other sensors are accessible via native applications and may be available from within a web browser in the future, recent research mostly addresses accelerometers [11,18,34,36,60]. We investigate the effectiveness of the state-of-the-art features for accelerometer-based device recognition introduced by Dey et al. [39]. First, we compare the recognition precision utilizing these features to the recognition precision when using raw accelerometer data to provide insights on the usefulness of specialized features for device fingerprinting. Second, we extend current research by taking other sensor types into account to determine whether accelerometer-based results can be extrapolated. This includes common sensors of mobile devices as well as combinations of different sensors’ data. All these sensors exhibit hardware imperfections due to the manufacturing process which results in quivering measurement readings even for unmoved devices. These imprecisions usually affect the measurement value to a thousandth and are expected to be characteristic features of different sensors.

4.3.1. Data Set The first step of our analysis is the preparation of a comprehensive data set of sensor measurements collected from a diverse set of mobile devices. We developed a sensor benchmarking app designed to collect raw sensor readings of accelerometers and other sensors from mobile devices in two stages: First, the user is instructed to put the mobile device on a flat table and leave it still to gather clean measurements for calibration. Second, the user is asked to turn the device in different directions, so we can collect readings when an actual interaction is happening. During both of these stages, the time window of each measurement is 2 seconds at the highest possible sampling frequency available, just like proposed by Dey et al. [39]. The app is available for Android and Blackberry phones and was distributed via the vendors’ app stores. We made sure that users of the app were aware of the fact that they participated in a scientific study and that we collected information about the sensors of their mobile device. We did not store any personally identifiable information. Note that a user of the app is instructed to follow the two phases, but a user might not follow these instructions and thus the collected data might contain outliers or even wrong measurement readings. Therefore, significant movements have been detected as outliers and filtered out for our analyses. Minor movements may be included and represent real-world settings for device authentication as a user may have to authenticate a device being on the go.

81 Chapter 4. Hardware Fingerprinting as Second Authentication Factor

We collected 41,610 benchmarks consisting of 58,280,607 raw measurement events in total from 7 different types of sensors and nearly 5,000 devices. Every event yields a value for x, y and z axes coordinates as well as a timestamp to specify when the event was measured. A benchmark consists of all events which occurred within a 10 seconds time slot. Depending on the sensor type and model, there are different numbers of events per benchmark. The precise numbers of events, benchmarks, and devices per sensor are shown in Table 4.1. Although data from other sensors (e. g., proximity sensors) was collected by the app, only a minority of devices possess such sensors: Only a few benchmarks for these sensor types could be obtained, and hence the data might not be substantive enough to make a claim about the recognition precision in general. For this reason, we take only those sensors into account having representative benchmark data available. As different devices happen to integrate sensors manufactured by the same vendor, we show in Table 4.2 the number of different sensors in our data set. Taken from our representative dataset, we see that there are many different vendors for general purpose accelerometers (275), but only a few for gravity sensors or linear acceleration sensors (37 each). There is a unique identifier within the data set for every single device so that we can recognize specific devices as a ground truth. Furthermore, we store an identifier for every device model (e. g., “Google Nexus 5”). These two identifiers enable us to group the sensor measurement data by device model as well as by single devices. Hence, we can determine the effectiveness of fingerprinting features for recognizing device types (e. g., are hardware imperfections of iPhone 6 devices significantly different compared to hardware imperfections of Nexus 5 devices?) and for recognizing single devices (e. g., is it possible to tell one Nokia Lumia 930 apart from another?). In the following, we use the term ModelID to describe the identifier used to group data by device model and the term DeviceID for the identifier used to group data by single devices. For every sensor type in the data set, we group the data once per DeviceID and once per ModelID. For both groups, we compute the features described in Section 4.3.2 from the raw sensor readings obtained by our app. This builds up four data sets in total:

1. Raw sensor measurements grouped by DeviceID called RDeviceID.

2. Feature set grouped by DeviceID defined as FDeviceID.

3. Raw sensor measurements grouped by ModelID named RModelID.

4. Feature set grouped by ModelID which we define as FModelID.

82 4.3. Fingerprinting for Sensors-based Authentication

Table 4.1.: Numbers of events, benchmarks and devices per sensor type Sensor Type Events Benchmarks Devices Acceleration 8,005,352 7,004 4,179 Magnetic Field 2,855,199 5,230 3,676 Orientation 8,047,497 6,228 4,963 Gyroscope 12,578,437 6,342 4,698 Gravity 9,061,253 5,726 4,374 Linear Acceleration 8,687,132 5,556 4,297 Rotation Vector 9,045,737 5,524 4,401

Table 4.2.: Number of different sensor hardware models by sensor type Sensor Type No. of Sensor Models Acceleration 275 Magnetic Field 179 Orientation 147 Gyroscope 100 Rotation Vector 43 Gravity 37 Linear Acceleration 37

4.3.2. Feature Set

In the second step of our analysis, we extract state-of-the-art features described below, originally proposed by Dey et al. [39], from the raw data records. We analyze if such features can be leveraged for other sensors as well and thus briefly introduce the feature set in the following. Preliminary, we calculate the Root Sum Square (RSS) of the x-, y-, and z-axes. Then, we extract the time domain features utilizing the NumPy [148] and SciPy [72] libraries. In order to extract the frequency domain features, we have to transfer the raw sensor readings from time domain into frequency domain. For this purpose, we interpolated the RSS data. We applied a cubic spline interpolation as it addresses the accuracy of the minimum hardware deviations in our data set. For having less than four samples per measurement or lack of sensor readings from all three axes, 183 measurements had to be omitted during the interpolation phase. The root cause may be hardware failure, broken sensors, or faulty drivers. After completing this

83 Chapter 4. Hardware Fingerprinting as Second Authentication Factor task, we utilize the Fast Fourier Transformation (FFT) to transform the interpolated measurements into the frequency domain. The frequency domain features are extracted from the transformed data. Finally, we vectorize the data to obtain sensor fingerprints utilizing the following features:

Time Domain Features Mean is described as the result of dividing the sum of measurements to the number of samples in a specified time window:

N 1 X x¯ = x(i) N i=1

Standard Deviation describes how much the measurements deviate from the mean of all measurements in a specified time window. This feature provides the ability to consider noisy signals in our tests [142]: v u N u 1 X σ = t (x(i) − x¯)2 N − 1 i=1

Average Deviation provides the mean of the deviations of all samples in a specified time frame. By definition, only the absolute amplitude is considered [142]:

N 1 X D = |x(i) − x¯| x¯ N i=1

Skewness measures the (lack of) symmetry of a distribution in a specified time frame. If the data set is symmetric, it looks similar on the left and right side of the mean. Skewness can be positive or negative if the data set is more distributed to the left or right, respectively. The skewness of symmetric data is near zero [136]:

N 1 X (x(i) − x¯) γ = ( )3 N σ i=1

Kurtosis states how much the data points are distributed near or far from the mean, i.e., whether a peak of data points exists near the mean or not. The ideal kurtosis is three according to our formula [136]:

N 1 X x(i) − x¯ β = ( )4 − 3 N σ i=1

84 4.3. Fingerprinting for Sensors-based Authentication

Root Mean Square (RMS) Amplitude measures the mean of all amplitudes over time. In order to calculate this feature, first, all amplitudes are squared, so that both negative and positive values become positive. After calculating the mean of these values, they are scaled back to the right size by calculating the square root: v u N u 1 X A = t (x(i))2 N i=1 The RMS amplitude is normally equal to 70.7% of the peak amplitude [52].

Lowest Value is the smallest amount among the measurements in a specified time window: L = min(x(i))|i=1 to N

Highest Value is the greatest value among the measurements in a specified time window: H = max(x(i))|i=1 to N

Frequency Domain Features Spectral Standard Deviation shows the spread of the frequencies in a spectrum relative to its mean along the frequency axis [59, 127]: s PN (y (i))2 ∗ y (i) σ = i=1 f m s PN i=1 ym(i)

Spectral Centroid can be considered as the middle point of the amplitude spec- trum [10]: PN y (i)y (i) ζ = i=1 f m s PN i=1 ym(i)

Spectral Skewness measures the symmetry of the distribution of the spectral magnitude values relative to their mean [76, 95, 137]:

PN 3 i=1(ym(i) − ζs) ∗ ym(i) γs = 3 σs

Spectral Kurtosis determines if the distribution of the spectral magnitude values contains non-Gaussian components [154]:

PN 4 i=1(ym(i) − ζs) ∗ ym(i) βs = 4 σs − 3

85 Chapter 4. Hardware Fingerprinting as Second Authentication Factor

Spectral Crest measures the peakiness of a spectrum and is inversely proportional to the flatness feature [162]:

(max(ym(i))|i=1 to N CRs = ζs

Irregularity-K measures the degree of variation of successive peaks in a spectrum. Irregularity-K refers to the definition of Krimphoff et al. [82] where irregularity is the sum of the amplitude minus the mean of the preceding, same and next amplitude:

N−1 X ym(i − 1) + ym(i) + ym(i + 1) IKs = ym(i) − 3 i=2

Irregularity-J measures the same as irregularity-K, but refers to the definition of Jensen [70] where irregularity is defined as the sum of squaring the differences in amplitude between adjoining partials [160]:

PN−1 2 i=1 (ym(i) − ym(i + 1)) IJs = PN−1 2 i=1 (ym(i))

Smoothness measures the degree of differences between adjacent amplitudes [105, 123]:

N−1 X (20.log(ym(i − 1)) + 20.log(ym(i)) + 20.log(ym(i + 1))) S = |20.log(y (i)) − s m 3 i=2

Flatness measures the flatness of a spectrum and is inversely proportional to the spectral crest. The differences between spectral crest and flatness are in the less required computational power for spectral crest, but more accurate results in spectral flatness since not normalized signals have less influence on the result [43]: q N QN ( i=1 ym(i)) F = s 1 PN N i=1 ym(i)

4.3.3. Classifier In the third step of our analysis, we apply five field-tested classification algorithms to all data sets described in Section 4.3.1. We chose algorithms from different machine learning categories to address the model selection problem. Note that our scenario for device or model recognition does not pose a typical classification problem, as

86 4.3. Fingerprinting for Sensors-based Authentication every device and model which needs to be “classified” has been seen during the training phase. This circumstance is more likely related to matching problems. We evaluate the following five classification and ensemble methods in our experiments:

• k-NN: The k-Nearest-Neighbor classifier is a basic ML algorithm. We chose k = 1 as this expresses the fact that we want to achieve a matching.

• SVM: Support Vector Machines belong to the group of large margin classifiers and hence usually provide a decent generalization.

• Bagging Tree: This classifier has been used originally by Dey et al. [39]. In order to evaluate the effectiveness on the basis of real-world data, we include it for comparison.

• Random Forest: The Random Forest classifier combines the merits of Bag- ging Tree and a random selection of features. This method remedies tendencies to overfit.

• Extra Trees: Extra Trees is an averaging ensemble method known for its high prediction accuracy. The drawback is that it usually grows bigger than Random Forests, especially on large data sets like our sensor measurements.

In every test, we split the existing data into a training set and a test set. The training set is used for cross-validating the classifiers’ parameters before creating a model and testing the test set. In the research of Guyon [56] and Amari et al. [8], the ratio of the test set to the training set is proposed to be inversely proportional to the square root of the number of features if the number of features is greater than one. For the 17 features described above, this means: 1 1 √ = √ ≈ 0.243 #features 17

Hence, we use a split of 75 % of the data for training and 25 % for testing. Please note that we did not use the same machine learning classification models for recognition of device manufacturing models and recognition of single devices. We conducted these experiments separately and trained classification models for the specific tasks. To perform a comprehensive analysis, we prepared each data set in three different ways for every experiment and applied the classifiers to the data (i) as-is, (ii) normalized and (iii) scaled. Finally, we determine the maximum recognition precision of all raw classifications and features classification for each dataset to clarify whether the classification of models and devices can be performed better on raw data or the introduced features. Every test—from splitting the data set into training set and test set up to

87 Chapter 4. Hardware Fingerprinting as Second Authentication Factor classification—has been performed three times and the mean of these repetitions is represented to mitigate “lucky strikes”, which may occur when a data set is split randomly.

4.3.4. Formalization

In the following, we work with the four data sets RModelID, FModelID, RDeviceID, and FDeviceID described in Section 4.3.1. Consequently, R represents all raw sensor measurements and F represents the features calculated on the basis of R. Hence, the feature set is derived as a function from raw sensor events: F = f(R) The function f includes the steps for feature extraction described in Section 4.3.2 including calculations of Root Sum Square, interpolation, and Fast Fourier Trans- formation. Consequently, F includes all features from time domain and frequency domain. Please keep in mind that every data set is split into a training subset and a test subset for subsequent machine learning procedures. Every data record of these data sets consists of a data vector and a class attribute. The data vector D includes all attributes which are used for recognition by machine learning. For data vectors of the raw sensor measurements data sets, the single values are plain readings of the dimensions x, y and z provided by the sensors directly: DR = r1, r2, ..., rn, n ∈ N. Consequently, for data vectors of the examined feature sets, every value represents a feature: DF = f1, f2, ..., fn, n ∈ N. The class attribute c is derived from the chosen identifier which is either the ModelID or the DeviceID. In order to calculate the recognition precision, we define a match as true positive. A match will be achieved if a data vector of a test set is related to a data vector with the same class c of the corresponding training set by the machine learning algorithm. A correct reject expresses a true negative, meaning that a non-trained device is not matched accidentally to a trained device. If a device which has been in the training set gets rejected while testing, it is a false negative while a non-trained device which is matched with a device from the training set poses a false positive. Finally, we are able to define the recognition precision P for specific data sets and feature sets as: PS,Mid = ML(SetT raining, SetT est), where S is a sensor type, M ∈ {R,F }, id ∈ {ModelID, DeviceID}, ML is the chosen machine learning algorithm and trainingset and testset are subsets corresponding to S and id. For instance, the recognition precision achieved by a Bagging Tree classifier (BT ) for the data set of features grouped by DeviceID and based on gravity sensor data will be:

PGravity,FDeviceID = BT (trainFDeviceID , testFDeviceID ).

Given these definitions, we are able to describe every scenario that is feasible using our data set described above.

88 4.4. Evaluation

4.4. Evaluation

We conducted recognition experiments for every sensor type with each data set utilizing each classifier seeking for the best precision to use for device authentica- tion. Hence, we present only the results of the best performing classifiers for each experiment. More specifically, for each sensor type and each data set we applied the algorithms described in Section 4.3.3, but for comparison we take the maximum recognition rate of all classifiers into account. Furthermore, we repeated every test with the data set three times using it (i) as-is, (ii) scaled and (iii) normalized to ensure to have the best preprocessing for every test. Again, we describe only the best results of all preprocessing methods in the following to compare the results of the best performing fingerprinting processes. A comparison of non-best performing classifiers and preprocessing methods would be possible but does not support our aim to find the best method for hardware-based fingerprinting for the purpose of device authentication. In order to determine the effectiveness of state-of-the-art features over raw data for sensor fingerprinting, we carried out several tests in two phases: First, we ran comparison tests for every single sensor listed in Table 4.1. The goal of these tests is to compare the recognition precision between utilizing the raw sensor data and the extracted features for fingerprints for each sensor. Second, we combined the data from different sensors to multi-sensor tests and applied the described methods to clarify the recognition precision when taking several sensors into account. We determined five combinations to be of interest due to results from previous experiments:

1. Accelerometers includes sensors for measuring acceleration and linear accel- eration. Recognizing this data precisely has been the explicit purpose of the fingerprint features.

2. Accelerometers & Gyroscope extends the combination by gyroscope sensor readings. Usually, if a device embeds accelerometers, a gyroscope is built-in.

3. All Available Sensors includes data from all sensors listed in Table 4.2.

4. No Accelerometers takes only sensors into account that do not measure accel- eration. This is the “inverse scenario” of scenario 1.

5. No Accelerometers & Gyroscope is the “inverse scenario” of scenario 2 and excludes acceleration sensors as well as gyroscope measurements.

In the following, we present the results of the single sensor tests as well as the combination tests.

89 Chapter 4. Hardware Fingerprinting as Second Authentication Factor

4.4.1. Single Sensor Tests

Our experiments confirm the overall result by Dey et al. [39]: The presented features provide the best precision for recognizing single devices on the basis of general purpose accelerometer data. Nevertheless, the precision of this case is only about 78 %, leaving room for improvement. Our results indicate that for linear accelerometer sensors as well as for gyroscope data, the proposed feature set provides a better recognition rate than using the raw sensor readings. However, the corresponding precision rates of about 49 % and 59 % are not suitable for device authentication in practice. Hence, device authentication methods should not rely on accelerometers and gyroscopes only. For all other sensors, the utilization of features leads to a lower precision compared to the raw sensor measurements. The highest recognition precision could be achieved with plain sensor measurements, and more different sensor types lead to a better precision. Thus, device authentication does not need to be based on a mathematical feature set but is feasible using raw sensor data as well. Table 4.3 summarizes the results of all single-sensor recognition experiments. It is tempting to suspect a connection between a high device recognition precision and the number of different sensor hardware models shown in Table 4.2. Never- theless, we could not show a significance to substantiate this assumption: Linear accelerometers and rotation sensors both have low model diversity and while the first are not suitable for device recognition indicating a maximum precision of about 59 %, the second show an excellent precision for recognizing single devices of almost 100 %. These differences are visualized in Figure 4.2. As shown in Figure 4.3, the precision based on accelerometers, magnetic field sensors, orientation sensors and rotation sensors is clearly lower compared to when raw measurement events are used when it comes to model recognition. Concurrently, using raw data for model recognition fails for data from gyroscopes, gravity sensors, and linear accelerometers. Using the feature set performs better for these sensors, but still, the recognition precision does not exceed 55 % and cannot be considered for a reliable model recognition in a practical setting. In summary, recognizing device models by only one sensor type does not require the use of features and can be done on the basis of raw sensor data for four of seven sensor types, while the other three sensor types cannot be used to distinguish between device models at all. In summary, the state-of-the-art feature set for accelerometer fingerprinting serves its purpose and is a reasonable way for device recognition based on accelerometers and gyroscope data. However, it is not suitable to distinguish devices based on data from other sensor types. Furthermore, using raw measurements of these other sensor types enables an even higher recognition precision. The use of accelerometer- based fingerprinting utilizing mathematical features for device authentication is questionable as fingerprinting based on other sensors performs significantly better.

90 4.4. Evaluation

Table 4.3.: Recognition precisions per sensor type, identifier and data set of single- sensor tests in percent Average Sensor Identifier Data Classifier Precision Acceleration Device F ET 78.23 Device R k-NN 62.68 Model F ET 69.39 Model R BT 76.46 Magnetic Field Device F ET 78.01 Device R RF 96.38 Model F ET 57.91 Model R ET 96.42 Orientation Device F ET 75.24 Device R k-NN 98.20 Model F ET 58.74 Model R k-NN 98.10 Gyroscope Device F BT 49.44 Device R k-NN 41.45 Model F BT 50.50 Model R k-NN 45.16 Gravity Device F ET 60.95 Device R k-NN 82.99 Model F ET 54.72 Model R k-NN 10.00 Lin. Acceleration Device F BT 58.92 Device R k-NN 18.81 Model F BT 48.35 Model R k-NN 10.14 Rotation Vector Device F ET 70.72 Device R k-NN 99.81 Model F ET 55.57 Model R k-NN 99.82 R = Raw data, F = Features k-NN = k-NearestNeighbor, BT = BaggingTree, ET = ExtraTrees, RF = RandomForest showing only best performing classifiers bold rows show maximum precision rate

91 Chapter 4. Hardware Fingerprinting as Second Authentication Factor

100

80 e t a r 60 n o i t i n g

o 40 c e R

20

0 Accelerometer Magnetic Orientation Gyroscope Gravity Linear Rotation field acceleration vector

Raw data Features

Highcharts.com Figure 4.2.: Recognition precisions per sensor for device recognition

100

80 e t a r 60 n o i t i n g

o 40 c e R

20

0 Accelerometer Magnetic Orientation Gyroscope Gravity Linear Rotation field acceleration vector

Raw data Features

Highcharts.com Figure 4.3.: Recognition precisions per sensor for model recognition

92 4.4. Evaluation

4.4.2. Multi Sensor Tests

While the use of single sensors does not seem to provide a reliable method for device or model recognition—and thus for device authentication—precision rates increase generally when sensor types are combined. The first combination includes both types of accelerometers. In this case, using the feature set for device recognition performs well and achieves a precision of about 92 %. For every other case we tested, the utilization of features did not exceed the precision attained by the use of raw measurements. Especially when accelerometers are left out (cases four and five), the recognition based on raw data is more effective. In total, there is no precision result lower than 88.5 %, while the maximum of 99.99 % can be achieved by using raw data of all sensors except accelerometers. Table 4.4 shows the results for all combination test. Consequently, using raw measurements of sensors for a magnetic field, orientation, gravity, rotation, and gyroscope is most effective for fingerprinting mobile devices. These sensors, which are common in modern devices, improve sensor-based finger- printing significantly and can be used as a basis for reliable device fingerprinting in practice. Figure 4.4 shows the achieved maximum precisions of each sensor combination described above. This finding is also valid for model recognition: Again, for accelerometers the feature set yields the best precision, but in all other combinations, features are not necessary to achieve recognition rates of up to 99.995 %. Figure 4.5 shows the results of the combination tests per model.

100

97.5

95 e t a r

92.5 n o i t

i 90 n g o

c 87.5 e R 85

82.5

80 Accelerometers Accelerometers and All sensors No accelerometers No accelerometers gyroscope and gyroscope

Raw data Features

Highcharts.com Figure 4.4.: Recognition precisions per combination for device recognition

93 Chapter 4. Hardware Fingerprinting as Second Authentication Factor

Table 4.4.: Recognition precisions per sensor combination, identifier and data set of combination tests in percent Average Sensors Identifier Data Classifier Precision Accelerometers Device F BT 92.48 Device R BT 88.69 Model F ET 91.54 Model R BT 89.65 Accelerometers Device F ET 88.50 & Gyroscope Device R BT 88.84 Model F ET 92.40 Model R RF 95.00 All Available Device F ET 98.60 Sensors Device R ET 99.98 Model F ET 98.16 Model R RF 99.99 No Accelerometers Device F RF 97.25 Device R ET 99.99 Model F ET 97.46 Model R ET 99.98 No Accelerometers Device F RF 94.64 & No Gyroscope Device R RF 99.98 Model F RF 96.05 Model R ET 99.97 R = Raw data, F = Features k-NN = k-NearestNeighbor, BT = BaggingTree, ET = ExtraTrees, RF = RandomForest showing only best performing classifiers bold rows show maximum precision rate

94 4.5. Discussion

100

97.5

95 e t a r

92.5 n o i t

i 90 n g o

c 87.5 e R 85

82.5

80 Accelerometers Accelerometers and All sensors No accelerometers No accelerometers gyroscope and gyroscope

Raw data Features

Highcharts.com Figure 4.5.: Recognition precisions per combination for model recognition

4.5. Discussion

While we found the feature set to be most precise for device recognition on the basis of accelerometer data, best recognition rates for devices and models can be achieved by sensor combinations without accelerometers applied to raw measurements. Taking common sensors together, recognition precisions of 99.98 % up to 99.995 % can be achieved without needing to consider complex features. Our experiments indicate that combining the data from different sensors leads to a more effective fingerprinting than the application of the proposed features. The feature set is suitable for the case of recognizing single devices by accelerometer data, but not reliable in any other circumstances. Furthermore, given a large quantity of real-world data, the same results can be achieved without these features using the same or comparable machine learning techniques. For other sensor types, using raw sensor data is more effective, esp. for recognition of single devices. However, both data types yield disadvantages: On the one hand, calculating features needs computational power but also condenses the data. On the other hand, storing all events’ raw measurements requires more storage capacities, but no mathematical calculations need to be made. Ultimately, single devices, as well as device models, can be recognized best when combining the measurement data of several sensors. For the purpose of device authentication, sensor fingerprinting is a valid method: High recognition rates can be achieved under realistic conditions on a real-world data set. While previous research mostly focusses on accelerometers and gyroscopes as these are accessible via the web, we found other sensor types’ hardware imperfections to be more characteristic making them even more relevant in this context. As

95 Chapter 4. Hardware Fingerprinting as Second Authentication Factor sensor-based hardware fingerprinting opens up the possibility to distinguish unique devices at very high precision, it does not seem to be necessary to have a fallback solution like device model recognition at all. Additional to the adversarial scenarios described in Section 4.2, it may be possible to randomize sensor measurements in order to prevent a recognition of a specific device [35]. However, tampering sensor readings with random data requires a customization of the device’s software like its browser when sensors are queried by websites or even the operating system when apps access the sensors for fingerprinting. Furthermore, tampering sensor readings raises a problem in practice: Sensors are used for specific reasons and adding randomness to their measurements may be helpful to evade fingerprinting their hardware imperfections but may also result in an unwanted behavior of functionalities which rely on sensors. For instance, if a websites accesses a device’s accelerometers or gyroscope and their data is randomized or tampered by the device first, the website’s functionality and hence the user experience may be affected. Ultimately, as the goal is to authenticate a device and randomization is only capable of preventing a recognition, the more relevant attack would be the imitation of a specific device. For such a mimic attack, an attacker would need to forge sensor data and replace it with the target device’s sensor data. As described in Section 4.2.2, an attacker has to solve some challenges to perform this attack while having little chances of success. Although such an attack is difficult to carry out in practice, we will investigate this scenario in future work. As more and more sensors are embedded in modern mobile devices, examining more sensor types for the purpose of hardware-based device fingerprinting will be the subject of future work. The availability of other sensors may lead to even better recognition results. As a future enhancement, our approach may be applied to wearables like smart- watches. Latest research has shown that techniques developed for smartphones are transferable to such devices [94].

4.6. Related Work

Dey et al. [39] proposed mathematical features based on accelerometer readings for fingerprinting mobile devices. Their work illustrates the possibility to identify devices by conducting a series of training and test set scenarios on 107 different stand-alone chips, smartphones, and tablets under laboratory conditions. While their work focuses on accelerometers only, we also inspected other sensor types like magnetic field or rotation vector sensors. The usefulness of the feature set could be verified for accelerometers on an extensive real-world dataset of nearly 5,000 devices. We have also shown that fingerprinting mobile devices is more precise when taking

96 4.6. Related Work other available sensor data into account. Furthermore, our results indicate that machine learning algorithms can be applied to the raw measurement events and specific features used for pre-processing the raw measurements do not yield better results. Several studies focus on real-world accelerometer data for recognizing movement or behavior. For instance, it has been shown that steps of a walking or running person can be detected clearly with the help of a smartphone’s accelerometers [141]. Dargie and Denko studied the behavior of accelerometers during similar movements and placed accelerometers on moving humans and cars [34]. They conclude that the extracted frequency domain features remain generally more robust than time domain features. In our study, we applied sensor readings gathered from both resting and moving devices and included features of time domain as well as frequency domain. A study by He utilized machine learning techniques to recognize human activities by accelerometer and gyroscope data [60]. Three feature sets were applied including 561, 50 and 20 features to distinguish between six different human activities. While this work aims to detect activities, our experiments do not consider the current movement as an artifact but aim to identify devices (and group devices by model) on the basis of real-world sensor data. A non-sensor-based method for hardware fingerprinting has been introduced by Moon et al. [113] as well as Kohno et al. [80]. The identification of devices is achieved by measuring clock skews. While the common idea is the recognition of devices by hardware differences, these studies focus on time differences and do not consider any of a device’s sensors. Bates et al. explored mobile device model recognition and showed that manu- facturer models can be distinguished by USB data with an accuracy of 97 % [12]. Notwithstanding, our experiments have shown that an even higher accuracy can be achieved by sensor-based hardware fingerprinting. Cao et al. have shown that instrumenting hardware level features including graphic cards, audio stack, and CPU may enhance classic browser fingerprinting and even enables a cross-browser fingerprinting [28]. Hardware as source for fingerprinting has also been utilized by Li et. al who created signatures of GPU core frequency variations [97]. However, the introduced fingerprint method cannot be leveraged for authentication directly. Duarte et al. presented an approach for classifying physical activities using a smartphones’ hardware sensors [41]. The authors were able to detect indoor and outdoor running, cycling, rowing and inactivity. While this indicates how precise hardware sensors are, we focussed on hardware imperfections in our work in contrast.

97 Chapter 4. Hardware Fingerprinting as Second Authentication Factor

4.7. Conclusion

In this chapter, we performed a detailed assessment of the effectiveness of sensor- based fingerprinting. We compared the benefit of using a well-defined feature set including attributes from time domain as well as frequency domain to using raw sensor data as input. We utilized five different machine learning techniques together with three data preparation processes and compared the precision at which a single device or a device model can be recognized on the basis of its hardware. To base our work upon real-world conditions, we gathered sensor data of almost 5,000 mobile devices. As a part of our work, we implemented the signal feature extraction process described by Dey et al. [39]. While we found the proposed feature set suitable for accelerometer-based recogni- tion of single devices, we have shown that it lacks precision for other sensor types. For non-accelerometer sensors the use of raw sensor readings as a basis for hardware fingerprinting results in a higher recognition precision. Furthermore, combining different sensor types leads to an even better precision and a higher robustness. We find that accelerometer measurements combined with other sensor data yield real-world recognition precisions of 99.98% up to 99.995%. In general, taking other common sensor types into account for fingerprinting results in a better precision than utilizing the previously proposed feature set. Given these findings, hardware-based device fingerprinting with sensor data is feasible and a valid method for device authentication. However, device authentication methods should not rely on accelerometers and gyroscopes only but combinations of different sensor types. For these, the calculation of features means computational effort without improving device recognition. Ultimately, using raw measurements of different sensor types is the most accurate way to instrument sensor-based hardware fingerprinting for device authentication. Implementing such an authentication mechanism may help to handle suspicious login attempts, password resets, and even remedy SIM spoofing.

98 CHAPTER

FIVE

USABILITY OF MOTION FINGERPRINTS FOR LIVELINESS TESTS

Instrumenting sensors for device fingerprinting is feasible and enables recognition of hardware as we have presented in the last chapter. Besides from measuring hardware imperfections while the device is kept still, we examine whether or not motions performed by a user are suitable for fingerprinting as well. We shed light on gesture fingerprinting in the context of CAPTCHAs.

These are challenge-response tests often used to determine whether a website’s visitor is a human or an automated program (so-called bot). Existing and widely used CAPTCHA schemes are based on visual puzzles that are hard to solve on mobile devices with a limited screen. We propose to leverage movement data from hardware sensors to build a CAPTCHA scheme suitable for mobile devices. Our approach is based on human motion information, and the scheme requires users to perform gestures from everyday life (e. g., hammering where a smartphone should be imagined as a hammer and the user has to hit a nail five times). We implemented a prototype of the proposed method and report findings from a comparative usability study with 50 participants. The results suggest that our scheme outperforms other competing schemes on usability metrics such as solving time, accuracy, and error rate. Furthermore, the results of the user study indicate that gestures are a suitable input method to solve CAPTCHAs on (mobile) devices with smaller screens and hardware sensors.

99 Chapter 5. Usability of Motion Fingerprints for Liveliness Tests

5.1. Introduction

CAPTCHAs1 (Completely Automated Public Turing tests to tell Computers and Humans Apart) are challenge-response tests used to distinguish human users from automated programs masquerading as humans. Due to the increasing abuse of resources on the web (e.g., automated creation of website accounts that are then used to perform malicious actions), captchas have become an essential part of online forms and the Internet ecosystem. They typically consist of visual puzzles intended to be easy to solve for humans, yet difficult to solve for computers [151]. The same idea is applied to audio puzzles. However, these are in reality often annoying to solve for human users [46]. Furthermore, visual pattern recognition algorithms gradually improved in the last years, making automated captcha solving feasible. For example, Burzstein et al. [23, 24] highlighted that due to the arms race between captcha designers and OCR algorithms, we must reconsider the design of (reverse) Turing tests from ground up. As such, there is a continuous arms race to design captcha schemes that are secure against automated attacks, but still usable for humans. In the last few years, mobile devices have become a primary medium for accessing online resources. While most web content has already been adjusted to smaller screens and touchscreen interactions, most captcha schemes still suffer from these usability constraints and are perceived as error-prone and time-consuming by their users: several studies demonstrated that captcha usability in the mobile ecosystem is still an unsolved challenge [23–25, 46, 134, 159]. According to Reynaga et al. [133], captchas are primarily evaluated on their security and limited usability work has been carried out to assess captcha schemes for mobile device usage. With the emerging proliferation of wearable devices such as smartwatches, it becomes inevitable to re-think user interactions with captchas to successfully tell humans and computers apart, without placing the burden on users that struggle with hard-to-solve visual or audio puzzles. In this chapter, we present Sensor Captchas, a captcha scheme designed for mobile devices. Based on previously published findings, we collected a set of design recommendations to tie our design decisions to. We propose motion features from hardware sensors as a novel input paradigm for mobile captchas. A user is expected to perform gestures from everyday actions which might either be known or imagined easily. Examples for this are the gestures hammering where the smartphone should be imagined as a hammer and the user has to hit a nail five times, or drinking, where a user is asked to drink from the smartphone, imagining it is a glass of water. Our approach is solely based on state-of-the-art sensors available in most smartphones and wearables such as gyroscope and accelerometer, and obviates the need for users to solve complex graphical puzzles on small screens.

1For better readability, we write the acronym in lowercase in the following.

100 5.1. Introduction

We implemented a prototype of the proposed scheme and present a repeated mea- sures user study to compare our approach to state-of-the-art visual captcha schemes (namely reCAPTCHA and noCAPTCHA2) as well as an innovative mechanism called Emerging Image Captcha [158]. Our findings show that sensor data is a suitable input for captcha challenges with a high success rate and a low solving time when leveraging gestures. While some gestures are easier to solve than others, the overall rate of solving successes shows the feasibility of our approach. Users rated our new captcha mechanism comparable to established captcha schemes, and we are able to demonstrate a learning effect within the first 15 challenges.

Contribution

In summary, we make the following contributions:

• We designed an extensible captcha scheme using accelerometer and gyroscope data as user input and machine learning classification for challenge validation.

• Based on a prototype implementation of the proposed scheme, we conducted a thorough user study with 50 participants to evaluate the usability of our approach, including a survey for direct user feedback.

• We compared our approach to well-known, established captcha methods (re- CAPTCHA and noCAPTCHA) as well as another innovative scheme (Emerging Images) regarding success rates, solving times, and user experience.

Outline

In the following section, we argue requirements for modern captcha schemes and make sure that our captcha mechanism meets these standards used in latest research. We also present a set of gestures which will be used as challenges in the following study. Then, we describe the design and implementation of our usability study before presenting its evaluation. We examine the gestures’ effectiveness and compare our captcha scheme to well-established mechanisms as well as another experimental scheme. We appraise our approach regarding user experience, technical correctness, and habituation effects before discussing security considerations and limitations. Finally, we present related research efforts and conclude with a short summary.

2noCAPTCHA is also referred to as new reCAPTCHA [66]

101 Chapter 5. Usability of Motion Fingerprints for Liveliness Tests

5.2. Hardware Sensors as User Input for Captchas

Modern mobile devices contain a variety of hardware sensors, including accelerom- eters and gyroscopes which are accessible via web techniques like JavaScript and HTML5. These sensors are so accurate that it is possible to detect steps of a walking person [141] and to distinguish between certain user actions [60]. As the main difference to existing captcha schemes, we utilize these hardware sensors as an input channel for solving a challenge. The benefit of this input channel is that a user does not need to type text on a small soft keyboard on a smartphone, but he can use a simple movement to prove liveliness. In practice, a website provider aims to distinguish a human user from an automated bot and therefore utilizes a captcha challenge. In our approach, this challenge is represented by a gesture a user has to perform. More precisely, a user gets a gesture description as a challenge and tries to carry out this gesture. During the user’s attempt to perform the required action, sensor measurements are recorded capturing the device’s movement. These measurements are treated as ad-hoc motion fingerprint of the user’s performance, as they are characteristic attributes for the movement and created on time when requested. If the user-created motion fingerprint matches the challenge gesture, the captcha is considered as solved successfully. We explored possible gestures for such challenges as they need to satisfy several requirements:

• Understandable: Users need to be able to understand the challenges and what they are supposed to do immediately.

• Accurate: The challenge needs to enable a precise differentiation between human users and automated bots.

• Deterministic: The choice whether a human or a bot is currently visiting a website needs to be deterministic.

• Solvable: It must be possible to solve the challenge within a reasonable amount of time.

5.2.1. Gesture Design In an early stage of our research, we chose very simple gestures like moving a device in a circle clockwise. While these movements were easy to understand by a user, it was hardly possible to distinguish between gestures due to too much variance precisely: we did not include any explicit statements about size and speed of the movement, so users were not able to solve these challenges accurately. Learning from these findings, we chose five gestures for our user study which are derived from everyday actions a user might either know or imagine easily:

102 5.2. Hardware Sensors as User Input for Captchas

• Hammering: The smartphone should be imagined as a hammer and a user has to hit an imagined nail five times.

• Bodyturn: A user is challenged to turn all around counter-clockwise.

• Fishing: The smartphone should be imagined as a fishing rod which is to cast.

• Drinking: A user is asked to drink from the smartphone, imagining it is a glass of water.

• Keyhole: The smartphone is an imaginary key which is to be put in a door lock and rotated left and right like unlocking a door.

Note that these gestures can be easily extended, e. g., by randomly choosing the number of times the “hammer” has to hit the imaginary nail or by taking a clockwise bodyturn into account. With such variations, more gestures are possible so that in a practical use not only five movements are available, but a great variety of different challenges can be designed. The gestures can be presented to users in various ways. For our prototype and user study, we described all gestures to perform in short texts. Pictures showing a drawing of a human performing the asked movement or even an animated image or a short video clip can alternatively present the challenge. When a user performs a gesture, accelerometer and gyroscope readings are recorded and transferred to a web server afterward. On the server side, we use a machine learning classifier to determine whether the recorded sensor data – the user’s motion fingerprint – matches the challenged gesture. If the motion fingerprint created by the user can be classified as the demanded gesture, the captcha has been solved successfully. If it is rejected by the classifier or matches a wrong gesture, the captcha has failed. Using machine learning technology in our captcha scheme is based on the following observation: If a captcha relies on text input, the challenge text is generated first and held by the server. When the user enters the text, this input can be compared to the generated text immediately. In our scenario, there is no challenge data produced in advance which the user input can be obviated. It is not usable to generate three-dimensional acceleration data and challenge a user to perform exactly this movement with a smartphone. Hence, we need a decider which is capable of distinguishing characteristics of one movement from another and ultimately determine whether a given ad-hoc motion fingerprint matches the challenge’s gesture. A machine learning classifier is an appropriate mechanism for this task as it describes a classification problem.

103 Chapter 5. Usability of Motion Fingerprints for Liveliness Tests

5.2.2. Satisfaction of Requirements We ground our captcha scheme in design principles suggested in existing scientific work on captcha usability, such as Reynaga et al. [133], Fidas et al. [46], and Bursztein et al. [23]. In the following, we present a collection of design principles and recommendations from these publications and argue how our design addresses these features.

Challenges • Deploy one task only. Optional features hinder usability on small screens where captcha solving is already more time-consuming than on desktop computers. Challenges should be designed with a one-task only focus.

• Leverage complexity. Visual puzzles suffer from an arms race between captcha providers and pattern recognition algorithms that sometimes even perform better than human beings. Although finding a harder problem in computer vision will increase the cognitive load on the user side, captchas need to be challenging and of a complex domain.

• Using cognitive behavior. Everyday life movements such as the one used for our challenges are capable of shifting captcha interactions to a domain beyond visual puzzles and touchscreen interaction. As the gestures are found in everyday life, we believe it is an easy task for humans to perform them, yet hard to fake for automated programs.

• Strive for a minimalistic interface. An interface should focus on the essential and be minimalistic. Our captcha challenges can be displayed and solved even from wearables such as smartwatches.

Environment of Use • Expect common conditions. Features which may fail in commonly expected environmental conditions should be avoided. Our design fulfills this recommen- dation although the performance of gestures may be conspicuous.

• Minimize load. For our approach, bandwidth usage is minimized as challenge descriptions are provided verbatim. Also, the data transmitted to the server consists of raw sensor data, as the decision whether the captcha was solved directly is performed on the server side to prevent attacks on the client.

• Rely on default software. For correct operation, a scheme should not rely on technologies that cannot be assumed obligatory. Our implementation is based on JavaScript which is supported by standard mobile browsers.

104 5.3. Usability Study

Engineering • Ensure compatability. To reach a majority of users, input mechanisms should be cross-platform compatible and not interfere with normal operations. Our approach is solely based on input from motion sensors which are state-of-the-art in smartphones and smartwatches.

• Aim for high robustness. Errors must not interfere with normal operations of the browser. Our scheme does not interfere with other operations.

• Support isolation. The captcha challenge should be separated from the rest of the web form. Our captchas may even be shown on another site of a form.

• Enable consistency. Orientation and size of the captcha should be kept consis- tent with the rest of the web form. As our challenge description is text-based or image-based, its presentation can easily be adjusted.

Privacy • Maximize user privacy. Additionally to the design principles listed above, we aim to spotlight user privacy. A user input should not be replaced by user fingerprinting like noCAPTCHA deploys [117]. Our goal is to propose a scheme that minimizes the impact on user privacy and works without collecting sensitive information on the users and their devices.

5.3. Usability Study

We implemented a prototype of the proposed scheme and conducted a comparative evaluation to assess the usability of our new captcha scheme against already existing solutions. In the following, we provide details on both aspects.

5.3.1. Design and Procedure Our user study is divided into two phases: first, a preliminary study was carried out to determine a suitable time frame for gesture performance, the best parameters for the machine learning classifier as well as the ground truth for the main user study. Both phases are described in more detail below. Figure 5.1 illustrates the complete user study setup.

105 Chapter 5. Usability of Motion Fingerprints for Liveliness Tests

Preliminary Study User Study reCaptcha

noCaptcha

Emerging Img

Ground Truth SensorCaptcha Data

Solving Training Model Testing Data

Machine Learning

Figure 5.1.: User study setup

Preliminary Study Sensor Captchas rely on motion input from hardware sensors and machine learning techniques to prove that the user is human. In order to train a model, we conducted a preliminary study. We built a data set of ground truth by instructing 20 participants to perform the movements and gestures described in Section 5.2. Then, we let them solve the challenges in a controlled environment under the following two conditions:

1. The challenges were not chosen randomly but assigned to the participants. Every user had to perform the same number of challenges. More precisely, every user performed every gesture three times.

2. We observed the users solving the challenges and instructed them if they made mistakes to ensure the correct performance.

The sensor data obtained in this preliminary study is used as ground truth for further processing. As the data collection was performed in a controlled environment and under the supervision of two experimenters, we know that the gestures have been performed correctly. Hence, the resulting ground truth data includes motion fingerprints for all gestures. To find the best-performing classifier, we conducted cross-validations and classifi- cation experiments with algorithms from different families, including support vector machines, k-Nearest Neighbor, and different ensemble methods. Our results suggest

106 5.3. Usability Study that a Random Forest classifier performs best on our ground truth and thus we used this algorithm to generate a model that was then used in the actual user study.

Main User Study We include three other captcha mechanisms besides Sensor Captchas in our main study: two schemes are well-known and commonly used in practice, while the other one is an experimental approach from a recent research paper:

1. reCAPTCHA is a well-proven text-based input mechanism. A user is asked to type words or numbers shown in an often distorted or blurred image. 2. noCAPTCHA is the field-tested successor of reCAPTCHA and challenges the user to select from nine images all these showing specific items, e. g., trees. It also instruments behavioral analysis. 3. Emerging Images relies on moving image recognition. A user has to type letters which are shown in an animated image series. This method has been proposed by Xu et al. [158].

While reCAPTCHA and noCAPTCHA are established mechanisms already used by Internet users and website providers every day, Emerging Images and Sensor Captcha represent scientific approaches and have not yet been deployed in a real-world environment. We chose a repeated measures design for our lab study, i.e., every participant had to solve puzzles from every captcha scheme in a controlled environment at our university campus. It was important to us to observe sources of errors in order to improve our design. Each participant was asked to solve a minimum of 15 challenges per scheme. We designed our study to present the challenges in randomized order to reduce any bias or fatigue effects. As all participants were asked to solve captchas of all four types, we were able to gather comprehensive solving data, including the number of correctly solved captchas and failures as well as the amount of time needed to solve each captcha. As our implementation was written in JavaScript, the participants were encouraged to use their own devices to avoid bias and distractions from the study tasks due to unfamiliarity with the device. Even though we had two backup devices with us, all participants used their own devices. After completing the captcha challenges, the participants filled out a short ques- tionnaire (see Section 5.4.3 for a complete listing of these questions). Also, one experimenter took notes in order to collect qualitative in-situ reactions and com- ments. This information was collected to understand particular difficulties and misunderstandings about the presented puzzles and the way of solving them. We believe these explorative findings are valuable to improve the usability of our captcha scheme.

107 Chapter 5. Usability of Motion Fingerprints for Liveliness Tests

5.3.2. Implementation reCAPTCHA, as well as noCAPTCHA, are operated by Google Inc. and provide an API which we used to include these methods in our study. The Emerging Images technique has been provided and hosted by Gerardo Reynaga, School of Computer Science at Carleton University, Ottawa Canada for the duration of our test. We implemented our Sensor Captchas and a survey site from which the participants accessed the different captcha challenges and the questionnaire. The website was implemented in JavaScript and contained a general information page and a separated page for every captcha method. Each of these pages contained a short description on how to solve this captcha and a start button. After tapping the start button, a form containing the captcha challenge and a submit button were displayed. For every captcha, we measured the solving time as the duration between tapping the start button and tapping the form submit button. Hence, we only measured the time a user required to process the captcha challenge mentally and to input a correct solution. This way, we managed to measure the solving time irrespective of network delays, implementation issues, and other technical factors. After a captcha challenge was completed, we stored the following information: An identifier every user could choose freely, e. g., a name, the current date, the captcha result which is either success or failure, the duration a user needed for the solving attempt, and a unique user key which was generated automatically and stored in the browser’s local storage as an anonymous identifier. reCAPTCHA and noCAPTCHA provide an API so that this information could be obtained and stored automatically except for one limitation: noCAPTCHA does not provide a way to check the result of a single challenge. If a challenge has not been solved correctly, the next challenge is displayed to the user automatically without triggering a Javascript event. Hence, it is not possible to record noCAPTCHA failures without interfering with Google’s API and violating the way it works which may have voided results and measurements. As there is no API available for Emerging Images Captcha, we manually kept track of successes as well as failures and entered this data by hand. However, the solving durations could be measured like for the other methods using a JavaScript frame. Regarding Sensor Captchas, we additionally stored the following information: The sensor data which serves as ad-hoc motion fingerprint, including accelerometer and gyroscope readings as arrays (of the dimensions x, y, and z as well as α, β, and γ), the original challenge which was displayed to the user, and the classification result which leads to a captcha success only if it matches the original challenge. After tapping the submit button on the Sensor Captcha page, sensor events were measured for five seconds which we set as a timeframe to perform the gesture. We designed the gesture movements in such way that they are practical to perform within this time and tested every gesture beforehand. Our preliminary study showed

108 5.4. Evaluation that five seconds is a reasonable amount of time to make all necessary movements. Though, this parameter can be analyzed and adjusted in future research. After this time, all data was submitted automatically, so that users did not have to tap another submit button in addition. The sensor data was sent to a socket parsing the data to our machine learning classifier, retrieving the classification result and finally storing all these information in the database. These functionalities were programmed in Python, implementing a Random Forest classifier from scikit-learn [126].

5.3.3. Recruitment and Participants We recruited 50 participants between December 2015 and February 2016 at the University campus and a major computer security conference. Most participants were students at our university from different branches of study, including information technology, medicine, arts, and science. While the youngest participant was 18 years old and the oldest was 55, the majority was aged between 20 and 35 years; male and female in approximately equal shares. All participants were familiar with the purpose of captchas on websites and reported to have used established methods before. To comply with ethical guidelines from our university, we did not collect any personally identifiable information. We only collected data on age, gender, and whether the participants had a background in information technology. Every session lasted about 20 minutes per participant, and they were compensated for their time with a voucher of a major online shop.

5.4. Evaluation

In the following, we compare the different captcha schemes regarding successful solving of challenges and amount of time needed to solve challenges. Concerning Sensor Captchas, we analyze the suitability of gestures as well as the survey included in our user study. Finally, we investigate whether a habituation effect can be asserted and shed light on the precision of our machine learning classifier.

5.4.1. Comparison of Mechanisms To compare the solvability among all considered captcha mechanism, we measured the successes and failures. A success represents the correct solution of a captcha, while a failure accounts for a wrong input. In our study, about 85 % of all reCAPTCHA challenges were successfully solved by the participants. As discussed in Section 5.3.2, it is not possible to catch success and failure cases of noCAPTCHA without interfering. Emerging Images seem to constitute a rather hard challenge, as only about 44 % of all challenges could be solved correctly. In contrast, Sensor Captchas achieve a high success rate: Of all

109 Chapter 5. Usability of Motion Fingerprints for Liveliness Tests provided gesture challenges, the participants were able to correctly solve about 92 %, making this mechanism to be reckoned with. These preliminary results of our study suggest that users were able to solve more Sensor Captchas correctly than challenges of any other type. Note that for Sensor Captchas, a failure may not only redound upon a wrong user input – namely not performing the challenge gesture – but also upon a misclassification by our machine learning algorithm. This factor will be discussed below in Sec. 5.4.5.

As described in Section 5.3.2, we measured the time users needed to solve every single challenge. Hence, we can analyze how much time is required on average to succeed at every mechanism. Table 5.2 shows the average amount of time per mechanism and captcha result.

Table 5.1.: Success rates (SR) in percent Mechanism SR Mean SD reCAPTCHA 84.63 86.98 33.56 Emerging Images 43.96 44.91 49.76 Sensor Captcha 91.60 48.13 49.97 SR = success rate, SD = standard deviation

Table 5.2.: Average solving times in seconds Mechanism S F Total Mean SD reCAPTCHA 12.22 26.36 14.39 12.43 18.59 noCAPTCHA - - 26.99 24.18 17.89 Emerging Images 21.91 24.29 23.24 26.15 29.41 Sensor Captchas 12.35 8.85 12.05 12.25 7.10 S = successes, F = failures, SD = standard deviation

We observe that in general failures take more time for reCAPTCHA as well as Emerging Images. The reason for this probably lies in the way the user input is provided: Users have to read and decipher letters or numbers first. Depending on the specific challenge, this may be difficult, so that hard challenges are more likely to fail but also take more time. We observed these cases to annoy many users as they first need to invest high effort to recognize the challenge’s letters or numbers and then still fail. For Sensor Captchas, we can see a lower solving time for failures than for successes, indicating that users may have failed to solve the challenge because they did not read the description text carefully enough.

110 5.4. Evaluation

We found noCAPTCHA to take in general more time than reCAPTCHA, which may be explained by the fact that reCAPTCHA applies browser fingerprinting first and then displays the challenge if the fingerprinting fails to recognize a user. Comparing the total time users were taken to solve captchas, reCAPTCHA is the fastest mechanism – probably because it is a practical method many users are already familiar with. Nevertheless, reCAPTCHA is directly followed by Sensor Captchas, suggesting that this approach is practicable and showing that users are able to perform the challenge’s gestures in a reasonable amount of time. Please note that Sensor Captchas’ solving time can be influenced by adjusting the time window for performing a gesture. We based an interval of five seconds upon our preliminary study but increasing this time would result in higher solving durations while decreasing could make it impossible to perform a gesture thoroughly. Our study has a repeated-measures design, so every participant was exposed to every condition. Therefore, we analyzed our data with repeated measures analyses of variance (ANOVAs). Table 5.1 shows not only the success rates of the captcha mechanisms but also their mean and standard deviation of successes, represented by 1 for success and 0 for failure. We see that the mean of Sensor Captchas resides within the standard deviation of reCAPTCHA and vice versa. Hence, differences between these two schemes are statistically not significant and may represent random errors. In contrast, the correct solving rate of Sensor Captchas is significantly higher as of the Emerging Images mechanism, meaning that even if the random error is considered, the success rate of Sensor Captchas is superior. Similar trends can be observed regarding the solving times of each mechanism in Table 5.2: There is no statistically significant difference between Sensor Captchas and reCAPTCHA regarding the time a user takes to solve a captcha. Though, the mean solving times of these two mechanisms are significantly lower compared to noCAPTCHA and Emerging Images. We can conclude that Sensor Captchas and reCAPTCHA can be solved faster than noCAPTCHA and Emerging Images, even if the random error is taken into account.

5.4.2. Gesture Analysis After comparing Sensor Captchas to other captcha mechanisms regarding success rates and solving times, we aim to analyze the gestures in detail. We conducted experiments to ascertain which gestures are accurate to perform and which movements happen to be related to other gestures. Table 5.3 shows the solving rates and error rates per gesture. We see that bodyturn and keyhole challenges were in general solved correctly, meaning that the sensor events measured during a user’s gesture performance could be matched to the challenged gesture. Bodyturn and keyhole were correctly solved by about 97 % and 96 % in total. For both, the highest mismatching was to the hammering gesture, meaning if a user input could not

111 Chapter 5. Usability of Motion Fingerprints for Liveliness Tests

Table 5.3.: Solving rates and error rates per gesture in percent Categorized as Gesture bodyturn drinking keyhole fishing hammering bodyturn 97.20 0.00 0.00 0.00 2.79 drinking 0.00 91.74 0.06 0.91 0.91 keyhole 0.65 1.30 96.08 0.00 1.96 fishing 02.22 4.44 0.00 78.89 14.44 hammering 01.62 0.00 08.13 4.87 85.37

be related to the challenge, it was classified as hammering. For the drinking movement, still about 92 % of the challenges were solved correctly. The gestures fishing and hammering seem to be prone for errors: Of all hammering challenges, about 85 % could be solved correctly and in case of the fishing gesture only about 79 %. We also see that fishing and hammering are the least precise gestures as about 14 % of all fishing challenges were classified as hammering and about 5 % of all hammering challenges were mistakenly related to the fishing gesture. This confusion can be explained by the movement itself: For hammering, users had to move their devices in one axis up and down, so this gesture is not very complex. For fishing applies the same as this movement also involves only one axis and although there are differences like the number of accelerations (hammering requires several acceleration moves in order to hit the imaginary nail five times while the fishing rod is cast only once), this low complexity leads to confusion about these two gestures. For the same reason, the fishing gesture was sometimes classified as drinking, although this happened only in about 4 % of all fishing challenges. In about 8 % of all hammering challenges, the sensor data was related to the keyhole gesture. The reason for this might be that users may have slightly turned their phones while hammering their devices on an imaginary nail. This resulted in movements in the z dimension which are an essential part of the keyhole gesture. The gestures drinking, keyhole, and bodyturn show only negligible errors and mistaken relations to other gestures. In general, only the hammering gesture yields potential for errors and should be excluded or enhanced in further studies. If this is fixed, the fishing gesture may presumably perform better as well because there will be no confusion with the hammering movement anymore.

112 5.4. Evaluation

5.4.3. Survey Results As a part of our study, users had to participate in a survey, rating all four captcha mechanisms regarding nine aspects. We leveraged a ten-levelled Likert scale for every item, adopted and extended statements from previous research by Reynaga et al. [134] to allow a direct comparison to this work. In detail, we let the users rate the following statements (* represents inverted items):

• Accuracy: It was easy to solve the challenges accurately.

• Understandability: The challenges were easy to understand.

• Memorability: If I did not use this method for several weeks, I would still be able to remember how to solve challenges.

• Pleasant: The captcha method was pleasant to use.

• Solvability*: It was hard to solve captcha challenges.

• Suitability: This method is well suitable for smartphones.

• Preference: On a mobile, I would prefer this captcha method to others.

• Input Mechanism*: This method is more prone to input mistakes.

• Habituation: With frequent use, it gets easier to solve the challenges.

10

reCAPT CHA 8.91 9.64 9.6 8.85 7.49 8.55 7.55 5.53 7.09 ±1.73 ±0.59 ±0.74 ±1.44 ±2.38 ±1.76 ±2.74 ±2.92 ±3 8

noCAPT CHA 7.87 9 9.51 8.28 6.79 8.04 6.45 5.45 7.32 6 ±2.21 ±1.75 ±0.76 ±2.3 ±2.67 ±2 ±2.99 ±2.67 ±2.91

4 Sensor Captchas 7.06 8.64 8.79 6.94 6.02 7.83 6.72 4.77 8.94 ±2.19 ±1.68 ±1.63 ±2.88 ±2.21 ±2.51 ±3.1 ±2.75 ±1.55

2

Emerging Images 3.81 7.55 8.62 4.43 2.62 5.17 3.17 2.96 5.57 ±2.45 ±2.25 ±2.07 ±2.79 ±2.67 ±2.81 ±2.27 ±2.77 ±2.85 0

Accuracy Understandability Memorability Pleasant Solvability* Suitability Preference Input Habituation Mechanism* Highcharts.com Figure 5.2.: Mean Likert-scores and standard deviations from survey

Figure 5.2 reports the mean Likert scale responses from strongly disagree = 1 to strongly agree = 10. Also, the colors in the figure represent the scale, from red

113 Chapter 5. Usability of Motion Fingerprints for Liveliness Tests representing strongly disagree to green as strongly agree. The established captcha mechanisms in our study—namely noCAPTCHA and reCAPTCHA—were in general rated high regarding accuracy, understandability, memorability, pleasant use, and suitability for mobile devices. Many users stated that they were familiar with these methods and therefore could easily solve the given challenges as the task was immediately clear. For understandability and memorability, we observe a low standard deviation among the ratings. In contrast, high standard deviation among participant ratings can be seen regarding the preferred captcha mechanism. This item holds a deviation of 2.99 for noCAPTCHA and 2.74 for reCAPTCHA, showing that users are at odds if they preferred these established methods which is substantiated by high standard deviation regarding input mistakes (“input mechanism”) showing 2.67 for noCAPTCHA and 2.92 for reCAPTCHA. For some users, these captchas seem to work well and are easy to use. However, other users are not comfortable with them and would not prefer these methods on mobile devices. Although Sensor Captcha holds the highest solving rate, users are not accustomed to this mechanism which results in a generally lower rating compared to the estab- lished captcha methods reCAPTCHA and noCAPTCHA. Sensor Captchas keeps up with established mechanisms regarding accuracy, understandability, memorability, suitability, preference and input mechanism—differences of these ratings are smaller than one. Significant differences can be seen regarding the ratings “pleasant” which may be rooted in the fact that the participants were not used to Sensor Captcha and the gestures require movement of body(parts) which users may be uncomfortable with in public environments and “solvability”. This is contradictory to the high solving rates and shows that users find it hard to solve Sensor Captchas although they were able to do so in most cases. The high rating of habituation indicates that participants adjudge a high learnability to Sensor Captchas, hence long term studies may improve the perception of solvability as well. We also shed light on habituation aspects in the next section. The items of our questionnaire which were rated with a low value also show high deviations: While “pleasant”, “preference”, and “input mechanism” show the lowest user ratings, the standard deviations are rather high with 2.88, 3.1, and 2.75. This indicates a broad range of user opinions, and while some participants found Sensor Captcha not pleasant and would not prefer this mechanism, other users indeed stated the opposite and would prefer our mechanism to established captcha methods. Furthermore, the lowest standard deviation of 1.55 holds “habituation” which states that the majority of users think that continuous use would increase the solvability and ease-of-use of Sensor Captcha. Emerging Images as another innovative captcha mechanism was rated well regard- ing understandability and memorability showing that users are familiar with text inputs and understand the task of typing letters from a sequence of images easily. However, participants found it hard to solve these challenges, given a low rating of accuracy, solvability, and pleasant-of-use. This might be the reason why most

114 5.4. Evaluation users would not prefer this method and stated that it is prone to errors (“input mechanism”). In contrast to Sensor Captcha, users are not optimistic whether a continuous use of Emerging Images may improve the solvability and handling, though, “habituation” holds the highest standard deviation of 2.85 for Emerging Images which shows that some users may get familiar with it.

Informal Participant Statements Participants were free to leave free text comments so that we could get a more detailed feedback on our study and scheme. Many users demanded animations for the description of gestures. As this may probably improve the understandability, accuracy, and solvability of Sensor Captchas, we will implement this feature in the future. A few users stated that the chosen gestures were not suitable for everyday use. Indeed, for Sensor Captchas to evolve into an established captcha method, the available gestures need to be reassessed. We abstracted gestures from everyday actions because simple movements were prone to errors and misunderstandings (see Section 5.2). Still, casting an imaginary fishing rod may be imaginable but not an action users want to perform in public environments. Some users stated that it is hard to solve text-based and image-based captchas— reCAPTCHA and noCAPTCHA—on a smartphone’s screen because it may be too small to comfortably display all images or the soft keyboard additionally to the challenge. This supports our original motivation for Sensor Captcha.

5.4.4. Habituation According to the survey results, many users think that solving Sensor Captchas will get more and more comfortable and easy by using the scheme. Although the long-term habituation to Sensor Captcha is left for future work, we investigate if users were able to improve their success rates during our user study. As described in Section 5.3.1, every user tried to solve at least 15 Sensor Captchas. While only about 49 % of all participants were able to solve the very first Sensor Captcha correctly, we notice a clear trend that more gestures could be performed successfully the more captchas have been tried to solve. The average success rate among all users for the 15th Sensor Captcha is about 84 % which supports the assumption that users may probably habituate to this captcha mechanism fast. To test a possible correlation between the number of solving attempts and the number of successes, we calculate the Pearson correlation coefficient ρ. Taking all user data into account, ρ = 0.7238, which proves a strong positive linear relationship statistically and verifies that with increasing number of challenges the number of successes also increases in our user study.

115 Chapter 5. Usability of Motion Fingerprints for Liveliness Tests

5.4.5. Classification There exist two possible factors for captcha failure: Not only humans may fail at solving a captcha challenge, but the machine learning classifier may fail at matching correct ad-hoc motion fingerprints with challenge gestures. To shed light on possible false classifications, we calculated precision and recall for different machine learning algorithms. In our scenario, a false positive is represented by the case that sensor data not belonging to a specific gesture will be accepted as correct solution for this gesture; in extreme case random sensor data is wrongly classified as correct solution to a challenge. Precision and recall of classification

1

0.9

n 0.8 o i s i c e r

P 0.7

0.6

0.5 0 0.2 0.4 0.6 0.8 1 Recall

Random Forest Extra Trees Bagging Tree kNN Highcharts.com Figure 5.3.: Classification precision and recall

Consequently, if a correct sensor data input is mistakenly rejected by the classifier, this case states a false negative. Note that in context of captchas, false positives are worse compared to false negatives: If users were sporadically not recognized as human, they would have to solve a second captcha at worst. However, if a bot was mistakenly accepted as human, it could circumvent the captcha protection. Correct classification of sensor data to the right gesture is a true positive, while a correct rejection of non-matching data constitutes a true negative. On this basis, we are able to calculate precision and recall of all data obtained in the user study. Figure 5.3 illustrates precision recall graphs of different classifiers which were to be considered. Given the data set of our user study, including accelerometer and gyroscope data of all performed gestures, the classifiers Random Forest, Extra Trees, and Bagging Tree yield a very high precision in distinguishing the gestures. Only the kNearestNeighbor algorithm (testing k = 1, k = 5, k = 10) was not capable of precisely classifying the gestures. While this one achieves an AUC of only 78.99 %, Bagging Tree achieved an AUC of 99.17 %, Extra Trees of 99.72 % and finally Random Forest of 99.89 %. This

116 5.5. Discussion confirms our choice to implement a Random Forest classifier in our user study back end. As shown in Figure 5.3, the classifier is capable of determining whether given sensor data satisfies a specific gesture at high precision. Hence, misclassifications are negligible in our study, and we are able to ascribe most captcha failures to user input errors.

5.5. Discussion

The proposed mechanism meets the common requirements to captcha schemes: the main goal of telling computers and human apart by a challenge as simple as possible is achieved. We also satisfy common design principles for captcha methods as discussed in Section 5.2.2. In this section, we discuss security as well was potential limitations of our approach and ideas for future work. Although our survey results indicate that users feel Sensor Captchas to be less accurate and solvable than established methods, our approach achieved the highest success rate and took users the least time to solve challenges. It thus might break the arms race in computer vision powered by more and more captcha mechanisms based on visual puzzles. The fact that the decision about success or failure is made server-side raises the bandwidth use in contrast to captcha schemes which work client-side only. However, the size of transferred sensor data is reasonable and deciding about a challenge’s solution server-side is more secure. On average, the sensor readings of accelerometer and gyroscope take 5 KB in total.

Security Considerations Basing liveliness determination on hardware sensor data enables new attack vectors aiming at data manipulation. An attacker may record sensor data and provide it as the solution to a given challenge. As our captcha scheme currently supports five gestures only, a replay attack succeeds with a theoretical probability of 0.2 which needs to be reduced by more varieties of gesture challenges. Thus, even with such extensions, the entropy of our approach will not exceed the entropy of text-based captchas. A bot could solve Sensor Captcha challenges if correct input data is available for every single gesture and if the automated solver furthermore is able to recognize the challenge presented. As this applies to all common captcha schemes, it also applies to our approach. Still, other schemes may have a larger input space. While an attacker may perform all gestures once and record the corresponding sensor data, the hardness of challenge recognition is essential for most captcha schemes. The security of text-based captchas especially relies on the assumption that the challenge is hard to identify. To harden a scheme against this attack vector, the way of presenting challenges could be randomly chosen to complicate automated detection.

117 Chapter 5. Usability of Motion Fingerprints for Liveliness Tests

Alternatively, an attacker could try to exploit the machine learning classification by replaying data of a different challenge than the presented. To test this behavior, we conducted a replay attack experiment choosing sensor measurements including accelerometer data and gyroscope data from the user study and attempt to solve a given challenge. We repeat this procedure 500 times to simulate such replay attacks under the same conditions like in our user study. Note that we do not use random sensor data but real-world sensor readings we obtained in our user study before. Leveraging completely random data may also be a possible scenario, but a less sophisticated attack. As a result, in two cases a sensor data replay of an original fishing challenge was misclassified as hammering leading to a false positive. One replay of a hammering gesture was accepted as solution to the keyhole challenge. As we already know, hammering tends to be misclassified (see Section 5.4), so diversifying this gestures may harden our system against this type of attack. All the other attacks, making a share of 99.4 %, were correctly rejected by our machine learning algorithm. If a user’s mobile is treated as untrusted or maliciously infected device, it may falsify sensor data. This would enable to tamper user input used for solving the presented challenge. However, if malware is able to change the input, e. g., by manipulating system drivers requiring root access or by tampering the browser environment, no captcha scheme can guarantee a correctly transferred input. We designed our system in a way that the decision whether or not a captcha is solved successfully is made server-side. If it was made client-side like in game-based captchas [51], replay attacks might be more feasible as the attacker would only have to replay the decision instead of determining the challenge and provide previously recorded data for solving. Still, if an attacker obtains correct solving patterns for our gestures, a replay attack is feasible. Finally, we focussed our studies on the general feasibility of sensor-based motion captchas and especially on usability aspects.

Limitations Our work builds on existing captcha designs and lessons learned from previous studies. As we focussed on usability aspects of captchas, we assume that the implementations of our captcha schemes are secure and best-case implementations. A limitation of our prototype implementation is that it is a proof-of-concept and was first tested on users in the course of this study. Also, the set of challenges our system provides is not sufficient to be resilient to replay attacks in practice. For our comparative user study, we recruited participants around the university campus. Hence our sample is biased towards this particular user group. Also, the participants solved the captcha puzzles in a controlled environment while an experimenter was present. We did not deploy our captcha scheme in the wild and

118 5.6. Related Work therefore do not have data on the captcha performance in a real-world setting where users have to deal with environmental constraints. Also, we did not collect any evidence on whether our scheme is applicable in all real-world situations, such as when a user performs a task on the phone while in a meeting. Due to the fact that sensor captchas require the user to move their device, they are potentially not applicable in some situations where a less obtrusive approach would be preferred by most users. We still believe that our results provide valuable insights into how users interact with the different types of captchas. We found that metrics like solving time, memorability, and error rate do not necessarily correspond to the perceived usefulness and user satisfaction.

5.6. Related Work

Captchas are a controversial topic discussed amongst researchers and practitioners. The main reason for this observation is the fact that captchas put much burden on a user, while they are often not reliable when it comes to distinguishing human users from automated programs. Many approaches have been presented in scientific literature and by companies such as Google, but most of these schemes are still susceptible to different types of attacks. Bursztein et al. identified major shortcomings of text captchas and proposed design principles for creating secure captchas [24]. They focus on interaction with desktop computers and do not consider usability shortcomings of captcha interactions on mobile devices. Fidas et al. validated visual captchas regarding their solvability based on empirical evidence from an online survey [46]. They found that background patterns are a major obstacle to correctly identify characters, but provide little to no additional security. Reynaga et al. presented a comparative study of different captcha systems and their performance when accessed via a smartphone [133]. They argue that visual captchas are hard to solve on mobile devices and that usability could be increased by limiting the number of tasks and by presenting simpler and shorter challenges with little or no obfuscation. Furthermore, distractions from the main task should be minimized by presenting unobtrusive captchas that are isolated from the rest of the web form. These factors highlight the need to develop novel captcha schemes that overcome the limitations of visual captchas. Reynaga et al. also conducted a comparative user study of nine captcha schemes and provided a set of ten specific design recommendations based on their findings [134]. Bursztein et al. reported findings from designing two new captcha schemes at Google and presented findings from a consumer survey [25]. Xu et al. [158] explored the robustness and usability of moving-image video captchas (emerging captchas) to defeat the shortcomings of simple image captchas and discussed potential attacks. Jiang et al. proposed gesture-based captchas that obviate the need to type letters by using

119 Chapter 5. Usability of Motion Fingerprints for Liveliness Tests swipe gestures and other touch-screen interactions additionally [71]. However, such complex methods may state a high burden to users. Gao et al. proposed a captcha scheme utilizing emerging images as a game [51]. Such game-based captchas are solved and validated client-side making them vulnerable to replay attacks. reCAPTCHA and noCAPTCHA by Google Inc. are field-tested, established mech- anisms [65]. However, both methods disclose unapparent downsides: reCAPTCHA is used to digitalize street view addresses as well as books and magazines. noCAPTCHA implements behavioral analysis and browser fingerprinting. Information that is used for fingerprinting includes but is not limited to: installed browser plugins, user agent, screen resolution, execution time, timezone, and number of user actions—including clicks, keystrokes, and touches—in the captcha frame. It also tests the behavior of many browser-specific functions as well as CSS rules and checks the rendering of canvas elements [117]. While this information is used for liveliness detection and therefore fit the aim of captchas, it can also be used for accurate user tracking, raising privacy concerns (see Chap. 2).

5.7. Conclusion

In this work, we demonstrated that motion information from hardware sensors available in mobile devices can be used to test liveliness. Due to several limitations such as smaller screens and keyboards, traditional captcha schemes designed for desktop computers are often difficult to solve on smartphones, smartwatches, and other kinds of mobile devices. In order to tackle the challenges implied by these constraints, we designed a novel captcha scheme based on motion fingerprinting and evaluated it against already existing approaches found in the Internet ecosystem and the scientific literature. Our results indicate that sensor-based captchas are a suitable alternative when deployed on mobile devices as they perform well on usability metrics such as user satisfaction, accuracy, error rate, and solving time. As our scheme requires users to perform gestures with a device in their hand, we plan to conduct a longitudinal field study to collect evidence on the feasibility of motion input in the wild (i. e., in situations where users are constrained by environmental conditions and unobtrusive interactions with their device) as well as involving wearables as input devices. For future work, we aim to iteratively improve the design and number of challenges. Although most gestures of user study were suitable, their movements need to be revised for everyday use and the entropy need to be increased by new gestures. Additionally, users would benefit from images or animations showing the challenge. Participants of our study agreed with Kluever et al. [78] that images and animations presenting a challenge are more enjoyable. Finally, conducting a long term study with participants using our mechanism regularly may confirm our findings on habituation effects.

120 CHAPTER SIX

IMPEDING AUTHORSHIP ATTRIBUTION VIA STYLOMETRY OBFUSCATION

Authorship attribution is a technique for relating texts of unknown source to its original authors. This can be achieved by using writeprints to measure charac- teristic uses of digits, letters, n-grams, etc. to measure an author’s stylometry. For anonymization of documents, stylometry can be obfuscated by changing text elements, e. g., synonymizing words and phrases. In this chapter, we investigate the effectiveness of both writeprint-based authorship attribution and stylometry obfuscation. We built a corpus gathering product reviews of 30 authors with 1,000 texts per author. Using this data corpus, we show how precise a given text can be related to its original author under various circumstances like an author group size of three to 30 authors and text set sizes from ten to 1,000 texts per author. We utilize three different obfuscator tools, each following a specialized paradigm, and compare the effectiveness of obfuscation. Additionally, we determine how obfuscation affects a text’s readability.

6.1. Introduction

Authorship attribution is a method to relate written texts to an author within a group of authors based upon individual writing style, so called stylometry. While this method can be used to detect plagiarism, it also yields a risk for authors requiring anonymity, e. g., human rights activists or whistleblowers. More specifically, those who exercise their right of free speech may be identified by their style of writing. As a measurement of stylometry, writeprints have been proposed by Li et al. [98] and utilized for the first time by Abbasi et al. [1]. These techniques can be utilized for author identification, plagiarism detection, and also de-anonymization. For each author, a vector of features is generated by extracting specific patterns from the

121 Chapter 6. Impeding Authorship Attribution via Stylometry Obfuscation written text and utilized to train machine learning models. These models could be either supervised or unsupervised based on the question whether the number of authors is actually known or not. A countermeasure against authorship attribution is stylometry obfuscation, which is instrumented to distort an author’s individual style of writing [73]. This technique substitutes words and expressions, shuffles text fragments or even rewrites parts of the original text so that it becomes harder to relate an obfuscated text to the original text’s author. These countermeasures are usually grouped into manual, semi-automated, and automated obfuscation [129]. If obfuscation is successfully applied, an obfuscated text is either not assignable to any author within a group or only assignable to another author of this group who is not the original author. The success and precision of authorship attributions depend on how many authors are within the group and how many texts (or more precisely: how many words) each author has written and also on the quality of the stylometry obfuscator. Luyckx stated the scalability regarding the sizes of author group and text set as well as the selection of writeprint features as extensively problematic [100]. In this chapter, we provide insights into authorship attribution despite stylometry obfuscation with varying author group sizes and various numbers of texts per author. We conducted experiments regarding the precision of stylometry recognition under different circumstances: we assume that either nothing is known about the published texts which are anonymous or there has been an information leak, e. g., revealing an author’s identity. We provide precision rates based on the number of authors and the number of texts per author and shed light on this dependency. Our results also indicate that text obfuscation is ineffective in many cases and may even harm a document’s readability.

Contribution In this chapter, we make the following contributions: • We implemented extended writeprints as a continuance of the original writeprints and revise the scenario of authorship attribution under realistic conditions. • We conducted experiments to relate texts to authors without any knowledge about the number of authors and their styles instrumenting unsupervised machine learning. • We leveraged supervised machine learning to perform authorship attribution with additional information, e. g., the number of authors or leaked texts. • We provide comprehensive insights on how many authors, texts, and words are required to perform authorship attribution.

122 6.2. Authorship Attribution

Outline In the next section, we describe the concept of authorship attribution as well as stylometry obfuscation. Then, we present our approach for discovering the limits of obfuscation of written text and explain our data set as well as how we utilize machine learning algorithms on our data set. Additionally, we introduce measuring methods to assess the readability of texts. The evaluation section provides insights on the effectiveness of stylometry obfuscation and its effect on a text’s readability. Next, we discuss threats to the validity of our approach and give an overview of related research, before we summarize this chapter.

6.2. Authorship Attribution

Authorship attribution describes the process of relating a given text document to its author. This can be achieved by measuring text attributes like the distribution of characters or the number of digits. Usually, this technique is applied to an anonymous document to reveal its original author. This way, it is possible to de-anonymize texts through measuring text attributes, which represent an author’s style of writing (so called stylometry). While authorship attribution may help to detect plagiarism, there is a downside of this technique: those authors who require anonymity for publishing documents, e. g., human right activists or opposition members in oppressed countries, may also be identified. To fight authorship attribution, obfuscation mechanisms may be used in order to alter a text and change the measured attributes. By this, it may get harder to assign a given text to its original author.

6.2.1. Writeprints One method to measure text attributes and, thus, an author’s stylometry, is writeprints, a text-based fingerprint of the author. The goal of such writeprints is to represent an author’s stylometry of a given text. The application of writeprints, in analogy to fingerprints but in the written form, is one of the best known and robust techniques regarding authorship attribution. Writeprints were introduced by Li et al. [98] and implemented and extended by Abbasi et al. [1]. Writeprints contain 700 different features which are extracted from documents and describe the similarity or dissimilarity between the style of authorship of documents. Due to the high dimension, writeprints utilize the Karhunen-Loeve (KL) transformation which is a supervised form of Principle Component Analysis (PCA) [91] in order to determine which features contain the highest variance so that they become more distinguishable. Writeprints are usually grouped into four different categories [98]:

123 Chapter 6. Impeding Authorship Attribution via Stylometry Obfuscation

1. Lexical which contains character-based and word-based features, e. g., the number of words in a sentence or occurrence of special characters.

2. Syntactic that defines the sentence-level style of the author. They count the frequency of function words, punctuation, and Part-of-Speech (POS) tags.

3. Structural which includes the habits of an author such as the number of paragraphs in a text.

4. Content-specific where word bigrams, trigrams, and bag-of-words are considered so they determine the topic of the document.

6.2.2. Stylometry Obfuscation Since in many scenarios a high accuracy of writeprints for authorship attribution has been claimed [1, 5, 20, 116], authors such as whistleblowers and human rights activists increasingly require to remain anonymous. Therefore, they need to have reliable tools and methods to obfuscate their writing style. Up to now, research providing this requirement can be broken down into three different categories: 1. Manual obfuscation where an individual is asked to alter the writing style of a written document either at own will or to attempt imitating a specific writing style. Examples of such work where crowd-sourcing was applied for this purpose are [7, 20].

2. Semi-automated (guided) methods such as Anonymouth [106] or Unstyle [121], in which a classifier is trained by both samples of the author who requires staying anonymous, and samples of other distinct authors. These tools provide guidance to the author how to change writing style, i.e., identify the features which provide the most variance. However, the author has to apply the suggested changes himself.

3. Fully automated tools where a text is provided as input to an obfuscating program or so-called text-spinner that alters the writing style of a document automatically. Round-trip translation is considered as such an approach and has been studied quite often in the past [26,75,130,157]. More sophisticated alterations are usually performed by synonym replacements applying WordNet, changing the structure of the sentences, and inserting spelling and grammatical errors. We describe this approach in more detail in Sec. 6.3.5 All of these obfuscation techniques are designed to alter an author’s text and, thus, writeprint to deceive machine learning algorithms and distance measurements. However, high efforts have been performed regarding de-anonymization and de- obfuscation of original author and writing style [38, 115, 125, 156].

124 6.3. Discovering Obfuscation Limits

6.3. Discovering Obfuscation Limits

While authorship attribution and stylometric obfuscation are topics of current research, we introduce in the following a new scenario with realistic circumstances and present our data corpus as well as our approach of writeprinting texts. Furthermore, we briefly discuss relevant machine learning techniques and three different well- known obfuscators we will examine regarding their capability of deceiving authorship attribution.

6.3.1. Scenarios The scenario described in Sec.6.2.2 and often used in latest research [7, 103, 106, 108, 121] is compliant with the use of machine learning algorithms, but does not apply under realistic circumstances. Using unobfuscated, original texts for the training of a machine learning model and obfuscated texts for classification is predicated on the following assumption: all original texts are available for training and from a certain point in time on, texts are published in an obfuscated way. Only for this scenario, it is reliable to perform a training based on clear texts and a classification (also testing) of obfuscated texts. However, this rarely occurs in the wild as having original texts as well as obfuscated texts available is a high requirement. Either an adversary who aims to attribute texts to their original authors does not have unobfuscated texts for training given that these texts should not be published for the sake of anonymity or there has been an information leak revealing any information, e. g., the number of authors or a relation of an obfuscated text to its original author. Hence, we argue that all operations for authorship attribution need to be performed on the obfuscated corpus only. With this prerequisite, two realistic scenarios are possible:

1. Only the obfuscated texts are available

2. Additional information about obfuscated texts is also available

Given obfuscated texts only, it is not possible to apply supervised machine learning and, therefore, not feasible to train a model for recognition. In this scenario, merely unsupervised machine learning (e. g., clustering) can be applied. However, if additional information—like the identity of one or more authors for instance—is available, supervised machine learning may be applied, but to obfuscated texts only. We presuppose that no original text is available and these are kept safe or get destroyed after obfuscation. Figure 6.1 depicts the high-level overview of a realistic scenario. A group of authors creates texts—several documents per author—and to stay anonymous neither these texts nor their authors’ identities are revealed. The authors utilize an obfuscator

125 Chapter 6. Impeding Authorship Attribution via Stylometry Obfuscation

Authors Texts OObbffuussccaattoorr Texts (obfuscated) secret public

Figure 6.1.: High-level overview of our system’s workflow which changes the given texts trying to make a relation of texts to authors harder. Then, the obfuscated texts are published. Note that also the obfuscator is a publicly available software tool in this case.

6.3.2. Data Corpus For our experiments, we obtained real-world texts from a large online retail platform. We scraped publicly available product reviews from the top 30 reviewers and gathered 1,000 texts per author. While the shortest text contains 14 words and the longest one 3,300 words, on average there are 345 words per review, deviating from 150 to 1,550 words per texts. To investigate how many texts per author are required for an authorship attribution with high precision, we built text subsets for our experiments containing in steps of ten, so that from every author ten texts are included, then 20 texts and so on up to 1,000. Also, we divided the group of authors into subgroups, randomly selecting one author iteratively, so that we may combine any number of authors with any number of texts per author.

6.3.3. Extended Writeprints Given the technique of writeprints (see Sec. 6.2.1), we implemented an extended version. While the features of writeprints are usually limited, e. g., in the number of letter n-grams, we do not cut off such measures. In fact, we measure a special set of text attributes, namely the numbers of

• Character count • Letters • Average characters per word • Digits Percentage • Word Lengths • Letters Percentage

126 6.3. Discovering Obfuscation Limits

• Special Characters • Letter bigrams

• Punctuation • Letter trigrams • Uppercase Letters Percentage • POS Tags • Digits • Two Digit Numbers • POS Bigrams • Three Digit Numbers • POS Trigrams

We generally do not limit these features to the top occurrences, like common writeprints implementations do. Hence, we do not only store the top letter n-grams, but all occurring letter n-grams, all digits occurring in the complete text set, and so on. Moreover, we utilize features from all categories described in Sec. 6.2.1. We are confident that the inclusion of all kind of attributes will increase the precision and reliability of authorship attribution. For instance, we also obtain the number of each special character found in all texts and all occurring POS bigrams and trigrams. In contrast, we omitted the following features:

• Word Bigrams • Function Words

• Word Trigrams • Common Misspellings

Including these features in our experiments would heavily extend the number of features and limiting them to the top occurrences would contradict our approach of unlimited writeprints. Hence, our implementation is more based on metrics and number of specific occurrences than on semantic features like word connections or misspellings. We implemented our study setup such that a writeprint is a text-based measure- ment, so that every text gets a writeprint which is related to its original author. Consequently, every author has as many writeprints as texts written.

6.3.4. Machine Learning Different machine learning techniques are required for the two scenarios described in Sec. 6.3.1 that we briefly discuss in the following.

Unsupervised Machine Learning Clustering as a representation of unsupervised machine learning is to be applied in the first scenario as the obfuscated texts are the only input source available. We instrument the Mean-Shift clustering algorithm [32] as we aim to predict the original

127 Chapter 6. Impeding Authorship Attribution via Stylometry Obfuscation author—one author is represented by one class—but we neither have labeled data nor knowledge of the number of classes. An advantage of the Mean-Shift algorithm is the automatic prediction of the number of classes. Furthermore, we use the k-Means clustering algorithm [101] to recheck the results of Mean-Shift. As this algorithm does not predict the number of classes automatically, we guess this number with the help of the “elbow method” [14, 79]. This is a way to determine the number of clusters (k) by estimating. This parameter gets incremented successively and the error of the clustering is calculated. Plotting the graph with k on the X axis and the error on the Y axis, we can determine a point from which the graph is no more decreasing strongly but proceeds with only slight decrease. At this point, the X value represents the best k and, thus, the estimated number of clusters, resp. classes.

Supervised Machine Learning For the second scenario, we leverage supervised machine learning for recognizing author styles. The text set gets split into a training subset and a testing subset. The training subset is used to build a model which is the basis for the classification of the testing subset. Like for unsupervised machine learning, one author is represented by one class. For this task, we use the Random Forest classifier as well as the Extra Trees classifier as we have labeled data in this scenario but less than 100,000 samples. We calculate the classification precision as a measurement for the quality of authorship attribution. If the precision expressed as rate is high, e. g., near 100 %, the texts could be classified to their original authors well and effectively. A low precision rate represents the case that texts are hard to assign to their authors.

6.3.5. Obfuscators To determine the capability of stylometry obfuscation as a countermeasure against authorship attribution, we leverage three different text obfuscation methods. Each obfuscator follows a different concept to change text elements, namely: 1. Text Transformation: The text style is transformed on a firm rule base and random text operations, including structural changes. 2. Text Synonymization: In a given text, the most frequent words are replaced with synonyms from WordNet [111]. 3. Text Spinning: Single words and phrases are replaced with synonyms while the text structure is kept original. We tested these methods against our implementations of writeprints regarding authorship attribution and discuss each method in the following.

128 6.3. Discovering Obfuscation Limits

Text Transformer

An obfuscator implementing text transformation has been created by Mihaylova et al. [108]. It is notable to mention that this obfuscator had the best obfuscation results in PAN 2016. This obfuscator starts by extracting seven lexical and syn- tactic stylometric features from the training documents and several public domain books. Then, the average values of these extracted features are considered and the appropriate measure is taken to change the writing style such as:

1. Splitting and merging sentences according to whether the sentence length is above or below the average sentence length, respectively.

2. Removing or replacing stop words.

3. Performing either spelling corrections or inserting spelling mistakes depending on a custom defined spelling score.

4. Changing the punctuation usage so that if it is above the average, punctuation marks such as comma, semicolon, and colon are removed. Otherwise, comma or semicolon are inserted before prepositions. Also, exclamation or question marks are repeated.

5. Substitution of the most or least common words with synonyms, hypernyms, or word descriptions gathered from Wordnet.

6. Phrase replacements.

7. Substituting uppercase letters of words having more than three symbols with lowercase ones.

8. Noise insertion such as randomly substituting words that have a different spelling in American and British English, as well as inserting random functional words at the beginning of sentences.

9. Replacing short forms with full ones such as he’s with he is.

10. Replacement of numbers with their word representation. However, since in case of date formats this substitution also occurs, it would be obvious to an expert that the document has been obfuscated.

11. Substitution of symbols and abbreviations with words.

129 Chapter 6. Impeding Authorship Attribution via Stylometry Obfuscation

WordNet Synonymizer Mansoorizadeh et al. implemented an obfuscator basing on a trained language model and utilizing WordNet [103]. The focus of this obfuscator is in hiding the writing style by replacing the most frequently used words of an author by the closest possible synonyms considering that the meaning of the sentences remains sound. To achieve this purpose, the obfuscator needs a set of training documents in order to determine the maximum likelihood estimate of word frequencies. Also, the obfuscator needs a trained language model with 4-grams on the Brown corpus [50] in order to correctly estimate the probabilities of sequence of words. This model is utilized to score the obfuscated sentences, i. e., a set of synonym candidates having high similarity scores are generated from Wordnet. Then, they are scored considering the conditional probability of appearing in the 4-gram sequences of the trained model. The candidate having the highest score in the obfuscated sentence is chosen as replacement for the original word. In the end, the proper form of the synonym is constructed by checking the POS tag of the original word. In each sentence, at most one word is replaced.

Text Spinner Text-Spinners or article re-writers are usually applied in so-called black-hat Search Engine Optimization (SEO) [163]. The reason for this application is providing the owners of unethical websites the ability to automatically rewrite and re-publish articles so that their websites are not penalized for containing previously published and duplicated text by search engines [40]. There exist only a few academic works which have analyzed the methodology of text spinners [53,89,163]. A state-of-the-art text spinner is Turbospinner [128]. The reason we got interested in text spinners are the similarities between text spinning and text obfuscation which led us to the assumption that these spinning tools can also be applied as author obfuscators. To the best of our knowledge, this is the first work that considers this possibility. As most text spinners, Turbospinner offers an API that we utilize for our experiments. There exists no proper documentation about Turbospinner or details about the text alteration process. However, we could observe randomly performed text alterations such as:

1. Synonymization of words and phrases.

2. Replacement of definitions with their abbreviation such as replacing search engine optimization with seo.

3. Change of digits to their written form, such as 4 to four.

4. Insertion of function words.

130 6.3. Discovering Obfuscation Limits

5. Replacement of words starting with capital letters with lowercase ones like I with i or Youth with youth.

6. Substitution of formal written forms of words with their shorter informal form, such as minutes to mins and thorough to thoru.

However, Turbospinner does not change the structure or punctuation of sentences.

6.3.6. Readability Readability is a metric to determine if a text is rather easy or hard to read and to understand. For our experiments, we apply the Flesch-Reading-Ease and the Automated Readability Index for every single text and author. The Flesch Reading Ease (FRE) states that short texts are easier to understand than long texts [48, 49]. Hence, the number of syllables per word as well as the average length of a sentence are instrumented as parameters. The Automated Readability Index (ARI) by Senter and Smith relies on characters per word instead of syllables per word like other readability gauges [139]. Its result is an approximate representation of a school grade which is required to understand the given text. Even after obfuscation, a text needs to be readable and understandable for a reader. We will calculate both readability measures for both original and obfuscated texts to assess whether texts processed by the introduced obfuscators will still be comprehensible On the one hand, heavy obfuscation may destroy a text so that the sense cannot be extracted anymore. On the other hand, light obfuscation may preserve a text’s readability but not change the text enough to deceive authorship attribution.

6.3.7. Experiment Setup For the experiments conducted in this chapter, all text sets are processed the following way. First, extended writeprints are created for every text within the set, so that every author has a writeprint for every own text. The writeprints are split per author into training and testing subsets. Second, a machine learning model is trained, taking each author as one class. Third, all records withing the test set get classified and we calculate the precision and recall of this classification. Then, this process is repeated for all subgroups of authors from ten to 30 and for subsets of ten to 1,000 texts per author in any combination. Finally, the readability gauges are calculated for all texts. We perform these steps for the original review texts as well as for the obfuscated reviews, separated by obfuscator, so that each of the four text sets gets measured regarding authorship attribution precision and readability.

131 Chapter 6. Impeding Authorship Attribution via Stylometry Obfuscation

6.4. Evaluation

In the following section, we perform authorship attribution experiments according to the scenarios described in Sec. 6.3.1. First, we present the results on the basis of an exemplary subset of ten cases (two to eleven authors) to point out the importance of the text set size. Then, the number of authors within a group is considered as a variable as well, generalizing our findings.

6.4.1. Unsupervised Authorship Attribution

First, we examine the scenario of having a set of texts and no further information about them. Hence, the number of authors is unknown and we apply clustering as unsupervised machine learning technique to perform an authorship attribution. As our data set contains the original authors, we are able to check the effectiveness of clustering subsequently. We opted for an iterative setup, which means the authors are added one by one in the clustering data. This way we simulate a growing group and also all possible group sizes of our data set. Our analysis leveraged the Mean Shift clustering algorithm first, in order to automatically assess the number of clusters given that this number represents the number of authors. In every iteration, the texts of one more author are added to the set, while the selection of this next author was made randomly. Then, the writeprints of all texts are created with their authors as classes and clustering is applied to these writeprints. For all subsets of authors, Mean Shift clustering was able to determine the number of authors correctly. However, the assignment of authors and texts was ineffective: the single clusters did not represent single authors and writeprints of several authors were distributed over several clusters. On average, Mean Shift clusters assigned texts to authors with a precision below 40 % when all 1,000 texts per author were considered. Note that this result is for the original texts without any applied obfuscation. To reappraise this finding, we also applied the k-Means clustering algorithm to the original texts. While Mean Shift clustering determines the number of clusters automatically, k-Means requires this parameter to be known. Hence, we instrumented the elbow method (see Sec. 6.3.4), testing several values for k and plotted the resulting graph with the number of clusters on the X axis. Then, we are able to determine the number of clusters by finding the point on the X axis where the plot gradient changes only slightly. Figure 6.2 shows this method by way of example of a group of eleven authors. This represents a small to mid-sized group and the attempt to de-anonymize one author against ten others. The graph takes a rather flat course from this point onwards.

132 6.4. Evaluation

40G

30G s r o r r e

d e r a

u 20G q s

f o

m u S

10G

0G 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of tested clusters Highcharts.com Figure 6.2.: Elbow method for subset of eleven authors

Like Mean Shift before, this method also determined the number of clusters correctly. While iterating over the authors, we see that the number of clusters—and hence the number of authors—in our data set is assessed correctly, the texts are not precisely related to their original authors. This means that even if the number of clusters matches the number of authors, the clusters neither represent the authors’ texts nor their writing styles. Figure 6.3 presents the precisions of clustering for the ten groups of two to eleven authors with k set to the number of authors. For a minimal group of two authors, clustering was able to achieve a precision of 76.71 % meaning that an unknown text of one of these two authors can be related to its original author with such probability. Nevertheless, the higher the group size, the lower the precision gets. For small groups of up to four authors, texts can be correctly related to their original author with a probability of 54.43 %. If the group size exceeds four, we see a clear drop of the clustering precision which is about 40 % for groups of five to ten authors. For the tenth case with eleven authors, the precision decreases to 33.18 %. In summary, texts could not be related to their original authors using clustering methods, even when obfuscation is not applied. Since the method of clustering writeprints does not seem to constitute a valid way for authorship attribution and the precision of relating texts to their original authors

133 Chapter 6. Impeding Authorship Attribution via Stylometry Obfuscation

80

76.71 75

70 ) %

n i

( 65

n o i

s 58.44

i 60 c e r 54.43 p 55 g n i r e

t 50 s u l c 45 42.57

G 42.27 41.27 41.41 V A 40 38.14 37.05

35 33.18

30 2 3 4 5 6 7 8 9 10 11 Number of authors Highcharts.com Figure 6.3.: Clustering series for original texts constantly decreases with higher group sizes, we omitted testing obfuscated texts. Even the original texts could not be related to their authors and although obfuscation changes the clusters, it does not impair authorship attribution. Without utilizing any additional information, unsupervised methods seem to fail for authorship attribution using our corpus.

6.4.2. Supervised Authorship Attribution

Given the insight of the previous section, we now apply supervised machine learning. In the wild, there must be an information leak, like revealing a relation of texts and authors, to make the following experiments possible. We use such leaked information for training machine learning classifier and test whether it is possible to precisely determine the author of a new text, although it may be obfuscated. In the following, we use the subset of eleven authors again, to stay comparable to the previous section. For the full data about authorship attribution in spite of stylometry obfuscation with an author group size from three to 30 and various numbers of texts per author, see Sec. 6.4.4. However, the number of texts per author will be a variable parameter in this analysis.

134 6.4. Evaluation

Supervised machine learning requires training data for building a model that can be utilized for recognition. This training set contains 75 % of the writeprints per author, so that 25 % remain for classification purposes. As this procedure represents a matching problem—a given writeprint should be matched with an author—we chose two ensemble algorithms, namely the Random Forest and the Extra Trees classifiers. Choosing the most suitable machine learning algorithm requires cross- validation comparison [69]. Hence, we performed cross-validation utilizing these two classification algorithms and found them almost equivalent regarding recognition precision, false positive rate, and false negative rate. However, we decided on the Extra Trees classifier as it performed slightly better—the differences did not exceed 0.65 % on average—than the Random Forest classifier, especially for bigger groups of authors. We separately conducted this recognition experiments for the original writeprints as well as for all obfuscated writeprints (see Sec. 6.3.5). Thus, there are four sets of writeprints, from: 1. the Original review texts,

2. the text Transformer obfuscator,

3. the synonymization obfuscator using WordNet,

4. the text spinning application Turbospinner. Figure 6.4 presents the author recognition precision—meaning the probability of correctly relating a text to its original author—per number of texts per author for these four data sets. Note that a review text in our data set consists of 345 words on average. Recognizing the original texts seems very well possible: Although the recognition precision is about 64 % for ten texts per author, it increases significantly to 91 % for 30 texts per author. From 50 onward, the precision rises up to 97.63 % for the total number of 1,000 reviews per author. Hence, matching a given text (resp. its writeprint) can be considered reliable for 30 or more texts per author in a group of eleven authors. The obfuscators are expected to lower the recognition precision as obfuscated texts should be harder to relate to their original authors. Taking a look at the recognition precision of the obfuscators, the Transformer and Turbospinner highly decrease the chance of matching a text to its original author for ten texts per author. This means that it is harder to relate texts obfuscated by these tools, if only few texts are given. The Turbospinner stays effective as long as the number of texts is below 30, while the Transformer is not able to keep a recognition rate lower than for the original review texts. Despite one local minimum, the WordNet obfuscator cannot lower the recognition precision and makes a relation possible at around the

135 Chapter 6. Impeding Authorship Attribution via Stylometry Obfuscation

100%

95%

90% 85% 80% 75% 70% 65% 60% Overallrecognitionprecision 55% 50% 10 20 30 40 50 100 200 300 400 500 600 700 800 900 1000 Number of texts per author Original Transformer Synonymizer Spinner

Figure 6.4.: Recognition precision per number of texts same probability as for the original texts. Finally, if 200 or more texts per author are given, all obfuscated reviews can be matched to their original authors with a precision of over 90 %. For higher numbers of texts, the precision for all text sets converges at 97 % ± 1 %. In summary, obfuscation techniques seem effective as long as the authors’ texts do not exceed a specific amount. If few texts are present, obfuscation is able to impede authorship attribution by a certain degree, but if there are enough texts per author, obfuscation loses its effectiveness.

6.4.3. Readability Obfuscators change the original texts either by replacing words or by rearranging text parts. An essential property of a text is its readability as it is intended to transport information. Depending on the degree of obfuscation, a text’s readability may be jeopardized. Hence, we compare the readabilities of texts in our data sets instrumenting two well-established measurement methods:

1. Flesch-Reading-Ease (FRE) and

2. Automated Readability Index (ARI)

Figure 6.5 presents the calculated readability measures per text set. The Flesch- Reading-Ease (FRE) of the original texts is 75.17 which is interpreted as “fairly easy to read” and related to texts that can be read by a school level of 7th grade (see B.1). The Automated Readability Index (ARI) is 7.69 which can be interpreted

136 6.4. Evaluation

90 18 80 16

70 14 60 12

50 10 Kincaid - 40 8 30 6

Flesch 20 4 10 2

0 0 IndexReadabilitAutomated Original Synonymizer Spinner Transformer Flesch-Kincaid Automated Readability Index

Figure 6.5.: Readability per obfuscator

as a sixth-grade text, readable by an eleven to twelve-year-old (see B.2). Knowing that the texts in our sets are online reviews, this readability level seems appropriate.

The texts altered by the obfuscator using WordNet are at a level comparable to the original texts: with an FRE of 76.73 and an ARI of 7.32, these texts are even slightly easier to read. Though, the school level and age required to understand these texts is the same as for the original texts.

Those texts obfuscated by Turbospinner were found to be harder to understand. The FRE is 67.21, which represents the 8th and 9th grade. Also, the ARI is higher than the original texts with a value of 9.18, meaning that these obfuscated texts can be read by a child of age 13 to 14. This means that the changes Turbospinner made to the texts decreased their readability significantly.

This applies even more to texts obfuscated by transformation: The FRE of these texts is 53.46 so that a school level of 11th grade is required to read them and they are categorized at the edge between “fairly difficult to read”and “difficult to read” (see Section B.1). The difference to the other text sets gets even more clear when looking at the ARI. Its value is 16.6, which is above the normal scale of the automated readability index ending at 14 and represents an age of 26 to 30 required to understand the text. Hence, we consider this text set as hardly readable.

Comparing the readability of texts between the separated text sets and relating them to their effectiveness (see Sec. 6.4.2) gives a clear tendency: the more an obfuscator decreases the precision of relating a text to its original author, the less readable the text gets.

137 Chapter 6. Impeding Authorship Attribution via Stylometry Obfuscation 3 authors 100%

95% 90% 85% 80% 75% 70% 65% 60% 55% 50%

Author recognitionAuthorprecision 45% 40% 10 20 30 40 50 75 100 250 500 750 1000 Number of texts per author

Original Transformer Synonymizer Spinner

Figure 6.6.: Recognition precisions of different text sets for a group of three authors

6.4.4. Number of Authors and Texts

While the previous sections showed findings with the number of texts per author as variable for a fixed group size of authors, we will also vary the group size in this section. Therefore, we extended our experiments to train and classify the combinations of all group sizes between three and 30 as a size of one or two does not apply to our scenario and our data set is limited to 30 authors. First, we investigate the edge cases of author group sizes, beginning with the mini- mum. Figure 6.6 outlines the author recognition precision—meaning the probability of matching a given text to its original author successfully—per number of texts per author for a group of three authors. In contrast to a group size of eleven authors (see Fig. 6.4) the precision rates are generally higher, no matter if obfuscated of not. Moreover, the general trend of a higher precision when more texts per author are available is clearly confirmed. While the Text Transformer yields the same precision rate as the original reviews for ten texts per author and stays slightly beneath for other text set sizes, it differs most at 20 texts per author. We can assume that in a small author group the difference of ten to twenty texts per author does not affect the effectiveness of this obfuscator. Nevertheless, it is not able to decrease the probability of authorship attribution at all. This also applies for the obfuscator instrumenting WordNet. If only ten texts are available per author, it seems to even increase the recognition precision. For all other text set sizes, the obfuscation does not affect the precision much. The largest difference is at 40 texts per author and still only around 4 %. Only Turbospinner shows a notable effect on the recognition

138 6.4. Evaluation 30 authors 100%

95% 90% 85% 80% 75% 70% 65% 60% 55% 50%

Authorrecognitionprecision 45% 40% 10 20 30 40 50 75 100 250 500 750 1000 Number of texts per author

Original Transformer Synonymizer Spinner

Figure 6.7.: Recognition precisions of different text sets for a group of thirty authors

precision. For up to 50 texts per author, the probability of recognizing an author by stylometry drops to 70 % to 86 % which is a difference of 10 % to 12 %. Hence, for a small group of authors and for few texts per author, the obfuscation can be deemed effective. However, if more than 50 texts per author are present, the probability of correct authorship attribution also aims at 99 %, although obfuscation has been applied by Turbospinner. Second, we examine the precision of authorship attribution with maximum group size for our experiment. Figure 6.7 is analog to Figure 6.6, but shows the results for a group of 30 authors. First of all, the precision for correctly matching a text to its original author by utilizing writeprints is generally lower for this large group (e.g., for ten texts per author it is around 60 % instead of 80 %). Although the distances between the curves are higher, meaning that there is more unsteadiness of the classification algorithm, they again all tend to a point at approx 92 %. Except for the edge case of ten texts per author—which is in total 3,448 words per author on average—where all obfuscators seem to impede authorship attribution, the effectiveness of stylometry obfuscation using the presented tools is questionable. For a more detailed view, we present the experiment results as graphs depicting the range of recognition precision. This means that the edge cases of three and 30 authors are shown as a line and the area between these two lines represents the corresponding number of authors. Figure 6.8 exposes this area for the original review texts. If only ten texts per author are known, the recognition precision within a group of three authors is around 80 %, while for a group of 30 authors is drops to 61 %. Although there are outlying values (see Appendix B.2 for the full data

139 Chapter 6. Impeding Authorship Attribution via Stylometry Obfuscation overview), the general tendency is that writeprints are easier to classify in a small group and more challenging to classify in a large group of authors (see Sec. 6.4.2). Hence, other possible group sizes lie in the range between 61 % and 80 %. The examination of original review texts supports an intentional hypothesis: First, attributing a text represented by its writeprint to its original author is more difficult in a larger group of authors than in a small one. However, we also see that the more texts per author are available for analysis, the smaller the range gets. This means that the recognition precision within large author groups approximates the precision within small groups if enough text material is given. Still, at 20 texts per author, authorship attribution is successful in 95 % of all cases if there are only three authors while for a group of thirty authors the precision never reaches such high values. We also clearly see that with a higher number of texts the distance between the two curves decreases. We may deduce that even the recognition precision within a large author group the authors approximates to the precision of a small group if the number of texts increases. Hence, comparing to a small group of authors, more text material is required to attribute authorship within a larger group. More clearly: increasing the available text material compensates an increasing group of authors. Figure 6.9 shows the recognition ranges depending on the group size for the text set obfuscated by the Text Transformer. The great distances at ten and 30 texts per authors are noticeable and indicate that the obfuscator is effective for the larger group of authors as it seems to be more difficult to relate authors and texts in such a group after the texts were obfuscated. This finding can also be observed for the obfuscator using WordNet, at least for ten texts per author as can be seen in Figure 6.10. Hence, recognizing authors by their writeprints is harder when there are more authors and fewer texts. It is important to notice that the range between the two curves is smaller for any text set having more than ten texts per author. So, when only a few texts are available, obfuscation using these tools may be effective for large groups of authors. Again, the smaller the author group size and the larger the text set, the higher the precision of authorship attribution. If there were many authors or few text material, obfuscators might have a chance to deceive the machine learning algorithm and interfere the relation process. For Turbospinner, the differences in recognition precision between three and 30 authors are generally greater than for the other obfuscators. This can be interpreted, that the obfuscator may be more effective as the recognition precision is lower in general. The success of authorship attribution despite this mechanism of stylometry obfuscation depends more on the author group size. The distances of the other obfuscators were rather small (with the exception of ten (and 30 for the Transformer) texts per author), meaning that the number of authors affects the recognition precision only a little. A greater range, like in Figure 6.11 indicates a higher

140 6.4. Evaluation Original Texts

100% 95% 90% 85% 80% 75% 70% 65% 60%

Authorrecognitionprecision 10 20 30 40 50 75 100 250 500 750 1000 Number of texts per author

3 authors 30 authors Figure 6.8.: Recognition precision from three to 30 authors for original review texts Authorobfuscation

100% 95% 90% 85% 80% 75% 70% 65% 60% 55% 50%

Authorrecognitionprecision 10 20 30 40 50 75 100 250 500 750 1000 Number of texts per author

3 authors 30 authors Figure 6.9.: Recognition precision from three to 30 authors for texts obfuscated by the Text Transformer variation of recognition precision and, thus, a higher dependence on the author group size. Finally, we see that the findings regarding a group of eleven authors as used exemplary in the previous sections can be confirmed with other group sizes, too. Note that at a text set size of 250, the distance between the two curves decreases notably in all the previous graphs. This can be interpreted as from this amount of texts ongoing, the higher number of available texts compensates a higher number of authors in the group. The complete matrices showing the recognition precisions for all combination of author group size and numbers of texts per author can be found in Appendix B.2.

141 Chapter 6. Impeding Authorship Attribution via Stylometry Obfuscation Wordnet-Pan

100% 95% 90% 85% 80% 75% 70% 65% 60% 55% 50%

Authorrecognitionprecision 10 20 30 40 50 75 100 250 500 750 1000 Number of texts per author

3 authors 30 authors Figure 6.10.: Recognition precision from three to 30 authors for texts obfuscated by the WordNet-based Synonymizer Turbospinner

100% 95% 90% 85% 80% 75% 70% 65% 60% 55% 50%

Authorrecognitionprecision 10 20 30 40 50 75 100 250 500 750 1000 Number of texts per author

3 authors 30 authors Figure 6.11.: Recognition precision from three to 30 authors for texts obfuscated by Turbospinner

In general, we can deduce that for the unobfuscated texts more authors or fewer texts result in a lower precision as well as fewer authors or more texts results in a higher precision of authorship attribution. This also applies for the obfuscator using WordNet, although the numbers are a slightly different, the same tendencies can be observed. Regarding the Text Transformer and Turbospinner obfuscators, we see a lower precision for fewer authors, indicating that obfuscation works for small groups of authors. However, the precision already increases for mid-sized groups of ten to twelve authors depending on the number of texts. Regarding the number of

142 6.5. Discussion texts per author, these obfuscators allow the same tendency as for the original texts: generally, the more texts are available per author, the higher the precision gets. This supports our thesis that obfuscators are effective for small groups of authors only. For mid-sized groups of ten to fourteen authors, texts and authors can be related at high precision. This also applies to large groups of up to 30 authors if there is enough text material. A precise classification even within such a large group can be achieved with about 250 texts per author which means around 70,000 words per author while for smaller groups ten to 20 texts are sufficient which equals 3,400 to 6,900 words per author.

6.5. Discussion

Although we considered realistic scenarios and handled our data carefully through the processes of obfuscation and authorship attribution, there exist limitations and threats to the validity of our work. First, the presented results are only valid for the used data corpus introduced in Sec. 6.3.2. Testing different texts from different contexts (e. g., short messages or blog posts) remains for future work. As the number of available texts and the author group size heavily influences the success of authorship attribution, using a scenario with more authors or less texts may decrease the precision of relating texts with their authors. However, we examined all possible scenarios with three to 30 authors and ten to 1,000 texts per author, which is about 3,400 to 172,500 words per author. In the latest research, the average number of words per author when using writeprints varies between 1,400 and 44,000 [1], 6,500 [20], and 4,500 [6]. This wide range of required words per author is deemed a scalability issue and may not be solved in general, but only in individual scenarios [100]. However, we plan to examine larger groups of authors and differently structured texts in the future. For instance, Twitter posts and blog articles are differently structured and the effectiveness of writeprints or even obfuscation techniques may vary. In our approach, we left out four features commonly used in writeprint implemen- tations (see Sec. 6.3.3) to keep the number of features manageable. Nevertheless, changing the instrumented features may also affect the results, so a lesser or even higher authorship attribution precision might be possible when using these features. All texts of our corpus are taken from one source as they are reviews on a major online shopping website. Recent studies have shown that authorship attribution with texts from several sources (e. g., domains) may lack effectiveness [120]. Hence, building a corpus from different sources and re-evaluating our findings is a possible future enhancement.

143 Chapter 6. Impeding Authorship Attribution via Stylometry Obfuscation

Additionally, utilizing alternative clustering methods like spectral clustering or Gaussian mixture models might provide other results for unsupervised authorship attribution. Revisiting this topic comparing different clustering approaches might provide more insights on the general feasibility of using unsupervised machine learning mechanisms for authorship attribution. As there exist many obfuscators more in the wild, determining which one was used can be deemed impossible from the obfuscated texts only. Both machine learning approaches, however, do not require the used obfuscator to be known. Though, for future research, more obfuscators—like multi-translators and other techniques— should be examined regarding their effectiveness against writeprints. It will be essential for its effectiveness that an obfuscator implements a mechanism to add randomness to an obfuscated text, so that the same input texts always leads to a different output. Otherwise, the leak of one text may reveal an authors identity and make it possible to relate other texts from this author. For maximum anonymity, an obfuscator would be a one-way function so that text changes are not invertible and pseudo-random so that every single input text is altered uniquely.

6.6. Related Work

In this section, we refer to the academic work which has been published regarding the game of cat and mouse between the attempts of author identification (attribution) and author masking (obfuscation). Stylometry has first been utilized to identifying the authors of unknown documents by hand [135]. However, we limit this section to the more recent related academic work. This journey started with the work of Rao et al. [132], who considered the possibility of identifying authors who are hiding behind pseudonyms. Their method was based on extracting syntactic and semantic stylometric features of newsgroup posting, frequency counting of these features, and performing Principle Component Analysis (PCA) for dimensionality reduction. The authors of this work also suggested the possibility of applying machine translation (i. e., round translation) in order to disguise the writing style of the original author. However, although there were many improvements regarding machine translation and the concept of perplexity, the work of Keswani et al. [75] demonstrated that even 16 years later round translation produces text with low readability scores. This endeavor was mainly continued by the work of the Privacy, Security and Automation Laboratory of the Drexel University that was one of the most influencing contributions regarding the application of stylometry to perform author identification and masking in the last years [5,6,20,106]. This work includes the development of JStylo and Anonymouth. These tools utilize different sets of stylometric properties such as the base-9 feature set and writeprints in order to re-identify the original author by supervised machine learning models

144 6.6. Related Work and help authors disguising their writing style, respectively. However, Le at al. [92] presented an attack on the obfuscation method of Anonymouth by exploiting its highly deterministic obfuscation algorithm. The authors of this attack claim to be able to identify the original author among ten potential authors in a set of size 2 with a probability of approximately 44 %. In the work of Kacmarcik and Gamon [73], the training of the SVM classifiers is performed on the well-known Federalist Papers before they are anonymized by utilizing effective vectors such as disguising the word frequency. This approach was also attacked in the work of Le et al. [92] by reversing the steps of the performed obfuscation. The application of decision trees regarding author identification has been less popular than of the usage of Support Vector Machines (SVM). However, Koppel and Schler [81] have shown in the past that this option is also reasonable. A partially similar work to our one has been performed by Abbasi et al. [1] who successfully re-identified the individuals who gave feedback comments to their purchases on Ebay. The similarity arises from the fact that they tried to present a robust method for cases where there has been intentional stylistic alterations in the author’s text. Also, Brennen and Greenstadt [21] performed research on three different methodologies utilizing stylometry in order to measure whether these obfuscation methods are reliable or not. However, they performed the training of their classifiers on the plaintext, while we perform our training on the already obfuscated texts. The PAN challenges, which have been organized since 2011, consist of scientific events and tasks on digital text forensics. Regarding authorship, there exist different challenges such as author identification, author profiling, and since 2016 automated author obfuscation as well. There were three submissions regarding the author obfuscation task this year [75, 103, 108, 129]. We asked the authors of two works to share their code with us so that we could evaluate our hypothesis with real obfuscating tools. Caliskan et al. have shown the general feasibility of using stylometric analyses for source code attribution [27]. While the authors are able to relate source code files to their original programmers well, we focus on written texts in this work. Nevertheless, taking corpora with other text formats and sources into account is a future challenge for our findings.

145 Chapter 6. Impeding Authorship Attribution via Stylometry Obfuscation

6.7. Conclusion

In this chapter, we reviewed authorship attribution scenarios under realistic circum- stances and implemented extended writeprints, including a comprehensive set of features. We applied clustering as unsupervised and ensemble methods as supervised machine learning techniques, instrumenting writeprints for authorship attribution. While clustering did not relate texts to their original authors reliably, attribution instrumenting an Extra Trees classifier is an effective method for this purpose. We leveraged three text obfuscator mechanisms using different approaches—in- place text transformation, model-based text alteration and text spinning. For our data, obfuscating texts with these mechanisms does not eliminate authorship attribution: either nothing is known about the corpus, e. g., the number of authors or their original styles or there is additional information about authors, e. g., a leaked text. In the first case, unsupervised machine learning can be used only, which results in a non-effective recognition of author stylometry. Hence, if even the original texts cannot be related to their authors, obfuscation is not required. In the second case, applying supervised machine learning to the texts and authors’ writeprints enable such a precise authorship attribution that the examined obfuscation techniques are ineffective, at least for a certain amount of texts depending on the number of authors. Also, an obfuscator is required to add randomness to the obfuscation process to be effective, so that the same input text always leads to a different output. In our experiments, the heavier a text gets obfuscated, the less readable it becomes. Texts with a lower chance of getting related to their original authors are likely harder to understand—like the texts obfuscated by Text Transformer and Turbospinner. In contrast, if an obfuscator keeps the readability, it hardly influences the authorship attribution—like the WordNet obfuscator. Hence, when using an obfuscator to anonymize a text, there needs to be a trade-off between obfuscation and readability.

146 CHAPTER SEVEN

CONCLUSION

In this thesis, we studied the field of digital fingerprinting from various perspectives. First, we focussed on software as a resource and showed that common libraries might fail at specifically recognizing mobile devices. We built a feature set to make fingerprinting of mobile devices possible and also discovered countermeasures to evade modern fingerprinting techniques. While it is feasible to escape such recognizing methods, users of the Internet cannot be sure to avoid system fingerprinting as it usually does not arouse suspicion and is applied secretly. This is particularly devastating when a user’s fingerprint affects the presented content, like prices in online shops. We conducted an empirical study which indicates the existence of location-based as well as fingerprint-based online price adjustments and shows that individual pricing policies exist in the wild. Prices at online platforms, e. g., for hotel booking, may vary based on users’ location and systems without their consent or awareness. Though, we could not prove a systematic price discrimination based on system fingerprints today. Second, we examine hardware as a resource for fingerprinting. We proved the feasibility of hardware sensor fingerprinting by comparing a feature set specialized for web-accessible sensors with raw sensor data. Our results support the existence of hardware imperfections which can be used to authenticate a mobile device. We were able to recognize single devices with a precision of up to 99.995 %, strongly indicating that implementing hardware-based authentication, e. g., as a second factor, fortifies the generic user authentication process. Given these hardware features, we designed and implemented a CAPTCHA scheme for liveliness tests. We instrumented a device’s sensors as user input and created a set of gestures which are fingerprinted and classified as specific motions. To test liveliness, a user is required to produce a particular motion fingerprint. Our scheme is not privacy-invasive like reCAPTCHA. We demonstrated that our mechanism

147 Chapter 7. Conclusion can be used as a privacy-preserving replacement for established techniques when it comes to usability. Third, we consider another potential resource of fingerprinting and investigate author attribution by stylometry. We implemented extensive writeprints as a method for text-based fingerprinting and compared different methods for anonymizing written texts by obfuscating the author’s style of writing. Our experiments show that such obfuscation usually decreases a text’s readability while it is still possible to recognize the original author’s style. Altogether, Fingerprinting constitutes a reliable technique for recognizing systems based on their attributes. The diversity of customized devices, hardware imperfec- tions or other attributes results in unique peculiarities which can be leveraged to distinguish between different systems and recognize a specific one among all others. This technique may be used for privacy-invasive purposes like user tracking but also provides a way to enhance existing IT security mechanisms and to improve user-friendly applications. Due to technological advance, new functionalities will evolve in the future bringing more possibilities for customization of digital systems. Some innovative techniques may serve as resources for fingerprinting although they were not specifically devel- oped for this purpose, like canvas elements. Such new resources might enable a more complex fingerprinting and eventually a higher precision, as long as newly designed features result in a higher variation and a higher individualization among systems. New browser plugins or additional sensor types yielding hardware imperfections are only two examples of how future techniques may improve system fingerprinting. Modern smartphones already provide more hardware sensors than the last smart- phone generation, e. g., barometer and Hall effect sensors as Google’s latest Pixel smartphone. The enrichment of software and hardware may offer more features for fingerprinting in the future. Furthermore, fingerprinting is an adaptable approach so that additional purposes and applications for this technique might be developed. Advances in IT security like enhancing authentication mechanisms are possible as well as highly precise user tracking, analyses of user behavior and preferences which may invade user privacy. Moreover, the technique of fingerprinting is not limited to digital systems. Like a human is able to recognize particular features of other humans, buildings, shapes and much more, the methods from fingerprinting may be adapted and developed to greater machine learning models for recognizing or identifying various kinds of items. Stylometry is only one example for fingerprinting other entities than computer systems but also face recognition and identication friend or foe (IFF) may benefit from advances in fingerprinting. The fact that fingerprinting is not limited to a piece of software or computer systems but is also applicable to human habits like writing styles, this technique may

148 concern everybody, depending on what purpose fingerprinting may be established for. Because of numerous application scenarios, it seems likely that this technique will evolve in the future, but it remains unclear whether or not user consent will be required to apply fingerprinting to personal data, properties or habits. Whatever the future holds, the piecing together of dissociated features enables precise fingerprinting which is challenging to evade technically—for digital systems as well as for human beings.

149 Chapter 7. Conclusion

150 LIST OF FIGURES

1.1. Fingerprinting process ...... 6 1.2. Amplification DDoS attack ...... 7 1.3. Overview of thesis topics ...... 8

2.1. Average ROC performance chart for the single-iteration experiment 35 2.2. Average ROC performance chart for the multi-iteration experiment . 36 2.3. Average ROC performance chart for our system under multiple sce- narios of evasion attacks ...... 41

3.1. Every system yields its own fingerprint: different features are extracted from a system and stored in a provider’s database ...... 51 3.2. Exemplary JavaScript code snippet of system fingerprinting and track- ing at Hrs.com ...... 52 3.3. High-level overview of our system’s workflow ...... 55 3.4. Scanner components operation chart ...... 57 3.5. Location-based price differentiation by provider ...... 62

4.1. Sensor-based device authentication for user authentication ...... 80 4.2. Recognition precisions per sensor for device recognition ...... 92 4.3. Recognition precisions per sensor for model recognition ...... 92 4.4. Recognition precisions per combination for device recognition . . . . 93 4.5. Recognition precisions per combination for model recognition . . . . 95

5.1. User study setup ...... 106 5.2. Mean Likert-scores and standard deviations from survey ...... 113 5.3. Classification precision and recall ...... 116

6.1. High-level overview of our system’s workflow ...... 126 6.2. Elbow method for subset of eleven authors ...... 133 6.3. Clustering series for original texts ...... 134

151 List of Figures

6.4. Recognition precision per number of texts ...... 136 6.5. Readability per obfuscator ...... 137 6.6. Recognition precisions of different text sets for a group of three authors138 6.7. Recognition precisions of different text sets for a group of thirty authors139 6.8. Recognition precision from three to 30 authors for original review texts141 6.9. Recognition precision from three to 30 authors for texts obfuscated by the Text Transformer ...... 141 6.10. Recognition precision from three to 30 authors for texts obfuscated by the WordNet-based Synonymizer ...... 142 6.11. Recognition precision from three to 30 authors for texts obfuscated by Turbospinner ...... 142

152 LIST OF TABLES

2.1. Tracking libraries and the applied fingerprinting techniques . . . . . 22 2.2. KLD results for SDesktop and SMobile ...... 23 2.3. Feature data types and example values ...... 31

3.1. Leveraged fingerprint features with exemplary values ...... 56 3.2. Search parameter features for hotel booking websites with example values ...... 58 3.3. Fingerprint-based price changes per provider ...... 64 3.4. Excerpt of median hotel prices per fingerprint, provider and country 66 3.5. Example for morphprints of pairing (O1, O2) ...... 67 3.6. Features and their share in cases of price changes ...... 68 3.7. Most influencing features ...... 70

4.1. Numbers of events, benchmarks and devices per sensor type . . . . . 83 4.2. Number of different sensor hardware models by sensor type . . . . . 83 4.3. Recognition precisions per sensor type, identifier and data set of single-sensor tests in percent ...... 91 4.4. Recognition precisions per sensor combination, identifier and data set of combination tests in percent ...... 94

5.1. Success rates (SR) in percent ...... 110 5.2. Average solving times in seconds ...... 110 5.3. Solving rates and error rates per gesture in percent ...... 112

A.1. Median hotel prices per fingerprint, provider and country ...... 169

B.1. Interpretation of the Flesch-Reading-Ease ...... 173 B.2. Interpretation of the Automated Readability Index ...... 174 B.3. Precision per number of authors and texts for original review texts in percent ...... 175

153 List of Tables

B.4. Precision per number of authors and texts for reviews obfuscated by text transformation in percent ...... 176 B.5. Precision per number of authors and texts for reviews obfuscated by WordNet-based synonymization in percent ...... 177 B.6. Precision per number of authors and texts for reviews obfuscated by text spinning in percent ...... 178

154 BIBLIOGRAPHY

[1] Ahmed Abbasi, Hsinchun Chen, and Jay F Nunamaker. Stylometric identifica- tion in electronic markets: Scalability and robustness. Journal of Management Information Systems, 25(1):49–78, 2008. [2] Gunes Acar, Christian Eubank, Steven Englehardt, Marc Juarez, Arvind Narayanan, and Claudia Diaz. The web never forgets: Persistent tracking mechanisms in the wild. In ACM Special Interest Group on Security, Audit and Control (SIGSAC), 2014. [3] Gunes Acar, Marc Juarez, Nick Nikiforakis, Claudia Diaz, Seda G¨urses, Frank Piessens, and Bart Preneel. FPDetective: Dusting the Web for fingerprinters. In ACM Conference on Computer and Communications Security (CCS), 2013. [4] Jagdish Prasad Achara, Gergely Acs, and Claude Castelluccia. On the unicity of smartphone applications. In Workshop on Privacy in the Electronic Society. ACM, 2015. [5] Sadia Afroz, Michael Brennan, and Rachel Greenstadt. Detecting hoaxes, frauds, and deception in writing style online. In IEEE Symposium on Security and Privacy (S&P), 2012. [6] Sadia Afroz, Aylin Caliskan Islam, Ariel Stolerman, Rachel Greenstadt, and Damon McCoy. Doppelg¨angerfinder: Taking stylometry to the underground. In IEEE Symposium on Security and Privacy (S&P), 2014. [7] Mishari Almishari, Ekin Oguz, and Gene Tsudik. Fighting authorship linka- bility with crowdsourcing. In Conference on Online Social Networks. ACM, 2014. [8] S.I. Amari, Noboru Murata, K.R. Muller, Michael Finke, and Howard Hua Yang. Asymptotic statistical theory of overtraining and cross-validation. IEEE Transactions on Neural Networks, 8(5):985–996, 1997.

155 Bibliography

[9] Martin Azizyan, Ionut Constandache, and Romit Roy Choudhury. Surround- sense: Mobile phone localization via ambience fingerprinting. In ACM Annual International Conference on Mobile Computing and Networking (MobiCom), 2009.

[10] R. Bader. Nonlinearities and Synchronization in Musical Acoustics and Music Psychology. Current Research in Systematic Musicology. Springer, 2013.

[11] Gianmarco Baldini, Gary Steri, Franc Dimc, Raimondo Giuliani, and Roman Kamnik. Experimental identification of smartphones using fingerprints of built-in micro-electro mechanical systems (mems). Sensors, 16(6):818, 2016.

[12] Adam Bates, Ryan Leonard, Hanna Pruse, Kevin Butler, and Danial Lowd. Leveraging usb to establish host identity using commodity devices. In Network and Distributed System Security Symposium (NDSS), 2014.

[13] Peter Baumann, Stefan Katzenbeisser, Martin Stopczynski, and Erik Tews. Disguised chromium browser: Robust browser, flash and canvas fingerprinting protection. In ACM on Workshop on Privacy in the Electronic Society, 2016.

[14] J. Bell. Machine Learning: Hands-On for Developers and Technical Profes- sionals. Wiley, 2014.

[15] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndi´c,ˇ Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2013.

[16] K. Boda. Firegloves. http://fingerprint.pet-portal.eu/?menu, 2016.

[17] Hristo Bojinov, Yan Michalevsky, Gabi Nakibly, and Dan Boneh. Mobile device identification via sensor fingerprinting. arXiv preprint arXiv:1408.1416, 2014.

[18] Bojinov, Hristo and Michalevsky, Yan and Nakibly, Gabi and Boneh, Dan. Mobile Device Identification via Sensor Fingerprinting. arXiv preprint arXiv:1408.1416, 2014.

[19] K. Brade. The tor browser. https://gitweb.torproject.org/tor-bro wser.git, 2014.

[20] Michael Brennan, Sadia Afroz, and Rachel Greenstadt. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Transactions on Information and System Security (TISSEC), 15(3):12, 2012.

156 Bibliography

[21] Michael Robert Brennan and Rachel Greenstadt. Practical attacks against authorship recognition techniques. In IAAI, 2009.

[22] Browserleaks. Chrome web store detector. http://www.browserleaks.c om/chrome, 2014.

[23] Elie Bursztein, Jonathan Aigrain, Angelika Moscicki, and John C Mitchell. The end is nigh: Generic solving of text-based captchas. In USENIX Workshop on Offensive Technologies (WOOT), 2014.

[24] Elie Bursztein, Matthieu Martin, and John Mitchell. Text-based captcha strengths and weaknesses. In ACM Conference on Computer and Communica- tions Security (CCS), 2011.

[25] Elie Bursztein, Angelique Moscicki, Celine Fabry, Steven Bethard, John C Mitchell, and Dan Jurafsky. Easy does it: More usable captchas. In ACM Annual Conference on Human Factors in Computing Systems, 2014.

[26] Aylin Caliskan and Rachel Greenstadt. Translate once, translate twice, trans- late thrice and attribute: Identifying authors and machine translation tools in translated text. In International Conference on Semantic Computing (ICSC). IEEE, 2012.

[27] Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss, Fabian Yamaguchi, and Rachel Greenstadt. De-anonymizing program- mers via code stylometry. In USENIX Security Symposium, 2015.

[28] Yinzhi Cao, Song Li, and Wijmans Erik. (cross-)browser fingerprinting via os and hardware level features. In Network and Distributed System Security Symposium (NDSS), 2017.

[29] Le Chen, Alan Mislove, and Christo Wilson. An empirical analysis of algorith- mic pricing on amazon marketplace. In World Wide Web Conference (WWW), 2016.

[30] Christian Rossow. Amplification Hell: Revisiting Network Protocols for DDoS Abuse. In Network and Distributed System Security Symposium (NDSS), 2014.

[31] S.A. Cole. Suspect Identities: A History of Fingerprinting and Criminal Identification. Harvard University Press, 2009.

[32] Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on pattern analysis and machine intelligence, 24(5):603–619, 2002.

157 Bibliography

[33] W.W. Daniel. Applied nonparametric statistics. The Duxbury advanced series in statistics and decision sciences. PWS-Kent Publ., 1990. [34] Waltenegus Dargie and Mieso K Denko. Analysis of error-agnostic time-and frequency-domain features extracted from measurements of 3-d accelerometer sensors. Systems Journal, 4(1):26–33, 2010. [35] Anupam Das, Nikita Borisov, and Matthew Caesar. Exploring ways to mitigate sensor-based smartphone fingerprinting. CoRR, abs/1503.01874, 2015. [36] Anupam Das, Nikita Borisov, and Matthew Caesar. Tracking mobile web users through motion sensors: Attacks and defenses. In Network and Distributed System Security Symposium (NDSS), 2016. [37] Amit Datta, Michael Carl Tschantz, and Anupam Datta. Automated ex- periments on ad privacy settings. In Workshop on Hot Topics in Privacy Enhancing Technologies (HotPETs), 2015. [38] R´emide Zoeten. Computational stylometry in adversarial settings. 2015. [39] Sanorita Dey, Nirupam Roy, Wenyuan Xu, Romit Roy Choudhury, and Srihari Nelakuditi. Accelprint: Imperfections of accelerometers make smartphones trackable. In Network and Distributed System Security Symposium (NDSS), 2014. [40] Kun Du, Hao Yang, Zhou Li, and Haixin Duan. The ever-changing labyrinth: A large-scale analysis of wildcard dns powered blackhat seo. In USENIX Security Symposium. [41] Francisco Duarte, Andr´eLouren¸co,and Arnaldo Abrantes. Classification of physical activities using a smartphone: Evaluation study using multiple users. Procedia Technology, 17:239–247, 2014. [42] Peter Eckersley. How unique is your web browser? In Privacy Enhancing Technologies Symposium (PETS), 2010. [43] Gunnar Eisenberg. Identifikation und Klassifikation von Musikinstru- mentenkl¨angenin monophoner und polyphoner Musik. Cuvillier, 2008. [44] Steven Englehardt and Arvind Narayanan. Online tracking: A 1-million-site measurement and analysis. In ACM SIGSAC Conference on Computer and Communications Security, 2016. [45] Christian Eubank, Marcela Melara, Diego Perez-Botero, and Arvind Narayanan. Shining the floodlights on mobile web tracking – a privacy survey. In Web 2.0 Security & Privacy Conference (W2SP), 2013.

158 Bibliography

[46] Christos Fidas, Artemios Voyiatzis, and Nikolaos Avouris. On the necessity of user-friendly captcha. In ACM Annual Conference on Human Factors in Computing Systems, 2011.

[47] Barry A. J. Fisher, William J. Tilstone, and Catherine Woytowiczm. Intro- duction to Criminalistics: The Foundation of Forensic Science. Elsevier Ltd, Oxford, 2009.

[48] Rudolf Flesch. How to write plain English: A book for lawyers and consumers. Harpercollins, 1979.

[49] Rudolph Flesch. A new readability yardstick. Journal of applied psychology, 32(3):221, 1948.

[50] Nelson Francis and Henry Kucera. Brown corpus manual. Brown University, 15, 1979.

[51] Song Gao, Manar Mohamed, Nitesh Saxena, and Chengcui Zhang. Emerging image game captchas for resisting automated and human-solver relay attacks. In Anual Computer Security Applications Conference (ACSAC). ACM, 2015.

[52] Stanley A. Gelfand. Essentials of Audiology. Thieme, 2011.

[53] Lee Gillam, John Marinuzzi, and Paris Ioannou. Turnitoff–defeating plagiarism detection systems. In Annual Conference of the Subject Centre for Information and Computer Sciences, 2010.

[54] James E. Girard. Criminalistics: Forensic Science, Crime and Terrorism: Forensic Science, Crime, and Terrorism. Jones & Bartlett Publ., 3rd edition, 2014.

[55] G´abor Gy¨orgyGuly´as,Gergely Acs, and Claude Castelluccia. Near-optimal fingerprinting with constraints. In Privacy Enhancing Technologies Symposium (PETS), 2016.

[56] Isabelle Guyon. A scaling law for the validation-set training-set size ratio. AT&T Bell Laboratories, 1997.

[57] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reute- mann, and Ian H. Witten. The weka data mining software: An update. SIGKDD Explor. Newsl., 11(1):10–18, November 2009.

[58] Aniko Hannak, Gary Soeller, David Lazer, Alan Mislove, and Christo Wilson. Measuring price discrimination and steering on e-commerce web sites. In Internet Measurement Conference (IMC). ACM, 2014.

159 Bibliography

[59] William J. Hardcastle, John Laver, and Fiona E. Gibbon. The Handbook of Phonetic Sciences. Blackwell Handbooks in Linguistics. Wiley, 2012.

[60] Haotian He. Human activity recognition on smartphones using various classi- fiers. 2013.

[61] Thomas Hupperich, Henry Hosseini, and Thorsten Holz. Leveraging sensor fingerprinting for mobile device authentication. In Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA), 2016.

[62] Thomas Hupperich, Katharina Krombholz, and Thorsten Holz. Sensor captchas: On the usability of instrumenting hardware sensors to prove liveliness. In International Conference on Trust and Trustworthy Computing (TRUST). Springer International Publishing, 2016.

[63] Thomas Hupperich, Davide Maiorca, Marc K¨uhrer,Thorsten Holz, and Giorgio Giacinto. On the robustness of mobile device fingerprinting: Can mobile users escape modern web-tracking mechanisms? In Anual Computer Security Applications Conference (ACSAC), 2015.

[64] Alexa Internet Inc. The top 1 million sites on the web. http://www.alexa .com/topsites/, 2015.

[65] Google Inc. Introducing nocaptcha. http://goo.gl/x7N7qt, 2016.

[66] Google Inc. reCAPTCHA – Easy on Humans Hard on Bots. https://www.g oogle.com/recaptcha/intro/index.html, 2016.

[67] MaxMind Inc. GeoIP2. https://www.maxmind.com/en/geoip2-servic es-and-databases, 2014.

[68] Net Applications Inc. Mobile & tablet browser market share. http://www.n etmarketshare.com/browser-market-share.aspx, 2014.

[69] Information Resources Management Association. Machine Learning: Concepts, Methodologies, Tools and Applications. IGI Global, 2011.

[70] Kristoffer Jensen. Timbre models of musical sounds. PhD thesis, Department of Computer Science, University of Copenhagen, 1999.

[71] Nan Jiang and Huseyin Dogan. A gesture-based captcha design supporting mobile devices. In British Human Computer Interaction Conference (HCI). ACM, 2015.

[72] Eric Jones, Travis Oliphant, and Pearu Peterson. Scipy: Open source scientific tools for python. http://scipy.org, 2016.

160 Bibliography

[73] Gary Kacmarcik and Michael Gamon. Obfuscating document stylometry to preserve author anonymity. In Proceedings of the COLING/ACL on Main conference poster sessions, pages 444–451. Association for Computational Linguistics, 2006.

[74] Samy Kamkar. Evercookie – never forget. http://samy.pl/evercookie/, 2010.

[75] Yashwant Keswani, Harsh Trivedi, Parth Mehta, and Prasenjit Majumder. Author masking through translation. 2016.

[76] Anssi Klapuri and Manuel Davy. Signal Processing Methods for Music Tran- scription. Springer, 2007.

[77] Chloe Kliman-Silver, Aniko Hannak, David Lazer, Christo Wilson, and Alan Mislove. Location, location, location: The impact of geolocation on web search personalization. In Internet Measurement Conference (IMC). ACM, 2015.

[78] Kurt Alfred Kluever and Richard Zanibbi. Balancing usability and security in a video captcha. In Symposium on Usable Privacy and Security (SOUPS), 2009.

[79] Trupti M Kodinariya and Prashant R Makwana. Review on determining number of cluster in k-means clustering. International Journal, 1(6):90–95, 2013.

[80] Tadayoshi Kohno, Andre Broido, and Kimberly C Claffy. Remote physical device fingerprinting. IEEE Transactions on Dependable and Secure Computing, 2(2):93–108, 2005.

[81] Moshe Koppel and Jonathan Schler. Exploiting stylistic idiosyncrasies for authorship attribution. In Workshop on Computational Approaches to Style Analysis and Synthesis, 2003.

[82] Jochen Krimphoff, Stephen McAdams, and Suzanne Winsberg. Caract´erisation du timbre des sons complexes. ii. analyses acoustiques et quantification psy- chophysique. Le Journal de Physique IV, 4(C5):C5–625, 1994.

[83] Katharina Krombholz, Thomas Hupperich, and Thorsten Holz. Use the force: Evaluating force-sensitive authentication for mobile devices. In Symposium on Usable Privacy and Security (SOUPS), 2016.

[84] Katharina Krombholz, Thomas Hupperich, and Thorsten Holz. May the force be with you: The future of force-sensitive authentication. Journal of Internet Computing, Special Issue of Usable Security and privacy, 2017.

161 Bibliography

[85] Marc K¨uhrer,Thomas Hupperich, Jonas Bushart, Christian Rossow, and Thorsten Holz. Going wild: Large-scale classification of open dns resolvers. In Internet Measurement Conference (IMC). ACM, 2015.

[86] Marc K¨uhrer,Thomas Hupperich, Christian Rossow, and Thorsten Holz. Exit from hell? reducing the impact of amplification ddos attacks. In USENIX Security Symposium, 2014.

[87] Marc K¨uhrer,Thomas Hupperich, Christian Rossow, and Thorsten Holz. Hell of a handshake: Abusing tcp for reflective amplification ddos attacks. In USENIX Workshop on Offensive Technologies (WOOT), 2014.

[88] Andreas Kurtz, Hugo Gascon, Tobias Becker, Konrad Rieck, and Felix Freiling. Fingerprinting mobile devices using personalized configurations. In Privacy Enhancing Technologies Symposium (PETS), 2016.

[89] Thomas Lancaster and Robert Clarke. Automated essay spinning–an initial investigation. In Annual Conference of the Subject Centre for Information and Computer Sciences, page 25, 2009.

[90] Pierre Laperdrix, Walter Rudametkin, and Benoit Baudry. Beauty and the beast: Diverting modern web browsers to build unique browser fingerprints. In IEEE Symposium on Security and Privacy (S&P), 2016.

[91] Robert Layton, Paul Watters, and Richard Dazeley. Automated unsupervised authorship analysis using evidence accumulation clustering. Natural Language Engineering, 19(01):95–120, 2013.

[92] Hoi Le, Reihaneh Safavi-Naini, and Asadullah Galib. Secure obfuscation of authoring style. In IFIP International Conference on Information Security Theory and Practice, pages 88–103. Springer, 2015.

[93] Mathias Lecuyer, Riley Spahn, Yannis Spiliopolous, Augustin Chaintreau, Roxana Geambasu, and Daniel Hsu. Sunlight: Fine-grained targeting detection at scale with statistical confidence. In ACM Conference on Computer and Communications Security (CCS), 2015.

[94] Wei-Han Lee and Ruby Lee. Implicit sensor-based authentication of smartphone users with smartwatch. In Hardware and Architectural Support for Security and Privacy. ACM, 2016.

[95] Alexander Lerch. An Introduction to Audio Content Analysis: Applications in Signal Processing and Music Informatics. Wiley, 2012.

162 Bibliography

[96] Adam Lerner, Anna Kornfeld Simpson, Tadayoshi Kohno, and Franziska Roesner. Internet jones and the raiders of the lost trackers: An archaeological study of web tracking from 1996 to 2016. In USENIX Security Symposium, 2016. [97] Fengjun Li, Xin Fu, and Bo Luo. A hardware fingerprint using gpu core frequency variations. In ACM SIGSAC Conference on Computer and Commu- nications Security, 2015. [98] Jiexun Li, Rong Zheng, and Hsinchun Chen. From fingerprint to writeprint. Communications of the ACM, 49(4):76–82, 2006. [99] Bin Liang, Wei You, Liangkun Liu, Wenchang Shi, and Mario Heiderich. Scriptless timing attacks on web browser privacy. In Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2014. [100] Kim Luyckx. Scalability issues in authorship attribution. ASP/VUB- PRESS/UPA, 2011. [101] James MacQueen. Some methods for classification and analysis of multivariate observations. In Berkeley symposium on mathematical statistics and probability, 1967. [102] Davide Maltoni, Dario Maio, Anil K. Jain, and Salil Prabhakar. Handbook of Fingerprint Recognition. Springer Science & Business Media, 2nd edition, 2009. [103] Muharram Mansoorizadeh, Taher Rahgooy, Mohammad Aminiyan, and Mahdy Eskandari. Author obfuscation using wordnet and language models-notebook for pan. In CLEF 2016 Evaluation Labs and Workshop, 2016. [104] Dana Mattioli. On orbitz, mac users steered to pricier ho- tels. http://www.wsj.com/articles/SB10001424052702304458604 577488822667325882, 2012. [105] Stephen Mcadams. Perspectives on the contribution of timbre to musical structure. Computer Music Journal, 23(3):85–102, 1999. [106] Andrew WE McDonald, Sadia Afroz, Aylin Caliskan, Ariel Stolerman, and Rachel Greenstadt. Use fewer instances of the letter “i”: Toward writing style anonymization. In Privacy Enhancing Technologies Symposium (PETS), 2012. [107] William Melicher, Mahmood Sharif, Joshua Tan, Lujo Bauer, Mihai Christodor- escu, and Pedro Giovanni Leon. (do not) track me sometimes: Users’ contextual preferences for web tracking. In Privacy Enhancing Technologies Symposium (PETS), 2016.

163 Bibliography

[108] Tsvetomila Mihaylova, Georgi Karadjov, Preslav Nakov, Yasen Kiprov, Georgi Georgiev, and Ivan Koychev. Su@pan’2016: Author obfuscation—notebook for pan. In CLEF 2016 Evaluation Labs and Workshop, 2016.

[109] Jakub Mikians, L´aszl´oGyarmati, Vijay Erramilli, and Nikolaos Laoutaris. Detecting price and search discrimination on the internet. In ACM Workshop on Hot Topics in Networks (HotNets), 2012.

[110] Jakub Mikians, L´aszl´oGyarmati, Vijay Erramilli, and Nikolaos Laoutaris. Crowd-assisted search for price discrimination in e-commerce: First results. In ACM Emerging Networking Experiments and Technologies, 2013.

[111] George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.

[112] Andre A Moenssens. Fingerprint techniques. Chilton Book Company London, 1971.

[113] Sue B. Moon, Paul Skelly, and Don Towsley. Estimation and removal of clock skew from network delay measurements. In Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM), 1999.

[114] Keaton Mowery and Hovav Shacham. Pixel perfect: Fingerprinting canvas in html5. In Web 2.0 Security & Privacy Conference (W2SP), 2012.

[115] Mihir Nanavati, Nathan Taylor, William Aiello, and Andrew Warfield. Herbert west-deanonymizer. In USENIX Summit on Hot Topics in Security (HotSec), 2011.

[116] Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong, John Bethencourt, Emil Stefanov, Eui Chul Richard Shin, and Dawn Song. On the feasibility of internet-scale author identification. In IEEE Symposium on Security and Privacy (S&P), 2012.

[117] Neuroradiology. Inside recaptcha. https://github.com/neuroradiolog y/InsideReCaptcha, 2016.

[118] Nick Nikiforakis, Wouter Joosen, and Benjamin Livshits. Privaricator: Deceiv- ing fingerprinters with little white lies. Technical Report MSR-TR-2014-26, 2014.

[119] Nick Nikiforakis, Alexandros Kapravelos, Wouter Joosen, Christopher Kruegel, Frank Piessens, and Giovanni Vigna. Cookieless monster: Exploring the ecosystem of web-based device fingerprinting. In IEEE Symposium on Security and Privacy (S&P), 2013.

164 Bibliography

[120] Rebekah Overdorf and Rachel Greenstadt. Blogs, twitter feeds, and red- dit comments: Cross-domain authorship attribution. In Privacy Enhancing Technologies Symposium (PETS), 2016.

[121] Alan Page. Unstyle: A tool for circumventing modern techniques of authorship attribution. Technical report, Allegheny College, 2015.

[122] Andriy Panchenko, Fabian Lanze, Andreas Zinnen, Martin Henze, Jan Pen- nekamp, Klaus Wehrle, and Thomas Engel. Website fingerprinting at internet scale. In Network and Distributed System Security Symposium (NDSS), 2016.

[123] Tae Hong Park. Salient feature extraction of musical instrument signals. PhD thesis, Dartmouth College Hanover, New Hampshire, 2000.

[124] Vern Paxson. An analysis of using reflectors for distributed denial-of-service attacks. ACM SIGCOMM Computer Communication Review, 31(3):38–47, 2001.

[125] Mathias Payer, Ling Huang, Neil Zhenqiang Gong, Kevin Borgolte, and Mario Frank. What you submit is who you are: A multimodal approach for deanonymizing scientific publications. IEEE Transactions on Information Forensics and Security, 10(1):200–212, 2015.

[126] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12:2825– 2830, 2011.

[127] Geoffroy Peeters, Bruno L. Giordano, Patrick Susini, Nicolas Misdariis, and Stephen McAdams. The timbre toolbox: Extracting audio descriptors from musical signals. The Journal of the Acoustical Society of America, 130(5):2902– 2916, 2011.

[128] Mauro Perino. Turbospinner. http://turbospinner.com, 2016.

[129] Martin Potthast, Matthias Hagen, and Benno Stein. Author obfuscation: attacking the state of the art in authorship verification. CLEF 2016 Evaluation Labs and Workshop, 2016.

[130] Chris Quirk, Chris Brockett, and William B Dolan. Monolingual machine translation for paraphrase generation. In Conference on Empirical Methods in Natural Language Processing, 2004.

165 Bibliography

[131] Robert Ramotowski. Lee and Gaensslen’s advances in fingerprint technology. CRC Press, 2012.

[132] Josyula R Rao, Pankaj Rohatgi, et al. Can pseudonymity really guarantee privacy? In USENIX Security Symposium, 2000.

[133] Gerardo Reynaga and Sonia Chiasson. The usability of captchas on smart- phones. In Security and Cryptography (SECRYPT), 2013.

[134] Gerardo Reynaga, Sonia Chiasson, and Paul C. van Oorschot. Exploring the usability of captchas on smartphones: Comparisons and recommendations. In NDSS Workshop on Usable Security (USEC), 2015.

[135] Joseph Rudman. The state of authorship attribution studies: Some problems and solutions. Computers and the Humanities, 31(4):351–365, 1997.

[136] S. Sanei and J.A. Chambers. EEG Signal Processing. Wiley, 2013.

[137] S.C. Satapathy, S.K. Udgata, and B.N. Biswal. Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA). Advances in Intelligent Systems and Computing. Springer, 2013.

[138] SeleniumHQ. Selenium WebDriver. http://www.seleniumhq.org/proje cts/webdriver/.

[139] RJ Senter and Edgar A Smith. Automated readability index. Technical report, DTIC Document, 1967.

[140] Benjamin Reed Shiller. First degree price discrimination using big data. The Federal Trade Commission, 2014.

[141] Steven Sinofsky. Supporting sensors in Windows 8. http: //blogs.msdn.com/b/b8/archive/2012/01/24/supporting-sens ors-in-windows-8.aspx, 2016.

[142] Steven W. Smith. Digital signal processing: a practical guide for engineers and scientists. Newnes, 2003.

[143] Jan Spooren, Davy Preuveneers, and Wouter Joosen. Mobile device fingerprint- ing considered harmful for risk-based authentication. In European Workshop on System Security. ACM, 2015.

[144] Paul Stone. Pixel perfect timing attacks with html5. Context Information Security (White Paper), 2013.

166 Bibliography

[145] Sebastian Uellenbeck, Thomas Hupperich, Christopher Wolf, and Thorsten Holz. Tactile one-time pad: Leakage-resilient authentication for smartphones. In International Conference on Financial Cryptography and Data Security (FC). Springer, 2015.

[146] Fabio Valente and Christian Wellekens. Variational bayesian methods for audio indexing. In International Workshop on Machine Learning for Multimodal Interaction (MLMI). Springer, 2006.

[147] Jennifer Valentino-Devries, Jeremy Singer-Vine, and Ashkan Soltani. Websites vary prices, deals based on users’ information. Wall Street Journal, 10:60–68, 2012.

[148] Stefan Van Der Walt, Chris Colbert, and Gael Varoquaux. The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22–30, 2011.

[149] Hal R. Varian. Price discrimination. Handbook of industrial organization, 1:597–654, 1989.

[150] Thomas Vissers, Nick Nikiforakis, Nataliia Bielova, and Wouter Joosen. Crying wolf? on the price discrimination of online airline tickets. In Workshop on Hot Topics in Privacy Enhancing Technologies (HotPETs), 2014.

[151] Luis Von Ahn, Manuel Blum, Nicholas J Hopper, and John Langford. Captcha: Using hard ai problems for security. In International Conference on the Theory and Applications of Cryptographic Techniques. Springer, 2003.

[152] W3C. WebDriver. https://www.w3.org/TR/webdriver/, 2016.

[153] Cliff Wang, Ryan M Gerdes, Yong Guan, and Sneha Kumar Kasera. Digital fingerprinting. Springer, 2016.

[154] J. Wang, G.G. Yen, and M.M. Polycarpou. Advances in neural networks. In International Symposium on Neural Networks (ISNN), 2012.

[155] Tao Wang and Ian Goldberg. On realistically attacking tor with website fingerprinting. In Privacy Enhancing Technologies Symposium (PETS), 2016.

[156] Wavmg Wickramasinghe. De-Anonymization of Anonymous E-mails in digital forensic. PhD thesis, 2016.

[157] Wei Xu, Alan Ritter, William B Dolan, Ralph Grishman, and Colin Cherry. Paraphrasing for style. In International Conference on Computational Lin- guistics, (COLING), 2012.

167 Bibliography

[158] Yi Xu, Gerardo Reynaga, Sonia Chiasson, J-M Frahm, Fabian Monrose, and Paul Van Oorschot. Security and usability challenges of moving-object captchas: decoding codewords in motion. In USENIX Security Symposium, 2012.

[159] Jeff Yan and Ahmad Salah El Ahmad. Usability of captchas or usability issues in captcha design. In Symposium on Usable Privacy and Security (SOUPS), 2008.

[160] Yi-Hsuan Yang and Homer H. Chen. Music emotion recognition. Multimedia Computing, Communication and Intelligence. CRC Press, 2011.

[161] Ting-Fang Yen, Yinglian Xie, Fang Yu, Roger Peng Yu, and Martin Abadi. Host fingerprinting and tracking on the web: Privacy and security implications. In Network and Distributed System Security Symposium (NDSS), 2012.

[162] Marvin Zelkowitz. Advances in Computers: Improving the Web, volume 78. Academic Press, 2010.

[163] Qing Zhang, David Y Wang, and Geoffrey M Voelker. Dspin: Detecting automatically spun content on the web. In Network and Distributed System Security Symposium (NDSS), 2014.

168 APPENDIX A

SYSTEM FINGERPRINTS AS INFLUENCE ON ONLINE PRICING POLICIES

A.1. Median Hotel Prices

Table A.1.: Median hotel prices per fingerprint, provider and country Hotels HRS Orbitz FP Fr De Ro Fr De Ro USA Fr De Ro USA 1 74 74 74 70 69.9 70 70.2 62.93 62.93 62.93 62.93 3 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 5 74 74 74 70.83 70.73 70.83 70.2 63.25 63.25 64.2 64.2 21 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 23 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 25 74 74 74 70.4 70.3 70.4 70.41 62.93 62.93 62.93 62.93 27 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 29 74 74 74 70 69.9 70 70 63.25 63.25 64.2 64.2 31 74 74 74 70.4 70.3 70.4 70.41 62.93 62.93 62.93 62.93 33 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 35 74 74 74 70 69.9 70 72.9 63.25 63.25 64.2 64.2 37 74 74 74 70 69.9 70 70.2 62.93 62.93 62.93 62.93 39 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 41 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 43 74 74 74 70 69.9 70 70.2 62.93 62.93 62.93 62.93 45 74 74 74 70 69.9 69.6 70.2 63.24 63.24 64.19 64.19 47 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 49 74 74 74 70 69.9 70 70.2 62.93 62.93 62.93 62.93 51 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 53 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 55 74 74 74 70 69.9 70 70.2 62.93 62.93 62.93 62.93

169 Appendix A. System Fingerprints as Influence on Online Pricing Policies

57 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 59 74 74 74 70.83 69.89 70.83 70.25 63.25 63.25 64.2 64.2 61 74 74 74 70.98 70.89 70.98 70.6 62.93 62.93 62.93 62.93 63 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 65 74 74 74 70 69.9 70 70 63.25 63.25 64.2 64.2 67 74 74 74 70 69.9 70 70.2 62.93 62.93 62.93 62.93 69 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 71 74 74 74 70.83 69.89 70.83 70.76 63.25 63.25 64.2 64.2 73 74 74 74 70 69.9 70 70.2 62.93 62.93 62.93 62.93 75 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 77 74 74 74 70.34 70.19 70.4 70.41 63.25 63.25 64.2 64.2 79 74 74 74 70 69.9 70 70.2 62.93 62.93 62.93 62.93 81 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 83 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 85 74 74 74 70 69.9 70 70.2 62.93 62.93 62.93 62.93 87 74 74 74 70.98 70.89 70.98 71.03 63.24 63.24 64.19 64.19 89 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 91 74 74 74 70 69.9 70 70.2 62.93 62.93 62.93 62.93 93 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 95 74 74 74 72.81 69.9 70 70.2 63.25 63.25 64.2 64.2 97 74 74 74 70 69.9 70 70.2 62.93 62.93 62.93 62.93 99 74 74 74 70.4 70.24 70.4 70.65 63.24 63.24 64.19 64.19 101 74 74 74 70.83 69.89 70.83 70.76 63.25 63.25 64.2 64.2 103 74 74 74 70 69.9 70 70.2 62.93 62.93 62.93 62.93 105 74 79.5 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 107 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 109 74 74 74 70 69.9 70 70.2 62.93 62.93 62.93 62.93 111 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 115 74 74 74 70 69.9 70 70.2 62.93 62.93 62.93 62.93 117 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 119 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 123 74 74 74 70.4 70.24 70.4 70.65 63.24 63.24 64.19 64.19 125 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 127 74 74 74 70 69.9 70 70.2 62.93 62.93 62.93 62.93 129 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 131 74 74 74 70.34 70.19 70.4 70.41 63.25 63.25 64.2 64.2 133 74 74 74 70 69.9 70 70.2 62.93 62.93 62.93 62.93 135 74 74 74 70.98 70.89 70.98 71.03 63.24 63.24 64.19 64.19 137 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 139 74 74 74 70 69.9 70 70.2 62.93 62.93 62.93 62.93 143 74 74 74 70.34 70.19 70.4 70.2 63.25 63.25 64.2 64.2 145 74 74 74 70 69.9 70 70.2 62.93 62.93 63.87 63.87 149 74 74 74 70.34 70.19 70.4 70.41 63.25 63.25 64.2 64.2 151 74 74 74 70 69.9 70 70.2 62.93 62.93 63.87 63.87 153 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 155 74 74 74 70.34 70.19 70.4 70.41 63.25 63.25 64.2 64.2

170 A.1. Median Hotel Prices

157 74 74 74 70 69.9 70 70.2 62.93 62.93 63.87 63.87 159 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 161 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 163 74 74 74 70 69.9 70 70.2 62.93 62.93 63.87 63.87 165 74 74 74 70.4 70.24 70.4 70.65 63.24 63.24 64.19 64.19 167 74 74 74 70.34 70.19 70.4 70.41 63.25 63.25 64.2 64.2 169 74 79.5 74 70.53 70.3 70.4 70.41 62.93 62.93 63.87 63.87 171 80 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 173 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 175 74 74 74 70.53 70.3 70.4 70.41 62.93 62.93 63.87 63.87 177 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 179 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 181 74 74 74 70.53 70.3 70.4 70.41 62.93 62.93 63.87 63.87 183 74 79.5 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 185 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 189 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 191 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 193 74 74 74 70 69.9 70 70.2 62.93 62.93 63.87 63.87 195 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 197 74 74 74 70.83 69.89 70.83 70.76 63.25 63.25 64.2 64.2 199 74 74 74 70 69.9 70 70.2 62.93 62.93 63.87 63.87 201 74 74 74 70.98 70.89 70.98 71.03 63.24 63.24 64.19 64.19 203 74 74 74 70.83 69.89 70.83 70.76 63.25 63.25 64.2 64.2 207 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 209 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 211 74 74 74 70 69.9 70 70.2 62.93 62.93 63.87 63.87 213 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 215 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 217 74 74 74 70 69.9 70 70.2 62.93 62.93 63.87 63.87 219 74 74 74 70.98 70.89 70.98 71.03 63.24 63.24 64.19 64.19 225 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 227 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 229 74 74 74 70.53 70.3 70.4 70.41 62.93 62.93 63.87 63.87 231 74 74 74 70.4 70.24 70.4 70.65 63.24 63.24 64.19 64.19 233 74 74 74 70.34 70.19 70.4 70.41 63.25 63.25 64.2 64.2 235 74 74 74 70.53 70.3 70.4 70.41 62.93 62.93 63.87 63.87 237 74 74 74 70.4 70.24 70.4 70.65 63.24 63.24 64.19 64.19 239 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 241 74 74 74 70 69.9 70 70.2 62.93 62.93 63.87 63.87 243 74 74 74 70.4 70.24 70.4 70.65 63.24 63.24 64.19 64.19 245 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 247 74 74 74 70 69.9 70 70.2 62.93 62.93 63.87 63.87 249 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 251 74 74 74 70.34 70.53 70.4 70.41 63.25 63.25 64.2 64.2 253 74 74 74 70 69.9 70 70.2 62.93 62.93 63.87 63.87 255 74 74 74 69.27 69.23 69.33 69.52 63.24 63.24 64.19 64.19

171 Appendix A. System Fingerprints as Influence on Online Pricing Policies

257 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 259 74 74 74 70.53 70.3 70.4 70.41 62.93 62.93 63.87 63.87 261 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 263 74 74 74 70.34 70.19 70.4 70.41 63.25 63.25 64.2 64.2 265 74 74 74 70 69.9 70 70.2 62.93 62.93 63.87 63.87 267 74 74 74 70.4 70.24 70.4 70.65 63.24 63.24 64.19 64.19 269 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 271 74 74 74 70 69.9 70 70.2 62.93 62.93 63.87 63.87 273 74 74 74 70 69.9 70 70.2 63.24 63.24 64.19 64.19 275 74 74 74 70.83 70.07 70.83 70.76 63.25 63.25 64.2 64.2 277 74 74 74 70.16 70.89 70.98 70.6 62.93 62.93 63.87 63.87 279 74 74 74 70.98 70.89 70.98 71.03 63.24 63.24 64.19 64.19 281 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 283 74 74 74 70.53 70.3 70.4 70.41 62.93 62.93 63.87 63.87 293 74 74 74 70 69.9 70 70.2 63.25 63.25 64.2 64.2 295 74 74 74 70 69.9 70 70.2 62.93 62.93 63.87 63.87 297 74 74 74 70.4 70.24 70.4 70.65 63.24 63.24 64.19 64.19

172 APPENDIX B

IMPEDING AUTHORSHIP ATTRIBUTION VIA STYLOMETRY OBFUSCATION

B.1. Readability Measures Interpretations

Table B.1.: Interpretation of the Flesch-Reading-Ease Score School Level Interpretation Very easy to read. Easily understood 90.0–100.0 5th grade by an average 11-year-old student. Easy to read. Conversational 80.0–90.0 6th grade English for consumers. 70.0–80.0 7th grade Fairly easy to read. Plain English. Easily understood 60.0–70.0 8th & 9th grade by 13- to 15-year-old students. 50.0–60.0 10th to 12th grade Fairly difficult to read. 30.0–50.0 college Difficult to read. Very difficult to read. Best under- 0.0–30.0 college graduate stood by university graduates.

173 Appendix B. Impeding Authorship Attribution via Stylometry Obfuscation

Table B.2.: Interpretation of the Automated Readability Index Score Age Grade Level 1 5-6 Kindergarten 2 6-7 First Grade 3 7-8 Second Grade 4 8-9 Third Grade 5 9-10 Fourth Grade 6 10-11 Fifth Grade 7 11-12 Sixth Grade 8 12-13 Seventh Grade 9 13-14 Eighth Grade 10 14-15 Ninth Grade 11 15-16 Tenth Grade 12 16-17 Eleventh grade 13 17-18 Twelfth grade 14 18-22 College

174 B.2. Authorship Attribution Precision Matrices

B.2. Authorship Attribution Precision Matrices

Table B.3.: Precision per number of authors and texts for original review texts in percent Number of Number of texts per author Authors 10 20 30 40 50 75 100 250 500 750 1000 3 79.75 94.45 98.17 98.10 95.20 99.24 97.40 97.37 98.62 99.71 99.60 4 88.11 83.42 95.56 95.38 94.75 96.40 96.08 98.18 99.43 99.07 99.55 5 88.56 79.37 92.27 97.00 96.38 96.72 97.94 98.19 99.44 99.36 99.40 6 71.74 80.26 93.18 91.51 92.08 97.14 97.12 97.71 97.80 98.17 98.39 7 34.75 94.22 84.10 89.24 93.31 94.18 96.44 98.02 98.82 98.56 98.12 8 65.48 86.91 86.82 86.24 90.24 95.02 92.84 97.37 97.48 97.89 97.70 9 30.42 73.82 85.75 81.70 87.99 93.15 94.82 96.71 97.67 97.05 97.88 10 70.56 89.60 78.56 88.20 88.84 87.42 92.60 96.04 97.03 97.66 97.66 11 21.92 85.88 81.53 84.30 87.33 85.50 92.26 93.99 96.35 97.24 97.63 12 81.53 81.61 83.68 88.48 94.75 93.03 94.91 95.98 97.33 97.48 97.99 13 72.68 75.49 91.09 86.16 90.02 89.76 90.02 93.65 96.38 95.96 97.18 14 68.61 73.24 89.70 85.02 87.79 89.51 92.85 94.40 96.51 96.91 96.61 15 47.95 75.92 84.97 89.39 85.43 87.48 89.51 92.65 96.32 96.13 96.55 16 51.04 70.45 81.50 88.68 85.62 88.21 90.84 92.04 95.52 96.07 96.50 17 76.25 69.41 80.06 80.29 88.31 84.59 87.65 93.59 94.43 95.84 95.99 18 56.44 70.14 82.05 84.22 84.26 85.21 89.87 92.61 93.17 94.00 95.49 19 63.04 67.63 78.19 82.24 77.27 83.39 86.56 93.51 94.66 94.81 95.78 20 84.43 79.85 88.44 83.39 80.96 87.60 88.80 90.77 94.19 94.84 94.99 21 53.34 80.43 71.13 84.24 75.44 85.56 85.45 90.32 93.93 94.59 94.85 22 54.79 67.15 80.13 86.03 84.72 86.85 87.93 90.49 93.10 94.79 94.65 23 32.45 60.95 87.40 82.00 84.70 85.92 87.44 90.46 93.38 94.74 94.97 24 54.88 77.14 79.44 74.36 81.40 83.80 87.00 90.30 93.65 93.85 94.75 25 61.79 68.80 72.72 83.22 80.47 84.41 86.59 89.09 92.07 93.29 94.35 26 54.31 81.20 68.36 74.78 77.02 83.10 83.97 88.96 93.08 93.61 93.58 27 24.23 62.00 77.67 78.75 80.27 83.06 86.93 90.11 92.23 93.26 93.10 28 60.40 63.57 74.19 78.36 80.79 84.26 84.20 88.03 92.09 92.99 93.61 29 62.69 74.74 76.18 81.53 81.05 85.10 89.36 90.32 92.44 94.24 93.67 30 64.05 68.29 76.94 79.96 81.96 88.29 87.30 89.39 92.03 93.60 93.64

175 Appendix B. Impeding Authorship Attribution via Stylometry Obfuscation

Table B.4.: Precision per number of authors and texts for reviews obfuscated by text transformation in percent Number of Number of texts per author Authors 10 20 30 40 50 75 100 250 500 750 1000 3 79.37 75.96 97.82 93.94 80.28 97.36 94.40 96.41 97.93 99.18 98.74 4 78.39 70.35 96.38 91.22 92.87 93.37 97.89 98.51 98.53 98.80 98.95 5 93.23 84.36 86.87 90.98 96.04 93.23 95.11 96.50 97.85 98.86 98.73 6 74.52 88.27 92.52 83.65 88.47 86.29 94.42 94.96 96.63 97.05 98.02 7 79.90 77.72 84.33 89.91 87.99 92.60 93.46 95.71 98.04 97.54 98.21 8 44.19 73.36 87.68 81.15 87.91 90.56 91.02 96.25 95.97 96.89 96.46 9 70.26 81.97 89.41 78.75 82.36 88.61 92.41 94.31 96.67 96.77 96.53 10 58.08 77.25 80.77 83.38 86.50 87.24 91.61 92.18 94.65 96.23 95.73 11 63.91 71.72 82.28 89.39 82.41 81.19 85.70 92.74 94.01 96.00 95.82 12 77.93 78.61 84.84 88.28 85.40 90.71 88.86 93.10 95.58 95.00 96.02 13 48.47 78.69 81.99 75.18 86.39 85.73 87.62 91.37 93.03 95.03 95.09 14 84.46 61.14 80.95 78.38 82.91 87.74 85.02 91.27 94.04 94.31 94.43 15 51.03 65.41 72.55 81.56 82.14 84.93 84.81 91.01 92.28 92.86 94.58 16 48.91 66.91 74.28 72.66 82.53 85.68 84.70 90.13 93.42 93.65 94.35 17 55.68 71.92 85.71 79.82 81.85 84.75 88.51 89.07 91.71 92.48 94.05 18 52.59 72.90 75.35 76.76 73.37 78.75 84.66 89.18 90.42 92.36 93.00 19 78.12 75.58 74.99 71.84 82.51 84.04 87.09 88.66 91.67 91.92 93.78 20 82.09 68.16 67.42 84.19 83.58 79.42 85.94 88.52 93.11 92.94 92.91 21 68.12 68.10 68.11 82.33 81.44 77.16 84.47 88.23 92.99 92.84 92.88 22 66.85 68.01 69.33 78.96 76.63 75.55 83.97 88.02 92.33 82.22 92.76 23 62.31 67.77 70.85 77.50 73.87 74.77 83.55 87.93 91.49 91.70 92.68 24 49.83 55.59 65.81 67.43 75.58 80.99 82.84 89.98 91.44 91.86 92.94 25 55.15 69.14 74.35 68.61 77.29 76.85 85.45 89.45 91.50 92.70 93.56 26 54.70 63.42 71.38 68.06 74.91 83.03 82.99 86.91 90.78 91.86 91.80 27 55.32 63.85 78.34 75.95 81.17 76.97 81.72 85.69 89.96 91.27 91.06 28 44.48 63.23 54.05 70.84 76.45 79.37 81.18 87.99 89.68 90.93 90.71 29 56.69 82.71 61.80 75.25 77.69 80.12 83.58 86.68 89.03 90.12 91.23 30 49.08 70.06 68.63 74.23 75.66 76.26 77.82 87.27 89.68 90.73 91.52

176 B.2. Authorship Attribution Precision Matrices

Table B.5.: Precision per number of authors and texts for reviews obfuscated by WordNet-based synonymization in percent Number of Number of texts per author Authors 10 20 30 40 50 75 100 250 500 750 1000 3 85.64 94.37 91.93 83.68 93.22 98.69 97.72 97.73 99.05 99.26 99.52 4 88.61 97.59 91.62 95.55 96.93 96.45 99.02 99.06 99.44 99.36 99.43 5 94.14 93.33 96.04 96.75 97.93 95.15 97.14 98.45 99.52 99.47 99.42 6 80.42 80.39 93.27 95.11 94.58 94.92 95.49 97.12 97.49 98.76 98.37 7 97.25 89.99 93.11 94.72 90.43 90.91 95.27 97.27 98.15 98.07 98.38 8 73.59 62.85 86.64 84.68 94.76 95.41 95.46 95.17 97.08 97.86 97.53 9 59.81 83.29 84.26 82.57 84.64 88.16 93.57 95.43 97.98 97.66 97.88 10 39.70 76.99 85.28 86.84 80.78 90.44 94.12 95.85 96.62 97.06 97.78 11 57.47 81.59 70.77 86.92 86.70 89.59 92.67 95.54 96.65 97.54 97.44 12 87.65 75.83 87.68 90.59 88.84 92.40 91.88 95.46 96.73 98.07 97.44 13 83.10 77.09 82.55 88.50 86.03 92.52 92.35 93.08 96.52 97.12 96.55 14 76.69 88.17 87.88 82.15 85.40 89.15 89.54 93.25 95.87 96.17 96.50 15 53.09 77.72 77.76 83.26 87.04 88.67 88.50 94.73 95.24 96.60 96.09 16 71.65 79.71 88.70 75.18 83.45 88.61 91.25 94.46 95.17 95.74 96.55 17 74.21 79.42 79.68 81.11 83.53 87.64 89.10 92.98 94.59 96.03 95.40 18 50.61 83.13 72.13 76.84 83.22 86.84 88.57 91.81 93.87 95.05 95.92 19 77.55 76.81 80.53 76.26 82.57 80.89 87.89 90.55 95.94 95.15 95.54 20 75.32 74.87 76.01 74.35 82.44 84.66 87.99 91.68 94.61 95.12 95.43 21 72.83 73.57 74.39 74.84 82.67 86.03 88.64 91.89 94.43 95.74 95.32 22 71.61 72.81 68.14 80.02 84.48 88.48 89.95 92.74 94.20 95.80 95.21 23 61.59 76.77 74.24 74.30 82.17 86.08 84.83 92.70 93.55 94.91 94.77 24 75.22 78.21 75.38 72.75 85.92 86.99 89.51 92.55 93.59 94.78 95.53 25 38.18 56.53 80.00 80.08 79.37 86.63 88.70 90.74 92.91 94.47 94.88 26 50.01 68.76 77.19 81.87 79.60 83.77 89.14 89.75 93.41 94.49 95.18 27 53.80 65.99 79.32 77.56 77.59 84.93 85.95 91.58 93.21 93.91 93.85 28 49.72 71.58 84.07 77.40 79.30 83.52 82.76 88.69 92.26 93.86 93.87 29 45.34 69.84 75.11 80.47 79.35 84.82 87.94 91.31 92.58 92.80 92.91 30 41.11 88.23 78.19 74.96 81.85 86.84 86.72 89.48 92.08 93.80 93.94

177 Appendix B. Impeding Authorship Attribution via Stylometry Obfuscation

Table B.6.: Precision per number of authors and texts for reviews obfuscated by text spinning in percent Number of Number of texts per author Authors 10 20 30 40 50 75 100 250 500 750 1000 3 13.39 84.70 65.62 92.64 96.65 95.24 96.53 98.19 98.23 98.33 98.71 4 69.91 76.31 88.48 85.82 93.80 95.41 96.25 98.07 98.26 98.62 98.80 5 48.69 78.25 76.79 90.20 96.30 93.93 94.50 97.18 99.11 98.68 98.61 6 86.75 89.80 80.56 87.66 92.34 94.05 91.99 96.67 95.55 97.21 97.12 7 58.53 78.31 75.18 88.68 91.08 95.00 91.88 95.80 97.63 98.00 97.57 8 81.07 61.46 88.95 85.27 88.34 94.12 83.82 94.46 94.34 96.28 97.25 9 56.98 57.69 74.38 77.67 72.36 84.43 86.20 91.66 94.46 95.15 95.33 10 72.75 60.73 85.72 75.12 83.99 81.43 86.57 92.59 95.42 95.40 95.54 11 50.49 74.22 70.82 82.36 79.24 84.03 81.43 90.07 92.86 94.52 95.76 12 59.49 81.40 83.33 80.54 73.61 87.08 91.94 92.93 94.09 95.65 95.66 13 75.27 78.22 79.38 84.08 82.47 87.81 86.35 91.96 93.19 94.30 94.01 14 63.71 90.03 69.40 72.84 85.02 82.32 90.08 91.20 92.68 93.58 93.54 15 57.84 84.71 83.73 77.80 77.96 84.61 87.81 88.59 91.75 92.87 93.65 16 48.61 72.07 75.43 81.87 77.60 84.83 86.07 89.25 92.34 92.75 94.05 17 67.39 79.62 81.86 76.64 75.08 85.07 86.06 90.24 92.20 91.66 91.92 18 39.63 76.39 73.23 71.22 82.29 84.64 82.52 89.86 90.33 91.70 91.52 19 69.94 65.68 76.09 81.20 75.83 80.00 75.67 89.75 89.87 91.88 91.84 20 65.47 70.34 66.42 73.70 66.89 82.18 79.29 85.41 91.35 90.22 91.59 21 63.22 67.54 67.67 72.64 70.32 80.43 80.12 85.48 91.04 90.34 91.61 22 58.46 65.61 70.11 71.75 71.44 78.46 80.25 85.68 90.87 90.36 91.63 23 50.16 63.13 73.26 70.99 74.30 77.49 81.97 87.21 90.12 90.33 91.68 24 42.01 58.40 78.75 70.10 68.84 84.27 79.65 87.11 89.69 89.56 91.58 25 59.52 53.83 70.01 75.13 77.66 77.86 77.11 85.82 90.66 90.40 91.45 26 41.50 72.10 66.67 69.86 69.46 74.01 76.33 86.02 88.65 90.37 91.14 27 48.42 58.30 66.86 72.41 68.52 80.71 74.70 85.27 88.63 89.22 89.94 28 44.32 63.57 73.15 68.29 61.76 73.52 77.80 86.65 87.49 88.30 89.11 29 54.13 56.83 69.63 65.85 67.97 82.51 79.27 84.07 84.95 84.73 86.08 30 55.34 56.40 67.60 68.21 74.32 79.40 72.12 84.90 88.11 87.83 89.66

178 Thomas Hupperich

Angaben zur Person Name Thomas Hupperich Ausbildung 2007–2011 IT-Sicherheit, Netze & Systeme, Ruhr-Universität, Bochum, M.Sc. 2004–2007 Wirtschaftsinformatik, FHDW, Bergisch Gladbach, Dipl.-Wirt.-Inf. (FH) Berufliche Tätigkeiten 10.2011 – Wissenschaftlicher Mitarbeiter, Ruhr-Universität Bochum, Fakultät für Elektro- 03.2017 technik und Informationstechnik, Lehrstuhl für Systemsicherheit 04.2011 – Wissenschaftliche Hilfskraft, Ruhr-Universität Bochum, Fakultät für Elektrotech- 09.2011 nik und Informationstechnik, Forschungsgruppe Trusted Computing 10.2009 – Studentische Hilfskraft, Ruhr-Universität Bochum, Institut für Sicherheit im e- 03.2011 Business