UPTEC F 18049 Examensarbete 30 hp 15 Augusti 2018

Using XGBoost to classify the Beihang Keystroke Dynamics Database

Johanna Blomqvist Abstract Using XGBoost to classify the Beihang Keystroke Dynamics Database Johanna Blomqvist

Teknisk- naturvetenskaplig fakultet UTH-enheten Keystroke Dynamics enable biometric security systems by collecting and analyzing computer keyboard usage data. There are different Besöksadress: approaches to classifying keystroke data and a method that has been Ångströmlaboratoriet Lägerhyddsvägen 1 gaining a lot of attention in the machine learning industry lately is Hus 4, Plan 0 the decision tree framework of XGBoost. XGBoost has won several Kaggle competitions in the last couple of years, but its capacity in Postadress: the keystroke dynamics field has not yet been widely explored. Box 536 751 21 Uppsala Therefore, this thesis has attempted to classify the existing Beihang Keystroke Dynamics Database using XGBoost. To do this, keystroke Telefon: features such as dwell time and flight time were extracted from the 018 – 471 30 03 dataset, which contains 47 usernames and passwords. XGBoost was then

Telefax: applied to a binary classification problem, where the model attempts 018 – 471 30 00 to distinguish keystroke feature sequences from genuine users from those of `impostors'. In this way, the ratio of inaccurately and Hemsida: accurately labeled password inputs can be analyzed. http://www.teknat.uu.se/student The result showed that, after tuning of the hyperparameters, the XGBoost yielded Equal Error Rates (EER) at best 0.31 percentage points better than the SVM used in the original study of the database at 11.52%, and a highest AUC of 0.9792. The scores achieved by this thesis are however significantly worse than a lot of others in the same field, but so were the results in the original study. The results varied greatly depending on user tested. These results suggests that XGBoost may be a useful tool, that should be tuned, but that a better dataset should be used to sufficiently benchmark the tool. Also, the quality of the model is greatly affected by variance among the users. For future research purposes, one should make sure that the database used is of good quality. To create a security system utilizing XGBoost, one should be careful of the setting and quality requirements when collecting training data.

Handledare: David Strömberg & Daniel Lindberg Ämnesgranskare: Michael Ashcroft Examinator: Tomas Nyberg ISSN: 1401-5757, UPTEC F 18049 Populärvetenskaplig Sammanfattning

Idag så använder majoriteten av alla företag och privatpersoner datorer och databaser för att skydda tillgångar och information. Det är därför viktigare än någonsin att ha säkra system som korrekt verifierar att rätt människor kommer in i dessa system. Vi är vana vid att använda till exempel fysiska nycklar och lösenord. Men så kallade biometriska lösningar blir allt mer intressanta. De bygger på att biologiska markörer, som till exempel fingeravtryck, är unika för varje individ. Ett steg längre är beteendebiometri, alltså att vi har unika beteenden, så som skrivstil och rörelsemönster. Denna studie har tittat på ett sådant område, så kallat Keystroke Dynamics, som bygger på att vi alla skriver med olika rytm på ett tangentbord när vi använder en dator. Tanken är att för att ta sig in i ett system, ska man inte bara behöva ha tillgång till det rätta lösenordet, utan även behöva skriva på det sätt som tillhör inloggningen. För att skapa ett sådant här system, kan man använda sig av maskininlärning. Då matar man en modell med exempel av hur människor skriver, och sen är tanken att modellen ska lära att känna igen vad som särskiljer dem. Det finns många olika teorier att använda för att göra detta, och denna studie har använt sig av det relativt nya kritikerrosade XGBoost. XGBoost är ett verktyg som bygger på beslut- sträd, där data kategoriseras genom att gå igenom ett ‘träd’ av relevanta frågor. Datasetet som har använts i detta projekt är det öppna ‘Beihang Keystroke Dynamics Database’. Studien visade, till något av en besvikelse, att XGBoost var ungefär lika bra som andra maskininlärningsmodeller på samma dataset. Slutsatserna som drogs var att detta förmod- ligen beror på att datatestet var för litet. I framtiden bör forskning titta mer på XGBoost och dess potential kring Keystroke Dynamics, och bör fokusera på att skapa ett stort dataset som kan användas i all forskning. Principen om Keystroke Dynamics användes redan under andra världskriget, då telegrafister började känna igen varandra genom rytmen som slogs på telegraferna när morsekod skick- ades. När sedan datorerna gjorde sitt intåg så har man försökt använda denna princip för tangentbord, och 1985 visade David Umphress och Glen Williams att ‘tangentbordsprofiler’ är unika. Det har kommit flertalet studier på ämnet sedan dess, men ett generellt problem i branschen är att det inte finns ett erkänt dataset som kan användas för att forska på (och således göra jämförelser enkla). En anledning till detta är att det finns många varianter på system. Datainsamlingen kan göras i labb, eller via internet hemma hos deltagarna. Texten de matar in som prover kan var lång, eller kort. De kan få välja texten själva, eller så skriver alla samma text. Språk kan ju självklart också göra skillnad. Att samla in data tar också tid, och på grund av just tidsbrist så valdes det i denna studie att använda ett befintligt dataset. Beihang Keystroke Dynamics Database, Dataset A, består av 47 deltagare som 4-5 gånger har skrivit in ett egenvalt lösenord som binds till ett unikt användarnamn via tangentbord på ett internetcafé. Deltagarna har också fått tillgång till andra deltagares lösenord och lämnat prover på hur de skriver dessa, för att härma en ‘attack’. Detta dataset valdes för att det var lättåtkomligt, och för att det bedömdes intressant att studera just fritext-databaser i en kommersiell miljö, just för att det speglar verkligheten bäst. XGBoost bedömdes intressant att undersöka för att det inte ännu har använts i Keystroke Dynamics-fältet, och för att det har åstadkommit otroligt bra resultat i andra sammanhang, och vunnit branschpriser. När man väl har datan så reduceras den till ett antal så kallade features. De features som har valts i denna studie är dwell time (hur länge en tangent är nedtryckt) och fyra varianter av flight time (hur lång tid det är mellan två tangenttryckningar). Genom att ta medelvärdet av dessa tider för en lösenordsinmatning, skapas feature-sekvenser om fem värden för varje inmatning, oavsett lösenordslängd. Sedan delas användarna upp i två grupper, en vars data används för att träna modellen, och en för att testa den. Det är för att modellen inte ska testas på data den redan har sett. XGBoost-modellen tränar genom att titta på feature-sekvenserna för att försöka sätta upp rätt regler för att avgöra om inmatning tillhör den genuina användaren eller om det är en attack (någon som har fått rätt användarnamn och lösenord). Den gör detta genom att titta på skillnaden mellan ett par av feature sekvenser, och försöker avgöra vad det är som

ii särskiljer en användares rytm från en annans, vad det är för skillnader som krävs för att två inmatningar ska kategoriseras som olika. Vid testning så används de här inlärda reglerna vid en jämförelse av en inmatning (som antingen är en ‘attack’ eller en ‘genuin’) med en inmatning som vi vet tillhör den genuina användaren, och avgör om skillnaden betyder att inmatningen är genuin eller en attack. Problemet blir då en så kallad binär klassificering. Modellen behöver bara lära att säga ‘genuin’ eller ‘inkräktare’ när den får ett par av feature- sekvenser. Vid testning så togs en del statistik fram. Ett mått som kallas ‘false acceptance rate’ (‘falsk acceptans-andel’) fick ett medelvärde på 19.75%, och ‘false rejection rate’ (‘falsk bortvisnings-andel’) fick ett medelvärde på 19.30%. ‘Equal error rate’ (EER), där andelen falska och sanna acceptansen sätts still lika, landade på 19.75%. Dessa siffror är höga för ett säkerhetssystem, om man till exempel jämför med fingeravtrycksläsning som har 0.02% och andra studier inom Keystroke Dynamics. Dock så är det viktigt att komma ihåg att steg 1 i ett sådant här system är att ha tillgång till lösenordet (inte bara tillgång till ett finger). Även om resultaten var sämre än andra studier, så var det bara något sämre än originalstudien på samma dataset, som hade ett EER om 11.83%. Detta ledde till slutsatsen till att det stora problemet inte var XGBoost, utan datasetet själv. Dessa resultat pekar alltså på att det som spelar mest roll för säkerheten i ett biometriskt system är datasetet. Datasetet måste vara tillräckligt stort för att en maskininlärningsmod- ell ska träna på tillräckligt många varierade sekvenser för att lära sig en tillräckligt generell uppsättning av regler. 47 användare visade sig vara för litet. Jag anser att Keystroke Dy- namics kan komma att bli ett bra alternativ att använda i säkerhetssammanhang, förslagsvis tillsammans med andra system, så som lösenord eller taggar. Jag tror även att principen har stor potential att användas inom smartphones, som har flertalet sensorer inbyggda redan. XGBoost bör definitivt fortsättas undersökas i framtiden, och forskning borde fokusera på att skapa ett stort dataset som kan användas för benchmarking.

iii Contents

Contents

1 Introduction1 1.1 Problem Statement...... 2 1.2 Background...... 2 1.2.1 and Keystroke Dynamics...... 2 1.2.2 Classification methods and XGBoost...... 3 1.3 Scope...... 4 1.4 Thesis Outline...... 4

2 Theory 6 2.1 systems...... 6 2.1.1 Authentication, Verification, and Identification...... 6 2.1.2 Static vs Continuous...... 6 2.1.3 Systems overview...... 6 2.2 Keystroke Dynamics...... 7 2.2.1 Keystroke Dynamics Data...... 7 2.2.2 Features...... 8 2.3 Supervised Learning...... 8 2.3.1 Data Leakage and Cross-Validation...... 8 2.3.2 Classification...... 9 2.3.3 Training...... 9 2.4 Decision Trees and XGBoost...... 11 2.4.1 Decision Trees...... 11 2.4.2 XGBoost...... 12 2.4.3 XGBoost Hyperarameters...... 14 2.5 Testing...... 15 2.5.1 Feature Importance...... 15 2.5.2 FPR and FNR (false positive/negative rate):...... 15 2.5.3 TPR and TNR (true positive/negative rate):...... 16 2.5.4 ROC (receiver operating characteristic):...... 16 2.5.5 AUC (area under curve):...... 17 2.5.6 EER (equal error rate):...... 17

3 Implementation 18 3.1 Software...... 18 3.2 Dataset...... 18 3.2.1 Characteristics of the data...... 18 3.2.2 Issues...... 18 3.2.3 Motivation...... 19 3.3 Feature Extraction...... 19 3.4 Training...... 20 3.5 Testing...... 22

4 Results 24 4.1 Tuning of the hyperparameters...... 24 4.2 Feature importance...... 24 4.3 AUC-scores...... 25 4.4 EER, FAR, and FRR...... 25 4.5 ROC-curves...... 26

5 Discussion 29

6 Conclusions 30

7 Future Work 31

8 Acknowledgments 31

9 Appendix 34

iv Contents

9.1 Feature importance...... 34 9.2 AUC Scores...... 36

v 1 Introduction

1 Introduction

Being able to provide and verify identity is a vital part of a functional and secure society.[29] People are asked to identify themselves multiple times a day, when doing everything from unlocking personal phones and using public transport, to making bank transactions and gaining access to secure systems at workplaces. Naturally, this leads to both commercial and scientific interest in the field of authentication. There are numerous methods of estab- lishing identity and rights to information. Methods are usually divided into three different areas, which can be used separately or together to complement each other: Those based on possession (keys and tokens), knowledge (passwords and codes), and lastly, biometrics. Biometric systems are based on biological traits, which are often described as being more difficult to copy, learn, or obtain for illegitimate users than for example tokens or passwords. There are two types of biometrics: Physical biometrics, which analyzes physical characteris- tics, such as scanning of fingerprints and retinas, and behavioral biometrics, which analyze traits associated with behavior.[2] Systems utilizing physical biometrics are widely in use, even in personal products such as phones. Behavioral biometrics, on the other hand, are still relatively unexplored, but technology is beginning to catch up with theories. One emerging technique is ‘keystroke dynamics’. Keystroke dynamics utilizes keyboard patterns, which are believed to be unique enough from person to person to form a basis for identification. The data collected can be of different types, for example, pressure data, which keys are being pressed, or the most common: Rhythm. A rhythm profile for an individual is constructed by time stamp data, constructing features from data representing pressing and releasing of the keys. The most simple features are known as ‘digraphs’, time relationships between two actions on the key- board. There are two major digraph categories: ‘Dwell time’ and ‘flight time’, representing for how long a key is pressed and how long it takes to find the next, respectively.[1] Flight time can be defined in a number of different ways, and a schematic of five common features can be seen in Figure1.

Figure 1: The four common keystroke features utilized in keystroke dynamics, demonstrated by different key stamps from pressing on two consecutive keys ‘J’ and ‘Y’. ‘D’ is for dwell time, represents how long a key is pressed down for. ‘F’ is for flight time, represents four different interpretations of time in-between pressing two keys, either between 1. releasing key J and pressing key Y; 2. releasing key J and releasing key Y; 3. pressing key J and pressing key Y; and 4. pressing key J and releasing key Y.[26]

1 1 Introduction

To decide whether an individual trying to access the system is authorized to do so, the system compares the recorded rhythm with previous samples provided by the genuine user.[2] Both Teh et al. and Ali et al. divide different approaches to this comparison into two categories: Statistical and machine learning approaches. Note that they have not used a formal definition of the terms, but rather a natural division of methods. Examples of statistical methods include common measurements such as mean, median, and standard deviation. Probabilistic modeling methods based on Gaussian distribution, such as Bayesian and Gaussian Density functions, as well as cluster methods based on homogeneous clusters, such as K-means are also categorized as statistical. The most used statistical methods are those based on pattern distance, such as Euclidean and Manhattan distance.[26][2] Examples of machine learning methods are Neural Networks, Decision Trees, Fuzzy Logic, and Support Vector Machines (SVM).[2] An implementation of decision trees which is relatively new is XGBoost, or ‘extreme gradient boosting’, presented by Tianqi Chen and Carlos Guestrin in 2014.[6] XGBoost has raised a lot of interest in the general machine learning community, and have for example performed very well in recent ‘Kaggle’ competitions.[6] Security systems based on keystroke dynamics are of interest not only because biometric systems, in general, are believed to be more secure, but also because they are cost-effective compared to other biometric systems, as the only products needed is a keyboard and a software to carry out the authentication. As Ali et al. comments, it is also a noninvasive method in regards to users as it only requires typing information (as opposed to fingerprints, eye scans, DNA samples, which require more, and possibly more personal information of the user).[1] Attempts to classify keystroke dynamics data using decision trees have been proved successful in the past, however, no research has emerged where XGBoost is used to classify the data.[1] Therefore, this study has attempted to classify the Beihang Keystroke Database, a com- mercially accessible database consisting of usernames and passwords, created by Li et al. in 2011, with XGBoost. The purpose has been to investigate whether XGBoost is a good approach to keystroke dynamics problems. Therefore, this study has not collected any data, nor created a working security application, but has classified data from an existing dataset, and analyzed its performance.

1.1 Problem Statement

The purpose of this report is to examine whether the framework XGBoost can be used to classify the data provided in the Beihang Keystroke Dynamics Database according to user. Ideally, the classification could be used for keystroke dynamics security systems, and the model should produce an accuracy higher than those provided by traditional frameworks, such as SVM and Neural Networks, results which were published along with the database by Li et al in 2011.

1.2 Background

This section of the report sets the study into context by reviewing previous studies in the area. First, studies made about keystroke dynamics, in general, are reviewed, then those looking at the classification problem in general, and some comments are made about XGBoost. Lastly, details of the scientific scope and the outline of the report is presented.

1.2.1 Biometrics and Keystroke Dynamics

The concept of Keystroke Dynamics began in 1899: William Lowe Bryan and Noble Harter wrote an article titled ‘Studies on the Telegraphic Language: The Acquisition of a Hierarchy of Habits’, which was published in The Psychological Review, where they remarked on

2 1 Introduction individual and unique typing pattern of telegraphists.[19] The idea resurfaced in World War II when telegraphists would identify the sender of the message based on punching rhythm.[16] In 1985, David Umphress and Glen Williams showed that keystroke profiles are unique enough to provide an effective protection, especially together with other means of identifica- tion, such as a password.[28] In 1995, Shepherd authored comprehensive lecture notes where he outlined the possibilities of keyboard data, the core of keystroke dynamics. He demon- strated that keyboards enable keystroke data to be monitored and time-stamped, allowing for logging of press and release events.[24] Important work was also made in 2000 by Fabian Monrose and Aviel Ruben when they published the paper ‘Authentication via Keystroke Dynamics’. They outlined a framework for the field of keystroke dynamics. They discuss that keystroke dynamics systems can be either static or continuous. Static systems only act during a specific time interval, for example during a logging in phase, so that the user needs a username, password, and the right typing rhythm in order to pass through. Continuous systems check keystrokes patterns throughout the session to authenticate the user and can label it as an impostor (and take appropriate measures) at any time. Systems which lie in-between static and continuous are also possible.[20] This will be more thoroughly discussed in the theory section of this report. In their study, Monrose and Ruben collected typing data from 63 different users over 11 months. They examined how a good system should be designed in order to detect anomaly data, and concluded that individualized thresholds, specific to the users of the system, should be implemented rather than a general one. For classification, KNN was applied to cluster similar feature sets of so-called digraphs (common pairs of letters in a text) to distinguish between users. Due to ‘the superior performance of Bayesian-like classifiers’, they decided to implement their own distance measure based on Gaussian distribution (details can be studied in their paper), which achieved an identification accuracy of 92.14%. Monrose and Ruben finish with recommending working with ‘structured’ text (the text typed is the same for all participants in the study), over free text (the participants are free to choose what text to type, sometimes however within certain restrictions). Although typing rhythms have been proven to be individual enough to provide reliable identification profiles, Monrose and Ruben suggest that security systems should use keystroke dynamics in combination with other methods (such as tokens), as ‘slight changes in behavior are inevitable’. This conclusion was supported by Unar et al. in their review ‘A review of biometric technology along with trends and prospects’ published in 2014. In 2002, Bergadano et al. published a paper called ‘User Authentication through Keystroke Dynamics’ which presented results from a study where 44 individuals provided samples from typing a structured text 682 characters long (in Italian), as well as a shorter (English) text. The data was analyzed by a distance algorithm that considered the relative distance between trigraphs (similar to digraphs, but with three letters). Bergadano et al’s method consisted of two steps: User classification, where training samples are classified, and the user authentication step, where each user is ‘attacked’ by samples provided by other authenticated users of the system, as well as an additional 110 other individuals. In their testing phase, an attempt had to be sufficiently close to the behavior of the claimed user in order to be classified as such, not only far away from the other profiles. Bergadano et al. report a result of a ‘false rejection rate’, or FRR, of 4% and a ‘false acceptance rate’, or FAR, of <0.01%.[Bergadano] (Details of FRR and FAR are explained in the ‘Theory’ chapter.) The Beihang Keystroke Dynamics database was studied by the creators in the paper ‘Study on the Beihang Keystroke Dynamics Database’, where they applied five different SVMs, a Gaussian method, and a Neural Network-classifier. Their EER results ranged from 11.8327% (made with one of the SVMs) to 20.7295% (the NN-classifier).[18]

1.2.2 Classification methods and XGBoost

Which classification method that is best suited for keystroke dynamics depends on several variables. For example, the amount of data, number of users, the type of input (long, short, free, fixed), and what type of authentication one wishes to implement (static, continuous). There have been a number of reviews that aim to give the reader an overview of the field of

3 1 Introduction keystroke dynamics. Examples are Karnan et al. in 2011, Teh et al. in 2013, and Ali et al. in 2017. This section will start with giving a brief summary of their conclusions regarding classification methods, followed by a brief history of XGBoost. All reviews stress that the main issue of the field is that there is yet no benchmarking dataset for keystroke dynamics, nor a set evaluation method, which aggravates comparisons between results.[16][26][2] Teh et al. reviewed around 70 studies, concluding that among these, statistical approaches were most commonly used (in 61% of the cases), with machine learning methods second (37%). Out of the machine learning methods, neural networks were by far the most common, followed by what they call ‘generic methods’, and decision trees in third place. Some of the most successful results achieved with machine learning methods were, in particular, made with a support vector machine (FAR of 0.76% Azevedo et al, 2007) and a naive Bayesian method (EER of 1.72% Balagani et al, 2011). The most successful attempt using decision trees achieved an EER of 2% (short input, Nonaka, and Kurihara, 2004) and a FAR of 0.88% (long input, Sheng and Phoba, 2005) (mentioned also by Ali et al). (EER, or ‘equal error rate’, is defined as the threshold where a value maps to a positive label that satisfies FAR = FRR. Details are provided in the ‘Theory’ chapter.) Karnan et al. reviewed approximately 30 studies in the field. From their work, one can conclude that the most successful machine learning attempts achieved used neural networks (FAR of 1% Cho et al, 2000) and potential functions and Bayes decision rule (FAR of 0.7%). The most successful attempt using random forest reached a FAR of 14% (Bartlow and Cukic, 2006).[16] Ali et al. reviewed approximately 80 studies. Out of the machine learning methods, a study made with a random forest decision tree approach produced very promising results of a FAR of 0.03% (Maxion and Killourhy 2010). However, this was done on data from digit input (instead of text). As stated, XGBoost is a tool that implements ‘gradient boosting’, first introduced by Jerome H. Friedman in 1999.[8] The idea is based on the wish to work with simple, ‘weak learners’ (models that are only ‘slightly better than guessing’ (p. 337)[15]), in order to shorten computation time. Their performance alone are not enough for making good predictions, so something had to be done to optimize them. From this ambition, the idea of ‘boosting’ was developed. In essence, boosting is similar to the practice of ‘bagging’ in the sense that it combines the predictions from several of these weak learners together, to make a stronger one.[6] However, unlike bagging, boosting creates these learners sequentially and typically with adaptive training sets.[3] Mathematical details about XGBoost is provided in the ‘Theory’ chapter.

1.3 Scope

This study has only aimed to analyze user identification via typing rhythm, and nothing else, to enable a static authentication system. (Other studies aim to analyze further char- acteristics of the users. For example, Ebb et al. published a study in 2011, in which they try to interpret emotional states of users.[12]) The data used is from the Beihang Keystroke Dynamics Database A. The machine learning method applied is extreme gradient boosting, through XGBoost. Other methods, such as neural networks and SVM, were decided against due to a sparse dataset (which work badly with neural networks), and due to that SVM:s are already widely researched in this area. This project did not include collecting data or creating an application for implementing a security system.

1.4 Thesis Outline

Chapter2 presents the theory relevant to the study, by explaining the basics of authen- tication and keystroke dynamics, as well as the theory behind machine learning, focusing on decision trees and XGBoost. Chapter3 presents the implementation part of the study, detailing what software was used and description and motivation of the used dataset. It also describes how features were extracted, and how the training and testing of the model were

4 1 Introduction carried out. Chapter4 presents the results of this implementation, with are discussed in chapter5. The report rounds off with a conclusion (chapter6) and some reflections about possible future work in the field (chapter7).

5 2 Theory

2 Theory

2.1 Authentication systems

This subchapter discusses the basics of terminology used in the field of authentication and the usual building blocks used in a security system.

2.1.1 Authentication, Verification, and Identification

The terms authentication, verification, and identification are in common language often used interchangeably. Technically however, they do describe different nuances of security. Authentication is a general term, as Ali et al. succinctly put it, ‘Authentication, in short, is the process of verifying a person’s legitimate right prior to the release of secure resources’. [2] Verification is one way of establishing this right. It involves a person presenting a proof of identity, and a system verifying that the person is who they claim to be and that they have the right to access.[2] This can be done on different levels, from just making sure that an authorized username provides the right password, to make sure that a photo ID is authentic, and matches the individual providing it. Verification is the most common method used in security systems.[2] In the case of identification, differs from verification as the system gains no knowledge of who the person claims to be. They simply provide a mean of identification (not an identity), and the system needs to establish if this sample of identification (for example a fingerprint) is authorized and present in a database, and if so, which identity does it belong to. This method is more time-consuming but necessary in some fields, for example in the forensic sciences.[2] As Ali et al reports, verification is the over-represented type of authentication in keystroke dynamics accounting for more than 89% of the research.[2] This study has also implemented a verification method, as a username is provided, and an access attempt is only checked against previously provided samples which the system knows belong to the legitimate user.

2.1.2 Static vs Continuous

Authentication systems can be either static or continuous. Static systems claim the majority of research (83%), and are the simplest versions where a system only checks the identity of a user once, most often as a first step to gain access (for example logging in).[2] However, this system does nothing to prevent already accessed information to be navigated by an intruder, if the authorized user leaves the system open for use (voluntarily or not). Continuous systems try to battle this issue, by continuously verifying that the user of the system is who they have claimed to be. Keystroke dynamics is in fact very appropriate for this type of authentication since the user, if gaining access to a computer-based system, probably will continue to use the keyboard. If the user had provided a fingerprint, it would, of course, be more difficult in practice to continuously verify it.[5] Nonetheless, experiments involving continuous systems are more difficult to set up, and the application prospects may not be worth the effort (how often would a person leave sensitive information unattended?), which might be why research has focused more on static systems.[2] Therefore, this study gas implemented a static verification system.

2.1.3 Systems overview

A static biometric verification system is created in two parts: First, the database is set up through a registration phase, and then the verification, the implementation of the system, can commence. A flowchart describing the parts are depicted in Figure2. In the registration

6 2 Theory phase, authorized users are asked to provide a number of identification samples which are registered as genuine samples. The next step is to extract features (easily comparable data) from the samples, which are stored in a database with information on which user they belong to. The features are also used in the next phase, where the training occurs. A model is trained to distinguish users from each other and from impostors is created using the registration features, and the system is set up. This model is then used when verifying attempts to access the system. Users provide their means of identification, a data sample, from which the features are extracted. These features are processed by the model and compared with the stored features belonging to the claimed user, finally resulting in a decision of access made by the system.

Figure 2: A flowchart describing the setup and running of a general static biometric au- thentication system. The data samples provided depend on the type of identification used (behavioral, physical), and so does the features extracted. The model can be of a machine learning or statistical type and will learn how to distinguish impostors from genuine users. This is used when running the system, which decides on either granting or denying the user access.

This study has focused on the setup phase, particularly in extracting features and training a machine learning model. The next subchapter will specify what these data samples can consist of, and what features can be extracted when dealing with keystroke dynamics. The subchapter after that will go into how a machine learning model, and particularly XGBoost, is trained. Throughout this report, the word ‘authentication’ will be used, as it is deemed more general and less confusing.

2.2 Keystroke Dynamics

All methods need to structure the incoming data in some way, pick features to focus their analysis on and to store, to enable future comparisons. Some models do this as a part of the learning process, for others, it is necessary to do this manually.[30] This subchapter will present different types of data and features common when dealing with keystroke dynamics.

2.2.1 Keystroke Dynamics Data

Keystroke Dynamics data is normally collected through regular QWERTY keyboards, which have receivers that register time-stamps of the keystrokes. Participants are asked to type the text multiple times. To evaluate the model created in the research, it is common practice to collect samples where the participants (either the same ones as provided the genuine samples, or a new pool of participants) have been asked to provide samples using the details of someone else, for example, using someone else’s username), actings as impostors on the system, and referred to as ‘impostor samples’. [26] For the practical data collection, researchers can choose to work in a lab setting, where the participants of the study are in a controlled environment. This often entails that the

7 2 Theory participants are using the same (or same type of) keyboard. The experiment can also choose to restrict what type of keyboard is used, but neglect where they are geographically situated when carrying out the experiment. They can also choose to simply not restrict it at all and let participants choose both setting and keyboard. The first version has the advantage of being controlled, allowing the study more rigor. However, it may not represent a real- life situation very well and affecting the participants" behavior, thus leading to biased and unrepresentative data. The second option removes the lab environment issue but raises the question of whether it is preferable that the participants are familiar with the keyboard or not. Which option leads to (the most) distorted data: using one’s own keyboard, which one rarely does when intruding, or using an unfamiliar device, which is not very representative of for example a situation at a workplace. The third option simply gives the researchers no insight into how device familiarity might affect results.[26] Some studies allowed for the participants to practice on provided keyboards before recording any data, to allow for familiarization.[13] Although time-stamp data is the most common data collected, some studies have tried to research and implement receivers for registering keystroke pressure, expanding the feature base for profiling.[26] Currently, it is also very popular to work with mobile devices, which typically already contain hardware for analyzing movement, such as accelerometers and gyroscopes, which opens up possibilities for multi-featured profiles.[1][26] The text from which the time-stamps are collected can also be of different nature. The text can either be fixed, meaning the participants all write the same text (as in the widely used GreyC Keystroke Dataset [13]), or free, allowing the participants to freely choose their text. It can further be a short sequence of letters (such as a username, password, or short phrase), or longer, often defined as paragraphs of more than 100 words[26][4].

2.2.2 Features

When the data has been collected, data that is easy to compare between samples, and that are distinctive enough to be of use in the identification process. This ‘reduced data’ is normally referred to as ‘features’. One or several features can be drawn from one sequence of data, which in this context is the sequence of time-stamps from one typing session of the text. Among the most common are the already discussed dwell and flight times, referred to as ‘digraphs’, as they represent relationships between two keystrokes (‘n-graphs’ are also sometimes used; timing relationships between three or more keystrokes).[26] If the text samples are fixed, or long enough, the data can be reduced further by creating sub-sets of digraph samples depending on what keys are being pressed, creating relationships of important keystrokes.[20]

2.3 Supervised Learning

Machine learning can be divided into two categories: Supervised and unsupervised learning. This study is of the supervised type since the training data is already labeled (unsupervised learning tries to find any patterns in data). The basis of any supervised learning algorithm is to present it with training data, and the correct labels (called observed values. The training occurs when the algorithm attempts to map the training data to an output as close to the observed values as possible. If the process involves training several models, the best performing model can be chosen by testing them all on a subset of the training data (called ‘validation’) and evaluate. The model obtained can then be tested on testing data. This last step produces statistics of the accuracy of the model.[21]

2.3.1 Data Leakage and Cross-Validation

The training and testing data is often derived from the same dataset. A common issue in machine learning is data leakage, where a model produces very good results when tested because some form of information has ‘leaked’ between the training and testing sets. For

8 2 Theory example, a model that is tested samples it has also been trained on and therefore has seen before can perform misleadingly well. There are several methods to help to avoid this. An easy strategy is to take care when splitting the data into the training set and testing set, basing the decision on some independent parameter.[23] The concept of splitting a set can also be used within the training set, by splitting it into a training and validation set, so that in each step of the training, the model evaluates on unseen data. This also prevents data leakage. However, ignoring parts of the data when training can lead to the opposite of over-fitting, under-fitting. To ensure that this does not occur, cross-validation can be implemented. Cross-validation splits the training sets a number of times, into what is called folds, training the model the same number of times, and each time letting a new split form the validation set, and the rest the training set.[14]A visualization of a cross-validation with five folds can be seen in Figure3. The algorithm then evaluates which split scored best, and uses that model in the testing phase on the current test set. The algorithm is then evaluated by calculating the average performance of all those splits.

Figure 3: Illustration of a cross-validation, where the dataset has been split into 5 folds.

2.3.2 Classification

Classification is the goal of training the model, it is where incoming samples are categorized by the model. This is in fact done both during training, where the prediction is checked and sent as feedback in the training process, as well as when actually running the system. There are different types of classification methods, which depend on what type of data the tree structure receives in the training. The two main types are: binary and multiclass classification. In binary, a data sample is to be classified as one out of two different classes, and in the latter, three or more classes are possible. Multiclass classification problems are very complex but can be transformed into binary classification problems. One way is to train one model per class, and when doing so, letting all samples not belonging to that class act as negative samples. This is called ‘one vs rest’. Multiclass methods involving many classes is time-consuming and requires good hardware.[21] Keystroke dynamic classifications are often multiclass problems since there is often more than two users authorized to use a system. There is, however, ways to train one, and only one, binary model on several classes. One such method is used in this study. The model is trained to compare two samples of feature sequences, and to decide whether or not they belong to the same class (eg user). The data the model receives are pairs of feature sequences, and during training also a label describing if the feature sequences belong to the same class, or not.[21]

2.3.3 Training

The training itself occurs mathematically by evaluating a model, through a so-called ob- jective function. The objective function keeps track of the error in the machine learning

9 2 Theory model, thus the goal is to minimize it. In trees in general, the objective function has a term l that represents training loss. Training loss describes how correct the model is compared to the observed data. XGBoost also implements a regularization term Ω, see Equation1. Regularization controls overfitting, and the two terms help balance the model, so it is not too predictive, nor too simple, an important aspect for any machine learning algorithm. Ob- jective functions help the model perform well because they give feedback on what is going on (this case through the regularization and training loss hyperparameters).[6]

Obj(Θ) = l(Θ) + Ω(Θ) (1)

The loss function l can be calculated in several ways. Two of the most popular are mean square error,

n X l = (yi − yˆ) (2) i=1 and logistic loss,

n X (−yˆi) yˆi l = yiln(1 + e ) + (1 − yi)ln(1 + e ). (3) i=1 [6]

In both cases, yi is the observed value, and yˆ the estimated value. For a binary classification problem using logistic loss, the observed value will be either 0 or 1, and the estimated value will be a probability between 0 and 1 of chosen ‘positive’ value (for example, probability of real value being 1). The regularization term Ω can also be calculated in various ways. The most common is to use some form of norm, and described in Equation4 and Equation5 are what is called the L2 norm and L1 norm, respectively.

Ω(w) = λ||w||2 (4)

Ω(w) = λ||w||1 (5)

How regularization controls overfitting can be explained through ‘Bias-Variance tradeoff’, which is a way of analyzing the expected error of the estimation the model produces. The error is made up of three terms, a bias term, a variance term, and lastly an irreducible error. The bias term is a measure of how much the estimated value differs from the true value. The variance term is a measure of how sensitive the model is to variations in the data. The last term, as suggested, cannot be reduced by the model, but the former two can. The tradeoff lies in that the best model is complicated enough to predict the training data (high bias and lower variance), but also general and uncomplicated that it predicts new data well (higher variance, which leads to lower bias). If it is too complicated, the model might be overfitting to the training data, and if it is too uncomplicated, it might underfit. While training loss tries to push the model to be perfect (increasing complexity), regular- ization balances this by reducing the complexity of the model. The complexity of the model depends on its hyperparameters. The regularization types of L1 and L2 reduces the ‘freedom’ of the hyperparameters, limiting the range of their values, effectively limiting the number of different variations of hyperparameters, which reduced complexity, and thus controlling overfitting.[21] Next, the method used in this study is explained, and how its objective function is evaluated through boosting.

10 2 Theory

2.4 Decision Trees and XGBoost

As previously mentioned, XGBoost is a framework which builds on gradient boosting. This subchapter will start with laying out the theory behind decision trees, introducing what sort of mathematical function they are implemented with, moving on to explaining extreme gradient boosting, or XGBoost.

2.4.1 Decision Trees

Decision trees are used for a variety of different classification problems. They classify data by letting it traverse a tree full of dividing questions, finally reaching a ‘leaf’ (end node) that predicts something final about the data. Trees are popular because they are easy to visualize, scale well, are good with anomalies such as missing data and outliers, and can handle both continuous and discrete data. A disadvantage of decision trees is their sensitivity to variance in data. One method to avoid this is to let more than onetree classify the data, so-called ‘tree ensemble’.[21] This is the model that XGBoost is built upon, and is more thoroughly discussed in the next subchapter.[6] First, a more detailed and mathematical explanation of decision trees. Decision trees create the tree structure by partitioning connecting nodes according to pa- rameters which are learned through training and assigns leaves different values that map on to a score that refers to the question in some way. The data is sorted through the partitions, which asks information about the data, and when arriving at a leaf the data point is assigned the value that represents the classification the tree has made. If these values are discrete (for example ‘yes’ or ‘no’), the trees are called classification trees. If the values are continuous (for example a probability) they are called regression trees.[21] An example of the structure of a regression tree can be seen in Figure4.

Figure 4: An example of a regression tree, where the data points are travelers on the Titanic, the classification is what chance of survival one individual would have. The questions asked at the partitions are what gender the individual is, what age and in what class it was traveling in. The leaves have been assigned different values learned through training. So, for a data point representing a young boy traveling in 3d class, the model will classify it as having a survival rate of 27%.[22]

Mathematically, the model that represents a decision tree is represented by a an adaptive

11 2 Theory basis-function model, M X f(x) = wmφ(x; vm) (6) m=1 where M represents the number of ‘distinctions’ we want to do (for the Titanic example in Figure4, M would be 5), wm is the ‘mean response’, or ‘leaf weight’ for the class m, ie the response to the question ‘what is the survival rate for anyone classified in leaf m’ (for example 27%), φ is the basis function for x, the data input, and vm, the ‘splitting variable’ which denotes the splitting question and its threshold value (in the above case, only ‘yes’ or ‘no’).[21] This is the function that a supervised learning method has to analyze in each step in order to optimize classification of the training data, in accordance with subchapter 3.3.3. In a tree ensemble, there are multiple trees. The model for such an ensemble can be expressed as

K X yˆi = fk(xi), fk ∈ F (7) k=1 where yˆi is the prediction for the given data xi, K is the number of trees in one assembly, and fk is the function for given tree, or Equation6. F is the function space of all trees.[6]

2.4.2 XGBoost

XGBoost uses an ensemble of trees, ideally weak learners. On top of that, it applies boosting. Boosting is the process of iterating through a number of steps, adding a new function each step that is based on the error from the previous estimation. Normal gradient boosting uses the principle of gradient decent to find the minimum of the objective function, through adding a new function in each step - boosting. XGBoost, does all of this, while considering a slightly different take on the regularization term, but most importantly considers second derivatives (which leads to faster computations).[7] Now follows a mathematical description of XGBoost. It starts of with its objective being written as (analogously with Equation1)

n X X Obj = l(yi, yˆi) + Ω(fk), fk ∈ F (8) i=1 k where the base models f, can be found through boosting. The boosting starts with the model formulating a prediction

(0) yˆi = 0 (9) and adds functions (boosts it)

(1) (0) yˆi = f1(xi) =y ˆi + f1(xi) (10)

(2) (1) yˆi = f1(xi) + f2(xi) =y ˆi + f2(xi) (11) ...

t (t) X (t−1) yˆi = fk(xi) =y ˆi + ft(xi) (12) k=1

(t) where yˆi is the model at step t.[7] The next step is finding the base models f. This is found through optimizing Equation8, which can be expressed as

12 2 Theory

n (t) X (t−1) Obj = l(yi, yˆi + ft(xi)) + Ω(ft) + constant, (13) i=1 and if the ft that minimizes the expression is found, the f in each step in the boosting is found. If Taylor expansion of Equation 13 is applied up to the second order, and if the loss function L is defined as the square loss (Equation2), Equation 13 can be approximated as

n X 1 Obj(t) = [g f (x ) + h f 2(x )] + Ω(f ), (14) i t i 2 i t i t i=1 where gi and hi are defined as the first and second order derivatives of the loss function;

(t−1) gi = ∂yˆ(t−1) l(yi, yˆ ) (15) and

2 (t−1) hi = ∂yˆ(t−1) l(yi, yˆ ) (16)

[7]. The next part is to expand the regularization term to control the complexity, or the overfit- ting, of the model. If the set of leaves Ij is defined as

Ij = i|q(xi) = j, (17) the regularization can be expanded to

T 1 X Ω(f ) = γT + λ w2 (18) t 2 j j=1 where T is the number of leaves in the tree, and λ and γ are hyperparameters which can be tuned when working with XGBoost. The larger the respective values of λ and γ, the more conservative the algorithm becomes. Equation 14 can be written

n T X 1 1 X Obj(t) = [g f (x ) + h f 2(x )] + γT + λ w2 (19) i t i 2 i t i 2 j i=1 j=1 which can be reduced to

T X 1 Obj(t) = [G w + (H + λ)w2] + γT, (20) j j 2 j j j=1 where Gj and Hj are defined as

X Gj = gi (21)

i∈Ij and

X Hj = hi, (22)

i∈Ij

13 2 Theory

The best minimization of Equation 20, or in other words, the optimal leaf weight, is com- puted by

∗ Gj wj = − (23) Hj + λ giving an optimal loss value for the new model:

T 2 1 X Gj Obj∗ = − + γT, (24) 2 H + λ j=1 j which is called the structure score, which is simply a measure of how good the tree structure is (the smaller the score, the better).[6] The performance of a tree can thus be measured, but how is the best tree structure found? It is not realistic to create all possible structures and evaluate them. The trick is to greedily growing the tree. This means that as the tree grows, it tries to add a split at each node. The score of the tree gains to its objective function (ie a penalty) for a split is calculated as

1 G2 G2 (G + G )2 Gain = [ L + R − L R ] − γ, (25) 2 HL + λ HR + λ HL + HR + λ with the first three terms are called the ‘training loss reduction’, and represents the score of the new left child, the right child, and the score if we do not split. The last term is a complexity penalty, or ‘regularization’, from introducing a new leaf into the structure. Now, the way forward is obvious: If the total gain is negative (so that the added scores are smaller than γ), the tree structure loses in performance by adding that leaf. If it is positive, the tree reduces its overall structure score and should perform the split. This is a variation of the pruning the tree. A model can also implement so-called recursive pruning, where the tree implements all possible splits to a maximum depth, and then recursively prune the tree of the splits that created negative gain. This is done to not overlook a split that looks like a loss at first, but that can lead to beneficial splits later.[6] This method of boosting a model with an objective with a regularization term and consid- erations of second derivatives has thus been named XGBoost. Other advantages to other tree boosting tools are that it is completely scalable, and according to its creators, ‘it runs more than ten times faster’ than other systems.[7]

2.4.3 XGBoost Hyperarameters

The XGBoost library offers three different types of hyperparameters for tuning: General parameters, parameters for the chosen booster (in this case a tree), and learning parame- ters.[8] Scikit has a plugin function called ‘grid search’ which enables tuning by searching through ranges of different parameters using cross-validation to find the optimal values for that particular model. This subchapter will go through those hyperparameters tuned in this study. General parameters concern message printouts and the number of threads used to perform parallel processing. Learning parameters concern the learning procedure and the evaluation of the objective function. Neither of these parameters have been tuned in this study. Booster parameters concern the tree structure itself. There are several different parameters and the following, which are the most common to tune, have been used in this study (default values are those for the Scikit API for XGBoost Classification:

Max depth: Max depth sets a limit to how ‘deep’ the tree can be, how many partitions can occur. A higher value increases the complexity of the model, risking overfitting. Range:[0,∞], default:3.

14 2 Theory

Min child weight: Min child weight sets a threshold on the weight of a tree node for partitioning to continue. A higher value results in a more conservative model. Range:[0,∞], default:1.[8]

Gamma: Gamma, as seen in Equation 18, sets a threshold for the minimal loss reduction to occur in a step in order for the tree to create a further split. A higher value results in a more conservative model. Range:[0,∞], default:0.[8]

Subsample: Subsample sets the share of data selected for training, to prevent overfitting. Range:(0,1], default:1.[8]

Column sample by tree: Column sample by tree sets the share of the features to be used in each tree. Range:(0,1], default:1.[8]

Alpha: Alpha represents the L1 regularization term, as in Equation5. The higher the value, the more conservative the model becomes. Default:0.[8]

Lambda: Lambda represents the L2 regularization term, as in Equation4. The higher the value, the more conservative the model becomes. Default:1.[8]

Eta (learning rate): Eta, also called learning rate, adjusts the weights after each boosting step. The higher the value, the lower the weights become and thus making the model more conservative. Range:[0,1], default:0.1[8]

2.5 Testing

Testing of the completed model is done to simulate a live scenario, to obtain an estimate of the model’s expected performance on new data. The model is then presented with un- seen data (a testing set), and given the task of classifying it. The nature of the output varies, depending on the measurement. The most common measurements when dealing with classification are presented below:

2.5.1 Feature Importance

The varying importance of the different features has held when building the model can be visualized through the plot importance function in XGBoost. The importance is calculated by considering the number of times each variable has been split and weighting this by squared improvement gained from each split.[11]

2.5.2 FPR and FNR (false positive/negative rate):

When doing a binary classification (are these sequences from the same sample or not?), there are two types of errors the model can make: False positives (answering ‘yes’ when the answer is ‘no’), and a false negative (the opposite). The rate of the occurrences of these errors are calculated as

FP FP FPR = = (26) FP + TN N− and

15 2 Theory

FN FN FNR = = (27) TP + FN N+ where FP , TN, FN, TP are the number of false positives, true negatives, false negatives, and true positives, respectively.[21] FPR and FNR are sometimes, in models designed for authorization, called ‘false acceptance rate’ and ‘false rejection rate’, or FAR and FRR, respectively. These terms will hereafter be used in this report when discussing these particular measurements. It can be argued that false acceptances are more dangerous for a security system, while a high FRR would probably lead to irritation and low user value.

2.5.3 TPR and TNR (true positive/negative rate):

Equivalently with FAR and FRR, there are two types of correct classifications the model can give: true positive and true negative answers. They are calculated as

TP FP TPR = = (28) TP + FN N+ and

TN FN TNR = = . (29) FP + TN N+

These measurements are also known as ‘true acceptance rate’ and ‘true rejection rate’. [21]

2.5.4 ROC (receiver operating characteristic):

All four characteristics above are calculated at a certain threshold (what percentage of certainty of an answer is required to label a sample as ‘positive’ or ‘negative’.) To get a complete view of the model, one needs to run it at several thresholds. Plotting TAR against FAR at different thresholds results in a so-called ROC curve. In Figure5, an example ROC curve is depicted. If the threshold is set to = 1 (classifying all samples as negative), the model reaches a point on the bottom left. If the threshold is set to = 0, the model ends up in the top right corner (classifying everything as positive). If the model acts on pure chance, TAR will equal FAR, and it will end up on a point on the diagonal. If the model has been trained so well that its TAR = 1 and FAR = 0 (no false positives), which is desirable in security systems, the model will end up on a point in the upper left corner. Therefore, a good result is when the ROC curve is as close to the upper left corner as possible.[21]

16 2 Theory

Figure 5: A standard ROC curve, with example results from three different models. The blue line, representing FPR = TPR (or FAR = TAR), is ‘worthless’ since it is as good as randomly guessing the answer. The model represented by the yellow line is better at classifying the data than the purple, since it is closer to the upper left corner (where TPR = 1 and FPR = 0). Figure credit:[25]

2.5.5 AUC (area under curve):

The ROC curve, while giving a visually pleasing way of comparing models, is difficult to interpret quantitatively. Therefore, the ‘AUC’, or area under the ROC curve, is often calcu- lated and presented as a measure of the quality of the model. As the ideal ROC curve hugs the y-axis and up to TAR = 1, and then steps out horizontally, the maximum AUC score is 1.[21]

2.5.6 EER (equal error rate):

Another quantitative measure of the ROC curve is the equal error rate, or the value of FAR when FRR = FAR. Visually, this can be done by drawing a F AR = 1 − T AR line through the plot, and see where it intersects the ROC curve. The ideal value of EER is 0, occurring when the curve is in the top left corner.[21]

17 3 Implementation

3 Implementation

3.1 Software

This study has built an XGBoost-model using Python 3.6.4. Java or R can also be used. Apart from some program handling plots and other visualization (for example matplotlib, Pandas, Scikit, plotly), and other extensions to make programming easier, the computer needs to be equipped with the following to run XGBoost: • GNU Compiler Collection (gcc) • XGBoost in some way (for example a clone of the git repository[9]) Detailed environment instructions can be found on the official XGBoost website.[8] The training of keystroke dynamics is generally not that power consuming, since the data consists of pure text (as opposed to images or other file sizes). Hence, a normal personal computer should be enough to run these programs.

3.2 Dataset

The dataset used in this study was made by Li et al. in 2011 and presented in the report ‘Study on The BeiHang Keystroke Dynamics Database’. The database contains keystroke data from a total 117 users collected in two different settings, in a cybercafe, and online, resulting in two databases: Database A and Database B. In this study, only Database A (49 users) have been used. The users provided usernames, and both training and testing sets of data of typing out a personally chosen password.[18] This subchapter describes the specific characteristics and the downsides of the data included, as well as explaining the motivation for choosing this particular dataset for the present study, including comments on other frequently used datasets.

3.2.1 Characteristics of the data

For each of the participants, the database contains a username and chosen password. For each username, there are 4-5 registration samples provided by the genuine user. There are also a varying number of attempt samples, using the correct username and password, each provided by either the genuine user or another participant posing as an ‘impostor’. The registration samples were used for training, and the attempt samples were used for testing the model. The data for every password entry consists of two timestamps per key; press and release. This enables calculation of dwell time and different flight times. Li et al. utilized this to create a feature vector consisting of four parts. One describes dwell time, and the other three describes three different types of flight time, as depicted in figure1.[18]

3.2.2 Issues

Every dataset has its issues, and the most critical is inaccurate data. In the process of creating the dataset, Li et al. filtered out those data points created by obvious misuse of the system.[18] Two additional samples were disregarded due to errors in data, leaving 47 users. They provided in total 208 registration samples and 1164 attempt samples. Other, more difficult to detect ‘inaccurate data’ include training samples that indicate behavior far away from the norm of the user. In research settings it might be rarer than in real life situations, where mood, illness, day, and time will always affect typing rhythm in an unpredictable way.

18 3 Implementation

3.2.3 Motivation

The decision to utilize the BeiHang database in this study was based on several factors, mainly its size, the features of the data (previously discussed), its collection method, and its accessibility. The size of the dataset with respect to users was deemed to be average, as observed by both Teh et al. and Ali et al. in their respective reviews.[26][1] Other popular datasets include the GreyC Keystroke Dataset, which in 2009 had collected over 7500 samples from 133 users. However, in this database, the users all typed the same password (‘greyc laboratory’).[13] As discussed, fixed text for all users has its advantages, but it was decided that examining a free text database would be more realistic and academically useful. For the same reason, the CMU Keystroke Dynamics Benchmark Dataset with 51 users typing a static password, was eliminated.[17] In addition, both the GreyC and CMU databases have been thoroughly studied compared to the BeiHang (131 and 346 citations, respectively, compared to 36, according to Google Scholar). A potential additional advantage of this dataset is the data collection. The settings were collected from a commercial system in a ‘free’ environment, meaning not a controlled labo- ratory. Li et al. argue that this would be ‘more comprehensive and more faithful to human behavior’. [18] Of course, the users were aware of their participation in a study, so it can- not be said to be completely identical to the everyday use of a keyboard. There is also an advantage to letting users choose their own password, as one would be comfortable with hers or his own password, result in typing rhythm more similar to normal use, than copying unknown text, as Teh et al. conclude.[26]. The data in the BeiHang Database might not be classified as completely ‘free text’ as there are limitations to passwords (mainly length), but they are decidedly more free than structured text. This text-setup is different from the majority of research conducted, in which passwords are not chosen by users, and the collection was carried out in laboratories.[13] The BeiHang Database was also easily accessible, with the data precisely labeled in txt-files. Although Giot et al. argue in a review from 2015 that the most reliable is for researchers is to ‘collect the datasets [themselves] that fits the need of their studies’, it was deemed unachievable for this study to build a new dataset of good quality, given the time frame. As several studies and reviews state, there is a strong need for a good-quality benchmarking dataset for Keystroke Dynamics.[1] There seemed to be no need to expand the sea of semi- good datasets, and focus has instead been to expand on the research on an existing database.

3.3 Feature Extraction

After acquiring the dataset, features are extracted. As the passwords were all different from each other, direct comparison of time-stamp data key to key was impossible. The features selected instead are the mean of those featured pictured in Figure1; dwell time, and four different types of flight time, for each sample. An example of such a data sequence can be seen in Table1, and the resulting features below it. The usernames are also included in both the training and testing set. Labels if the data was collected from a genuine user or an impostor was also included in the data used in testing. Registration data is used for training the model, and attempt data for testing the model. Tables2 and3 depicts how data frames used for training and testing, respectively.

Table 1: Password, time stamps and resulting extracted features for an example password ‘psw’.

key p s w keystroke down up down up down up timestamp 0 1 3 4 5 6

• dwell time average = 1 • flight time type 1 average = 1,5

19 3 Implementation

• flight time type 2 average = 2,5 • flight time type 3 average = 2,5 • flight time type 4 average = 3,5

Table 2: Excerpt from the registration data containing feature sequences from two different samples provided by two different genuine users. The unit is in ms. ‘d’ stands for ‘dwell time’ and ‘f’ for ‘flight time’ (of which there are four different types).

d mean f1 mean f2 mean f3 mean f4 mean username 162036.0 389064.0 550526.0 550225.0 711687 12345 138766.0 872.4 147306.0 137660.0 284094 304567

Table 3: Excerpt from the attempts data containing feature sequences from two different samples provided, both claiming to be user ‘12345’ and providing the correct password, but one sample is genuine (‘0’), and one an impostor (‘1’). The unit is in ms. ‘d’ stands for ‘dwell time’ and ‘f’ for ‘flight time’ (of which there are four different types).

d mean f1 mean f2 mean f3 mean f4 mean username impostor 136353.0 247693.0 386233.0 386832.0 525371.0 12345 0 101175.0 57725.2 160851.0 161111.0 264237.0 12345 1

3.4 Training

When the features have been extracted, the next step is to actually train the model. The blog post ‘Building Supervised Models for User Verification Part 1 of the Tutorial’ by Maciek Dziubinski and Bartosz Topolski was the inspiration for the structure of training and testing algorithms in this study.[10] The first step is to split the usernames into usernames whose data will only be used for training and validation, and usernames whose data will only be used for testing (the testing dataset is covered in the next subchapter). This is done to prevent data leakage and can be repeated multiple times in order to ana- lyze potential differences between different divisions. The data used in the training is as previously mentioned, all from the registration database (where all samples come from gen- uine users), as seen in Table2. The next step is to split the training set into folds for cross-validation. Within the folds, the binary classification part is prepared: All feature se- quences are paired with one another, labeled according to whether they belong to different or the same user, and the feature difference is calculated (simply the difference between the sequences per feature). This is visualized in Figure6. This, in turn, creates new features, referred to as a ‘difference feature’, which include labels. The model can now be trained to recognize patterns to whether two feature sequences belong to the same or different users. The difference feature calculated from both two feature sequences from the same user, and from two different, can be seen in Table5 and6, respectively. This same pairing is done for all data in the validation set.

20 3 Implementation

Figure 6: Graphically showing that during training, in both training and validating, each sample (feature sequence) is paired up with all other samples in the set.

The model is trained on the difference features. To build the best model possible for that particular training and validation set, a ‘grid search’ is performed using the hyperparameters discussed in the ‘Theory’ chapter, and which are also shown in Table4 together with the respective range of values. The hyperparameters are searched through in six different steps, also indicated in Table4, each new stage using the found values of the last. The model is evaluated in throughout this process through making predictions on the validation data, lastly comparing to the validation labels. This results in an AUC score, which is used to decide which fold in the cross-over produced the best performing model. An overview of this algorithm is described in Algorithm 1.

Table 4: The hyperparameters tuned in 6 steps for each training and validation set during training, and the range of each value tested.

Range Step # Parameter (start, stop, step) max_depth (0, 11, 1) 1 min_child_weight (1, 11, 1) 2 gamma (0, 1, 0.1) subsample (0.6, 1.1, 0.1) 3 colsample_bytree (0.6, 1.1, 0.1) reg_alpha (1e-5, 100, *10) 4 reg_lambda (1e-5, 100, *10) 5 learning_rate (0, 1, 0.02)

21 3 Implementation

Algorithm 1 Training 1: Start with the extracted registration data, as in i Table2 2: Split the usernames into approximately 80% used for training and validation, and 20% used for testing 3: Divide the data associated with training and validation into 4 folds of training and validation sets to perform cross-validation 4: Create an XGBoost Classifier with Sklearn.pipeline 5: for all folds: 6: Calculate the difference between all possible pairs of training sequences 7: Calculate the difference between all possible pairs of validation sequences 8: Fit the XGBoost classifier using gridsearch to the training feature differences, and their labels 9: Make predictions of the validation feature differences and compare to labels 10: Calculate AUC score for the present fold from this prediction 11: Evaluate which fold created the best model based on AUC score

Table 5: An example of two feature sequences from the same user that are paired up during training and the resulting difference feature including the new feature ‘label’. That the ‘label’ is ‘1’ indicates that the sequences are from the same user.

d mean f1 mean f2 mean f3 mean f4 mean username Feature 162036.0 389064.0 550526.0 550225.0 711687.0 12345 Sequences 155167.0 228783.0 385450.0 384850.0 541517.0 12345

label Difference Feature 6869 160281 165076 165375 170170 1

Table 6: An example of two feature sequences from two different users that are paired up during training and the resulting difference feature, including the new feature ‘label’. That the ‘label’ is ‘0’ indicates that the sequences are from different users.

d mean f1 mean f2 mean f3 mean f4 mean username Feature 155167.0 228783.0 385450.0 384850.0 541517.0 12345 Sequences 138766.0 872.4 147306.0 137660.0 284094 304567

label Difference Feature 16401 227910,6 238144 247190 257423 0

3.5 Testing

The aim of testing a system meant for security is to simulate a real-life situation. Therefore, the attempt dataset is used for testing, which includes samples from both the genuine user and impostors acting under the same username. The registration data of the usernames included in the testing set has not, as previously mentioned, been included in the training phase. This is important: The model will try to classify samples from users on which it has not trained. The testing is based on the same binary classification principle as the training, but with some adjustments. One attempt sample is compared with the registration samples from that claimed username, as illustrated in Figure7. This is true to reality as a database would hold registration samples of the authorized users, and the system would look at an identification attempt and try to decide whether it matches those samples. The comparison is done, analogously with training, by taking the difference between each pair of feature sequences. The model then makes a prediction from each of those pairs and calculates the mean prediction for that attempt to sample. This prediction is then evaluated against the

22 3 Implementation label if the attempt was made by the genuine user, or an impostor, to generate statistics. A concise overview of this is presented in Algorithm 2.

Algorithm 2 Testing 1: Start with the extracted attempt data, as in i Table3 2: Use the attempt data associated with the 20% usernames from the split in step 2 in Algorithm 1 3: Pair each attempt sample with registration data associated with the same claimed username, as in Figure7 4: Calculate the feature difference of each pair 5: Use the XGBoost model developed in training to make a mean prediction of whether the sample comes from the genuine user or an impostor 6: Compare predictions with the sample’s impostor label to get statistics of accuracy

Figure 7: Graphically showing that in testing, each attempt sample in the testing set, is paired up with all the registration samples from the same username.

23 4 Results

4 Results

In this chapter, the results from the implementation are presented. Five different models were created using five different splits of of attempt data (the other users for each split being used for training data). The usernames included in each split is presented in Table7.

Table 7: The usernames selected for the test set in each split. The training set in each split is the test sets complement.

Split 1 Split 2 Split 3 Split 4 Split 5 uikjuy uytreg lokiu hh773510555 saiwaiba zhangyue xiaotian wowowo1 oikjuy uyhjgt zhangyue1 w6363286 ftgjkhuji 12345 tredfgr wwwh gg773510555 456456 tiankong112 uythgf 583980746 huan juyhg x335399102 fenghuo wenheng1 zhangyue2 xiaoxiong 304567 huan2 335399102 ghkhh tiankong1 xiaowei 474716407 xiaoji kk773510555 qianhai18 5864564 wukun88 fghjk jkghjghi zhuzhu guanhongxu fenglinhuoshan tiankong11 kuijhu

4.1 Tuning of the hyperparameters

The tuning of the hyperparameters using grid search, in the order shown in Table4 in chapter3, resulted in the values shown in Table8.

Table 8: The values of the hyperparameters tuned during training, before and after training, for each test split.

Value Before tuning Parameter After tuning (default) Split 1 Split 2 Split 3 Split 4 Split 5 max_depth 3 3 3 1 1 1 min_child_weight 1 3 9 9 1 1 gamma 0 0 0.7 0 0 0 subsample 1 1 0.7 0.7 0.8 0.7 colsample_bytree 1 0.6 0.8 0.8 1 0.8 reg_alpha 0 1e5 0.1 1 0.01 0.01 reg_lambda 1 1 1 1 1 1 learning_rate 0.1 0.1 0.08 0.58 0.18 0.12

4.2 Feature importance

Two of the feature importance graphs generated are shown in Figure8. They are from split 1, before and after tuning. Due to similarity between splits, only these two are depicted here. The feature importance graphs for the remaining four splits can be viewed in Figures 14 in the Appendix.

24 4 Results

(a) Feature importance graph of split 1 before (b) Feature importance graph of split 1 after tuning of hyperparameters. tuning of hyperparameters.

Figure 8: Feature importance graph of split 1, before and after tuning. The importance of each feature is plotted as a function of how many times each feature has been split on. ‘d’ represents dwell time, and ‘f1,2,3,4’ the four different flight time features.

4.3 AUC-scores

The mean AUC scores for each split as well as the overall mean, before and after tuning, are presented in Table9. The AUC scores per user are presented in Figures 15 and 16 in the Appendix. It can also be seen that the AUC scores displayed in Table9 are different from those visible in the ROC curves in Figures9 to 13. This is because the AUC displayed in the table is the mean of all individual AUC in each set, while the AUC displayed in the ROC curves are the actual ‘area under the curve’ for the ROC for the whole set.

Table 9: The mean AUC score for each split and the overall mean, before and after tuning the hyperparameters.

Mean AUC score Split Before tuning After tuning 1 0.9792 0.9860 2 0.9261 0.9258 3 0.8422 0.8402 4 0.9367 0.9463 5 0.9262 0.9404 Overall mean 0.92208 0.92792

4.4 EER, FAR, and FRR

The EER, FAR, and FRR scores before and after tuning are shown i Tables 10 and 11.

25 4 Results

Table 10: The EER and FAR scores of each split, and the mean, before and after tuning the hyperparameters.

EER (optimal FAR) Split Before tuning After tuning 1 0.1198 0.1152 2 0.2044 0.1956 3 0.2973 0.2432 4 0.25 0.25 5 0.2270 0.1837 Mean 0.2197 0.19754

Table 11: The FRR scores of each split, and the mean, before and after tuning the param- eters.

FRR Split Before tuning After tuning 1 0.125 0.125 2 0.1944 0.1944 3 0.2727 0.2424 4 0.2692 0.2308 5 0.2414 0.1724 Overall mean 0.2205 0.1930

4.5 ROC-curves

The receiver operating characteristic curves before and after the tuning of the parameters are shown in Figures9 through 13.

(a) Split 1 before tuning (b) Split 1 after tuning

Figure 9: Reciever operating characteristic (ROC) curves for split 5, before and after tuning. The blue lines are the ROC, and the red dotted lines represents where TPR = FPR

26 4 Results

(a) Split 2 before tuning (b) Split 2 after tuning

Figure 10: Reciever operating characteristic (ROC) curves for split 5, before and after tuning. The blue lines are the ROC, and the red dotted lines represents where TPR = FPR

(a) Split 3 before tuning (b) Split 3 after tuning

Figure 11: Reciever operating characteristic (ROC) curves for split 5, before and after tuning. The blue lines are the ROC, and the red dotted lines represents where TPR = FPR

(a) Split 4 before tuning (b) Split 4 after tuning

Figure 12: Reciever operating characteristic (ROC) curves for split 5, before and after tuning. The blue lines are the ROC, and the red dotted lines represents where TPR = FPR

27 4 Results

(a) Split 5 before tuning (b) Split 5 after tuning

Figure 13: Reciever operating characteristic (ROC) curves for split 5, before and after tuning. The blue lines are the ROC, and the red dotted lines represents where TPR = FPR

28 5 Discussion

5 Discussion

Figure8, mirrored by Figures 14 in the Appendix shows that dwell time was the most important feature for the classification, both before and after tuning, followed by the flight time 1 feature. It is interesting that the time spent on each key is more individual than the time spent searching for the next one. This may also be because people tend to dwell on the keys while searching for the next key. The relative importance of the features did not change at all by tuning the hyperparameters, but the F score evened out a bit. So the tuned models used the features in a more optimal way, which is an expected result from a tuning. Table8 shows that the tuning of the hyperparameters led to changes in the value of almost all parameters, and across a large range between the different splits. This indicates significant differences between the splits. ‘Learning rate’ for example, varied from 0.08 in split 2 to 1 in split 1. Lambda, however, did not vary at all and its optimal value remained at the default 1 across the splits. The ROC curves i Figures9 through 13, as well as table9 shows little to no difference in the AUC scores before and after tuning. Table9 shows a mean AUC score of 0.91 before tuning and 0.93 after tuning, which are both good scores. Split 1 performs best with an after-tuning score of 0.9860, which is good. As seen in Figures 15 and 16 in the Appendix, the AUC varies greatly between the users. 25 users show AUC scores of 1.00 (24 before the tuning), while one user in particular, username ‘wowowo1’, shows especially bad results with an AUC of 0.36 (0.27 before the tuning). This does not mean however that ‘wowowo1’ is the one to blame, the bad results might be due to the small size of the dataset. The user’s input might have a good quality pattern, maybe just not one present in the training set data for that split, making the model unable to correctly classify the attempt data for ‘wowowo1’. Split 1, which has the highest average, has a very homogenous score, ranging from 0.95 to 1.0. Split 3, which performed worst in terms of AUC, has the most widespread range of 0.36 to 1.0. These outliers on which the model predicts poor results have a significant impact on the quality of the model. Unlike AUC, both FAR and FRR did improve after tuning, which is promising. Tables 10 and 11 shows that the best FAR (a value that per definition is equal to EER) reached was 11.52% and on average 19.75%. The FRR, although more irritating than critical for a security system, was also quite high at 12.5% at its best, and on average 19.30%. These are significantly higher values than the majority of the studies reviewed in the ‘Background’ subchapter, of which several reported FAR values of lower than 1%, as 0.03% (Maxion and Killorhy 2010)[1] and 0.76% (Azevedo et al 2007).[26] It is 7.9 percentage points higher than the best result of 11.8327% achieved using an SVM-method in the original study on the Beihang Database made by Li et al, and the best split (EER = 11.52%) in this study actually performs better. Although, when considering averages, the results from this study enough are not conclusive enough to claim that XGBoost did perform better that the SVM method. However, it is important do bring up that is not straightforward to compare the original study with this one. Most importantly, Li et al also augmented their data by using means, meaning that their model had more data t train on. This was planned in this study, but was cut due to time constraints. Secondly, the original study is not clear on what type of classification they implemented. In this study, the decision to implement a binary type of classification may have impacted the results, and perhaps other types, such as multiclass classifications, would have produced better values. If comparing the FAR results with those of other security measures, such as fingerprints (which has a tested FAR of 0.01% [27]), it is important to remember that in these types of keystroke dynamics systems, FARs are not measures of ‘what is the risk that an impostor would get in’, as the impostor first has to obtain the password (in fingerprint scans, one only has to have a finger). So it is actually a measure of ‘what is the risk that an impostor who has obtained the password would get in’, which of course requires deeper analyse if comparing system accuracy. All the measurements show a clear difference between the splits, which indicated that that the size of the training sets, and quality (consistency) of user input data was the most important factors in the somewhat disappointing results. The more varied and small data

29 6 Conclusions sets, the less likely to get a good generalized model. If the dataset had been bigger, the results would probably have been better.

6 Conclusions

This study aimed to investigate whether or not the framework XGBoost is a suitable choice for classifying the Beihang Keystroke Database. It has found that XGBoost produces similar results to other methods tried on the database, but it cannot be said from this study to be an extraordinarily good tool for use in Keystroke Dynamics. It might be that XGBoost works better on big data problems, making it unsuitable for smaller security systems. However, based on research and the general hype around XGBoost, I conclude that it should definitely be investigated more in the future for work in the Keystroke Dynamics field, as this study had other issues apart that impacted the results. Although the results in this study in many cases are significantly worse than other studies made in the field, it is difficult to draw conclusions other than that the dataset matters, when the studies have been made on different datasets. The conclusion is that the dataset plays a significantly larger roll in the outcome than the model used. The main downside to Database A of the Beihang Keystroke Dynamics Database was probably its size, with its 47 users (between 37-38 users to train on in each split), compared to for example the GreyC dataset with its 133 users. The smaller the amount of training data, the larger is the risk of the model overfitting and not producing a general enough model, and of course, the more harmful a bad quality sample is to the general model. This is what might have occurred in this case. So the Beihang Keystroke Dynamics Database did not turn out as good as I initially thought, with its size not being sufficient, and the samples not consistent enough. I do however believe that research should look more into free text keystroke databases collected in relaxed environments, as I believe they represent more life like scenarios. Better results might have been achieved if the hyperparameters where tuned even more, and utilizing the ROC curve more. In terms of XGBoost, tuning of hyperparameters did yield better results, of both AUC, EER, FAR and FRR. This study thus suggest that tuning of hyperparameters should be used when applying XGBoost to a problem. If trying to implement Keystroke Dynamics as a security measure, and using machine learn- ing as a classifying method, based on this study, and research, I suggest either using a general model provided by an external operator, based on a large dataset, or training ones own model onsite under careful monitoring of the data provided to assert quality. Then one can also train on all authorized users, taking advantage of data leakage. If one does not have enough users to warrant training, maybe methods such as clustering and outlier detection are more useful. To make the training data more consistent, the system could further be designed so that it discovers data anomalies already in the registration phase, forcing individuals to retype if they strayed from their usual rhythm. If they cannot produce good enough data, maybe another method is required in their particular case. Due to the generally high FAR across the field however, I suggest that a Keystroke Dy- namics systems fair best in critical situations, and not for everyday use. Due to quite high FAR, I agree with previous claims that it is also best used in tandem with other means of authentication, such as tokens and passwords.

30 8 Acknowledgments

7 Future Work

For future work within Keystroke Dynamics and XGBoost, my primarily suggest working with larger datasets. 50 or so users providing 4-5 training samples proved to be too few. It can also be a good idea to focus on anomaly detection of the data samples, which did not make it into this study. When working with XGBoost, tuning of hyperparameters should be used. Also omitted from this study was consideration of FAR and FRR levels to balance that important tradeoff, which can be used to tune thresholds used, producing a lower FAR. In addition, statistical evaluation of the results would be a good idea in going forward. The research community should concentrate on producing a benchmark suite of datasets to be used in studies, preferably including different approaches to free and fixed text, long and short texts, and collection method.

8 Acknowledgments

I want to thank my subject reader Michael Ashcroft for insightful comments during the course of my project. I also want to thank my supervisors David Strömberg and Daniel Lindberg at Omegapoint AB for initiating the project, and for their guidance and support. Everyone else at Omegapoint, especially my fellow thesis workers, have also been a great source of support and brainstorming this spring. Last, but by no means least, I want to thank my family and friends for their never tiring support throughout my thesis, and indeed my entire education.

31 References

References

[1] Ali, Md Liakat et al. “Authentication and Identification Methods Used in Keystroke Biometric Systems”. In: 2015 IEEE 17th Int. Conf. HPCC, 2015 IEEE 7th Int. Symp. CSS, 2015 IEEE 12th ICESS. IEEE, Aug. 2015, pp. 1424–1429. [2] Ali, Md Liakat et al. “Keystroke Biometric Systems for User Authentication”. In: J. Signal Process. Syst. 86.2-3 (Mar. 2017), pp. 175–190. [3] Bauer, Eric et al. “An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants”. In: Mach. Learn. 36.August (1999), pp. 105–139. [4] Bergadano, Francesco, Gunetti, Daniele, and Picardi, Claudia. “User authentication through keystroke dynamics”. In: ACM Trans. Inf. Syst. Secur. 5.4 (Nov. 2002), pp. 367–397. [5] Bours, Patrick and Mondal, Soumik. “Continuous Authentication with Keystroke Dy- namics”. In: Recent Adv. User Authentication Using Keystroke Dyn. Biometrics. Ed. by Y. Zhong and Y. Deng. 2nd ed. 2015. Chap. 3, pp. 41–58. [6] Chen, Tianqi. Introduction to Boosted Trees. 2014. url: https://homes.cs.washington. edu/%7B~%7Dtqchen/pdf/BoostedTree.pdf (visited on 02/02/2018). [7] Chen, Tianqi and Guestrin, Carlos. “XGBoost: A Scalable Tree Boosting System”. In: CoRR abs/1603.0 (2016). [8] DMLC, Distributed Machine Learning Community. XGBoost Documents. 2016. url: http://xgboost.readthedocs.io/en/latest/ (visited on 04/26/2018). [9] DMLC and Community, Distributed Machine Learning. xgboost. 2018. (Visited on 04/26/2018). [10] Dziubiński, Maciek and Topolski, Bartosz. Building Supervised Models for User Verifi- cation — Part 1 of the Tutorial. 2018. url: https://blog.daftcode.pl/building- supervised - models - for - user - verification - part - 1 - of - the - tutorial - 7496d5d394b9 (visited on 05/25/2018). [11] Elith, J et al. “A Working Guide to Boosted Regression Trees”. In: Source J. Anim. Ecol. J. Anim. Ecol. 77.4 (2008), pp. 802–813. [12] Epp, Clayton, Lippold, Michael, and Mandryk, Regan L. “Identifying emotional states using keystroke dynamics”. In: Proc. 2011 Annu. Conf. Hum. factors Comput. Syst. - CHI ’11. Vancouver: ACM Press, 2011, p. 715. [13] Giot, Romain, El-Abed, Mohamad, and Rosenberger, Christophe. “GREYC keystroke: A benchmark for keystroke dynamics biometric systems”. In: 2009 IEEE 3rd Int. Conf. Biometrics Theory, Appl. Syst. Washington, DC: IEEE, Sept. 2009, pp. 1–6. [14] Gupta, Prashant. Cross-Validation in Machine Learning. 2017. url: https://towardsdatascience. com/cross-validation-in-machine-learning-72924a69872f (visited on 05/05/2018). [15] Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. Springer Series in Statistics The Elements of Statistical Learning The Elements of Statistical Learning. 2nd ed. New York: Springer-Verlag New York, 2009. [16] Karnan, M, Akila, M, and Krishnaraj, N. “Biometric personal authentication using keystroke dynamics: A review”. In: Appl. Soft Comput. J. 11.2 (2011), pp. 1565–1573. [17] Killourhy, Kevin S. and Maxion, Roy A. “Comparing anomaly-detection algorithms for keystroke dynamics”. In: 2009 IEEE/IFIP Int. Conf. Dependable Syst. Networks. Lisbon: IEEE, June 2009, pp. 125–134. [18] Li, Yilin et al. “Study on the BeiHang Keystroke Dynamics Database”. In: 2011 Int. Jt. Conf. Biometrics. 2011, pp. 1–5. [19] Lowe Bryan, William and Noble, Harter. “Studies on the telegraphic language: The acquisition of a hierarchy of habits.” In: Psychol. Rev. 6.4 (1899), pp. 345–375. [20] Monrose, Fabian and Rubin, Aviel D. “Keystroke dynamics as a biometric for authen- tication”. In: Futur. Gener. Comput. Syst. 16.4 (2000), pp. 351–359. [21] Murphy, Kevin P. Machine Learning - A Probabilistic Perspective. Cambridge, Mas- sachusetts: The MIT Press, 2012. isbn: 9780262018029. [22] Ng, Annalyn. Decision Trees Tutorial – Algobeans. 2016. url: https://algobeans. com/2016/07/27/decision-trees-tutorial/ (visited on 03/23/2018). [23] Ribeiro, Marco Tulio, Singh, Sameer, and Guestrin, Carlos. “Why Should I Trust You? Explaining the Predictions of Any Classifier”. In: KDD ’16 Proc. 22nd ACM SIGKDD

32 References

Int. Conf. Knowl. Discov. Data Min. San Francisco, California, USA: ACM, 2016, pp. 1135–1144. [24] Shepherd, S J. “Continuous Authentication By Analysis of Keyboard Typing Charac- teristics”. In: Eur. Conv. Secur. Detect. Vol. 1995. 408. IEE, 1995, pp. 16–18. [25] Tape, Thomas G. The Area Under an ROC Curve. 2018. url: http://gim.unmc. edu/dxtests/Default.htm (visited on 05/07/2018). [26] Teh, Pin Shen, Teoh, Andrew Beng Jin, and Yue, Shigang. “A survey of keystroke dynamics biometrics.” In: ScientificWorldJournal. 2013.2013 (2013), p. 408280. [27] Thakkar, Danny. Top Five Biometrics: Face, , Iris, Palm and Voice. 2017. url: https://www.bayometric.com/biometrics-face-finger-iris-palm-voice/ (visited on 06/19/2018). [28] Umphress, David and Williams, Glen. “Identity verification through keyboard charac- teristics”. In: Int. J. Man-Machine Stud. 23 (1985), pp. 263–273. [29] Unar, J.A., Seng, Woo Chaw, and Abbasi, Almas. “A review of biometric technology along with trends and prospects”. In: Pattern Recognit. 47.8 (Aug. 2014), pp. 2673– 2688. issn: 0031-3203. doi: 10.1016/J.PATCOG.2014.01.016. url: http://www. sciencedirect . com / science / article / pii / S003132031400034X ? via % 7B % 5C % %7D3Dihub. [30] Zeiler, Matthew D and Fergus, Rob. “LNCS 8689 - Visualizing and Understanding Convolutional Networks”. In: ECCV 8689 (2014), pp. 818–833. url: https://cs. nyu.edu/%7B~%7Dfergus/papers/zeilerECCV2014.pdf.

33 9 Appendix

9 Appendix

9.1 Feature importance

(a) Feature importance graph of split 2 before (b) Feature importance graph of split 2 after tuning of hyperparameters. tuning of hyperparameters.

(c) Feature importance graph of split 3 before (d) Feature importance graph of split 3 after tuning of hyperparameters. tuning of hyperparameters.

(e) Feature importance graph of split 4 after (f) Feature importance graph of split 4 before tuning of hyperparameters. tuning of hyperparameters.

34 9 Appendix

(g) Feature importance graph of split 5 after (h) Feature importance graph of split 5 before tuning of hyperparameters. tuning of hyperparameters.

Figure 14: Feature importance graph of each split, before and after tuning. The importance of each feature is plotted as a function of how many times each feature has been split on. ‘d’ represents dwell time, and ‘f1,2,3,4’ the four different flight time features.

35 9 Appendix

9.2 AUC Scores

Figure 15: The AUC scores for each username before tuning

36 9 Appendix

Figure 16: The AUC scores for each username after tuning

37