Continuous Typist Verification Using Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
Department of Computer Science Hamilton, New Zealand Continuous Typist Verification using Machine Learning Kathryn Hempstalk This thesis is submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy at The University of Waikato. July 2009 c 2009 Kathryn Hempstalk ! Abstract A keyboard is a simple input device. Its function is to send keystroke informa- tion to the computer (or other device) to which it is attached. Normally this information is employed solely to produce text, but it can also be utilized as part of an authentication system. Typist verification exploits a typist’s pat- terns to check whether they are who they say they are, even after standard authentication schemes have confirmed their identity. This thesis investigates whether typists behave in a sufficiently unique yet consistent manner to enable an effective level of verification based on their typing patterns. Typist verification depends on more than the typist’s behaviour. The qual- ity of the patterns and the algorithms used to compare them also determine how accurately verification is performed. This thesis sheds light on all tech- nical aspects of the problem, including data collection, feature identification and extraction, and sample classification. A dataset has been collected that is comparable in size, timing accuracy and content to others in the field, with one important exception: it is derived from real emails, rather than samples collected in an artificial setting. This dataset is used to gain insight into what features distinguish typists from one another. The features and dataset are used to train learning algorithms that make judgements on the origin of previously unseen typing samples. These algorithms use “one-class classification”; they make predictions for a particular user having been trained on only that user’s patterns. This thesis examines many one-class classification algorithms, including ones designed specifically for typist verification. New algorithms and features are proposed to increase speed and accuracy. The best method proposed per- forms at the state of the art in terms of classification accuracy, while de- creasing the time taken for a prediction from minutes to seconds, and—more importantly—without requiring any negative data from other users. Also, it is general: it applies not only to typist verification, but to any other one-class classification problem. Overall, this thesis concludes that typist verification can be considered a useful biometric technique. i Acknowledgments This thesis could not have been completed without the assistance of a number of people, too many to mention them all individually. I am grateful to anyone who has helped me, supported me, and given me the strength necessary to complete such a substantial piece of work as this thesis. However, there are several groups who I would like to personally thank for their considerable input into my work. First, I would like to thank the Tertiary Education Commission and the University of Waikato for the generous scholarships they have provided me with. This thesis would not have been possible without their financial support. Second, I would like to express my gratitude to the authors of “Learning to Identify a Typist” and “Keystroke Analysis of Free Text” who kindly provided me with their datasets. Your work catalyzed my interest in typist verification, and the datasets provided me with great insights - which are reflected in this thesis. Third, I would like to thank my supervisors, Eibe and Ian, who have pro- vided me with the guidance, encouragement and academic aid necessary to complete this thesis. Ian, your confidence in my work and optimism was con- stantly reassuring - I feel truly honoured to have you as a supervisor. Equally, Eibe your advice was always welcome, and your assistance with technical as- pects of this thesis was much appreciated. Without either of you I am certain I would not have been able to complete this thesis. Last but not least, I would like to acknowledge the support provided by my family, friends, lab co-inhabitants, Team xG, pets and partner. Mum, Dad and Paul - you have always been there for me and words cannot express how grateful I am of that. To all my friends and colleagues, thank you. To those of you still working on your PhDs - hang in there! To Lily and Oscar, for always brightening my day and showing that cat-hugs really are the best kind of hugs. And to Mark - your patience, faith in me and care of me will forever be cherished. iii Contents Abstract i Acknowledgments iii 1 Introduction 1 1.1 Verification Systems . 2 1.2 Thesis Statement . 3 1.3 Motivation . 6 1.4 Contributions . 8 1.5 Thesis Structure . 9 2 Background 13 2.1 Biometrics . 15 2.1.1 Biometric System Processes . 16 2.1.2 Evaluating Biometric Measures . 17 2.2 Typing as a Biometric . 19 2.2.1 Terminology . 20 2.3 Password Hardening . 22 2.4 Static Typist Verification . 28 2.5 Continuous Typist Verification . 29 3 Replicating the State of the Art 35 3.1 Statistical Methodology . 35 3.2 Learning To Identify A Typist . 37 3.2.1 Algorithm . 38 3.2.2 Experimental Setup . 39 3.2.3 Results . 41 3.3 Keystroke Analysis of Free Text . 42 3.3.1 Algorithm . 42 3.3.2 Experimental Setup . 44 3.3.3 Results . 45 3.4 Comparison of Techniques . 45 3.4.1 Comparison Using the ROC Curve . 46 v 3.4.2 Comparison Using Datasets . 46 3.5 Summary . 49 4 Data Collection 51 4.1 Collecting Data . 52 4.1.1 Recording with SquirrelMail . 53 4.1.2 Technical Issues . 55 4.1.3 Ethical Issues . 57 4.2 Final Dataset . 58 4.3 Re-evaluation of Existing Algorithms . 60 5 Preliminary Experiments with New Techniques 63 5.1 Methodology . 64 5.2 PPM-Based Classifier . 65 5.2.1 Algorithm . 66 5.2.2 Results . 67 5.3 Spread Classifier . 68 5.3.1 Algorithm . 68 5.3.2 Results . 70 5.4 Context Classifier . 71 5.4.1 Algorithm . 71 5.4.2 Results . 72 5.5 Individual Digraph Classifier . 73 5.5.1 Algorithm . 74 5.5.2 Results . 75 5.6 Performance Comparison . 76 5.7 Summary . 77 6 Typist Behaviour 79 6.1 The Digraph . 80 6.2 From Digraphs to Finger Movements . 86 6.3 Key Usage . 88 6.4 Pausing . 89 6.5 Ordering of Events . 91 6.6 Speed and Error Rate . 93 6.7 Summary . 94 7 One-Class Classification 97 7.1 One-Class Classification versus Multi-Class Classification . 99 vi 7.1.1 Comparing Classifiers Fairly . 100 7.1.2 Evaluation Method . 102 7.1.3 Results . 103 7.1.4 Summary . 106 7.2 Methods of One-Class Classification . 107 7.3 Combining Density and Class Probability Estimation . 109 7.3.1 Combining Classifiers Using Bayes’ Rule . 110 7.3.2 Combined Classifier Performance . 112 7.3.3 Generating Different Proportions of Artificial Data . 114 7.3.4 Using Real Data . 115 7.3.5 Replacing Components . 116 7.4 Summary . 117 8 Typist Verification as One-Class Classification 119 8.1 Typing and One-Class Classification . 120 8.1.1 Methodology . 120 8.1.2 Features . 121 8.1.3 Results . 123 8.2 Adding Mouse Patterns . 125 8.2.1 Features . 126 8.2.2 Results . 128 8.3 Summary . 129 9 Conclusions 131 9.1 Collecting Data . 132 9.2 Identifying Channels of Information . 133 9.3 Requirements for System Training . 134 9.4 Revisiting the Hypothesis . 135 9.5 Future Work . 135 References 137 A Experiment Details 145 A.1 Purpose of Experiment . 145 A.1.1 Questionnaire . 145 A.1.2 Email Recording . 145 A.2 Mail Analysis . 150 A.3 Network Access . 154 vii A.4 Data Collection . 154 A.5 Data Archiving/Destruction . 155 A.6 Confidentiality . 155 A.7 Example Email Recording . 156 A.8 How to read an email recording . 157 B Instructions For Participants 159 B.1 Requirements . 159 B.2 Links . 159 B.3 CS Webmail . 159 B.4 Mail Analysis . 160 B.5 Recording Content . 166 B.6 Final Review . 166 C Questionnaire 167 D List of Top 54 Predictive Digraphs 171 viii List of Figures 2.1 Alphabetic layout (from [65]). 14 2.2 The original QWERTY keyboard layout (from [64]). 14 2.3 The enrollment process for biometrics. 17 2.4 The prediction process for biometrics. 18 3.1 The main workbench window . 36 3.2 Partial ROC curves for KAOFT and LTIAT . 47 6.1 Distribution of times for Backspace Backspace . ..