Orr Hunter K 202104 Msc.Pdf (5.035Mb)
Total Page:16
File Type:pdf, Size:1020Kb
Modelling Human Behaviour Based on Similarity Measurements Between Event Sequences by Hunter Orr A thesis submitted to the School of Computing in conformity with the requirements for the degree of Master of Science Queen's University Kingston, Ontario, Canada May 2021 Copyright c Hunter Orr, 2021 Abstract From a set of sequences, individual's behavioural patterns can be identified. Using these sequences of events, the metadata available can be processed into a weighted format to improve the meaningfulness of the sequence comparisons. The usefulness of this process, identifying users' behavioural patterns, is important in a number of areas such as cybersecurity. This work examines the properties a cybersecurity dataset might contain and demonstrates its effectiveness on a dataset with those properties. Building on the existing sequence comparison method, Damerau-levenshtein dis- tance, this work develops a pipeline of steps that can be used to transform the meta- data and integrate this weighted format into the sequence comparison calculation. In this pipeline, one of the most significant transformations that is applied to the meta- data is based on previous work by Brand. This transformation reduces the impact of high popularity pairwise relationships. This pipeline is shown to incorporate the metadata information into the resulting distance values. Thus, producing meaningful changes which demonstrate the benefit of these extra steps. i Acknowledgments I would like to thank Dr. David Skillicorn for his guidance and thoughtful discussions throughout this entire work. The discussions we had were invaluable and continue to encourage me to improve my thinking process. I appreciate the enthusiasm and encouragement you have provided throughout this research. I would like to acknowledge Queen's University for the educational and profes- sional opportunities it has provided. The experiences I have gained will serve me well in the future. Additionally, this thesis would not have been possible without funding from the NSERC CREATE program. ii Contents Abstract i Acknowledgments ii Contents iii List of Tables v List of Figures vi Chapter 1: Introduction 1 1.1 The problem . 1 1.2 Significance . 3 1.3 Previous work . 4 1.4 This work . 6 1.5 Organization of Thesis . 7 Chapter 2: Background 8 2.1 Behavioural Modelling . 9 2.2 Surveys of Similarity Metrics . 14 2.3 Sequence Comparison . 15 2.3.1 Edit Distance . 15 2.3.2 Jaccard Similarity . 15 2.3.3 Weighted methods . 16 2.3.4 Discrete Cosine Transform (DCT) . 17 2.4 Tools . 18 2.4.1 CountVectorizer . 18 2.4.2 FastText . 18 2.5 Correlation Matrices . 19 2.6 Covariance Matrices . 19 2.7 Brand Method . 20 2.8 Clustering . 21 iii 2.9 Visualization Techniques . 22 Chapter 3: Experiment 25 3.1 Research Objectives . 25 3.2 Tag-by-Tag . 28 3.2.1 Restaurant ID by Tag Frequency Matrix . 28 3.2.2 Problems in the Frequency Matrix . 29 3.2.3 Tag-by-Tag Correlation Matrix . 30 3.2.4 Tag-by-Tag Brand Matrix . 32 3.3 Comment-by-Comment . 32 3.4 Restaurant-by-Restaurant . 35 3.4.1 Weighted Sequence Comparison . 36 3.5 Restaurant-by-Restaurant (With Metadata) . 39 3.6 Restaurant-by-Restaurant (With Tag and Comment) . 41 3.7 User Clustering . 42 3.8 Summary . 42 Chapter 4: Results 44 4.1 Introduction . 44 4.1.1 Validation . 45 4.2 Tag-by-Tag . 46 4.3 Comment-by-Comment . 53 4.4 Restaurant-by-Restaurant Correlation Matrix . 60 4.4.1 Restaurant-by-Restaurant - From Comments . 70 4.5 Restaurant-by-Restaurant (With Tag) . 74 4.6 Restaurant-by-Restaurant (With Comments) . 77 4.7 Restaurant-by-Restaurant (With Tag and Comment) . 79 4.8 Clustering the distance matrices . 82 4.9 Summary . 87 Chapter 5: Conclusion 89 Bibliography 93 Appendix A: Nearest Neighbour Results 96 Appendix B: Cluster Contents 100 B.1 Cluster 0 . 100 B.2 Cluster 37 . 110 B.3 Cluster 45 . 110 iv List of Tables 3.1 An example of the tags data file . 29 3.2 An example of the comments data file . 33 4.1 cookies Nearest Neighbour . 51 4.2 Nearest Neighbours First 10 Most Frequent Tags . 54 4.3 Cluster 1 `Good' Examples . 56 4.4 Cluster 1 Inappropriate Examples . 57 4.5 Cluster 10 Examples . 57 4.6 Cluster 30 Examples . 57 4.7 DBSCAN results with varying parameters . 87 v List of Figures 3.1 Brand Connections Diagram . 40 4.1 Tag Frequency Plot Before Truncation . 47 4.2 Tag Frequency Plot After Upper and Lower Truncation . 48 4.3 Tag-by-Tag Correlation Matrix - Before Truncation . 49 4.4 Tag-by-Tag Correlation Matrix - After Truncation . 50 4.5 Tag-by-Tag Brand Matrix . 51 4.6 Comment Vectors visualized using SVD . 55 4.7 Dendrogram - 50 Clusters . 55 4.8 Comment-by-Comment Correlation Matrix . 58 4.9 Comment-by-Comment Brand Matrix . 60 4.10 Restaurant-by-Restaurant (From Tag) Correlation Matrix . 61 4.11 Restaurant-by-Restaurant (From Tag) Brand Matrix . 62 4.12 Nonweighted DamerauLevenshtein Distance Similarity Matrix . 64 4.13 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (From Tag) Correlation Matrix . 65 4.14 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (From Tag) Brand Matrix . 66 4.15 Restaurant-by-Restaurant (From Comments) Correlation Matrix . 71 vi 4.16 Restaurant-by-Restaurant (From Comments) Brand Matrix . 72 4.17 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (From Comments) Correlation Matrix . 73 4.18 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (From Comments) Brand Matrix . 74 4.19 Restaurant-by-Restaurant (With Tag) Correlation Matrix . 75 4.20 Restaurant-by-Restaurant (With Tag) Brand Matrix . 76 4.21 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (With Tag) Correlation Matrix . 77 4.22 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (With Tag) Brand Matrix . 78 4.23 Restaurant-by-Restaurant (With Comments) Correlation Matrix . 79 4.24 Restaurant-by-Restaurant (With Comments) Brand Matrix . 80 4.25 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (With Comments) Correlation Matrix . 81 4.26 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (With Comments) Brand Matrix . 82 4.27 Restaurant-by-Restaurant (With Tag and Comments) Correlation Ma- trix . 83 4.28 Restaurant-by-Restaurant (With Tag and Comments) Brand Matrix . 84 4.29 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (With Tag and Comments) Correlation Matrix . 85 4.30 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (With Tag and Comments) Brand Matrix . 86 vii 1 Chapter 1 Introduction 1.1 The problem Every day humans make decisions about what to wear, what to eat, and what to do. These decisions are not based entirely on whim or will but instead based on a large amount of data that is being considered. These decisions are choices based on the behavioural patterns of each individual. These decisions form a sequence of events. These sequences are compared to each other in order to identify patterns, or similarities, and find clusters of individuals who are similar in their behaviour. Typical decisions may include analyzing groups of people in proximity, identifying choices between different destination locations, or choosing an optimal outcome in a hypothetical situation. This concept of similarity comparison, that humans can successfully analyze even as children, is the main element of this research. Event sequences are a list of events that occur in a particular order. This list of events may share a common theme, such as the list of events to prepare a cup of tea: [boil water, add tea bag to cup, pour boiled water into the cup, add sugar(if required), add milk (if required), stir]. Event sequences are characterized by having a 1.1. THE PROBLEM 2 sequential order and may have a branching structure based on optional steps. Another component to these sequences is the extra data known about the situation. Extra data, or metadata, is the additional information that can be collected related to the principal data. In this behavioural work, the metadata is used to inform the sequence comparison computation about underlying similarities between behaviours. This is key for the prime function of this research which is to develop informed similarity matrices which relate behaviours together. Through this metadata, similar behaviours will have similar metadata relations and those behaviours with the most metadata similarities will be the most similar. This data is generally only tangentially related to the topic area or field that is within the sequence. An example of this can simply be demonstrated by considering the sequence of preparing tea above. An important piece of metadata will inform the observer that replacing milk with cream is a minor change compared to a drastic change such as replacing milk with juice. An example directly related to this work, where the main topic is comparing sequences of restaurant choices, than the metadata is information about the similarities and differences of these restaurants. Such information might be the price range or the type of food served. In this work, the metadata will be the tags and comments associated with the restaurants. To understand why this metadata is required, let us look at a simplified example of human intuition versus a standard algorithm's perception. For this example, consider three event sequences which represent the restaurants three individuals visited: • Event sequence A consists of [PizzaHut, McDonalds] • Event sequence B consists of [PizzaHut, Kelseys] • Event sequence C consists of [PizzaHut, Harveys] 1.2. SIGNIFICANCE 3 The commonality of PizzaHut is obvious. However, to identify similarities in the second restaurant choice an analysis of the metadata associated with the restaurants, needs to be conducted.