Modelling Human Behaviour Based on Similarity Measurements Between Event Sequences

by

Hunter Orr

A thesis submitted to the School of Computing in conformity with the requirements for the degree of Master of Science

Queen’s University Kingston, Ontario, Canada May 2021

Copyright c Hunter Orr, 2021 Abstract

From a set of sequences, individual’s behavioural patterns can be identified. Using these sequences of events, the metadata available can be processed into a weighted format to improve the meaningfulness of the sequence comparisons. The usefulness of this process, identifying users’ behavioural patterns, is important in a number of areas such as cybersecurity. This work examines the properties a cybersecurity dataset might contain and demonstrates its effectiveness on a dataset with those properties. Building on the existing sequence comparison method, Damerau-levenshtein dis- tance, this work develops a pipeline of steps that can be used to transform the meta- data and integrate this weighted format into the sequence comparison calculation. In this pipeline, one of the most significant transformations that is applied to the meta- data is based on previous work by Brand. This transformation reduces the impact of high popularity pairwise relationships. This pipeline is shown to incorporate the metadata information into the resulting distance values. Thus, producing meaningful changes which demonstrate the benefit of these extra steps.

i Acknowledgments

I would like to thank Dr. David Skillicorn for his guidance and thoughtful discussions throughout this entire work. The discussions we had were invaluable and continue to encourage me to improve my thinking process. I appreciate the enthusiasm and encouragement you have provided throughout this research. I would like to acknowledge Queen’s University for the educational and profes- sional opportunities it has provided. The experiences I have gained will serve me well in the future. Additionally, this thesis would not have been possible without funding from the NSERC CREATE program.

ii Contents

Abstract i

Acknowledgments ii

Contents iii

List of Tables v

List of Figures vi

Chapter 1: Introduction 1 1.1 The problem ...... 1 1.2 Significance ...... 3 1.3 Previous work ...... 4 1.4 This work ...... 6 1.5 Organization of Thesis ...... 7

Chapter 2: Background 8 2.1 Behavioural Modelling ...... 9 2.2 Surveys of Similarity Metrics ...... 14 2.3 Sequence Comparison ...... 15 2.3.1 Edit Distance ...... 15 2.3.2 Jaccard Similarity ...... 15 2.3.3 Weighted methods ...... 16 2.3.4 Discrete Cosine Transform (DCT) ...... 17 2.4 Tools ...... 18 2.4.1 CountVectorizer ...... 18 2.4.2 FastText ...... 18 2.5 Correlation Matrices ...... 19 2.6 Covariance Matrices ...... 19 2.7 Brand Method ...... 20 2.8 Clustering ...... 21

iii 2.9 Visualization Techniques ...... 22

Chapter 3: Experiment 25 3.1 Research Objectives ...... 25 3.2 Tag-by-Tag ...... 28 3.2.1 Restaurant ID by Tag Frequency ...... 28 3.2.2 Problems in the Frequency Matrix ...... 29 3.2.3 Tag-by-Tag Correlation Matrix ...... 30 3.2.4 Tag-by-Tag Brand Matrix ...... 32 3.3 Comment-by-Comment ...... 32 3.4 Restaurant-by-Restaurant ...... 35 3.4.1 Weighted Sequence Comparison ...... 36 3.5 Restaurant-by-Restaurant (With Metadata) ...... 39 3.6 Restaurant-by-Restaurant (With Tag and Comment) ...... 41 3.7 User Clustering ...... 42 3.8 Summary ...... 42

Chapter 4: Results 44 4.1 Introduction ...... 44 4.1.1 Validation ...... 45 4.2 Tag-by-Tag ...... 46 4.3 Comment-by-Comment ...... 53 4.4 Restaurant-by-Restaurant Correlation Matrix ...... 60 4.4.1 Restaurant-by-Restaurant - From Comments ...... 70 4.5 Restaurant-by-Restaurant (With Tag) ...... 74 4.6 Restaurant-by-Restaurant (With Comments) ...... 77 4.7 Restaurant-by-Restaurant (With Tag and Comment) ...... 79 4.8 Clustering the distance matrices ...... 82 4.9 Summary ...... 87

Chapter 5: Conclusion 89

Bibliography 93

Appendix A: Nearest Neighbour Results 96

Appendix B: Cluster Contents 100 B.1 Cluster 0 ...... 100 B.2 Cluster 37 ...... 110 B.3 Cluster 45 ...... 110

iv List of Tables

3.1 An example of the tags data file ...... 29 3.2 An example of the comments data file ...... 33

4.1 cookies Nearest Neighbour ...... 51 4.2 Nearest Neighbours First 10 Most Frequent Tags ...... 54 4.3 Cluster 1 ‘Good’ Examples ...... 56 4.4 Cluster 1 Inappropriate Examples ...... 57 4.5 Cluster 10 Examples ...... 57 4.6 Cluster 30 Examples ...... 57 4.7 DBSCAN results with varying parameters ...... 87

v List of Figures

3.1 Brand Connections Diagram ...... 40

4.1 Tag Frequency Plot Before Truncation ...... 47 4.2 Tag Frequency Plot After Upper and Lower Truncation ...... 48 4.3 Tag-by-Tag Correlation Matrix - Before Truncation ...... 49 4.4 Tag-by-Tag Correlation Matrix - After Truncation ...... 50 4.5 Tag-by-Tag Brand Matrix ...... 51 4.6 Comment Vectors visualized using SVD ...... 55 4.7 Dendrogram - 50 Clusters ...... 55 4.8 Comment-by-Comment Correlation Matrix ...... 58 4.9 Comment-by-Comment Brand Matrix ...... 60 4.10 Restaurant-by-Restaurant (From Tag) Correlation Matrix ...... 61 4.11 Restaurant-by-Restaurant (From Tag) Brand Matrix ...... 62 4.12 Nonweighted DamerauLevenshtein Distance Similarity Matrix . . . . 64 4.13 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (From Tag) Correlation Matrix ...... 65 4.14 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (From Tag) Brand Matrix ...... 66 4.15 Restaurant-by-Restaurant (From Comments) Correlation Matrix . . . 71

vi 4.16 Restaurant-by-Restaurant (From Comments) Brand Matrix ...... 72 4.17 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (From Comments) Correlation Matrix ...... 73 4.18 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (From Comments) Brand Matrix ...... 74 4.19 Restaurant-by-Restaurant (With Tag) Correlation Matrix ...... 75 4.20 Restaurant-by-Restaurant (With Tag) Brand Matrix ...... 76 4.21 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (With Tag) Correlation Matrix ...... 77 4.22 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (With Tag) Brand Matrix ...... 78 4.23 Restaurant-by-Restaurant (With Comments) Correlation Matrix . . . 79 4.24 Restaurant-by-Restaurant (With Comments) Brand Matrix ...... 80 4.25 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (With Comments) Correlation Matrix ...... 81 4.26 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (With Comments) Brand Matrix ...... 82 4.27 Restaurant-by-Restaurant (With Tag and Comments) Correlation Ma- trix ...... 83 4.28 Restaurant-by-Restaurant (With Tag and Comments) Brand Matrix . 84 4.29 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (With Tag and Comments) Correlation Matrix ...... 85 4.30 Difference in Weighted Edit Distance using Restaurant-by-Restaurant (With Tag and Comments) Brand Matrix ...... 86

vii 1

Chapter 1

Introduction

1.1 The problem

Every day humans make decisions about what to wear, what to eat, and what to do. These decisions are not based entirely on whim or will but instead based on a large amount of data that is being considered. These decisions are choices based on the behavioural patterns of each individual. These decisions form a sequence of events. These sequences are compared to each other in order to identify patterns, or similarities, and find clusters of individuals who are similar in their behaviour. Typical decisions may include analyzing groups of people in proximity, identifying choices between different destination locations, or choosing an optimal outcome in a hypothetical situation. This concept of similarity comparison, that humans can successfully analyze even as children, is the main element of this research. Event sequences are a list of events that occur in a particular order. This list of events may share a common theme, such as the list of events to prepare a cup of tea: [boil water, add tea bag to cup, pour boiled water into the cup, add sugar(if required), add milk (if required), stir]. Event sequences are characterized by having a 1.1. THE PROBLEM 2

sequential order and may have a branching structure based on optional steps. Another component to these sequences is the extra data known about the situation. Extra data, or metadata, is the additional information that can be collected related to the principal data. In this behavioural work, the metadata is used to inform the sequence comparison computation about underlying similarities between behaviours. This is key for the prime function of this research which is to develop informed similarity matrices which relate behaviours together. Through this metadata, similar behaviours will have similar metadata relations and those behaviours with the most metadata similarities will be the most similar. This data is generally only tangentially related to the topic area or field that is within the sequence. An example of this can simply be demonstrated by considering the sequence of preparing tea above. An important piece of metadata will inform the observer that replacing milk with cream is a change compared to a drastic change such as replacing milk with juice. An example directly related to this work, where the main topic is comparing sequences of restaurant choices, than the metadata is information about the similarities and differences of these restaurants. Such information might be the price range or the type of food served. In this work, the metadata will be the tags and comments associated with the restaurants. To understand why this metadata is required, let us look at a simplified example of human intuition versus a standard algorithm’s perception. For this example, consider three event sequences which represent the restaurants three individuals visited:

• Event sequence A consists of [PizzaHut, McDonalds]

• Event sequence B consists of [PizzaHut, Kelseys]

• Event sequence C consists of [PizzaHut, Harveys] 1.2. SIGNIFICANCE 3

The commonality of PizzaHut is obvious. However, to identify similarities in the second restaurant choice an analysis of the metadata associated with the restaurants, needs to be conducted. From a standard algorithm’s perspective, using the edit distance metric, all three event sequences have the same edit cost and therefore they are perceived as the same distance apart, equally similar. They share one element, PizzaHut, and one element is different. However, this conclusion fails to grasp the added complexity that a human actor would see when comparing these sequences. A human would likely suggest event sequences A and C are more similar due to additional knowledge from the human’s experience. The edit distance algorithm is missing elements that allow a more sophisticated similarity comparison to be made.

1.2 Significance

Societal transactions have identified an advantage to those that can predict an in- dividual’s actions. Marketing, for example, is a major field designed to predict the purchasing patterns of consumers. Through the identification and analysis of con- sumer patterns, an organization can manipulate its marketing campaigns to solicit greater interest and influence each consumer’s behaviour. Marketing is an obvious example where similarity comparisons, based on behavioural models, can lead to iden- tifying consumer groupings. These groupings can allow for precise targeting of ads to a group that you, as the marketer, are more likely to sell to. The challenge of understanding which abstract concept is the key link in a partic- ular group is of greater challenge. However, despite this, it would be largely beneficial in a variety of areas to be able to develop an accurate model to identify the group an individual belongs to by correlating their behaviour to that of a larger sample of other 1.3. PREVIOUS WORK 4

users. An analysis could then use these similarities to identify commonalities among the clustered group of individuals. This analysis has a variety of applications in var- ious fields including marketing transactions (e.g., product recommendation tooling found in various e-commerce websites), content service providers (e.g., Netflix and YouTube), and security areas (e.g., anomalous behaviour detection). Similarity values can be obtained by comparing the sequence of events, that make up each user’s individual behavioural pattern, to the pool of other user’s event se- quences. In addition to these sequences, metadata can be used to aid in producing comparisons with a higher accuracy. Thus, by analysing a large group of users, a specific individual’s choices could be matched with other similar users by assessing other individuals within the data and determining the similarity value between these users based on their behavioural sequences.

1.3 Previous work

There is a significant body of research on the area of similarity metrics. Some of these techniques include mathematical based approaches such as the Manhattan dis- tance and the Euclidean distance calculations. Other approaches include text based analysis, which attempt to resolve the challenges of translating unconstrained text into useful data elements. Gomaa and Fahmy completed a survey of text based approaches and identified three categories: character-based similarity, corpus-based similarity, and knowledge-based similarity[11]. Character-based similarity includes methods such as Longest Common Substring and N-gram, and is focused on iden- tifying the similarity of text strings based on character or character subsets being 1.3. PREVIOUS WORK 5

compared to other similar groupings. Corpus-based similarity methods include meth- ods such as Normalized Google Distance, which determines similarities from the un- derlying data in the corpus, a larger document or information set. Knowledge-based similarity approaches include methods such as WordNet which measures the degrees of similarity between words based on the broader understanding of the semantics as derived by the networks used. For handling specifically sequences, sequence comparison methods such as Edit Distance and Jaccard Similarity Index are commonly used in the field as a baseline. Further research has derived weighted versions of these methods to produce results with a greater accuracy by utilizing the metadata. In this work, the weights will be the metadata byproducts which are used to increase utility in the final sequence comparison. Cosine Similarity is a common similarity measurement defined by the angle be- tween the vectors when their magnitude has been normalized to one. This data is compiled into a correlation matrix. A correlation matrix is a where each element at (i, j) represents the correlation value between the two elements defined by the row value i and the column value j. For example, in a restaurant- by-restaurant correlation matrix, which stores the similarity values between a list of restaurants, the values i and j would each represent specific restaurants. Within the results of the cosine similarity correlation matrix, a significant problem with this type of data can be identified. Popularity is an overwhelming factor that skews the similarity connection values. Another method, identified by Brand, is applied to this matrix to deal with this popularity factor[6]. This method uses the Moore-Pennrose Inverse of the Laplacian representation of the correlation matrix to produce a new 1.4. THIS WORK 6

correlation matrix which discounts the effects of the popular elements. This process can be thought of as a geographical embedding on a hyper sphere where the angular distances can be measured without the overwhelming influence of popularity. A key aspect of this research is the principle that a user can be more accurately grouped with other similar users by leveraging metadata to improve the element by element interactions of the sequence comparison.

1.4 This work

In this work, the similarity between behavioural models in the form of sequences will be analyzed. The methods for determining these similarity values are improved upon. The aim of this work is to implement the multiple techniques, identified above, to produce results with greater accuracy by including the metadata in the sequence comparison. In the data selected, each user has a sequence of restaurants which tracks the order in which each individual visited restaurants. The restaurants are the elements of the sequence which represent behavioural choices. The sequence represents over- all behavioural information about the user which can be used to compare different individuals and identify commonalities. In order to have a more meaningful charac- terization of the relationships between two individuals, a distance measure will be used to quantify the difference. This distance measure can be calculated by examin- ing two sequences and identifying how many restaurants are similar between the two sequences. However, individual restaurants when compared to one another can also have similarities. This is the information that will be extracted from the metadata so that this restaurant compared to restaurant similarity can also be considered when 1.5. ORGANIZATION OF THESIS 7

comparing sequences of restaurants.

1.5 Organization of Thesis

Chapter 2 will discuss the previous work done in related areas. In Chapter 3, the experiment that was conducted will be explained step by step. The methods used for data preparation, data processing and data visualization will be discussed. In Chapter 4, the results of this experiment will be discussed along with their implications. This fourth chapter will be similarly structured to Chapter 3 for easier convenience. In Chapter 5, the document will be concluded along with the future work and limitations of this research. 8

Chapter 2

Background

The first section discusses how behavioural patterns were analyzed to produce group- ings of similar elements. Further work examines sequences of behaviours to identify these groupings. In the comparison of these groupings, a range of distance metrics were examined. From these distance metrics, several sequence comparison techniques are identified. Finally, some of the standard tools and techniques used in this area are identified. This culmination of research represents the state of this subject area and the research that has been conducted in the field. However, it should not be considered a comprehensive review of all literature for this broad research area. Similarity is simply a measure of the commonalities or differences between two objects. This measurement is performed countless times a day by many people and permeates our lives. When attempting to replicate or analyze human interactions, the behaviour that is displayed by humans is key to identifying the commonalities in these interactions. Humans are complex creatures and exhibit complex actions, or behaviours. These behaviours on their own are likely not an indication of a long term pattern that would be useful in identifying other people who exhibit similar behaviour. Therefore, it is necessary to examine a larger range of these behaviours 2.1. BEHAVIOURAL MODELLING 9

to gain a more informed model of what a person is likely to do. Thus, the field of behavioural similarity attempts to use information about the interactions a person, or persons, have with a system to compare with greater accuracy different people or groups of people. By examining the sequences of these behaviours, one can model the behavioural patterns that are being present throughout a behavioural sequences. Using this data similar behavioural patterns can be grouped together and group identities can be discovered.

2.1 Behavioural Modelling

A paper by Carter et al. establishes the foundation of behavioural analysis for the purposes of grouping individuals[7]. The work discusses analyzing the behavioural patterns of entrepreneurs who were still in the transition process of starting a business and have not transitioned to business ownership yet. This work analyzes what types of behaviours individuals do when they are trying to establish a new business. This information was then used to develop behaviour sequences to separate entrepreneurs that are still trying to launch a business versus those which have since quit. By analyzing the patterns of behaviours of those entrepreneurs who have have quit versus those who are still trying or have succeeded, important distinctions between groups of people can be established. Taylor also demonstrates similar concepts on behavioural modelling with his work on measuring the interrelationships of sequences of behaviour[18]. This work focuses on the occurrence of a behaviour, how often this behaviour co-occurs with another behaviour and the ordering of the sequence. A challenge identified by this pattern when examining co-occurring behaviour is the size of the window. Different problems 2.1. BEHAVIOURAL MODELLING 10

require different sized windows and the window size can disrupt meaningful connec- tions that would otherwise be desirable. This work sets the groundwork on methods that exist for examining the proximity of sequences of behaviours. The paper dis- cusses its own method for calculating this proximity and the associated connections one can make from this information. Barbic et al. use similar concepts of behavioural patterns to segment motion capture footage[4]. In the paper, the authors discuss using various techniques to segment motion capture data. In this work, the authors evaluate three methods for segmenting the motion data. This data represents different behaviours which can be identified by the relations between various segments of data. They evaluate the three methods - Principal Component Analysis (PCA), Probabilistic PCA and Gaussian Mixture Models - to identify relations in the data and segment the motion data into distinct behaviours. Other examples of behavioural modelling include medical assist devices, modeling users based on their interactions with vehicles within the first 20 seconds, modelling behaviour in the form of DNA to leverage existing methods of comparison, smart home user identification and experience improvement, and identifying user’s skill levels based on their behavioural sequence with applications in education. Forbes et al. discusses the dangers of falling, which increases as a person ages. They propose using machine learning on behavioural sequences extracted from home sensors or wearables[10]. In this paper, the ideas behind modelling the daily human behaviour of the individuals being studied is described. They use activities of daily living (ADLs) to measure the morbidity of the individuals. Additionally, once a pattern has been established, anomalous behaviour can be detected where a person 2.1. BEHAVIOURAL MODELLING 11

might miss a step or deviates from their ADLs in a significant way. These behaviours are linked to activity recognition by using a number of techniques such as Hidden Markov Models. This paper discusses the improvement that can be gained in fall prediction by using behavioural modelling. Kar et al. propose using behavioural patterns to identify users of a vehicle based on a sequence of their actions. The method proposed can identify this sequence to a particular user with a over 90 percent accuracy within 20 seconds of the first door opening[13]. Pre-trip events such as door opening, door closing, ignition starting and seat belt fastening are all used to construct these pre-trip sequences. Based on the consistency in these events unfolding in a common pattern among the same user, the user can be identified. The model gathers its data from the car’s systems. A vector of time differentials of specific events such as door opening and closing, as discussed above, is used. Another matrix is created which stores the variables obtained from the vehicle bus which give information on a number of features such as acceleration pedal, break pedal, turn signal position and more. This data trains an SVM with 5 fold validation. In real time the program can use the variables discussed above and identify which driver the current user is based on the model trained on previous data samples from this user. Additionally, unsupervised methods were tried including k-means clustering and hierarchical clustering. This research found that the SVM performed slightly better than the unsupervised methods. SVM reached a accuracy of 89 percent while the unsupervised methods only achieved an accuracy of 84 percent in the same 20 seconds on a trace of the same length. 2.1. BEHAVIOURAL MODELLING 12

A number of use cases for identification were suggested including: personalized audio, temperature and dashboard settings. Another feature of this identification is the anomalous detection of an unknown individual using a vehicle. Cresci et al. propose a novel method to represent behavioural sequences[9]. This method involves the use of Digital DNA, defined by the authors as similar to regular DNA which is composed of four unique parts and thus has four corresponding symbols as representation for each atomic part. The DNA is then a sequence of these symbols. Similarly, Digital DNA is a method for storing behavioural information in this format. The number of characters and meaning of each character is set depending on the task at hand. Cresci et al. state that the advantage of this method is to leverage the existing tool set that exists in the field of bioinformatics to compare these sequences with one another. Cresci’s et al. work is demonstrated on a twitter spambot network. The new model demonstrates how a twitter timeline can be composed as a Digital DNA sequence. This timeline can then be compared to other timelines using the Longest Common Substring (LCS). This algorithm aids in determining which Digital DNA is most similar to another by identifying those sequences that share long chains of similar events. Their work concludes with the results that this method provides greater accuracy than the other methods discussed, with an accuracy of 0.929. Koschmider explores the idea of behavioural modelling through the clustering of event traces[15]. The objective of this paper was to leverage smart home technologies, and the data they provide, to improve the experience of the users. Koschmider notes that existing mining algorithms do not produce useful models. Instead, Koschmider presents a new behavioural modelling technique. This new technique deconstructs the log files from the smart home sensors and constructs behavioural sequences based 2.1. BEHAVIOURAL MODELLING 13

on the events found within the logs. These sequences are compared with one an- other through morphing and assignment algorithims which base their approach on Levenshtein distance. Based on this information the event sequences are clustered based on the connections found in the previous morphing and assignment algorithms. Koschmider concluded that their technique was superior with regards to identifying behavioural deviations. Unlike other related work examined, Koschmider’s research was able to identify missing activities or exogenous factors. This was attributed to the precise clustering of the behaviours. As an example that can be targeted at a consumer market, Loh and Sheng de- veloped a expert vs novice identification system based on the behavioural indicators that could be extracted from a game and placed in a sequence[16]. Loh and Sheng acknowledge the difficulty that education systems have with including informative games into their curriculum. This problem has multiple parts including the barriers of creating a fun game that students want to play, along with the requirements to measure how well a student is performing with regards to the knowledge the game is suppose to be imparting. Loh and Sheng tackle the success measurement problem by using string similarity methods. The authors gather the data from the game and con- struct sequences which can be compared to one another, or to an expert. Five string similarity methods were tested including; Levenshtein Similarity, Lee Similarity, Jac- card Coefficient, Dice Coefficient and Jaro-Winkler Distance. Based on the results of their research the method with the best results at distinguishing between experts and novices was the Jaccard Coefficient. This method provided the best results in all the testing that was conducted. Thus, this research suggested that a better market alternative could be made to fit the needs of both a fun and interactive game, along 2.2. SURVEYS OF SIMILARITY METRICS 14

with a methodology for ensuring the educational requirements are being met. The field of behavioural modelling and similarity is dependant on human behav- ior and since human behaviour is relevant in a large number of fields, behavioural modelling can be applied to any of those fields as well. With the baseline from above, looking at the different ways behavioural modelling is being used and a brief descrip- tion of some of the methods being used to conduct this type of work, the focus of this research can be discussed.

2.2 Surveys of Similarity Metrics

With an understanding of behavioural modelling and the sequences of behaviour that can be extracted from these data sources, the ways through which these behaviours and the sequences of behaviours are compared needs to be considered. In this next section, a variety of metrics will be examined through a number of surveys which provide an overview of the metrics that exist. Gomaa and Fahmey, in 2013, conducted a survey of text-based similarity approaches[11]. They identified three types of text similarity; string-based similarity, corpus-based similarity, and knowledge-based similarity. String-based similarity shares many key characteristics with the similarity metrics I will use in my research. Other work by Cha et al. assessed binary similarity as it relates to distance metrics[8]. Their paper identified seventy-six distance measurements that were com- piled in a hierarchical cluster to determine the amount of overlap that the various methods share with their counterparts. By grouping the most correlated metrics, the relationship between the groupings can be assessed through simple matching or prob- abilistic matching. Cha et al.’s work is significant for its identification of a multitude 2.3. SEQUENCE COMPARISON 15

of distance metrics.

2.3 Sequence Comparison

With the foundation of some of the metrics used to evaluate the difference between two behavioural elements. The direction of this research focuses on the sequences of behaviours and the information these sequences contain. Thus, given sequences of behavior related to each user, methods for measuring the similarity between sequences of behaviour are needed.

2.3.1 Edit Distance

The Edit distance is a method that represents the number of ways that one sequence must change for it to match another sequence. This distance metric has a variety of different types briefly described below:

• Hamming Distance - Only Substitutions

• Jaro Distance - Only Transpositions

• Longest Common Sequence (LCS) - Insertion and Deletions

• Levenshtein Distance - Insertions, Deletions, and Substitutions

• Damerau–Levenshtein Distance - Insertions, Deletions, Substitutions and Trans- positions

2.3.2 Jaccard Similarity

The Jaccard similarity, also known as Jaccard index, is a metric that measures the number of intersecting components and divides this value by the number of total 2.3. SEQUENCE COMPARISON 16

elements[3]. Jaccard similarity can be expressed by the following equation:

X ∩ Y d(x, y) = (2.1) X ∪ Y

In the equation above, d(x, y) represents the distance measurement, X represents one set of elements, and Y represents the other set. An example of a use case includes Wikipedia. On Wikipedia the Jaccard index works by computing the number of users that have edited two pages and then the denominator is the number of users who have edited either page. The Wikipedia similarity team produced a document which identifies how Wikipedia uses Jaccard similarity coefficient to reduce entity pairs[19]. In this document, Jaccard is compared with Dice and cosine similarity coefficients to determine which method is most ideal for ranking page importance. Based on these results it is clear cosine is the ideal choice for ranking.

2.3.3 Weighted methods

The sequence comparison techniques discussed above are not sophisticated enough to handle the more complex information that is contained within the elements of these sequences. The nuance and context that these elements are a part of, are not fully considered in the standard sequence comparison techniques. However, these techniques can be modified to be weighted. Weighting these comparisons simply means that a portion or all of the calculation value will be dependent on a weighting matrix which influences the result. Edit Distance, for example, could be replaced with Weighted Edit Distance and the value for substitution, originally 1 or 0, could be replaced with a value representing how similar each two objects you are substituting 2.3. SEQUENCE COMPARISON 17

are, for example a float between 1 and 0. A paper by Ioffe discusses multiple ways to potentially improve methods such as Jaccard similarity with weighted additions to their equations[12]: Weighted Jaccard: Σ min(S ,T ) J(S,T ) = k k k (2.2) Σkmax(Sk,Tk) where S and T are sequences that the user wants to find the similarity value between.

2.3.4 Discrete Cosine Transform (DCT)

The Discrete Fourier Transform (DFT) was briefly considered but put aside in ex- change for the Discrete Cosine Transform (DCT). Both methods transform a series of points into functions. DFT uses both sine and cosine functions for its representa- tion. DCT only uses cosine functions to represent its data. The mapping done by the DFT which results in the frequency domain can make comparisons more efficient in the lower dimensional space. Agrawal, Faloutsos and Swami discuss the use of the Discrete Fourier Transform to improve the efficiency of similarity processing[1]. As stated above, the Discrete Cosine Transform approximates a series of points into a series of cosine functions. This method has been identified as useful in pattern recognition[2]. It is believed that the distance between any two sequences can be represented by these first few coefficients, similar to the DFT[1]. One of the versions of DCT (M-D DCT II) is shown below:

N −1 N −1 X1 X2 π 1 π 1 Xk1,k2 = xn1,n2 cos[ (n1 + ) k1] cos[ (n2 + ) k2] (2.3) N1 2 N2 2 n1=0 n2=0 2.4. TOOLS 18

2.4 Tools

Sequence comparison is only the top-level step. As discussed these sophisticated comparison methods require weighting matrices to inform the methods about the underlying relationships between the elements of the sequences. To produce these weighting matrices a number of processing is required. These tools are a number of common data analytic methods that will be used within this work.

2.4.1 CountVectorizer

The first component of constructing this weighting matrix is converting the raw data into a usable form. CountVectorizer is a tool found within the skilearn fea- ture selection library1. This tool converts a collection of documents into a corre- sponding document-word array. A document-word array is an array that relates the number of times a particular word, column, appeared in a provided document, row. CountVectorizer takes in a number of “documents” to be analyzed for the frequency assessment. As an output the tool passes back a series of rows, with the total vocab- ulary of words as columns. Each index represents the frequency of a word in each document.

2.4.2 FastText

FastText is an open source software product that allows for the training of text embeddings. FastText is trained on string data to learn the textual patterns. Fast- Text uses n-grams to deconstruct a sentence and compose the new embedded value.

1CountVectorizer documentation page:https://scikit-learn.org/stable/modules/ generated/sklearn.feature_extraction.text.CountVectorizer.html 2.5. CORRELATION MATRICES 19

FastText in its training automatically prunes when approaching its maximum vocab- ulary size. Additionally, FastText detects high frequency words and discards them also. The model used in this research is trained using the Continuous Bag of Words (CBOW) model. From these patterns, the 300 dimensional sentence vector embed- ding can be generated for each string. The value of 300 was chosen as the length of the word embedding due it being standard for word embeddings. Although each word is repersented as a 300 dimensional vector, it seems helpful to make the repersentation of a set of words a vector space by normalizing using z-scores.

2.5 Correlation Matrices

Once a frequency matrix is constructed, the non-normalized cosine similarity (or dot product) can be calculated to construct a correlation matrix. This process converts the initial document by word frequency matrix, where each index represents the frequency that word occurred in that corresponding document, into a document by document correlation matrix. In a correlation matrix, the value at each index is a representation of how similar those two documents are. These correlation matrices are symmetric matrices.

2.6 Covariance Matrices

A is a symmetric matrix where each value is the corresponding covariance between the row and column element. A covariance matrix is calculated by using the dot product of normalized rows. The diagonal of the covariance matrix are the variances. The covariance of each relation corresponds to the relationship between the values of the two elements. If two elements have a positive covariance 2.7. BRAND METHOD 20

then their values tend to move in the same direction. The opposite is true for a negative covariance.

2.7 Brand Method

A method, introduced by Brand, which will be referred to as the Brand method[6]. This method is used after the correlation matrix is created to reduce the overwhelming impact of popular elements. Additionally, to distinguish the resulting correlation matrix of this method from the resulting correlation matrix of the cosine similarity, the resulting matrix from this method will be referred to as a Brand matrix. Because the correlation matrix is a symmetric matrix, the diagonal can be set to 0 and an efficient calculation that does not require embedding can be accomplished. The Brand method utilizes the Moore-Penrose pseudoinverse on the Laplacian of the correlation matrix. The effect of this method can be thought of as a projection of the commute times onto a hypersphere. This reduces the impact of popular elements. The pseudoinverse calculates the approximate inverse of a matrix when one cannot be obtained otherwise, because the correlation matrix is not full rank. This method was identified in other similarity papers which demonstrated its ability to improve connections[17]. In this paper, the authors use the Brand method to alter the values of elements of the dataset based on their global popularity. The main issue that is being dealt with by the Brand method in this paper is the high popularity words that co-occur with other high popularity words. In Brand’s paper, multiple metrics are tested against his proposed method. The method predicted to be the most useful, commute distance, is significantly surpassed by the proposed cosine correlation method. 2.8. CLUSTERING 21

Q = (1T W 1) · (diag(W 1) − W )+ (2.4)

In this equation, Q is the symmetric Brand matrix, and W is the symmetric correlation matrix with a zero-diagonal. The + is a symbol for a pseudoinverse.

2.8 Clustering

As a step of the methods described later, the clusters of word embeddings produced from the FastText method above needs to be calculated. Thus, to solve this a number of cluster techniques were identified and key ones were tested. K-means clustering is a clustering method which divides the data into k clusters. This method works by randomly selecting cluster centroids and identifying the closest elements using Eu- clidean distance to these centroids. Through an iterative approach these centroids are adjusted until stabilization is achieved, thus resulting in a local optimized clustering. However, after examining other methods such as hierarchical clustering and plot- ting the dendrogram of the clusters, it appeared that the hierarchical method was more in line with the outcomes that were targeted by my problem. The hierarchical clustering algorithm divides the dataset into multiple cluster groups. There are two types of hierarchical clustering; Agglomerative and Divisive. The difference between these two types is based on the structure of forming the clusters. Divisive clustering starts with all the elements in one cluster and divides the cluster to form the final cluster hierarchy. Aggolmerative clustering is the opposite, it starts with every ele- ment in its own cluster and joins the clusters until the hierarchy is formed. For this work the Agglormerative algorithm was selected. 2.9. VISUALIZATION TECHNIQUES 22

Additionally, a density based clustering solution was also required and the DB- SCAN method was selected. The DBSCAN technique works by computing the dis- tance measurements between elements, based on the set parameters. The first pa- rameter is epsilon, which represents the threshold for a point to be within range of another, and the second is min samples which represents the minimum number of points within a threshold to be considered a cluster. Any points outside these identified clusters are outliers and are labelled to their own cluster.

2.9 Visualization Techniques

Throughout the steps of processing the data, visualization methods can be used to identify patterns and easily compare results. To this end, a number of visualization techniques are used throughout this work. These will be discussed here. In the selection of clustering techniques, Singular Value Decomposition (SVD) was considered. This method can be used to reduce the dimensionality of a matrix. SVD deconstructs the original matrix into three matrices:

T = U · Σ · V T (2.5)

In this equation, T is an m × n matrix which represents the input into the SVD process. The first factor, U, is an m × m . The second factor, Σ, represents a m × n with non-negative numbers on the diagonal. The final factor, V T , is an n × n unitary matrix. The diagonal of Σ is in descending order, so the most significant factors can be chosen and the three factors can be truncated to this value. The three factors, now truncated to a smaller dimensionality, can now be reconstructed into one matrix, of size m × k where k is the number of dimensions 2.9. VISUALIZATION TECHNIQUES 23

selected. Thus, if R is the rank of T then you can reduce the dimensionality of this matrix by replacing R with R0, where R0 is the number of dimensions for the final output. The components of the matrix can be recombined to produce T 0 for the new rank value R0, thus producing a reduced dimensionality matrix[14]. The reduced dimensionality is useful for this process because the U matrix can represent the T matrix in its three most significant dimensions. These three dimen- sions can be plotted on a scatter plot and the icons can be colour coded according to the cluster groupings. This provides a visual representation of the complete data set and the dataset as it is clustered. Plots, such as a frequency plot, are used to identify cutoff points for frequency thresholding of the tag data. Additionally, histograms can be used to display key features of the distribution of the data at various stages. A dendrogram is used to identify a cutoff to determine how many clusters should the model contain. A dendrogram represents the formation of the various clusters in hierarchical clustering in a tree structure. This allows the difference in clustering between elements in the data to be seen and identified for valuable information. Additionally, a outline of the cluster contents can be displayed in table format to demonstrate that the clusters have some common identifiable key component. These tables can be used to demonstrate the differences at different stages of the data processing. In addition to these tables, another visualization technique that is used to compare the differences between the various stages of the processing method is colour-map matrices. Due to the large number of numeric data associated with these matrices, 2.9. VISUALIZATION TECHNIQUES 24

displaying them as they are is unhelpful. In order to solve this problem, the values are converted to visual components, colours, which allow for easier interpretation of the large matrices. Due to the size of the numeric results that the pipeline produces in the experiments methods, a logarithmic scale is used for colour distribution to better represent the difference between individual elements on the plot. 25

Chapter 3

Experiment

3.1 Research Objectives

The main target of this research is in cybersecurity applications, specifically the iden- tification of anomalous activity in a computer system. Using behavioural modelling and similarity, the similarity between users in a network can be found. One can use the logs of a network, the sequence of commands a user gives, to construct be- havioural sequences. These sequences can be used to identify any outliers in the network. In order to build a model to address this problem, a surrogate dataset, with similar properties to those we expect can be gathered from networks, is used to test the validity of the approach. The dataset that we expect that can be pulled from a network includes the logs of each user’s interactions on that network. These logs can be transformed into event sequences that represent the behavioural patterns of individuals on the network. This surrogate data has the properties we predict will ex- ist in the network data. Some of these properties are: specific commands that share similar properties, highly disproportional frequency of command usage and unique individualized patterns. 3.1. RESEARCH OBJECTIVES 26

In the log files, one can expect to find a number of commands with a variety of parameters depending on the task at hand. The specifics of these parameters are not always important and as a result these commands can be identified as much more similar than initial examination would suggest. The surrogate dataset proposed here is also based on user to user similarity. The sequences are constructed from restaurant selections. Similar to this command specificity problem, restaurants can belong to larger groups that are more impactful on the overall similarity than the specifics. An example of this could include the abstract group of pizza restaurants. Highly disproportional frequency of command usage means that a user may use a common command such as “cat” or “grep” many times. However, a more specific command such as “strace” may not be in the purview of a user and thus occurs rather infrequently. This range of high frequency to low frequency elements also occurs in the surrogate dataset. Restaurant tags such as “coffee” are quite common, while terms such as “spicy” occur less frequently. Finally, individual stylization occurs when there are individuals working in their own way. Using the surrogate dataset in this work, each sequence of events is a list of restaurants that a user visits. The improvement provided by this research is the use of metadata obtained from the restaurant tags and comments data to better under- stand the underlying similarity between restaurants. These similarities are applied to allow for improved sequence comparisons between users. For this research, the dataset was obtained from a social networking program called Foursquare. The Foursquare application allows users to track physical locations they have visited and to provide reviews based on their experiences at these locations. The Foursquare dataset con- tained data geocentred in New York City and contained records from October 24 2011 3.1. RESEARCH OBJECTIVES 27

to February 20 2012. The dataset is comprised of three separate data files. The first is the checkins dataset which contains a list of the users and which restaurants they visited. Thus, from this dataset a series of event sequences can be created to represent each individ- ual’s behaviour. The other two data files contain information relating restaurants to specific tags or comments. This extra information is what is used to calculate the sim- ilarity of the restaurants which is then used to improve the final sequence comparison. The tags data file is in the form of restaurant id by list of tags. This data is already in the form required for processing and thus can be read in with minimal modification. These modifications include removing potentially empty rows. The comment data file requires more preprocessing including removing the extra rows, converting the text data into the corresponding word vector and removing duplicate restaurant id rows. In order to convert the text data into word vectors the FastText library is used. Using the CBOW parameter, the dictionary containing the 300 dimensional word vectors is created. This model is used at the time of processing to convert each string of text into a 300 dimensional word vector. The getSentenceVector function is used to pro- duce these 300 dimensional word vectors based on the strings. This function works by considering the sentence as a sequence of word tokens with an end of sentence (EOS) token at the end of the sequence. The word vectors, using getWordVector, is used to get the values of each of the word vectors in the sequence. Each vector is divided by its L2 norm value and then the positive values are averaged to produce the resulting sentence vector1. Finally, similar to the tags dataset, the rows will be compacted by removing rows containing repeated restaurant ids and leaving one row

1How are sentence vectors calculated?:https://github.com/facebookresearch/fastText/ issues/323 3.2. TAG-BY-TAG 28

with the restaurant id and a list of word vectors. The first two sections describe a method for transforming the metadata obtained from the datasets into a form which will be used in the later sections. A template pipeline will be described which outlines a framework to achieve the results described above. This pipeline is outlined here starting with the tag data. A process is con- structed which transforms this data into a frequency matrix. This is followed by transformation into a correlation matrix and finally into a Brand matrix. The result- ing matrix, a symmetric matrix which represents the relationship between two spe- cific restaurants, serves as the weighting matrix for a sequence comparison method. The next two sections will outline similar but modified pipelines which allow for the inclusion of the first transformations done in this section, the correlation matrices which hold tag-by-tag and comment-by-comment relations, into a useful restaurant- by-restaurant format. Therefore, in this chapter the methodology to construct various restaurant-by- restaurant matrices will be described. The pipeline will be modified to handle the various challenges that arise due to the type of raw data provided.

3.2 Tag-by-Tag

3.2.1 Restaurant ID by Tag Frequency Matrix

The tag dataset is in the form: restaurant id by list of tags. The first column is the restaurant ids which is a numeric field that represents the unique identifier for each particular restaurant. The second column is the list of tags, a list of comma separated strings. Each tag represents a feature about the particular restaurant. An example of some of the entries in the tags data file can be seen in Table 3.1. I use CountVectorizor 3.2. TAG-BY-TAG 29

Table 3.1: An example of the tags data file

Rest ID List of tags 26 “brunch,happy hour, irish, pub” 36 “french, soho, zagat-rated, zagats” 39 “byob” to count the occurrences of each tag in relation to the each restaurant, this produces a document by frequency matrix2. In this matrix, each row represents a restaurant id and each column represents a specific tag. The resulting matrix indexes represent how often a particular tag appears with a particular restaurant. From this restaurant id by tag frequency matrix several problems can arise that impact the accuracy of the final similarity results. The main problems that arise are extremely common tags and rarely used tags which skew the results.

3.2.2 Problems in the Frequency Matrix

The first problem, the extremely common tags, can have the impact of providing greater significance to those tags because they appear more frequently. This prop- agates to the restaurants as well. Common tags are popular and thus inflate their similarity values. This also has the effect of causing the other tag relationships to be more distant than they should be. An example of this popularity problem can be seen by considering the adjacency graph. Adding a new highly connection node would influence the path between a large portion of the existing nodes. Thus, extremely frequent tags need to be handled to remove this problem. Two solutions were used. The first is thresholding. Before computing the restaurant-by-restaurant correlation matrix, the step after calculating the restaurant-by-tags frequency matrix, the most

2The CountVectorizer package is from the sklearn feature extraction library and the docu- mentation can be found here:https://scikit-learn.org/stable/modules/generated/sklearn. feature_extraction.text.CountVectorizer.html 3.2. TAG-BY-TAG 30

frequent tags columns are removed. The second solution is the Brand transform. The second problem, the rarely used tag, can also have significant impact on the final similarity result as a large number of tags do not occur frequently enough to provide significant data. In particular, a tag can occur only once for a single restaurant. Thus, these tag columns can also be removed from the restaurant-by-tag frequency matrix. To perform this removal step for both the most frequent and least frequent tags, the tag frequency columns are sorted in decreasing order and a histogram is plotted. These frequencies demonstrate the significant frequency difference between the most frequent tags, the least frequent tags and the tags that fall between them. In order to identify the lower cutoff point between the least frequent tags and the middle section, the graph is examined for when the slope of the graph becomes near zero. This cutoff indicates that tags below this point are too infrequent to provide results. The upper cutoff point between the most frequent tags and middle section can be identified by examining the graph for a knee in the slope. This cutoff indicates that the tags above this point have a frequency too high to be useful without distorting the resulting data. From these two cutoff points the restaurant-by-tags frequency matrix, with the tag frequency columns sorted in decreasing frequency order, can be truncated at the two points to remove those tags with problematic tag frequencies. The remaining tags will have a frequency that is neither too frequent nor too infrequent.

3.2.3 Tag-by-Tag Correlation Matrix

The purpose of this step is to compute the tag-by-tag correlation matrix from the restaurant-by-tag frequency matrix. A correlation matrix is a symmetric matrix where 3.2. TAG-BY-TAG 31

each value at a position (i, j) is the similarity value between restaurant i and restau- rant j. The following equation is used:

C = F T · F (3.1) where F is the restaurant-by-tag frequency matrix and C is the tag-by-tag correlation matrix. F is in the form of m × n and thus when multiplying the by the original matrix the resulting correlation matrix, C, will be of the form n × n. The diagonal of F is zeroed, since this represents the correlation between a par- ticular tag and itself. This is not useful for our purposes. The symmetric matrix can represent the of a weighted graph and intermediate steps require the zeroed diagonal. This removes the self loops in the undirected graph. At this point the tag-by-tag correlation matrix can be transformed into a further useful form. One of the forms that appeared to have potential was the matrix. The precision matrix was found in several places in the literature and appeared to be a useful method for this type of work. The precision matrix is defined as the inverse of the covariance matrix. This matrix was constructed using the numpy and linalg libraries in python3. Despite the potential, the results did not appear to show any useful results. 3Numpy linalg library - https://numpy.org/doc/stable/reference/routines.linalg.html 3.3. COMMENT-BY-COMMENT 32

3.2.4 Tag-by-Tag Brand Matrix

In this step of the pipeline, the aim is to apply the Brand transformation, Equation 2.4 from Chapter 2, to the tag-by-tag correlation matrix to reduce the impact of highly popular components. To do this computation, any negative value is set to zero and the rows with only zero values are removed. Once this preprocessing is complete the Brand transformation can be applied to the previous tag-by-tag correlation matrix to create the tag-by-tag Brand matrix. This resulting matrix is better able to identify patterns between restaurants than the previous correlation matrix due to its handling of highly popular tags. This Brand matrix marks the end of this pipeline. These two matrices represent the similarity value between each tag pair.

Visualizing the difference - Correlation Matrix vs Brand Matrix

To demonstrate the improvement of the Brand matrix over the correlation matrix, a nearest neighbour search is conducted on particular tags. Tags were selected: ranking the tags from highest global frequency to lowest, the first ten tags along with every 10th tag after that was selected for display. The nearest 10 tags to each tag was computed.

3.3 Comment-by-Comment

The starting data for this pipeline is significantly more different than the previous one. This data file contains three columns: rest id, user id and comment. The comments in string form are individually entered by the users and thus contain various different styles. This causes the comments to be difficult to find commonalities with each other. 3.3. COMMENT-BY-COMMENT 33

An example of the raw data can be seen in Table 3.2. Computing this data through the CountVectorizer tool prematurely results in a which is unfeasible for this pipeline. Thus, steps need to be taken to mitigate these properties. Table 3.2: An example of the comments data file

Rest ID User ID Comment 59010 20 “Go for the grattin dauphinois!” 37602 20 “When here you have to order the Mac & Cheese. Amaz- ing” 68863 20 “People watching + Cheesecake + Sex & The City”

One of the initial steps in this pipeline is to convert the text strings into numeric word vectors. However, based on the results of these vectors, the resulting correlation matrices were sparse. To deal with this challenge, the text embeddings are clustered and the cluster ids are used instead of the word embeddings. Using the above steps, this modified pipeline will produce comment-by-comment correlation matrices. The first step is to replace each comment string with a 300 dimensional vector using the FastText tool4[5]. The FastText model is trained on the comment strings to build this unsupervised model. Once these 300 dimensional vectors are created they are put into a matrix where each row corresponds to a comment string. The matrix is z-scored. This changes the vectors to allow them to be more accurately compared to one another. At this point, the data is in the form of restaurant id by word vector. However, there are multiple rows with the same restaurant id. This is resolved by collapsing the matrix into the form restaurant id by list of word vectors where for each restaurant id every corresponding row’s word vector is collected and combined into a list. This form can then be processed by a similar pipeline, to the one described above in the

4The FastText toolkit and documentation can be found at https://fasttext.cc/docs/en/ references.html 3.3. COMMENT-BY-COMMENT 34

previous pipeline, using the same logic as was used for the tags. However, the result is a extremely sparse matrix. The resulting matrices are too sparse to be useful. A better approach is to cluster the comments into more dense groups. The initial clustering technique used was k-means clustering. This method produced reasonable results. However, after producing a dendrogram from the data, Figure 4.7, it became clear that the hierarchical clustering produced more useful representations of the data. Thus, the clustering method was switched to hierarchical clustering. Based on the dendrogram, the number of clusters was chosen to be 50. Using these clusters the comment vectors are replaced with their corresponding cluster id. In addition to the dendrogram, other methods can be used to demonstrate that the hierarchical clustering provides better clusters than the k-means clustering, for example using Singular Value Decomposition, SVD. This step is significant because now with reduced dimensionality the data can be plotted for visualization. This data is plotted twice, once with the data coloured by the k-means clusters and once with the hierarchical clusters. This process allows the user to visually assess the different types of clustering. Additionally, to verify this clustering process, the clusters are examined. This examination is to confirm that the clusters that are formed share common themes. This also ensures that meaningless clusters are not created.. The matrix can be reformed to be of the form restaurant-id-by-list-of-cluster-ids. This next step is to collapse the multiple rows of each restaurant and its correspond- ing cluster id into a list such that there is only one row per restaurant id and the corresponding list of cluster ids associated with that particular restaurant. Identical to described above when the unclustered comment vectors were condensed into lists. 3.4. RESTAURANT-BY-RESTAURANT 35

This data can now be processed through the pipeline, similar to the tag-by-tag pipeline discussed above, to compute a correlation matrix. This will produce a comment-cluster-by-comment-cluster correlation matrix and a comment-cluster-by- comment-cluster Brand matrix. These can be expanded to include every comment using the results from its corresponding cluster. Since these comment clusters repre- sent the original comments, these matrices are considered the comment-by-comment correlation matrices.

3.4 Restaurant-by-Restaurant

In this section, the previous pipelines that use the tags and comments data will be modified to produce restaurant-by-restaurant correlation matrices. As in the previous pipelines, the template explains how the restaurant-id-by-list-of-tags or restaurant-id- by-comment string data is processed into a frequency matrix, then into a correlation matrix before finally ending up in with a Brand matrix. However, in this version of the pipeline, the correlation matrix calculation will be modified so that the resulting cor- relation matrices are of the type restaurant-by-restaurant restaurant-by-restaurant. Similar to the previous pipelines, this template performs the same steps to arrive at a restaurant-by-restaurant correlation matrix. A few key differences exist. First, the equation used to calculate the correlation matrix from the frequency matrix is altered to produce a restaurant-by-restaurant correlation matrix from the restaurant- id-by-tag frequency matrix or restaurant-id-by-comment frequency matrix.

C = F · F T (3.2)

Using the tags dataset produces a restaurant-by-restaurant correlation matrix 3.4. RESTAURANT-BY-RESTAURANT 36

where the correlation values are based on the tags associated with each particular restaurant. The comments dataset does not produce an identical matrix since the data used to identify those correlations are found within the comments dataset. Thus, simply following the pipelines as described in the previous two sections with the modification to the correlation equation described above will produce the following matrices:

1. Restaurant-by-restaurant (from tag data) correlation matrix

2. Restaurant-by-restaurant (from tag data) Brand matrix

3. Restaurant-by-restaurant (from comments data) correlation matrix

4. Restaurant-by-restaurant (from comments data) Brand matrix

Unlike the previous two pipelines, this pipeline produces four restaurant-by-restaurant correlation matrices which can be used for weighting purposes, a step towards the end goal.

3.4.1 Weighted Sequence Comparison

To demonstrate that the resulting correlation matrices that are produced by the previous pipeline steps provide more meaningful results when used on sequence com- parisons than traditional sequence comparison methods, traditional methods will be compared to weighted versions of those methods to demonstrate the gain in useful information that can be obtained. Damerau-Levenshtein distance is a version of edit distance which computes the number of edits required to transform one sequence into another. In this method, an edit can be an insertion, a deletion, a substitution or a transposition. In the standard 3.4. RESTAURANT-BY-RESTAURANT 37

Damerau-Levenshtein distance, each of these edits adds 1 to the cost. However, in the weighted version of this method, the substitution takes into account the similarity between the two restaurants being compared and adjusts the cost accordingly. There- fore, if a restaurant is more similar then the cost will be lower and if the restaurant is less similar than the cost will be higher. Finally, the algorithm is adjusted to have a weighted cost transposition. For different purposes this value can be changed from a static 0, to indicate that the ordering of the sequences does not matter, to the full cost of 1, to suggest that every transposition is the same. In this work, the weighted transposition cost signifies that elements which are more similar to one another can occur in a looser order than in the original case. For example, the case where a sequence A contains two similar but not identically placed elements [Harveys, Pizza Pizza, Pizza Hut] and sequence B contains the same but swapped elements [Harveys, Pizza Hut, Pizza Pizza]. Given a third sequence, C, [Harveys, Pizza Pizza, Steakhouse] the distance calculations between sequence A and the other two sequences can illustrate why using a non-one transposition can be beneficial. Consider the case where there is no weighting. The distance value between sequence A and B would be 1 (it takes one transposition to convert sequence A into sequence B). The distance value between A and C would also be 1. It takes one substitution to convert sequence A into sequence C. Using the standard algorithm, these two sequences are an equal distance apart from the original sequence A. How- ever, it can be recognized that the transposition of two common elements is much less of a difference than the substitution of an unrelated new element. Thus, in the weighted system the distance value between A and B would be a value between 0 and 1 which represents how similar Pizza Pizza and Pizza Hut are. In this scale, 0 3.4. RESTAURANT-BY-RESTAURANT 38

represents identical elements and 1 represents there is no information relating these two elements. The closer this value is to 0, the more similar they are. This value will be less than the 1 distance cost of the substitution and thus better represents the meaningfulness of the relationship between these sequences. The data from the checkins file is used to compute each user’s restaurant sequence. This file was assumed to be created chronologically and thus, the event sequence is created by merging the occurrences of each user into a list of restaurants. Once the data is in this form, the comparison process is straightforward. A sample of 500 users will be selected. Each of these users is compared to each other using the original Damerau-Levenshtein distance and using the weighted version. Multiple weighted matrices are compared including the restaurant-by-restaurant correlation matrix and the restaurant-by-restaurant Brand matrix. These correlation matrices are not in proper form to be initially useful to this weighting. In a weighting matrix, the objective is to have dissimilar elements with a larger number to represent a larger edit is needed, while a smaller number indicates that the elements are quite similar. Currently the matrices magnitude are opposite so that a larger number represents a more similar element. Thus, the matrices are transformed in the following way. First, the values are normalized to between 0 and 1. The values are reversed by subtracting 1 from all the values, so now the range is −1 to 0. The values in the matrix can then have their absolute values taken so that the original most similar elements, the largest numbers which after normalization would be closest to 1 are now closest to 0. The smallest numbers, originally closest to 0, after the subtraction are closest to −1 and thus after the absolute value are closer to 1. The matrix requires one more step before it can be used in the weighted algorithm: the diagonal which 3.5. RESTAURANT-BY-RESTAURANT (WITH METADATA) 39

represents an element compared to itself, needs to be zeroed again to represent that an element is identical to itself. Thus, it does not need to add any edit distance to replace a value with a copy of itself. A final point about the weighting matrices interaction with the Damerau-Levenshtein distance calculation. Due to the way the data was collected, there is a large discrep- ancy in the restaurant ids that are contained within the tags dataset and similarly with the comments dataset. Thus, it is likely that the checkins data contains restau- rant ids from both of these sets and thus each individual correlation matrix will not include the entire restaurant id pool. In order to handle the case where a restaurant id is not identified in the correlation matrix, the default value of 1 will be used to represent that the substitution is still less than the full deletion and addition process. However, since there is no information at that time on that restaurant it cannot be more similar to the other restaurants, thus must be at least 1.

3.5 Restaurant-by-Restaurant (With Metadata)

This is the first calculation that utilizes the previously constructed tag-by-tag corre- lation matrices and comment-by-comment correlation matrices. Recall as previously mentioned, these matrices cannot be used directly as weighting matrices for sequence comparison. Instead these matrices will be used in the calculation of a more informed restaurant-by-restaurant correlation matrix. This result will include both the infor- mation from the original dataset but also the additional connections identified by the correlation matrices included. To construct the most accurate restaurant-by-restaurant correlation matrices pos- sible, I include the tag-by-tag Brand matrix or comment-by-comment Brand matrix 3.5. RESTAURANT-BY-RESTAURANT (WITH METADATA) 40

Figure 3.1: Brand Connections Diagram into the restaurant-by-restaurant calculation.

T C = F · Cmetadata · F (3.3)

In this equation, F represents the restaurant id by tag frequency matrix (or restaurant id by comment frequency matrix), Cmetadata represents the tag-by-tag, or comment-by-comment, Brand matrix produced in the previous pipelines. C is the re- sulting matrix which is the symmetric matrix comparing restaurants and restaurants, but which incorporates the similarity values gathered between tags, or comments. The rationale behind this equation is that relationships, in the form of distances, between restaurants can be improved, shortening the distance, by including another path between tags. This type of modification will allow for more subtle connections between restaurants. The added value for this process can be seen in Figure 3.1. The shorter the path, the more similar the two elements are to one another. In the figure, it becomes clear that the pathways created by implementing the comment-by-comment Brand matrix in the restaurant-by-restaurant correlation cal- culation are beneficial to the overall similarity connections. For example, a pathway between user 2 and user 3 is created. The connection result can benefit from including multiple Brand matrices. This would account for the potential of multiple reduced 3.6. RESTAURANT-BY-RESTAURANT (WITH TAG AND COMMENT) 41

paths between two restaurants. Thus, the equation above could have the matrix value

Cmetadata replaced with any number of Cmetadata matrices. From the modified equation above, the matrices that can be produced are:

1. Restaurant-by-restaurant (including tag data) correlation matrix

2. Restaurant-by-restaurant (including tag data) Brand matrix

3. Restaurant-by-restaurant (including comments data) correlation matrix

4. Restaurant-by-restaurant (including comments data) Brand matrix

3.6 Restaurant-by-Restaurant (With Tag and Comment)

The aim of this section is to combine the information from both the tags and com- ments metadata to produce a more informative restaurant-by-restaurant correla- tion matrix. This calculation is represented by the following equation, where C is

the resulting correlation matrix, FRestT ag is the restaurant-by-tag frequency matrix,

FRestComment is the restaurant-by-comment frequency matrix, CT agT ag is the tag-by-

tag Brand matrix, CCommentComment is the comment-by-comment Brand matrix. The resulting correlation matrix C now includes the metadata from both the tag and comment data.

T T C = FRestT ag · CT agT ag · FRestT ag · FRestComment · CCommentComment · FRestComment (3.4)

The resulting matrices are:

1. Restaurant-by-restaurant (including tag and comments data) correlation matrix 3.7. USER CLUSTERING 42

2. Restaurant-by-restaurant (including tag and comments data) Brand matrix

3.7 User Clustering

The resulting restaurant-by-restaurant matrices in each of the above steps represents the distance between user’s sequences. These matrices can be considered as adjacency matrices and the distance metrics the distance between users. Thus, to understand the structure of the data, the DBSCAN clustering technique is implemented. First, the data is scaled to between 0 and 1. Then the parameter values, epsilon and min samples, are varied to determine the various structure of the graph under different conditions. The resulting structure and its analyze is composed into a table.

3.8 Summary

In this chapter, the methods used to transform the raw data into useful correlation matrices is described. A basic pipeline is defined. This pipeline takes the initial data file and processes the data into the desired result, a correlation matrix. Variations of this template are also discussed. Initially, a modification of the preprocessing steps is completed to allow for the use of the comments dataset. In the next variation, the template‘s correlation calculation is modified to produce a restaurant-by-restaurant correlation matrix. Next, the previous two section’s results, the tag-by-tag correlation matrix and the comment-by-comment correlation matrix, are used to combine these extra pieces of information into a restaurant-by-restaurant correlation matrix. Finally, the process is extended to allow for both the tag and comments data to be used in the formation of a combined restaurant-by-restaurant correlation matrix. These correlation matrices at the various stages can be used as input for the 3.8. SUMMARY 43

sequence comparison methods to demonstrate the application of these results. Finally, these resulting distance matrices are clustered under DBSCAN to identify groupings of users. Visualization techniques were used to display the results throughout the pipeline including frequency plots, histograms, colour matrices, dendrograms, and nearest neighbours. In Chapter 4, the results produced from this pipeline will be examined and discussed. 44

Chapter 4

Results

4.1 Introduction

In the previous chapter, the methodology of this research and its various iterations were discussed. In this chapter, the results produced along significant components of the pipelines, described previously, will be presented. The goal of this research is to demonstrate that using the techniques described in Chapter 3, the incorporation of metadata into correlation matrices which will be used as weighting matrices, improves the effectiveness of sequence comparisons. There are multiple different types of figures, mentioned in previous sections, that will be used to display the resulting data matrices. The most common type of visu- alization technique used will be the colour-map matrix, which was used to produce heat map versions of the correlation matrices. This format allows for easier visualiza- tion and comparison between steps. Other techniques include: frequency plots which display the frequency of multiple elements, histograms which display the distribution of the values within the data object, a dendrogram which visualizes the hierarchi- cal clustering component, and tables containing the results of a nearest neighbour 4.1. INTRODUCTION 45

search to demonstrate various capabilities of the methodology proposed. All these visualization methods will be used at various points in this chapter. The structure of the rest of this chapter will follow the outline presented in the previous chapter. First, the tag-by-tag correlation matrices and then the comment- by-comment correlation matrices will be displayed. Next, the pipeline which produces restaurant-by-restaurant correlation matrices, will have its results displayed. In this pipeline and the pipelines that follow the matrices created are able to be used as weighting matrices for sequence comparison. These matrices will be used as weigh- ing matrices in the weighted Damerau-Levenshtein calculation and the results will be shown. Next, the pipelines which produced the restaurant-by-restaurant (includ- ing tag data) correlation matrices and the restaurant-by-restaurant (including com- ments data) correlation matrices will be presented. This is followed by the results of weighted distance calculation Finally, the restaurant-by-restaurant (including tag and comments data) correlation matrices will be the last pipeline’s results which will be presented. After those results, the results of this weighted distance calculation are included.

4.1.1 Validation

Before the results of this chapter are discussed the framework for which this work is being analyzed on needs to be set. One of the key difficulties of this area of research is the struggle of identifying “correct” results when there is no ground truth to use as a base. The interpretation of an individual, or a group of individuals, is not sufficient to determine the correct answer when discussing similarity. Additionally, a large portion of this work relies on the impact the Brand method brings to reducing the 4.2. TAG-BY-TAG 46

popularity problems that arise between high popularity pairwise values. This claim is based on the original work by Brand which has demonstrated that the transformation does improve similarity correlations by reducing the impact of the popular pairwise relationships. Finally, once all the transformations have taken place, the similarity metrics will show that similarity mass clumps into key components.

4.2 Tag-by-Tag

This section deals with the first form of the template pipeline, the tag-by-tag cor- relation matrix. In this section, the template pipeline will be modified to alter the resulting data to produce the tag-by-tag correlation matrices. The first part of the pipeline is the conversion of the raw data into a document-by- term frequency matrix. Thus, in this pipeline the restaurant-id-by-list-of-tags data is converted into the restaurant-id-by-tags-frequency matrix. After this frequency ma- trix is created, preprocessing is applied before any correlation matrix can be produced. One of the steps of this preprocessing includes sorting the columns, each representing a tag, based on frequency in descending order. At this point, the frequency of each column can be tabulated. This data is used to create the first frequency plot, Figure 4.1. This plot demonstrates the popularity of individual tags in descending order before any truncation. Based on the information that is known about this data, it is expected that there will be a large difference in popularity between the highest popularity elements and the least popular elements. From this plot, it is clear that there is a large difference between the most popular tags and the least popular tags as predicted. As discussed in the previous chapter, the popularity of an element can influence the results of the correlation calculation. 4.2. TAG-BY-TAG 47

Figure 4.1: Tag Frequency Plot Before Truncation As a result, the methodology in Chapter 3 describes a process to select cutoff values which will be used to truncate the frequency matrix. The identified cutoff points, 15 and 180, can now be used to truncate the data to remove these high popularity and low popularity tags. When calculating for the end result of a correlation matrix, both the upper and lower cutoff points will be used to truncate the columns of the frequency matrix. When calculating for the end result of a Brand correlation matrix, only the lower cutoff points must be used to truncate the columns of the frequency matrix. In this work, the popularity problem is reduced by including the upper truncation. In both cases after the truncation, each row will be checked to ensure it still contains non-zero values, representing restaurants without any mid frequency tags, and any rows without values are removed. The next frequency plot, Figure 4.2, demonstrates the effect of truncation when compared to the previous frequency plot. The sharp drop of high popularity tags and the long tail of the low popularity tags has been removed. It is at this point 4.2. TAG-BY-TAG 48

Figure 4.2: Tag Frequency Plot After Upper and Lower Truncation that the frequency matrix is ready to be transformed into the correlation matrix. The significance of the thresholding can be demonstrated in the next two figures. The first, Figure 4.3, displays the results of the correlation matrix produced if no truncation is used. As can be seen in this figure, there are a large number of tags with no similarity. As a result, a large section of the matrix has a value of zero. To understand the structure of this colour-map, recall the preprocessing step where the tags were sorted in descending frequency order. Thus, the tags in the top left hand corner represent the tags with the highest frequency values. This is useful for comparison to the next figure, the tag by tag correlation matrix after truncation. This next matrix, Figure 4.4, is the resulting correlation matrix if the matrix is produced after truncation. As can be seen, the number of tags has dropped signif- icantly but the matrix has also become more dense. The large amounts of empty values have been removed with the truncation preprocessing step. While the tags with the highest frequency are positioned closest to tag id 0, there are several tags 4.2. TAG-BY-TAG 49

Figure 4.3: Tag-by-Tag Correlation Matrix - Before Truncation with high correlation among the tags with lesser frequency values. These orange-red values are easy to pick out among the less dense portion of the colour-map. However, they exist through out the matrix. Following these steps, the pipeline continues as described in the template. The Brand transform is applied and Figure 4.5 is produced. As mentioned before, despite the top left of the colour-map matrix having higher density of values due to sorting on frequency, the rest of the matrix contains sig- nificant correlation values. The smaller correlated values have been removed by the transformation process and subsequent pruning. Additionally, the smaller correlation 4.2. TAG-BY-TAG 50

Figure 4.4: Tag-by-Tag Correlation Matrix - After Truncation values have been reduced and pruned. The remaining correlation values indicate the strong or weak correlation between the tags. As noted at the beginning of this section, these tag-by-tag correlation matrices cannot be directly used to aid in improving sequence comparisons. Thus, another technique is used to evaluate the difference between the correlation and Brand tag- by-tag matrices: nearest neighbour search. For each tag, the ten nearest neighbours are selected. The results for the tag-by-tag correlation matrices nearest neighbour search is compared to the results of the tag-by-tag Brand search results. To illustrate 4.2. TAG-BY-TAG 51

Figure 4.5: Tag-by-Tag Brand Matrix this point, an example of using the nearest neighbor search on the tag data matrices can produce results such as in Table 4.1.

Table 4.1: cookies Nearest Neighbour

Correlation results: cupcake, cupcakes, cake, bakery, desserts, dessert, pastries, cafe, breakfast Brand results: bakery, donuts, chocolate, cupcakes, cake, cupcake, cakes, pastry, baked

The results from this search demonstrates the difference between the correlation 4.2. TAG-BY-TAG 52

matrix results and the results produced by the Brand transform. The correlation re- sult shows in the top 10 highest correlated tags to cookies contains the tag ‘breakfast’. Breakfast is not something one would naturally associate with cookies. However, breakfast is a highly popular tag and likely relates with a number of other adjacent tags. In the Brand result, breakfast is not present. A number of other changes oc- cur including shifting the importance of various tags and adding more specifically relevant tags such as ‘chocolate’ and ‘donuts’. Chocolate is added due to its obvi- ous connection with cookies and donuts is added for its relevance to cookies through stores. To illustrate this point, the contents of the nearest neighbour results will be demonstrated. The first 10 most frequently used tags will be displayed below in Table 4.2. By examining these tables, we can see that the process we have conducted thus far is performing as intended. Looking at the first sub table of Table 4.2, which relays information about the ‘pizza’ tag, the correlation results indicate a series of tags which one could commonly understand is related to ‘pizza’. Tags such as ‘pasta’ is a common example of this. Some of the other tags also make sense when given the context of the data, that it was collected in New York. Tags such as ‘park’, ‘brooklyn’ and ‘district’ can then be seen to represent key locations that pizza is related to. However, by examining the Brand results the reordering and change in tags suggests a deeper understanding of the relationships between these tags. For example, new tags such as ‘rated’, ‘Italian’ and ‘bravo’ can easily be understood to be of key relation to a tag like ‘pizza’. Additionally, the reordering of tags such as ‘brooklyn’ (gaining in similarity) and ‘trendy’ (reducing in similarity) also makes 4.3. COMMENT-BY-COMMENT 53

sense given the relationship brooklyn, new york has with pizza and with the more common less descriptive nature of the ‘trendy’ tag. These types of changes are reflected in all the nearest neighbour searches. By understanding the context of the tags and examining their relationship to the targeted tag, the Brand results can be seen to providing a more nuanced set of tags that are closer tied to the common examples that can be seen in the correlation results. The rest of the nearest neighbour results on every 10th tag can be found in Ap- pendix A.

4.3 Comment-by-Comment

The second form of the pipeline utilizes a similar methodology but uses the comments dataset instead of the tag dataset which was used in the last section. As described in the previous chapter, the comments present a challenge even after being transformed into 300 dimensional sentence vectors. To illustrate the difficulty of this problem, the z-scored comment vectors are reduced into 3 dimensions using SVD and visualized in Figure 4.6. Each point is coloured according to its cluster label from the hierarchical clustering. Once the dataset is read into the pipeline, a number of preprocessing steps occur including replacing the comment strings with their corresponding sentence vectors. These vectors are clustered and as part of this clustering step produce a dendrogram, Figure 4.7. The distance values are included at the joining of two clusters for reference between cluster joins. Additionally, the bottom of the tree diagram from the last cluster join to where the leafs reach the x-axis has various thicknesses. These dots represent previous cluster joins that are not fully visualized in this diagram. Finally, 4.3. COMMENT-BY-COMMENT 54

Table 4.2: Nearest Neighbours First 10 Most Frequent Tags

Pizza Correlation results: pasta, douchebags, park, people, magazine, brooklyn, trendy, district, douchebag Brand results: brooklyn, rated, italian, bbq, new, wsj, bravo, douchebag, trendy Sandwiches Correlation results: salads, salad, sandwich, bread, soup, cafe, pastries, tea, or- ganic Brand results: free, tea, french, lunch, dinner, salads, dessert, cheese, cafe Lunch Correlation results: salads, sandwiches, french, beer, breakfast, cocktails, dinner, tea, bread Brand results: breakfast, sandwiches, dinner, cheese, french, burgers, hot, the, brooklyn Chicken Correlation results: diner, dog, karaoke, shop, run, ground, area, play, ski Brand results: fried, vegetarian, diner, sushi, vegan, shop, gallery, dog, run Beer Correlation results: great, cocktails, friendly, the, burger, late, dining, night, on Brand results: the, cocktails, burgers, outdoor, great, cheese, and, friendly, dinner Italian Correlation results: pasta, trendy, desserts, ny, manhattan, to, fresh, eat, district Brand results: trendy, new, steak, pizza, outdoor, spot, york, people, hot Rated Correlation results: chef, top, bbq, douchebag, bravo, socialite, brooklyn, wsj, magazine Brand results: bravo, pizza, brooklyn, douchebag, bbq, wsj, chef, socialite, top Breakfast Correlation results: cafe, sandwiches, salads, dinner, salad, bread, pastries, or- ganic, soup Brand results: lunch, sandwiches, dinner, burgers, and, cafe, free, cheese, tea Burgers Correlation results: burger, the, french, good, shakes, cocktails, salad, milkshakes, wings Brand results: the, and, fries, late, cheese, beer, dinner, burger, lunch Vegetarian Correlation results: diner, alley, bowling, shop, play, record, ground, run, ski Brand results: vegan, chicken, sushi, diner, fried, shop, gallery, dog, run 4.3. COMMENT-BY-COMMENT 55

Figure 4.6: Comment Vectors visualized using SVD the numbers along the x-axis are the number of elements inside each cluster.

Figure 4.7: Dendrogram - 50 Clusters

Based on the dendrograms produced, it was decided that the number of comment 4.3. COMMENT-BY-COMMENT 56

clusters should be 50. Additionally this diagram provided evidence for switching to hierarchical clustering from k-means clustering. This is demonstrated in the clusters produced by the clustering method. Examining a cluster reveals clear direction that the contents of the cluster aligns with. The hierarchical clustering is applied on the sentence vectors based on the chosen parameters. The results of this clustering are used in the next step of the pipeline where the sentence vectors are replaced with cluster ids. Before continuing the pipeline, the cluster results are checked. These results are checked by examining the contents of the clusters for similarity. It is possible that there would be a con- nection that a human would not make. However, this data did not appear to have this characteristic and the reasoning for the way the clustering was conducted can be seen through examination of the contents. A couple of comments that were placed in the first cluster are in Table 4.3. Table 4.3: Cluster 1 ‘Good’ Examples

“Bagels are really good” “Try the pollo alla grilga. It was damn good.” “The cheeseburger with Swiss was pretty damn good... the bun was the best part!” “The arugula and beet salad was delish” “Their cheesecake was amazing”

In this small portion of results from the first cluster, it is evident that many of the comments had a positive spin focusing on the word ‘good’. This can be seen in the first three examples demonstrated in the table. In addition, other positive words such as ‘delish’, ‘amazing’, or ‘excellent’ were used in many of the comments in this cluster. This is demonstrated through the two comments in the fourth and fifth row of the table displaying results from cluster 1. However, despite the majority of comments following this pattern there were a 4.3. COMMENT-BY-COMMENT 57

few outliers. These are displayed in the Table 4.4. Table 4.4: Cluster 1 Inappropriate Examples

“awful. this place gives Japanese food a really bad name.” “Tasty but WAY overpriced. $8 for a frozen yogurt? Really? REALLY?” “Overpriced.”

In this table, the three rows demonstrate a few of the outliers that appear within this first cluster. One way these outliers could be further reduced would be by using smaller clusters. Other clusters also display distinguishable patterns. Cluster 10, Table 4.5, focuses on recommendations. Common phrasing includes “have the” and “try the”. Addi- tionally, the twitter handle ‘@Foodspotting’ was commonly linked with this theme. Cluster 30, Table 4.6, can be seen focusing on the restaurant’s service and wait staff. Although this cluster does not appear to clearly differentiate between positive and negative comments. The comments tend to focus on the negative aspects of comments towards service. Table 4.5: Cluster 10 Examples

“have the Chili Cheese Fries!!” “Try the Spicy Bloody Mary - And the adult drinks are good too... Uncle˜ Deno and Greg (via @Foodspotting)” “Try the Swamp Water - Tasty it is, classy it’s not. (via @Foodspotting)” Table 4.6: Cluster 30 Examples

“nice spot. super-duper attentive wait staff” “Shitty service! Just left” “TERRIBLE service. Terrible.” “Staff nice, food great, but the pace of food service is unbearably slow”

From the contents, it is clear that the clustering is being done on a number of parameters being selected from the text data. At this point, having some validation 4.3. COMMENT-BY-COMMENT 58

that the clustering steps are doing what was expected, we can continue the pipeline. The data was moved to the next stage in the pipeline to produce the restaurant-id- by-comment-cluster-id frequency matrix from the restaurant-id-by-list-of-comment cluster ids data. This matrix can be used to produce the correlation matrix using the same formula as used in the previous tag-by-tag correlation matrix calculation. This next colour-map matrix displays the results of this comment-by-comment correlation matrix, shown in Figure 4.8.

Figure 4.8: Comment-by-Comment Correlation Matrix

Unlike the previous colour-map matrices displayed before this one, this colour-map matrix only has 37 rows and columns. The 37 rows and columns are the cluster ids 4.3. COMMENT-BY-COMMENT 59

that remain after preprocessing eliminates empty and extremely low correlation rows. There are still some examples of low correlation rows that can be seen in this figure in the blue colour. Particularly, clusters 24 and 26 appear to have low correlation values in relation to the other clusters. This low correlation values across the entire rows suggests that the comments in these clusters are dissimilar to all the other clusters. However, the relationship of these two clusters to each other is closer to orange, which means the two clusters contents are similar to each other. A large portion of this figure demonstrates a medium amount of correlation, but some high correlation values can be seen. For example, rows 3, 4 and 5 appear to have a medium amount of correlation with almost all the other clusters. There are two key elements that make these rows significant. First despite their medium amount of correlation, in these rows, correlation with clusters 24 and 26 is decidedly low. Second, multiple correlation values in these rows are of higher correlation than the medium amount of correlation presented by the green colour. This orange-green colour represents the in between the medium amount of correlation and a high amount of correlation. This suggests that these comment clusters are similar to a large portion of the other clusters. After this correlation matrix, the colour-map matrix in Figure 4.9 displays the Brand matrix that is produced by the following step of the pipeline. This colour-map matrix contains a large amount of zero-valued correlation values. This is a result of the large amount of low to average correlation values in the previous correlation matrix being eliminated by the Brand transformation and the subsequent pruning. The remaining values present interesting information about the relationship between the various clusters. For example, clusters 33 and 37 appear to have a high correlation 4.4. RESTAURANT-BY-RESTAURANT CORRELATION MATRIX 60

with other clusters. However, other clusters such as cluster 1 have no correlation values with the other clusters.

Figure 4.9: Comment-by-Comment Brand Matrix

4.4 Restaurant-by-Restaurant Correlation Matrix

In this form of the pipeline, which serves as the template for the next set of pipelines, the aim will be to produce a restaurant-by-restaurant correlation matrix which can be used for weighting the sequence comparisons. This pipeline follows a similar path as the tag-by-tag pipeline. However, when 4.4. RESTAURANT-BY-RESTAURANT CORRELATION MATRIX 61

at the step to compute the correlation matrix, an altered formula is used. Using this altered formula, the correlation matrix is produced and displayed in Figure 4.10. Similar to the other pipelines, the diagonal is zeroed along with other post processing techniques as described by the methodology in the previous chapter.

Figure 4.10: Restaurant-by-Restaurant (From Tag) Correlation Matrix

Based on this colour-map matrix, there is a small number of significantly correlated values which are coloured between yellow and red. Additionally, there is a large number of rows which have consistent low correlation with almost every restaurant. Next, the middle of the colour index, around green, has a number of values in this area. These values suggest a weak correlation with many other restaurants. Finally, 4.4. RESTAURANT-BY-RESTAURANT CORRELATION MATRIX 62

there is a large amount of white space which indicates a correlation value of zero suggesting that many restaurants share no correlation. As mentioned in the methodology and used previously before the correlation ma- trix can be transformed, using the Brand transformation, the negative values have to be removed. After the transformation the matrix is produced in Figure 4.11,

Figure 4.11: Restaurant-by-Restaurant (From Tag) Brand Matrix

This colour-map matrix clearly contains many more strongly correlated values. There are almost no low correlated values as these have either been reduced to 0 or gained correlation value from the transformation. When compared to the previous 4.4. RESTAURANT-BY-RESTAURANT CORRELATION MATRIX 63

correlation matrix, one can notice the similar structure that is presented. The restau- rants with many values forming lines around restaurant id 500 and before 1000, can be seen in the Brand transformed correlation matrix. These correlation values, and as a result the pattern they form, are either reinforced by the correlation values - becoming larger - or removed - erasing part of the pattern - if the restaurants had low correlation values. Neither of these matrices are useful for the end goal but both can be used. The goal of this research was to incorporate weighted matrices, which has its values based on correlation information obtained based on the metadata, into sequence comparisons. Thus, to better understand the significance of using the Brand transform, the two correlation matrices above can be independently used in the sequence comparison step. The results of this step indicates the success or effectiveness of either matrix. Additionally, the baseline, the unweighted Damerau-Levenshtein distance, is also used to produce a similarity matrix and serves as a base to compare each weighted results. The visual displays will be of the difference between that versions weighted edit distance results and the baseline results produced by the original edit distance. Thus, at this point the two correlation matrices can be processed to be of a differ- ent form which is needed for use in the weighting algorithm, as described previously. Once the preprocesing is complete, the weighted Damerau-Levenshtein distance can be used on each user sequence in relation to the other users’ sequences. The resulting edit distance values are placed in a similarity matrix where each corresponding i and j represent a specific user. The value stored at index (i, j) is the edit distance between the two user sequences. To demonstrate the gain in using the methodology described in this research, 4.4. RESTAURANT-BY-RESTAURANT CORRELATION MATRIX 64

Figure 4.12 is the non-weighted edit distance when run on the user sequences.

Figure 4.12: Nonweighted DamerauLevenshtein Distance Similarity Matrix

Now the difference between the weighted edit distance and the original edit dis- tance can be displayed. These values represent the changes that can be achieved through the use of weighting matrices. These changes are considered improvements due to the improved meaningfulness of the results as previously shown in the nearest neighbour tables and in the following results. Figure 4.13 displays the difference in results from the baseline and the weighted results from the restaurant-by-restaurant (From Tag) correlation matrix. This figure displays numerous differences between the original method and the weighted method. 4.4. RESTAURANT-BY-RESTAURANT CORRELATION MATRIX 65

Figure 4.13: Difference in Weighted Edit Distance using Restaurant-by-Restaurant (From Tag) Correlation Matrix A key part of this figure is understanding that the amount of change is large. The values near the colour red are approximately 1 unit of edit distance less than the original edit distance method. Considering that a sequence can range anywhere from 1 to 208 elements, these results demonstrate significant changes as compared to the original results. Figure 4.14 displays a similar comparison but using the restaurant-by-restaurant (From Tag) Brand matrix as the weighting matrix. It is clear in this version that there are a number of areas that were reduced in value as a result of the Brand 4.4. RESTAURANT-BY-RESTAURANT CORRELATION MATRIX 66

transformation. As described previously, the Brand transform handles popularity issues that arise from common tags.

Figure 4.14: Difference in Weighted Edit Distance using Restaurant-by-Restaurant (From Tag) Brand Matrix

Examples

We can focus in on a few of these results to examine how the weighted matrix is changing the edit cost. As discussed previously, as a result of the data used, the sequences are composed of restaurant identifiers instead of actual restaurant names. In this first example, the two sequences are identified below: 4.4. RESTAURANT-BY-RESTAURANT CORRELATION MATRIX 67

• Sequence 1: [630, 630, 9526]

• Sequence 19: [18170, 11193, 110941]+

Using the standard Damerau-Levenshtein distance, the edit distance is 3. This is due to the fact that each element needs to be replaced to transform one sequence into another. However, using the weighted Damerau-Levenshtein distance, the distance calculation is able to leverage the correlation found between restaurants ‘630’ and ‘18170’. Instead of replacing ‘630’ with ‘18170’ for a cost of 1, as per the standard method, the weighted method can reduce this transaction due to the perceived simi- larity to 0.9696. Thus, the standard method returns a value of 3 while the weighted method returns a value of 2.9696. By tracing the reduced cost back to the original restaurant tags, it can be seen that the first restaurant, ‘630’, shares a similar tag with the second restaurant ‘18170’. The tag ‘vegetarian’ is related to the tag ‘vegan’. It is shown in Appendix A that these two tags are correlated with each other due to their perceived relationship. In this second example, the two sequences are identified below:

• Sequence 4: [19393, 13016, 4506, 11790, 19182]

• Sequence 12: [11734, 1394, 11734, 1394, 23258]

Using the standard Damerau-Levenshtein distance, the edit distance is 5 since each element must be replaced. Using the weighted version, the algorithm detects that ‘13016’ can be substituted with ‘1394’ for a reduced cost of 0.9648. Thus, the final weighted cost is 4.9648. By tracing the reduced cost back to the original restaurant tags, it can be seen that the first restaurant, ‘1394’, has a tag (“pizza”) that is weakly related to the 4.4. RESTAURANT-BY-RESTAURANT CORRELATION MATRIX 68

restaurant, ’13016’, which has the tag “italian”. It was shown earlier in this section that these two tags are correlated with each other due to their perceived relationship. In this third example, the two sequences are identified below:

• Sequence 30: [2593, 14197, 1404, 1211, 9209, 349, 12833, 1602, 23209, 2593, 14197, 1404, 4340, 1211, 9209, 16675, 12833, 22148, 7723, 21345, 68935, 1231]

• Sequence 31: [14924, 2302016, 19346, 1086, 19336, 549, 102278, 14924, 35966, 2302016, 19346, 16816, 16858, 7557, 11733, 51967, 7792, 1782104, 3205, 9953, 12910]

Using the standard Damerau-Levenshtein distance, the edit distance is 22 since that is the max distance to replace every value to transform one sequence into another. The weighted edit distance cost is 21.8688 as a result of the following transforms:

• (1602 ↔ 14924) Distance cost: 0.9728

• (1404 ↔ 16816) Distance cost: 0.9519

• (21345 ↔ 9953) Distance cost: 0.9923

• (1231 ↔ 12910) Distance cost: 0.9518

The above restaurants are correlated based on their related tags as well. For example, restaurant ‘1602’ and ‘14924’ have a relationship based on the tags ‘italian’ and ‘ny’. It’s interesting to note that in the Brand results other versions of New York are correlated to the ‘italian’ tag. The other restaurants have similar underlying relationships, such as the third substitution which weakly correlates ‘italian’ with ‘cash’ and ‘only’. In this fourth example, the two sequences are identified below: 4.4. RESTAURANT-BY-RESTAURANT CORRELATION MATRIX 69

• Sequence 201: [42775, 54085, 85776, 7723, 49886, 143883, 25353, 104003, 79919, 15743, 42775, 9772, 64680, 7723, 49886, 11664, 143883, 25353, 6900, 79919, 42, 174401, 173414, 13463, 24321, 5639, 34897]

• Sequence 394: [47569, 12833, 186, 12875, 11846, 24321, 14854, 45896, 20, 4797, 17830, 12833, 186, 12875, 11846, 24321, 14854, 27031, 9727, 11717, 14487, 6894, 773, 12621, 733, 6790]

Using the standard Damerau-Levenshtein distance, the edit distance is 27 since that is the max distance to replace every value to transform one sequence into another. The weighted edit distance cost is 26.4400 as a result of the following transforms:

• (7723 ↔ 12875) Distance cost: 0.9672

• (79919 ↔ 20) Distance cost: 0.6981

• (7723 ↔ 12875) Distance cost: 0.9672

• (79919 ↔ 11717) Distance cost: 0.8377

• (174401 ↔ 6894) Distance cost: 0.9800

• (173414 ↔ 773) Distance cost: 0.9899

There are a number of underlying connections demonstrated in these substitu- tions. The strongest correlation, between restaurants ‘79919’ and ‘20’, is based on both restaurants sharing the tag ‘hot’. While this connection appears strong, the un- derlying data used for these weighted edit distance calculations is a correlation matrix. The number of records in each of the two restaurants makes it difficult to highly trust 4.4. RESTAURANT-BY-RESTAURANT CORRELATION MATRIX 70

this interaction without considering the impact of the longer sequences and the num- ber of high popularity tags. Other connections are present in this example, such as between ‘79919’ and ‘11717’. These restaurants share the connection between the tag ‘hot’ and the other four tags for ‘11717’. These tags are ‘lunch’, ‘cookies’, ‘cakes’, and ‘good mark’. These multiple connections are why the similarity value between these two restaurants is stronger than many of the other connections. The resulting differences between the weighted edit distance and the non-weighted edit distance are quite small. This is to be expected as the number of overlapping restaurants between any two sequences is small in comparison to the size of the se- quences. With sequence lengths ranging up to 208 elements, and the most change that can occur from any one interaction is to reduce the non-weighted edit distance’s cost from 1 to near 0. Thus, on sequences of large lengths even with the higher likeli- hood of having similar restaurants, the amount of change in the edit cost is expected to be small. The longer length sequences are expected to have more connections and as a result a larger change than smaller sequences due to having more opportunities for the use of a weighted value.

4.4.1 Restaurant-by-Restaurant - From Comments

Using a similar technique as described in the previous pipelines, but using the com- ments dataset frequency matrix and the altered formula produces the restaurant-by- restaurant correlation matrix, Figure 4.15. Returning to the restaurant-by-restaurant correlation matrices results in a larger number of rows and columns. The large number of elements can cloud the results from an overview perspective. From the overview, it is clear that there are a number 4.4. RESTAURANT-BY-RESTAURANT CORRELATION MATRIX 71

Figure 4.15: Restaurant-by-Restaurant (From Comments) Correlation Matrix of zero, or near zero, correlation values. As well, there are a number of low and medium value correlation values. Peering further into the colour-map matrix, one can make out the high correlation values present. In Figure 4.16, the restaurant-by-restaurant Brand matrix, the medium and high correlation values are preserved past the Brand transformation and subsequent prun- ing. Similar to the previous section, the weighted edit distance results of these two ma- trices will be compared to the baseline. In Figure 4.17, the restaurant-by-restaurant (From Comments) correlation matrix is used as a weighing matrix. These results 4.4. RESTAURANT-BY-RESTAURANT CORRELATION MATRIX 72

Figure 4.16: Restaurant-by-Restaurant (From Comments) Brand Matrix are much more sparse than the tags version. This is likely due to two major factors. First, the comments dataset is much more sparse than the tags dataset and while there were measures taken to reduce this sparsity, the problem is not completely solved. Secondly, the comments dataset and tags dataset only share a small portion of overlapping restaurant ids. Thus, we can surmise that a large portion of the restau- rant ids present in the comments dataset are unused in the sequence sample that is displayed here. Figure 4.18 displays an interesting result, the Brand transform has allowed for more edit distance costs to be different. As well as, similarly to the last sections 4.4. RESTAURANT-BY-RESTAURANT CORRELATION MATRIX 73

Brand weighting matrix, reduced the value of many of the differences.

Figure 4.17: Difference in Weighted Edit Distance using Restaurant-by-Restaurant (From Comments) Correlation Matrix

In the previous few sections, the basic pipeline and a few variations have been used to produce various correlation matrices. The restaurant-by-restaurant correla- tion matrices could be used to aid in the sequence comparison goal of this research. However, these matrices do not fully incorporate the metadata connections that this research is striving to demonstrate. Therefore, in the following sections, the tag-by- tag correlation matrices and the comment-by-comment correlation matrices, which were unused previously, will be incorporated into the correlation calculation. This 4.5. RESTAURANT-BY-RESTAURANT (WITH TAG) 74

Figure 4.18: Difference in Weighted Edit Distance using Restaurant-by-Restaurant (From Comments) Brand Matrix will produce restaurant-by-restaurant correlation matrices which will be further am- plified by containing the metadata to metadata connections.

4.5 Restaurant-by-Restaurant (With Tag)

The purpose of this pipeline is to produce a restaurant-by-restaurant correlation ma- trix, which includes the tag-by-tag Brand matrix. This pipeline is similar to the template pipeline and follows similar steps. However, at the correlation calculation an additional matrix is included in the calculation, the tag-by-tag Brand matrix. This 4.5. RESTAURANT-BY-RESTAURANT (WITH TAG) 75

matrix provides additional information to the final restaurant-by-restaurant correla- tion matrix, which can be seen in Figure 4.19.

Figure 4.19: Restaurant-by-Restaurant (With Tag) Correlation Matrix

The results of this improved correlation matrix can be seen in this colour-map matrix compared to Figure 4.10. This correlation matrix is denser than the previous matrix. This is likely due to the increased connectivity that can be achieved, as demonstrated in the previous chapter. The Brand transform can be applied next, as per the pipeline description, to arrive at the restaurant-by-restaurant (with Tag) Brand matrix as shown in Figure 4.20. 4.5. RESTAURANT-BY-RESTAURANT (WITH TAG) 76

This figure is also significantly more dense than the previous restaurant-by-restaurant Brand matrix. In an addition to the increased connectivity that was evident in the correlation matrix, the Brand transform demonstrates that more connections achieve the higher correlation values than other correlation matrices.

Figure 4.20: Restaurant-by-Restaurant (With Tag) Brand Matrix

The weighted Damerau-Levenshtein algorithm can then be applied to sequences of restaurants using these correlation matrices. As expected when including additional tag connections, as demonstrated in Figure 4.21, the amount of change between the baseline and weighted versions has increased. The Brand version in Figure 4.22 has reduced the values of many of the results. There still exists patches of large differences 4.6. RESTAURANT-BY-RESTAURANT (WITH COMMENTS) 77

between the original edit distance and these weighted versions.

Figure 4.21: Difference in Weighted Edit Distance using Restaurant-by-Restaurant (With Tag) Correlation Matrix

4.6 Restaurant-by-Restaurant (With Comments)

This section repeats the process demonstrated above but instead of including the tag data into the calculation, the comments data is used. The colour-map matrices demonstrate that the inclusion of a metadata correlation matrix improves connectiv- ity. An interesting result is found in the restaurant-by-restaurant (with comments) Brand matrix, Figure 4.24. The significant portion of the zero values have shifted to 4.6. RESTAURANT-BY-RESTAURANT (WITH COMMENTS) 78

Figure 4.22: Difference in Weighted Edit Distance using Restaurant-by-Restaurant (With Tag) Brand Matrix a medium amount of correlation. Despite the overview perspective, there is still a significant amount of high levels of correlation. Similarly to the previous section, these results are applied to the weighted Damerau- Levenshtein algorithm. The following matrices are produced by using these correla- tion matrices as a weighting matrix and applying the algorithm to the user sequences. Figures 4.25 and 4.26 represent the weighted edit distance when using the restaurant- by-restaurant (with comments) correlation and Brand matrix respectively. As men- tioned in the previous examples, the Brand version reduces the values of many of the 4.7. RESTAURANT-BY-RESTAURANT (WITH TAG AND COMMENT) 79

Figure 4.23: Restaurant-by-Restaurant (With Comments) Correlation Matrix differences. In this example, the values were already quite low and with this reduc- tion, many of the sequence edit distance costs which once varied are now identical. The remaining values do still represent significant improvements on the original edit distance calculation.

4.7 Restaurant-by-Restaurant (With Tag and Comment)

The purpose of this final pipeline is to include both the tag and comment datasets and corresponding correlation matrices into a single restaurant-by-restaurant correlation matrix. However, as can be seen in the following figures 4.27 and 4.28, the datasets 4.7. RESTAURANT-BY-RESTAURANT (WITH TAG AND COMMENT) 80

Figure 4.24: Restaurant-by-Restaurant (With Comments) Brand Matrix both contained restaurant ids from the same pool but did not share enough of these ids to produce a large correlation matrix. Only 40 of the over 2000 restaurant ids each dataset contained were overlapping. The results of this matrix are similar to the general pattern of the matrices above. A number of restaurants contain medium amounts of correlation with one another. A few restaurants deviate from this pattern such as restaurant 11 which has no cor- relation values. Additionally, restaurant 19 has a high correlation value with almost every other restaurant. Other restaurants can be seen to have varying degrees of near high correlation with almost every restaurant id or near low correlation with every 4.7. RESTAURANT-BY-RESTAURANT (WITH TAG AND COMMENT) 81

Figure 4.25: Difference in Weighted Edit Distance using Restaurant-by-Restaurant (With Comments) Correlation Matrix restaurant id. However, the unsymmetrical nature of this matrix, which is possible, does not seem to follow with the nature of the data that is being worked with. This matrix is transformed through the Brand transformation to produce the restaurant-by-restaurant (with tag and comment) Brand matrix. The results of this transformation, similar to previous uses, results in a large number of low correlation values being pruned. Unfortunately as discussed above, the datasets contained only a small amount of overlapping restaurant ids. This is due to the way the data was collected. As a result 4.8. CLUSTERING THE DISTANCE MATRICES 82

Figure 4.26: Difference in Weighted Edit Distance using Restaurant-by-Restaurant (With Comments) Brand Matrix this pipeline is predicted to not be as informative as it could be on another dataset. The following figures, 4.29 and 4.30, are the results when the previous correlation matrices are applied to the weighted Damerau-Levenshtein calculation.

4.8 Clustering the distance matrices

In this section, the previously produced distance matrices from the weighted edit distance calculation are collected. These matrices have distance values which connect users. Due to the nature of this data, the matrices can be considered as adjacency 4.8. CLUSTERING THE DISTANCE MATRICES 83

Figure 4.27: Restaurant-by-Restaurant (With Tag and Comments) Correlation Ma- trix matrices between users. These matrices can then be used to cluster the users into groups. Using the DBSCAN clustering technique, the results demonstrated that there is a core commonality between these users. Despite varying parameters, the users generally only clustered into one cluster. The min samples parameter had to remain at 2 for any other smaller clusters to appear. If the parameter was altered to above 2 regardless of the epsilon value, then there would always be at most one cluster. In all the testing that was conducted, when there was more than just the main cluster and the outlier cluster, the remaining clusters had small counts of elements (almost always 4.8. CLUSTERING THE DISTANCE MATRICES 84

Figure 4.28: Restaurant-by-Restaurant (With Tag and Comments) Brand Matrix 2). The results of various tests can be seen in Table 4.7. The tests demonstrated in this table uses a min samples value of 2. The range of epsilon chosen was from 1/10 the median value to 1/2 the median value. The median value, not including non-zero entries, was approximately 0.0705. The epsilon value was this value multiplied by the factor (from 1/10 to 1). The weighted that was used was the restaurant-by-restaurant (with tag) correlation matrix. However, testing showed that there was too small a difference between the various files to actually get significantly different results by changing file. 4.8. CLUSTERING THE DISTANCE MATRICES 85

Figure 4.29: Difference in Weighted Edit Distance using Restaurant-by-Restaurant (With Tag and Comments) Correlation Matrix Based on the results of the table and various testing of parameters, the conclusion that can be reached is that there are enough pairwise similarities to each user that every user is reachable from any other. This means that there are no outlying be- haviours. No groups of users that have tastes significantly different from the rest of the users. This suggests that the usefulness of these results is with regards to the im- mediate neighbours of any user. The differences in this behaviour can be interpreted by reducing the epsilon value to reduce the core component to a smaller size. The 4.8. CLUSTERING THE DISTANCE MATRICES 86

Figure 4.30: Difference in Weighted Edit Distance using Restaurant-by-Restaurant (With Tag and Comments) Brand Matrix common pattern of behaviour these users share can be separated into more compo- nents by examining the data at each layer. The group of common users component at 1/10th of median appears to be sequences of small length, mainly length 1. As the epsilon value increases, the sequences included increase in length. 4.9. SUMMARY 87

Table 4.7: DBSCAN results with varying parameters

ε Number of users Number of users Total number of in main cluster in outlier cluster clusters 1/10th of median 39 459 3 2/10th of median 79 417 4 3/10th of median 173 327 2 4/10th of median 205 289 5 5/10th of median 260 238 3 6/10th of median 283 217 2 7/10th of median 302 198 2 8/10th of median 333 167 2 9/10th of median 349 151 3 Median 366 134 2 4.9 Summary

In this chapter, the pipelines from the previous chapter were implemented and their results were recorded. The first pipeline uses the techniques established in the pre- vious chapter to construct a tag-by-tag correlation matrix and Brand matrix. The second pipeline does a similar process for the comments dataset with the special provi- sions added to handle the unique nature of the comments data. The template pipeline demonstrated the standard operating steps that serve as the baseline for the various variations that are conducted. This pipeline produced a few restaurant-by-restaurant correlation matrices from the tags dataset. This pipeline concluded by demonstrating the effectiveness of these matrices as weighting matrices in the sequence compari- son calculations. The next two pipelines utilize the previously unused metadata-by- metadata correlation matrices to create improved restaurant-by-restaurant correla- tion matrices. These improved correlation matrices are infused with the connections established by the metadata-by-metadata correlation matrices. This results in im- proved connectivity between restaurants in the dataset. Finally, both the tag and 4.9. SUMMARY 88

comment correlation matrices were used to create the final restaurant-by-restaurant correlation matrices. These matrices would represent the restaurant-by-restaurant correlation with the information provided by both the tag dataset and the comments dataset. However, due to the way the data was collected, this method provides a lim- ited scope on the overall restaurant-by-restaurant relationships and as a result little to the weighted sequence comparison algorithm. In the clustering the distance matrices section, the final distance matrices pro- duced from the weighted edit distance calculation were considered as adjacency ma- trices between the users that they represent. A clustering method was applied which demonstrated a core commonality between the New Yorkers that are represented in this dataset. 89

Chapter 5

Conclusion

Behaviour is not arbitrary, its a choice based on knowledge and understanding of the elements available for selection. When documented over a period of time, these behaviours form a sequence. Previous work has described methods for comparing these sequences to have a better understanding of the underlying connections between sequences. However, the problem remains that within each sequence elements of a sequence also have similarity relationships which influence the difference between the sequences. In this work, I examined the process to utilize metadata previously unused in this comparison and demonstrated how it could be used to improve sequence comparison techniques through weighting. A key application of this work is in the area of cybersecurity. The behavioural patterns located within computing systems can be analyzed to identify outliers potentially representing malicious actors. In this work, to simulate this target the FourSquare dataset was chosen to simulate the properties of such a system. Before the weighted sequence comparison can be used, the weighting matrices must first be produced. In order to produce these matrices, some pipelines were designed. These pipelines are similar in nature but vary in small aspects to produce 90

different outputs. All the pipelines begin by processing the raw data and creating the first matrices of the pipelines, the frequency matrix. For the versions that use the tag dataset, this step is simple. A small amount of preprocessing to eliminate invalid rows and then the CountVectorizer tool is used to produce the document- word frequency matrix. The versions of the pipelines that use the comment dataset are more complicated. The comments are in string format and thus then first step is transferring them into a numeric format. FastText is used to produce 300 dimensional sentences vectors for each of the comments. These vectors are too sparse for the following steps. Thus, hierarchical clustering is used to reduce the sparsity. With the comments replaced by their corresponding cluster ids, the comment dataset can also be processed through CountVectorizer to produce its corresponding document-word frequency matrix. At this point the tag and comment versions of the pipeline become similar. The non-normalized cosine similarity can now be computed between each frequency ma- trix and its transpose to produce correlation matrices. Depending on the orienta- tion of the frequency matrix, either a metadata-by-metadata correlation matrix or a restaurant-by-restaurant correlation matrix will be produced as a result. Thresh- olding can be applied to this matrix and then the Brand transform to turn these correlation matrices into Brand matrices. The metadata-by-metadata versions of the correlation and Brand matrices are not directly usable in the goal of weighting the se- quence comparisons. Instead they will be used further along in another pipeline. The restaurant-by-restaurant correlation and Brand matrices are usable in the weighting process. Some simple processing to reduce the range is all that is required to make these matrices ready for use as weighting matrices. At this stage, four weighting 91

matrices have been produced. The pipelines do not need to stop here. In order to create a more meaningful repre- sentation of the relationship between elements, the metadata-by-metadata correlation and Brand matrices can be used in the correlation calculation of the restaurant-by- restaurant matrices. The addition of this matrix allows for further connections to be made between restaurants. From this step, four additional weighting matrices can be produced. The ultimate purpose of the pipeline is to create a restaurant- by-restaurant correlation and Brand matrices that include metadata from both the tag and comments dataset. Similar to the process before, the metadata matrices are inserted into the correlation calculation. This will produce another two matrices. At this stage, ten potential weighting matrices have been produced. The Damerau- Levenshtein distance calculation has been modified to adjust the distance value if there exists a record that correlates the two restaurants. Using the weighting matrices produced above, a series of calculations can be done to determine the difference between the original distance calculation and the various modified versions. The differences represent the contributions of the metadata on the difference between restaurants within these sequences. These new distance measurements provide a more meaningful demonstration of the differences between the various sequences. There are a number of limitations that were present throughout the duration of this work. First, the way the sequence data was collected was unclear. As an assumption of this work, the sequence file was assumed to be in chronological order and thus the sequences could be constructed as a result. The two metadata files were also not as related as predicted. The number of overlaping restaurants was small and as a result the last stage of the pipeline, the attempt to include both the tag 92

and comments datasets into a single restaurant by restaurant correlation matrix was unsuccessful. Additionally, several of the techniques used in the pipelines have specific conditions in order to work correctly. As a result, initial preprocessing included removing any rows with less than 2 tags. Also, at the end of producing the correlation matrices, the negative values were set to zero. Lastly, the computation of this program is prohibitive for instantaneous or quick results. The dataset contained 2060 user sequences ranging from 1 element to 208 elements each. Despite the final sequence comparison matrices being symmetrical and only needing half the computations, using 20 threads which consumed approximately 153gb of memory, the computation of the weighted sequence comparison matrix took a significant amount of time. The worst case experienced took approximately 3 weeks to fully compute the full sequence list. However, other results were able to be computed within two days. This discrepancy is attributed to the additional computing required for the weighted method, the former length, and the non-weighted method, the latter length. Despite these prohibitive computation times, strategic planning can be used to maximize the use of this work. Once a matrix is computed, new sequences can be added within minimal computation required. However, if the underlying weighting matrices change then the computation may need to run again in whole. In this work, only 500 sequences were used as a result of the time required to compute the various distance matrices using the weighted matrices produced in the various pipelines. BIBLIOGRAPHY 93

Bibliography

[1] R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence databases. Proc. 4th Int. Conf. on Foundations of Data Organization and Algo- rithms, 730:69–84, 1993.

[2] N. Ahmed, T. Natarajan, and K. R. Rao. Discrete cosine transform. IEEE Trans. on Computers, 23:88–93, 1974.

[3] J. Bank and B. Cole. Calculating the jaccard similarity coefficient with map reduce for entity pairs in Wikipedia. 2008.

[4] Jernej Barbic, Alla Safonova, Jia-Yu Pan, Christos Faloutsos, Jessica Hodgins, and Nancy Pollard. Segmenting motion capture data into distinct behaviors. pages 185–194, 2004.

[5] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enrich- ing word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.

[6] M. Brand. A random walks perspective on maximizing satisfaction and profit. In SIAM Conference on Optimization, May 2005. BIBLIOGRAPHY 94

[7] Nancy M. Carter, William B. Gartner, and Paul D. Reynolds. Exploring start-up event sequences. Journal of Business Venturing, 11(3):151–166, 1996.

[8] S. Choi, S. Cha, and C. C. Tappert. A survey of binary similarity and distance measures. J. Systemics, Cybernetics and Informatics, pages 43–48, 2010.

[9] S. Cresci, R. Di Pietro, M. Petrocchi, A. Spognardi, and M. Tesconi. Dna- inspired online behavioral modeling and its application to spambot detection. IEEE Intelligent Systems, 31(5):58–64, 2016.

[10] Glenn Forbes, Stewart Massie, and Susan Craw. Fall prediction using be- havioural modelling from sensor data in smart homes. Artificial Intelligence Review, 53:1071–1091, 2020.

[11] W.H Gomaa and A.A Fahmy. A survey of text similarity approaches. Interna- tional Journal of Computer Applications, pages 13–18, 2013.

[12] S. Ioffe. Improved consistent sampling, weighted minhash and l1 sketching. In Data Mining (ICDM), 2010 IEEE 10th International Conference, page 246–255, 2010.

[13] Gorkem Kar, Shubham Jain, Marco Gruteser, Jinzhu Chen, Fan Bai, and Ramesh Govindan. Predriveid: Pre-trip driver identification from in-vehicle data. Number 2, pages 1–12, 2017.

[14] A. Kelil and S. Wang. Scs: A new similarity measure for categorical sequences. ICDM ’08: Proceedings of IEEE International Conference on Data Mining, pages 498–505, 2008. BIBLIOGRAPHY 95

[15] Agnes Koschmider. Clustering event traces by behavioral similarity. In Advances in Conceptual Modeling, pages 36–42, 2017.

[16] Christian Sebastian Loh and Yanyan Sheng. Measuring the (dis-)similarity be- tween expert and novice behaviors as serious games analytics. Education and Information Technologies, 20:5–19, 2015.

[17] D.B. Skillicorn, R. Billingsley, P. Peppas, P. Gardenfors, H. Prade, and M.-A. Williams. Case-based similarity for social robots. 2017.

[18] Paul J. Taylor. Proximity coefficents as a measure of interrelationships in se- quences of behavior. Behavior Research Methods, (38):42–50, 2006.

[19] V. Thada and V. Jaglan. Comparison of Jaccard, Dice, cosine similarity co- efficient to find best fitness value for web retrieved documents using genetic algorithm. International Journal of Innovations in Engineering and Technology, 2013. 96

Appendix A

Nearest Neighbour Results

vegetarian(10th)

Correlation results: diner, alley, bowling, shop, play, record, ground, run, ski, gallery

Brand results: vegan, chicken, sushi, diner, fried, shop, gallery, dog, run, breakfast

cocktails(20th)

Correlation results: great, good, open, dining, service, patio, drinks, trendy, on, special

Brand results: outdoor, great, the, beer, dining, cheese, burger, seating, best, french

dinner(30th)

Correlation results: salad, soup, delivery, salads, good, sandwich, bread, healthy, special, drinks

Brand results: cheese, free, french, sandwiches, burgers, lunch, hours, late, the, beer

socialite(40th) 97

Correlation results: douchebag, chef, douchebags, top, brooklyn, girl, gossip, bravo, trendy, magazine

Brand results: bravo, chef, douchebag, top, rated, gossip, girl, wsj, bon, pizza

cupcake(50th)

Correlation results: cookies, cake, dessert, desserts, pastries, bread, cupcakes, trendy, cafe, chocolate

Brand results: dessert, cupcakes, tea, hot, bakery, free, cream, cake, sandwiches, ice

music(60th)

Correlation results: spanish, cuban, on, specials, sangria, patio, square, dining, special, great

Brand results: friendly, great, on, hour, happy, dining, patio, hours, good, cocktails

dog(70th)

Correlation results: play, ground, record, ski, karaoke, area, bowling, alley, run, shop

Brand results: shop, gallery, diner, fried, run, karaoke, chicken, ski, area, play

asian(80th)

Correlation results: sake, curry, rice, noodles, drinks, all, ramen, long, pork, byob

Brand results: wings, pork, korean, sake, rice, chinese, cheap, curry, fish, thai

wsj(90th)

Correlation results: magazine, box, mag, appetit, bbq, bon, district, trendy, finan- cial, foodie

Brand results: chef, bbq, rated, pizza, bravo, top, magazine, bon, new, douchebag 98

bread(100th)

Correlation results: pastries, salads, organic, desserts, juice, smoothies, stump- town, healthy, bagels, green

Brand results: salads, cookies, free, organic, wifi, pastries, chocolate, desserts, espresso, tea

play(110th)

Correlation results: record, bowling, ski, alley, ground, area, karaoke, run, dog, gallery

Brand results: dog, run, karaoke, gallery, shop, area, ski, ground, record, diner

noodles(120th)

Correlation results: dumplings, spicy, curry, long, ramen, awesome, byob, pan- cakes, wheelchair, indian

Brand results: boys, date, vietnamese, byob, hipsters, falafel, baked, lobster, cui- sine, mediterranean

open(130th)

Correlation results: special, patio, specials, soda, sports, garden, service, good mark, hipsters, menu

Brand results: good, special, service, drinks, specials, frozen, cheap, bistro, chelsea, soda

lobster(140th)

Correlation results: oysters, soho, fish, steakhouse, mediterranean, indian, awe- some, long, grilled, grill

Brand results: No more results 99

sake(150th)

Correlation results: ramen, all, snacks, hotel, date, mediterranean, byob, long, soho, grill

Brand results: chinese, boys, date, vietnamese, byob, hipsters, falafel, baked, lob- ster, cuisine 100

Appendix B

Cluster Contents

B.1 Cluster 0

“Try the pollo alla grilga. It was damn good.” “Excellent wine list. And the \” “Prepare for the meat sweats” “Beer combinations are pretty good” “It’s a good restaurant” “Cheap & cheerful. Excellent Italian comfort food.” “Service was pleasant but slow. Steak was delicious.” “Cute, cozy cafe... Good, reasonably priced snacks. $5 organic breakfast egg roll... A real winner!!” “Good pizza! Just had a food baby.” “Pretty good. Service friendly, though she did say she was one of the ’good ones.’ great music.” “While the pumpkin waffle isn’t very sweet, it’s quite good.” “Great place hear good music and catch good food.” B.1. CLUSTER 0 101

“Our waitress was so hot that it was kind of distracting. In a good way.” “Oysters were great, as was the Manhattan Clam Chowder but the service during lunch was slow - waiter was movin’ though. Just very busy.” “Monkfish is excellent” “Wonderful but pricey curried pork” “The cheeseburger with Swiss was pretty damn good... the bun was the best part!” “It’s ’ok’ Cuban food. Portions are small for price.” “Tea press was great. Bacon was phenomenal. Apple & cheddar omelette was disap- pointing. Apples were too mushy. Tasted like apple sauce and cheese. :( The service was spectacular though.” “Mosquitos bad headbanger music nuff said. Som good slices tho.” “Burger is amazing. Avoid the catfish. Clam chowder was great too.” “Nice Chelsea brunch locale; good beer list.” “The General Tso shrimp was great. Plump, well prepared prawns in a delicious, light sauce. The half crispy chicken we had was also delicious.” “Excellent value dinners and house wine.” “great fish, good food” “Really great brunches & reasonable prices.” “great place to watch Philly games” “Raspberry Lemon trifle was transporting. Cake in cup nowhere near as good.” “Hot waiters here! Friendly, too.” “The arugula and beet salad was delish” “So tasty. Slow service tonight, but we weren’t rushed so objt was fantastic.” “Coffee + flan so good here” B.1. CLUSTER 0 102

“I preferred French Laundry over Per Se. Great food but service was lacking last time I was here.” “Big portions, prices are VERY reasonable. Good drinks as well.” “Excellent Texas Red Chili, great ribs too. A bit pricey though.” “The chips were great with blue chz- extra chips was $3 more. Salmon fri special was delish. Bourbon dessert was light and perfect even on full stomach. Dirty martini kettel1 $14 and was like 3 shots.” “Spicy tuna tartar was delish. Bourbon praline profiterole was amazing.” “The pumpkin and raisin muffin was pretty scrumptious.” “The tempeh Reuben was great!” “Okay garlic rolls, but really good pizza.” “Cash Only. Otherwise excellent.” “The Napoleon was spectacular.” “Generous portions” “Grilled pizza stayed crispy and flavorful. Pork chop was great. Penne was nothing special. Overall fantastic ingredients.” “Omelette’s are pretty good and decently priced.” “Have a thick skin. The food was excellent, but the service was horrible and it was ridiculously loud.” “Great appetizers. Lamb shank was alright, while salmon was excellent. Try their oysters!” “Really neat atmosphere inside. Reasonably priced food. Get the fried pickles to share (or not!)” “Super expensive but good.” B.1. CLUSTER 0 103

“The Salt Cod looks really good” “Papusa’s are good - Tamales not so much.” “Cheap. Generous. Fun.” “Ridiculously good sandwiches.” “Frittata was delicious.” “Cheap drinks, good food, fun atmosphere. ¡3” “The goat cheese and basil scramble was awesome. 5 of us got 5 different things and all sampled. The cheese cornbread was awesome too. All was amazingly delish.” “Wild boar was delicious” “Seasonal Pumkin Beer was good but peak organic IPA was better :)” “Chicken salad... not recommended. Chk soup, is good.” “Not worth the money....unless you’re really hungry and need really bad faux- Japanese food...” “awful. this place gives Japanese food a really bad name.” “really really good calamari!” “Portions are generous fyi. Im about to explode.” “Everything is damn good. Try the monk dumplings yummmmm.” “Disappointing. Tuna was was salty, salad was prepackaged arugula and served in a plastic takeout container. Feels too cheap for the price they charge.” “The pasta with mushrooms was fabulous. And the skirt steak was perfect too.” “Very disappointed waited 45 minutes for my food. Catfish was nasty. Totally dis- satisfied.” B.1. CLUSTER 0 104

“Service was poor. Waitress kept us waiting even though restaurant was pretty empty. Asparagus was disappointing, croquettes were good but order only brings two. Overall it was forgettable.” “Healthy. Cheap. Portions not great, but fast and convenient so who cares about the portions?” “fries made the clam bake - cobb salad was pretty good too” “Strong margaritas good food” “Sapporo draft here was damn smooth.” “Convenient, easy, good pizza place. Wish it was open later!” “24h, good food - also called \” “Really good food!Service was awesome!!” “Everything is very clean and fresh. The wait staff seemed very sweet, too. I had the Classic sanwich, which was a bit of an acquired taste for me. Everything else sounded terrific though!” “Coffee was weak, but the food looked tantalizing...” “good Indian food!” “Bagels are really good” “good coffee really bad espresso, be warned.” “The hummus was excellent.” “The roast organic chicken was good. Not superb, as I had expected, but good.” “cheap but good turkish. excellent hummus.” “Rustic. Not fancy but good. Quick and friendly” “Really great edamame, Cobb salad was awesome too!” “Excellent brunch. Good variety, reasonable prices, nice vibe.” B.1. CLUSTER 0 105

“surprisingly good lunch in midtown” “Not cheap, but generous portions of *awesome* food. Worth making the trip from Brooklyn.” “Awesome live music but weak Margs. Decent food, reasonable prices.” “Good food but the pomodoro pizza was disappointing.” “Shitty service but good pizza” “great portuguese food, great ambiance. The desserts are awesome as well!.” “Their cheesecake was amazing” “They are closed for vacation until April 25 (well deserved too)!” “This. Pizza. Is. My Favorite! So good.” “Ridiculously good fries here.” “The food was good but the service was terrible. Definitely won’t be returning.” “Goooood Sangria. Bistek Cubano was delicious. So was the Pollo Beso..” “sandwiches are pretty delicious.” “Every bit as good as millennium in San Francisco” “They have a sitar player! (And good, inexpensive food.)” “Try the salmon tartar. Pretty good!” “Burrito bowl for dinner...always good eats!!!” “Yep barista was friendly as fuck. Latte was quality too. Recommend.” “Raspberry Tart was really good.” “Good food, staff has barely contained rage.” “The lasagna was perfect! Juicy, tomatoey and herbtastic!!” “Overpriced, but lots of options and delicious, so go for it!” “The classic Banh Mi was damn good das fo sho!” B.1. CLUSTER 0 106

“Avoid, avoid avoid. The food was poor and totally over priced, the service was shockingly poor. The only good thing was the singing which is fun, until the whack the volume up too loud...” “Recently closed for health code violations.” “Friendly staff, good food, if the place had wifi it’d be perfect.” “The Pumpkin Curry was good...could have used a bit more pumpkin flavor but overall tasty!” “Tasty but WAY overpriced. $8 for a frozen yogurt? Really? REALLY?” “Excellent, reasonable italian classics, brick oven. Wine not so cheap.” “Worth EVERY penny spent - excellent food, excellent service!” “Try a triple grande soy no whip pumpkin spice latte; what the hell” “Overpriced.” “Rad diner, good food, 24 hours a day.” “Pretty good pizza for really cheap.” “Nothing particularly special but pretty damn good, good deal.” “The Mediterranean vege burger with dill dressing was quite tasty! Affordable & quick, too.” “Reasonable prices, great food, cool upstaires seating area.” “Borderline shady seating scene upstairs, but decent.” “Rockefeller Center- yummy but overpriced.” “Mea, mushroom risotto ... Very good!” “Very cool place” “Wild Mushroom Appetizer was super tasty. I also loved the Grilled Chicken Bor- deaux with butternut squash raviolli, sauce was delicious.” B.1. CLUSTER 0 107

“easy places and really nice” “Was pretty nice, food was delectable, best part though was the tour of Ramsay’s immaculate kitchen.” “Grilled salmon was great. But the cod and bass were just so-so.” “Tried a crazy vegetarian soup, sooo good! Vegie burger was freshly made” “Pretty solid breakfast. Nothing fancy, but good.” “Apple Muffin is very good!” “Not great, but good.” “great food, but damn, why sooooo expensive” “Really, really good french fries!” “Staff is really friendly here.” “The 48hr organic chicken was really amazing!” “Service was great for me! Loved the chicken wrap.” “I’m a food bag.” “Food was good, loved the nachos. Waiter was miserable, seemed like he did not want to be there.” “The last time I went the service was crappy but overall the food is great” “Love this place, so romantic, and really good food too” “Chicken Caesar is really very good.” “The chicken roll was too heavy. Bleh. Should’ve gone with a slice instead.” “Is pretty good” “Pricey but yummy!” “All Xmas latte flavors (caramel brˆul´ee,gingerbread, pumpkin spice) are sooooo good!!!” B.1. CLUSTER 0 108

“Babaganoush is babababangin here. The owner was really nice too.” “Expensive (especially considering the healthier, more appropriate portions), but super noms.” “Try the Tex-Mex wrap. Very good.” “Steak sandwich - baguette was nice and fresh, steak was a little dry. Fries were cold.” “A really good Panino Indiavolato!” “Ribs were excellent, but portions small for \” “very generous quantity” “Excellent food!” “It’s no Fuddruckers. Not really impressed...” “Hainanese chicken is very solid.” “Really cool bartender!” “good craft beer selection” “I had a nice time, food was good.” “Everything we had was good. Fries were extremely salty though.” “The food was great, but the service takes too long.” “Try the Pumpkin Spice Latte. It’s pretty good.” “Food was great and service is usually good.” “Museum prices but undeniably excellent fare.” “Falafel’s pretty good.” “Chicken soup greasy but good!” “Hot sandwiches are decent but overpriced. Generally overpriced, bad quality food” “Food was great” B.1. CLUSTER 0 109

“Food was alright mgmt was a jerk” “Combo Green – so good!” “Lousy service and unfriendly staff. Double espresso was dull. Grilled cheese focaccia was decent.” “Really cool place” “Super bon.Very good” “The Super Meat Quesadillas worth dying for!” “Really good Swiss pastries” “Coffee surprisingly good for a machine!!” “Nice bar, good music, no cover!” “Guac was average, food was good, service left a lot to be desired” “Service was sub-par at best. Food was ok.” “Red wine here - really bad” “RM8 for small plate of hokkien noodle. Is that pricy or reasonable?” “Branches a really cool here )))” “I was a bit disappointed in the food. The Lamb osso bucco was alright. Prosciutto wrapped shrimp was good.” “cafe food suck!” “pricey but good selection for beer snobs” “Good food but $11 for a vegetarian lunch buffet? Not worth it IMHO” “Have a Manhattan. Is good.” B.2. CLUSTER 37 110

B.2 Cluster 37

In this cluster the text is “Say thank you & vote for \” 29 times. That is the whole cluster.

B.3 Cluster 45

The general tread of this cluster was around “pork”. However, as mentioned in before, there exists outliers in some clusters.

“pulled pork sandwich is the bizness!!!” “Pork steamed soup dumplings!” “Fried pickles.” “Pork & Chive dumplings!” “Pulled pork sandwich FTW” “Pork chops!!” “The pork buns are insanely good: a massive slab of pork belly in a light fluffy bun.” “The pulled pork sando here is amazing.” “Five-spice glazed pork belly ... get the pork belly in YOUR belly!!” “It’s ALL about the pork belly, baby!” “FREE WIFI!!!!” “Steamed Bean buns” “pork buns are amazing.” “CHIPOTLE FRIDAYS!” “anything pork belly is incredible.” “The steamed buns are fabulous. The pork is so tender.” B.3. CLUSTER 45 111

“Special pork buns!!” “The pork buns are everything you could hope for. Nom!” “Crowded !!!!!” “Steamed pork buns are a must!” “Their BBQ pulled pork sandwich is particularly spectacular.” “Pulled pork sandwich (Porky’s revenge) is excellent”