Comparing Unsupervised Clustering Algorithms to Locate Uncommon User Behavior in Public Travel Data
Total Page:16
File Type:pdf, Size:1020Kb
Comparing unsupervised clustering algorithms to locate uncommon user behavior in public travel data A comparison between the K-Means and Gaussian Mixture Model algorithms MAIN FIELD OF STUDY: Computer Science AUTHORS: Adam Håkansson, Anton Andrésen SUPERVISOR: Beril Sirmacek JÖNKÖPING 2020 juli This thesis was conducted at Jönköping school of engineering in Jönköping within [see field of study on the previous page]. The authors themselves are responsible for opinions, conclusions and results. Examiner: Maria Riveiro Supervisor: Beril Sirmacek Scope: 15 hp (bachelor’s degree) Date: 2020-06-01 Mail Address: Visiting address: Phone: Box 1026 Gjuterigatan 5 036-10 10 00 (vx) 551 11 Jönköping Abstract Clustering machine learning algorithms have existed for a long time and there are a multitude of variations of them available to implement. Each of them has its advantages and disadvantages, which makes it challenging to select one for a particular problem and application. This study focuses on comparing two algorithms, the K-Means and Gaussian Mixture Model algorithms for outlier detection within public travel data from the travel planning mobile application MobiTime1[1]. The purpose of this study was to compare the two algorithms against each other, to identify differences between their outlier detection results. The comparisons were mainly done by comparing the differences in number of outliers located for each model, with respect to outlier threshold and number of clusters. The study found that the algorithms have large differences regarding their capabilities of detecting outliers. These differences heavily depend on the type of data that is used, but one major difference that was found was that K-Means was more restrictive then Gaussian Mixture Model when it comes to classifying data points as outliers. The result of this study could help people determining which algorithms to implement for their specific application and use case. Keywords Machine learning, clustering, K-Means, Gaussian Mixture Model, expectation- maximum, data analysis, public transport, silhouette analysis, outliers, outlier detection, data, algorithms, experiment. 1 https://infospread.se/mobitime/en Acknowledgments We would like to thank Beril Sirmacek and Maria Riviero for their help and guidance during the process of writing this thesis. We would also like to thank Infospread Euro AB, as well as Vladimir Klicik and Gabriella Stenlund for being a helping hand throughout the whole study, and for providing the data and the hardware necessary to make this study possible. 1 Introduction .............................................................................. 1 1.1 BACKGROUND ......................................................................................................................... 1 1.2 PARTNERING WITH INFOSPREAD EURO AB .............................................................................. 1 1.3 PROBLEM DESCRIPTION............................................................................................................ 2 1.3.1 Related work .................................................................................................................... 2 1.4 AIM & RESEARCH OBJECTIVES ................................................................................................ 3 1.5 PURPOSE .................................................................................................................................. 3 1.6 SCOPE AND DELIMITATIONS ..................................................................................................... 4 2 Theories ..................................................................................... 5 2.1 MACHINE LEARNING ................................................................................................................ 5 2.2 CLUSTERING ............................................................................................................................ 5 2.3 OUTLIER DETECTION ................................................................................................................ 5 2.4 K-MEANS ALGORITHM ............................................................................................................ 6 2.5 GAUSSIAN MIXTURE MODEL ................................................................................................... 6 3 Method and implementation ................................................... 7 3.1 EXPERIMENT DESIGN ............................................................................................................... 7 3.2 EXPLORATION/PREPARATION OF DATA .................................................................................... 8 3.2.1 Overview .......................................................................................................................... 8 3.2.2 The data - exploration and preparation ......................................................................... 9 3.2.3 Region ............................................................................................................................ 10 3.2.4 Privacy concerns and GDPR ......................................................................................... 10 3.3 CONSTRUCTING BOTH THE MODELS ....................................................................................... 10 3.3.1 Using silhouette analysis to determine K (number of clusters) .................................. 10 3.3.2 Model construction ....................................................................................................... 12 3.4 PICK APPROPRIATE THRESHOLDS FOR IDENTIFYING OUTLIERS ............................................... 12 3.5 COMPARE RESULTS OF THE MODELS ...................................................................................... 13 3.5.1 Experiment 1 ............................................................................................................. 14 3.5.2 Experiment 2 ............................................................................................................. 17 3.6 ARTIFICIAL OUTLIER EXPERIMENT ......................................................................................... 19 4 Interpretation and analysis of the results ............................. 22 4.1 EXPERIMENT 1 INTERPRETATION AND ANALYSIS ................................................................... 22 4.2 EXPERIMENT 2 INTERPRETATION AND ANALYSIS ................................................................... 22 4.3 ARTIFICIAL OUTLIER EXPERIMENT INTERPRETATION AND ANALYSIS ..................................... 23 5 Discussions and conclusions ................................................... 24 5.1 IMPLICATIONS ........................................................................................................................ 25 5.2 LIMITATIONS ......................................................................................................................... 25 5.3 RELIABILITY ISSUES ............................................................................................................... 25 5.4 CONCLUSION ......................................................................................................................... 26 5.5 FUTURE WORK ....................................................................................................................... 27 References .................................................................................... 28 1 Introduction Commuting is a day to day normal task for most people, and as such people usually follow basic patterns such as what station they enter or leave the bus, what type of ticket they purchase, and what bus they prefer to take. This study will show how to locate outliers in these patterns by modeling normal user behavior in the area of public transport, and then locating outliers within that model. The research will focus on implementing a mathematically distance-based algorithm (K-Means), and a statistical algorithm (Gaussian Mixture Model), and model the user behavior using both of them, to then compare the results that the two models produce. 1.1 Background Data analysis, both exploratory and confirmatory, can be done in many ways these days. Clustering is a widely used algorithm for performing data analysis on datasets, and it can be done in many ways and can be utilized in many different areas. As given in [2] Even though there is an increasing interest in the use of clustering methods in pattern recognition, image processing and information retrieval, clustering has a rich history in other disciplines such as biology, psychiatry, psychology, archaeology, geology, geography, and marketing. A clear definition of what a cluster is, is given in [3] While analyzing data, a widely used task is to find groups of dataset objects that share similar characteristics. In doing so, users gain insight into their data, understand it, and even reduce its high dimensionality nature. These conceptual groups are commonly referred to as clusters. Clustering algorithms produce these clusters to organize and separate data into different groups, as described above, and this is the way clustering algorithms help perform data analysis, and in this case, outlier detection. Anomaly detection, also known as outlier detection [4], is a way to find irregularities in data of any kind, as well as used with all types of data to find errors or problems with a dataset [5]. As mentioned before, the K-Means, and the GMM (Gaussian Mixture Model)