University of Amsterdam Final year project A Comparison of Clustering Algorithms Rick Vergunst supervised by Dr. Dick Heinhuis revised by Laura Wennekes July 13, 2017 Abstract Within this paper an attempt was made to compare different clustering techniques. The algorithms that were chosen each represent a different clustering technique. These algorithms tried to cluster six different data sets of which the classifications were known. The conclusions were evaluated on the basis of different metrics in order to guarantee an inclusive overview. It can be concluded that overall the two best available techniques in this test were Gaussian Mixture and Mini Batch K-Means with the latter outperforming the former on scalability. For small data sets with few features, Birch proved most successful. Another conclusion that can be drawn based on this paper is that if the number of wanted clusters is unknown, DBSCAN shows the best scalability while maintaining a decent performance. 1 Contents 1 Introduction 3 2 Theoretical Framework 5 2.1 Clustering . 5 2.2 Cluster analysis techniques . 5 2.3 Cluster analysis comparisons . 7 3 Methodology 9 3.1 Data sets . 9 3.1.1 Banknote authentication . 9 3.1.2 Htru2 . 10 3.1.3 Skin tone . 10 3.1.4 Spam detection . 10 3.1.5 Driving motion . 10 3.1.6 Forest recognition . 11 3.1.7 Data preparation . 11 3.2 Algorithms . 11 3.2.1 Gaussian Mixture . 11 3.2.2 Mini Batch K-Means . 12 3.2.3 Ward hierarchical clustering . 12 3.2.4 DBSCAN . 13 3.2.5 Birch . 13 3.2.6 Affinity Propagation . 13 3.2.7 Mean Shift . 14 3.2.8 Spectral Clustering . 14 3.3 Evaluation metrics . 15 3.3.1 Time and Memory . 15 3.3.2 Adjusted Rand Index . 16 3.3.3 Adjusted Mutual Information . 16 3.3.4 Homogeneity, completeness and V-measure . 16 3.3.5 Fowlkes-Mallows score . 17 3.3.6 Calinski-Harabaz Index . 17 3.3.7 Silhouette Coefficient . 17 3.3.8 Implementation . 17 4 Results 19 4.1 Time and Memory . 19 4.2 Known classification metrics . 21 4.3 Unknown class labels . 25 5 Conclusion 27 6 Discussion 28 6.1 Algorithm performance . 28 6.2 Metric performance . 28 6.3 Data sets . 28 6.4 Further research . 28 2 1 Introduction Within the fields of data mining and machine learning a lot of different methods to analyze data exist. These methods range from anomaly detection to classification. In data mining the goal is to find patterns within data regardless the nature of this data. These methods are often a combination of statistics, artificial intelligence, machine learning and databases (Fayyad et al., 1996). With these patterns one can try to give meaning to data, which in turn can aid anybody in making decisions for example (Berry and Linoff, 1997). Most available methods can easily be compared and measured. The fact that these methods have set results gives that those results can either be right or wrong. However, this does not hold true for the clustering method. Clustering is the counterpart of the classification method where data is clustered into a certain amount of groups (Jain et al., 1999). The difference between clustering and classification is based on whether or not the groups are pre-defined. An example of classification is the outcome of a football match. There are three possible groups, either a win, a draw or a loss and every data point has to be assigned to one of these groups. Within clustering this is not possible. The amount of groups is unknown and should thus be found during the process of clustering. Clustering can be used in a broad range of fields within the scientific world. This is recognized by both Jain et al. and Kaufman and Rousseeuw (Jain et al., 1999)(Kaufman and Rousseeuw, 2009). Not only do we see applications of clustering within the different data mining fields, but also in fields such as economy, sociology, psychology and even environmental studies (Breiger et al., 1975)(Nugent and Meila, 2010)(Focardi et al., 2001)(Tan et al., 2013). As a result, any research done to improve relevant algorithms will accordingly yield progress in numerous fields. Within data mining, the role of clustering especially holds a unique status. In order to find patterns in unstructured data, or in determining whether the data holds any value, one must resort to the clustering method. Because there is no alternative to this specific technique, it is vital that all available information surrounding it is kept updated. This way, the best possible results are ensured at all times. As stated by Mayer and Cukier, big data will acquire increased importance in our society in the coming years (Mayer-Sch¨onberger and Cukier, 2013). This is mostly due to the huge value the data can hold and extracting that value from big data can be a huge advantage for any business or instance (Katal et al., 2013)(Roski et al., 2014)(McAfee et al., 2012)(Tien, 2013). To understand these large amounts of data it has to be efficiently analyzed. Implementing the clustering method proves to be useful in this venture. (Shirkhorshidi et al., 2014). Clustering can be the first step in finding new value in data, especially in a large amount of unlabeled and unstructured data. By making sense of this data and clustering it, businesses for example can adjust their strategy based on these clusters and change their behaviour based on the cluster that is relevant to them. Performance becomes an important factor in this new field as well because the sheer amount of data you are working with is increasing. Because of the influx in data, the trial-and-error method is becoming more expensive and choosing the right algorithm is even more delicate. Some initial attempts to evaluate performance have already been made but further research is necessary to get a better overview (Shirkhorshidi et al., 2014) (Aggarwal and Reddy, 2013). Furthermore, it is also important to know which algorithms work well for big data. Especially determining the level of scalibility of algorithms and finding those that prove useful for big data or only need little adjustment is vital since big data is invading every field of work (Manovich, 2011). Within the clustering method, there are several different algorithms to choose from and each of these has its own features and merits. However, the results from these algorithms differ significantly and can give a different outlook on the data. The choice in algorithms therefore determines the results in a severe way and should not be taken lightly. However, comparing methods is subject to interpretation and there is no definite true or false. This also means that the choice of the algorithm can be seen as more of an art than a rational decision. Some attempts in this direction have been made, as will be discussed in the theoretical framework. This paper will therefore attempt to give an overview of the different available methods and how they compare to each other. By comparing them it will be determined which algorithm is best suited for a certain case and specifically how well the algorithms scale to features and the size of data sets. In order to do this the paper will try to answer the following question: How do the different methods of clustering compare to each other and is there a go-to algorithm? 3 Before answering this question, a discussion of previous work done within this field will be presented. Besides this, a theoretical framework and a problem statement are given in the section labelled 'Theoretical Framework'. After that, the 'Methodology' section discusses the different elements used and describes the way of coding and clustering as executed. The 'Results' section shows all results in tables to provide an inclusive overview of all findings. Following each table a short discussion of these results is presented. Based on these results, the 'Conclusion' section deduces the answer on the research question stated before. Finally, all other findings and proposals for future research will be discussed in the 'Discussion' section. 4 2 Theoretical Framework In this section an attempt is made to create an overview of what clustering is and the depth it has. First, to get acquainted with clustering, the generic step process is analyzed. Second, different clustering techniques are looked at. A wide range of techniques is available and each has its own merits and drawbacks. In order to fully understand the different algorithms that are available within these techniques, an example is given. Lastly, some papers that have attempted and executed a similar comparison of clustering are considered. In doing so, some inspiration was gained and certain procedures that give a good outcome can be reproduced. Furthermore, this section gives an idea on how far the scientific world has come with comparing algorithms and further proves the relevance of the problem at hand. 2.1 Clustering Cluster analysis is an important form of analyses and very relevant today. As stated by Kaufman and Rousseeuw (Kaufman and Rousseeuw, 2009), from childhood until adolescence individuals always divide concepts amongst each other and attempt to form groups. This can be dogs and cats, male and female, but also more complex concepts in science. Classification, either based on structured or unstructured data, is relevant and present in numerous fields of work today. This is further proven by the research of Jain et al. (Jain et al., 1999). This study looks inclusively at data clustering and concluded that clustering can be a beautiful way of pattern recognition and collecting results.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages32 Page
-
File Size-