University of Amsterdam

Final year project

A Comparison of Clustering Algorithms

Rick Vergunst

supervised by Dr. Dick Heinhuis

revised by Laura Wennekes

July 13, 2017 Abstract Within this paper an attempt was made to compare different clustering techniques. The algorithms that were chosen each represent a different clustering technique. These algorithms tried to cluster six different data sets of which the classifications were known. The conclusions were evaluated on the basis of different metrics in order to guarantee an inclusive overview. It can be concluded that overall the two best available techniques in this test were Gaussian Mixture and Mini Batch K-Means with the latter outperforming the former on scalability. For small data sets with few features, Birch proved most successful. Another conclusion that can be drawn based on this paper is that if the number of wanted clusters is unknown, DBSCAN shows the best scalability while maintaining a decent performance.

1 Contents

1 Introduction 3

2 Theoretical Framework 5 2.1 Clustering ...... 5 2.2 techniques ...... 5 2.3 Cluster analysis comparisons ...... 7

3 Methodology 9 3.1 Data sets ...... 9 3.1.1 Banknote authentication ...... 9 3.1.2 Htru2 ...... 10 3.1.3 Skin tone ...... 10 3.1.4 Spam detection ...... 10 3.1.5 Driving motion ...... 10 3.1.6 Forest recognition ...... 11 3.1.7 Data preparation ...... 11 3.2 Algorithms ...... 11 3.2.1 Gaussian Mixture ...... 11 3.2.2 Mini Batch K-Means ...... 12 3.2.3 Ward ...... 12 3.2.4 DBSCAN ...... 13 3.2.5 Birch ...... 13 3.2.6 Affinity Propagation ...... 13 3.2.7 ...... 14 3.2.8 Spectral Clustering ...... 14 3.3 Evaluation metrics ...... 15 3.3.1 Time and Memory ...... 15 3.3.2 Adjusted Rand Index ...... 16 3.3.3 Adjusted Mutual Information ...... 16 3.3.4 Homogeneity, completeness and V-measure ...... 16 3.3.5 Fowlkes-Mallows score ...... 17 3.3.6 Calinski-Harabaz Index ...... 17 3.3.7 Silhouette Coefficient ...... 17 3.3.8 Implementation ...... 17

4 Results 19 4.1 Time and Memory ...... 19 4.2 Known classification metrics ...... 21 4.3 Unknown class labels ...... 25

5 Conclusion 27

6 Discussion 28 6.1 Algorithm performance ...... 28 6.2 Metric performance ...... 28 6.3 Data sets ...... 28 6.4 Further research ...... 28

2 1 Introduction

Within the fields of and a lot of different methods to analyze data exist. These methods range from to classification. In data mining the goal is to find patterns within data regardless the nature of this data. These methods are often a combination of statistics, artificial intelligence, machine learning and databases (Fayyad et al., 1996). With these patterns one can try to give meaning to data, which in turn can aid anybody in making decisions for example (Berry and Linoff, 1997). Most available methods can easily be compared and measured. The fact that these methods have set results gives that those results can either be right or wrong. However, this does not hold true for the clustering method. Clustering is the counterpart of the classification method where data is clustered into a certain amount of groups (Jain et al., 1999). The difference between clustering and classification is based on whether or not the groups are pre-defined. An example of classification is the outcome of a football match. There are three possible groups, either a win, a draw or a loss and every data point has to be assigned to one of these groups. Within clustering this is not possible. The amount of groups is unknown and should thus be found during the process of clustering. Clustering can be used in a broad range of fields within the scientific world. This is recognized by both Jain et al. and Kaufman and Rousseeuw (Jain et al., 1999)(Kaufman and Rousseeuw, 2009). Not only do we see applications of clustering within the different data mining fields, but also in fields such as economy, sociology, psychology and even environmental studies (Breiger et al., 1975)(Nugent and Meila, 2010)(Focardi et al., 2001)(Tan et al., 2013). As a result, any research done to improve relevant algorithms will accordingly yield progress in numerous fields. Within data mining, the role of clustering especially holds a unique status. In order to find patterns in unstructured data, or in determining whether the data holds any value, one must resort to the clustering method. Because there is no alternative to this specific technique, it is vital that all available information surrounding it is kept updated. This way, the best possible results are ensured at all times. As stated by Mayer and Cukier, big data will acquire increased importance in our society in the coming years (Mayer-Sch¨onberger and Cukier, 2013). This is mostly due to the huge value the data can hold and extracting that value from big data can be a huge advantage for any business or instance (Katal et al., 2013)(Roski et al., 2014)(McAfee et al., 2012)(Tien, 2013). To understand these large amounts of data it has to be efficiently analyzed. Implementing the clustering method proves to be useful in this venture. (Shirkhorshidi et al., 2014). Clustering can be the first step in finding new value in data, especially in a large amount of unlabeled and unstructured data. By making sense of this data and clustering it, businesses for example can adjust their strategy based on these clusters and change their behaviour based on the cluster that is relevant to them. Performance becomes an important factor in this new field as well because the sheer amount of data you are working with is increasing. Because of the influx in data, the trial-and-error method is becoming more expensive and choosing the right algorithm is even more delicate. Some initial attempts to evaluate performance have already been made but further research is necessary to get a better overview (Shirkhorshidi et al., 2014) (Aggarwal and Reddy, 2013). Furthermore, it is also important to know which algorithms work well for big data. Especially determining the level of scalibility of algorithms and finding those that prove useful for big data or only need little adjustment is vital since big data is invading every field of work (Manovich, 2011). Within the clustering method, there are several different algorithms to choose from and each of these has its own features and merits. However, the results from these algorithms differ significantly and can give a different outlook on the data. The choice in algorithms therefore determines the results in a severe way and should not be taken lightly. However, comparing methods is subject to interpretation and there is no definite true or false. This also means that the choice of the algorithm can be seen as more of an art than a rational decision. Some attempts in this direction have been made, as will be discussed in the theoretical framework. This paper will therefore attempt to give an overview of the different available methods and how they compare to each other. By comparing them it will be determined which algorithm is best suited for a certain case and specifically how well the algorithms scale to features and the size of data sets. In order to do this the paper will try to answer the following question:

How do the different methods of clustering compare to each other and is there a go-to algorithm?

3 Before answering this question, a discussion of previous work done within this field will be presented. Besides this, a theoretical framework and a problem statement are given in the section labelled ’Theoretical Framework’. After that, the ’Methodology’ section discusses the different elements used and describes the way of coding and clustering as executed. The ’Results’ section shows all results in tables to provide an inclusive overview of all findings. Following each table a short discussion of these results is presented. Based on these results, the ’Conclusion’ section deduces the answer on the research question stated before. Finally, all other findings and proposals for future research will be discussed in the ’Discussion’ section.

4 2 Theoretical Framework

In this section an attempt is made to create an overview of what clustering is and the depth it has. First, to get acquainted with clustering, the generic step process is analyzed. Second, different clustering techniques are looked at. A wide range of techniques is available and each has its own merits and drawbacks. In order to fully understand the different algorithms that are available within these techniques, an example is given. Lastly, some papers that have attempted and executed a similar comparison of clustering are considered. In doing so, some inspiration was gained and certain procedures that give a good outcome can be reproduced. Furthermore, this section gives an idea on how far the scientific world has come with comparing algorithms and further proves the relevance of the problem at hand.

2.1 Clustering Cluster analysis is an important form of analyses and very relevant today. As stated by Kaufman and Rousseeuw (Kaufman and Rousseeuw, 2009), from childhood until adolescence individuals always divide concepts amongst each other and attempt to form groups. This can be dogs and cats, male and female, but also more complex concepts in science. Classification, either based on structured or unstructured data, is relevant and present in numerous fields of work today. This is further proven by the research of Jain et al. (Jain et al., 1999). This study looks inclusively at data clustering and concluded that clustering can be a beautiful way of pattern recognition and collecting results. This boils down to the right algorithm but is also dependent on the data that is used. As Jain et al. recognizes, clustering should be used on either image retrieval, object recognition, document retrieval or data mining in order to get the expected results. It is therefore important for the researcher to make some careful design choices in order to get the results he or she desires. Jain et al. used different steps in order to execute a cluster analysis (Jain et al., 1999). As stated in the research a cluster analysis consists of the following four steps: 1. Pattern representation 2. Similarity computation 3. Grouping process 4. Cluster representation The first step is pattern recognition and often skipped by researchers as it is the most complex step in the process of clustering. In this step the researcher must determine the important and relevant features of the problem at hand. In case of small data sets, this can be done by relying on previous experiences but with larger data sets the researcher has to run algorithms to determine these features. This process can be expensive and time consuming, hence often this step is skipped. The next step in the process is similarity computation, where the researcher takes the found patterns of the algorithm and compares them either on explicit or implicit knowledge. In this step the researcher attempts to differentiate between the patterns. The third step is the grouping process that roughly consists of two methods. One of them is ’hierarchical schemes’ which is more versatile and therefore has better results in more complex cases. The other is ’partitional schemes’ which is less expensive and better suited for bigger problems if the expensiveness of the solution matters. The fourth and final step in the process is cluster representation where the researcher presents a solution in an understandable way. Often graphs are used for this step as it makes the clusters easy to represent.

2.2 Cluster analysis techniques The chosen cluster analysis technique determines the results and whether or not the question can be answered. It is therefore important to choose the right technique. Aggarwal and Reddy and Fahad et al. recently created an overview of the techniques available today (Aggarwal and Reddy, 2013) (Fahad et al., 2014). Table 1 below represents all these techniques and an example of an algorithm within that technique. The first of many techniques is the clustering based on probability models, a technique unlike most tech- niques (Bock, 1996). Within this technique classification is performed based on the underlying probabilities

5 of the chosen features. Based on these probabilities you classify each data point and create clusters. This technique is extensive as the researcher can mix and match with the models until the best suited model is found.(Topchy et al., 2004)(Biernacki et al., 2000). An example of a probability model is the expectation- maximization algorithm (EM) as presented in Table 1. The EM algorithm is an iterative technique and consists of two steps within each iteration. The first step estimates the log-likelihood based on the current values of the parameters. A probability distribution will be made on the function that comes out of this step. The second step maximizes these probabilities by adjusting the function and recalculating the parameters. These recalculated parameters can then be used for another iteration, which can be repeated as many times as desired (Do and Batzoglou, 2008)(Moon, 1996). The next two techniques described by Aggarwal and Reddy are partitional and hierarchical techniques, which are presented in the second and third row of Table 1. Partitional algorithms were the usual go-to algorithms as described by Wilson et al. (Wilson et al., 2002). These algorithms take the input and features and create different partitions of these parameters. Afterwards the partitions are compared based on a function specific to the partitional algorithm at hand. The functions that this algorithm tries to minimize are generally objective in nature.(Jin and Han, 2010). These partitions consist of K clusters, which can be pre-determined or chosen by the algorithm. The requirements that must be met are that each point is only contained in one group and that each cluster has at least one data point. The most well known algorithm that has already existed for more than 50 years in this technique is the k-means algorithm (Jain, 2010). The k-means algorithm takes as input the amount of wanted clusters and then creates P partitions out of N data points. Every partition is then given k cluster centers based on the given input. Afterwards, every data point is considered and added to the cluster center that is closest to the data point, usually based on the Euclidean distance. This is done for every partition before the partitions are compared. The objective is to minimize the sum of squares of the distances of every data point to the center of its cluster. The partition that has the lowest sum is then chosen as solution for this instance. Hierarchical algorithms recycle a process multiple times. This process can either create more, divisive, or fewer, agglomerative, clusters based on the question at hand. This is done via calculating the distance between the existing clusters and then adjusting the cluster based on this calculation. Distances often used are Manhattan, Euclidean for numeric values and Hamming for non-numeric values. The process starts either with one cluster or with a cluster on each data point. Next, the distances are calculated between the clusters or the optimal division is based on the largest distance possible. This process is then repeated indefinitely. Hierarchical algorithms result in a dendogram, which is a graph of the different layers of clusters that are created. An obvious difference between hierarchical and partitional techniques is that the partitional approach aims for a specific amount of k-clusters whereas the hierarchical method tries to create either more or less clusters and the researcher can choose a specific step in the layering. The next technique is density based clustering, which is presented in the fourth row of Table 1. The difference that density based clustering has when compared to other techniques is that outliers do not have a severe influence on the formed clusters. The method of density clustering is based on the density of the data objects and is divided by either continuous regions or low density areas (Kriegel et al., 2011). The idea of density based clustering is built on human differentiation of concepts and regions within imagery. When shapes interrupt or flow through each other, the classic algorithms cannot follow this dynamic and try to divide it into separate clusters. However, density based clustering can recognize these regions by analyzing their density and transforming it into separate clusters. This means that the amount of clusters found at the end is not pre-determined. One of the most well known examples of density based clustering is DBSCAN (Ester et al., 1996). DBSCAN divides all the data points into three categories. The first category is ’core points’, which consists of points which are located in a certain minimal reach of other points. This reachability is determined via a certain chosen function for this instance. The next category of points is ’density reachable points’ that consists of points that are not located within a certain reach of other points but which are connected through a core point. The third category is the ’outlier points, which consists of points that are not in reach of other points and are therefore disconnected from any density. Based on these characteristics, the classification of points is performed. By this diversification you get a certain amount of clusters and filter out the outliers as they do not belong to any cluster. The last technique is grid based clustering, as presented in the fifth row of Table 1. This technique is somewhat similar to density based clustering as it also looks at density. However, there is a significant difference in the approach as described by Grabust and Borisov (Grabusts and Borisov, 2002). With a grid

6 based clustering approach the researcher first divides the data set into a number of cells that together form a grid. Afterwards the density of each cell is calculated based on the chosen algorithm. These cells are then sorted looking at the found densities of each cell. This constructed hierarchy is then analyzed and the algorithm tries to find cluster centers. Afterwards, the cells are divided between the cluster centers to create a number of clusters. STING is an example of an algorithm within the grid based approach. In STING, the data points are first divided into layers which are again divided into a grid of regions containing data points. The first layer is then chosen and for each cell in this grid a probability is calculated which determines whether the region is relevant or not. Every cell is then labeled either relevant or not. Next, the relevant regions are taken through the process again which is repeated until it reaches the bottom layer. A few relevant regions are filtered out, which then either answer the query or not. If they do not, the data points in these regions are further processed with the above steps. If the query is answered, the regions are returned as a result and the data is clustered.

Table 1: Clustering techniques Clustering technique Example Probability models Expectation-Maximization algorithm Partitional algorithms K-means algorithm Hierarchical algorithms Euclidiean distance Density based clustering DBSCAN Grid based clustering STING

2.3 Cluster analysis comparisons Small scale trials to compare cluster techniques have been conducted. Steinbach, Karypis and Kumar for example compared bisected K-Means with hierarchical agglomerative techniques which are both quite traditional in nature. (Steinbach et al., 2000). For this comparison they used two techniques, namely F- measure and entropy. Entropy is widely used amongst different scientific fields. With the Entropy function, the ’goodness’ of a cluster is determined. In other words, to what extent does the cluster accurately respect the question at hand as explained by Sripada and Rao (Sripada and Rao, 2011). The other measure, F- measure, is usually employed as a combination of recall and accuracy (Hripcsak and Rothschild, 2005) but can be expanded to measure the effectiveness of the clustering in a hierarchical technique. The difficulty with comparing clusters is the subjective nature of clusters and their dependence on an individual’s decision making surrounding the ambition underlying the clustering. Therefore Rand determined certain objective criteria to aid in this process(Rand, 1971). By looking at objective measures the conclusions and comparison will have a more scientific basis. The first criterion was to what extent the algorithm determined the inherent structure of the data. If a technique can determine the structure of a data set, it will understand the data better, which usually leads to better clustering. The second criterion is to what extent resampling affects the clustering. If a clustering technique shows different results when resampling is done, the results of one clustering cycle will be unreliable and therefore weaken any claims made based on that clustering. The last criterion Rand stated considered the handling of new data. If new data is added and the clustering differs vastly from the previous, the clustering is sensitive, which again makes it unreliable. This, just as in the second criterion, will weaken any claims made and must therefore be taking into serious account. Another interesting comparison has been made by Abbas et al. (Abbas et al., 2008). In their paper the authors present an objective overview of several algorithms. The chosen algorithms were K-means, HC, SOM and EM. This selection was made to capture some of the diversity within the cluster analysis field. These algorithms were compared based on four factors namely: the size of the data set, number of clusters, type of data set and the type of software. These factors were then given different variables and compared to one another to get an overview of the algorithms’ performances. For the size of the data set, the researcher differentiated between small and huge data sets to look at the scalability of the algorithm. The data sets used consisted respectively of 36.000 and 4.000 data points. Within the number of clusters, the researchers tried different amounts of clusters. Because of their nature, the algorithms were able to create a certain

7 amount of clusters which made it easy to compare and achieve. The cluster amounts used were 8, 16, 32 and 64. As their type of data set, the researchers chose an ideal data set for each algorithm based on the type and characteristics of the algorithm and also a random data set. With this comparison the researchers evaluated the performance of the algorithm in new situations and in situations where the algorithm should shine. The amount of clusters used in this endeavour was always 32. For the last factor the researchers used different packages to run the algorithms, namely LNKnet Package and Cluster and TreeView Package, which proved to make no difference. A new approach was considered in the research of Fahad et al. (Fahad et al., 2014). The application of clustering techniques within big data was specifically looked at. This approach is thought of as promising for the future. (Manyika et al., 2011). In this research five different algorithms were used that span across mul- tiple different techniques as described in the previous paragraph. These five algorithms were Fuzzy-Cmeans, Birch, Denclue, Optimal Grid and EM. Fahad et al. chose to use eight different data sets with different characteristics. The differences between these sets were predominantly the amount of instances within the data set, the attributes these instances had and the amount of classes present in the data set. The foremost aim of this research was to test the different algorithms thoroughly. In the evaluation different metrics were employed to capture more aspects of the algorithms. First, compactness and separation were used to deter- mine how ’good’ the clusters were. Furthermore, the Davies-Bouldin Index was applied to measure the ratio of within-clusters to between-clusters. Another index that was used was the Dunn Validity to measure both the compactness and separation between individual clusters. The second to last index was the correctness which uses the correct classification to determine the accuracy of the clustering. The last used index was the Adjust Rand Index where instances that are in a cluster and instances that are part of different clusters are compared. Lastly, the quality of the clustering through an formula was determined through the Normalized Mutual Information. As shown in this research there are numerous measures available to achieve a complete overview of the data and algorithms that are used in the clustering.

Looking at the previous paragraphs in this section, several things can be deduced. The different steps in clustering are all vital in carrying out a good clustering process. These steps are fixed for every type of clustering which creates a consistent context and a solid foundation in each situation. A wide range of available techniques exists within clustering. By choosing algorithms that correspond to each of the techniques and differentiating between them one can create a reasonable and complete overview of what is possible. These techniques differ so vastly in nature that different results are expected accordingly. All of these techniques have advantages and disadvantages which means that testing the full range of techniques should be attempted. Out of the comparisons done up to this date quite a broad range of metrics to use can be deducted. Each allows for a different aspect of the algorithm to be tested. If a subjective metric is generated, elements that could be looked at are density and definition. Objective metrics often require the classification to be known, which is not the case within clustering. Objective measures are still possible with clustering by choosing certain data sets that are meant for classification. Important for a good comparison is to create as many variables as possible within the data sets. This can either be in size or the amount of features that a data set has. By changing these variables one can create different contexts in order to mimic a ’real life’ scenario as accurate as possible.

8 3 Methodology

The goal of this paper is to create an exhaustive overview of the available clustering methods and attempt to compare them based on measures. It is therefore important to test as many variables as possible while remaining concise and clear. This present section outlines in detail the different steps taken in the process of getting the results and motivates certain choices. Furthermore several pieces of code will be given to provide a thorough understanding and improve reproducibility. The present research follows the steps as discussed by Jain et al. in order to get as close to a real clustering process as possible (Jain et al., 1999). This section first outlines the different data sets and discusses their relevance. It is important to consider different types of data sets because their nature can vastly differentiate the results of a certain algorithm, which was already partially proven by Fahad (Fahad et al., 2014). Then, the algorithms that will be used to cluster with the different data sets are presented. These algorithms cover a broad range of different techniques within clustering as discussed in the theoretic framework. Lastly, the evaluation and the way it was controlled during the process will be discussed. These types of evaluation are motivated by the theoretic framework and work previously done. However, it was also attempted to provide additional motivations where needed.

3.1 Data sets What data set is chosen is a vital element in the process as it will determine the results and the contexts that is tested. In the present research, effort was made to select data sets that outline different scenarios that can occur in clustering. It was attempted to remove most external influence on the results through creating as many contexts as possible. The chosen data can be determined by certain variables in order to create an overview. These variables are either the data set’s size or the amount of features that exist and will be used in the data set. In determining a data set’s size these are partitioned into three sizes ranging from small to large data sets. The small data sets consist of sets up to 10.000 data points, the medium data sets range between 10.000 and 100.000 data points and the large data sets have more than 100.000 data points and can go up to millions of data points. By using these metrics the size and to what extent this influences an algorithm’s performance can easily be looked at. This metric is becoming more and more important as big data is becoming relevant and gathering large amounts of data has never been easier. As presented before, the second metric used is relative to a data set’s features. The features determine what the data point consists of and the algorithm will cluster based on these features. Features can be anything and their impact will be determined by the algorithm. In this metric, two possible outcomes exist. Either a few features, which ranges up to ten features, or more than 10 features per data point. The performance is affected by these features and it is important to look at the influence of these features as the amount can drastically change and impact the process. In choosing data sets, one can either select simulated data or find sets of real-life data. In the present research, the latter option has been chosen as this mimics a real-life clustering situation best. Furthermore, the data sets originate from different kinds origins and types to get a greater span of subjects. This can be biological or image recognition. The data sets are retrieved from the UCI Machine Learning Repository, which is a database with different types of data sets taken from real life situations. The data sets are typified with different variables with a preferred task for the data, the size, features and a brief description of the data. For the comparison in the present research, data sets with a classified task have been chosen. This is perfect for a comparison as the correct class is given along with the data. This in conjunction with determining the amount of clusters you want to create opens up the use of a lot of indices that require both the predicted and actual labels, thus increasing the grounds and depth of the comparison. Six data sets have been chosen from the UCI directory with each differing variables. These six data sets are discussed and explained in the following sections.

3.1.1 Banknote authentication The first data set consists of images taken from either genuine or fake banknote authentications. This divides the data set into two distinct classes and therefore asks for two clusters. The images taken are 400 by 400 pixels which have been turned into four features that describe the image in numbers. Colours did not play a part in the process as some of the images were gray as well which would influence the process.

9 For the feature extraction of the images a Wavelet transform tool was used. This is a widely used method for describing and compressing images. The idea is to turn the image into wavelets that are described by certain features. By looking at these features and the differences between the different wavelets an image can be reconstructed with only a few values, thus describing the differences between the images in a rather efficient way. The features extracted from the banknote images are skewness, variance, curtosis and entropy. Skewness determines how much the wavelet is sided to one side, variance is how much the wavelet differs from the mean, curtosis is how high the highest peaks are of the wavelet and entropy determines how busy the wavelet and thus the image is. The data set has a total of 1372 data points and can thus be considered a relatively small data set.

3.1.2 Htru2 The second data set used for the comparison is called Htru2. It is a data set containing data about pulsars. Pulsars are neuron stars that pulse electromagnetic waves. Interesting to note is that every pulsar has its own unique wave and can thus be identified. The waves that are generated can be picked up by large radio telescopes, which makes it easy on one hand, but hard on the other as other radio signals are also picked up. It is therefore important to be able to identify which waves are truly from these pulsars and which are not. This problem can be solved by classifying these signals. This data set in particular has eight features in total to identify the real pulsars from the ’fake’ signals. The first four features are statistics of the pulse profile of the wave and the second set of four features are about the particular DM-SNR curve that is obtained through the radio signal. The set contains a total of 17,898 samples of radio signals where 1,639 are real pulsar waves and 16,259 are ’fake’ signals. The data set is considered as a medium data set in this particular research.

3.1.3 Skin tone The third data set is about skin segmentation and the pursuit to differentiate the real skin tones from the non-skin tones. The data set has three features where each feature is a colour. The data points are divided in Blue, Green and Red values which create a colour. The data comes from both the FERET and PAL database and the real skin colours are formed from a diverse group of people differing in age, race and gender. The total size of the database contains 245,057 data points where 50,859 data points are real skin colours and the other 194,198 are non-skin data points.

3.1.4 Spam detection The fourth data set is about spam detection which is as relevant as ever with the increase of internet traffic. The data set contains information based on the text inside emails and created features based on that information. Through these features, the specific algorithm has to determine whether it’s spam or not. There are 57 features for each data point. The first 48 features are specific words that are common in emails. The values are a percentage of the occurrence of the word based on the total amount of words in the email. The next 6 features are specific characters and their occurrence is analyzed in the same fashion as the analyzing of words. The next three features look at the occurrence of capital words looking at respectively the average length of capital runs, the longest capital run and the total number of capital letters inside the email. The set contains a total of 4,601 data points where 1,813 points are real spam, which come from postmasters and individuals who filtered emails as spam and 2,788 are normal mail, which come from conversations between colleagues etc.

3.1.5 Driving motion The fifth data set contains data about drive diagnosis, not based on sensors but rather on features based on the motor. These motors have certain defective components and can therefore be classified in 11 different types of motors. The motors are regarded in twelve different operating conditions which can be either speed but also load moments and forces. These conditions where then measured and a total of 48 features were extracted. For each measure the phases are regarded and certain statistical attributes like mean, skewness

10 and curtosis were recorded and stored in the features. The set contains a total of 58,509 data points which are, as mentioned before, divided into eleven types of motors.

3.1.6 Forest recognition The sixth and final set that was selected for this study is a cover type set where different forests are divided into clusters based on pictures taken from these forests. This is done via observations based on 30 by 30 meter cells which are given features. The forests are divided into seven different types of vegetation. The features are very diverse in order to make a clear distinction between the possible forest types. Features can be the elevation in meters, the slope of hills but also the shading of the cell and the type of wilderness and soil that the cell has. In total, every data point has 53 features to make the distinction with very diverse data types ranging from colour index to binary classification features. Important to note is that the cells contain as little human intervention as possible to have cells that are as ’true’ as they can be to the cover type. The data set contains a total of 518,012 data points and is the largest set used in this particular experiment.

3.1.7 Data preparation Every data set that has been used was prepared to be able to use it along with the other modules in a rather efficient and easy way. To do this the pandas module was used. (McKinney, 2011) Pandas allows the user to easily load data inside a DataFrame and adjust the values based on what is needed. This DataFrame can then be used along with the scikit module to apply the techniques and evaluate the clustering. Below a data preparation is shown, which was applied to every data set retrieved.

The first step in the process is to retrieve the file and transform it into a DataFrame. In the present research, this was done via the from csv function, in which the separator is regarded as a tab. The next step is to assign names to the columns based on the values. The last column was named Target to correspond to the correct labels. Out of this column a new DataFrame is created and the Target column is dropped from the original DataFrame in the final line. Lastly the Target DataFrame is adjusted so that the ones equal zeroes and twos equal ones as the labeling to the scikit module starts at zero and further problems with comparing are prevented.

3.2 Algorithms Picking the right algorithms is important in order to test every aspect of clustering. It is important to choose as many algorithms as possible that cover the different techniques discussed in the theoretic framework. Besides that, the chosen algorithms must not be too specific in their usage as the attempt is to make a general overview. Lastly, it is important to keep the implementation in mind, which could also be an important factor in choosing an algorithm. In the present research, the scikit-learn module is used for the implementation (Pedregosa et al., 2011). This is a widely used module for statistics and machine learning within Python. The module contains all types of statistical implementation, including a wide range of cluster techniques. Furthermore, the module is very well documented and also includes evaluation indices, which helps tremendously in evaluating the found results and comparing the algorithms.

3.2.1 Gaussian Mixture The first algorithm used for the data sets in this study is the Gaussian Mixture model. As can be deduced from its name, this algorithm is of the probability model type which was presented in the first row of Table

11 1. The model assumes that all data points come from Gaussian distributions while the parameters are unknown. This model in particular uses the expectation-maximization (EM) as an algorithm to create the model first hand and use it on training data. As said before the EM uses a two step method where the first step is to generate values that are expected which are than maximized and used in the next iteration of the step. The goal is to maximize the likelihood of the parameters in the model, which can be iterated over many times. Important to note is that the algorithm can converge to a local optimum, while maintaining a relatively fast algorithm. After the model is created based on the given parameters and the outcome of the EM algorithm, which can be altered as well, the model can be used on test data to cluster the data. What makes the method unique as compared to a normal EM is that it allows for parameters to be given to create a certain clustering process. It allows for softer or fuzzier clustering, which means that data points can belong to multiple clusters or that the points are not hard coded to a cluster but rather given a score with the chance to belong to certain cluster.

3.2.2 Mini Batch K-Means The second algorithm that was used was a variant on the K-means algorithm, namely the mini batch k- means. This decision was made because the mini batch variant has improved computation time while not losing that much accuracy (Sculley, 2010). As with the standard variant of K-means, the goal is to reduce the inertia or within-cluster sum of squares. The formula for this consists of two steps, step one the assignment step:

(t) (t) 2 (t) 2 Si = {xp : kxp − mi k ≤ kxp − mj k ∀j, 1 ≤ j ≤ k} and the update step:

(t+1) 1 X m = x i (t) j |Si | (t) xj ∈Si

At the assignment step every data point x is assigned to one of the the given k clusters, which are randomly placed. This assignment is based on the distance which is often the Euclidean distance formed by the point x minus the cluster centre m to the power of two. Every point is appointed to exactly one S, which if the set of means. Afterwards the update step is performed where, based on the new assignments, the new means of the clusters are calculated again to create new cluster centers which are then put into the first step until convergence. The algorithm differs from the original K-means on the basis that the input data is divided into randomly sampled subsets. These subsets, the mini batches, are then used individually to decrease the computation time. The first step in the process is to draw a certain amount of samples to form a mini batch which then are assigned to the nearest centroid. The second step in the process is to update the centroids, which is done based on every mini batch. Every centroid is updated based on the streaming average of the specific batch used in the iteration. This decreased the movement of the centroid and therefore improves the computation time. The algorithm runs until it has reached convergence, which can be a local optimum, or when the number of iterations that was given is reached.

3.2.3 Ward hierarchical clustering The Ward method is applicable in hierarchical clustering, which means that one either starts with a sample amount of clusters or one cluster and keeps iterating until he or she finds the desired amount of clusters or until no further steps are possible. The steps are done based on a certain objective function which the algorithm tries to minimize or maximize. The Ward method is an objective function and is typified as the minimum variance method. The goal for this function is to determine the variance within the clusters and then try to minimize this variance. The variance can be calculated using numerous distance metrics. Effectively it determines the error of the sum of squares and tries to minimize this. At each step, the method calculates what happens with the function if one either merges or divides clusters and it finds the next steps that minimize the function for this particular step.

12 In the present study, the Euclidean distance is used, which is a widely applied distance metric. It is often employed as it is very lightweight while maintaining relatively strong results. In this research, the squared Euclidean distance is used for calculating the minimum variance. This function can be denoted as follows:

2 dij = d({Xi}, {Xj}) = kXi − Xjk

In the method the distance between two points x is calculated and kept in matrix d. This distance metric can be anything but is often the Euclidean distance. Afterwards the matrix d is considered and will be divided into more or less clusters based on the type of clustering. This division is based on the calculated distances and after the new assignment the steps are repeated until a certain amount of clusters is reached or no more steps are possible.

3.2.4 DBSCAN Another technique is the DBSCAN method which falls into the density based clustering algorithms. It stands for density-based spatial clustering of applications with noise and looks at the density of the data points. The algorithm takes two parameters, which is the minimum amount of points to define a cluster and the eps. In other words, the maximum distance for two points to be considered ’in the same neighbourhood’ is based on two parameters. These parameters are used to measure each point after which it is determined in what category those points fall. This can be either a core point, a reachable point or an outlier. A point is considered a core point if a minimum amount of numbers is reachable based on a given parameter prematurely. Every cluster has to have at least one core point and is formed by every point, either a core point or reachable point, that is reachable from this core point based on the given distance metric. As with a lot of other algorithms the Euclidean distance is used in this situation to calculate the distance. The algorithm first fully explores a neighbourhood or cluster before moving on to a non-defined point and starting a new cluster, thus creating a number of clusters that is not known in advance. Important to note is that the algorithm often revisits points it has been before which increases the time it runs.

3.2.5 Birch Birch is a unique method because it uses a tree-like data structure to cluster the data. For the clustering the algorithm develops trough four phases to achieve its results. The first phase consists of creating a CF-Tree out of the available data points. CF in this context means a Clustering Feature which is a triplet consisting of the data points, the linear sum of those data points and the square of the sum of the data points of the first variable. These features are then organized based on the branching factor B and the threshold T. In the tree structure every node has at most B entries, where the entries are formed by a CF and a child node. These child nodes consists of CF itself constrained by a certain amount thus resulting in a tree consisting of CF’s. In the second phase the algorithm looks at the created tree and tries to create a smaller tree by cutting outliers and grouping CF’s that are very similar. The third phase consists of applying a agglomerative clustering algorithm to cluster the leaves or child nodes of the CF’s. By compressing the clusters here the algorithm achieves the amount of clusters given as a parameter. Afterwards phase four can be applied which considers the found clusters and looks for errors by considering the ’seeds’ of the clusters and redistributing the data. An important note within this algorithm is that every subcluster holds information that improves the memory usage. This information is about the number of clusters, the linear sum, squared sum and centroids of the samples with the subcluster and the squared norm of the centroids. By using these metrics, the calculation of the radius of the subcluster is improved as it only has to hold these specific attributes in the memory while calculating.

3.2.6 Affinity Propagation The next algorithm is affinity propagation which is unique in it’s calculation. The algorithm sends messages between points to establish the connection between those points and based on this, the algorithm clusters

13 points together. It chooses a function to determine the similarity between points and based on those similar- ities assigns each point. An example of such function can be the negative squared distance between points. It then applies two steps per message to establish the connection. First of it uses the following formula:

r(i, k) ← s(i, k) − max{a(i, k0) + s(i, k0)} k06=k

In this formula r is the responsibility matrix and a is the availability matrix. R is how suited a point is to be an exemplar or clustering point, while the a matrix is how likely x would choose y as their exemplar or clustering point. S in this formula contains the distance chosen as discussed before which could be the negative distance. The second step is formed by the following formulas:

  X 0 a(i, k) ← min 0, r(k, k) + max(0, r(i , k)) i0∈{/ i,k}

and

X a(k, k) ← max(0, r(i0, k)) i06=k

In this step the algorithms basically reassigns the availability matrix based on the new responsibility matrix found in step one. Both steps are then repeated until no more changes occur after a few iterations, or after a pre-determined amount of iterations.

3.2.7 Mean Shift Mean shift finds its core meaning in its name already as it is based around the centroids or ’means’ of the clusters. The algorithm uses the following formula:

P K(xi − x)xi m(x) = xi∈N(x) P K(x − x) xi∈N(x) i

In this formula K is some kind of that determines the wait of a point, which can be a Gaussian wait for example. N in this context is the neighbourhood, which are all the points for which the K does not equal to 0 and with this formula produces m which is the mean density of the region. Afterwards x becomes the new found m and the process is repeated until possible convergence. The interesting thing about this algorithm is that it solely looks in a region around the point it considers as a clustering point, which makes it rather efficient as it does not use the whole data set for each iteration.

3.2.8 Spectral Clustering The last algorithm used is spectral clustering (Ng et al., 2002). The first step in this process is to create an affinity matrix which establishes the similarity between points. This matrix is created by a certain technique and usually is a radial based function, which can be the euclidean distance for example. After this matrix has been created the algorithm uses k-means to create the clusters and create a graph of the data points. Afterwards this graph is regarded and it applies a normalized cut problem. This means that it tries to cut an edge inside a graph so that the weight of that edge is outweighed by the remaining edges within the graph. Based on this the algorithm tries to reduce the clustering even further and eventually produces the final clustering.

14 3.3 Evaluation metrics After applying the algorithms to the data sets, some metrics have to be used to determine how ’good’ the clustering was. Important here is to involve as many metrics as possible with a broad range of evaluation factors to get a complete image of the clustering. Similar to the algorithms, the scikit module was used to apply the metrics to the found labels and because the present study had access to both the data sets and the known correct classification a broad range of the available metrics inside the module were available for usage. Choosing these specific algorithms was based on both the theoretical framework and provided validation within the metrics. The theoretical framework proves that more metrics give a better overview in general but specifically because it allows for the evaluation of the process from more angles. Most metrics require that the classification of a data set is known, which holds true in this research as well. Because in this case the classifications were indeed known, it allowed the usage of a lot of metrics and increased the angles of evaluation. Based on this, it was attempted to provide as many angles as possible in this paper. Something else that must be taken into account is the density and definition of clusters. Next to these guidelines it was also attempted to put validation in the metrics where possible. This means that some of the metrics test roughly the same. By doing this it was attempted to eliminate randomness as much as possible and it made for a better comparison of the metrics as well. To further validate the comparisons between metrics some metrics are also used that directly influence each other and give more information on how the clustering proceeded and what its strong points were. Lastly, several general metrics such as memory and time were added to test general performance as not only the resulting clustering matters but also the lengthiness of the process.

3.3.1 Time and Memory The first two metrics are not necessarily clustering specific but can be important factors in choosing which algorithm to use. The first one is time, which is becoming more prevalent with the increase of big data applications and the increase in the availability of data. For the tracking of time the standard python module time was employed. By starting a timer before fitting the data and ending the timer after the fitting every algorithm can be timed in a rather similar way. The whole process of starting the algorithm and assigning the labels was timed as this is a constant factor for every algorithm. Besides time, it is also important to look at the memory usage of an algorithm. As the data increases an algorithm needs more memory to perform the clustering and by tracking the memory usage you can predict the scalability of the algorithm. This can be essential in choosing the right algorithm as one must consider both the available memory and how much the algorithm needs. For the tracking of memory, a memory profiler, the line profiler in specific, was installed and used. Through this module, the memory usage of each line can be tracked and it thus accurately knows how ’costly’ the algorithm is. Below is an example of memory tracking where the increment shows how much the fitting and creating of the algorithm adds in MiB’s.

Figure 1: Memory usage example

15 3.3.2 Adjusted Rand Index The first metric of the module is the Adjusted Rand Index which is a slight adaptation to the normal Rand Index. The Rand Index itself is based on knowing both the known and found classifications and uses the following formula:

a + b RI = nsamples C2

In this formula the C equals all the possible pairs in the dataset. A equals all the pairs that are in both the correct classification C and the found clustering K. B equals all the pairs that are in different sets between both C and K. By using this formula the values can range between 0 and 1 where 1 is a perfect score and 0 is no match at all. The Adjust Rand Index is corrected for chance and transforms to the following formula:

RI − E[RI] ARI = max(RI) − E[RI]

Along with the normal Rand Index the expected Rand Index is taken from both the Rand index and the max Rand Index and afterwards these values are divided. By doing this the values can range from -1 to 1 where -1 is bad labeling, 0 is random labeling and 1 is perfect labeling. This way you can conclude more from the found values and achieve a better comparison. Furthermore, the formula does not assume any structure for the data and is thus very suitable for comparing data structures that differ from each other.

3.3.3 Adjusted Mutual Information The second metric is in a way similar to the previous one, in the fact that it also adjusts based on chance and has the same ranges for the values with the same meaning. The Adjusted Mutual Information is, as its name already suggests, an adjustment of the Mutual Information formula. This formula takes into account the entropy and a contingency table with both the known and found clustering labels. Based on this table it calculates the entropy of both sets of found labels and uses that to calculate the performance of the clustering, which is bound to both the found entropies. The Adjust Mutual Information changes this range to 0 and 1 and adds the chance in the same way as the Adjust Rand Index as can be seen by the formula:

MI − E[MI] AMI = max(H(U),H(V )) − E[MI]

The MI is the found Mutual Information and the max is determined by both the entropies. Afterwards, the same expected index is subtracted from both and divided to get said ranges. By doing this, the values of different data sets and labels can be compared as the score is both normalized and adjusted for chance. The similarities in the first two methods is also seen in the advantages as this metric also has no problem with different clustering structures.

3.3.4 Homogeneity, completeness and V-measure The following three metrics influence each other and together say something about the execution of the clustering. Homogeneity determines whether within a cluster the cluster only contains members of a single class, which would mean a perfect clustering. Completeness does the opposite and tries to look at whether members of the same class are within the same cluster. V-measure takes both values and calculates the harmonic mean of both. By looking at the values one can determine where the clustering went wrong and can evaluate where improvements in the algorithm or clustering can be made. All three metrics are bound by 0 and 1.

16 3.3.5 Fowlkes-Mallows score The Fowlkes-Mallows score is a widely adopted metric used throughout a lot of statistics if the correct classifications are known. The metric uses precision and recall to calculate a mean and determines how well the clustering was handled. Precision is formed by taking all the correct classified values and dividing it by all the labels that are classified as correct. Recall is formed by taking the correctly classified labels and dividing it by all the correctly labeled classification including the negative ones. This turns into the following formula:

TP FMI = p(TP + FP )(TP + FN)

Within this formula the TP equals the True Positive, the FP equals the False Positive and the FN the False Negative. As with the first two metrics no assumptions is made based on the structure of the clustering and is bound by 0 and 1 just like the AMI.

3.3.6 Calinski-Harabaz Index The first metric that does not need the known labels is the Calinski-Harabaz Index. It looks at the between- cluster dispersion and the within-cluster dispersion to determine how well the clustering was executed. Important to note is that a higher score is better and there is no limit on how high the scores can be. This metric looks at the density of the clusters and penalizes clustering that overlaps. The metric uses the following formula to calculate:

T r(B ) N − k s(k) = k × T r(Wk) k − 1

In this formula the N is the amount of samples and k is the amount of clusters. Tr(Bk) equals the between clusters dispersion and how well they are separated while the Tr(Wk) determines how dense the clusters itself are. Both of the values are matrices with the different possible combinations.

3.3.7 Silhouette Coefficient The second and last metric that does not need the known labels is the Silhouette Coefficient. It is a rather simple metric and looks at how well the clusters are defined. This is done via both calculating the mean between a point and all other points in the same cluster and between the points and all the points of the nearest cluster. This turns into the following formula:

b − a s = max(a, b)

In this formula b is the distance to the nearest cluster and a is the distance to the points inside the cluster. A huge drawback of this metric is that it is very memory heavy as it has to calculate a lot of points and the distances between those points, which can make it rather slow.

3.3.8 Implementation For the implementation a function was written to calculate all the metrics at once which can be seen in Figure 2.

17 Figure 2: Metrics function

In this function, x is the correct labeling and y is the predicted labels as found by the clustering process. Z in this instance is the created model which is used for the last two metrics to evaluate. First, both the x and y are ordered to align the labels, which allows the metrics to perform a correct evaluation.

18 4 Results

This section contains the results of the clustering and the metric results based on the data set as executed in the present research. The results are presented in table form and will be discussed in the text below every table. The tables will contain every algorithm and a column for the six data sets. The tables are ordered by metric which allows one to easily compare the algorithms and their performances within a metric. Within the tables some values where not achieved due to an error. Through either Mem or Nan this is assigned to respectively a memory error or a fault within the algorithm itself due to a computational error.

4.1 Time and Memory The two tables for the measured time (Table 2) and memory usage (Table 3) of the different algorithms on the different data sets can be found. Important to note is that the time is in seconds and the memory is in megabytes. Furthermore, some algorithms were not able to complete clustering specific data sets, which greatly reduced the efficiency and scalability of the algorithm. In order to test the efficiency both the time and memory are taken into consideration.

Table 2: Time Bank Htru2 Skin Spam Drive Cov Gaussian Mixture 0.0080 0.1190 1.3130 0.2110 14.1000 60.8650 Mini Batch K-Means 0.1450 0.0320 0.2010 0.0310 0.2690 0.7430 Ward Hierarchical 0.1810 13.0090 Mem 1.0790 861.7290 Mem DBSCAN 0.0420 0.2950 5.1620 0.2960 34.2610 28.2820 BIRCH 0.0670 15.1570 Mem 0.9470 2527.1780 Mem Affinity Propagation 1.2950 842.1470 Mem 57.6290 Mem Mem Mean Shift 3.5330 164.3260 19036.2449 29.9190 6406.4210 Mem Spectral Clustering 3.1439 872.0910 Mem Mem Mem Mem

First, the time is considered. Analyzing Table 2, it can be concluded that the performance can roughly be split into two groups. The first group consisting of Affininty Propagation, Mean Shift and Spectral Clustering took way more time than the second group. For example with the first data set, it generally took those three more than one second, up to close to a minute to complete the clustering. If we look further at larger data sets for both many and few features it can be concluded that similar patterns exists. In other words, the three algorithms in group 1 significantly under perform if measured against the other five algorithms. The three algorithms once again took considerably more time to complete in comparison or were not able to complete the clustering. As expected, this behaviour is continued with the largest data sets where only Mean Shift was able to complete the algorithm. This however took more than 5 hours which is easily the lengthiest an algorithm has run for by quite a margin. The worst performing algorithm was undoubtedly Spectral Clustering. This algorithm was not able to complete even the smallest of data sets and it showed the worst scalability. Affinity Propagation was the next worst as it generally took more time to cluster, or was not able at all to cluster a data set, which leaves Mean Shift as the third worst. Group 2, consisting of the five remaining algorithms is comprised of: Gaussian, Mini Batch K-Means, Ward, DBSCAN and Birch. Considering the smallest data sets for these five algorithms, similar results in time can be detected, as all five roughly take a second or less to complete the clustering. Only marginal differences from the hundredths of seconds can be concluded. Such differences are negligible. However, if a ’worst’ performer must be selected from group 2, it must be Ward as it took the longest out of all the five for the smallest data sets with few and many features. Comparing few to many features we can see that the latter takes more time to cluster, which is to be expected due to more computational strain. As for the fastest, we can see that for both data sets the Gaussian Mixture is the fastest, be it by a small margin as discussed. Going up one size, interesting patterns emerge. The computational time increased for each algorithm as expected, the margins however for these increases differ vastly. Both Ward and Birch increase significantly more, where Birch even took 42 minutes, while a size smaller only took a second. Ward also experiences quite an increase, especially for many features, where it shows that both handled scalability significantly worse than the others. Looking at the other three algorithms, one can also see differences in

19 the ranking. Mini Batch K-Means is able to hold its speed and keeps its running time easily beneath one second. The other two algorithms, Gaussian and DBSCAN, however take quite some more time to complete the many features data sets. Both also see an increase in running time on the few features data set. Looking at the largest set, it can be concluded that Mini Batch K-Means is able to keep its times under one second, while the other two show the same, or an increase in computation time. Interesting to see is that DBSCAN actually slightly decreased in time, for the many data set, while increasing for quite a bit for the few data set. Overall Mini Batch K-Means can keep its computational time easily beneath one second. It can thus be concluded that Mini Batch K-Means is the most time efficient algorithm out of all the algorithms.

Table 3: Memory Bank Htru2 Skin Spam Drive Cov Gaussian Mixture 1.852 3.598 4.070 2.820 5.090 5.094 Mini Batch K-Means 0.641 1.531 8.938 1.633 4.281 19.156 Ward Hierarchical 1.625 2.660 Mem 1.105 8.883 Mem DBSCAN 0.438 3.559 16.652 1.172 0.824 1.688 BIRCH 3.895 33.078 Mem 21.621 369.863 Mem Affinity Propagation 2.410 Mem Mem 9.281 Mem Mem Mean Shift 2.566 19.312 42.789 21.051 Mem Mem Spectral Clustering 4.367 Mem Mem Mem Mem Mem

The second efficiency metric, memory, is presented in Table 3. It can be concluded that the same algorithms that scored inferior on ’time’, score inferior on ’memory’ as well. However, to this group of three, Birch can be added as well. For the smallest data sets, these four algorithms took significantly more memory in comparison to the other algorithms. Birch shows an interesting shift from scoring rather well on ’time’, to second worst in both data sets in memory, only leaving Spectral Clustering behind. Spectral Clustering seems to perform very badly on efficiency in general as it is the worst on both time and memory consumption. Similar to the developments in Table 2, we can detect at least a tripling of consumption of memory from few features to many features in the four algorithms in Table 3. Going up one size, the same increase occurs for the few features with Birch even taking up 10 times more memory usage. Roughly we can say that all the algorithms scale rather bad and show quite big memory use, even for the smallest data sets. All of these can be seen as rather memory inefficient, or demanding. Spectral Clustering is easily the worst, followed by Birch, Mean Shift and Propagation. The remaining four algorithms scored considerably better on the memory usage and should thus be re- garded as the more memory efficient algorithms. For the smallest data sets we can actually see differences in comparison with time. DBSCAN performed the best out of all on the few features and small data set followed by Mini Batch K-Means, while the other two, Gaussian and Ward took more memory. Looking at many features however, Ward uses the least, followed by DBSCAN, Mini and lastly Gaussian by a margin. Looking at the time, Gaussian seems to use more memory than the rest, while it was as fast as the others, this could possibly be explained because a model has to be created which can use more memory, this how- ever is speculation. As with the other algorithms we also see an increase in memory usage between few and many features, except for Ward, which took less memory to complete the clustering. This suggests that the amount of features is not that important for the memory usage and it can efficiently take in more features. Considering the medium sized data sets we see different rankings. For the few features, Mini Batch seems to be the most efficient, followed by Ward and then by both Gaussian and DBSCAN. DBSCAN seems to scale bad in this instance in size and memory usage, while Ward increased the slightest in this step of scale. Looking at the many features however, we can see differences as well. Most notably: DBSCAN took less memory than the smallest data set of many features and it used the fewest memory out of all the algorithms, which is rather strange. Mini Batch also took quite some more memory to cluster when compared with few features just like with the smaller data sets, which seems to show that the memory scalability of Mini Batch K-Means with features is relatively bad. Gaussian underwent roughly the same increase considering the sizes and features, so it seems to have less trouble with more features. This leaves us with the largest data set, which shows a different trend once again. Gaussian Mixture seems to pull ahead now, as the memory usage of the algorithms almost does not seem to increase any more. The algorithm shows great scalability

20 considering both the features and size and seems superior in few features since Mini Batch took twice as much and DBSCAN took four times as much memory to complete the clustering. For the largest data set however, DBSCAN scored the best, while Gaussian Mixture kept stable and Mini Batch increased by quite margin again, similar to the trend surrounding few features. The trend of Mini Batch increasing significantly with features also holds true in this case.

For overall efficiency, a top three is easily appointed. This top three undoubtedly is: Gaussian Mixture, Mini Batch K-Means and DBSCAN. Those algorithms easily outperform the other algorithms when all scenarios are taken into account. All three show exceptional scalability with both features and size. Mini Batch seems to be the best considering the speed as it was always able to keep the time beneath one second which is extremely fast when compared with the other two. As the size however increased it used more memory, which could prove to be a problem for lower end computers. Gaussian Mixture shows to be rather stable in increase and rather efficient while usually not being the ultimate best. It needed a bit more time to cluster, but used fewer memory when compared to Mini Batch K-Means and seems to generally be good for few features. It also maintained low time usage and a rather memory efficient clustering process. DBSCAN seems to be outperformed by both on the few features data sets. However, considering more features, it strangely seems to increase in efficiency and easily outperforms the others on memory usage. Looking at the other algorithms we see rather inefficient algorithms. Birch was able to keep up in time, but used quite a lot of memory in comparison, while for Ward it is the other way around. The other three algorithms scored inefficient on both metrics and can thus be regarded as the worst, with Spectral Clustering being the worst by quite a margin as it couldn’t cluster the smallest data set with many features.

4.2 Known classification metrics This section presents the metrics that rely on knowing the actual classification. As in the previous tables, some results are not available if the algorithms were not able to peform the clustering. The values in the different tables correspond with the explanation given in the methodology section in their respective subsection. Some of the values have an e followed by a number which equals to the number to the power of the number behind the e. Those values can also be regarded as zero as they are extremely close to zero. Furthermore the Homogeneity, Completeness and V-Measure will and should be regarded as one metric as they relate to each other and influence each other heavily. Something to keep in mind is that DBSCAN, Birch, Affinity Propagation and Mean Shift do not take a number of clusters as input, which means it’s expected to be outperformed by the other algorithms.

Table 4: Adjusted Rand Index Bank Htru2 Skin Spam Drive Cov Gaussian Mixture 0.8459 0.0475 0.0538 0.7239 0.3439 0.0360 Mini Batch K-Means 0.7519 -0.0769 0.5977 -0.0300 0.4368 0.0306 Ward Hierarchical 0.7572 -0.0714 Mem 0.1881 0.0969 Mem DBSCAN 0.3372 0.0 0.0005 -0.0323 0.0 1.4509e-16 BIRCH 0.9424 -0.076 Mem 0.0641 0.1883 Mem Affinity Propagation 0.0614 0.0003 Mem 0.0752 Mem Mem Mean Shift 0.0 -0.0266 -0.0679 -0.044 0.0252 Mem Spectral Clustering 0.0113 -0.0610 Mem Mem Mem Mem

The first metric to look at is the Adjust Rand Index (Table 4), which compares the similarity of the clustering, while using chance normalization to strengthen the claims. Looking at the first data set, the few features and small data set, we can see the same diversion as with time in Table 2. Affinity Propagation, Mean Shift and Spectral Clustering all achieved close to a zero. This translates to a random labeling as the algorithms states, which is rather bad. The other algorithms however performed rather well on this data set within the metric. Birch performed the best with roughly 0.9, which is a near perfect score. Right behind Birch comes Gaussian Mixture with a score of 0.85 followed by both Mini Batch and Ward with roughly the same score of 0.75, which is rather high as well and shows great similarities with the actual classification and

21 the clustering. DBSCAN scored roughly 0.34 which is actually quite decent considering it picks the amount of clusters for itself. The many features small data set shows some interesting results as only Gaussian was able to get a strong result followed by Ward with only 0.18. This shows that the clustering process had trouble scaling with features. The rest of the algorithms scored close to zero which leads to roughly a random classification. The medium data sets show different results for both many and few features. The Htru2 data set shows bad scores on all the algorithms, with all approaching to 0 as score. It shows that the clustering algorithms all had trouble with this particular data set. This could be a problem with the data set, or the data type as the results are unexpected and could be something to look at a later time. However, the many features data set Drive shows different results, with Mini Batch K-Means pulling ahead of Gaussian Mixture by a margin, while both getting a score beneath 0.5. The other algorithms all scored close to zero once again. For the largest data sets only a few algorithms were able to complete the clustering. All but one also scored roughly zero, which seems to show that as the size increases the clustering performance seems to decrease. However, Mini Batch K-Means scored roughly a 0.6 for this metric on the few features large data set, which is a very strong score. Because of this and considering the rest, Mini Batch K-Means seems to scale better for size as it kept its metric score with the size, while Gaussian seems to lose performance as the size increases. Looking at the other algorithms, however, the performance seems to be bad, with only getting a decent score for the first data set with few features, or no decent scores at all.

Table 5: Adjusted Mutual Information Bank Htru2 Skin Spam Drive Cov Gaussian Mixture 0.5401 0.0034 nan 0.4652 0.3464 nan Mini Batch K-Means 0.4759 0.0091 nan 0.0422 0.5155 nan Ward Hierarchical 0.4928 0.0065 Mem 0.1277 0.2841 Mem DBSCAN 0.3378 2.4980e-16 nan 0.0296 -1.8319e-15 nan BIRCH 0.9021 0.0213 Mem 0.0733 0.3593 Mem Affinity Propagation 0.1893 0.0473 Mem 0.1052 Mem Mem Mean Shift 5.6561e-16 0.0014 Mem 0.0691 0.3475 Mem Spectral Clustering 0.0162 0.0042 Mem Mem Mem Mem

The next metric that requires the known classifications is the Adjust Mutual Information (Table 5), which looks at the agreement of both classifications and is normalized for chance similar to the previous metric. As with the previous metric, the bottom three algorithms scored rather bad on the first data set with few features, where Mean Shift and Spectral scored close to zero and Affinity lacks behind in performance to the remaining five algorithms. In this case Birch pulls ahead by quite margin, which increased even further, with a score of 0.9 once again. Birch seems to do very well on the smallest data set with few features. This is followed by Gaussian Mixture, followed by Mini Batch, Ward and lastly DBSCAN. As with the previous metric, the performance on the smallest data set with few features seems to be roughly the same, where Gaussian decreased in score the most. However, the small data set with many features, shows that Gaussian Mixture performed the best once again, followed by Ward in this instance. Interesting to see is that Birch ranks much lower, as it seems to struggle with more features and the increasing size of the data set. Affinity Propagation actually scored third in this particular test, which is the first time thus far. Overall, the scores are around 0.1 or closing in on zero which suggests randomness. For the medium data set, we see the same developments as before with the Htru2 data set, where all the algorithms close in on zero. This strengthens the need to further inspect why this data sets has these scores for almost all the algorithms, even though they performed perfectly fine before. Looking at the medium data set with many features one can see that Mini Batch K-Means starts to outperform the rest of the algorithms as was the case with other metrics as well. This is followed by both Mean Shift and Gaussian Mixture which is unique for Mean Shift as it usually does not belong to the top. Ward’s scalability seems rather off in this test when features and size increase. DBSCAN performed rather horrible on this data set as it scored close to 0. The evaluation of the larger data sets proved to be problematic as the metric returned errors while attempting to do so. This is something to consider in further endeavours and shows the influence of metrics

22 and the possible steps that can be made in the evaluation metrics.

Table 6: Fowlkes-Mallows Bank Htru2 Skin Spam Drive Cov Gaussian Mixture 0.9248 0.7463 0.0202 0.8654 0.4994 nan Mini Batch K-Means 0.8805 0.7771 nan 0.6567 0.5489 -0.9408 Ward Hierarchical 0.8793 0.8048 Mem 0.7118 0.3519 Mem DBSCAN 0.6656 0.9130 nan 0.6703 nan -0.9669 BIRCH 0.9715 0.7867 Mem 0.7095 0.4020 Mem Affinity Propagation 0.2491 0.0909 Mem 0.2800 Mem Mem Mean Shift 0.7112 0.8930 1.1863 0.6169 nan Mem Spectral Clustering 0.7013 0.8316 Mem Mem Mem Mem

Table 6 depicts the metric Fowlkes-Mallows, which uses the true positive, false negative and false positive to test the clustering performance. When looking at the smallest data set Affinity Propagation clearly stands out as the worst as it scored 0.25 and 0.28 respectively while the second last scored well within the 0.6 range. The best scores are achieved by Birch for the few features and Gaussian Mixture for the many features. This is in line with previous found results, especially the Birch which usually falls off on increased features or increased size. For the first data set, Gaussian comes in second, closely following Birch, while both Mini Batch K-Means and Ward follow Gaussian with roughly a score of 0.88. Mean and Spectral scored quite similar as well around 0.7 and lastly DBSCAN with 0.66. Interesting to see is that DBSCAN once again improved with features, while the others decreased in performance on more features. Furthermore, Mini Batch K-Means loses some ground on Ward and Mean also loses performance and becomes the second worst, as with the other metrics. The medium data sets show the same kind of results for Affinity as it under performs by quite a margin in comparison with the others. Interesting to see as well is that DBSCAN scored the highest on the few features medium data set, especially because the algorithm clusters for itself. DBSCAN is followed by Mean Shift and Spectral Clustering which is odd as those where the worst by far when looking at other metrics. One has to keep in mind that this data set also acted strange on other metrics. This could be due to the fact that those algorithms have specific features that work well on this data set. However, this would not explain the horrible results in the other metrics. Looking at the more features data set we can see that Mini Batch K-Means scored the highest followed by Gaussian Mixture, Birch and Ward. Once again we see that Mini Batch K-Means scales rather well as the features or the size increases, especially the features where it slowly starts to out perform the other algorithms be it by a small margin. The largest data set often resulted in errors as the metric was not able to complete the evaluation. The ones that did finish however came close to zero or even approached -1, which should not be possible. Mean Shift scored a 1.1863 which probably is an error as the values range from 0 to 1, just like the negative scores. Thus the metric seems to act strange and irregular and the results of the largest data sets should probably not be considered.

Table 7: Homogeneity Bank Htru2 Skin Spam Drive Cov Gaussian Mixture 0.7862 0.0110 nan 0.6938 0.5037 nan Mini Batch K-Means 0.6927 0.0297 nan 0.0629 0.6336 nan Ward Hierarchical 0.7173 0.0213 Mem 0.1905 0.2843 Mem DBSCAN 0.4918 8.1590e-16 nan 0.0442 -7.6395e-16 nan BIRCH 0.9059 0.0272 Mem 0.0734 0.3595 Mem Affinity Propagation 0.9845 0.1544 Mem 1.0000 Mem Mem Mean Shift 5.6562e-16 0.0046 Mem 0.1032 0.1449 Mem Spectral Clustering 0.0236 0.0136 Mem Mem Mem Mem

23 Table 8: Completeness Bank Htru2 Skin Spam Drive Cov Gaussian Mixture 0.8004 0.0062 nan 0.6731 0.8438 nan Mini Batch K-Means 0.7176 0.0222 nan 0.1516 0.8306 nan Ward Hierarchical 0.7111 0.0187 Mem 0.3434 0.7958 Mem DBSCAN 0.1875 1.0 nan 0.0797 1.0 nan BIRCH 0.9022 0.0214 Mem 0.2458 0.7922 Mem Affinity Propagation 0.1924 0.0099 Mem 0.1340 Mem Mem Mean Shift 1.0 0.0143 Mem 0.1187 0.5304 Mem Spectral Clustering 0.1674 0.0147 Mem Mem Mem Mem

Table 9: V-Measure Bank Htru2 Skin Spam Drive Cov Gaussian Mixture 0.7932 0.0079 nan 0.6833 0.6309 nan Mini Batch K-Means 0.7049 0.0254 nan 0.0889 0.7189 nan Ward Hierarchical 0.7141 0.0199 Mem 0.2451 0.4189 Mem DBSCAN 0.2715 1.6318e-15 nan 0.0569 -1.5279e-15 nan BIRCH 0.9042 0.0240 Mem 0.1131 0.4946 Mem Affinity Propagation 0.3219 0.0185 Mem 0.2364 Mem Mem Mean Shift 1.1312e-15 0.0070 Mem 0.1103 0.2276 Mem Spectral Clustering 0.0413 0.0141 Mem Mem Mem Mem

The last three metrics of this section are grouped together as they relate to and influence each other. The V-Measure is the harmonic mean of both Homogeneity and Completeness and is thus directly influenced by both. Starting of with the smallest data set with few features we can see that Birch once again has the highest score of 0.9 for the V-Measure. In line with other results we also see that Gaussian Mixture comes in second and Mini Batch and Ward score rather similar. Affinity Propagation in this case scores higher than DBSCAN, which is somewhat unique. Both Mean and Spectral score bad. Looking at homogeneity and completeness more specifically we see that overall, the scores are rather similar to the V-Measure and both Homogeneity and Completeness score similarly. The ones that stand out however are Affinity Propagation, which scored almost one for homogeneity, and Mean Shift, which scored a one for Completeness. This could be the case more often and could point out certain strengths of those algorithms. As with other metrics the performance of Birch seems to decrease when the number of features increases. Gaussian Mixture seems to perform the best with this data set by quite a margin followed by Affinity and Ward. Affinity seems to show some consistent performance throughout the smaller data sets, which means it can sometimes score relatively high. Mini Batch K-Means scores inferior within this instance as it in fact ranked second lowest next to DBSCAN. The latter algorithm seems to have trouble with this metric, contrary to other metrics where it seemed to perform better as the features increased. For the other small data set the Homogeneity and Completeness are rather consistent and stable for most algorithms. Affinity Propagation however scored a 1 for Homogeneity, which provides reason to believe that it prioritizes Homogeneity over Completeness. The Htru2 data set shows horrible results all around, with all algorithms nearing a score of zero. The many features medium size data set, however, shows some interesting results that further strengthen earlier patterns. One can see that once again, Mini Batch K-Means seems to improve with size and features and ranked first once again, followed by Gaussian Mixture, Birch and Ward. Mean Shift also scored quite decent, while the other algorithms scored 0. Looking at the specifics we see that the algorithms were mostly able to score on Completeness but had trouble with Homogeneity. Specifically that means that the members of a class are assigned to the same cluster but the clusters are often made of more members of different classes. This could be data set specific but it could also be due to the amount of clusters that are needed, which is eleven.

24 4.3 Unknown class labels This section holds the last two metrics that did not require the classification of the data to be known beforehand. As with the previous metrics, the range for the metric has been laid down in the methodology section. Interesting here is that the Silhouette Coefficient often did not run because of memory issues or other similar problems. Furthermore, the Calinski-Harabaz index provided very high numbers on large data sets which is probably due to the size of the data set but this will reviewed in the next section.

Table 10: Calinski-Harabaz Bank Htru2 Skin Spam Drive Cov Gaussian Mixture 245.8383 12.2769 4165.9462 247.6694 7.6669 nan Mini Batch K-Means 204.4953 172.4058 31690.0862 32.8869 19.5767 35702.1044 Ward Hierarchical 216.3235 162.0155 Mem 21.5496 8.3079 Mem DBSCAN 4.6616 nan 33.4895 1.0934 nan nan BIRCH 274.3639 172.4314 Mem 2.6876 8.4301 Mem Affinity Propagation 9.1562 5.1342 Mem 1.8494 Mem Mem Mean Shift nan 32.6767 575.6246 2.4354 1.2088 Mem Spectral Clustering 2.9856 173.8983 Mem Mem Mem Mem

The first of two metrics in this section is the Calinski-Harabaz (Table 10), which determines how well the clusters are defined. Within the smallest data set with few features we once again see the same ranking with Birch at the top, followed by Gaussian, which in turn is followed by both Mini Batch K-Means and Ward. The other algorithms have quite a large margin with this group of algorithms which means that those were not to be able to define their clusters in this case. Interesting to see is that when the many features small data set is taken into account, most performances seem to get disrupted. Only Gaussian Mixture was able to keep up its performance and as expected, Birch’s performance decreased immensely. Both Ward and Mini Batch K-Means also lost quite a lot on results but can still distinguish themselves from the worst, be it by a small margin. Gaussian Mixture seems to define its clusters rather well based on these results, far better than the others when considering both few and many features. Moving on to the medium sized data sets, we can distinguish some interesting results for the medium data set Htru2, where Spectral Clustering scored the highest for the first time followed very closely by Mini Batch K-Means and Birch. This seems very odd as Spectral Clustering showed rather atrocious results on other metrics and the previous behaviour of metrics on this data set. Birch also seems to keep some performance up on this data set, thus proving it can define its clusters well on few features. As in line with other results we also see that Mini Batch K-Means can keep up its results rather well on larger sized data sets. The last unique thing is that Gaussian Mixture seemed to have a hard time defining this data set’s clusters as it ranked second worst in this test. In considering the medium sized set with many features we can conclude that Mini Batch K-Means comes ahead once again, thus showing scalability with size and features as with other metrics. Mini Batch is followed by Gaussian, Ward and Birch. Interesting to see is that Birch is able to keep defining its clusters even though it has trouble scoring high on other metrics on larger data sets, or data sets with more features. Mean Shift seems to have trouble defining its clusters as it scored the lowest on this set. Tests involving the largest data set showed that only the Skin data set produced some valuable results. Where once again Mini Batch K-Means takes the lead followed by Gaussian Mixture Mean Shift and DB- SCAN. The scores on this occasion are way higher since this specific metric does not normalize. The margin between Mini Batch K-Means and the second best Gaussian Mixture has become larger and thus enforcing the righteous first place of Mini Batch K-Means in this situation. The second metric of this section and final metric in general is the Silhouette Coefficient, which just like the previous metric evaluates how well defined the clusters are. Important to state is that a score of 1 is highly dense clustering, while -1 stands for incorrect clustering, scores around 0 indicate overlapping clusters, but this does not necessarily mean they are of bad quality. Looking at the first data set, once again, a pattern emerges that we have seen before. Birch leads the pack by a small margin, followed by Gaussian Mixture while Ward and Mini Batch K-Means do not lie far apart in scores. Next comes Spectral Clustering, Affinity and finally DBSCAN. As can been seen the patterns followed the previous found results and no real deviations

25 Table 11: Silhouette Coefficient Bank Htru2 Skin Spam Drive Cov Gaussian Mixture 0.1639 -0.0153 Mem 0.1174 Mem Mem Mini Batch K-Means 0.1482 0.0374 Mem -0.3957 Mem Mem Ward Hierarchical 0.1377 0.0637 Mem 0.1716 Mem Mem DBSCAN -0.3284 nan Mem -0.6931 nan nan BIRCH 0.1719 0.0465 Mem 0.0771 Mem Mem Affinity Propagation -0.1733 -0.3326 Mem -0.7674 Mem Mem Mean Shift nan 0.0913 Mem -0.7561 Nan Mem Spectral Clustering 0.0005 0.1029 Mem Mem Mem Mem are found. Looking at the many features small set, however, some outstanding findings are detected. Mini Batch K-Means seems to have trouble defining its clusters here. Furthermore, Ward Hierarchical scored the highest out of all the algorithms followed by Gaussian. As expected, the evaluation of Birch came out worse than it was, as it seemingly has trouble with more features. The rest of the algorithms also see a decrease in evaluation as Mean and Affinity come close to -1 and DBSCAN also lost some score and closes in on -1. For the medium sized data set only the Htru2 data set gave results as the other returned errors on run. Interesting to see here is that Spectral Clustering scored the highest once again on its definition of the cluster and just as in the previous metric, followed quite closely by Mean Shift. The rest of the algorithms scored around 0 or even negative for Affinity Propagation. This is the first time Mean Shift and Spectral Clustering lead the rankings, be it by a small margin, which means that no hard claims can be made based on these results as they performed so poorly on (almost) every other metric.

26 5 Conclusion

Throughout this research an attempt was made to create an overview of the different techniques in clustering available today and how they compare to each other. By creating such an overview one can pick a technique to use for the clustering process and thus get relatively good results on the first try. This is becoming more and more important as the cost of performing a clustering is increasing due to the sheer size of the data, but also due to the type of data. Thus, finding the optimal technique fast is becoming more and more important. By looking at clustering from different angles an overview was made and interesting results were found, which will be discussed in this section. Out of the eight algorithms used three immediately fall of as the worst algorithms. These algorithms are Spectral Clustering, Affinity Propagation and Mean Shift. Spectral Clustering can easily be seen as the worst as it was not able to complete any of the many features data sets and the ones that it completed took a lot of time and memory. Next up is Mean Shift, which was able to complete most of the data sets, but took so long that it is not feasibly as there are way better options. The scores that it achieved were also relatively bad, which makes it a bad choice in general. The last algorithm of this ’group’ is Affinity Propagation which had some reasonable scores on smaller data sets, but was usually outperformed by others, especially its counter part DBSCAN. It also had trouble with more features and took quite some time and memory to complete clustering, which makes it a bad choice as well. However, this does not necessarily means that the mentioned algorithms are bad, as they could be good in specific cases. The general cases used here however do not show great performance. This leaves a group of 5 algorithms that outperformed the previous three. DBSCAN is the only one out of the five that does not take an amount of clusters as parameter but rather chooses the amount of clusters by itself. It also was able to perform all the data sets rather fast and efficient which makes it an interesting choice. However, the performance was usually sub par to the others. Birch is a unique algorithm in this case as it showed the best scores on a small data set with few features, but scored abysmal on the others and scaled rather bad performance wise. This makes it a great option for small data sets and few features to get great results, but not in any other case and definitely not in a big data case. Ward Hierarchical scored relatively good on the metrics and performance, but lacked to the other two, namely Mini Batch K-Means and Gaussian Mixture. It seems like it is a worse version of the two and was not able to perform clustering on the largest data sets which shows bad scalability. It can be thus regarded as a decent option, but better choices can be made. Now only two algorithms remain, namely Gaussian Mixture and Mini Batch K-Means which both performed the best out of all the algorithms. Out of the two Gaussian Mixture seems to outperform the rest on the small and some instances of the medium data sets, with the exception of Birch for the smallest data set with few features. Mini Batch K-Means however showed to pull ahead of the rest as the size and features increased and was able to keep its performance, while the scores of the others plummeted. Efficiency wise the two are opposites with both pros and cons. Gaussian Mixture slowly ramped up in time, but was very memory efficient and only slightly increased in memory usage while the data sets increased rather fast. Mini Batch K-Means on the other hand had lighting fast times as it always kept the time below one second, but saw larger memory usage as the size and features increased, which leaves us with a choice for memory, or for time. In the end when considering all variables we can see that the best scalability comes from Mini Batch K-Means when talking about big data cases as it was able to keep the results rather consistent throughout the whole process. Gaussian Mixture showed to be a great pick if one has a data set between 10.000 and 100.000, a small data set with many features or if memory can be an issue through the clustering process. Birch surprisingly scored the best on small data sets with few features and could be the go-to algorithm for data sets with these attributes. However, for all these algorithms one needs to determine the amount of clusters through the ’elbow’ method for example, which can be costly if the data set is huge. An alternative for this is DBSCAN which determines this for itself and is rather efficient in both time and memory. Yet, the metrics showed that the results are not that great, which should always be considered when choosing a clustering algorithm.

27 6 Discussion

Throughout the research some interesting things where found that are not necessarily part of the goal of the research, therefore some of these things will be discussed in this section. Furthermore, the limitations of the research will be reviewed and some options for future research will be considered.

6.1 Algorithm performance In this day and age where the data is increasing every day and big data is becoming more prevalent it is important to be able to deal with large amounts of data. However, looking at the performance of the algorithms we see that only three out of the eight algorithms were able to complete every data set, while the others were not able to cluster the largest, or even the medium sized data set. The largest data set in this case was roughly 500,000 points, which is not big considering the raw amount of data that is generated every day. It seems that the algorithms used in this case often are not build to scale well with size, which is evident when looking at Ward which has trouble with larger data sets while being a generic or ’classic’ algorithm. Furthermore, regarding the metric scores one can often see that the scores lower on most algorithms, which further strengthens the view that the scalability of the clustering algorithms is rather bad. Being able to keep performance high is important as this improves the clustering and thus provides the best possible results.

6.2 Metric performance Next to the algorithms, the metrics too showed some interesting performances throughout the research. Some of the metrics were not able to perform evaluation on the largest data sets. They returned either a memory error, or some calculation error made in the process. Because of this, the results are not complete which weakens any claims made. Next to this, some evaluations were invalid due to the before mentioned calculation error, which could be an implementation error as well. However, we do saw some patterns emerge throughout the research, where metrics that roughly tested the same scored the same for the algorithms. A last thing to mention is that the Silhouette Coefficient took a lot of time and memory to complete. Due to that the evaluation could fail or take a lot of time, sometimes without results. Yet, the rest was very fast and showed quick and clear results. All in all some improvements can be made on the metrics as well. Improving on the scalability of the metrics, but also on the performance for some as well. Furthermore, the availability of metrics without knowing the classification is also rather low as there are only two available and in a real clustering process, one does not know the correct classification, thus eliminating all but two metrics.

6.3 Data sets For the most part, the data sets acted accordingly as they provided different angles on data and different types of data by covering most of the tasks that are possible through clustering as discussed in the introduction. However, when you want to create an even bigger picture, one could try to get data sets for every type of task and thus testing every possible variable. Yet, this could prove to be difficult, as getting your hand on specific data sets that also provide different variables and have certain features is tough in our experience. Getting the right data sets is worthwhile as they each give a different take on the clustering algorithms and you could possibly find some interesting results, but on has to consider the time it takes. Within this research specific, the second data set with few features, Htru2, showed some odd results. On most of the metrics we saw horrible scores that were out of order with how the rest of the data sets behaved with those metrics. Next to this, we also saw that the worst algorithm scored the highest on one of the metrics that did act right, which is odd but possible. It could be interesting to look at why every metric behaved strangely on this particular data set. Purely speculating, it could be the type of data or the task that is assigned to this data set. By explaining such problems one can advance in the field and maybe find a solution to this problem as other data sets could have the same problems.

6.4 Further research Next to some possibilities mentioned here before, some other research is possible as well. The choice in algorithms was based on availability and general coverage of different main styles of algorithms. However,

28 there are more variations and different algorithms possible that could prove to be useful. In further research these variations could be tested to truly search for the best options in big data, or other general and relevant cases that can be distinguished. By creating a completer overview the claims that can be made are stronger because they are based on more scientific research which should always be regarded as something to strive for. Next to the algorithms some diversification is also possible within the metrics. There are more metrics available and by testing those, the image gets completer and thus further strengthening the claims, while simultaneously providing more angles on the subject. Another thing to look at is the implementation of the algorithms and metrics. In this research Python was used for everything from data preparation to clustering. However, there are other languages available with their own options, pros and cons. An example is R, which is tailored to statistics in general and could possible give some unique angles on the subject. By doing this you can also compare the languages and how they perform in the clustering process and thus influencing the choice for language in further researches as one language can prove to be better in a certain context.

29 References

Abbas, O. A. et al. (2008). Comparisons between data clustering algorithms. Int. Arab J. Inf. Technol., 5(3):320–325. Aggarwal, C. C. and Reddy, C. K. (2013). Data clustering: algorithms and applications. Chapman and Hall/CRC. Berry, M. J. and Linoff, G. (1997). Data mining techniques: for marketing, sales, and customer support. John Wiley & Sons, Inc. Biernacki, C., Celeux, G., and Govaert, G. (2000). Assessing a mixture model for clustering with the inte- grated completed likelihood. IEEE transactions on pattern analysis and machine intelligence, 22(7):719– 725.

Bock, H. H. (1996). Probabilistic models in cluster analysis. Computational Statistics & Data Analysis, 23(1):5–28. Breiger, R. L., Boorman, S. A., and Arabie, P. (1975). An algorithm for clustering relational data with appli- cations to social network analysis and comparison with multidimensional scaling. Journal of mathematical psychology, 12(3):328–383.

Do, C. B. and Batzoglou, S. (2008). What is the expectation maximization algorithm? Nature biotechnology, 26(8):897. Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226–231.

Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A. Y., Foufou, S., and Bouras, A. (2014). A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE transactions on emerging topics in computing, 2(3):267–279. Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3):37.

Focardi, S. M. et al. (2001). Clustering economic and financial time series: Exploring the existence of stable correlation conditions. Discussion Paper. Grabusts, P. and Borisov, A. (2002). Using grid-clustering methods in data classification. In Parallel Com- puting in Electrical Engineering, 2002. PARELEC’02. Proceedings. International Conference on, pages 425–426. IEEE.

Hripcsak, G. and Rothschild, A. S. (2005). Agreement, the f-measure, and reliability in information retrieval. Journal of the American Medical Informatics Association, 12(3):296–298. Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern recognition letters, 31(8):651–666. Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: a review. ACM computing surveys (CSUR), 31(3):264–323. Jin, X. and Han, J. (2010). Partitional Clustering, pages 766–766. Springer US, Boston, MA. Katal, A., Wazid, M., and Goudar, R. (2013). Big data: issues, challenges, tools and good practices. In Contemporary Computing (IC3), 2013 Sixth International Conference on, pages 404–409. IEEE.

Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis, volume 344. John Wiley & Sons. Kriegel, H.-P., Kr¨oger,P., Sander, J., and Zimek, A. (2011). Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(3):231–240.

30 Manovich, L. (2011). Trending: The promises and the challenges of big social data. Debates in the digital humanities, 2:460–475. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., and Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity.

Mayer-Sch¨onberger, V. and Cukier, K. (2013). Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt. McAfee, A., Brynjolfsson, E., et al. (2012). Big data: the management revolution. Harvard business review, 90(10):60–68. McKinney, W. (2011). pandas: a foundational python library for data analysis and statistics. Python for High Performance and Scientific Computing, pages 1–9. Moon, T. K. (1996). The expectation-maximization algorithm. IEEE Signal processing magazine, 13(6):47– 60. Ng, A. Y., Jordan, M. I., and Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems, pages 849–856. Nugent, R. and Meila, M. (2010). An overview of clustering applied to molecular biology. Statistical methods in molecular biology, pages 369–404. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12(Oct):2825–2830. Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850. Roski, J., Bo-Linn, G. W., and Andrews, T. A. (2014). Creating value in health care through big data: opportunities and policy implications. Health affairs, 33(7):1115–1122. Sculley, D. (2010). Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web, pages 1177–1178. ACM. Shirkhorshidi, A. S., Aghabozorgi, S., Wah, T. Y., and Herawan, T. (2014). Big data clustering: a review. In International Conference on Computational Science and Its Applications, pages 707–720. Springer.

Sripada, S. C. and Rao, D. M. S. (2011). Comparison of purity and entropy of k-means clustering and fuzzy c means clustering. Indian journal of computer science and engineering, 2(3):343–6. Steinbach, M., Karypis, G., Kumar, V., et al. (2000). A comparison of document clustering techniques. In KDD workshop on text mining, volume 400, pages 525–526. Boston.

Tan, P.-N., Steinbach, M., and Kumar, V. (2013). Data mining cluster analysis: basic concepts and algo- rithms. Introduction to data mining. Tien, J. M. (2013). Big data: Unleashing information. In Service Systems and Service Management (IC- SSSM), 2013 10th International Conference on, pages 4–4. IEEE.

Topchy, A., Jain, A. K., and Punch, W. (2004). A mixture model for clustering ensembles. In Proceedings of the 2004 SIAM International Conference on Data Mining, pages 379–390. SIAM. Wilson, H., Boots, B., and Millward, A. (2002). A comparison of hierarchical and partitional clustering techniques for multispectral image classification. In Geoscience and Remote Sensing Symposium, 2002. IGARSS’02. 2002 IEEE International, volume 3, pages 1624–1626. Ieee.

31