Comparison of Data Mining Methods on Different Applications: Clustering and Classification Methods
Total Page:16
File Type:pdf, Size:1020Kb
Inf. Sci. Lett. 4, No. 2, 61-66 (2015) 61 Information Sciences Letters An International Journal http://dx.doi.org/10.12785/isl/040202 Comparison of Data Mining Methods on Different Applications: Clustering and Classification Methods Richard Merrell∗ and David Diaz Department of Computer Engineering, Yerevan University, Yerevan, Armenia. Received: 12 Jan. 2015, Revised: 21 Apr. 2015, Accepted: 24 Apr. 2015 Published online: 1 May 2015 Abstract: Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. In this review we study different type if clustering methods. Keywords: Clustering, Data Mining, Ensemble, Big data. 1 Introduction connected by an edge can be considered as a prototypical form of cluster. Relaxations of the complete connectivity According to Vladimir Estivill-Castro, the notion of a requirement (a fraction of the edges can be missing) are ”cluster” cannot be precisely defined, which is one of the known as quasi-cliques. A ”clustering” is essentially a set reasons why there are so many clustering algorithms [1,2, of such clusters, usually containing all objects in the data 3,4]. There is a common denominator: a group of data set. Additionally, it may specify the relationship of the objects. However, different researchers employ different clusters to each other, for example a hierarchy of clusters cluster models, and for each of these cluster models again embedded in each other. different algorithms can be given. The notion of a cluster, as found by different algorithms, varies significantly in its properties. Understanding these ”cluster models” is key to 2 Clustering Methods understanding the differences between the various algorithms. Typical cluster models include: Connectivity models: for example hierarchical clustering builds models Clustering algorithms can be categorized based on their based on distance connectivity. Centroid models: for cluster model, as listed above. The following overview example the k-means algorithm represents each cluster by will only list the most prominent examples of clustering a single mean vector. Distribution models: clusters are algorithms, as there are possibly over 100 published modeled using statistical distributions, such as clustering algorithms. Not all provide models for their multivariate normal distributions used by the clusters and can thus not easily be categorized. An Expectation-maximization algorithm. Density models: for overview of algorithms explained in Wikipedia can be example DBSCAN and OPTICS defines clusters as found in the list of statistics algorithms. There is no connected dense regions in the data space. Subspace objectively ”correct” clustering algorithm, but as it was models: in Biclustering (also known as Co-clustering or noted, ”clustering is in the eye of the beholder [4,5,6,7,8, two-mode-clustering), clusters are modeled with both 9]. The most appropriate clustering algorithm for a cluster members and relevant attributes. Group models: particular problem often needs to be chosen some algorithms do not provide a refined model for their experimentally, unless there is a mathematical reason to results and just provide the grouping information. prefer one cluster model over another. It should be noted Graph-based models: a clique, i.e., a subset of nodes in a that an algorithm that is designed for one kind of model graph such that every two nodes in the subset are has no chance on a data set that contains a radically ∗ Corresponding author e-mail: [email protected] c 2015 NSP Natural Sciences Publishing Cor. 62 R. Merrel, D. Diaz: Comparison of Data Mining Methods... different kind of model. [4,5,7] For example, k-means cannot find non-convex clusters. [4,10,11,12,13]. 2.1 Connectivity based clustering Connectivity based clustering, also known as hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away. These algorithms connect ”objects” to form ”clusters” based on their distance. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form, which can be represented using a dendrogram, which explains where the common name ”hierarchical clustering” comes from: these algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other Fig. 1: Single-linkage on density-based clusters. 20 clusters extracted, most of which contain single elements, since linkage at certain distances. In a dendrogram, the y-axis marks the clustering does not have a notion of ”noise”. Many people used distance at which the clusters merge, while the objects are this method on the different application [15,16,17,18,19,20,21, placed along the x-axis such that the clusters don’t mix. 22,23,24,25] These methods will not produce a unique partitioning of the data set, but a hierarchy from which the user still needs to choose appropriate clusters. They are not very robust towards outliers, which will either show up as also restricting the centroids to members of the data set additional clusters or even cause other clusters to merge (k-medoids), choosing medians (k-medians clustering), (known as ”chaining phenomenon”, in particular with choosing the initial centers less randomly (K-means++) single-linkage clustering). In the general case, the or allowing a fuzzy cluster assignment (Fuzzy c-mean. complexity is which makes them too slow for large data Most k-means-type algorithms require the number of sets. For some special cases, optimal efficient methods (of clusters - k - to be specified in advance, which is complexity) are known: SLINK [5] for single-linkage and considered to be one of the biggest drawbacks of these CLINK [6] for complete-linkage clustering. In the data algorithms. Furthermore, the algorithms prefer clusters of mining community these methods are recognized as a approximately similar size, as they will always assign an theoretical foundation of cluster analysis, but often object to the nearest centroid. This often leads to considered obsolete. They did however provide incorrectly cut borders in between of clusters (which is inspiration for many later methods such as density based not surprising, as the algorithm optimized cluster centers, clustering figure (1). not cluster borders. K-means has a number of interesting theoretical properties. On the one hand, it partitions the data space into a structure known as a Voronoi diagram. 2.2 Centroid-based clustering On the other hand, it is conceptually close to nearest neighbor classification, and as such is popular in machine In centroid-based clustering, clusters are represented by a learning. Third, it can be seen as a variation of model central vector, which may not necessarily be a member of based classification, and Lloyd’s algorithm as a variation the data set. When the number of clusters is fixed to k, of the Expectation-maximization algorithm for this model k-means clustering gives a formal definition as an discussed below figure(2). optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized [29, 2.3 Distribution-based clustering 30,31,32,33]. The optimization problem itself is known to be NP-hard, and thus the common approach is to search The clustering model most closely related to statistics is only for approximate solutions. A particularly well known based on distribution models. Clusters can then easily be approximative method is Lloyd’s algorithm [7,24,25,26, defined as objects belonging most likely to the same 27,28] often actually referred to as ”k-means algorithm”. distribution. A convenient property of this approach is It does however only find a local optimum, and is that this closely resembles the way artificial data sets are commonly run multiple times with different random generated: by sampling random objects from a initializations. Variations of k-means often include such distribution. While the theoretical foundation of these optimizations as choosing the best of multiple runs, but methods is excellent, they suffer from one key problem c 2015 NSP Natural Sciences Publishing Cor. Inf. Sci. Lett. 4, No. 2, 61-66 (2015) / www.naturalspublishing.com/Journals.asp 63 Fig. 2: Means separates data into Voronoi-cells, which assumes equal-sized clusters. Fig. 3: On Gaussian-distributed data, EM works well, since it uses Gaussians for modelling clusters. known as overfitting, unless constraints are put on the model complexity. A more complex model will usually be able to explain the data better, which makes choosing the methods) plus all objects that are within these objects’ appropriate model complexity inherently difficult. One range. Another interesting property of DBSCAN is that prominent method is known as Gaussian mixture models its complexity is fairly low - it requires a linear number of (using the expectation-maximization algorithm). Here, range queries on the database - and that it will discover the data set is usually modelled with a fixed (to avoid essentially the same results (it is deterministic for core overfitting) number of Gaussian distributions that are and noise points, but not for border points) in each run, initialized randomly and whose parameters are iteratively therefore there is no need to run it multiple times. optimized to fit better to the data set. This will converge to OPTICS [11,50,51] is a generalization of DBSCAN that a local optimum, so multiple runs may produce different removes the need to choose an appropriate value for the results.