Easychair Preprint Compare Spectral and Hierarchical Clustering Using Different Parameters
Total Page:16
File Type:pdf, Size:1020Kb
EasyChair Preprint № 2523 Compare Spectral and Hierarchical Clustering using different parameters Anand Khandare and Roshankumar R. Maurya EasyChair preprints are intended for rapid dissemination of research results and are integrated with the rest of EasyChair. February 1, 2020 Compare Spectral and Hierarchical Clustering using different parameters Prepared by Dr. Anand Khandare Roshakumar R. Maurya Associate Professor Department of Computer Engineering Thakur College of Engineering and Thakur College of Engineering and Technology Technology Mumbai, India Mumbai, India Abstract-- Clustering is an automatic learning technique which aims at grouping a set of objects into clusters so that objects in the same clusters should be similar as possible, whereas objects in one cluster should be as dissimilar as possible from objects in other clusters. Clustering aims to group in an unsupervised way, a given dataset into clusters such that dataset within each clusters are more similar between each other than those in different clusters. Cluster analysis aims to organize a Figure.1. Clustering collection of patterns into clusters based on similarity. This report show a comparison between two clustering II. BACKGROUND AND MOTIVATION techniques: Spectral Clustering, and Hierarchical Clustering. Given the dataset that was used for this A. Background report, the most accurate techniques were found to be Spectral Clustering (when using lobpcg as the eigen Clustering is one of the challenging mining techniques in the solver, and considering 25 neighbors when constructing knowledge data discovery process. Managing huge amount the affinity matrix), and Hierarchical Clustering (when of data is a difficult task since the goal is to find a suitable computing the linkage using the cosine method, and partition in an unsupervised way (i.e. without any prior using an ’average’ as a linkage cretirion). knowledge) trying to maximize the intra-cluster similarity and minimize inter-cluster similarity which in turn maintains Keywords-- Clustering, Hierarchical Clustering, Spectral high cluster cohesiveness. Clustering groups data instances Clustering. into subsets in such a manner that similar instances are grouped together, while different instances belong to I. INTRODUCTION different groups. The instances are thereby organized into an efficient Clustering is the most interesting topics in data mining representation that characterizes the population being which aims of finding intrinsic structures in data and find sampled. Thus the output of cluster analysis is the number of some meaningful subgroups for further analysis. It is a groups or clusters that form the structure of partitions, of the common technique for statistical data analysis, which is data set. In short clustering is the technique to process the used in many fields, including machine learning, data data into meaningful group for statistical analysis. The mining, pattern recognition, image analysis and exploitation of Data Mining and Knowledge discovery has bioinformatics. Thus a cluster could also be defined as the penetrated to a variety of Machine Learning Systems. “methodology of organizing objects into groups whose members are similar in some way.” B. Motivation In this paper, we compare between two clustering techniques: Spectral Clustering, and Hierarchical Clustering. As the amount of digital documents over the years as the Given the dataset includes 2645 samples. That was used for Internet grows has been increasing dramatically, managing this report, the most accurate techniques were found to be information search, and retrieval, etc., have become Spectral Clustering algorithm and Hierarchical Clustering practically important problems. Developing methods to algorithm, in this algorithms we used different parameters organize large amounts of unstructured text documents into compare their time complexity and error rate to display a smaller number of meaningful clusters would be very which is better such as when spectral using ‘lobpcg’ as the helpful as clustering such as indexing, filtering, automated eigen solver, and considering 25 neighbours when metadata generation, population of hierarchical catalogues constructing the affinity matrix, and Hierarchical Clustering of web resources and, in general, any application requiring using computing the linkage using the cosine method, and document organization. using an ‘average’’ as a linkage criterion. Also there are large number of people who are interested in reading specific news so there is necessity to cluster the news articles from the number of available articles, since the large number of articles are added each data and many articles corresponds to same news but are added from different sources. By clustering the articles, we could reduce our search domain for recommendations as most of the users are interested in the news corresponding to a few number of clusters. This could improve the result of time efficiency to a greater extent and would also help in identification of same news from different sources. The main motivation is to compare different types of unsupervised algorithm to study their behaviour, advantage, and disadvantage and study how you choose unsupervised learning algorithm based on the dataset type. This paper projected a some common clustering algorithm Figure 2: Spectral Clustering for Four Cluster (Spectral and Hierarchical Clustering) for compare and analysis their behavior on different types of dataset such as 1) Locally Optimal Block Preconditioned Conjugate structural dataset and un-structural dataset etc. and also Gradient (LOBPCG): Locally Optimal Block implement the different parameter of unsupervised learning Preconditioned Conjugate Gradient (LOBPCG) is algorithm to observed error rate, correctness etc. by compare demonstrated to efficiently solve eigenvalue problems for different unsupervised learning algorithm we get their graph Laplacians that appear in spectral clustering. For static advantage and disadvantage, what types of dataset we used graph partitioning, 10–20 iterations of LOBPCG without that algorithm to increase the application performance. preconditioning result in ˜10x error reduction, enough to achieve 100% correctness for all Challenge datasets with III SPECTRAL CLUSTERING known truth partitions. LOBPCG methods do not require storing a matrix of the Spectral clustering is a technique with roots in eigenvalue problem in memory, but rather only need the results of multiplying the matrix by a given vector. Such a graph theory, where the approach is used to matrix-free characteristic of the methods makes them identify communities of nodes in a graph based on particularly useful for eigenvalue analysis problems of very the edges connecting them. The method is flexible large sizes, and results in good parallel scalability on multi- and allows us to cluster non graph data as well. threaded computational platforms to large matrix sizes Spectral clustering uses information from the processed on many parallel processors. LOBPCG is a block method, where several eigenvectors are eigenvalues (spectrum) of special matrices built computed simultaneously as in the classical subspace power from the graph or the data set. We’ll learn how to method. Blocking is beneficial if the eigenvectors to be construct these matrices, interpret their spectrum, computed correspond to clustered eigenvalues, which is a and use the eigenvectors to assign our data to typical scenario in multi-way spectral partitioning, where clusters. often a cluster of the smallest eigenvalues is separated by a gap from the rest of the spectrum. Blocking also allows taking advantage of high-level BLAS3-like libraries for Eigenvectors and Eigenvalues matrix-matrix operations, which are typically included in CPU-optimized computational kernels. Critical to this discussion is the concept of eigenvalues and eigenvectors. For a matrix A, if there exists a vector x which Advantages isn’t all 0’s and a scalar λ such that Ax = λx, then x is said to be an eigenvector of A with corresponding eigenvalue λ. We 1. Does not make strong assumptions on the statistics of can think of the matrix A as a function which maps vectors to the clusters - Clustering techniques like K-Means new vectors. Most vectors will end up somewhere Clustering assume that the points assigned to a cluster completely different when A is applied to them, but are spherical about the cluster centre. This is a strong eigenvectors only change in magnitude. If you drew a line assumption to make, and may not always be relevant. In through the origin and the eigenvector, then after the such cases, spectral clustering helps create more accurate mapping, the eigenvector would still land on the line. clusters. The amount which the vector is scaled along the line depends 2. Easy to implement and gives good clustering results. It on λ. Eigenvectors are an important part of linear algebra, can correctly cluster observations that actually belong to because they help describe the dynamics of systems the same cluster but are farther off than observations in represented by matrices. There are numerous applications other clusters due to dimension reduction. which utilize eigenvectors, and we’ll use them directly here 3. Reasonably fast for sparse data sets of several thousand to perform spectral clustering. elements. Basic Spectral Algorithm Disadvantages 1. Create a similarity graph between our N objects to 1. Use of K-Means clustering in the final step implies that cluster. the clusters