Fast Approximate Spectral Clustering
Total Page:16
File Type:pdf, Size:1020Kb
Fast Approximate Spectral Clustering Donghui Yan Ling Huang Michael I. Jordan University of California Intel Research Lab University of California Berkeley, CA 94720 Berkeley, CA 94704 Berkeley, CA 94720 [email protected] [email protected] [email protected] ABSTRACT variety of methods have been developed over the past several decades Spectral clustering refers to a flexible class of clustering proce- to solve clustering problems [15, 19]. A relatively recent area dures that can produce high-quality clusterings on small data sets of focus has been spectral clustering, a class of methods based but which has limited applicability to large-scale problems due to on eigendecompositions of affinity, dissimilarity or kernel matri- its computational complexity of O(n3) in general, with n the num- ces [20, 28, 31]. Whereas many clustering methods are strongly ber of data points. We extend the range of spectral clustering by tied to Euclidean geometry, making explicit or implicit assump- developing a general framework for fast approximate spectral clus- tions that clusters form convex regions in Euclidean space, spectral tering in which a distortion-minimizing local transformation is first methods are more flexible, capturing a wider range of geometries. applied to the data. This framework is based on a theoretical anal- They often yield superior empirical performance when compared ysis that provides a statistical characterization of the effect of local to competing algorithms such as k-means, and they have been suc- distortion on the mis-clustering rate. We develop two concrete in- cessfully deployed in numerous applications in areas such as com- stances of our general framework, one based on local k-means clus- puter vision, bioinformatics, and robotics. Moreover, there is a sub- tering (KASP) and one based on random projection trees (RASP). stantial theoretical literature supporting spectral clustering [20, 34]. Extensive experiments show that these algorithms can achieve sig- Despite these virtues, spectral clustering is not widely viewed as nificant speedups with little degradation in clustering accuracy. Spe- a competitor to classical algorithms such as hierarchical clustering cifically, our algorithms outperform k-means by a large margin in and k-means for large-scale data mining problems. The reason is terms of accuracy, and run several times faster than approximate easy to state—given a data set consisting of n data points, spectral spectral clustering based on the Nyström method, with comparable clustering algorithms form an n × n affinity matrix and compute eigenvectors of this matrix, an operation that has a computational accuracy and significantly smaller memory footprint. Remarkably, 3 our algorithms make it possible for a single machine to spectral complexity of O(n ) in general. For applications with n on the cluster data sets with a million observations within several min- order of thousands, spectral clustering methods begin to become utes. infeasible, and problems with n in the millions are entirely out of reach. In this paper we focus on developing fast approximate algorithms Categories and Subject Descriptors for spectral clustering. Our approach is not fundamentally new. H.3.3 [Information Search and Retrieval]: Clustering; I.2.6 [Art- As in many other situations in data mining in which a computa- ificial Intelligence]: Learning tional bottleneck is involved, we aim to find an effective preproces- sor that reduces the size of the data structure that is input to that General Terms bottleneck (see, e.g., [25, 27]). There are many options that can be considered for this preprocessing step. One option is to perform Algorithms, Experimentation, Performance various forms of subsampling of the data, selecting data points at random or according to some form of stratification procedure. An- Keywords other option is to replace the original data set with a small number of points (i.e., “representatives”) that aim to capture relevant struc- Unsupervised Learning, Spectral Clustering, Data Quantization ture. Another approach that is specifically available in the spectral clustering setting is to exploit the literature on low-rank matrix ap- 1. INTRODUCTION proximations. Indeed, this last approach has been the approach Clustering is a problem of primary importance in data mining, most commonly pursued in the literature; in particular, several re- statistical machine learning and scientific discovery. An enormous searchers have proposed using the Nyström method for this pur- pose [10, 35, 12]. While it is useful to define such preprocessors, simply possessing a knob that can adjust computational complexity does not constitute a solution to the problem of fast spectral cluster- Permission to make digital or hard copies of all or part of this work for ing. What is needed is an explicit connection between the amount personal or classroom use is granted without fee provided that copies are of data reduction that is achieved by a preprocessor and the sub- not made or distributed for profit or commercial advantage and that copies sequent effect on the clustering. Indeed, the motivation for using bear this notice and the full citation on the first page. To copy otherwise, to spectral methods is that they can provide a high-quality clustering, republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. and if that high-quality clustering is destroyed by a preprocessor KDD’09, June 28–July 1, 2009, Paris, France. then we should consider other preprocessors (or abandon spectral Copyright 2009 ACM 978-1-60558-495-9/09/06 ...$10.00. clustering entirely). In particular, it is not satisfactory to simply re- Algorithm 1 SpectralClustering (x1,..., xn) duce the rank of an affinity matrix so that an eigendecomposition n Rd Input: n data points {xi}i=1, xi ∈ can be performed in a desired time frame, unless we have certain Output: Bipartition S and S¯ of the input data understanding of the effect of this rank reduction on the clustering. In this paper we propose a general framework for fast spectral 1. Compute the affinity matrix A with elements: clustering and conduct an end-to-end theoretical analysis for our 2 kxi−xj k 2 , method. In the spirit of rate-distortion theory, our analysis yields aij = exp − 2σ i, j = 1,...,n a relationship between an appropriately defined notion of distor- 2. Compute the diagonal“ degree” matrix D with elements: n tion at the input and some notion of clustering accuracy at the out- di = j=1 aij put. This analysis allows us to argue that the goal of a preprocessor 3. Compute the normalized Laplacian matrix: P 1 1 should be to minimize distortion; by minimizing distortion we min- L = D− 2 (D − A)D− 2 imize the effect of data reduction on spectral clustering. 4. Find the second eigenvector v2 of L To obtain a practical spectral clustering methodology, we thus 5. Obtain the two partitions using v2: make use of preprocessors that minimize distortion. In the current 6. S = {i : v2i > 0}, S¯ = {i : v2i ≤ 0} paper we provide two examples of such preprocessors. The first is classical k-means, used in this context as a local data reduction step. The second is the Random Projection tree (RP tree) of [8]. In either case, the overall approximate spectral clustering algorithm V , and consider the following optimization criterion: takes the following form: 1) coarsen the affinity graph by using the m preprocessor to collapse neighboring data points into a set of local W (Vj,V ) − W (Vj,Vj ) “representative points,” 2) run a spectral clustering algorithm on the Ncut = . (1) W (Vj,V ) j=1 set of representative points, and 3) assign cluster memberships to X the original data points based on those of the representative points. In this equation, the numerator in the jth term is equal to the sum Our theoretical analysis is a perturbation analysis, similar in spirit of the affinities on edges leaving the subset V and the denominator to those of [20] and [28] but different in detail given our focus on j is equal to the total degree of the subset V . Minimizing the sum practical error bounds. It is also worth noting that this analysis j of such terms thus aims at finding a partition in which edges with has applications beyond the design of fast approximations to spec- large affinities tend to stay within the individual subsets V and in tral clustering. In particular, as discussed by [18], our perturbation j which the sizes of the V are balanced. analysis can be used for developing distributed versions of spectral j The optimization problem in (1) is intractable and spectral clus- clustering and for analyzing robustness to noise. tering is based on a standard relaxation procedure that transforms The remainder of the paper is organized as follows. We begin the problem into a tractable eigenvector problem. In particular, with a brief overview of spectral clustering in Section 2, and sum- the relaxation for Ncut is based on rewriting (1) as a normalized marize the related work in Section 3. In Section 4 we describe quadratic form involving indicator vectors. These indicator vectors our framework for fast approximate spectral clustering and discuss are then replaced with real-valued vectors, resulting in a general- two implementations of this framework—“KASP,” which is based ized eigenvector problem that can be summarized conveniently in on k-means, and “RASP,” which is based on RP trees. We eval- terms of the (normalized) graph Laplacian L of A defined as fol- uate our algorithms in Section 5, by comparing both KASP and lows: RASP with Nyström approximation and k-means. We present our 1 1 1 1 theoretical analysis in Section 6. In particular, in that section, we L = D− 2 (D − A)D− 2 = I − D− 2 AD− 2 = I − L0, (2) provide a bound for the mis-clustering rate that depends linearly n on the amount of perturbation to the original data.