Spectral Clustering Via the Power Method - Provably

Spectral Clustering via the Power Method - Provably Christos Boutsidis [email protected] Yahoo, 229 West 43rd Street, New York, NY, USA. Alex Gittens [email protected] International Computer Science Institute, Berkeley, CA, USA. Prabhanjan Kambadur [email protected] Bloomberg L.P., 731 Lexington Avenue, New York, 10022, USA. Abstract Spectral clustering is one of the most impor- tant algorithms in data mining and machine in- telligence; however, its computational complex- ity limits its application to truly large scale data analysis. The computational bottleneck in spectral clustering is computing a few of the top eigenvectors of the (normalized) Laplacian matrix corresponding to the graph representing the data to be clustered. One way to speed up the computation of these eigenvectors is to use the “power method” from the numerical linear alge- bra literature. Although the power method has been empirically used to speed up spectral clustering, the theory behind this approach, to the best of our knowledge, remains unexplored. This Figure 1. 2-D data amenable to spectral clustering. paper provides the first such rigorous theoretical justification, arguing that a small number of power iterations suffices to obtain near-optimal partitionings using the approximate eigenvectors. such shortcomings of traditional clustering approaches, re- Specifically, we prove that solving the k-means searchers have produced a body of more flexible and data- clustering problem on the approximate eigen- adaptive clustering approaches, now known under the um- vectors obtained via the power method gives an brella of spectral clustering. The crux of these approaches additive-error approximation to solving the k- is to model the points to be clustered as vertices of a graph, means problem on the optimal eigenvectors. where weights on edges connecting the vertices are as- signed according to some similarity measure between the points. Next, a new, hopefully separable, representation of 1. Introduction the points is formed by using the eigenvectors of the (nor- Consider clustering the points in Figure1. The data in this malized) Laplacian matrix associated with this similarity space are non-separable and there is no apparent clustering graph. This new, typically low-dimensional, representation metric which can be used to recover this clustering struc- is often called “spectral embedding” of the points. We re- ture. In particular, the two clusters have the same centers fer the reader to (Fiedler, 1973; Von Luxburg, 2007; Shi (centroids); hence, distance-based clustering methods such & Malik, 2000b) for the foundations of spectral cluster- as k-means (Ostrovsky et al., 2006) will fail. Motivated by ing and to (Belkin & Niyogi, 2001; Ng et al., 2002; Liu & Zhang, 2004; Zelnik-Manor & Perona, 2004; Smyth & Proceedings of the 32 nd International Conference on Machine White, 2005) for applications in data mining and machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- learning. We explain spectral clustering and the baseline right 2015 by the author(s). algorithm in detail in Section 2.1. Submission and Formatting Instructions for ICML 2015 The computational bottleneck in spectral clustering is the This definition generalizes to any k > 2 in a straight- computation of the eigenvectors of the Laplacian matrix. forward manner (we omit the details). Minimizing Motivated by the need for faster algorithms to compute Ncut(A; B) in a weighted undirected graph is an NP- these eigenvectors, several techniques have been devel- Complete problem (see appendix in (Shi & Malik, 2000b) oped in order to speedup this computation (Spielman & for proof). Motivated by this hardness result, Shi and Ma- Teng, 2009; Yan et al., 2009; Fowlkes et al., 2004; Pa- lik (Shi & Malik, 2000b) suggested a relaxation to this van & Pelillo, 2005; Bezdek et al., 2006; Wang et al., problem that is tractable in polynomial time using the Sin- 2009; Nystrom¨ , 1930; Baker, 1977). Perhaps the most gular Value Decomposition (SVD). First, (Shi & Malik, popular of the above mentioned techniques is the “power 2000b) shows that for any G; A; B and partition vector method” (Lin & Cohen, 2010). The convergence of the y 2 Rn with +1 to the entries corresponding to A and power method is theoretically well understood when it −1 to the entries corresponding to B the following iden- comes to measure the principal angle between the space tity holds: 4 · Ncut(A; B) = yT (D − W)y=(yT Dy): spanned by the true and the approximate eigenvectors (see Here, D 2 Rn×n is the diagonal matrix of degree nodes: P Theorem 8.2.4 in (Golub & Van Loan, 2012)). We refer Dii = j Wij: Hence, the spectral clustering problem readers to (Woodruff, 2014) for a rigorous theoretical anal- in Definition1 can be restated as finding such an optimum ysis of the use of the power method for the low-rank matrix partition vector y, which, as we mentioned above, is an approximation problem. However, these results do not im- intractable problem. The real relaxation for spectral clus- ply that the approximate eigenvectors of the power method tering asks for a real-valued vector y 2 Rn: are useful for spectral clustering. Definition 2 (The real relaxation for the spectral clustering problem for k = 2 (Shi & Malik, 2000b)). Given graph G Contributions. In this paper, we argue that the eigenvec- with n nodes, adjacency matrix W; and degrees matrix D tors computed via the power method are useful for spectral n find y 2 R such that: clustering, and that the loss in clustering accuracy is small. We prove that solving the k-means problem on the approx- yT (D − W)y y = argmin T : imate eigenvectors obtained via the power method gives an n T (y Dy) y2R ;y D1n additive-error approximation to solving the k-means problem on the optimal eigenvectors (see Lemma5 and Thm6). Once such a y is found, one can partition the graph into two subgraphs by looking at the signs of the elements in 2. Background y. When k > 2, one can compute k eigenvectors and then apply k-means clustering on the rows of a matrix, denoted 2.1. Spectral Clustering as Y, containing those eigenvectors in its columns. We first review one mathematical formulation of spectral Motivated by these observations, Ng et. al (Ng et al., d clustering. Let x1; x2;:::; xn 2 R be n points in d di- 2002) (see also (Weiss, 1999)) suggested the following al- mensions. The goal of clustering is to partition these points gorithm for spectral clustering 1 (inputs to the algorithm are d into k disjoint sets, for some given k. To this end, de- the points x1;:::; xn 2 R and the number of clusters k). fine a weighted undirected graph G(V; E) with jV j = n n×n nodes and jEj edges: each node in G corresponds to an 1. Construct the similarity matrix W 2 R as Wij = 2 −(kxi−xj k )/σ xi; the weight of each edge encodes the similarity between e (for i 6= j); Wii = 0 and σ is given. n×n its end points. Let W 2 R be the similarity matrix n×n 2 2. Construct D 2 R as the diagonal matrix of degrees of W = e−(kxi−xj k )/σ; i 6= j W = 0 P — ij and ii — that the nodes: Dii = j Wij : gives the similarity between xi and xj. Here, σ is a tuning − 1 − 1 n×n 2 parameter. Given this setup, spectral clustering for k = 2 3. Construct W~ = D 2 WD 2 2 R . corresponds to the following graph partitioning problem: 4. Find the largest k eigenvectors of W~ and assign them as n×k 3 Definition 1 (The Spectral Clustering Problem for columns to a matrix Y 2 R . k = 2 . Let x ; x ;:::; x 2 d (Shi & Malik, 2000b)) 1 2 n R 5. Apply k-means clustering on the rows of Y, and use this and k = 2 be given. Construct graph G(V; E) as described clustering to cluster the original points accordingly. in the text above. Find subgraphs A and B of G that mini- mize the following: 1Precisely, Ng et. al suggested an additional normalization step on Y before applying k-means, i.e., normalize Y to unit row 1 1 norms, but we ignore this step for simplicity. Ncut(A; B) = cut(A; B)· + ; 2 ~ assoc(A; V ) assoc(B; V ) Here, L = D − W is the Laplacian matrix of G and L = In − W~ is the so called normalized Laplacian matrix. P 3 − 1 − 1 cut(A; B) = W ; assoc(A; V ) = 2 2 where, xi2A;xj 2B ij The top k eigenvectors of D WD correspond to the − 1 − 1 P W ; assoc(B; V ) = P W . bottom k eigenvectors of I − D 2 WD 2 . xi2A;xj 2V ij xi2B;xj 2V ij n Submission and Formatting Instructions for ICML 2015 This algorithm serves as our baseline for an “exact spectral Let k be the number of clusters. One can define a partition 4 clustering algorithm”. One way to speedup this baseline of the rows of Y by a cluster indicator matrix X 2 Rn×k. algorithm is to use the power method (Lin & Cohen, 2010) Each column j = 1; : : : ; k of X represents a cluster. Each in Step 4 to quickly approximate the eigenvectors in Y; that row i = 1; : : : ; n indicates the cluster membership of y . is, p i So, Xij = 1= sj; if and only if the data point yi is in the n×k (j) (j) • Power method: Initialize S 2 R with i:i:d random jth cluster (sj = jjX jj0; X is the jth column of X Gaussian variables.

Load more