Spectral Clustering via the Power Method - Provably

Christos Boutsidis [email protected] Yahoo, 229 West 43rd Street, New York, NY, USA.

Alex Gittens [email protected] International Computer Science Institute, Berkeley, CA, USA.

Prabhanjan Kambadur [email protected] Bloomberg L.P., 731 Lexington Avenue, New York, 10022, USA.

Abstract Spectral clustering is one of the most impor- tant algorithms in data mining and machine in- telligence; however, its computational complex- ity limits its application to truly large scale data analysis. The computational bottleneck in spec- tral clustering is computing a few of the top eigenvectors of the (normalized) Laplacian ma- trix corresponding to the graph representing the data to be clustered. One way to speed up the computation of these eigenvectors is to use the “power method” from the numerical linear alge- bra literature. Although the power method has been empirically used to speed up spectral clus- tering, the theory behind this approach, to the best of our knowledge, remains unexplored. This Figure 1. 2-D data amenable to spectral clustering. paper provides the first such rigorous theoreti- cal justification, arguing that a small number of power iterations suffices to obtain near-optimal partitionings using the approximate eigenvectors. such shortcomings of traditional clustering approaches, re- Specifically, we prove that solving the k-means searchers have produced a body of more flexible and data- clustering problem on the approximate eigen- adaptive clustering approaches, now known under the um- vectors obtained via the power method gives an brella of spectral clustering. The crux of these approaches additive-error approximation to solving the k- is to model the points to be clustered as vertices of a graph, means problem on the optimal eigenvectors. where weights on edges connecting the vertices are as- signed according to some similarity measure between the points. Next, a new, hopefully separable, representation of 1. Introduction the points is formed by using the eigenvectors of the (nor- Consider clustering the points in Figure1. The data in this malized) associated with this similarity space are non-separable and there is no apparent clustering graph. This new, typically low-dimensional, representation metric which can be used to recover this clustering struc- is often called “spectral embedding” of the points. We re- ture. In particular, the two clusters have the same centers fer the reader to (Fiedler, 1973; Von Luxburg, 2007; Shi (centroids); hence, distance-based clustering methods such & Malik, 2000b) for the foundations of spectral cluster- as k-means (Ostrovsky et al., 2006) will fail. Motivated by ing and to (Belkin & Niyogi, 2001; Ng et al., 2002; Liu & Zhang, 2004; Zelnik-Manor & Perona, 2004; Smyth & Proceedings of the 32 nd International Conference on Machine White, 2005) for applications in data mining and , Lille, France, 2015. JMLR: W&CP volume 37. Copy- learning. We explain spectral clustering and the baseline right 2015 by the author(s). algorithm in detail in Section 2.1. Submission and Formatting Instructions for ICML 2015

The computational bottleneck in spectral clustering is the This definition generalizes to any k > 2 in a straight- computation of the eigenvectors of the Laplacian matrix. forward manner (we omit the details). Minimizing Motivated by the need for faster algorithms to compute Ncut(A, B) in a weighted undirected graph is an NP- these eigenvectors, several techniques have been devel- Complete problem (see appendix in (Shi & Malik, 2000b) oped in order to speedup this computation (Spielman & for proof). Motivated by this hardness result, Shi and Ma- Teng, 2009; Yan et al., 2009; Fowlkes et al., 2004; Pa- lik (Shi & Malik, 2000b) suggested a relaxation to this van & Pelillo, 2005; Bezdek et al., 2006; Wang et al., problem that is tractable in polynomial time using the Sin- 2009; Nystrom¨ , 1930; Baker, 1977). Perhaps the most gular Value Decomposition (SVD). First, (Shi & Malik, popular of the above mentioned techniques is the “power 2000b) shows that for any G, A, B and partition vector method” (Lin & Cohen, 2010). The convergence of the y ∈ Rn with +1 to the entries corresponding to A and power method is theoretically well understood when it −1 to the entries corresponding to B the following iden- comes to measure the principal angle between the space tity holds: 4 · Ncut(A, B) = yT (D − W)y/(yT Dy). spanned by the true and the approximate eigenvectors (see Here, D ∈ Rn×n is the diagonal matrix of degree nodes: P Theorem 8.2.4 in (Golub & Van Loan, 2012)). We refer Dii = j Wij. Hence, the spectral clustering problem readers to (Woodruff, 2014) for a rigorous theoretical anal- in Definition1 can be restated as finding such an optimum ysis of the use of the power method for the low-rank matrix partition vector y, which, as we mentioned above, is an approximation problem. However, these results do not im- intractable problem. The real relaxation for spectral clus- ply that the approximate eigenvectors of the power method tering asks for a real-valued vector y ∈ Rn: are useful for spectral clustering. Definition 2 (The real relaxation for the spectral clustering problem for k = 2 (Shi & Malik, 2000b)). Given graph G Contributions. In this paper, we argue that the eigenvec- with n nodes, adjacency matrix W, and degrees matrix D tors computed via the power method are useful for spectral n find y ∈ R such that: clustering, and that the loss in clustering accuracy is small. We prove that solving the k-means problem on the approx- yT (D − W)y y = argmin T . imate eigenvectors obtained via the power method gives an n T (y Dy) y∈R ,y D1n additive-error approximation to solving the k-means prob- lem on the optimal eigenvectors (see Lemma5 and Thm6). Once such a y is found, one can partition the graph into two subgraphs by looking at the signs of the elements in 2. Background y. When k > 2, one can compute k eigenvectors and then apply k-means clustering on the rows of a matrix, denoted 2.1. Spectral Clustering as Y, containing those eigenvectors in its columns. We first review one mathematical formulation of spectral Motivated by these observations, Ng et. al (Ng et al., d clustering. Let x1, x2,..., xn ∈ R be n points in d di- 2002) (see also (Weiss, 1999)) suggested the following al- mensions. The goal of clustering is to partition these points gorithm for spectral clustering 1 (inputs to the algorithm are d into k disjoint sets, for some given k. To this end, de- the points x1,..., xn ∈ R and the number of clusters k). fine a weighted undirected graph G(V,E) with |V | = n n×n nodes and |E| edges: each node in G corresponds to an 1. Construct the similarity matrix W ∈ R as Wij = 2 −(kxi−xj k )/σ xi; the weight of each edge encodes the similarity between e (for i 6= j); Wii = 0 and σ is given. n×n its end points. Let W ∈ R be the similarity matrix n×n 2 2. Construct D ∈ R as the diagonal matrix of degrees of W = e−(kxi−xj k )/σ, i 6= j W = 0 P — ij and ii — that the nodes: Dii = j Wij . gives the similarity between xi and xj. Here, σ is a tuning − 1 − 1 n×n 2 parameter. Given this setup, spectral clustering for k = 2 3. Construct W˜ = D 2 WD 2 ∈ R . corresponds to the following graph partitioning problem: 4. Find the largest k eigenvectors of W˜ and assign them as n×k 3 Definition 1 (The Spectral Clustering Problem for columns to a matrix Y ∈ R . k = 2 . Let x , x ,..., x ∈ d (Shi & Malik, 2000b)) 1 2 n R 5. Apply k-means clustering on the rows of Y, and use this and k = 2 be given. Construct graph G(V,E) as described clustering to cluster the original points accordingly. in the text above. Find subgraphs A and B of G that mini- mize the following: 1Precisely, Ng et. al suggested an additional normalization step on Y before applying k-means, i.e., normalize Y to unit row  1 1  norms, but we ignore this step for simplicity. Ncut(A, B) = cut(A, B)· + , 2 ˜ assoc(A, V ) assoc(B,V ) Here, L = D − W is the Laplacian matrix of G and L = In − W˜ is the so called normalized Laplacian matrix. P 3 − 1 − 1 cut(A, B) = W ; assoc(A, V ) = 2 2 where, xi∈A,xj ∈B ij The top k eigenvectors of D WD correspond to the − 1 − 1 P W ; assoc(B,V ) = P W . bottom k eigenvectors of I − D 2 WD 2 . xi∈A,xj ∈V ij xi∈B,xj ∈V ij n Submission and Formatting Instructions for ICML 2015

This algorithm serves as our baseline for an “exact spectral Let k be the number of clusters. One can define a partition 4 clustering algorithm”. One way to speedup this baseline of the rows of Y by a cluster indicator matrix X ∈ Rn×k. algorithm is to use the power method (Lin & Cohen, 2010) Each column j = 1, . . . , k of X represents a cluster. Each in Step 4 to quickly approximate the eigenvectors in Y; that row i = 1, . . . , n indicates the cluster membership of y . is, √ i So, Xij = 1/ sj, if and only if the data point yi is in the n×k (j) (j) • Power method: Initialize S ∈ R with i.i.d random jth cluster (sj = ||X ||0; X is the jth column of X Gaussian variables. Let Y˜ ∈ n×k contain the left singular (j) R and ||X ||0 denotes the number of non-zero elements of vectors of the matrix X(j)). We formally define the k-means problem as follows: T 2p+1 B = (W˜ W˜ )pWS˜ = W˜ S, Definition 3. [THE k-MEANSCLUSTERINGPROBLEM] n×k for some integer p ≥ 0. Now, use Y˜ instead of Y in step 5 Given Y ∈ R (representing n data points – rows – de- above. scribed with respect to k features – columns) and a positive integer k denoting the number of clusters, find the indicator n×k The use of the power method to speedup eigenvector com- matrix Xopt ∈ R which satisfies, putation is not new. Power method is a classical technique T 2 in the numerical linear algebra literature (see Section 8.2.4 Xopt = argmin kY − XX YkF. in (Golub & Van Loan, 2012)). Existing theoretical analy- X∈X T ˜ ˜ T sis provides sharp bounds for the error kYY − YY k2 Here, X denotes the set of all m × k indicator matrices X. (see Theorem 8.2.4 in (Golub & Van Loan, 2012); this the- Also, we will denote orem assumes that S has orthonormal columns and it is not T 2 perpendicular to Y.) (Halko et al., 2011; Woodruff, 2014) kY − XoptXoptYkF := Fopt. also used the power method with random Gaussian initial- ization and applied it to the low-rank matrix approximation This definition is equivalent to the more traditional defi- problem. The approximation bounds proved in those pa- nition involving sum of squared distances of points from ˜ ˜ ˜ T ˜ pers are for the term kW − YY Wk2. To the best of our cluster centers (Boutsidis & Magdon-Ismail, 2013; Cohen knowledge, none of these results indicates that the approx- et al., 2014) (we omit the details). Next, we formalize the imate eigenvectors are useful for spectral clustering pur- notion of a “k-means approximation algorithm”. poses. Definition 4. [k-MEANSAPPROXIMATIONALGORITHM] An algorithm is called a “γ-approximation” for the k- 2.2. Connection to k-means means clustering problem (γ ≥ 1) if it takes inputs the n×k The previous algorithm indicates that spectral clustering dataset Y ∈ R and the number of clusters k, and re- n×k turns out to be a k-means clustering problem on the rows turns an indicator matrix Xγ ∈ R such that w.p. 1−δγ , of Y, the matrix containing the bottom eigenvectors of the T 2 T 2 kY − Xγ Xγ YkF ≤ γ min kY − XX YkF = γ · Fopt. normalized Laplacian matrix. The main result of our paper X∈X is to prove that solving the k-means problem on Y˜ and us- Y ing this to cluster the rows in gives a clustering which is An example of such an approximation algorithm is in (Ku- k as good as the clustering by solving the -means problem mar et al., 2004) with γ = 1 + ε (0 < ε < 1) and δ Y k γ on . To this end, we need some background on -means a constant in (0, 1). The corresponding running time is clustering; we present a linear algebraic view below. O(1) O(nk · 2(k/ε) ). A trivial algorithm with γ = 1 but k k For i = 1 : n, let yi ∈ R be a row of Y as a column running time Ω(n ) is to try all possible k-clusterings of vector. Hence, the rows of Y and keep the best.  T  — y1 — T — y2 — 3. Main result Y =   ∈ n×k.  .  R  .  Next, we argue that applying a k-means approximation al- T ˜ — yn — gorithm on Y and Y gives approximately the same clus-

4 2 2 tering results for a sufficiently large number of power it- This can be implemented in O(n kp+k n) time, as we need erations p. Hence, the eigenvectors found by the power O(n2kp) time to implement all the matrix-matrix multiplications method do not sacrifice the accuracy of the exact spec- (right-to-left) and another O(k2n) time to find Y˜ . As we discuss below, selecting p ≈ O(ln(kn)) and assuming that the multi- tral clustering algorithm. This is formally shown in Theo- plicative spectral gap of between the kth and (k+1)th eigenvalue rem6. However, the key notion of approximation between of W˜ is large suffices to get very accurate clusterings. This leads the exact and the approximate eigenvectors is captured in to an O(kn2 ln(kn)) runtime for this step. Lemma5: Submission and Formatting Instructions for ICML 2015

Lemma 5. [See Section 4.1 for proof] For any ε, δ > 0, let acknowledge that it would have been better to obtain re- √ sults on how well Xγ˜ approximates Xopt directly, for ex- 1  −1 −1  2 2 ln 4 · n · ε · δ · k ample, via bounding the error kXγ˜ − XoptkF; however, p ≥ , such results are notoriously difficult to obtain since this is ln (γ ) k a combinatorial objective. where   Next, let us take a more careful look at the effect of the σk W˜ parameter γk. It suffices to discuss this effect for two γk =  , σ W˜ cases: 1) γk = 1; and 2) γk > 1. To this end, we k+1 need to use a relation between the eigenvalues of W˜ and − 1 − 1 is the multiplicative eigen-gap between the k-th and the L˜ = D 2 LD 2 (see proof of theorem for explanation): (k + 1)-th singular value of W˜ . Then, with probability for all i = 1, 2, . . . , n, the relation is: −2n T ˜ ˜ T 2 2 at least 1 − e − 2.35δ: kYY − YY kF ≤ ε . σi(W˜ ) = 1 − σn−i+1(L˜). In words, for a sufficiently large value of p, the orthog- onal projection operators on span(Y) and span(Y˜ ) are 3.1.1. γk = 1 bounded, in Frobenius norm, for an arbitrarily small ε. Here and throughout the paper we reserve σ1(W˜ ) to de- First, we argue that the case γk = 1 is not interesting from note the largest singular value of W˜ : a spectral clustering perspective and hence, we can safely assume that this will not occur in practical scenarios. γk is σ1(W˜ ) ≥ σ2(W˜ ) ≥ · · · ≥ σn(W˜ ) ≥ 0; the multiplicative eigen-gap between the kth and the (k + 1)th eigenvalue of W˜ : and similarly for L˜:  ˜  σk W ˜ λ1(L˜) ≥ λ2(L˜) ≥ · · · ≥ λn(L˜). 1 − λn−k+1(L) γk =   = . 1 − λn−k(L) σk+1 W˜ Next, we present our main theorem (see Section 4.2 for the proof). The folklore belief (see end of Section 4.3 in (Lee et al., Theorem 6. Construct Y˜ via the power method with 2012)) in spectral clustering says that a good k to select in order to cluster the data via spectral clustering is when  √  1 · ln 4 · n · ε−1 · δ−1 · k 2 ˜ ˜ p ≥ , λn−k+1(L)  λn−k(L). ln (γk) In this case, the spectral gap is sufficiently large and γ where k will not be close to one; hence, a small number of power  ˜  iterations p will be enough to obtain accurate results. To σk W ˜ 1 − σn−k+1(L) summarize, if γ = 1, it does not make sense to perform γk =   = . k 1 − σn−k(L) σk+1 W˜ spectral clustering with k clusters and the user should look 0 for a k > k such that the gap in the spectrum, γk0 is not approaching and at the very least is strictly larger than 1. Here, L˜ = In − W˜ is the normalized Laplacian matrix. ˜ Consider running on the rows of Y a γ-approximation k- This gap assumption is not surprising from a linear alge- means algorithm with failure probability δγ . Let the out- braic perspective as well. It is well known that in order the n×k come be a clustering indicator matrix Xγ˜ ∈ R . Also, power iteration to succeed to find the eigenvectors of any let Xopt be the optimal clustering indicator matrix for Y. symmetric matrix, the multiplicative eigen-gap in the spec- −2n Then, with probability at least 1 − e − 2.35δ − δγ , trum should be sufficiently large, as otherwise the power

T 2 T 2 2 method will not be able to distinguish the kth eigenvector kY−Xγ˜Xγ˜ YkF ≤ (1+4ε)·γ·kY−XoptXoptYkF+4ε . from the (k + 1)th eigenvector. For a more detailed discus- sion of this we refer the reader to Thm 8.2.4 in (Golub & 3.1. Discussion Van Loan, 2012). Several remarks are necessary regarding our main theorem. 3.1.2. γ > 1 First of all, notice that the notion of approximation in clus- k tering quality is with respect to the objective value of k- We remark that the graph we construct to pursue spectral means; indeed, this is often the case in approximation al- clustering is a complete graph, hence it is connected; a gorithms for k-means clustering (Cohen et al., 2014). We basic fact in (see, for example, (Lee Submission and Formatting Instructions for ICML 2015 et al., 2012)) says that the smallest eigenvalue of the Lapla- cian matrix L of the graph equals to zero (λn(L) = 0) if and only if the graph is connected. An extension of this fact says that the number of disconnected components in the graph equals the multiplicity of the eigenvalue zero in the Laplacian matrix L. When this happens, we have 100

σn(L) = σn−1(L) = ... = σn−k+1(L) = 0, 80

and, correspondingly, p 60

σn(L˜) = σn−1(L˜) = ... = σn−k+1(L˜) = 0. 40 What happens, however, when the k smallest eigenvalues of the normalized Laplacian are not zero but close to zero? Cheeger’s inequality and extensions in (Lee et al., 2012) in- 20 dicate that as those eigenvalues approaching zero, the graph 0.0 0.1 0.2 0.3 0.4 0.5 is approaching a graph with k disconnected components, hence clustering such graphs should be “easy”, that is, the x number of power iterations p should be small. We formally argue about this statement below. First, we state a version Figure 2. Number of power iterations p vs the eigenvalue x := of the Cheeger’s inequality that explains the situation for σn−k+1(L˜) of the normalized Laplacian. the k = 2 case (we omit the details of the high order exten- sions in (Lee et al., 2012)). The facts below can be found in Section 1.2 in (Bandeira et al., 2013)). Recall also the graph partitioning problem in Definition1. Let the min- made arbitrarily small since the dependence on ε−1 is log- imum value for Ncut(A, B) be obtained from a partition arithmic with respect to the number of power iterations. In into two sets Aopt and Bopt, then, implies T 2 particular, when ε ≤ kY − XoptXoptYkF, we have a rel- 1 q ative error bound: kY − X XT Yk2 ≤ λ (L˜) ≤ Ncut(A ,B ) ≤ 2 2λ (L˜). γ˜ γ˜ F 2 n−1 opt opt n−1

Hence, if λn−1(L˜) → 0, then Ncut(Aopt,Bopt) → T 2 0, which makes the clustering problem “easy”, hence ≤ ((1 + 4ε) · γ + 4ε) kY − XoptXoptYkF amenable to a small number of power iterations. T 2 ≤ (1 + 8ε) · γ · kY − XoptXoptYkF, Now, we formally derive the relation of p as a func- tion of σn−k+1(L˜). Towards this end, fix all values 3 −3 (n, ε, δ, σn−k(L˜)) to constants, e.g., n = 10 , ε = 10 , where the last relation uses 1 ≤ γ. One might wonder −2 δ = 10 , and σn−k(L˜) = 1/2. Also, x = σn−k+1(L˜). that to achieve this relative error performance, ε → 0, in Then, which case it appears that p is very large. However, this 1 ln(4 · 109) is not true since the event that ε → 0 is necessary occurs p := f(x) = . T 2 2 ln(2 − 2 · x) only when kY−XoptXoptYkF → 0, which happens when Simple calculus arguments show that as 0 ≤ x < 1/2 (this Y ≈ Xopt. In this case, from the discussion in the previous section, we have σ (L˜) → 0, because Y is “close” to range for x is required to make sure that γk > 1) increases, n−k+1 then f(x) also increases, which confirms the expectation an indicator matrix if and only if the graph is “close” to that for an eigenvalue x := σ (L˜) approaching to 0, having k disconnected components, which itself happens if n−k+1 ˜ the number of power iterations is approaching zero as well. and only if σn−k+1(L) → 0. It is easy now to calculate We plot f(x) in Figure2. The number of power iterations that, when this happens, p → 0 because, from standard calculus arguments, we can derive: is always small since the dependence of p on σn−k+1(L˜) is logarithmic. √ 3.1.3. DEPENDENCEON ε 1  1 −1  2 · ln 4 · n · x · δ · k lim = 0. Finally, we remark that the approximation bound in the the- x→0  1−x  ln ˜ orem indicates that the loss in clustering accuracy can be 1−σn−k(L) Submission and Formatting Instructions for ICML 2015

−2n † † 2 2 4. Proofs then w.p. 1 − e − 2.35δ, kΩ1Ω1 − Ω2Ω2k2 ≤ ε . 4.1. Proof of Lemma5 4.1.3. CONCLUDINGTHEPROOFOF LEMMA 5 4.1.1. PRELIMINARIES Applying Lemma8 with A = W˜ and We first introduce the notation that we use throughout the     σk W˜ paper. A, B,... are matrices; a, , ... are column vectors. −1 −1 −1 ¯ p ≥ ln 4nε δ ln   , In is the n×n identity matrix; 0m×n is the m×n matrix of   σk+1 W˜ zeros; 1n is the n×1 vector of ones. The Frobenius and the 2 P 2 spectral matrix-norms are kAkF = i,j Aij and kAk2 = gives T maxkxk2=1 kAxk2, respectively. The thin (compact) SVD T ˜ ˜ 2 2 kYY − YY kF ≤ ε . of A ∈ Rm×n of rank ρ is √ Rescaling ε0 = ε/ k, i.e., choosing p as    T   Σk 0 Vk A = Uk Uρ−k T , (1)   ˜   0 Σρ−k V √ σk W | {z } ρ−k  −1 −1  −1 m×ρ UA∈ | {z } p ≥ ln 4nε δ k ln , R ρ×ρ | {z }    ΣA∈ VT ∈ ρ×n ˜ R A R σk+1 W with singular values σ1 (A) ≥ ... ≥ σk (A) ≥ gives σk+1 (A) ≥ ... ≥ σρ (A) > 0. The matrices Uk ∈ T ˜ ˜ T 2 2 m×k m×(ρ−k) kYY − YY kF ≤ ε /k. R and Uρ−k ∈ R contain the left singular n×k vectors of A; and, similarly, the matrices Vk ∈ R Combining this bound with Lemma7 we obtain: n×(ρ−k) and Vρ−k ∈ R contain the right singular vectors. T T k×k (ρ−k)×(ρ−k) kYYT −Y˜ Y˜ k2 ≤ 2kkYYT −Y˜ Y˜ k2 ≤ 2k·(ε2/k) = ε2, Σk ∈ R and Σρ−k ∈ R contain the sin- F 2 † −1 T gular values of A. Also, A = VAΣA UA denotes the as advertised. pseudo-inverse of A. For a symmetric positive definite ma- trix (SPSD) A = BBT , λ (A) = σ2 (B) denotes the i-th i i 4.2. Proof of Theorem6 eigenvalue of A. Let T T 4.1.2. INTERMEDIATE LEMMAS YY = Y˜ Y˜ + E,

To prove Lemma5, we need the following simple fact. where E is an n × n matrix with kEkF ≤ ε (this follows after taking square root on both sides in the inequality of Lemma 7. For any Y ∈ n×k and Y˜ ∈ n×k with R R Lemma5). Next, we manipulate the term kY−X XT Yk T ˜ T ˜ γ˜ γ˜ F Y Y = Y Y = Ik: as follows T T T T ˜ ˜ 2 T ˜ ˜ 2 kY − Xγ˜Xγ˜ YkF = kYY − YY kF ≤ 2kkYY − YY k2. T T T 2 2 = kYY − Xγ˜Xγ˜ YY kF Proof. This is because kXkF ≤ rank(X)kXk2, for any X, T ˜ ˜ T ˜ ˜ T T ˜ ˜ T and the fact that rank(YY −YY ) ≤ 2k. To justify this, = kYY + E − Xγ˜Xγ˜ (YY + E)kF T T notice that YY − Y˜ Y˜ can be written as the product of T T T T ≤ kY˜ Y˜ − Xγ˜X Y˜ Y˜ kF + k(In − Xγ˜X )EkF two matrices each with rank at most 2k: γ˜ γ˜ ˜ ˜ T T ˜ ˜ T T ! ≤ kYY − Xγ˜Xγ˜ YY kF + kEkF T Y T ˜ ˜ ˜  T YY − YY = Y Y T . = kY˜ − X X Y˜ k + kEk −Y˜ γ˜ γ˜ F F √ ˜ T ˜ 2 ≤ γ · min kY − XX YkF + kEkF X∈X √ ˜ T ˜ ≤ γ · kY − XoptXoptYkF + kEkF We also need the following result, which appeared as √ ˜ ˜ T T ˜ ˜ T Lemma 7 in (Boutsidis & Magdon-Ismail, 2014). = γ · kYY − XoptXoptYY kF + kEkF √ m×n T T T Lemma 8. For any matrix A ∈ R with rank at least = γ · kYY − E − XoptXopt(YY − E)kF + kEkF k, let p ≥ 0 be an integer and draw S ∈ n×k, a matrix of √ T T T √ R ≤ γ · kYY − XoptXoptYY kF + 2 · γ · kEkF i.i.d. standard Gaussian random variables. Fix δ ∈ (0, 1) √ T √ and ε ∈ (0, 1). Let Ω = (AAT )pAS, and Ω = A . If = γ · kY − XoptXoptYkF + 2 · γ · kEkF 1 2 k √ √ = γ · pF + 2 · γ · kEk  σ (A)  opt F p ≥ ln 4nε−1δ−1 ln−1 k , √ p √ σk+1 (A) ≤ γ · Fopt + 2 · γ · ε Submission and Formatting Instructions for ICML 2015

In the above, we used the triangle inequality for the Frobe- 5. Experiments T T nius norm, the fact that (In −Xγ˜X ) and (In −XoptX ) γ˜ opt To conduct our experiments, we developed high-quality are projection matrices 5 (combined with the fact that for MATLAB versions of the spectral clustering algorithms. In any projection matrix P and any matrix Z: kPZk ≤ F the remainder of this section, we refer to the clustering al- kZk ), the fact that for any matrix Q with orthonormal F gorithm in (Shi & Malik, 2000a) as “exact algorithm”. We columns and any matrix X: kXQT k = kXk , the fact √ F F refer to the modified version that uses the power method as that 1 ≤ γ and the definition of a γ-approximation k- “approximate algorithm”. To measure clustering quality, means algorithm. we used normalized mutual information (Manning et al., Overall, we have proved: 2008):

T √ p √ I(Ω; C) kY − Xγ˜Xγ˜ YkF ≤ γ · Fopt + 2 · γ · ε. NMI(Ω; C) = 1 . 2 (H(Ω) + H(C)) Taking squares on both sides in the previous inequality gives: Here, T 2 Ω = {ω1, ω2, .., ωk} kY − Xγ˜Xγ˜ YkF ≤ is the set of discovered clusters and

p 2 ≤ γ · Fopt + 4 · ε · γ Fopt + 4 · γ · ε C = {c1, c2, .., ck}  p 2 = γ · Fopt + 4 · ε Fopt + 4 · ε is the set of true class labels. Also,

X X P (ωk ∩ cj) I(Ω; C) = P (ωk ∩ cj) log p P (ωk)P (cj) Finally, using Fopt ≤ Fopt shows the claim in the the- k j orem. This bound fails with probability at most e−2n + is the mutual information between Ω and C, 2.35δ + δγ , which simply follows by taking the union bound on the failure probabilities of Lemma5 and the γ- X H(Ω) = − P (ωk) log P (ωk) approximation k-means algorithm. k and Connection to the normalized Laplacian. Towards this X end, we need to use a relation between the eigenvalues of H(C) = − P (ck) log P (ck) − 1 − 1 k W˜ and the eigenvalues of L˜ = D 2 LD 2 . Recall that the Laplacian matrix of the graph is L = D − W and are the entropies of Ω and C, where P is the proba- bility that is estimated using maximum-likelihood. NMI the normalized Laplacian matrix is L˜ = In − W˜ . From the last relation, it is easy to see that an eigenvalue of L˜ is a value in [0, 1], where values closer to 1 repre- equals to 1 minus some eigenvalue of W˜ ; the ordering of sent better clusterings. The exact algorithm uses MAT- the eigenvalues, however, is different. Specifically, for i = LAB’s svds function to compute the top-k singular 1, 2 . . . , n, the relation is vectors. The approximate algorithm exploits the tall- thin structure of B and computes Y˜ using MATLAB’s svd 6 λi(W˜ ) = 1 − λn−i+1(L˜), function . The approximate algorithm uses MAT- LAB’s normrnd function to generate the random Gaus- where the ordering is: sian matrix S. We used MATLAB’s kmeans function with the options ‘EmptyAction’, ‘singleton’, λ1(W˜ ) ≥ λ2(W˜ ) ≥ · · · ≥ λn(W˜ ), ‘MaxIter’, 100, ‘Replicates’, 10. All our experiments were run using MATLAB 8.1.0.604 (R2013a) and on a 1.4 GHz Intel Core i5 dual-core processor running OS

λ1(L˜) ≥ λ2(L˜) ≥ · · · ≥ λn(L˜). X 10.9.5 with 8GB 1600 MHz DDR3 RAM. Finally, all reported running times are for computing Y for the exact From this relation: algorithm and Y˜ for the approximate algorithm (given W˜ ).   σk W˜ 1 − λn−k+1(L˜) 5.1. Spectral Clustering Accuracy γk =   = . ˜ 1 − λn−k(L) σk+1 W We ran our experiments on four multi-class datasets from the libSVMTools webpage (Table1). To compute W, 5A matrix P is called a projection matrix if is square and P2 = P. 6[U,S,V] = svd(B’*B); tildeY=B*V*Sˆ(-.5); Submission and Formatting Instructions for ICML 2015

Exact Approximate Name p=2 Best under exact time NMI time (secs) NMI time (secs) NMI time (secs) p SatImage 0.5905 1.270 0.5713 0.310 0.6007 0.690 6 Segment 0.7007 1.185 0.2240 0.126 0.5305 0.530 10 Vehicle 0.1655 0.024 0.2191 0.009 0.2449 0.022 6 Vowel 0.4304 0.016 0.3829 0.003 0.4307 0.005 3

Table 2. Spectral Clustering results for exact and approximate algorithms on the datasets from Table1. For approximate algorithms, we report two sets of numbers: (1) the NMI achieved at p = 2 along with the time and (2) the best NMI achieved while taking less time than the exact algorithm. We see that the approximate algorithm performs at least as good as the exact algorithm for all but the Segment dataset.

Name n d #nnz #classes SatImage 4435 36 158048 6 Segment 2310 19 41469 7 Vehicle 846 18 14927 4 Vowel 528 10 5280 11

Table 1. The libSVM multi-class classification datasets (Chang & Lin, 2011) used for our spectral clustering experiments.

2 −(kxi−xj k )/σij we use the heat kernel:Wij = e , where d d xi ∈ R and xj ∈ R are the data points and σij is a tuning parameter; σij is determined using the self-tuning method described in (Zelnik-Manor & Perona, 2004). That is, for each data point i, xi is computed to be the Euclidean dis- th tance of the ` furthest neighbor from i; then σij is set to Figure 3. Increase in running time (to compute Y˜ ) normalized be xixj for every (i, j); in our experiments, we report the by running time when p = 0. The baseline times (at p=0) are results for ` = 7. To determine the quality of the clustering SatImage (0.0648 secs), Segment (0.0261 secs), Vehicle (0.0027 obtained by the approximate algorithm and to determine secs), and Vowel (0.0012 secs). the effect of p (number of power iterations) on the cluster- ing quality and running time, we varied p from 0 to 10. In columns 2 and 3, we see the results for the exact algorithm, Figure3 depicts the relation between p and the running which serve as our baseline for quality and performance of time (to compute Y˜ ) of the approximate algorithm. All the approximate algorithm. In columns 4 and 5, we see the times are normalized by the time taken when p = 0 to the NMI with 2 power iterations (p = 2) along with the enable reporting numbers for all datasets on the same scale. running time. Immediately, we see that even at p = 2, As expected, as p increases, we see a linear increase. the Vehicle dataset achieves better accuracy than the ex- act algorithm while resulting in a 2.5x speedup. The only Acknowledgment outlier is the Segment dataset, which achieves poor NMI at p = 2. In columns 6—8, we report the best NMI that We would like to thank Yiannis Koutis and James Lee for was achieved by the approximate algorithm while staying useful discussions. under the time taken by the exact algorithm; we also report the value of p and the time taken (in seconds) for this re- References sult. We see that even when constrained to run in less time than the exact algorithm, the approximate algorithm bests Baker, C.T.H. The numerical treatment of integral equa- the NMI achieved by the exact algorithm. For example, at tions, volume 13. Clarendon Press, 1977. p = 6, the approximate algorithm reports NMI=0.2449 for Vehicle, as opposed to NMI=0.1655 achieved by the exact Bandeira, Afonso S, Singer, Amit, and Spielman, Daniel A. algorithm. We can see that in many cases, 2 subspace itera- A cheeger inequality for the graph connection laplacian. tions suffice to get good quality clustering (Segment dataset SIAM Journal on Matrix Analysis and Applications, 34 is the exception). (4):1611–1630, 2013. Submission and Formatting Instructions for ICML 2015

Belkin, Mikhail and Niyogi, Partha. Laplacian eigenmaps Liu, Rong and Zhang, Hao. Segmentation of 3d meshes and spectral techniques for embedding and clustering. In through spectral clustering. In Computer Graphics and NIPS, 2001. Applications, 2004. PG 2004. Proceedings. 12th Pacific Conference on, pp. 298–305. IEEE, 2004. Bezdek, J.C., Hathaway, R.J., Huband, J.M., Leckie, C., and Kotagiri, R. Approximate clustering in very large Manning, Christopher D., Raghavan, Prabhakar, and relational data. International journal of intelligent sys- Schutze,¨ Hinrich. Introduction to Information Retrieval. tems, 21(8):817–841, 2006. Cambridge University Press, 2008.

Boutsidis, Christos and Magdon-Ismail, Malik. Determin- Ng, Andrew Y, Jordan, Michael I, Weiss, Yair, et al. On istic feature selection for k-means clustering. Informa- spectral clustering: Analysis and an algorithm. Advances tion Theory, IEEE Transactions on, 59(9):6099–6110, in neural information processing systems, 2:849–856, 2013. 2002. Nystrom,¨ E. Uber¨ die praktische Auflosung¨ von Integral- Boutsidis, Christos and Magdon-Ismail, Malik. Faster svd- gleichungen mit Anwendungen auf Randwertaufgaben. truncated regularized least-squares. ISIT, 2014. Acta Mathematica, 54(1):185–204, 1930. Chang, Chih-Chung and Lin, Chih-Jen. LIBSVM: A li- Ostrovsky, Rafail, Rabani, Yuval, Schulman, Leonard J, brary for support vector machines. ACM Transactions and Swamy, Chaitanya. The effectiveness of lloyd-type on Intelligent Systems and Technology, 2:27:1–27:27, methods for the k-means problem. In FOCS, 2006. 2011. Software available at http://www.csie. ntu.edu.tw/˜cjlin/libsvm. Pavan, M. and Pelillo, M. Efficient out-of-sample extension of dominant-set clusters. NIPS, 2005. Cohen, Michael, E, Sam, Musco, Cameron, Musco, Christopher, and Persu, Madalina. Dimensionality re- Shi, J. and Malik, J. Normalized cuts and image segmen- duction for k-means clustering and low rank approxima- tation. Pattern Analysis and Machine Intelligence, IEEE tion. arXiv preprint arXiv:1410.6801, 2014. Transactions on, 22(8):888–905, 2000a.

Fiedler, Miroslav. Algebraic connectivity of graphs. Shi, Jianbo and Malik, Jitendra. Normalized cuts and im- Czechoslovak Mathematical Journal, 23(2):298–305, age segmentation. Pattern Analysis and Machine Intelli- 1973. gence, IEEE Transactions on, 22(8):888–905, 2000b. Smyth, S and White, S. A spectral clustering approach to Fowlkes, C., Belongie, S., Chung, F., and Malik, J. Spectral finding communities in graphs. In SDM, 2005. grouping using the Nystrom method. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(2): Spielman, Daniel and Teng, Shang-Hua. Nearly-linear time 214–225, 2004. algorithms for preconditioning and solving symmetric, diagonally dominant linear systems. ArxiV preprint, Golub, Gene H and Van Loan, Charles F. Matrix computa- http://arxiv.org/pdf/cs/0607105v4.pdf, 2009. tions, volume 3. JHU Press, 2012. Von Luxburg, Ulrike. A tutorial on spectral clustering. Halko, Nathan, Martinsson, Per-Gunnar, and Tropp, Statistics and computing, 17(4):395–416, 2007. Joel A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decom- Wang, L., Leckie, C., Ramamohanarao, K., and Bezdek, positions. SIAM review, 53(2):217–288, 2011. J. Approximate spectral clustering. Advances in Knowl- edge Discovery and Data Mining, pp. 134–146, 2009. Kumar, Amit, Sabharwal, Yogish, and Sen, Sandeep. A Weiss, Yair. Segmentation using eigenvectors: a unifying simple linear time (1+ ε)-approximation algorithm for view. In Computer vision, 1999. The proceedings of the geometric k-means clustering in any dimensions. In seventh IEEE international conference on, volume 2, pp. FOCS, 2004. 975–982. IEEE, 1999. Lee, James R, Oveis Gharan, Shayan, and Trevisan, Luca. Woodruff, David P. Sketching as a tool for numerical linear Multi-way spectral partitioning and higher-order cheeger algebra. arXiv preprint arXiv:1411.4357, 2014. inequalities. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pp. 1117– Yan, D., Huang, L., and Jordan, M.I. Fast approximate 1130. ACM, 2012. spectral clustering. In KDD, 2009.

Lin, Frank and Cohen, William W. Power iteration cluster- Zelnik-Manor, Lihi and Perona, Pietro. Self-tuning spectral ing. In ICML, 2010. clustering. In NIPS, 2004.