Power Iteration Clustering

Power Iteration Clustering Frank Lin [email protected] William W. Cohen [email protected] Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213 USA Abstract malized similarity matrix. This embedding turns out to be very e®ective for clustering, and in comparison We present a simple and scalable graph clus- to spectral clustering, the cost (in space and time) of tering method called power iteration cluster- explicitly calculating eigenvectors is replaced by that ing (PIC). PIC ¯nds a very low-dimensional of a small number of matrix-vector multiplications. embedding of a dataset using truncated power iteration on a normalized pair-wise We test PIC on a number of di®erent types of datasets similarity matrix of the data. This em- and obtain comparable or better clusters than existing bedding turns out to be an e®ective cluster spectral methods. However, the highlights of PIC are indicator, consistently outperforming widely its simplicity and scalability | we demonstrate that used spectral methods such as NCut on real a basic implementation of this method is able to par- datasets. PIC is very fast on large datasets, tition a network dataset of 100 million edges within running over 1,000 times faster than an NCut a few seconds on a single machine, without sampling, implementation based on the state-of-the-art grouping, or other preprocessing of the data. IRAM eigenvector computation technique. This work is presented as follows: we begin by describ- ing power iteration and how its convergence property indicates cluster membership and how we can use it 1. Introduction to cluster data (Section 2). Then we show experimen- We present a simple and scalable clustering method tal results of PIC on a number of real and synthetic called power iteration clustering (hereafter PIC). In datasets and compare them to those of spectral cluster- essence, it ¯nds a very low-dimensional data embed- ing, both in cluster quality (Section 3) and scalability ding using truncated power iteration on a normalized (Section 4). Next, we survey related work (Section 5), pair-wise similarity matrix of the data points, and this di®erentiating PIC from clustering methods that em- embedding turns out to be an e®ective cluster indica- ploy matrix powering and from methods that modi¯es tor. the \traditional" spectral clustering to improve on ac- curacy or scalability. Finally, we conclude with why In presenting PIC, we make connections to and make we believe this simple and scalable clustering method comparisons with spectral clustering, a well-known, el- is very practical | easily implemented, parallelized, egant and e®ective clustering method. PIC and spec- and well-suited to very large datasets. tral clustering methods are similar in that both embed data points in a low-dimensional subspace derived 2. Power Iteration Clustering from the similarity matrix, and this embedding pro- vides clustering results directly or through a k-means 2.1. Notation and Background algorithm. They are di®erent in what this embedding is and how it is derived. In spectral clustering the Given a dataset X = fx1; x2; :::; xng, a similarity func- embedding is formed by the bottom eigenvectors of tion s(xi; xj) is a function where s(xi; xj) = s(xj; xi) the Laplacian of an similarity matrix. In PIC the em- and s ¸ 0 if i 6= j, and following previous work bedding is an approximation to a eigenvalue-weighted (Shi & Malik, 2000), s = 0 if i = j. An a±nity matrix n£n linear combination of all the eigenvectors of an nor- A 2 R is de¯ned by Aij = s(xi; xj). The de- gree matrixPD associated with A is a diagonal matrix th Appearing in Proceedings of the 27 International Confer- with dii = j Aij: A normalized a±nity matrix W is ence on Machine Learning, Haifa, Israel, 2010. Copyright de¯ned as D¡1A. Below we will view W interchange- 2010 by the author(s)/owner(s). ably as a matrix, and an undirected graph with nodes Power Iteration Clustering (a) 3Circles PIC result (b) t = 50, scale = 0:01708 (c) t = 400, scale = 0:01066 (d) t = 1000, scale = 0:00786 Figure 1. Clustering result and the embedding provided by vt for the 3Circles dataset. In (b) through (d), the value of each component of vt is plotted against its index. Plots (b) through (d) are re-scaled so the largest value is always at the very top and the minimum value at the very bottom, and scale is the maximum value minus the minimum value. X and the edge from xi to xj weighted by s(xi; xj). PI during the convergence process are extremely interesting. This is best illustrated by example. Figure 1(a) W is closely related to the normalized random-walk 2 shows a simple dataset|i.e., each x³i is a point´ in R Laplacian matrix L of Meil¸aand Shi (2001), de¯ned ¡jjx ¡x jj2 ¡1 i j as L = I ¡D A. L has a number of useful properties: space, with s(xi; xj) de¯ned as exp 2σ2 . Fig- most importantly to this paper, the second-smallest ures 1(b) to 1(d) shows vt at various values of t, each t eigenvector of L (the eigenvector with the second- illustrated by plotting v (i) for each xi. For purposes smallest eigenvalue) de¯nes a partition of the graph of visualization, the instances x in the \bulls-eye" are W that approximately maximizes the Normalized Cut ordered ¯rst, followed by instances in the central ring, criteria. More generally, the k smallest eigenvectors then by those in the outer ring. We have also re-scaled de¯ne a subspace where the clusters are often well- the plots to span the same vertical distance|the scale separated. Thus the second-smallest, third-smallest, is shown below each plot; as we can see the di®erences ..., kth smallest eigenvectors of L are often well-suited between the distinct values of the vt's become smaller for clustering the graph W into k components. as t increases. Qualitatively, PI ¯rst converges locally within a cluster: by t = 400 the points from each clus- Note that the k smallest eigenvectors of L are also the ter have approximately the same value in vt, leading to k largest eigenvectors of W . One simple method for three disjoint line segments in the visualization. Then, computing the largest eigenvector of a matrix is power after local convergence, the line segments draw closer iteration (PI), also called the power method. PI is an together more slowly. iterative method, which starts with an arbitrary vector v0 6= 0 and repeatedly performs the update 2.3. Further Analysis of PI's Convergence vt+1 = cW vt Let us assume that W has eigenvectors e1;:::; en where c is a normalizing constant to keep vt from get- with eigenvalues ¸1; : : : ; ¸n, where ¸1 = 1 and e1 t is constant. Given W , we de¯ne the spectral repre- ting too large (here c = 1=jjW v jj1). Unfortunately, PI does not seem to be particularly useful in this setting. sentation of a value a 2 f1; : : : ; ng to be the vector While the k smallest eigenvectors of L (equivalently, sa = he1(a);:::; ek(a)i, and de¯ne the spectral distance between a and b as the largest eigenvectors of W ) are in general interest- v u ing (Meil¸a& Shi, 2001), the very smallest eigenvector uXk t 2 of L (the largest of W ) is not|in fact, it is a constant spec(a; b) ´ jjsa ¡ sbjj2 = (ei(a) ¡ ei(b)) vector: since the sum of each row of W is 1, a constant i=2 vector transformed by W will never change in direc- tion or magnitude, and is hence a constant eigenvector Usually in spectral clustering it is assumed that the eigenvalues ¸2; : : : ; ¸k are larger than the remaining of W with eigenvalue ¸1 = 1. ones. We de¯ne W to have an (®; ¯)-eigengap between the kth and (k + 1)th eigenvector if ¸ =¸ ¸ ® and 2.2. Power Iteration Convergence k 2 ¸k+1=¸2 · ¯. We will also say that W is γe-bounded The central observation of this paper is that, while if 8i; a; b 2 f1; : : : ; ng, jei(a) ¡ ei(b)j · γe; note that t running PI to convergence on W does not lead to an every W is γe-bounded for some γe. Letting v be interesting result, the intermediate vectors obtained by the result of of the tth iteration of PI, we de¯ne the Power Iteration Clustering (t; v0)-distance between a and b as Note that the size of the radius is of no importance in clustering, since most clustering methods (e.g., k- t 0 t t pic (v ; a; b) ´ jv (a) ¡ v (b)j means) are based on the relative distance between points, not the absolute distance. Furthermore, if the For brevity, we will usually drop v0 from our notation c 's are not too large or too small, the distorting fac- (e.g., writing pict(a; b)). Our goal is to relate pict(a; b) i tors are dominated by the factors of (¸ =¸ )t, which and spec(a; b). Let us ¯rst de¯ne i 2 implies that the importance of the dimension associ- Xk ated with the i-th eigenvector is downweighted by (a t t signal (a; b) ´ [ei(a) ¡ ei(b)]ci¸i power of) its eigenvalue; in Section 3.2 we will show i=2 that experimentally, this weighting scheme often im- Xn proves performance for spectral methods.

Load more