K-Means Clustering Via Principal Component Analysis

K-means Clustering via Principal Component Analysis Chris Ding [email protected] Xiaofeng He [email protected] Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720 Abstract that data points belonging to same cluster are sim- Principal component analysis (PCA) is a ilar while data points belonging to different clusters widely used statistical technique for unsuper- are dissimilar. One of the most popular and efficient vised dimension reduction. K-means cluster- clustering methods is the K-means method (Hartigan ing is a commonly used data clustering for & Wang, 1979; Lloyd, 1957; MacQueen, 1967) which unsupervised learning tasks. Here we prove uses prototypes (centroids) to represent clusters by op- that principal components are the continuous timizing the squared error function. (A detail account solutions to the discrete cluster membership of K-means and related ISODATA methods are given indicators for K-means clustering. Equiva- in (Jain & Dubes, 1988), see also (Wallace, 1989).) lently, we show that the subspace spanned On the other end, high dimensional data are often by the cluster centroids are given by spec- transformed into lower dimensional data via the princi- tral expansion of the data covariance matrix pal component analysis (PCA)(Jolliffe, 2002) (or sin- truncated at K 1 terms. These results indi- gular value decomposition) where coherent patterns − cate that unsupervised dimension reduction can be detected more clearly. Such unsupervised di- is closely related to unsupervised learning. mension reduction is used in very broad areas such as On dimension reduction, the result provides meteorology, image processing, genomic analysis, and new insights to the observed effectiveness of information retrieval. It is also common that PCA PCA-based data reductions, beyond the con- is used to project data to a lower dimensional sub- ventional noise-reduction explanation. Map- space and K-means is then applied in the subspace ping data points into a higher dimensional (Zha et al., 2002). In other cases, data are embedded space via kernels, we show that solution for in a low-dimensional space such as the eigenspace of Kernel K-means is given by Kernel PCA. On the graph Laplacian, and K-means is then applied (Ng learning, our results suggest effective tech- et al., 2001). niques for K-means clustering. DNA gene expression and Internet newsgroups are ana- The main basis of PCA-based dimension reduction is lyzed to illustrate the results. Experiments that PCA picks up the dimensions with the largest indicate that newly derived lower bounds for variances. Mathematically, this is equivalent to find- K-means objective are within 0.5-1.5% of the ing the best low rank approximation (in L2 norm) of optimal values. the data via the singular value decomposition (SVD) (Eckart & Young, 1936). However, this noise reduction property alone is inadequate to explain the effective- 1. Introduction ness of PCA. In this paper , we explore the connection between these Data analysis methods are essential for analyzing the two widely used methods. We prove that principal ever-growing massive quantity of high dimensional components are actually the continuous solution of the data. On one end, cluster analysis(Duda et al., 2000; cluster membership indicators in the K-means cluster- Hastie et al., 2001; Jain & Dubes, 1988) attempts to ing method, i.e., the PCA dimension reduction auto- pass through data quickly to gain first order knowledge matically performs data clustering according to the K- by partitioning data points into disjoint groups such means objective function. This provides an important Appearing in Proceedings of the 21 st International Confer- justification of PCA-based data reduction. ence on Machine Learning, Banff, Canada, 2004. Copyright 2004 by the authors. Our results also provide effective ways to solve the K- 1 means clustering problem. K-means method uses K Substituting Eq.(4) into Eq.(3), we see JD is always prototypes, the centroids of clusters, to characterize positive. We summarize these results in the data. They are determined by minimizing the sum Theorem 2.1. For K = 2, minimization of K-means of squared errors, cluster objective function JK is equivalent to maxi- K mization of the distance objective JD, which is always J = (x m )2 positive. K i − k kX=1 iX∈Ck Remarks. (1) In JD, the first term represents average between-cluster distances which are maximized; this where (x1, , xn)= X is the data matrix and mk = · · · forces the resulting clusters as separated as possible. xi/nk is the centroid of cluster Ck and nk is the i∈Ck (2) The second and third terms represent the aver- Pnumber of points in Ck. Standard iterative solution to K-means suffers from a well-known problem: as age within-cluster distances which will be minimized; iteration proceeds, the solutions are trapped in the this forces the resulting clusters as compact or tight local minima due to the greedy nature of the update as possible. This is also evident from Eq.(2). (3) The algorithm (Bradley & Fayyad, 1998; Grim et al., 1998; factor n1n2 encourages cluster balance. Since JD > 0, Moore, 1998). max(JD) implies maximization of n1n2, which leads to n1 = n2 = n/2. Some notations on PCA. X represents the original These remarks give some insights to the K-means clus- data matrix; Y = (y1, , yn), yi = xi x¯, repre- · · · − tering. However, the primary importance of Theorem sents the centered data matrix, where x¯ = i xi/n. 2.1 is that JD leads to a solution via the principal com- The covarance matrix (ignoring the factorP 1/n ) is (x x¯)(x x¯)T = Y Y T . Principal directions u ponent. i i − i − k Pand principal components vk are eigenvectors satisfy- Theorem 2.2. For K-means clustering where K = 2, ing: the continuous solution of the cluster indicator vector is the principal component v1, i.e., clusters C1, C2 are T T T 1/2 Y Y uk = λkuk, Y Y vk = λkvk, vk = Y uk/λk . given by (1) These are the defining equations for the SVD of Y : C1 = i v1(i) 0 , C2 = i v1(i) > 0 . (5) 1 2 { | ≤ } { | } Y = λ / u vT (Golub & Van Loan, 1996). Ele- k k k k The optimal value of the K-means objective satisfies ments of v are the projected values of data points on P k the bounds the principal direction uk. 2 2 ny λ1 < JK=2 <ny (6) 2. 2-way clustering − Proof. Consider the squared distance matrix D = Consider the K = 2 case first. Let 2 (dij ), where dij = xi xj . Let the cluster indicator 2 || − || d(C , C ) (x x ) vector be k ℓ ≡ i − j iX∈Ck jX∈Cℓ n2/nn1 if i C1 q(i)= ∈ (7) p n1/nn2 if i C2 be the sum of squared distances between two clusters − ∈ C C p k, ℓ. After some algebra we obtain This indicator vector satisfies the sum-to-zero and nor- 2 K 2 malization conditions: i q(i)=0, i q (i)=1. One (x x ) 1 T i j 2 q q D JK = − = ny JD, (2) can easily see that DP = J . IfP we relax the re- 2nk − 2 − Xk=1 i,jX∈Ck striction that q must take one of the two discrete values, and let q take any values in [ 1, 1], the solution and of minimization of J(q)= qTDq/q−Tq is given by the eigenvector corresponding to the lowest (largest nega- n1n2 d(C1, C2) d(C1, C1) d(C2, C2) JD = 2 2 2 (3) tive) eigenvalue of the equation Dz = λz. n n1n2 − n1 − n2 A better relaxation of the discrete-valued indicator q 2 T into continuous solution is to use the centered distance where y = i yi yi/n is a constant. Thus min(JK) is equivalentP to max(JD). Furthermore, we can show matrix D, i.e, to subtract column and row means of D. Let Dˆ = (dîj ), where d(C1, C2) d(C1, C1) d(C2, C2) 2 = 2 + 2 + (m1 m2) . (4) ˆ 2 1 2 dij = dij d /n d /n + d../n (8) n n n1 n2 − 2 − i. − .j where di. = j dij , d.j = i dij , d.. = ij dij . Now indicator vectors. T ˆ T we have q DPq = q Dq = PJD, since theP 2nd, 3rd and Regularized relaxation 4th terms in Eq.(8) contribute− zero in qTDˆq. There- fore the desired cluster indicator vector is the eigen- This general approach is first proposed in (Zha et al., vector corresponding to the lowest (largest negative) 2002). Here we present a much expanded and con- eigenvalue of sistent relaxation scheme and a connectivity analysis. Dˆz = λz. First, with the help of Eq.(2), JK can be written as ˆ By construction, this centered distance matrix D has 2 1 T JK = xi xi xj , (9) a nice property that each row (and column) is sum- − nk T Xi Xk i,jX∈Ck to-zero, dˆ = 0, j. Thus e = (1, , 1) is an i ij ∀ · · · eigenvectorP of Dˆ with eigenvalue λ = 0. Since all other The first term is a constant. The second term is the eigenvectors of Dˆ are orthogonal to e, i.e, zTe = 0, sum of the K diagonal block elements of XTX ma- they have the sum-to-zero property, i z(i) = 0, a trix representing within-cluster (inner-product) simi- definitive property of the initial indicatorP vector q. In larities. contrast, eigenvectors of Dz = λz do not have this The solution of the clustering is represented by K non- property. negative indicator vectors: HK = (h1, , hK), where 2 2 T · · · With some algebra, di.

K-Means Clustering Via Principal Component Analysis

Adaptive Wavelet Clustering for Highly Noisy Data

Cluster Analysis, a Powerful Tool for Data Analysis in Education

Cluster Analysis Y H Chan

Cluster Analysis Objective: Group Data Points Into Classes of Similar Points Based on a Series of Variables

Cluster Analysis: What It Is and How to Use It Alyssa Wittle and Michael Stackhouse, Covance, Inc

Factors Versus Clusters

Cluster Analysis

A Nonparametric Statistical Approach to Clustering Via Mode Identification

Regression Clustering

Statistical Power for Cluster Analysis

Simultaneous Clustering and Estimation of Heterogeneous Graphical Models

Locus Coeruleus Lesion