K-Means Clustering Via Principal Component Analysis
Total Page:16
File Type:pdf, Size:1020Kb
K-means Clustering via Principal Component Analysis Chris Ding [email protected] Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720 Xiaofeng He [email protected] Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720 Abstract by partitioning data points into disjoint groups such that data points belonging to same cluster are sim- Principal component analysis (PCA) is a ilar while data points belonging to different clusters widely used statistical technique for unsuper- are dissimilar. One of the most popular and efficient vised dimension reduction. K-means clus- clustering methods is the K-means method (Hartigan tering is a commonly used data clustering & Wang, 1979; Lloyd, 1957; MacQueen, 1967) which for performing unsupervised learning tasks. uses prototypes (centroids) to represent clusters by op- Here we prove that principal components are timizing the squared error function. (A detail account the continuous solutions to the discrete clus- of K-means and related ISODATA methods are given ter membership indicators for K-means clus- in (Jain & Dubes, 1988), see also (Wallace, 1989).) tering. New lower bounds for K-means ob- jective function are derived, which is the total On the other end, high dimensional data are often variance minus the eigenvalues of the data co- transformed into lower dimensional data via the princi- variance matrix. These results indicate that pal component analysis (PCA)(Jolliffe, 2002) (or sin- unsupervised dimension reduction is closely gular value decomposition) where coherent patterns related to unsupervised learning. Several im- can be detected more clearly. Such unsupervised di- plications are discussed. On dimension re- mension reduction is used in very broad areas such as duction, the result provides new insights to meteorology, image processing, genomic analysis, and the observed effectiveness of PCA-based data information retrieval. It is also common that PCA reductions, beyond the conventional noise- is used to project data to a lower dimensional sub- reduction explanation that PCA, via singu- space and K-means is then applied in the subspace lar value decomposition, provides the best (Zha et al., 2001). In other cases, data are embedded low-dimensional linear approximation of the in a low-dimensional space such as the eigenspace of data. On learning, the result suggests ef- the graph Laplacian, and K-means is then applied (Ng fective techniques for K-means data cluster- et al., 2001). ing. DNA gene expression and Internet news- The main basis of PCA-based dimension reduction is groups are analyzed to illustrate our results. that PCA picks up the dimensions with the largest Experiments indicate that the new bounds variances. Mathematically, this is equivalent to find- are within 0.5-1.5% of the optimal values. ing the best low rank approximation (in L2 norm) of the data via the singular value decomposition (SVD) 1. Introduction (Eckart & Young, 1936). However, this noise reduction property alone is inadequate to explain the effective- Data analysis methods are essential for analyzing the ness of PCA. ever-growing massive quantity of high dimensional In this paper , we explore the connection between these data. On one end, cluster analysis(Duda et al., 2000; two widely used methods. We prove that principal Hastie et al., 2001; Jain & Dubes, 1988) attempts to components are actually the continuous solution of the pass through data quickly to gain first order knowledge cluster membership indicators in the K-means cluster- st Appearing in Proceedings of the 21 International Confer- ing method, i.e., the PCA dimension reduction auto- ence on Machine Learning, Banff, Canada, 2004. Copyright matically performs data clustering according to the K- 2004 by the authors. means objective function. This provides an important justification of PCA-based data reduction. Substituting Eq.(4) into Eq.(3), we see JD is always positive. We summarize these results in Our results also provide effective ways to solve the K- means clustering problem. K-means method uses K Theorem 2.1. For K = 2, minimization of K-means prototypes, the centroids of clusters, to characterize cluster objective function JK is equivalent to maxi- the data. They are determined by minimizing the sum mization of the distance objective JD, which is always of squared errors, positive. K Remarks. (1) In JD, the first term represents average 2 JK = (xi mk) between-cluster distances which are maximized; this − kX=1 iX∈Ck forces the resulting clusters as separated as possible. (2) The second and third terms represent the aver- where (x1, , x ) = X is the data matrix and m = n k age within-cluster distances which will be minimized; x /n···is the centroid of cluster C and n is the i∈Ck i k k k this forces the resulting clusters as compact or tight Pnumber of points in Ck. Standard iterative solution as possible. This is also evident from Eq.(2). (3) The to K-means suffers from a well-known problem: as factor n1n2 encourages cluster balance. Since JD > 0, iteration proceeds, the solutions are trapped in the max(JD) implies maximization of n1n2, which leads to local minima due to the greedy nature of the update n1 = n2 = n/2. algorithm (Bradley & Fayyad, 1998; Grim et al., 1998; Moore, 1998). These remarks give some insights to the K-means clus- tering. However, the primary importance of Theorem Some notations on PCA. X represents the original 2.1 is that JD leads to a solution via the principal com- data matrix; Y = (y1, , y ), y = x x¯, repre- ··· n i i − ponent. sents the centered data matrix, where x¯ = i xi/n. The covarance matrix (ignoring the factorP 1/n ) is Theorem 2.2. For K-means clustering where K = 2, T T (xi x¯)(xi x¯) = YY . Principal directions uk the continuous solution of the cluster indicator vector i − − Pand principal components vk are eigenvectors satisfy- is the principal component v1, i.e., clusters C1, C2 are ing: given by T T T 1/2 YY uk = λkuk, Y Y vk = λkvk, vk = Y uk/λk . C1 = i v1(i) 0 , C2 = i v1(i) > 0 . (5) (1) { | ≤ } { | } These are the defining equations for the SVD of Y : The optimal value of the K-means objective satisfies T the bounds Y = k λkukvk (Golub & Van Loan, 1996). Elements of vkPare the projected values of data points on the 2 2 ny λ1 <JK=2 < ny (6) principal direction uk. − 2. 2-way clustering Proof. Consider the squared distance matrix D = (d ), where d = x x 2 . Let the cluster indicator ij ij || i− j|| Consider the K = 2 case first. Let vector be 2 d(C , C ) (x x ) n2/nn1 if i C1 k ℓ ≡ i − j q(i) = ∈ (7) iX∈Ck jX∈Cℓ ½ p n1/nn2 if i C2 − ∈ p be the sum of squared distances between two clusters This indicator vector satisfies the sum-to-zero and nor- 2 Ck, Cℓ. After some algebra we obtain malization conditions: i q(i) = 0, i q (i) = 1. One T K 2 can easily see that q DPq = JD. IfP we relax the re- (xi xj ) 2 1 − JK = − = ny JD, (2) striction that q must take one of the two discrete val- 2nk − 2 kX=1 i,jX∈Ck ues, and let q take any values in [ 1, 1], the solution of minimization of J(q) = qTDq/q−Tq is given by the and eigenvector corresponding to the lowest (largest nega- n1n2 d(C1, C2) d(C1, C1) d(C2, C2) tive) eigenvalue of the equation Dz = λz. JD = 2 2 2 (3) n · n1n2 − n1 − n2 ¸ A better relaxation of the discrete-valued indicator q 2 T into continuous solution is to use the centered distance where y = y yi/n is a constant. Thus min(JK) i i matrix D, i.e, to subtract column and row means of is equivalentP to max(JD). Furthermore, we can show D. Let Dˆ = (dˆij ), where d(C1, C2) d(C1, C1) d(C2, C2) 2 = 2 + 2 + (m1 m2) . (4) ˆ 2 1 2 dij = dij d /n d /n + d../n (8) n n n1 n2 − − i. − .j where di. = j dij , d.j = i dij , d.. = ij dij . Now indicator vectors. T ˆ T we have q DPq = q Dq = PJD, since theP 2nd, 3rd and Regularized relaxation 4th terms in Eq.(8) contribute− zero in qTDˆq. There- fore the desired cluster indicator vector is the eigen- This general approach is first proposed in (Zha et al., vector corresponding to the lowest (largest negative) 2001). Here we present a much expanded and con- eigenvalue of sistent relaxation scheme and a connectivity analysis. Dˆz = λz. First, with the help of Eq.(2), JK can be written as ˆ By construction, this centered distance matrix D has 2 1 T JK = xi xi xj, (9) a nice property that each row (and column) is sum- − nk T Xi Xk i,jX∈Ck to-zero, dˆ = 0, j. Thus e = (1, , 1) is an i ij ∀ ··· eigenvectorP of Dˆ with eigenvalue λ = 0. Since all other The first term is a constant. The second term is the eigenvectors of Dˆ are orthogonal to e, i.e, zTe = 0, sum of the K diagonal block elements of XTX ma- they have the sum-to-zero property, i z(i) = 0, a trix representing within-cluster (inner-product) simi- definitive property of the initial indicatorP vector q. In larities. contrast, eigenvectors of Dz = λz do not have this The solution of the clustering is represented by K non- property. negative indicator vectors: HK = (h1, , hK), where 2 2 T ··· With some algebra, di. = nxi + nx 2nxi x¯, d.