<<

K- Clustering via Principal Component

Chris Ding [email protected] Xiaofeng He [email protected] Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720

Abstract that points belonging to same cluster are sim- Principal component analysis (PCA) is a ilar while data points belonging to different clusters widely used statistical technique for unsuper- are dissimilar. One of the most popular and efficient vised dimension reduction. K-means cluster- clustering methods is the K-means method (Hartigan ing is a commonly used data clustering for & Wang, 1979; Lloyd, 1957; MacQueen, 1967) which tasks. Here we prove uses prototypes (centroids) to represent clusters by op- that principal components are the continuous timizing the squared error function. (A detail account solutions to the discrete cluster membership of K-means and related ISODATA methods are given indicators for K-means clustering. Equiva- in (Jain & Dubes, 1988), see also (Wallace, 1989).) lently, we show that the subspace spanned On the other end, high dimensional data are often by the cluster centroids are given by spec- transformed into lower dimensional data via the princi- tral expansion of the data matrix pal component analysis (PCA)(Jolliffe, 2002) (or sin- truncated at K 1 terms. These results indi- gular value decomposition) where coherent patterns − cate that unsupervised dimension reduction can be detected more clearly. Such unsupervised di- is closely related to unsupervised learning. mension reduction is used in very broad areas such as On dimension reduction, the result provides meteorology, image processing, genomic analysis, and new insights to the observed effectiveness of . It is also common that PCA PCA-based data reductions, beyond the con- is used to project data to a lower dimensional sub- ventional noise-reduction explanation. Map- space and K-means is then applied in the subspace ping data points into a higher dimensional (Zha et al., 2002). In other cases, data are embedded space via kernels, we show that solution for in a low-dimensional space such as the eigenspace of K-means is given by Kernel PCA. On the graph Laplacian, and K-means is then applied (Ng learning, our results suggest effective tech- et al., 2001). niques for K-means clustering. DNA gene expression and Internet newsgroups are ana- The main basis of PCA-based dimension reduction is lyzed to illustrate the results. that PCA picks up the dimensions with the largest indicate that newly derived lower bounds for . Mathematically, this is equivalent to find- K-means objective are within 0.5-1.5% of the ing the best low rank approximation (in L2 norm) of optimal values. the data via the singular value decomposition (SVD) (Eckart & Young, 1936). However, this noise reduction property alone is inadequate to explain the effective- 1. Introduction ness of PCA. In this paper , we explore the connection between these methods are essential for analyzing the two widely used methods. We prove that principal ever-growing massive quantity of high dimensional components are actually the continuous solution of the data. On one end, cluster analysis(Duda et al., 2000; cluster membership indicators in the K-means cluster- Hastie et al., 2001; Jain & Dubes, 1988) attempts to ing method, i.e., the PCA dimension reduction auto- pass through data quickly to gain first order knowledge matically performs data clustering according to the K- by partitioning data points into disjoint groups such means objective function. This provides an important Appearing in Proceedings of the 21 st International Confer- justification of PCA-based data reduction. ence on , Banff, Canada, 2004. Copyright 2004 by the authors. Our results also provide effective ways to solve the K- 1 means clustering problem. K-means method uses K Substituting Eq.(4) into Eq.(3), we see JD is always prototypes, the centroids of clusters, to characterize positive. We summarize these results in the data. They are determined by minimizing the sum Theorem 2.1. For K = 2, minimization of K-means of squared errors, cluster objective function JK is equivalent to maxi- K mization of the distance objective JD, which is always J = (x m )2 positive. K i − k kX=1 iX∈Ck Remarks. (1) In JD, the first term represents average between-cluster distances which are maximized; this where (x1, , xn)= X is the data matrix and mk = · · · forces the resulting clusters as separated as possible. xi/nk is the centroid of cluster Ck and nk is the i∈Ck (2) The second and third terms represent the aver- Pnumber of points in Ck. Standard iterative solution to K-means suffers from a well-known problem: as age within-cluster distances which will be minimized; iteration proceeds, the solutions are trapped in the this forces the resulting clusters as compact or tight local minima due to the greedy nature of the update as possible. This is also evident from Eq.(2). (3) The (Bradley & Fayyad, 1998; Grim et al., 1998; factor n1n2 encourages cluster balance. Since JD > 0, Moore, 1998). max(JD) implies maximization of n1n2, which leads to n1 = n2 = n/2. Some notations on PCA. X represents the original These remarks give some insights to the K-means clus- data matrix; Y = (y1, , yn), yi = xi x¯, repre- · · · − tering. However, the primary importance of Theorem sents the centered data matrix, where x¯ = i xi/n. 2.1 is that JD leads to a solution via the principal com- The covarance matrix (ignoring the factorP 1/n ) is (x x¯)(x x¯)T = Y Y T . Principal directions u ponent. i i − i − k Pand principal components vk are eigenvectors satisfy- Theorem 2.2. For K-means clustering where K = 2, ing: the continuous solution of the cluster indicator vector is the principal component v1, i.e., clusters C1, C2 are T T T 1/2 Y Y uk = λkuk, Y Y vk = λkvk, vk = Y uk/λk . given by (1) These are the defining equations for the SVD of Y : C1 = i v1(i) 0 , C2 = i v1(i) > 0 . (5) 1 2 { | ≤ } { | } Y = λ / u vT (Golub & Van Loan, 1996). Ele- k k k k The optimal value of the K-means objective satisfies ments of v are the projected values of data points on P k the bounds the principal direction uk. 2 2 ny λ1 < JK=2

( B ) T 0.5 tn = ( n1/n, , nk/n) . (13) ( A ) p · · · p Therefore we always have

0 n1 nk 1 qK = h1 + + hK = e. r n · · · r n rn

−0.5 0 20 40 60 80 100 120 This linear transformation is always possible (see i later). For example when K = 2, we have Figure 1. (A) Two clusters in 2D space. (B) Principal n2/n n1/n component v1(i), showing the value of each element i. T = − , (14)  pn1/n pn2/n  p p K and q1 = n2/n h1 n1/n h2, which is precisely the 3. -way Clustering − indicatorp vector of Eq.(7).p This approach for K-way Above we focus on the K = 2 case using a single indi- clustering is the generalization of K = 2 clustering in cator vector. Here we generalize to K > 2, using K 1 2. − 3 § T The mutual orthogonality of hk, hk hℓ = δkℓ ( δkℓ = Theorem 3.2. (Fan) Let A be a symmetric ma- 1 if k = ℓ; 0 otherwise), implies the mutual orthogo- trix with eigenvalues ζ1 ζ and correspond- ≥ ··· ≥ n nality of qk, ing eigenvectors (v1, , vn). The maximization of T · · · T Tr(Q AQ) subject to constraints Q Q = IK has the T T T qk qℓ = hp tpk hstsℓ = (T T )kℓ = δkℓ. solution Q = (v1, , vK)R, where R is an arbitary · · · T Xp Xs Xp K K orthonormal matrix, and max Tr(Q AQ) = × ζ1 + + ζK. Let QK−1 = (q1, , qK−1), the above orthogonality · · · relation can be represented· · · as Eq.(11) is first noted in (Gordon & Henderson, 1977) in slghtly different form as a referee comment and was T QK−1QK−1 = IK−1, (15) promptly dismissed. It is independently rediscovered

T in (Zha et al., 2002) where spectral relaxation tech- q e =0, for k =1, , K 1. (16) k · · · − nique is applied [to Eq.(11) instead of Eq.(18)], leading Now, the K-means objective can be written as to K principal eigenvectors of XTX as the continuous solution. The present approach has three advantages: T T T T T JK = Tr(X X) e X Xe/n Tr(Q 1X XQk−1) h − − k− (a) Direct relaxation on k in Eq.(11) is not as desir- (17) able as relaxation on qk of Eq.(18). This is because Note that JK does not distinguish the original data qk satisfy sum-to-zero property of the usual PCA com- xi and the centered data yi . Repeating the above h { } { } ponents while k do not. Entries of discrete indicator derivation on yi , we have q { } vectors k have both positive and negative values, thus T T T closer to the continuous solution. On the other hand, JK = Tr(Y Y ) Tr(Qk−1Y YQk−1), (18) − entries of discrete indicator vectors hk have only one T noting that Y e = 0 because rows of Y are centered. sign, while all eigenvectors (except v1) of X X have both positive and negative entries. In other words, the The first term is constant. Optimization of JK be- comes continuous solutions of hk will differ significantly from T T its discrete form, while the continuous solutions of qk max Tr(QK−1Y YQK−1) (19) QK−1 will be much closer to its discrete form. subject to the constraints Eqs.(15,16), with additional (b) The present approach is consistent with both K > 2 case and K = 2 case presented in 2 using a single constraint that qk are the linear transformations of § indicator vector. The relaxation of Eq.(11) for K = 2 the hk as in Eq.(12). If we relax (ignore) the last would requires two eigenvectors; that is not consistent constraint, i.e., let hk to take continuous values, while still keeping constraints Eqs.(15,16), the maximization with the single indicator vector approach in 2. (c) Relaxation in Eq.(11) uses the original§ data, problem can be solved in closed form, with the follow- T ing results: X X, while the present approach uses centered ma- trix Y T Y . Using Y T Y makes the orthogonality condi- Theorem 3.1. When optimizing the K-means objec- tions Eqs.(15, 16) consistent since e is an eigenvector tive function, the continuous solutions for the trans- of Y T Y . Also, Y T Y is closely related to covariance formed discrete cluster membership indicator vectors matrix Y Y T , a central theme in . QK 1 are the K 1 principal components: QK 1 = − − − Cluster Centroid Subspace Identifcation (v1, , vK 1). J satisfies the upper and lower · · · − K bounds Suppose we have found K clusters with m cen- K−1 k 2 2 troids. The between-cluster scatter matrix Sb = ny λk < JK

2 2 2 tion φ( ) and can be ignored. The clustering problem y y = y⊥ y⊥ + yk yk · || i − j ||d || i − j ||r || i − j ||s becomes the maximization of the objective function k W 1 T T where yi is the component in the r-dimensional clus- JK = wij = Tr H W H = Tr Q W Q. ⊥ nk ter subspace and yi is the component in the s- Xk i,jX∈Ck dimensional irrelavent subspace (d = r + s). We wish (23) to show that where W = (wij ) is the kernel matrix: wij = φ(x )Tφ(x ). k k i j yi yj r 1 if i C , j C = C || − || ∈ k ∈ ℓ 6 k The advantage of Kernel K-means is that it can de- y y ≈  r/d if i, j Ck, i j d ∈ scribe data distributions more complicated that Gaus- || − || (21) sion distributions. The disadvantage of Kernel K- If yi,yj are in different clusters, yi yj runs from one means is that it no longer has cluster centroids because cluster to another, or, it runs from− one centroid to there are only pairwise kernel or similarities. Thus the another. Thus it is nearly inside the cluster subspace. fast order(n) local refinement no longer apply. This proves the first equality in Eq.21. If y ,y are i j PCA has been applied to kernel matrix in (Sch¨olkopf in the same cluster, we assume data has a Gaussian et al., 1998). Some advantages has been shown due distribution. With probability of r/d, y y points to i j to the nonlinear transformation. The equivalence be- a direction in the cluster subspace, which− are retained tween K-means clustering and PCA shown above can in after PCA projection. With probability of s/d, y i − be extended to here. Note that in general, a kernel yj points to a direction outside the cluster subspace, 2 matrix may not be centered, whereas the linear kernel which collaps to zero, y⊥ y⊥ 0. This proves i j in PCA is centered since data are centered. We center the second equality in Eq.21.|| − || ≈ the kernel by W PWP,P = I eeT /n. Because of Eq.(21) shows that in cluster subspace, between- Eq.(16), QT WQ ←= QT PWPQ, thus− Eq.(23) remains cluster distances remain constant; while within-cluster unchanged. The centered kernel has the property that T distances shrink: clusters become relatively more com- all eigenvectors qk satisfy qk e = 0. We assume all pact. The lower the cluster subspace dimension r is, data and kernel are centered. Now, repeating previ- the more compact clusters become, and the more ef- ously analysis, we can show that solution to Kernel fective the K-means clustering in the subspace. K-means is given by Kernel PCA components: When projecting to a subspace, the subspace represen- Theorem 3.5. The continuous solutions for the dis- tations could differ by an orthogonal transformation T . crete cluster membership indicator vectors in Kernel 5 K-means clustering are the K 1 Kernel PCA compo- matrix Γ is semi-positive definite. Γ has K eigen- W − nents, and JK (opt) has the following upper bound vectors of real values with non-negative eigenvalues: (z1, , zK) = Z, where Γ zk = λkzk. Clearly, tn of K−1 Eq.(13)· · · is an eigenvector of Γ with eigenvalue λ = 0. W JK (opt) < ζk (24) Other K 1 eigenvectors are mutually orthonormal, Xk=1 ZTZ = I.− Under general conditions, Z is non-singular and Z−1 = ZT. Thus Z is the desired orthonormal where ζk are the principal eigenvalues of the centered Kernel PCA matrix W . transformation T in Eq.(12). Summerizing these re- sults, we have Theorem 3.3. The linear transformation T of Eq.(12) Recovering K clusters is formed by the K eigenvectors of Γ specified by Once the K 1 principal components qk are computed, Eq.(25). how to recover− the non-negative cluster indicators h , k This result indicates the K-means clustering is reduced therefore the clusters themselves? to an optimization problem with K(K 1)/2 1 pa- − − Clearly, since i qk(i) = 0, each principal compo- rameters. nent has manyP negative elements, so they differ sub- stantially from the non-negative cluster indicators hk. Connectivity Analysis Thus the key is to compute the orthonormal transfor- mation T in Eq.(12). The above analysis gives the structure of the transfor- mation T . But before knowning the clustering results, K K A orthonormal transformation is equivalent to we cannot compute T and thus not computing hk nei- × 2 a rotation in K-dimensional space; there are K ele- ther. Thus we need a method that bypasses T . ments with K(K + 1)/2 constraints (Goldstein, 1980); the remaining K(K 1)/2 degrees of freedom are most In Theorem 3.3, cluster subspace spanned via the clus- K conveniently represented− by Euler angles. For K = 2 ter centroids is given by the first 1 principal di- rections, i.e., the low-dimensional spectral− representa- a single rotation φ specifies the transformation; for T K = 3 three Euler angles (φ, θ, ζ) determine the rota- tion of the covariance matrix Y Y . We examine the low-dimensional spectral representation of the kernel tion, etc. In K-means problem, we require that the last K 1 (Gram) matrix Y T Y = − λ v vT. For subspace column of T have the special form of Eq.(13); there- k=1 k k k projection, the coefficient λk is unimportant. Adding fore, the true degree of freedom is FK = K(K 1)/2 1. P − − a constant matrix eeT/n, we have For K = 2, FK = 0 and the solution is fixed; it is given in Eq.(14). For K = 3, FK = 2 and we need K−1 T T to search through a 2-D space to find the best solu- C = ee /n + vkvk . tion, i.e., to find the T matrix that will transform qk Xk=1 to non-negative indicator vectors hk. Following the proof of Theorem 3.3, C can be written Using Euler angles to specify the orthogonal rotation as in high dimensional space K > 3 with the special K K T T constraint is often complicated. This problem can be C = qkqk = hkhk . solved via the following representation. Given arbi- kX=1 kX=1 T trary K(K 1)/2 positive numbers αij that sum-to- h h has a clear diagonal block structure, which − k k k one, 1≤i≤j≤n αij = 1 and αij = αji. The degree Pleads naturally to a connectivity interpretation: if of freedomP is K(K 1)/2 1, same as the degree of c > 0 then x , x are in the same cluster, we say − − ij i j freedom in our problem. We form the following K K they are connected. We further associate a prob- × matrix: ability for the connectivity between i, j as pij = 1/2 1/2 −1/2 −1/2 cij /cii cjj = δij depending whether xi, xj are con- Γ =Ω Γ¯ Ω , Ω = diag(√n1, , √nK). · · · nected or not. The diagonal block structure is the characteristic structure for clusters. where If the data has clear cluster structure, we expect C Γ¯ = α , i = j; Γ¯ = α , (25) has similar diagonal block structure, plus some noise, ij − ij 6 ii ij jX|j6=i due to the fact that principal components are approx- imations of the discrete valued indicators. For exam- ¯ 2 It can be shown that 2 ij xiΓij xj = ij αij (xi xj) ple, C could contain negative elements. Thus we set T − for any x = (x1, P, xK) . ThusP the symmetric c = 0 if c < 0. Also, elements in C with small · · · 6 ij ij positive values indicate weak, possibly spurious, con- clustering. The clustering results are given in the fol- nectivities, which should be suppressed. We set lowing 36 cij = 0 if pij < β, (26) · · · ··  2 10 1  where 0 <β< 1 and we chose β =0.5. 1 9· · · B =  · · · ·  Once C is computed, the block structure can be de-  11   · · · · ·  tected using the spectral ordering(Ding & He, 2004).  6   7· ··· 5·  By computing the cluster crossing, the cluster overlap  · · · ·  along the specified ordering, a 1-D curve exhibits the where bkℓ = number samples being clustered into class cluster structure. Clusters can be identified using a k, but actually belonging to class ℓ (by human exper- linearized cluster assignment (Ding & He, 2004). tise). The clustering accuracy is Q = k bkk/N = 0.875, quite reasonable for this difficultP problem. To 4. Experiments provide an understanding of this result, we perform the PCA connectivity analysis. The cluster connec- Gene expressions tivity matrix P is shown in Fig.3. Clearly, the five smaller classes have strong within-cluster connectiv- C1: Diffuse Large B Cell Lymphoma (46) 1 0.4 C2: germinal center B (2) (not used) ity; the largest class C has substantial connectivity C3: lymph node/tonsil (2) (not used) C4: Activated Blood B (10) to other classes (those in off-diagonal elements of P ). C5: resting/activated T (6) C6: transformed cell lines (6) 0.3 C7: Follicular lymphoma (9) This explains why in clustering results (first column in C8: resting blood B (4) (not used) C9: Chronic lymphocytic leukaemia (11) B), C1 is split into several clusters. 0.2 Also, one sample in C5 has large connectivity to C4 and is thus clustered into C4 (last column in B). 0.1 2 v 0

0 10

−0.1 20

30 −0.2

40

−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 v 50 1

60 Figure 2. Gene expression profiles of human lym- phoma(Alizadeh et al., 2000) in first two principal com- 70 ponents. 80

4029 gene expressions of 96 tissue samples on hu- 0 10 20 30 40 50 60 70 80 man lymphoma is obtained by Alizadeh et al.(Alizadeh Figure 3. The connectivity matrix for lymphoma. The 6 et al., 2000). Using biological and clinic expertise, Al- classes are ordered as C1,C4,C7,C9,C6,C5. izadeh et al classify the tissue samples into 9 classes as shown in Figure 2. Because of the large number of Internet Newsgroups classes and also highly uneven number of samples in We apply K-means clustering on Internet news- each classes (46, 2, 2, 10, 6, 6, 9, 4, 11), it is a relatively group articles. A 20-newsgroup dataset is from difficult clustering problem. To reduce dimension, 200 www.cs.cmu.edu/afs/cs/project/theo-11/www out of 4029 genes are selected based on F - for /naive-bayes.html. Word - document matrix is first this study. We focus on 6 largest classes with at least constructed. 1000 words are selected according to the 6 tissue samples per class to adequately represent each between words and documents in class; classes C2, C3, and C8 are ignored because the unsupervised manner. Standard tf.idf term weight- number of samples in these classes are too small (8 tis- ing is used. Each document is normalized to 1. sue samples total). Using PCA, we plot the samples in the first two principal components as in Fig.2. We focus on two sets of 2-newsgroup combinations and two sets of 5-newsgroup combinations. These four Following Theorem 3.1, the cluster structure are em- newsgroup combinations are listed below: bedded in the first K 1 = 5 principal components. − In this 5-dimensional eigenspace we perform K-means A2: B2: 7 bound (rows starting with P2 or P5) consistently gives Table 2. Clustering accuracy as the PCA dimension is re- close lower bound of the K-means values (rows start- duced from original 1000. ing with Km). For K = 2 cases, the lower-bound is Dim A5-B A5-U B5-B B5-U about 0.6% within the optimal K-means values. As 5 0.81/0.91 0.88/0.86 0.59/0.70 0.64/0.62 6 0.91/0.90 0.87/0.86 0.67/0.72 0.64/0.62 the number of cluster increase, the lower-bound be- 10 0.90/0.90 0.89/0.88 0.74/0.75 0.67/0.71 20 0.89 0.90 0.74 0.72 come less tight, but still within 1.4% of the optimal 40 0.86 0.91 0.63 0.68 values. 1000 0.75 0.77 0.56 0.57 PCA-reduction and K-means Next, we apply K-means clustering in the PCA sub- NG1: alt.atheism NG18: talk.politics.mideast NG2: comp.graphics NG19: talk.politics.misc space. Here we reduce the data from the original 1000 A5: B5: dimensions to 40, 20, 10, 6, 5 dimensions respectively. NG2: comp.graphics NG2: comp.graphics NG9: rec.motorcycles NG3: comp.os.ms-windows The clustering accuracy on 10 random samples of each NG10: rec.sport.baseball NG8: rec.autos newsgroup combination and size composition are av- NG15: sci.space NG13: sci.electronics NG18: talk.politics.mideast NG19: talk.politics.misc eraged and the results are listed in Table 2. To see the subtle difference between centering data or not at 10, In A2 and A5, clusters overlap at medium level. In B2 6, 5 dimensions; results for original uncentered data and B5, clusters overlap substantially. are listed at left and the results for centered data are listed at right. To accumulate sufficient statistics, for each newsgroup combination, we generate 10 datasets, each is a ran- Two observations. (1) From Table 2, it is clear that as dom sample of documents from the newsgroups. The dimensions are reduced, the results systematically and details are the following. For A2 and B2, each clus- significantly improves. For example, for datasets A5- ter has 100 documents randomly sampled from each balanced, the cluster accuracy improves from 75% at newsgroup. For A5 and B5, we let cluster sizes vary 1000-dim to 91% at 5-dim. (2) For very small number to resemble more realistic datasets. For balanced case, of dimensions, PCA based on the centered data seem we sample 100 documents from each newsgroup. For to lead to better results. All these are consitent with the unbalanced case, we select 200,140,120,100,60 doc- previous theoretical analysis. uments from different newsgroups. In this way, we generated a total of 60 datasets on which we perform Discussions cluster analysis: Traditional data reduction perspective derives PCA as We first assess the lower bounds derived in this pa- the best set of bilinear approximations (SVD of Y ). per. For each dataset, we did 20 runs of K-means clustering, each starting from different random starts The new results show that principal components are (randomly selecting data points as initial cluster cen- continuous (relaxed) solution of the cluster member- troids). We pick the clustering results with the lowest ship indicators in K-means clustering (Theorems 2.2 K-means objective function value as the final cluster- and 3.1). These two views (derivations) of PCA are ing result. For each dataset, we also compute principal in fact consistent since data clustering also is a form eigenvalues of the kernel matrices of XTX, Y T Y from of data reduction. Standard data reduction (SVD) the uncentered and centered data matrix (see 1). § happens in Euclidean space, while clustering is a data Table 1 gives the K-means objective function values reduction to classification space (data points in same and the computed lower bounds. Rows starting with cluster are considered belonging to same class while

Km are the JK optimal values for each data sample. points in different clusters are considered belonging to Rows with P2 and P5 are lower bounds computed from different classes). This is best explained by the vector Eq.(20). Rows with L2a, L2b are the lower bounds of quantization widely used in signal processing(Gersho the earlier work (Zha et al., 2002). L2a are for original & Gray, 1992) where the high dimensional space of sig- data and L2b are for centered data. The last column is nal feature vectors are divided into Voronoi cells via the averaged percentage difference between the bound the K-means algorithm. Signal feature vectors are ap- and the optimal value. proximated by the cluster centroids, the code-vectors. That PCA plays crucial roles in both types of data For datasets A2 and B2, the newly derived lower- reduction provides a unifying theme in this direction. bounds (rows starting with P2) are consistently closer to the optimal K-means values than previously derived bound (rows starting with L2a or L2b). Across all 60 random samples the newly derived lower- 8 Table 1. K-means objective function values and theoretical bounds for 6 datasets.

Datasets: A2 Km 189.31 189.06 189.40 189.40 189.91 189.93 188.62 189.52 188.90 188.19 — P2 188.30 188.14 188.57 188.56 189.10 188.89 187.85 188.54 187.91 187.25 0.48% L2a 187.37 187.19 187.71 187.68 188.27 187.99 186.98 187.53 187.29 186.37 0.94% L2b 185.09 184.88 185.63 185.33 186.25 185.44 185.00 185.56 184.75 184.02 2.13% Datasets: B2 Km 185.20 187.68 187.31 186.47 187.08 186.12 187.12 187.36 185.51 185.50 — P2 184.44 186.69 186.05 184.81 186.17 185.29 186.13 185.62 184.73 184.19 0.60% L2a 183.22 185.51 184.97 183.67 185.02 184.19 184.88 184.50 183.55 183.08 1.22% L2b 180.04 182.97 182.36 180.71 182.46 181.17 182.38 181.77 180.42 179.90 2.74% Datasets: A5 Balanced Km 459.68 462.18 461.32 463.50 461.71 462.70 460.11 463.24 463.83 463.54 — P5 452.71 456.70 454.58 457.61 456.19 456.78 453.19 458.00 457.59 458.10 1.31% Datasets: A5 Unbalanced Km 575.21 575.89 576.56 578.29 576.10 579.12 579.77 574.57 576.28 573.41 — P5 568.63 568.90 570.10 571.88 569.51 572.26 573.18 567.98 569.32 566.79 1.16% Datasets: B5 Balanced Km 464.86 464.00 466.21 463.15 463.58 464.70 464.45 465.57 466.04 463.91 — P5 458.77 456.87 459.38 458.19 456.28 458.23 458.37 458.38 459.77 458.84 1.36% Datasets: B5 Unbalanced Km 580.14 581.11 580.76 582.32 578.62 581.22 582.63 578.93 578.27 578.30 — P5 572.44 572.97 574.60 575.28 571.45 574.04 575.18 571.76 571.16 571.13 1.25%

Acknowledgement Grim, J., Novovicova, J., Pudil, P., Somol, P., & Ferri, F. (1998). Initialization normal mixtures of densities. Proc. This work is supported by U.S. Department of Energy, Int’l Conf. (ICPR 1998). Office of Science, Office of Laboratory Policy and In- Hartigan, J., & Wang, M. (1979). A K-means clustering frastructure, through an LBNL LDRD, under contract algorithm. Applied Statistics, 28, 100–108. DE-AC03-76SF00098. Hastie, T., Tibshirani, R., & Friedman, J. (2001). Elements of statistical learning. Springer Verlag. References Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I., Jain, A., & Dubes, R. (1988). for clustering Rosenwald, A., Boldrick, J., Sabet, H., Tran, T., Yu, data. Prentice Hall. X., et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Na- Jolliffe, I. (2002). Principal component analysis. Springer. ture, 403, 503–511. 2nd edition.

Bradley, P., & Fayyad, U. (1998). Refining initial points Lloyd, S. (1957). quantization in pcm. Bell for k-means clustering. Proc. 15th International Conf. Telephone Laboratories Paper, Marray Hill. on Machine Learning. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proc. 5th Berkeley Ding, C., & He, X. (2004). Linearized cluster assignment Symposium, 281–297. via spectral ordering. Proc. Int’l Conf. Machine Learn- ing (ICML2004). Moore, A. (1998). Very fast em-based mixture model clus- tering using multiresolution kd-trees. Proc. Neural Info. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern Processing Systems (NIPS 1998). classification, 2nd ed. Wiley. Ng, A., Jordan, M., & Weiss, Y. (2001). On spectral clus- Eckart, C., & Young, G. (1936). The approximation of tering: Analysis and an algorithm. Proc. Neural Info. one matrix by another of lower rank. Psychometrika, 1, Processing Systems (NIPS 2001). 183–187. Sch¨olkopf, B., Smola, A., & M¨uller, K. (1998). Nonlin- Fan, K. (1949). On a theorem of Weyl concerning eigen- ear component analysis as a kernel eigenvalue problem. values of linear transformations. Proc. Natl. Acad. Sci. Neural Computation, 10, 1299–1319. USA, 35, 652–655. Wallace, R. (1989). Finding natural clusters through en- Gersho, A., & Gray, R. (1992). and tropy minimization. Ph.D Thesis. Carnegie-Mellon Uiv- signal compression. Kluwer. ersity, CS Dept.

Goldstein, H. (1980). Classical mechanics. Addison- Zha, H., Ding, C., Gu, M., He, X., & Simon, H. (2002). Wesley. 2nd edition. Spectral relaxation for K-means clustering. Advances in Neural Information Processing Systems 14 (NIPS’01), Golub, G., & Van Loan, C. (1996). Matrix computations, 1057–1064. 3rd edition. Johns Hopkins, Baltimore. Zhang, R., & Rudnicky, A. (2002). A large scale clustering scheme for K-means. 16th Int’l Conf. Pattern Recogni- Gordon, A., & Henderson, J. (1977). An algorithm for tion (ICPR’02). euclidean sum of squares classification. Biometrics, 355– 362. 9