<<

K- Clustering via Principal Component

Chris Ding [email protected] Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720 Xiaofeng He [email protected] Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720

Abstract by partitioning points into disjoint groups such that data points belonging to same cluster are sim- Principal component analysis (PCA) is a ilar while data points belonging to different clusters widely used statistical technique for unsuper- are dissimilar. One of the most popular and efficient vised dimension reduction. K-means clus- clustering methods is the K-means method (Hartigan tering is a commonly used data clustering & Wang, 1979; Lloyd, 1957; MacQueen, 1967) which for performing tasks. uses prototypes (centroids) to represent clusters by op- Here we prove that principal components are timizing the squared error function. (A detail account the continuous solutions to the discrete clus- of K-means and related ISODATA methods are given ter membership indicators for K-means clus- in (Jain & Dubes, 1988), see also (Wallace, 1989).) tering. New lower bounds for K-means ob- jective function are derived, which is the total On the other end, high dimensional data are often minus the eigenvalues of the data co- transformed into lower dimensional data via the princi- variance matrix. These results indicate that pal component analysis (PCA)(Jolliffe, 2002) (or sin- unsupervised dimension reduction is closely gular value decomposition) where coherent patterns related to unsupervised learning. Several im- can be detected more clearly. Such unsupervised di- plications are discussed. On dimension re- mension reduction is used in very broad areas such as duction, the result provides new insights to meteorology, image processing, genomic analysis, and the observed effectiveness of PCA-based data . It is also common that PCA reductions, beyond the conventional noise- is used to project data to a lower dimensional sub- reduction explanation that PCA, via singu- space and K-means is then applied in the subspace lar value decomposition, provides the best (Zha et al., 2001). In other cases, data are embedded low-dimensional linear approximation of the in a low-dimensional space such as the eigenspace of data. On learning, the result suggests ef- the graph Laplacian, and K-means is then applied (Ng fective techniques for K-means data cluster- et al., 2001). ing. DNA gene expression and Internet news- The main basis of PCA-based dimension reduction is groups are analyzed to illustrate our results. that PCA picks up the dimensions with the largest indicate that the new bounds . Mathematically, this is equivalent to find- are within 0.5-1.5% of the optimal values. ing the best low rank approximation (in L2 norm) of the data via the singular value decomposition (SVD) 1. Introduction (Eckart & Young, 1936). However, this noise reduction property alone is inadequate to explain the effective- methods are essential for analyzing the ness of PCA. ever-growing massive quantity of high dimensional In this paper , we explore the connection between these data. On one end, cluster analysis(Duda et al., 2000; two widely used methods. We prove that principal Hastie et al., 2001; Jain & Dubes, 1988) attempts to components are actually the continuous solution of the pass through data quickly to gain first order knowledge cluster membership indicators in the K-means cluster- st Appearing in Proceedings of the 21 International Confer- ing method, i.e., the PCA dimension reduction auto- ence on , Banff, Canada, 2004. Copyright matically performs data clustering according to the K- 2004 by the authors. means objective function. This provides an important justification of PCA-based data reduction. Substituting Eq.(4) into Eq.(3), we see JD is always positive. We summarize these results in Our results also provide effective ways to solve the K- means clustering problem. K-means method uses K Theorem 2.1. For K = 2, minimization of K-means prototypes, the centroids of clusters, to characterize cluster objective function JK is equivalent to maxi- the data. They are determined by minimizing the sum mization of the distance objective JD, which is always of squared errors, positive.

K Remarks. (1) In JD, the first term represents average 2 JK = (xi mk) between-cluster distances which are maximized; this − kX=1 iX∈Ck forces the resulting clusters as separated as possible. (2) The second and third terms represent the aver- where (x1, , x ) = X is the data matrix and m = n k age within-cluster distances which will be minimized; x /n···is the centroid of cluster C and n is the i∈Ck i k k k this forces the resulting clusters as compact or tight Pnumber of points in Ck. Standard iterative solution as possible. This is also evident from Eq.(2). (3) The to K-means suffers from a well-known problem: as factor n1n2 encourages cluster balance. Since JD > 0, iteration proceeds, the solutions are trapped in the max(JD) implies maximization of n1n2, which leads to local minima due to the greedy nature of the update n1 = n2 = n/2. (Bradley & Fayyad, 1998; Grim et al., 1998; Moore, 1998). These remarks give some insights to the K-means clus- tering. However, the primary importance of Theorem Some notations on PCA. X represents the original 2.1 is that JD leads to a solution via the principal com- data matrix; Y = (y1, , y ), y = x x¯, repre- ··· n i i − ponent. sents the centered data matrix, where x¯ = i xi/n. The covarance matrix (ignoring the factorP 1/n ) is Theorem 2.2. For K-means clustering where K = 2, T T (xi x¯)(xi x¯) = YY . Principal directions uk the continuous solution of the cluster indicator vector i − − andP principal components vk are eigenvectors satisfy- is the principal component v1, i.e., clusters C1, C2 are ing: given by

T T T 1/2 YY uk = λkuk, Y Y vk = λkvk, vk = Y uk/λk . C1 = i v1(i) 0 , C2 = i v1(i) > 0 . (5) (1) { | ≤ } { | } These are the defining equations for the SVD of Y : The optimal value of the K-means objective satisfies T the bounds Y = k λkukvk (Golub & Van Loan, 1996). Elements of vkPare the projected values of data points on the 2 2 ny λ1

2. 2-way clustering Proof. Consider the squared D = (d ), where d = x x 2 . Let the cluster indicator ij ij || i− j|| Consider the K = 2 case first. Let vector be 2 d(C , C ) (x x ) n2/nn1 if i C1 k ℓ ≡ i − j q(i) = ∈ (7) iX∈Ck jX∈Cℓ ½ p n1/nn2 if i C2 − ∈ p be the sum of squared distances between two clusters This indicator vector satisfies the sum-to-zero and nor- 2 Ck, Cℓ. After some algebra we obtain malization conditions: i q(i) = 0, i q (i) = 1. One T K 2 can easily see that q DPq = JD. IfP we relax the re- (xi xj ) 2 1 − JK = − = ny JD, (2) striction that q must take one of the two discrete val- 2nk − 2 kX=1 i,jX∈Ck ues, and let q take any values in [ 1, 1], the solution of minimization of J(q) = qTDq/q−Tq is given by the and eigenvector corresponding to the lowest (largest nega- n1n2 d(C1, C2) d(C1, C1) d(C2, C2) tive) eigenvalue of the equation Dz = λz. JD = 2 2 2 (3) n · n1n2 − n1 − n2 ¸ A better relaxation of the discrete-valued indicator q 2 T into continuous solution is to use the centered distance where y = y yi/n is a constant. Thus min(JK) i i matrix D, i.e, to subtract column and row means of is equivalentP to max(JD). Furthermore, we can show D. Let Dˆ = (dˆij ), where d(C1, C2) d(C1, C1) d(C2, C2) 2 = 2 + 2 + (m1 m2) . (4) ˆ 2 1 2 dij = dij d /n d /n + d../n (8) n n n1 n2 − − i. − .j where di. = j dij , d.j = i dij , d.. = ij dij . Now indicator vectors. T ˆ T we have q DPq = q Dq = PJD, since theP 2nd, 3rd and Regularized relaxation 4th terms in Eq.(8) contribute− zero in qTDˆq. There- fore the desired cluster indicator vector is the eigen- This general approach is first proposed in (Zha et al., vector corresponding to the lowest (largest negative) 2001). Here we present a much expanded and con- eigenvalue of sistent relaxation scheme and a connectivity analysis. Dˆz = λz. First, with the help of Eq.(2), JK can be written as ˆ By construction, this centered distance matrix D has 2 1 T JK = xi xi xj, (9) a nice property that each row (and column) is sum- − nk T Xi Xk i,jX∈Ck to-zero, dˆ = 0, j. Thus e = (1, , 1) is an i ij ∀ ··· eigenvectorP of Dˆ with eigenvalue λ = 0. Since all other The first term is a constant. The second term is the eigenvectors of Dˆ are orthogonal to e, i.e, zTe = 0, sum of the K diagonal block elements of XTX ma- they have the sum-to-zero property, i z(i) = 0, a trix representing within-cluster (inner-product) simi- definitive property of the initial indicatorP vector q. In larities. contrast, eigenvectors of Dz = λz do not have this The solution of the clustering is represented by K non- property. negative indicator vectors: HK = (h1, , hK), where 2 2 T ··· With some algebra, di. = nxi + nx 2nxi x¯, d.. = 2 2 − nk 2n y . Substituting into Eq.(8), we obtain T 1/2 hk = (0, , 0, 1, , 1, 0, , 0) /nk (10) T T ··· ··· ··· dˆij = 2(xi x¯) (xj x¯) or Dˆ = 2Y Y . z }| { − − − − (Without loss of generality, we index the data such Therefore, the continuous solution for cluster indicator that data points within each cluster are adjacent.) vector is the eigenvector corresponding to the largest With this, Eq.(9) becomes (positive) eigenvalue of the Gram matrix Y T Y , which T 1 T T by definition, is precisely the principal component v . JK = Tr(X X) Tr(Hk X XHk) (11) Clearly, JD < 2λ1, where λ1 is the principal eigenvalue − T T T T T T of the matrix. Through Eq.(2), we obtain where Tr(HK X XHK) = h1 X Xh1 + + h X Xhk. ··· k the bound on JK. – There are redundancies in HK. For example, ⊓ K 1/2 n h = e. Thus one of the h ’s is linear com- Figure 1 illustrates how the principal component can k=1 k k k bination of others. We remove this redundancy by (a) determine the cluster memberships in K-means clus- P performing a linear transformation T into qk’s: tering. Once C1, C2 are determined via the principal component according to Eq.(5), we can compute the QK = (q1, , qK) = HKT, or qℓ = hktkℓ, (12) current cluster means mk and iterate the K-means ··· Xk until convergence. This will bring the cluster solution T to the local optimum. We will call this PCA-guided where T = (tij ) is a K K orthonormal matrix: T T = K-means clustering. I, and (b) requiring that× the last column of T is

( B ) T 0.5 tn = ( n1/n, , nk/n) . (13) ( A ) p ··· p Therefore we always have

0 n1 nk 1 qK = h1 + + hK = e. r n ··· r n rn

−0.5 0 20 40 60 80 100 120 This linear transformation is always possible (see i later). For example when K = 2, we have Figure 1. (A) Two clusters in 2D space. (B) Principal n2/n n1/n component v1(i), showing the value of each element i. T = − , (14) µ pn1/n pn2/n ¶ p p K and q1 = n2/n h1 n1/n h2, which is precisely the 3. -way Clustering − indicatorp vector of Eq.(7).p This approach for K-way Above we focus on the K = 2 case using a single indi- clustering is the generalization of K = 2 clustering in cator vector. Here we generalize to K > 2, using K 1 2. − § T The mutual orthogonality of hk, hk hℓ = δkℓ ( δkℓ = Theorem 3.2. (Fan) Let A be a symmetric ma- 1 if k = ℓ; 0 otherwise), implies the mutual orthogo- trix with eigenvalues ζ1 ζ and correspond- ≥ · · · ≥ n nality of qk, ing eigenvectors (v1, , vn). The maximization of T ··· T Tr(Q AQ) subject to constraints Q Q = IK has the T T T qk qℓ = hp tpk hstsℓ = (T T )kℓ = δkℓ. solution Q = (v1, , vK)R, where R is an arbitary ··· T Xp Xs Xp K K orthonormal matrix, and max Tr(Q AQ) = × ζ1 + + ζK. Let QK−1 = (q1, , qK−1), the above orthogonality ··· relation can be represented··· as Eq.(11) is first noted in (Gordon & Henderson, 1977) in slghtly different form as a referee comment and T QK−1QK−1 = IK−1, (15) promptly dismissed. It is rediscovered in (Zha et al.,

T 2001) where spectral relaxation technique is applied q e = 0, for k = 1, , K 1. (16) k ··· − [to Eq.(11) instead of Eq.(18)], leading to K principal Now, the K-means objective can be written as eigenvectors of XTX as the continuous solution. The present approach has three advantages: T T T T T JK = Tr(X X) e X Xe/n Tr(Q 1X XQk−1) h − − k− (a) Direct relaxation on k in Eq.(11) is not as desir- (17) able as relaxation on qk of Eq.(18). This is because K Note that J does not distinguish the original data qk satisfy sum-to-zero property of the usual PCA com- xi and the centered data yi . Repeating the above h { } { } ponents while k do not. Entries of discrete indicator derivation on yi , we have q { } vectors k have both positive and negative values, thus T T T closer to the continuous solution. On the other hand, JK = Tr(Y Y ) Tr(Qk−1Y Y Qk−1), (18) − entries of discrete indicator vectors hk have only one T noting that Y e = 0 because rows of Y are centered. sign, while all eigenvectors (except v1) of X X have both positive and negative entries. In other words, the The first term is constant. Optimization of JK be- comes continuous solutions of hk will differ significantly from T T its discrete form, while the continuous solutions of qk max Tr(QK−1Y Y QK−1) (19) QK−1 will be much closer to its discrete form. subject to the constraints Eqs.(15,16), with additional (b) The present approach is consistent with both K > 2 case and K = 2 case presented in 2 using a single constraint that qk are the linear transformations of § indicator vector. The relaxation of Eq.(11) for K = 2 the hk as in Eq.(12). If we relax (ignore) the last would requires two eigenvectors; that is not consistent constraint, i.e., let hk to take continuous values, while still keeping constraints Eqs.(15,16), the maximization with the single indicator vector approach in 2. (c) Relaxation in Eq.(11) uses the original§ data, problem can be solved in closed form, with the follow- T ing results: X X, while the present approach uses centered ma- trix Y T Y . Using Y T Y makes the orthogonality condi- Theorem 3.1. When optimizing the K-means objec- tions Eqs.(15, 16) consistent since e is an eigenvector tive function, the continuous solutions for the trans- of Y T Y . Also, Y T Y is closely related to covariance formed discrete cluster membership indicator vectors matrix YY T , a central theme in . QK−1 are the K 1 principal components: QK−1 = − Recovering K clusters (v1, , vK 1). JK satisfies the upper and lower ··· − bounds Once the K 1 principal components q are computed, K−1 − k 2 2 how to recover the non-negative cluster indicators hk, ny λ

40

0.1 2

v 50

0 60

70 −0.1

80

−0.2

0 10 20 30 40 50 60 70 80

−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 v 1 Figure 3. The connectivity matrix for lymphoma. The 6 classes are ordered as C1,C4,C7,C9,C6,C5. Figure 2. Gene expression profiles of human lym- phoma(Alizadeh et al., 2000) in first two principal com- ponents. ing is used. Each document is normalized to 1. We focus on two sets of 2-newsgroup combinations and two sets of 5-newsgroup combinations. These four Following Theorem 3.1, the cluster structure are em- newsgroup combinations are listed below: bedded in the first K 1 = 5 principal components. − In this 5-dimensional eigenspace we perform K-means A2: B2: NG1: alt.atheism NG18: talk.politics.mideast clustering. The clustering results are given in the fol- NG2: comp.graphics NG19: talk.politics.misc lowing A5: B5: NG2: comp.graphics NG2: comp.graphics 36 ····· NG9: rec.motorcycles NG3: comp.os.ms-windows ··· NG10: rec.sport.baseball NG8: rec.autos 2 10 1 NG15: sci.space NG13: sci.electronics  1 · 9 ···  NG18: talk.politics.mideast NG19: talk.politics.misc B = ··· ··  11   ···· 6 ·  In A2 and A5, clusters overlap at medium level. In B2  ····   7 5  and B5, clusters overlap substantially. where bkℓ = number samples being clustered into class To accumulate sufficient statistics, for each newsgroup k, but actually belonging to class ℓ (by human exper- combination, we generate 10 datasets, each is a ran- tise). The clustering accuracy is Q = k bkk/N = dom sample of documents from the newsgroups. The 0.875, quite reasonable for this difficultP problem. To details are the following. For A2 and B2, each clus- provide an understanding of this result, we perform ter has 100 documents randomly sampled from each the PCA connectivity analysis. The cluster connec- newsgroup. For A5 and B5, we let cluster sizes vary tivity matrix P is shown in Fig.3. Clearly, the five to resemble more realistic datasets. For balanced case, smaller classes have strong within-cluster connectiv- we sample 100 documents from each newsgroup. For ity; the largest class C1 has substantial connectivity the unbalanced case, we select 200,140,120,100,60 doc- to other classes (those in off-diagonal elements of P ). uments from different newsgroups. In this way, we This explains why in clustering results (first column in generated a total of 60 datasets on which we perform B), C1 is split into several clusters. cluster analysis: Also, one tissue sample in C5 has large connectivity to We first assess the lower bounds derived in this pa- C4 and is thus clustered into C4 (last column in B). per. For each dataset, we did 20 runs of K-means Internet Newsgroups clustering, each starting from different random starts (randomly selecting data points as initial cluster cen- We apply K-means clustering on Internet news- troids). We pick the clustering results with the lowest group articles. A 20-newsgroup dataset is from K-means objective function value as the final cluster- ing result. For each dataset, we also compute principal www.cs.cmu.edu/afs/cs/project/theo-11/www T T /naive-bayes.html. Word - document matrix is first eigenvalues of the matrices of X X,Y Y from the uncentered and centered data matrix (see 1). constructed. 1000 words are selected according to the § between words and documents in Table 1 gives the K-means objective function values unsupervised manner. Standard tf.idf term weight- and the computed lower bounds. Rows starting with Table 1. K-means objective function values and theoretical bounds for 6 datasets.

Datasets: A2 Km 189.31 189.06 189.40 189.40 189.91 189.93 188.62 189.52 188.90 188.19 — P2 188.30 188.14 188.57 188.56 189.10 188.89 187.85 188.54 187.91 187.25 0.48% L2a 187.37 187.19 187.71 187.68 188.27 187.99 186.98 187.53 187.29 186.37 0.94% L2b 185.09 184.88 185.63 185.33 186.25 185.44 185.00 185.56 184.75 184.02 2.13% Datasets: B2 Km 185.20 187.68 187.31 186.47 187.08 186.12 187.12 187.36 185.51 185.50 — P2 184.44 186.69 186.05 184.81 186.17 185.29 186.13 185.62 184.73 184.19 0.60% L2a 183.22 185.51 184.97 183.67 185.02 184.19 184.88 184.50 183.55 183.08 1.22% L2b 180.04 182.97 182.36 180.71 182.46 181.17 182.38 181.77 180.42 179.90 2.74% Datasets: A5 Balanced Km 459.68 462.18 461.32 463.50 461.71 462.70 460.11 463.24 463.83 463.54 — P5 452.71 456.70 454.58 457.61 456.19 456.78 453.19 458.00 457.59 458.10 1.31% Datasets: A5 Unbalanced Km 575.21 575.89 576.56 578.29 576.10 579.12 579.77 574.57 576.28 573.41 — P5 568.63 568.90 570.10 571.88 569.51 572.26 573.18 567.98 569.32 566.79 1.16% Datasets: B5 Balanced Km 464.86 464.00 466.21 463.15 463.58 464.70 464.45 465.57 466.04 463.91 — P5 458.77 456.87 459.38 458.19 456.28 458.23 458.37 458.38 459.77 458.84 1.36% Datasets: B5 Unbalanced Km 580.14 581.11 580.76 582.32 578.62 581.22 582.63 578.93 578.27 578.30 — P5 572.44 572.97 574.60 575.28 571.45 574.04 575.18 571.76 571.16 571.13 1.25%

Km are the JK optimal values for each data sample. Table 2. Clustering accuracy as the PCA dimension is re- Rows with P2 and P5 are lower bounds computed from duced from original 1000. Eq.(20). Rows with L2a, L2b are the lower bounds of the earlier work (Zha et al., 2001). L2a are for original data and L2b are for centered data. The last column is Dim A5-B A5-U B5-B B5-U the averaged percentage difference between the bound 5 0.81/0.91 0.88/0.86 0.59/0.70 0.64/0.62 and the optimal value. 6 0.91/0.90 0.87/0.86 0.67/0.72 0.64/0.62 10 0.90/0.90 0.89/0.88 0.74/0.75 0.67/0.71 For datasets A2 and B2, the newly derived lower- 20 0.89 0.90 0.74 0.72 bounds (rows starting with P2) are consistently closer 40 0.86 0.91 0.63 0.68 to the optimal K-means values than previously derived 1000 0.75 0.77 0.56 0.57 bound (rows starting with L2a or L2b). Across all 60 random samples the newly derived lower- bound (rows starting with P2 or P5) consistently gives close lower bound of the K-means values (rows start- ing with Km). For K = 2 cases, the lower-bound is listed at right. about 0.6% within the optimal K-means values. As the number of cluster increase, the lower-bound be- Two observations. (1) From Table 2, it is clear that as come less tight, but still within 1.4% of the optimal dimensions are reduced, the results systematically and values. significantly improves. For example, for datasets A5- balanced, the cluster accuracy improves from 75% at PCA-reduction and K-means 1000-dim to 91% at 5-dim. (2) For very small number Next, we apply K-means clustering in the PCA sub- of dimensions, PCA based on the centered data seem space. Here we reduce the data from the original 1000 to lead to better results. dimensions to 40, 20, 10, 6, 5 dimensions respectively. These observations indicate PCA dimension reduction The clustering accuracy on 10 random samples of each is particularly beneficial for K-means clustering. From newsgroup combination and size composition are av- Thereom 3.1, the eigen-space VK are the relaxed solu- eraged and the results are listed in Table 2. To see the tions of the transformed indicators QK, i.e., K-means subtle difference between centering data or not at 10, clustering in eigen-space VK are approximately equiv- 6, 5 dimensions; results for original uncentered data alent to that in the transformed indicator space QK. are listed at left and the results for centered data are Because K-means clustering is invariant w.r.t. the or- thogonal transformation T (see Eq.12), K-means clus- one matrix by another of lower rank. Psychometrika, tering in the QK space is equivalent to K-means clus- 1, 183–187. tering in the unsigned indicator space HK. In HK space, all data points belonging to a cluster collapse Fan, K. (1949). On a theorem of Weyl concerning into a single point, i.e., clusters are well separated. eigenvalues of linear transformations. Proc. Natl. Acad. Sci. USA, 35, 652–655. Hence clustering in HK space is particularly effective — our results provides a theoretical basis for the use Gersho, A., & Gray, R. (1992). of PCA dimension reduction for K-means clustering. and signal compression. Kluwer.

Discussions Goldstein, H. (1980). Classical mechanics. Addison- Wesley. 2nd edition. Traditional data reduction perspective derives PCA as the best set of bilinear approximations (SVD of Y ). Golub, G., & Van Loan, C. (1996). Matrix computa- The new results show that principal components are tions, 3rd edition. Johns Hopkins, Baltimore. continuous (relaxed) solution of the cluster member- Gordon, A., & Henderson, J. (1977). An algorithm for ship indicators in K-means clustering (Theorems 2.2 euclidean sum of squares classification. Biometrics, and 3.1). These two views (derivations) of PCA are 355–362. in fact consistent since data clustering also is a form of data reduction. Standard data reduction (SVD) Grim, J., Novovicova, J., Pudil, P., Somol, P., & Ferri, happens in Euclidean space, while clustering is a data F. (1998). Initialization normal mixtures of densi- reduction to classification space (data points in same ties. Proc. Int’l Conf. (ICPR cluster are considered belonging to same class while 1998). points in different clusters are considered belonging to Hartigan, J., & Wang, M. (1979). A K-means cluster- different classes). This is best explained by the vector ing algorithm. Applied Statistics, 28, 100–108. quantization widely used in signal processing(Gersho & Gray, 1992) where the high dimensional space of sig- Hastie, T., Tibshirani, R., & Friedman, J. (2001). El- nal feature vectors are divided into Voronoi cells via ements of statistical learning. Springer Verlag. the K-means algorithm. Signal feature vectors are ap- proximated by the cluster centroids, the code-vectors. Jain, A., & Dubes, R. (1988). for clustering That PCA plays crucial roles in both types of data data. Prentice Hall. reduction provides a unifying theme in this direction. Jolliffe, I. (2002). Principal component analysis. Springer. 2nd edition. Acknowledgement Lloyd, S. (1957). quantization in pcm. We thank a referee for pointing out the reference (Gor- Bell Telephone Laboratories Paper, Marray Hill. don & Henderson, 1977). This work is supported by MacQueen, J. (1967). Some methods for classification U.S. Department of Energy, Office of Science, Office and analysis of multivariate observations. Proc. 5th of Laboratory Policy and Infrastructure, through an Berkeley Symposium, 281–297. LBNL LDRD, under contract DE-AC03-76SF00098. Moore, A. (1998). Very fast em-based mixture model References clustering using multiresolution kd-trees. Proc. Neu- Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I., ral Info. Processing Systems (NIPS 1998). Rosenwald, A., Boldrick, J., Sabet, H., Tran, T., Yu, X., et al. (2000). Distinct types of diffuse large B-cell Ng, A., Jordan, M., & Weiss, Y. (2001). On spectral lymphoma identified by gene expression profiling. clustering: Analysis and an algorithm. Proc. Neural Nature, 403, 503–511. Info. Processing Systems (NIPS 2001). Wallace, R. (1989). Finding natural clusters through Bradley, P., & Fayyad, U. (1998). Refining initial entropy minimization. Ph.D Thesis. Carnegie- points for k-means clustering. Proc. 15th Interna- Mellon Uiversity, CS Dept. tional Conf. on Machine Learning. Zha, H., Ding, C., Gu, M., He, X., & Simon, H. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pat- (2001). Spectral relaxation for K-means clustering. tern classification, 2nd ed. Wiley. Advances in Neural Information Processing Systems 14 (NIPS’01), 1057–1064. Eckart, C., & Young, G. (1936). The approximation of