<<

Clustering by Left- Factorization

Raman Arora [email protected] Maya R. Gupta [email protected] Amol Kapila [email protected] Maryam Fazel [email protected] University of Washington, Seattle, WA 98103, USA

Abstract gorithm in Section 3. Experimental results are pre- sented in Section 4. We propose clustering samples given their pairwise similarities by factorizing the sim- 1.1. Related Work in Matrix Factorization ilarity matrix into the product of a clus- ter matrix and its transpose. Some clustering objective functions can be written as We propose a rotation-based algorithm to matrix factorization objectives. Let n feature vectors compute this left-stochastic decomposition be stacked into a feature-vector matrix X ∈ Rd×n. (LSD). Theoretical results link the LSD clus- Consider the model X ≈ FGT , where F ∈ Rd×k can tering method to a soft kernel k-means clus- be interpreted as a matrix with k cluster prototypes tering, give conditions for when the factor- as its columns, and G ∈ Rn×k is all zeros except for ization and clustering are unique, and pro- one (appropriately scaled) positive entry per row that vide error bounds. Experimental results on indicates the nearest cluster prototype. The k-means simulated and real similarity datasets show clustering objective follows this model with squared that the proposed method reliably provides error, and can be expressed as (Ding et al., 2005): accurate clusterings. T 2 arg min kX − FG kF , (1) F,GT G=I G≥0

1. Introduction where k · kF is the Frobenius norm, and inequality G ≥ 0 is component-wise. This follows because the We propose a new non-negative matrix factorization combined constraints G ≥ 0 and GT G = I force each (NNMF) model that produces a probabilistic cluster- row of G to have only one positive element. It is ing of samples given a matrix of similarities between straightforward to show that the minimizer of (1) oc- the samples. The interpretation of non-negative fac- curs at F ∗ = XG, so that (1) is equivalent to (Ding tors of matrices as describing different parts of data et al., 2005; Li & Ding, 2006): was first given by (Paatero & Tapper, 1994) and (Lee T T 2 & Seung, 1999). In this paper, we investigate a con- arg min kX X − GG kF . (2) GT G=I strained NNMF problem, where the factors can be G≥0 interpreted as encoding the probability of each data T point belonging to different clusters of the data. Replacing X X in (2) with a kernel matrix K results in the same minimizer as the kernel k-means objective The paper is organized as follows. First, we discuss (Ding et al., 2005; Li & Ding, 2006): related work in clustering by matrix factorization in T 2 Section 1.1. Then we introduce the proposed left- arg min kK − GG kF . (3) GT G=I stochastic decomposition (LSD) clustering formulation G≥0 in Section 1.2. We provide a theoretical foundation for LSD in Section 2. We exploit the geometric structure It has been shown that normalized-cut spectral clus- present in the problem to provide a rotation-based al- tering (Shi & Malik, 2000) also attempts to minimize an equivalent objective (3) as kernel k-means, but for Appearing in Proceedings of the 28 th International Con- a kernel that is a normalized version of the input ker- ference on Machine Learning, Bellevue, WA, USA, 2011. nel (Ding et al., 2005). Similarly, probabilistic la- Copyright 2011 by the author(s)/owner(s). tent semantic indexing (PLSI) (Hofmann, 1999) can LSD Clustering be formulated using the matrix factorization model 1.3. Related Kernel Definitions X ≈ FGT , where the approximation is in terms of Some kernel definitions are related to our idealization relative entropy (Li & Ding, 2006). that a given similarity matrix is usefully modeled as Ding et al. (Ding et al., 2010; 2005) explored compu- cK ≈ P T P . Cristianini et al. (Cristianini et al., 2001) ∗ tationally simpler variants of the kernel k-means ob- defined the ideal kernel K to be Kij = 1 if and only jective by removing the problematic constraint in (3) if Xi and Xj are from the same class. The Fisher T that G G = I, with the hope that the solutions would kernel (Jaakkola & Haussler, 1998) is defined as Kij = T −1 still have one dominant entry per row of G, which Ui I Uj, where I is the and each they take as the indicator of cluster membership. One Ui is a function of the Fisher score, computed from a such variant, termed convex NMF (Ding et al., 2010), generative model and data Xi and Xj. restricts the columns of F (the cluster centroids) to be convex combinations of the columns of X. An- 2. Theoretical Results other variant, cluster NMF (Ding et al., 2010), solves T 2 arg minG≥0 kX − XGG kF . To provide intuition for our algorithmic approach and the LSD clustering method, we present some theoreti- 1.2. LSD Clustering cal results; all proofs are in the supplemental material.

We propose explicitly factorizing a matrix K ∈ Rn×n The proposed LSD clustering objective (4) is the same to produce that each of the n samples be- as the objective (3), except that we constrain P to longs to each of k clusters. We require only that K be be left-stochastic, factorize a scaled version of K, and symmetric and that its top k eigenvalues be positive; do not constrain P T P = I. The LSD clustering can we refer to such matrices in this paper as similarity be interpreted as a soft kernel k-means clustering, as matrices. Note that a similarity matrix need not be follows: PSD, it can be an indefinite kernel matrix (Chen et al., 2009). If the input does not satisfy these requirements Proposition 1 (Soft Kernel K-means) Suppose T it must be modified before use with this method. For K is LSDable, and let Kij = φ(Xi) φ(Xj) for some d×n d1×n example, if the input is a graph A mapping φ : R → R and some feature-vector d×n ∗ with zero diagonal, one could replace the diagonal by matrix X ∈ R , then the minimizer P of the LSD each node’s degree, or replace A by A + αI for some objective (4) also minimizes the following soft kernel constant α and the I, or some other k-means objective: spectral modification. See (Chen et al., 2009) for fur- 2 arg min kφ(X) − FP kF . (5) ther discussion of such modifications. F ∈ d1×k RT P ≥0,P 1k=1n Given a similarity matrix K, we estimate the cluster probability matrix to be any P ∗ that solves: Next, Theorem 1 states that there will be multiple left-stochastic factors P , which are related by rotations about the normal to the probability simplex (which in- T arg min kcK − P P kF , (4) cludes permuting the rows, that is, changing the clus- P ≥0 T ter labels): P 1k=1n Theorem 1 (Factors of K Related by Rotation) Suppose K is LSDable such that K = P T P . Then, where c ∈ R is a scaling factor that depends only on (a) for any matrix M ∈ Rk×n s.t. K = M T M, there K and is defined in Proposition 2, and 1k is a k × 1 is a unique orthogonal k × k matrix R s.t. M = RP . ∗ ˆ k×n ˆ ˆT vector of all ones. Note that a solution P to the above (b) for any matrix P ∈ R s.t. P ≥ 0, P 1k = 1n optimization problem is an approximate left-stochastic and K = PˆT Pˆ, there is a unique orthogonal k × k decomposition of the matrix cK, which inspires the matrix Ru s.t. Pˆ = RuP and Ru~u = ~u, where following terminology: 1 T ~u = k [1,..., 1] is normal to the plane containing the LSD Clustering: A solution P ∗ to (4) is a clus- probability simplex. ter probability matrix, from which we form the LSD (c) a sufficient condition for P to be unique (up to clustering by assigning the ith sample to cluster j∗ if a permutation of rows), is that there exist an m × m ∗ sub-matrix in which is a permutation of the identity ∗ 6 K Pij > Pij for all j = j . k matrix Im, where m ≥ (b 2 c + 1). LSDable: We call K LSDable if the minimum of (4) is zero. While there may be many LSDs of a given K, they may LSD Clustering

T all result in the same LSD clustering. The following kK − Pˆ PˆkF ≤ 2, where Pˆ minimizes (4) for K˜ . th theorem states that for k = 2, the LSD clustering is Furthermore, if kW kF is o(λ˜k) where λ˜k, is the k unique, and for k = 3, it gives tight conditions for largest eigenvalue of K˜ , then uniqueness of the LSD clustering. The key idea of this √ ! theorem for k = 3 is illustrated in Fig. 1. k  1  kP − RPˆkF ≤  1 + C1 ||K 2 ||F +  , (6) |λ˜k| Theorem 2 (Uniqueness of an LSD Clustering) For k = 2, the LSD clustering is unique (up to a where R is an and C1 is a constant. permutation of the labels). For k = 3, let αc (αac) be the maximum angle of clockwise (anti-clockwise) rota- The error bound in (6) involves three terms: the tion about normal to the simplex that does not rotate first term captures the perturbation of the eigenval- any of the columns of P off the simplex; let βc (βac) ues and scales linearly with ; the second term in- 1 p be the minimum angle of clockwise (anti-clockwise) volves kK 2 kF = tr(K) due to the coupling between rotation that changes the clustering. Then, if αc < βc the eigenvectors of the original matrix K and the per- or if αac < βac, the LSD clustering is unique, up to a turbed eigenvectors as well as perturbed eigenvalues; permutation of labels. and the third term proportional to 2 is due to the perturbed eigenvectors and perturbed eigenvalues. As expected, ||P − RPˆ||F → 0 as  → 0, relating the LSD to the true factor with a rotation, which is consistent with Theorem 1. αac αc Proposition 2 (Scaling Factor) Given a similar- ity matrix K, the scaling factor c∗ that solves

βc arg min min kcK − P T P k2 β F ac c∈R+ P ∈[0,1]k×n T P 1k=1n

T −1 2 ∗ k(MM ) M1nk2 T is given by c = k , where K = M M is any decomposition of K.

Figure 1. Illustration of conditions for uniqueness of the LSD clustering. The columns of P ∗ are shown as points 3. Computing the LSD of K on the probability simplex for k = 3 and n = 15. The The LSD problem stated in (4) is a nonconvex op- points correspond to three clusters are marked by ‘o’s, ‘+’s, timization in P , and a completely positive (CP) fac- and ‘x’s. The Y-shaped thick gray lines show the cluster decision regions. Rotating the columns of P ∗ about the torization with an additional left-stochasticity con- center of the simplex ~u results in a different LSD solution straint. CP problems are a subset of non-negative T Pˆ = RuP such that K = Pˆ Pˆ. One can rotate clock- matrix factorization (NNMF) problems. Algorithms wise by at most angle βc (and counter-clockwise βac) be- to solve NNMF can be applied here, such as Sei- fore any point crosses a cluster decision boundary. Note denberg’s exponential-time quantifier elimination ap- ∗ that rotating P by more than αc degrees clockwise (or proach (Cohen & Rothblum, 1993); a multiplicative αac degrees anti-clockwise) pushes the points out of the weights update (Lee & Seung, 2000); a greedy method probability simplex - no longer a legitimate LSD, but ro- using rank-one downdates (Biggs et al., 2008), or gra- tating by 120 degrees is a legitimate LSD that corresponds dient descent (Paatero, 1999; Hoyer, 2004; Berry et al., to a re-labeling of the clusters. As specified in Theorem 2, 2007). The LSD problem could also be re-formulated a sufficient condition for the LSD clustering to be unique and an alternating minimization approach used, as in (up to re-labeling) is if α < β and α < β . c c ac ac (Paatero & Tapper, 1994; Paatero, 1997). As is typical with nonconvex problems, no known polynomial-time Our next result uses perturbation theory (Stewart, method provides theoretical guarantees or conditions 1998) to give an error bound on LSD estimation for under which a correct solution is found. a perturbed LSDable matrix. We propose a rotation-based algorithm to solve (4) Theorem 3 (Perturbation Error Bound) that exploits the geometry of this problem. We believe Suppose K is LSDable such that K = P T P . Let this is a new approach that may also be applicable K˜ = K + W , where W is a perturbation matrix to more general CP and NNMF problems. First, to with bounded Frobenius norm, kW kF ≤ . Then build the reader’s intuition, we present an overview of LSD Clustering the algorithm solving (4). Full algorithmic details are orthogonalization. Define a k × k Givens rotation ma-  T  deferred to Section 3.2. ~u ~m trix RG such that (RG)11 = (RG)22 = k~uk k ~mk ,  ~uT ~v  3.1. Algorithm Overview (RG)21 = −(RG)12 = k~uk k~vk and for all i, j > 2, set (RG)ij = 1 if i = j and zero otherwise. Then We overview the algorithm steps for an LSDable K. T R = URGU is a k × k such that Full algorithm details, including projections needed if R ~m = ~u . K is not LSDable, are given in Section 3.2. See Fig. 2 k ~mk k~uk for a pictorial description of these steps for k = 3. Step 3b: Set Q = RM.˜

Step 1: Scale the similarity matrix K by c ∈ R ∗ 0 Step 4a: Find a rotation Ru about ~u such that the (given in Proposition 2) to produce K = cK. ∗ ∗ projection bRuQc of RuQ onto the probability simplex minimizes the factorization error to K0: Step 2: Factor K0 = M T M, for some M ∈ Rk×n. ∗ 0 T Ru = arg min ||K − bRQc bRQc||F , (7) R:R~u=~u Step 3: Rotate M to form Q ∈ Rk×n, whose col- umn vectors all lie in the hyperplane that contains the such that R is a rotation matrix. probability simplex (see Fig. 2(a)). Note that for k = 2, no rotation is needed, and P ∗ = bQc. For k = 3 and k = 4, minimizing the 1 T Step 4: Rotate Q about the vector ~u = k [1,..., 1] objective in (7) reduces to a simple optimization over to form the left-stochastic solution P ∗, whose column a set of parameters, as discussed in Sec. 3.3. For gen- vectors all lie in the probability simplex (see Fig. 2(b)). eral k > 4, any global optimization could be used; we use a gradient-based approach detailed in Section 3.4. 3.2. Full LSD Algorithm Projection onto the probability simplex can be found using the algorithm given in (Michelot, 1986). Here we give more explicit details for each of the algo- rithm steps given in Sec. 3.1. ∗ ∗ Step 4b: Output P = bRuQc. Step 1: 0 Scale the similarity matrix K = cK by c 3.3. Further Details: Step 4a for k = 3 and given in Proposition 2. k = 4

k×k For k ≥ 3, the rotation that leaves ~u = 1 [1,..., 1]T Step 2a: Let Λ1:k ∈ R be the of k 0 n×k fixed can be described as the composition of three the top k eigenvalues of K and V1:k ∈ R be the corresponding eigenvectors. rotations: (a) a rotation Rue that takes ~u to ~e = √1 [0, 0,..., 0, 1]T , followed by (b) a rotation R about k e 1 ~e, followed by (c) the rotation RT that takes ~e back Step 2b: Set M = Λ 2 V T . ue 1:k 1:k to ~u. In other words the rotation matrix can be de- T T −1 composed as Ru = RueReRue, where Rue is a fixed Step 2c: Compute vector ~m = (MM ) M1n, matrix and the optimization is over the rotation Re. which is normal to the least-squares hyperplane fit The rotation matrix Re has the following structure: to the n columns of M obtained by solving for ~m = T 2   arg min k kM mˆ − 1 k . R 0 mˆ ∈R n 2 R = k−1 , (8) e 0T 1 Step 2d: Project the columns of M onto the hyper- √ where R is a (k −1)×(k −1) rotation matrix. This plane normal to ~m that is 1/ k units away; i.e., for the k−1   reduction in the dimensionality of the parameter space jth column compute M˜ = M + ~mT M − √1 ~m . j j j k k ~mk is due to the constraint R~u = ~u. Thus for k = 3, the Let M˜ = [M˜ 1,..., M˜ n]. optimization is over planar rotation matrices that can be parameterized by the angle of rotation, θ ∈ [0, 2π), Step 3a: Compute a k × k rotation matrix R such thereby reducing the problem to a 1-D optimization ~m ~u 1 T problem. that R k ~mk = k~uk , where ~u = k [1, 1,..., 1] , as fol-  T  lows: Define ~v = ~u − ~u ~m ~m . Extend For k = 4, the rotation matrix Re is a 3×3 matrix that k~uk k~uk k ~mk k ~mk can be parameterized by ZYZ Euler angles α, β, γ, i.e.  ~m ~v  k k ~mk , k~vk to a basis U for R , using Gram-Schmidt angles of rotation about Z, Y and Z axis respectively: LSD Clustering

(a) Rotate M to the simplex hyperplane (b) Rotate to fit the simplex

Figure 2. The proposed rotation-based LSD algorithm for an example where k = 3. Fig. 2a shows the columns of M 0 T k from Step 2, where K = M M. The columns of M correspond to points in R , shown here as green circles in the negative orthant. If K is LSDable, the green circles would√ lie on a hyperplane. We scale each column of M so that the least-squares fit of a hyperplane to columns of M is 1/ k units away from the origin (i.e., the distance of the probability simplex from the origin). We then project columns of M onto this hyperplane, mapping the green circles to the blue 1 T circles. The normal to this best-fit hyperplane is first rotated to the vector u = k [1,..., 1] , which is normal to the probability simplex; mapping the blue circles to the red circles, which are the columns of Q in Step 3. Then, as shown in Fig. 2b, we rotate the columns of Q about the normal ~u to best fit the points inside the probability simplex (some projection onto the simplex may be needed), mapping the red circles to black crosses. The black crosses are the columns of the solution P ∗.

Re(α, β, γ) = RZ (γ)RY (β)RZ (α), where α, γ ∈ [0, 2π) 3. Compute the rotation matrix Rk−1 that takes a and β ∈ [0, π]. Therefore, the optimization in (7) re- small step (of size η) along the geodesic connect- ~x ~y T  duces to a search over three parameters. ing k~xk and k~yk : Rk−1 = exp −η(∇L − ∇ L) . Note that since the gradient is a rank one ma- 3.4. Further Details: Step 4a for k > 4 trix, the reduces to a simple quadratic form (Arora, 2009). For higher k, we used a gradient descent for Step 4a based on (Arora, 2009) as follows. Let J(R) denote 4. Form Re using Rk−1 as in (8), and update the the objective function in (7). Initialize the search with T  ∗ ∗ ∗ rotation matrix Ru = RueReRue Ru. Ru = I and Ru = I, J = J(Ru), and choose a small step size η = .25. Randomly choose a column (RuQ)j ∗ ∗ ∗ ∗ 5. If J(Ru) < J , update R = Ru and J = J(R ). that lies outside the simplex, and repeat the following u u steps for a fixed number of passes (or until all columns of (RuQ) lie inside the simplex). 4. Experimental Results We compare clustering algorithms that take a b c 1. Find the projection (RuQ)j of (RuQ)j onto the pairwise-similarity matrix K as input. We present simplex (Michelot, 1986). Rotate (RuQ)j and two Gaussian simulations, two simulations where the b c (RuQ)j about the origin to lie in the plane nor- clusters are defined by manifolds, and real similarity − mal to ~e, and then take the first k 1 components datasets from (Chen et al., 2009). of the rotated vectors: ~x = (Rue(RuQ)j)1:k−1, ~y = (Rueb(RuQ)jc)1:k−1. Every time the k-means algorithm is used it is run with 20 random initializations (which are each given 2. Compute the gradient of the quadratic loss func- the Matlab default of 100 maximum iterations), and 2 ~x − ~y the result that minimizes the within-cluster scatter is tion L(Rk−1)= Rk−1 k~xk k~yk with respect to 2 used. K-means is implemented with Matlab’s kmeans the rotation matrix Rk−1, evaluated at Rk−1 = I: routine, except for the kernel k-means which we im-  ~x ~y   ~x T plemented. The convex NMF routine used NMFlib ∇L = 2 − . k~xk k~yk k~xk (Grindlay). For normalized spectral clustering, we LSD Clustering

4 x 10 LSD Kernel KMeans 2 LSD 0.6 Convex NMF Kernel KMeans Unnormalized Spec Convex NMF Normalized Spec Sing Link Unnormalized Spec 0.5 Normalized Spec Comp Link Sing Link Avg Link 1.5 Comp Link 0.4 Avg Link

0.3

1 0.2 Fraction incorrectly clustered 0.1 Runtimes (in seconds) 0.5 0

Figure 4. Results for three-cluster Gaussian simulation. 0 500 1000 1500 2000 Number of samples (n) from N ([2 4]T , [3 0; 0 3]). We averaged results for 1000 0.5 randomly generated datasets, each of which consisted of 50 samples from each cluster.

0.4 Runtimes are shown for increasing n for the two- cluster simulation in 3. The spectral clusterings, con- vex NMF, and LSD clustering all have comparable 0.3 runtimes, and are much faster than the linkages or ker- nel k-means. Theoretically, we hypothesize that the LSD clustering computational complexity is roughly 0.2 the same as spectral clustering, as both will be domi- nated by the need to compute k eigenvectors. Results for the two-cluster simulation are shown in Fig. Fraction0.1 incorrectly clustered 3. LSD shows similar performance to kernel k-means and convex NMF, with LSD performing slightly better 2 0 for most values of σ . The three-cluster simulation, 0 1 2 3 4 5 shown in Fig. 4, shows similar performance. Cluster overlap parameter (σ 2) 4.2. Manifold Simulations Figure 3. Runtimes (top) and results (bottom) for the two- cluster Gaussian simulation. We simulate k = 2 clusters lying on two one- dimensional manifolds in two-dimensional space, as shown in Fig. 5. For each sample, the first feature used the Ng-Jordan-Weiss version (Ng et al., 2002). x[1] is drawn iid from a standard normal, and the sec- ond feature is x[1]2 if the point is from the first cluster, − ∈ 4.1. Gaussian Cluster Simulations or arctan(x[1]) b for some b R if the point is from the second cluster. Fifty samples are randomly gener- We ran two Gaussian cluster simulations for k = 2 and ated for each cluster, and results were averaged for 250 k = 3 clusters. The first simulation generated samples randomly generated datasets for each value of the clus- iid from k = 2 Gaussians N ([0 0]T , [.5σ2 0; 0 4σ2]) and ter separation b. We clustered based on two kernels. N ([5 0]T , [4σ2 0; 0 .5σ2]). Then the RBF kernel matrix The first kernel was the RBF kernel with bandwidth with bandwidth one was computed. For each value of 1. The second kernel was a nearest-neighbor kernel 2 T σ , we averaged results for 1000 randomly generated K = A + A , where A(xi, xj) = 1 if xi is one of the datasets, each of which consisted of 100 samples from five nearest neighbors of xj (includes itself as a neigh- each cluster. bor) and A(xi, xj) = 0 otherwise. The second simulation used the same two Gaussian Results are shown in Fig. 6. Given the RBF ker- clusters with σ2 = 3, plus a third cluster generated iid nel, LSD clustering and convex NMF outperform the LSD Clustering

other methods, followed by kernel k-means and com- 3 plete linkage. Given the nearest-neighbor kernel, LSD clustering, single-linkage, and average-linkage perform 2 best, followed by the spectral clusterings, with convex 1 y=x2 NMF and kernel k-means doing roughly 10% worse 0 than LSD.

−1 y=arctan(x)−2 4.3. Real Similarity Data −2 We also compared the clustering methods on ten real −3 −2 0 2 4 similarity datasets used in (Chen et al., 2009); these datasets are provided as a pairwise matrix K, and Figure 5. One realization of the n = 100 samples in the the similarity definitions include human judgement, manifold simulation with separation b = 2. Smith-Waterman distance, and purchasing trends; see (Chen et al., 2009) for more details. These datasets have class labels for all samples, which were used as the ground truth to compute the errors shown in Table LSD Kernel KMeans 4.3. While no one clustering method clearly outshines 0.5 Convex NMF Unnormalized Spec all of the others in Table 4.3, the proposed LSD clus- Normalized Spec Sing Link tering is ranked first for four of the ten datasets, and Comp Link only for the Internet Ads dataset does it rank poorly 0.4 Avg Link (7th of 8).

0.3 5. Discussion

0.2 In this paper, we proposed a left-stochastic decompo- sition of a similarity matrix to produce cluster prob- ability estimates. Compared to seven other clustering Fraction0.1 incorrectly clustered methods given the same similarity matrix as input, the LSD clustering reliably performed well across the

0 different simulations and real benchmark datasets. 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Separation (b) To compute the LSD, we proposed a rotation-based algorithm for the matrix factorization that may be RBF Kernel an effective approach to solving other NNMF and CP problems. Even though the algorithm uses only the top k eigenvectors, the perturbation error bound of 0.5 Theorem 3 still holds because the assumptions on the perturbation matrix ensures that there is a correspon- 0.4 dence between the top k eigenvectors of the original and perturbed matrices. 0.3 One notable advantage of the LSD clustering algo- rithm is that it is deterministic for k = 2; that is, the clustering only requires computing the top two 0.2 eigenvectors, a deterministic rotation to the positive orthant, and projecting to the probability simplex. Fraction0.1 incorrectly clustered This is in contrast to methods like kernel k-means and spectral clustering, which are usually implemented with multiple random initializations of k-means. As a 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 consequence, LSD clustering can provide a determin- Separation (b) istic top-down binary clustering tree, a topic for fu- Nearest-neighbor Kernel ture research. Overall, the runtime for LSD clustering can be expected to be similar to spectral clustering. Figure 6. Results for the manifold simulation. For k = 3, we presented theoretical conditions for the LSD Clustering

Table 1. Clustering Error Rate on Similarity Datasets Amazon Aural Internet Patrol Protein Voting Yeast Yeast Yeast Yeast Binary Sonar Ads Pfam SW SW SW 7-12 5-7 5-12 7-12 # of clusters k: 2 2 2 8 4 2 2 2 2 2 LSD .39 .14 .35 .44 .37 .10 .36 .28 .09 .10 Kernel K-Means .33 .18 .29 .75 .28 .10 .49 .39 .10 .17 Convex NMF .32 .11 .48 .79 .55 .10 .38 .31 .10 .10 Unnorm. Spec .39 .49 .16 .51 .62 .09 .48 .50 .11 .50 Norm. Spec .39 .13 .16 .79 .11 .10 .38 .31 .14 .10 Sing. Link .39 .49 .16 .79 .66 .38 .50 .50 .50 .50 Comp. Link .39 .47 .16 .71 .40 .04 .34 .39 .18 .33 Avg. Link .39 .41 .16 .57 .60 .05 .47 .40 .10 .15 uniqueness of an LSD clustering, which we believe may Hoyer, P. O. Non-negative matrix factorization with be satisfied in practice. We hypothesize that analogous sparseness constraints. JMLR, 5:1457–1469, 2004. conditions exist for higher k. Jaakkola, T. and Haussler, D. Exploiting generative models in discriminative classifiers. NIPS, 1998. References Lee, D. D. and Seung, H. S. Learning the parts of ob- Arora, R. On learning rotations. NIPS, 2009. jects by non-negative matrix factorization. Nature, Berry, M. W., Browne, M., Langville, A. N., Pauca, 401:788–791, 1999. V. P., and Plemmons, R. J. Algorithms and applica- Lee, D. D. and Seung, H. S. Algorithms for non- tions for approximate factoriza- negative matrix factorization. NIPS, 2000. tion. Computational and Data Analysis, 52(1):155 – 173, 2007. Li, T. and Ding, C. The relationships among various nonnegative matrix factorization methods for clus- Biggs, M., Ghodsi, A., and Vavasis, S. Nonnegative tering. In Proc. Intl. Conf. Data Mining, 2006. matrix factorization via rank-one downdate. ICML, 2008. Michelot, C. A finite algorithm for finding the projec- tion of a point onto the canonical simplex of Rn. J. Chen, Y., Garcia, E. K., Gupta, M. R., Cazzanti, L., Opt. Theory Appl., 50:195–200, 1986. and Rahimi, A. Similarity-based classification: Con- cepts and algorithms. JMLR, 2009. Ng, A., Jordan, M., and Weiss, Y. On spectral clus- tering: Analysis and an algorithm. NIPS, 2002. Cohen, J. E. and Rothblum, U. G. Nonnegative ranks, decompositions, and factorizations of nonnegative Paatero, P. Least-squares formulation of robust non- matrices. and Its Applications, 190: negative factor analysis. Chemometrics and Intell. 149–168, 1993. Lab. Sys., 37:23 – 35, 1997. Cristianini, N., Shawe-Taylor, J., Elisseeff, A., and Paatero, P. The multilinear engine: A table-driven, Kandola, J. On kernel target alignment. NIPS, 2001. least squares program for solving multilinear prob- lems, including the n-way parallel factor analysis Ding, C., He, X., and Simon, H. D. On the equiva- model. J. Comp. Graph. Stat., 8(4):854–888, 1999. lence of nonnegative matrix factorization and spec- Paatero, P. and Tapper, U. Positive matrix factor- tral clustering. SIAM Conf. Data Mining, 2005. ization: A non-negative factor model with optimal Ding, C., Li, T., and Jordan, M. I. Convex and semi- utilization of error estimates of data values. Envi- nonnegative matrix factorizations. IEEE Trans. ronmetrics, 5:111–126, 1994. PAMI, 32, 2010. Shi, J. and Malik, J. Normalized cuts and image seg- Grindlay, G. NMFlib. www.ee.columbia.edu/ grindlay. mentation. IEEE Trans. PAMI, 22(8), 2000. Hofmann, T. Probabilistic latent semantic analysis. Stewart, G. W. Matrix Algorithms, Volume I: Basic UAI, 1999. Decompositions. SIAM, 1998.