Clustering by Left-Stochastic Matrix Factorization
Total Page:16
File Type:pdf, Size:1020Kb
Clustering by Left-Stochastic Matrix Factorization Raman Arora [email protected] Maya R. Gupta [email protected] Amol Kapila [email protected] Maryam Fazel [email protected] University of Washington, Seattle, WA 98103, USA Abstract 1.1. Related Work in Matrix Factorization Some clustering objective functions can be written as We propose clustering samples given their matrix factorization objectives. Let n feature vectors d×n pairwise similarities by factorizing the sim- be gathered into a feature-vector matrix X 2 R . T d×k ilarity matrix into the product of a clus- Consider the model X ≈ FG , where F 2 R can ter probability matrix and its transpose. be interpreted as a matrix with k cluster prototypes n×k We propose a rotation-based algorithm to as its columns, and G 2 R is all zeros except for compute this left-stochastic decomposition one (appropriately scaled) positive entry per row that (LSD). Theoretical results link the LSD clus- indicates the nearest cluster prototype. The k-means tering method to a soft kernel k-means clus- clustering objective follows this model with squared tering, give conditions for when the factor- error, and can be expressed as (Ding et al., 2005): ization and clustering are unique, and pro- T 2 arg min kX − FG kF ; (1) vide error bounds. Experimental results on F;GT G=I simulated and real similarity datasets show G≥0 that the proposed method reliably provides accurate clusterings. where k · kF is the Frobenius norm, and inequality G ≥ 0 is component-wise. This follows because the combined constraints G ≥ 0 and GT G = I force each row of G to have only one positive element. It is straightforward to show that the minimizer of (1) oc- 1. Introduction curs at F ∗ = XG, so that (1) is equivalent to (Ding We propose a new non-negative matrix factorization et al., 2005; Li & Ding, 2006): (NNMF) model that yields a probabilistic clustering T T 2 arg min kX X − GG kF : (2) of samples given a matrix of pairwise similarities be- GT G=I tween the samples. The interpretation of non-negative G≥0 factors of matrices as describing different parts of data Replacing XT X in (2) with a kernel matrix K results was first given by (Paatero & Tapper, 1994) and (Lee in the same minimizer as the kernel k-means objective & Seung, 1999). We discuss more related work in clus- (Ding et al., 2005; Li & Ding, 2006): tering by matrix factorization in Section 1.1. Then we introduce the proposed left-stochastic decomposition T 2 arg min kK − GG kF : (3) (LSD) clustering formulation in Section 1.2. We pro- GT G=I vide a theoretical foundation for LSD in Section 2. We G≥0 exploit the geometric structure present in the problem to provide a rotation-based algorithm in Section 3. Ex- It has been shown that normalized-cut spectral cluster- perimental results are presented in Section 4. ing (Shi & Malik, 2000) also attempts to minimize an equivalent objective as kernel k-means, but for a kernel that is a normalized version of the input kernel (Ding Appearing in Proceedings of the 28 th International Con- et al., 2005). Similarly, probabilistic latent semantic ference on Machine Learning, Bellevue, WA, USA, 2011. indexing (PLSI) (Hofmann, 1999) can be formulated Copyright 2011 by the author(s)/owner(s). using the matrix factorization model X ≈ FGT , where LSD Clustering the approximation is in terms of relative entropy (Li 1.3. Related Kernel Definitions & Ding, 2006). Some kernel definitions are related to our idealization Ding et al. (2010) explored computationally simpler that a given similarity matrix is usefully modeled as variants of the kernel k-means objective by remov- cK ≈ P T P . Cristianini et al. (2001) defined the ideal T ∗ ing the problematic constraint in (3) that G G = I, kernel K to be Kij = 1 if and only if Xi and Xj are with the hope that the solutions would still have one from the same class. The Fisher kernel (Jaakkola & T −1 dominant entry per row of G, which they take as the Haussler, 1998) is defined as Kij = Ui I Uj, where indicator of cluster membership. One such variant, I is the Fisher information and each Ui is a function termed convex NMF, restricts the columns of F (the of the Fisher score, computed from a generative model cluster centroids) to be convex combinations of the and data Xi and Xj. columns of X. Another variant, cluster NMF, solves k − T k2 arg minG≥0 X XGG F . 2. Theoretical Results 1.2. LSD Clustering To provide intuition for the LSD clustering method and our rotation-based algorithmic approach, we 2 n×n We propose explicitly factorizing a matrix K R present some theoretical results. to produce probabilities that each of the n samples be- longs to each of k clusters. We require only that K be The proposed LSD clustering objective (4) is the same symmetric and that its top k eigenvalues be positive; as the objective (3), except that we constrain P to we refer to such matrices in this paper as similarity be left-stochastic, factorize a scaled version of K, and matrices. Note that a similarity matrix need not be do not constrain P T P = I. The LSD clustering can PSD { it can be an indefinite kernel matrix (Chen be interpreted as a soft kernel k-means clustering, as et al., 2009). If the input does not satisfy these re- follows: quirements, it must be modified before use with this method. For example, if the input is a graph adja- Proposition 1 (Soft Kernel K-means) Suppose T cency matrix A with zero diagonal, one could replace K is LSDable, and let Kij = φ(Xi) φ(Xj) for some d d the diagonal by each node's degree, or replace A by mapping φ : R ! R 1 and some feature-vector matrix d×n ∗ A + αI for some constant α and the identity matrix X = [X1;:::;Xn] 2 R , then the minimizer P of I, or perform some other spectral modification. See the LSD objective (4) also minimizes the following (Chen et al., 2009) for further discussion of such mod- soft kernel k-means objective: ifications. arg min kΦ(X) − FP k2 ; (5) d ×k F F 2R 1 Given a similarity matrix K, we estimate a scaling T P ≥0;P 1k=1n factor c∗, and a cluster probability matrix P ∗ that d1×n solves where Φ(X) = [φ(X1); φ(X2); : : : ; φ(Xn)] 2 R . T 2 arg min min kcK − P P kF ; (4) c2R+ P ≥0 Next, Theorem 1 states that there will be multiple T P 1k=1n left-stochastic factors P , which are related by rotations about the normal to the probability simplex (which in- where 1k is a k × 1 vector of all ones. Note that a solution P ∗ to the above optimization problem is an cludes permuting the rows, that is, changing the clus- approximate left-stochastic decomposition of the ma- ter labels): trix c∗K, which inspires the following terminology: Theorem 1 (Factors of K Related by Rotation) LSD Clustering: A solution P ∗ to (4) is a clus- Suppose K is LSDable such that K = P T P . Then, ter probability matrix, from which we form an LSD (a) for any matrix M 2 Rk×n s.t. K = M T M, there clustering by assigning the ith sample to cluster j∗ if is a unique orthogonal k × k matrix R s.t. M = RP . ∗ ^ k×n ^ ^T Pj∗i ≥ Pji for all j =6 j . (b) for any matrix P 2 R s.t. P ≥ 0; P 1k = 1n and K = P^T P^, there is a unique orthogonal k × k LSDable: We call K LSDable if the minimum of (4) matrix Ru s.t. P^ = RuP and Ru~u = ~u, where is zero. 1 T ~u = k [1;:::; 1] is normal to the plane containing the We show in Proposition 2 that the minimizing scaling probability simplex. factor c∗ is given in a closed form and does not de- (c) a sufficient condition for P to be unique (up to a pend on a particular solution P ∗ for an LSDable K. permutation of rows), is that there exists an m × m Therefore, without loss of generality we will assume sub-matrix in K which is a permutation of the identity ∗ k that c = 1 for an LSDable K. matrix Im, where m ≥ (b 2 c + 1). LSD Clustering T While there may be many LSDs of a given K, they may norm, kW kF ≤ . Then kK − P^ P^kF ≤ 2, where P^ all result in the same LSD clustering. The following minimizes (4) for K~ . Furthermore, if kW kF is o(λ~k), th theorem states that for k = 2, the LSD clustering is where λ~k is the k largest eigenvalue of K~ , then unique, and for k = 3, it gives tight conditions for p ! uniqueness of the LSD clustering. The key idea of this k 1 kP − RP^kF ≤ 1 + C1 kK 2 kF + ; (6) theorem for k = 3 is illustrated in Fig. 1. jλ~kj Theorem 2 (Uniqueness of an LSD Clustering) where R is an orthogonal matrix and C1 is a constant. For k = 2, the LSD clustering is unique (up to a permutation of the labels). For k = 3, let αc (αac) be The error bound in (6) involves three terms: the the maximum angle of clockwise (anti-clockwise) rota- first term captures the perturbation of the eigenval- tion about normal to the simplex that does not rotate ues and scales linearly with ; the second term in- 1 p any of the columns of P off the simplex; let βc (βac) volves kK 2 kF = tr(K) due to the coupling between be the minimum angle of clockwise (anti-clockwise) the eigenvectors of the original matrix K and the per- rotation that changes the clustering.