Clustering by Left-Stochastic Matrix Factorization

Clustering by Left-Stochastic Matrix Factorization Raman Arora [email protected] Maya R. Gupta [email protected] Amol Kapila [email protected] Maryam Fazel [email protected] University of Washington, Seattle, WA 98103, USA Abstract gorithm in Section 3. Experimental results are pre- sented in Section 4. We propose clustering samples given their pairwise similarities by factorizing the sim- 1.1. Related Work in Matrix Factorization ilarity matrix into the product of a cluster probability matrix and its transpose. Some clustering objective functions can be written as We propose a rotation-based algorithm to matrix factorization objectives. Let n feature vectors compute this left-stochastic decomposition be stacked into a feature-vector matrix X 2 Rd×n. (LSD). Theoretical results link the LSD clus- Consider the model X ≈ FGT , where F 2 Rd×k can tering method to a soft kernel k-means clus- be interpreted as a matrix with k cluster prototypes tering, give conditions for when the factor- as its columns, and G 2 Rn×k is all zeros except for ization and clustering are unique, and pro- one (appropriately scaled) positive entry per row that vide error bounds. Experimental results on indicates the nearest cluster prototype. The k-means simulated and real similarity datasets show clustering objective follows this model with squared that the proposed method reliably provides error, and can be expressed as (Ding et al., 2005): accurate clusterings. T 2 arg min kX − FG kF ; (1) F;GT G=I G≥0 1. Introduction where k · kF is the Frobenius norm, and inequality G ≥ 0 is component-wise. This follows because the We propose a new non-negative matrix factorization combined constraints G ≥ 0 and GT G = I force each (NNMF) model that produces a probabilistic cluster- row of G to have only one positive element. It is ing of samples given a matrix of similarities between straightforward to show that the minimizer of (1) oc- the samples. The interpretation of non-negative fac- curs at F ∗ = XG, so that (1) is equivalent to (Ding tors of matrices as describing different parts of data et al., 2005; Li & Ding, 2006): was first given by (Paatero & Tapper, 1994) and (Lee T T 2 & Seung, 1999). In this paper, we investigate a con- arg min kX X − GG kF : (2) GT G=I strained NNMF problem, where the factors can be G≥0 interpreted as encoding the probability of each data T point belonging to different clusters of the data. Replacing X X in (2) with a kernel matrix K results in the same minimizer as the kernel k-means objective The paper is organized as follows. First, we discuss (Ding et al., 2005; Li & Ding, 2006): related work in clustering by matrix factorization in T 2 Section 1.1. Then we introduce the proposed left- arg min kK − GG kF : (3) GT G=I stochastic decomposition (LSD) clustering formulation G≥0 in Section 1.2. We provide a theoretical foundation for LSD in Section 2. We exploit the geometric structure It has been shown that normalized-cut spectral clus- present in the problem to provide a rotation-based al- tering (Shi & Malik, 2000) also attempts to minimize an equivalent objective (3) as kernel k-means, but for Appearing in Proceedings of the 28 th International Con- a kernel that is a normalized version of the input ker- ference on Machine Learning, Bellevue, WA, USA, 2011. nel (Ding et al., 2005). Similarly, probabilistic la- Copyright 2011 by the author(s)/owner(s). tent semantic indexing (PLSI) (Hofmann, 1999) can LSD Clustering be formulated using the matrix factorization model 1.3. Related Kernel Definitions X ≈ FGT , where the approximation is in terms of Some kernel definitions are related to our idealization relative entropy (Li & Ding, 2006). that a given similarity matrix is usefully modeled as Ding et al. (Ding et al., 2010; 2005) explored compu- cK ≈ P T P . Cristianini et al. (Cristianini et al., 2001) ∗ tationally simpler variants of the kernel k-means ob- defined the ideal kernel K to be Kij = 1 if and only jective by removing the problematic constraint in (3) if Xi and Xj are from the same class. The Fisher T that G G = I, with the hope that the solutions would kernel (Jaakkola & Haussler, 1998) is defined as Kij = T −1 still have one dominant entry per row of G, which Ui I Uj, where I is the Fisher information and each they take as the indicator of cluster membership. One Ui is a function of the Fisher score, computed from a such variant, termed convex NMF (Ding et al., 2010), generative model and data Xi and Xj. restricts the columns of F (the cluster centroids) to be convex combinations of the columns of X. An- 2. Theoretical Results other variant, cluster NMF (Ding et al., 2010), solves T 2 arg minG≥0 kX − XGG kF . To provide intuition for our algorithmic approach and the LSD clustering method, we present some theoreti- 1.2. LSD Clustering cal results; all proofs are in the supplemental material. We propose explicitly factorizing a matrix K 2 Rn×n The proposed LSD clustering objective (4) is the same to produce probabilities that each of the n samples be- as the objective (3), except that we constrain P to longs to each of k clusters. We require only that K be be left-stochastic, factorize a scaled version of K, and symmetric and that its top k eigenvalues be positive; do not constrain P T P = I. The LSD clustering can we refer to such matrices in this paper as similarity be interpreted as a soft kernel k-means clustering, as matrices. Note that a similarity matrix need not be follows: PSD, it can be an indefinite kernel matrix (Chen et al., 2009). If the input does not satisfy these requirements Proposition 1 (Soft Kernel K-means) Suppose T it must be modified before use with this method. For K is LSDable, and let Kij = φ(Xi) φ(Xj) for some d×n d1×n example, if the input is a graph adjacency matrix A mapping φ : R ! R and some feature-vector d×n ∗ with zero diagonal, one could replace the diagonal by matrix X 2 R , then the minimizer P of the LSD each node's degree, or replace A by A + αI for some objective (4) also minimizes the following soft kernel constant α and the identity matrix I, or some other k-means objective: spectral modification. See (Chen et al., 2009) for fur- 2 arg min kφ(X) − FP kF : (5) ther discussion of such modifications. F 2 d1×k RT P ≥0;P 1k=1n Given a similarity matrix K, we estimate the cluster probability matrix to be any P ∗ that solves: Next, Theorem 1 states that there will be multiple left-stochastic factors P , which are related by rotations about the normal to the probability simplex (which in- T arg min kcK − P P kF ; (4) cludes permuting the rows, that is, changing the clus- P ≥0 T ter labels): P 1k=1n Theorem 1 (Factors of K Related by Rotation) Suppose K is LSDable such that K = P T P . Then, where c 2 R is a scaling factor that depends only on (a) for any matrix M 2 Rk×n s.t. K = M T M, there K and is defined in Proposition 2, and 1k is a k × 1 is a unique orthogonal k × k matrix R s.t. M = RP . ∗ ^ k×n ^ ^T vector of all ones. Note that a solution P to the above (b) for any matrix P 2 R s.t. P ≥ 0; P 1k = 1n optimization problem is an approximate left-stochastic and K = P^T P^, there is a unique orthogonal k × k decomposition of the matrix cK, which inspires the matrix Ru s.t. P^ = RuP and Ru~u = ~u, where following terminology: 1 T ~u = k [1;:::; 1] is normal to the plane containing the LSD Clustering: A solution P ∗ to (4) is a clus- probability simplex. ter probability matrix, from which we form the LSD (c) a sufficient condition for P to be unique (up to clustering by assigning the ith sample to cluster j∗ if a permutation of rows), is that there exist an m × m ∗ sub-matrix in which is a permutation of the identity ∗ 6 K Pij > Pij for all j = j . k matrix Im, where m ≥ (b 2 c + 1). LSDable: We call K LSDable if the minimum of (4) is zero. While there may be many LSDs of a given K, they may LSD Clustering T all result in the same LSD clustering. The following kK − P^ P^kF ≤ 2, where P^ minimizes (4) for K~ . th theorem states that for k = 2, the LSD clustering is Furthermore, if kW kF is o(λ~k) where λ~k, is the k unique, and for k = 3, it gives tight conditions for largest eigenvalue of K~ , then uniqueness of the LSD clustering. The key idea of this p ! theorem for k = 3 is illustrated in Fig. 1. k 1 kP − RP^kF ≤ 1 + C1 jjK 2 jjF + ; (6) jλ~kj Theorem 2 (Uniqueness of an LSD Clustering) For k = 2, the LSD clustering is unique (up to a where R is an orthogonal matrix and C1 is a constant. permutation of the labels). For k = 3, let αc (αac) be the maximum angle of clockwise (anti-clockwise) rota- The error bound in (6) involves three terms: the tion about normal to the simplex that does not rotate first term captures the perturbation of the eigenval- any of the columns of P off the simplex; let βc (βac) ues and scales linearly with ; the second term in- 1 p be the minimum angle of clockwise (anti-clockwise) volves kK 2 kF = tr(K) due to the coupling between rotation that changes the clustering.

Load more