Clustering by Left-Stochastic Matrix Factorization

Clustering by Left-Stochastic Matrix Factorization Raman Arora [email protected] Maya R. Gupta [email protected] Amol Kapila [email protected] Maryam Fazel [email protected] University of Washington, Seattle, WA 98103, USA Abstract gorithm in Section 3. Experimental results are pre- sented in Section 4. We propose clustering samples given their pairwise similarities by factorizing the sim- 1.1. Related Work in Matrix Factorization ilarity matrix into the product of a cluster probability matrix and its transpose. Some clustering objective functions can be written as We propose a rotation-based algorithm to matrix factorization objectives. Let n feature vectors compute this left-stochastic decomposition be stacked into a feature-vector matrix X 2 Rd×n. (LSD). Theoretical results link the LSD clus- Consider the model X ≈ FGT , where F 2 Rd×k can tering method to a soft kernel k-means clus- be interpreted as a matrix with k cluster prototypes tering, give conditions for when the factor- as its columns, and G 2 Rn×k is all zeros except for ization and clustering are unique, and pro- one (appropriately scaled) positive entry per row that vide error bounds. Experimental results on indicates the nearest cluster prototype. The k-means simulated and real similarity datasets show clustering objective follows this model with squared that the proposed method reliably provides error, and can be expressed as (Ding et al., 2005): accurate clusterings. T 2 arg min kX − FG kF ; (1) F;GT G=I G≥0 1. Introduction where k · kF is the Frobenius norm, and inequality G ≥ 0 is component-wise. This follows because the We propose a new non-negative matrix factorization combined constraints G ≥ 0 and GT G = I force each (NNMF) model that produces a probabilistic cluster- row of G to have only one positive element. It is ing of samples given a matrix of similarities between straightforward to show that the minimizer of (1) oc- the samples. The interpretation of non-negative fac- curs at F ∗ = XG, so that (1) is equivalent to (Ding tors of matrices as describing different parts of data et al., 2005; Li & Ding, 2006): was first given by (Paatero & Tapper, 1994) and (Lee T T 2 & Seung, 1999). In this paper, we investigate a con- arg min kX X − GG kF : (2) GT G=I strained NNMF problem, where the factors can be G≥0 interpreted as encoding the probability of each data T point belonging to different clusters of the data. Replacing X X in (2) with a kernel matrix K results in the same minimizer as the kernel k-means objective The paper is organized as follows. First, we discuss (Ding et al., 2005; Li & Ding, 2006): related work in clustering by matrix factorization in T 2 Section 1.1. Then we introduce the proposed left- arg min kK − GG kF : (3) GT G=I stochastic decomposition (LSD) clustering formulation G≥0 in Section 1.2. We provide a theoretical foundation for LSD in Section 2. We exploit the geometric structure It has been shown that normalized-cut spectral clus- present in the problem to provide a rotation-based al- tering (Shi & Malik, 2000) also attempts to minimize an equivalent objective (3) as kernel k-means, but for Appearing in Proceedings of the 28 th International Con- a kernel that is a normalized version of the input ker- ference on Machine Learning, Bellevue, WA, USA, 2011. nel (Ding et al., 2005). Similarly, probabilistic la- Copyright 2011 by the author(s)/owner(s). tent semantic indexing (PLSI) (Hofmann, 1999) can LSD Clustering be formulated using the matrix factorization model 1.3. Related Kernel Definitions X ≈ FGT , where the approximation is in terms of Some kernel definitions are related to our idealization relative entropy (Li & Ding, 2006). that a given similarity matrix is usefully modeled as Ding et al. (Ding et al., 2010; 2005) explored compu- cK ≈ P T P . Cristianini et al. (Cristianini et al., 2001) ∗ tationally simpler variants of the kernel k-means ob- defined the ideal kernel K to be Kij = 1 if and only jective by removing the problematic constraint in (3) if Xi and Xj are from the same class. The Fisher T that G G = I, with the hope that the solutions would kernel (Jaakkola & Haussler, 1998) is defined as Kij = T −1 still have one dominant entry per row of G, which Ui I Uj, where I is the Fisher information and each they take as the indicator of cluster membership. One Ui is a function of the Fisher score, computed from a such variant, termed convex NMF (Ding et al., 2010), generative model and data Xi and Xj. restricts the columns of F (the cluster centroids) to be convex combinations of the columns of X. An- 2. Theoretical Results other variant, cluster NMF (Ding et al., 2010), solves T 2 arg minG≥0 kX − XGG kF . To provide intuition for our algorithmic approach and the LSD clustering method, we present some theoreti- 1.2. LSD Clustering cal results; all proofs are in the supplemental material. We propose explicitly factorizing a matrix K 2 Rn×n The proposed LSD clustering objective (4) is the same to produce probabilities that each of the n samples be- as the objective (3), except that we constrain P to longs to each of k clusters. We require only that K be be left-stochastic, factorize a scaled version of K, and symmetric and that its top k eigenvalues be positive; do not constrain P T P = I. The LSD clustering can we refer to such matrices in this paper as similarity be interpreted as a soft kernel k-means clustering, as matrices. Note that a similarity matrix need not be follows: PSD, it can be an indefinite kernel matrix (Chen et al., 2009). If the input does not satisfy these requirements Proposition 1 (Soft Kernel K-means) Suppose T it must be modified before use with this method. For K is LSDable, and let Kij = φ(Xi) φ(Xj) for some d×n d1×n example, if the input is a graph adjacency matrix A mapping φ : R ! R and some feature-vector d×n ∗ with zero diagonal, one could replace the diagonal by matrix X 2 R , then the minimizer P of the LSD each node's degree, or replace A by A + αI for some objective (4) also minimizes the following soft kernel constant α and the identity matrix I, or some other k-means objective: spectral modification. See (Chen et al., 2009) for fur- 2 arg min kφ(X) − FP kF : (5) ther discussion of such modifications. F 2 d1×k RT P ≥0;P 1k=1n Given a similarity matrix K, we estimate the cluster probability matrix to be any P ∗ that solves: Next, Theorem 1 states that there will be multiple left-stochastic factors P , which are related by rotations about the normal to the probability simplex (which in- T arg min kcK − P P kF ; (4) cludes permuting the rows, that is, changing the clus- P ≥0 T ter labels): P 1k=1n Theorem 1 (Factors of K Related by Rotation) Suppose K is LSDable such that K = P T P . Then, where c 2 R is a scaling factor that depends only on (a) for any matrix M 2 Rk×n s.t. K = M T M, there K and is defined in Proposition 2, and 1k is a k × 1 is a unique orthogonal k × k matrix R s.t. M = RP . ∗ ^ k×n ^ ^T vector of all ones. Note that a solution P to the above (b) for any matrix P 2 R s.t. P ≥ 0; P 1k = 1n optimization problem is an approximate left-stochastic and K = P^T P^, there is a unique orthogonal k × k decomposition of the matrix cK, which inspires the matrix Ru s.t. P^ = RuP and Ru~u = ~u, where following terminology: 1 T ~u = k [1;:::; 1] is normal to the plane containing the LSD Clustering: A solution P ∗ to (4) is a clus- probability simplex. ter probability matrix, from which we form the LSD (c) a sufficient condition for P to be unique (up to clustering by assigning the ith sample to cluster j∗ if a permutation of rows), is that there exist an m × m ∗ sub-matrix in which is a permutation of the identity ∗ 6 K Pij > Pij for all j = j . k matrix Im, where m ≥ (b 2 c + 1). LSDable: We call K LSDable if the minimum of (4) is zero. While there may be many LSDs of a given K, they may LSD Clustering T all result in the same LSD clustering. The following kK − P^ P^kF ≤ 2, where P^ minimizes (4) for K~ . th theorem states that for k = 2, the LSD clustering is Furthermore, if kW kF is o(λ~k) where λ~k, is the k unique, and for k = 3, it gives tight conditions for largest eigenvalue of K~ , then uniqueness of the LSD clustering. The key idea of this p ! theorem for k = 3 is illustrated in Fig. 1. k 1 kP − RP^kF ≤ 1 + C1 jjK 2 jjF + ; (6) jλ~kj Theorem 2 (Uniqueness of an LSD Clustering) For k = 2, the LSD clustering is unique (up to a where R is an orthogonal matrix and C1 is a constant. permutation of the labels). For k = 3, let αc (αac) be the maximum angle of clockwise (anti-clockwise) rota- The error bound in (6) involves three terms: the tion about normal to the simplex that does not rotate first term captures the perturbation of the eigenval- any of the columns of P off the simplex; let βc (βac) ues and scales linearly with ; the second term in- 1 p be the minimum angle of clockwise (anti-clockwise) volves kK 2 kF = tr(K) due to the coupling between rotation that changes the clustering.

Clustering by Left-Stochastic Matrix Factorization

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support