Clustering by Low-Rank Doubly Stochastic Matrix Decomposition

Clustering by Low-Rank Doubly Stochastic Matrix Decomposition Zhirong Yang [email protected] Department of Information and Computer Science, Aalto University, 00076, Finland Erkki Oja [email protected] Department of Information and Computer Science, Aalto University, 00076, Finland Abstract negativity constraint, together with various low-rank matrix approximation objectives, has widely been used Clustering analysis by nonnegative low- for the relaxation purpose in the past decade. rank approximations has achieved remarkable progress in the past decade. However, The most popular nonnegative low-rank approxi- most approximation approaches in this di- mation method is Nonnegative Matrix Factorization rection are still restricted to matrix factor- (NMF). It finds a matrix that approximates the simi- ization. We propose a new low-rank learn- larities and can be factorized into several nonnegative ing method to improve the clustering per- low-rank matrices. NMF was originally applied to vec- formance, which is beyond matrix factoriza- torial data, where Ding et al.(2010) have shown that tion. The approximation is based on a two- NMF is equivalent to the classical k-means method. step bipartite random walk through virtual Later NMF was applied to the (weighted) graph given cluster nodes, where the approximation is by the pairwise similarities. For example, Ding et al. formed by only cluster assigning probabil- (2008) presented Nonnegative Spectral Cuts by using a ities. Minimizing the approximation error multiplicative algorithm; Arora et al.(2011) proposed measured by Kullback-Leibler divergence is Left Stochastic Decomposition that approximates a equivalent to maximizing the likelihood of similarity matrix based on Euclidean distance and a a discriminative model, which endows our left-stochastic matrix. Another stream in the same method with a solid probabilistic interpre- direction is topic modeling. Hofmann(1999) gave a tation. The optimization is implemented generative model in Probabilistic Latent Semantic In- by a relaxed Majorization-Minimization algo- dexing (PLSI) for counting data, which is essentially rithm that is advantageous in finding good lo- equivalent to NMF using Kullback-Leibler (KL) di- cal minima. Furthermore, we point out that vergence and Tri-factorizations. Bayesian treatment the regularized algorithm with Dirichlet prior of PLSI by using Dirichlet prior was later introduced only serves as initialization. Experimental re- by Blei et al.(2001). Symmetric PLSI with the same sults show that the new method has strong Bayesian treatment is called Interaction Component performance in clustering purity for various Model (ICM) (Sinkkonen et al., 2008). datasets, especially for large-scale manifold Despite remarkable progress, the above relaxation ap- data. proaches are still not fully satisfactory in all of the following requirements that affect the clustering performance using nonnegative low-rank approximation: (1) 1. Introduction approximation error measure that takes into account Cluster analysis assigns a set of objects into groups so sparse similarities, (2) decomposition form of the ap- that the objects in the same cluster are more similar proximating matrix, where the decomposing matrices to each other than to those in other clusters. Opti- should contain just enough parameters for clustering mization of most clustering objectives is NP-hard and but not more, and (3) normalization of the approxi- relaxation to \soft" clustering is often required. A non- mating matrix, which ensures relatively balanced clusters and equal contribution of each data sample. Lack- Appearing in Proceedings of the 29 th International Confer- ing one or more of these dimensions can severely affect ence on Machine Learning, Edinburgh, Scotland, UK, 2012. clustering performance. Copyright 2012 by the author(s)/owner(s). Clustering by Low-Rank Doubly Stochastic Matrix Decomposition In this paper we present a new nonnegative low-rank where i = 1; : : : ; n and k = 1; : : : ; r. In the following, approximation method for clustering, which satisfies i; j and v stand for data sample (node) indices while all of the above three requirements. First, because k and l stand for cluster indices. datasets often lie in curved manifolds such that only similarities in a small neighborhood are reliable, we 2.1. Learning objective adopt KL-divergence to handle the resulting sparsity. Second, different from PLSI, we enforce an equal con- Some of our work was inspired by the AnchorGraph tribution of every data sample and then directly con- (Liu et al., 2010) which was used in large approxi- struct the decomposition over the probabilities from mative graph construction based on a two-step ran- samples to clusters. Third, these probabilities form dom walk between data nodes through a set of an- the only decomposing matrix to be learned in our ap- chor nodes. Note that AnchorGraph is not a clustering proach and directly gives the answer for probabilistic method. clustering. Furthermore, our decomposition method If we augment the input similarity graph by r cluster leads to a doubly-stochastic approximating matrix, nodes, the cluster assigning probabilities can be seen as which was shown to be desired for balanced graph cuts single-step random walk probabilities from data nodes (Zass & Shashua, 2006). We name our new method to the augmented cluster nodes. Without preference DCD because it is based on Data-Cluster-Data ran- to any particular samples, we impose uniform prior dom walks. P (i) = 1=n over the data nodes. By this prior, the In order to solve the DCD learning objective, we pro- reversed random walk probabilities can then be calcu- pose a novel relaxed Majorization-Minimization algo- lated by the Bayes formula rithm to handle the new matrix decomposition type. P (kji)P (i) P (kji) Our relaxation strategy works robustly in finding sast- P (ijk) = = : (1) P P (kjv)P (v) P P (kjv) isfactory local optimizers under the stochasticity con- v v straint. Furthermore, we argue that complexity con- trol such as Bayesian priors only provides initialization Consider next the probability of two-step random for the new algorithm. This eliminates the problem of walks from ith data node to jth data node via all clus- hyperparameter selection in the prior. ter nodes (DCD random walk): X X P (kji)P (kjj) Empirical comparison with NMF and other graph- P (ijj) = P (ijk)P (kjj) = : (2) based clustering approaches demonstrates that our P P (kjv) k k v method can achieve the best or nearly the best clustering purity in all tasks. For some datasets, the new This probability defines another similarity between method significantly improves the state-of-the-art. two data nodes, Abij = P (ijj), with respect to cluster nodes. Note that this matrix has rank at most After this introductory part, we present the new equal to r. The learning target is now to find a good method in Section2, including its learning objec- approximation between the input similarities and the tive, probabilistic model, optimization and initializa- DCD random walk probabilities: tion techniques. In Section3, we point out the con- nections and differences between our method and other A ≈ A:b (3) recent related work. Experimental settings and results are given in Section4. Finally we conclude the paper AnchorGraph does not provide any error measure for and discuss some future work in Section5. the above approximation. A conventional choice in NMF is the squared Euclidean distance, which em- 2. Clustering by DCD ploys the underlying assumption that the noise is additive and Gaussian. Suppose the similarities between n data samples are precomputed and given in a nonnegative symmetric In real-world clustering tasks for multivariate datasets, matrix A. This matrix can be seen as (weighted) affin- data points often lie in a curved manifold. Conse- ity of an undirected similarity graph where each node quently, similarities based on Euclidean distances are corresponds to a data sample (data node). A clus- reliable only in a small neighborhood. Such local- tering analysis algorithm takes such input and divides ity causes high sparsity in the input similarity ma- the data nodes into r disjoint subsets. In probabilistic trix. Sparsity also commonly exists for real-world net- clustering analysis, we want to find P (kji), the prob- work data. Because of the sparsity, Euclidean dis- ability of assigning the ith sample to the kth cluster, tance is improper for the approximation in Eq. (3), because additive Gaussian noise should lead to a dense Clustering by Low-Rank Doubly Stochastic Matrix Decomposition observed graph. In contrast, (generalized) Kullback- Although it is possible to construct a multi-level Leibler divergence is more suitable for the approxima- graphical model similar to the Dirichlet process topic tion. The underlying Poisson noise characterizes rare model, we emphasize that the smallest approximation occurrences that are present in our sparse input. We error (or perplexity) is our final goal. Dirichlet prior is can now formulate our learning objective as the fol- used only in order to ease the optimization. Therefore lowing optimization problem: we do not employ more complex generative models; see Section 2.4 for more discussion. ! X Aij min DKL(AjjAb) = Aij log − Aij + Abij W ≥0 2.3. Optimization ij Abij (4) The optimization problem with Dirichlet prior on W X is equivalent to minimizing s.t. Wik = 1; i = 1; : : : ; n; (5) k X X J (W ) = − Aij log Abij − (α − 1) log Wik (9) ij ik where we write Wik = P (kji) for convenience and thus There are two ways to handle the constraint Eq. (5). X WikWjk Abij = P : (6) First, one can develop the multiplicative algorithm by Wvk k v the procedure proposed by Yang & Oja(2011) by ne- glecting the stochasticity constraint, and then normal- ize the rows of W after each update. However, the op- Note that Ab is symmetric as it is easy to verify that timization by this way easily gets stuck in poor local P (ijj) = P (jji).

Load more