Clustering by Low-Rank Doubly Decomposition

Zhirong Yang [email protected] Department of Information and Computer Science, Aalto University, 00076, Finland Erkki Oja [email protected] Department of Information and Computer Science, Aalto University, 00076, Finland

Abstract negativity constraint, together with various low-rank matrix approximation objectives, has widely been used Clustering analysis by nonnegative low- for the relaxation purpose in the past decade. rank approximations has achieved remark- able progress in the past decade. However, The most popular nonnegative low-rank approxi- most approximation approaches in this di- mation method is Factorization rection are still restricted to matrix factor- (NMF). It finds a matrix that approximates the simi- ization. We propose a new low-rank learn- larities and can be factorized into several nonnegative ing method to improve the clustering per- low-rank matrices. NMF was originally applied to vec- formance, which is beyond matrix factoriza- torial data, where Ding et al.(2010) have shown that tion. The approximation is based on a two- NMF is equivalent to the classical k-means method. step bipartite random walk through virtual Later NMF was applied to the (weighted) graph given cluster nodes, where the approximation is by the pairwise similarities. For example, Ding et al. formed by only cluster assigning probabil- (2008) presented Nonnegative Spectral Cuts by using a ities. Minimizing the approximation error multiplicative algorithm; Arora et al.(2011) proposed measured by Kullback-Leibler divergence is Left Stochastic Decomposition that approximates a equivalent to maximizing the likelihood of similarity matrix based on Euclidean distance and a a discriminative model, which endows our left-. Another stream in the same method with a solid probabilistic interpre- direction is topic modeling. Hofmann(1999) gave a tation. The optimization is implemented generative model in Probabilistic Latent Semantic In- by a relaxed Majorization-Minimization algo- dexing (PLSI) for counting data, which is essentially rithm that is advantageous in finding good lo- equivalent to NMF using Kullback-Leibler (KL) di- cal minima. Furthermore, we point out that vergence and Tri-factorizations. Bayesian treatment the regularized algorithm with Dirichlet prior of PLSI by using Dirichlet prior was later introduced only serves as initialization. Experimental re- by Blei et al.(2001). Symmetric PLSI with the same sults show that the new method has strong Bayesian treatment is called Interaction Component performance in clustering purity for various Model (ICM) (Sinkkonen et al., 2008). datasets, especially for large-scale manifold Despite remarkable progress, the above relaxation ap- data. proaches are still not fully satisfactory in all of the fol- lowing requirements that affect the clustering perfor- mance using nonnegative low-rank approximation: (1) 1. Introduction approximation error measure that takes into account Cluster analysis assigns a set of objects into groups so sparse similarities, (2) decomposition form of the ap- that the objects in the same cluster are more similar proximating matrix, where the decomposing matrices to each other than to those in other clusters. Opti- should contain just enough parameters for clustering mization of most clustering objectives is NP-hard and but not more, and (3) normalization of the approxi- relaxation to “soft” clustering is often required. A non- mating matrix, which ensures relatively balanced clus- ters and equal contribution of each data sample. Lack- Appearing in Proceedings of the 29 th International Confer- ing one or more of these dimensions can severely affect ence on Machine Learning, Edinburgh, Scotland, UK, 2012. clustering performance. Copyright 2012 by the author(s)/owner(s). Clustering by Low-Rank Doubly Stochastic

In this paper we present a new nonnegative low-rank where i = 1, . . . , n and k = 1, . . . , r. In the following, approximation method for clustering, which satisfies i, j and v stand for data sample (node) indices while all of the above three requirements. First, because k and l stand for cluster indices. datasets often lie in curved manifolds such that only similarities in a small neighborhood are reliable, we 2.1. Learning objective adopt KL-divergence to handle the resulting sparsity. Second, different from PLSI, we enforce an equal con- Some of our work was inspired by the AnchorGraph tribution of every data sample and then directly con- (Liu et al., 2010) which was used in large approxi- struct the decomposition over the from mative graph construction based on a two-step ran- samples to clusters. Third, these probabilities form dom walk between data nodes through a set of an- the only decomposing matrix to be learned in our ap- chor nodes. Note that AnchorGraph is not a clustering proach and directly gives the answer for probabilistic method. clustering. Furthermore, our decomposition method If we augment the input similarity graph by r cluster leads to a doubly-stochastic approximating matrix, nodes, the cluster assigning probabilities can be seen as which was shown to be desired for balanced graph cuts single-step random walk probabilities from data nodes (Zass & Shashua, 2006). We name our new method to the augmented cluster nodes. Without preference DCD because it is based on Data-Cluster-Data ran- to any particular samples, we impose uniform prior dom walks. P (i) = 1/n over the data nodes. By this prior, the In order to solve the DCD learning objective, we pro- reversed random walk probabilities can then be calcu- pose a novel relaxed Majorization-Minimization algo- lated by the Bayes formula rithm to handle the new matrix decomposition type. P (k|i)P (i) P (k|i) Our relaxation strategy works robustly in finding sast- P (i|k) = = . (1) P P (k|v)P (v) P P (k|v) isfactory local optimizers under the stochasticity con- v v straint. Furthermore, we argue that complexity con- trol such as Bayesian priors only provides initialization Consider next the of two-step random for the new algorithm. This eliminates the problem of walks from ith data node to jth data node via all clus- hyperparameter selection in the prior. ter nodes (DCD random walk): X X P (k|i)P (k|j) Empirical comparison with NMF and other graph- P (i|j) = P (i|k)P (k|j) = . (2) based clustering approaches demonstrates that our P P (k|v) k k v method can achieve the best or nearly the best clus- tering purity in all tasks. For some datasets, the new This probability defines another similarity between method significantly improves the state-of-the-art. two data nodes, Abij = P (i|j), with respect to clus- ter nodes. Note that this matrix has rank at most After this introductory part, we present the new equal to r. The learning target is now to find a good method in Section2, including its learning objec- approximation between the input similarities and the tive, probabilistic model, optimization and initializa- DCD random walk probabilities: tion techniques. In Section3, we point out the con- nections and differences between our method and other A ≈ A.b (3) recent related work. Experimental settings and results are given in Section4. Finally we conclude the paper AnchorGraph does not provide any error measure for and discuss some future work in Section5. the above approximation. A conventional choice in NMF is the squared Euclidean distance, which em- 2. Clustering by DCD ploys the underlying assumption that the noise is ad- ditive and Gaussian. Suppose the similarities between n data samples are precomputed and given in a nonnegative symmetric In real-world clustering tasks for multivariate datasets, matrix A. This matrix can be seen as (weighted) affin- data points often lie in a curved manifold. Conse- ity of an undirected similarity graph where each node quently, similarities based on Euclidean distances are corresponds to a data sample (data node). A clus- reliable only in a small neighborhood. Such local- tering analysis algorithm takes such input and divides ity causes high sparsity in the input similarity ma- the data nodes into r disjoint subsets. In probabilistic trix. Sparsity also commonly exists for real-world net- clustering analysis, we want to find P (k|i), the prob- work data. Because of the sparsity, Euclidean dis- ability of assigning the ith sample to the kth cluster, tance is improper for the approximation in Eq. (3), because additive Gaussian noise should lead to a dense Clustering by Low-Rank Decomposition observed graph. In contrast, (generalized) Kullback- Although it is possible to construct a multi-level Leibler divergence is more suitable for the approxima- graphical model similar to the Dirichlet process topic tion. The underlying Poisson noise characterizes rare model, we emphasize that the smallest approximation occurrences that are present in our sparse input. We error (or perplexity) is our final goal. Dirichlet prior is can now formulate our learning objective as the fol- used only in order to ease the optimization. Therefore lowing optimization problem: we do not employ more complex generative models; see Section 2.4 for more discussion. ! X Aij min DKL(A||Ab) = Aij log − Aij + Abij W ≥0 2.3. Optimization ij Abij (4) The optimization problem with Dirichlet prior on W X is equivalent to minimizing s.t. Wik = 1, i = 1, . . . , n, (5) k X X J (W ) = − Aij log Abij − (α − 1) log Wik (9) ij ik where we write Wik = P (k|i) for convenience and thus There are two ways to handle the constraint Eq. (5). X WikWjk Abij = P . (6) First, one can develop the multiplicative algorithm by Wvk k v the procedure proposed by Yang & Oja(2011) by ne- glecting the stochasticity constraint, and then normal- ize the rows of W after each update. However, the op- Note that Ab is symmetric as it is easy to verify that timization by this way easily gets stuck in poor local P (i|j) = P (j|i). Therefore, Ab is also doubly stochastic because it is left-stochastic by probability definition. minima in practice. Here we employ a relaxing strategy to handle the con- 2.2. Probabilistic model straint. We first introduce Lagrangian multipliers for the constraints: The optimization objective has an analogous statisti- ! cal model with the PLSI. Dropping the constant terms X X L(W, λ) = J (W ) + λ W − 1 . (10) from DKL(A||Ab), the objective is equivalent to maxi- i ik mizing i k X Unlike traditional constrained optimization that solves Aij log Abij. (7) the fixed-point equations, we employ a heuristic to find ij the multipliers λ. Denote ∇ = ∇+ − ∇− the gradient of J with respect to W , where ∇+ and ∇− are the This can be identified as the log-likelihood of positive and (unsigned) negative parts, respectively. the following generative model if A are integers: ij This suggests a fixed-point update rule for W : for t = 1,...,T , add one to entry (i, j) ∼   1 ∇− − λ Multinomial n A,b 1 , whose likelihood is given by 0 ik i Wik = Wik + . (11) ∇ik T  Aij Y 1 Y 1 Imposing P W 0 = 1, we obtain p(A) = Abij = Abij , k ik n n t=1 ij bi − 1 λi = , (12) P ai where T = ij Aij. The above model simply uses uniform prior on rows of where ai and bi are given in Algorithm1. Next we W . It does not prevent from using informative priors show that the augmented objective Eq. (10) decreases or complexity control. A natural choice for probabili- after each iteration with the above λ. ties is the Dirichlet distribution (α > 0) Theorem 1. Denote W new the updated matrix after each iteration. It holds that L(W new, λ) ≤ L(W, λ) r Γ(rα) Y with λi = (bi − 1)/ai. p(W ,...,W |α) = W α−1, (8) i1 ir [Γ(α)]r ik k=1 Proof. The algorithm construction mainly follows the which is also the conjugate prior of multinomial dis- Majorization-Minimization procedure (see e.g. Yang & tribution. The Dirichlet prior reduces to be uniform Oja, 2011). We use W and Wf to distinguish the cur- when α = 1. rent estimate and the variable, respectively. Clustering by Low-Rank Doubly Stochastic Matrix Decomposition

Algorithm 1 Relaxed MM Algorithm for DCD It adds the same constant 1 + 1 to both numer- ai Wik Input: similarity matrix A, number of clusters r, ator and denominator in order to guarantee that the nonnegative initial guess of W . updated matrix entries are positive, which is imple- repeat mented by using a further upper-bound of zero. All !−1 the above upper bounds are tight at Wf = W , i.e. X WikWjk Zij = P Aij G(W, W ) = J (W ). Wvk k v (Minimization) P sk = v Wvk ∂G 1 W  1  ∇− = 2 (ZW ) s−1 + αW −1 + ik − ik ik k ik =∇ik − − ∇ik − ∂Wfik Wik Wfik Wik ∇+ = W T ZW  s−2 + W −1 ik kk k ik  1 1   1 1  + λ + + W − W ∇− i ik X il X il ai Wik Wik Wfik ai = , bi = Wil ∇+ ∇+     l il l il Wik − 1 + bi = − ∇ik + + ∇ik + . − Wfik ai ai ∇ikai + 1 Wik ← Wik + ∇ikai + bi Setting the gradient to zero gives until W is unchanged ∇− + 1 Output: cluster assigning probabilities W . new ik ai Wik = Wik (13) ∇+ + bi ik ai

(Majorization) Multiplying both numerator and denominator by ai !−1 gives the last update rule in Algorithm1. Therefore, new new WikWjk X WilWjl L(W , λ) ≤ G(W ,W ) ≤ L(W, λ). Let φijk = P P . Wvk Wvl v l v Algorithm1 jointly minimizes the approximation er- L(Wf) ror and drives the rows of W towards the probability " # simplex. The Lagrangian multipliers are automatically X X selected by the algorithm, without extra human tuning ≤ − Aijφijk log Wfik + log Wfjk − log Wfvk ijk v labor. The quantities bi are the row sums of the uncon- X X strained multiplicative learning result, while the quan- − (α − 1) log Wfik + λiWik + C1 tities ai balance between the gradient learning force ik ik and the probability simplex attraction. Besides con- " P # X v Wfvk venience, we find that this relaxation strategy works ≤ − Aijφijk log Wfik + log Wfjk − P v Wvk more robustly than the brute-force normalization after ijk each iteration. X X − (α − 1) log Wfik + λiWik + C2 ik ik 2.4. Initialization " # P W X v fvk The optimization problems of many clustering anal- ≤ − Aijφijk log Wfik + log Wfjk − P Wvk ijk v ysis methods, including ours, are non-convex. Usu- X X ally finding the global optimum is very expensive or − (α − 1) log W + λ W fik i ik even NP-hard. When local optimizers are used, the ik ik optimization trajectory can easily get stuck in poor   ! X 1 1 Wfik Wfik local optima if the algorithm starts from an arbitrary + + Wik − log − 1 + C2 ai Wik Wik Wik random guess. Proper initialization is thus needed to ik achieve satisfactory performance. ≡G(W,Wf ), The cost of the initialization should be much cheaper where C1 and C2 are constants irrelevant to the vari- than the main algorithm. There are two popular able Wf. The first two inequalities follow the CCCP choices: k-means and Normalized Cut (Ncut). The majorization (Yang & Oja, 2011) using the convex- first one can only be applied to vectorial data and ity and concavity of − log() and log(), respectively. could be slow for large-scale high-dimensional data. The third inequality is called “moving term” technique Here we employ the second initialization method. used in multiplicative updates (Yang & Oja, 2010). While the original Ncut is NP-hard, the relaxed Ncut Clustering by Low-Rank Doubly Stochastic Matrix Decomposition problem can be efficiently solved via spectral methods abilities P (k|i) which contains only n × (r − 1) free (Shi & Malik, 2000). Furthermore, it is particularly parameters. This difference can be large when there suitable for sparse graph input, which is our focus in are only a few clusters (e.g. r = 2 or r = 3). this paper. It is known that the performance of PLSI can be Besides Ncut, we emphasize that the minimal approxi- improved by using Bayesian non-parametric model- mation error is our sole learning objective and all reg- ing. Bayesian treatment for the symmetric version of ularized versions, e.g. with different Dirichlet priors, PLSI leads to Interaction Component Model (Sinkko- only serve as initialization. This is because cluster- nen et al., 2008). It associates Dirichlet priors to the ing analysis, unlike supervised learning problems, does PLSI factorizing matrices and then makes use of the not need to provide inference for unseen data. That conjugacy between Dirichlet and multinomial to derive is, the complexity control such as Bayesian priors is collapsed Gibbs sampling or variational optimization not meant for better generalization performance, but methods. for better-shaped space to facilitate optimization. In An open problem of Bayesian methods is how to de- this sense, we can use the results of diverse regular- termine the hyperparameters that control the priors. ized versions or even other clustering algorithms as Asuncion et al.(2009) found that wrongly chosen pa- starting guesses, and pick the one with the smallest rameters can lead to only mediocre or even poor per- approximation error among multiple runs. formance. The automatic hyperparameters updating In implementation, we first convert an initialization method proposed by Minka(2000) does not necessar- clustering result to an n × r binary indicator matrix, ily lead to good solution in terms of perplexity (Asun- and then add a small positive perturbation to all en- cion et al., 2009) or clustering purity in our exper- tries. Next, the perturbed matrix is fed to our opti- iments (see Section4). Hofmann(1999); Asuncion mization algorithm (with α = 1 in Algorithm1). et al.(2009) suggested to select the hyperparameters using the smallest approximation error for some held- 3. Related Work out matrix entries, which is however more costly and might weaken or even break the cluster structure. Our method intersects with several other machine By contrast, there is no such prior hyperparameter se- learning approaches. Here we discuss some of these lection problem in our method. The algorithms using directions, pinpointing the connections and our new various priors only play their role in the initialization. contributions. Among the runs with different starting points, we sim- ply select the one with the smallest approximation er- 3.1. Topic Model ror. A topic model is a type of statistical model for discov- ering the abstract “topics” that occur in a collection of 3.2. Nonnegative Matrix Factorization documents. An early topic model was PLSI (Hofmann, Nonnegative Matrix Factorization is one of the earliest 1999) which maximizes the following log-likelihood for methods for relaxing clustering problems by nonnega- symmetric input A: tive low-rank approximation (see e.g. Xu et al., 2003). X X The research on NMF also opened the door for multi- Aij log P (k)P (i|k)P (j|k). (14) ij k plicative majorization-minimization algorithms for op- timization over nonnegative matrices. In the original One can see that PLSI has similar form as Eq. (7). NMF, an input nonnegative matrix X is approximated Both objectives can be equivalently expressed by non- by a product of two low-rank matrices W and H. Later negative low-rank approximation using KL-divergence. researchers found that more constraints or normaliza- The major difference is the decomposition form of tions should be imposed on the factorizing matrices to the approximating matrix. There are two ways to achieve desired performance. model the hierarchy between latent variables and the Orthogonality is a popular choice (see e.g. Ding et al., observed ones. Topic model uses the pure genera- 2006) for highly sparse factorizing matrices, especially tive way while our method employs the discriminative the cluster indicator matrix. However, the orthogonal- way. PLSI gives the clustering results indirectly. One ity constraint seems exclusive of other constraints or should apply Bayes formula to evaluate P (k|i) using priors. In practice, the orthogonality favors Euclidean P (i|k) and P (k). There are n × r − 1 free parameters distance as the approximation error measure for sim- to be learned in the latter two quantities. In contrast, ple update rules, which is against our requirement for our method directly learns the cluster assigning prob- Clustering by Low-Rank Doubly Stochastic Matrix Decomposition sparse graph input. Table 1. of selected datasets. Stochasticity seems more natural for relaxing hard clustering to probabilities. Recently Arora et al. Dataset #samples #classes (2011) proposed a symmetric NMF using left- stochastic factorizing matrices called LSD. Their Amazon 96 2 method also directly learns the cluster assigning prob- Iris 150 3 abilities. However, LSD is also restricted to Euclidean Votes 435 2 distance. ORL 400 40 PIE 1166 53 Our method has two major differences from LSD. YaleB 1292 38 First, we use Kullback-Leibler divergence which is Coil20 1440 20 Isolet 1559 26 more suitable for sparse graph input or curved man- Mfeat 2000 10 ifold data. This also enables us to make use of the Webkb4 4196 4 Dirichlet and multinomial conjugacy pair. Second, our 7sectors 4556 7 decomposition has good interpretation in terms of a USPS 9298 10 random walk. Furthermore, imbalanced clustering is PenDigits 10992 10 LetReco 20000 26 implicitly penalized because of the denominator in Eq. MNIST 70000 10 (6).

3.3. AnchorGraph The detailed settings of the compared methods are as DCD uses the same matrix decomposition as Anchor- follows. We implemented NSC, PNMF, ONMF, LSD, Graph. However, there are several major differences PLSI, and DCD using multiplicative updates. For between the two methods. First of all, AnchorGraph is these methods, we ran their update rules for 10,000 it- not made for clustering, but for constructing the graph erations to ensure that all algorithms have sufficiently input. AnchorGraph has no learning objective that converged. We used the default setting for 1-Spec. captures the global structure of data such as clusters. ICM uses collapsed Gibbs sampling, where each round Each row of the decomposing matrix in AnchorGraph of the sampling sweeps the graph once. We ran the is learned individually and only encodes the local in- ICM sampling for 100,000 rounds to ensure that the formation. There is no learning over the decomposing MCMC burn-in is converged (it took about one day matrix as a whole. Furthermore, anchors are either for the largest dataset). The hyperparameters in ICM selected from data samples or pre-learned by e.g. k- are automatically adjusted by using Minka’s method means. By contrast, cluster nodes in our formulation (Minka, 2000). are virtual. They are not vectors and need no physical Despite mediocre results, Ncut runs very fast and gives storage. pretty stable outputs. We thus used it for initializa- tion. After getting the Ncut cluster indicator matrix, 4. Experiments we add 0.2 to all entries and feed it as starting point for other methods, which is a common initialization 4.1. Compared methods setting for NMF methods. The other three initial- We have compared our method with eight other clus- ization points for our method are provided by Ncut tering algorithms that can take a symmetric nonnega- followed by DCD using three different Dirichlet priors tive as input. The compared algorithms (α = 1.2, α = 2, and α = 5). The clustering result of range from classical to state-of-the-art methods with our method is reported by the run with the smallest various principles: graph cuts including Normalized approximation error, see Eq. (4). Cut (Ncut) (Shi & Malik, 2000), Nonnegative Spectral Cut (NSC) (Ding et al., 2008), and 1-Spectral ratio 4.2. Datasets Cheeger cut (1-Spec) (Hein & B¨uhler, 2010); nonneg- ative matrix factorization including Projective NMF The performance of clustering methods were evaluated (PNMF) (Yang & Oja, 2010), Symmetric 3-Factor Or- using real-world datasets. In particular, we focus on thogonal NMF (ONMF) (Ding et al., 2006), and Left- data that lie in a curved manifold. We thus selected Stochastic Decomposition (LSD) (Arora et al., 2011); 15 such datasets which are publicly available from a topic models including Probabilistic Latent Seman- variety of domains. The data sources are given in the tic Indexing (PLSI) (Hofmann, 1999) and Interaction supplemental document. Component Model (ICM) (Sinkkonen et al., 2008). The statistics of the selected datasets are summarized Clustering by Low-Rank Doubly Stochastic Matrix Decomposition

Table 2. Clustering purities for the compared methods on various data sets.

Dataset Ncut PNMF NSC ONMF PLSI LSD 1-Spec ICM DCD

Amazon 0.63 0.76 0.63 0.63 0.63 0.68 0.63 0.63 0.78 Iris 0.90 0.93 0.90 0.33 0.91 0.97 0.91 0.97 0.97 Votes 0.72 0.72 0.72 0.73 0.73 0.72 0.72 0.73 0.73 ORL 0.81 0.82 0.82 0.03 0.83 0.81 0.80 0.20 0.83 PIE 0.67 0.66 0.68 0.02 0.68 0.69 0.64 0.12 0.68 YaleB 0.45 0.43 0.46 0.03 0.51 0.45 0.39 0.10 0.51 Coil20 0.81 0.71 0.82 0.05 0.82 0.78 0.75 0.63 0.81 Isolet 0.57 0.55 0.56 0.04 0.58 0.57 0.57 0.36 0.58 Mfeat 0.75 0.77 0.79 0.10 0.77 0.78 0.80 0.69 0.78 Webkb4 0.54 0.41 0.54 0.40 0.59 0.62 0.40 0.49 0.62 7sectors 0.25 0.29 0.25 0.24 0.37 0.35 0.25 0.38 0.41 USPS 0.74 0.75 0.74 0.77 0.73 0.79 0.74 0.60 0.81 PenDigits 0.80 0.78 0.80 0.10 0.80 0.86 0.80 0.52 0.89 LetReco 0.24 0.25 0.23 0.04 0.28 0.29 0.18 0.21 0.32 MNIST 0.77 0.74 0.79 0.11 0.79 0.76 0.88 0.95 0.97 in Table1. In brief, Amazon are book similarities clustering purities for the compared methods are given according to amazon.com buying records; Votes are in Table2. voting records in US congress by two different par- Our method has strong performance in terms of clus- ties; ORL, PIE, YaleB are face images collected un- tering purity. DCD wins 12 out of 15 selected datasets. der different conditions; COIL20 are small toy images; Even for the other three datasets, DCD is the first or Isolet and LegReco are handwritten English letter im- second runner-up, with purities tied with or very close ages; Webkb4 and 7sectors are text document collec- to the winner. tions; Mfeat, USPS, PenDigits, MNIST are handwrit- ten digit images. The new method is particularly more advantageous for large datasets. Note that the datasets in Table We preprocessed the above datasets to produce sim- 2 are ordered by their sizes. We can see that there ilarity graph input except Amazon which is already are some other winners or joint winners for smaller in sparse graph format. We extracted the scattering datasets, for example, LSD for the PIE faces or 1-Spec features (Mallat, 2012) for image data except Isolet for the Mfeat digits. PLSI performs quite similarly and Mfeat which have their own feature representa- with DCD for these small clustering tasks. However, tion. We used Tf-Idf features for text documents. DCD demonstrates clear win over the other methods After feature extraction, we constructed a K-Nearest- for the five largest datasets. Neighbor (KNN) graph for each dataset. We set K = DCD has remarkable performance for the largest 5 for the six smallest datasets (except Amazon) and dataset MNIST. In this case, clustering as unsuper- K = 10 for the other datasets. We then symmetrized vised learning by using our method has even achieved and binarized the KNN graph B to obtain the input classification accuracy (i.e. purity) very close to many similarities A (i.e. A = 1 if B = 1 or B = 1, and ij ij ji modern supervised approaches1, whereas we only need A = 0 otherwise). ij ten labeled samples to remove the cluster-class permu- tation ambiguity. 4.3. Results Clustering performance of the compared methods is 5. Conclusions evaluated by clustering purity We have presented a new clustering method based on r nonnegative low-rank approximation with three ma- 1 X l purity = max nk (15) n 1≤l≤r jor original contributions: (1) a novel decomposition k=1 approach for the approximating matrix derived from l a two-step random walk; (2) a relaxed majorization- where nk is the number of data samples in the cluster k that belong to ground-truth class l. A larger purity 1see http://yann.lecun.com/exdb/mnist/ in general corresponds to better clustering result. The Clustering by Low-Rank Doubly Stochastic Matrix Decomposition minimization algorithm for finding better approximat- tional Conference on Data Mining (ICDM), pp. 183–192, ing matrices; (3) a strategy that uses regularization 2008. with the Dirichlet prior as initialization. Experimen- Ding, C., Li, T., and Jordan, M. I. Convex and semi- tal results showed that our method works robustly for nonnegative matrix factorizations. IEEE Transactions various selected datasets and can improve clustering on Pattern Analysis and Machine Intelligence, 32(1):45– purity for large manifold datasets. 55, 2010. There are some other dimensions that affect clustering Hein, M. and B¨uhler,T. An inverse power method for performance. Our practice indicates that initialization nonlinear eigenproblems with applications in 1-spectral clustering and sparse pca. In Advances in Neural Infor- could play an important role because most current al- mation Processing Systems (NIPS), pp. 847–855, 2010. gorithms are only local optimizers. Using Dirichlet prior is only one way to smooth the objective function Hofmann, T. Probabilistic latent semantic indexing. In space. One could use other priors or regularization International Conference on Research and Development in Information Retrieval (SIGIR), pp. 50–57, 1999. techniques to achieve better initializations. Liu, W., He, J., and Chang, S.-F. Large graph construction Another dimension is the input graph. We have fo- for scalable semi-supervised learning. In International cused on the grouping procedure given that the simi- Conference on Machine Learning (ICML), pp. 679–686, larities are precomputed. One should notice that bet- 2010. ter features or a better similarity measure can signif- Mallat, S. Group invariant scattering. Communications in icantly improve clustering purity. Though we did not Pure and Applied , 2012. use AnchorGraph for the sake of including topic mod- els in our comparison, it could be more beneficial to Minka, T. Estimating a dirichlet distribution, 2000. construct both approximated and approximating ma- Shi, J. and Malik, J. Normalized cuts and image segmenta- trices by the same principle. This also suggests that tion. IEEE Transactions on Pattern Analysis and Ma- clustering analysis could be performed in a deeper way chine Intelligence, 22(8):888–905, 2000. using hierarchical pre-training. Detailed implementa- Sinkkonen, J., Aukia, J., and Kaski, S. Component models tion should be investigated in the future. for large networks. ArXiv e-prints, 2008.

Xu, W., Liu, X., and Gong, Y. Document clustering based Acknowledgment on non-negative matrix factorization. In International Conference on Research and Development in Informaion This work was financially supported by the Academy Retrieval (SIGIR), pp. 267–273, 2003. of Finland (Finnish Centre of Excellence in Com- putational Inference Research COIN, grant no. Yang, Z. and Oja, E. Linear and nonlinear projective 251170; Zhirong Yang additionally by decision num- nonnegative matrix factorization. IEEE Transaction on Neural Networks, 21(5):734–749, 2010. ber 140398). Yang, Z. and Oja, E. Unified development of multiplicative algorithms for linear and quadratic nonnegative matrix References factorization. IEEE Transactions on Neural Networks, Arora, R., Gupta, M., Kapila, A., and Fazel, M. Clustering 22(12):1878–1891, 2011. by left-stochastic matrix factorization. In International Zass, R. and Shashua, A. Doubly stochastic normalization Conference on Machine Learning (ICML), pp. 761–768, for spectral clustering. In Advances in Neural Informa- 2011. tion Processing Systems (NIPS), pp. 1569–1576, 2006. Asuncion, A., Welling, M., Smyth, P., and Teh, Y.-W. On smoothing and inference for topic models. In Conference on Uncertainty in Artificial Intelligence (UAI), pp. 27– 34, 2009. Blei, D., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. Journal of Machine Learning Research, 3: 993–1022, 2001. Ding, C., Li, T., Peng, W., and Park, H. Orthogonal non- negative matrix t-factorizations for clustering. In In- ternational conference on Knowledge discovery and data mining (SIGKDD), pp. 126–135, 2006. Ding, C., Li, T., and Jordan, M. I. Nonnegative matrix fac- torization for combinatorial optimization: Spectral clus- tering, graph matching, and clique finding. In Interna-