Locality Preserving Nonnegative Matrix Factorization

Locality Preserving Nonnegative Matrix Factorization Deng Cai† Xiaofei He† Xuanhui Wang‡ Hujun Bao† Jiawei Han‡ †State Key Lab of CAD&CG, College of Computer Science, Zhejiang University, China {dengcai, xiaofeihe, bao}@cad.zju.edu.cn ‡Department of Computer Science, University of Illinois at Urbana-Champaign {xwang20, hanj}@cs.uiuc.edu Abstract Latent Semantic Indexing (LSI, [Deerwester et al., 1990])ap- plies Singular Value Decomposition (SVD) to decompose the Matrix factorization techniques have been fre- original data matrix X into a product of three matrices, that is, quently applied in information processing tasks. X = USVT. U and V are orthogonal matrices and S is a diag- Among them, Non-negative Matrix Factorization onal matrix. The quantities Sii are called the singular values (NMF) have received considerable attentions due of X, and the column vectors of U and V are called left and to its psychological and physiological interpreta- right singular vectors, respectively. By removing those sin- tion of naturally occurring data whose representa- gular vectors corresponding to sufficiently small singular val- tion may be parts-based in human brain. On the ues, we obtain a natural low-rank approximation to the origi- other hand, from geometric perspective the data is nal matrix and each dimension corresponds to a hidden topic. usually sampled from a low dimensional manifold Besides SVD, the other popular matrix factorization tech- embedded in high dimensional ambient space. One niques include LU-decomposition, QR-decomposition, and hopes then to find a compact representation which Cholesky decomposition. uncovers the hidden topics and simultaneously respects the intrinsic geometric structure. In this pa- Recently Non-negative Matrix Factorization (NMF, [Lee per, we propose a novel algorithm, called Local- and Seung, 1999]) have been proposed and achieved great ity Preserving Non-negative Matrix Factorization success due to its theoretical interpretation and practical per- (LPNMF), for this purpose. For two data points, formance. Previous studies have shown there is psychological we use KL-divergence to evaluate their similarity and physiological evidence for parts-based representation in on the hidden topics. The optimal maps are ob- human brain [Logothetis and Sheinberg, 1996; Palmer, 1977; tained such that the feature values on hidden topics Wachsmuth et al., 1994]. The non-negative constraints in are restricted to be non-negative and vary smoothly NMF lead to a parts-based representation because it allows along the geodesics of the data manifold. Our em- only additive, not subtractive, combinations. NMF has been pirical study shows the encouraging results of the shown to be superior to SVD in face recognition [Li et al., proposed algorithm in comparisons to the state-of- 2001] and document clustering [Xu et al., 2003].Themajor the-art algorithms on two large high-dimensional disadvantage of NMF is that it fails to consider the intrinsic databases. geometric structure in the data. In this paper, we aim to discover the hidden topics and the intrinsic geometric structure simultaneously. We propose a 1 Introduction novel algorithm called Locality Preserving Non-negative Ma- Data representation has been a fundamental problem in many trix Factorization (LPNMF) for this purpose. For two data areas of information processing. A good representation points, we use KL-divergence to evaluate their similarity on can significantly facilitates learning from example in terms the hidden topics. A nearest neighbor graph is constructed to of learnability and computational complexity [Duda et al., model the local manifold structure. If two points are suffi- 2000]. On the one hand, each data point may be associated ciently close on the manifold, then we expect that they have with some hidden topics. For example, a face image can be similar representations on the hidden topics. Thus, the op- thought of as a combination of nose, mouth, eyes, etc. On the timal maps are obtained such that the feature values on hid- other hand, from geometrical perspective, the data points may den topics are restricted to be non-negative and vary smoothly be sampled from a probability distribution supported on a low along the geodesics of the data manifold. We also propose an dimensional submanifold embedded in the high dimensional efficient method to solve the optimization problem. It is im- space. One hopes then to find a compact representation which portant to note that this work is fundamentally based on our respects both hidden topics as well as geometric structure. previous work GNMF [Cai et al., 2008]. The major differ- In order to discover the hidden topics, matrix factorization ence is that GNMF evaluates the relationship between two techniques have been frequently applied [Deerwester et al., matrices using Frobenius norm. While in this work, we use 1990; Liu et al., 2008]. For example, the canonical algorithm the divergence which has better probabilistic interpretation. 1010 2 A Brief Review of NMF factorization methods, e.g., SVD. Unlike SVD, no subtrac- Non-negative Matrix Factorization (NMF) [Lee and Seung, tions can occur in NMF. For this reason, it is believed that [ 1999] is a matrix factorization algorithm that focuses on the NMF can learn a parts-based representation Lee and Seung, ] analysis of data matrices whose elements are nonnegative. 1999 . The advantages of this parts-based representation has m×n been observed in many real world problems such as face anal- Given a data matrix X =[xij ]=[x1, ··· , xn] ∈ R , ysis [Li et al., 2001], document clustering [Xu et al., 2003] each column of X is a sample vector. NMF aims to find two m×t and DNA gene expression analysis [Brunet et al., 2004]. non-negative matrices U =[uik] ∈ R and V =[vjk] ∈ Rn×t which minimize the following objective function: 3 Locality Preserving Non-negative Matrix xij O = xij log − xij + yij (1) Factorization yij i,j Recall that NMF tries to find a basis that is optimized for y T the linear approximation of the data. One might hope that where Y =[ij ]=UV . The above objective function is knowledge of the geometric structure of the data can be ex- lower bounded by zero, and vanishes if and only if X = Y. ploited for better discovery of this basis. A natural assump- It is usually referred as “divergence” of X from Y instead of tion here could be that if two data points xj, xs are close in “distance” between X and Y because it is not symmetric in the intrinsic geometry of the data distribution, then zj and X and Y. It reduces to the Kullback-Leibler divergence, or x y zs, the representations of this two points in the new basis, relative entropy, when ij ij = ij ij =1,sothatX and are also close to each other. This assumption is usually re- Y can be regarded as normalized probability distributions. 1 ferred to as manifold assumption [Belkin and Niyogi, 2001; Although the objective function O in Eq. (1) is convex in He and Niyogi, 2003], which plays an essential rule in de- U only or V only, it is not convex in both variables together. veloping various kinds of algorithms including dimensional- Therefore it is unrealistic to expect an algorithm to find the ity reduction algorithms [Belkin and Niyogi, 2001] and semi- global minimum of O. Lee & Seung [Lee and Seung, 2001] supervised learning algorithms [Belkin et al., 2006]. presented an iterative update algorithm as follows: Recent studies on spectral graph theory [Chung, 1997] and manifold learning theory [Belkin and Niyogi, 2001] have j (xij vjk/ k uikvjk) uik ← uik demonstrated that the local geometric structure can be effec- j vjk tively modeled through a nearest neighbor graph on a scatter (2) N (xij uik/ uikvjk) of data points. Consider a graph with vertices where each v ← v i k jk jk u vertex corresponds to a document in the corpus. Define the i ik edge weight matrix W as follows: It is proved that the above update steps will find a local mini- 1, if xj ∈ Np(xs) or xs ∈ Np(xj) O [ ] Wjs = (4) mum of the objective function Lee and Seung, 2001 . 0, otherwise. In reality, we have t m and t n. Thus, NMF essen- tially try to find a compressed approximation of the original where Np(xs) denotes the set of p nearest neighbors of xs. data matrix, X ≈ UVT . We can view this approximation Again, we can use the divergence between the low dimen- column by column as sional representations in the new basis of two samples to measure the “distance”: t t xj ≈ ukvjk (3) v D || v jk − v v , k=1 (zj zs)= jk log jk + sk (5) vsk k=1 where uk is the k-th column vector of U. Thus, each data vec- t j =[vj1, ··· ,vjk] tor xj is approximated by a linear combination of the columns since we have z . Thus, the following term of U, weighted by the components of V. Therefore U can be can be used to measure the smoothness of the low dimen- regarded as containing a basis that is optimized for the linear sional representation varies smoothly along the geodesics in T the intrinsic geometry of data. approximation of the data in X.Letzj denote the j-th row t n of V, zj =[vj1, ··· ,vjk] . zj can be regarded as the new 1 R = D(zj||zs)+D(zs||zj) Wjs representation of each data point in the new basis U.Since 2 relatively few basis vectors are used to represent many data j,s=1 (6) vectors, good approximation can only be achieved if the ba- n t 1 vjk vsk sis vectors discover structure that is latent in the data [Lee and = vjk log + vsk log Wjs.

Load more