Co-Separable Nonnegative Matrix Factorization

Co-Separable Nonnegative Matrix Factorization Junjun Pan Michael K. Ng∗ Abstract Nonnegative matrix factorization (NMF) is a popular model in the field of pattern recog- nition. It aims to find a low rank approximation for nonnegative data M by a product of two nonnegative matrices W and H. In general, NMF is NP-hard to solve while it can be solved efficiently under separability assumption, which requires the columns of factor matrix are equal to columns of the input matrix. In this paper, we generalize separability assumption based on 3-factor NMF M = P1SP2, and require that S is a sub-matrix of the input matrix. We refer to this NMF as a Co-Separable NMF (CoS-NMF). We discuss some mathematics properties of CoS-NMF, and present the relationships with other related matrix factorizations such as CUR decomposition, generalized separable NMF(GS-NMF), and bi-orthogonal tri- factorization (BiOR-NM3F). An optimization model for CoS-NMF is proposed and alternated fast gradient method is employed to solve the model. Numerical experiments on synthetic datasets, document datasets and facial databases are conducted to verify the effectiveness of our CoS-NMF model. Compared to state-of-the-art methods, CoS-NMF model performs very well in co-clustering task, and preserves a good approximation to the input data matrix as well. Keywords. nonnegative matrix factorization, separability, algorithms. 1 Introduction Matrix methods lie at the root of most methods of machine learning and data analysis. Among all matrix methods, nonnegative matrix factorization (NMF) is an important one. It can auto- matically extracts sparse and meaningful features from a set of nonnegative data vectors and has m×n become a popular tool in data mining society. Given a nonnegative matrix M 2 R+ and an m×r r×n arXiv:2109.00749v1 [cs.LG] 2 Sep 2021 integer factorization rank r, NMF is the problem of computing W 2 R+ and H 2 R+ such that M ≈ WH. Note that r is usually much smaller than minfm; ng, NMF is well-known as a pow- erful technique for dimension reduction, and is able to give easily interpretable factors due to the nonnegativity constraints. It has been applied successfully in many areas, like image processing, text data mining, hyperspectral unmixing, see for example the recent survey and books [5, 9, 14] and the references therein. In general, NMF is NP-hard and its solution is not unique, see [9,34] and the reference therein. To resolve these two disadvantages, some assumptions like separability are introduced as a way to solve NMF problem efficiently and to guarantee the uniqueness of solution. NMF with separability m×r assumption is referred to as separable NMF problem, aims to find nonnegative matrices W 2 R+ r×n and H 2 R+ such that M = W H; W = M(:; K): ∗Department of Mathematics, The University of Hong Kong. Emails: [email protected], [email protected]. M. Ng’s research is supported in part by HKRGC GRF 12300218, 12300519, 17201020 and 17300021. 1 The constraint W = M(:; K) implies that each column of W is equal to a column of M. If a matrix M is r-separable, then there exist some permutation matrix Π 2 f0; 1gn×n and a nonnegative 0 r×(n−r) matrix H 2 R+ such that I H0 MΠ = MΠ r ; 0n−r;r 0n−r;n−r where Ir is the r-by-r identity matrix and 0r;p is the matrix of all zeros of dimension r by p. Equivalently, I H0 M = M Π r ΠT : (1) 0n−r;r 0n−r;n−r | {z } n×n X2R This equivalent definition of separability was proposed and discussed in [7, 8, 17, 32] and will be very useful in this paper. Separable NMF is an important method which corresponding to self-dictionary learning in data science [18]. The separability makes sense in many practical applications. For instance, in document classification, given a word-document data matrix, each entry M(i; j) of M represents the importance of word i in document j. Separability of M indicates that, for each topic, there exist at least one document only discuss that topic, which is referred to as ”pure” document. These ”pure” documents can be regarded as key features that form feature matrix W = M(:; K) to represent its original data matrix M. Considering feature matrix W , i.e., ”word × key documents” matrix, it is reasonable to assume that , for each pure document, there are at least one word used only in that document. For example, in a pure document that only discusses biology, the words like ”transaminase”, ”amino acid”, can only show up in that biology document, but not in documents related to politics, philosophy or art. Based on the above consideration, we generalize the separability assumption as follows. m×n Definition 1. A matrix M 2 R+ is co-(r1; r2) separable if there exists an index set K1 of m×r1 cardinality r1 and an index set K2 of cardinality r2, and nonnegative matrices P1 2 R+ and r2×n P2 2 R+ such that M = P1M(K1; K2)P2 (2) where P1(K1; :) = Ir1 and P2(:; K2) = Ir2 . M(K1; K2) is referred to as the core of matrix M. For simplicity, we call a matrix CoS-matrix if it has decomposition (2). The co-(r1; r2)- separability is a natural extension of r-separability. A matrix M is r-separable matrix, is also a co-(m; r)-separable. Every m × n nonnegative matrix M is co-(m; n)-separable. Note-worthily, compared to r-separability, co-(r1; r2)-separability provides a more compact basic matrix (i.e., M(K1; K2)) to represent data matrix. 1.1 Related Problems As a method that selects columns and rows to represent the input nonnegative matrix , CoS-NMF model is related to generalized separable NMF (GS-NMF) model [31]. Precisely, GS-NMF aims to find row set K1 and column set K2 to represent M in the form of M = M(:; K2)P2 + P1M(K1; :), where P2(:; K2) = Ir2 and P1(K1; :) = Ir1 . The motivation of GS-NMF is different from CoS-NMF model, take document classification as an example, GS-NMF assumes that there exists either a ”pure” document or an anchor word, while CoS-NMF assume that there are at least an anchor word exist in a ”pure” document. The different motivations lead to different representative form. We can see that GS-NMF is more relaxed, while CoS-NMF has a more compact form. 2 CoS-NMF also has a very close connection with CUR decomposition, that is, given a matrix M, identify a row subset K1 and column subset K2 from M such that kM − M(:; K2)UM(K1; :)k is minimized. For CUR model, the factor matrix U is computed to minimize the approximation y y y error [25], i.e., U = M(:; K2) MM(K1; :) . When U is required to be U = M(K1; K2) , the variant model is then called pseudo-skeleton approximation where Ay denotes a Moore-Penrose generalized inverse of matrix A. Note that these models do not consider nonnegativity, the analysis is different from CoS-NMF. For example, CUR can pick any subset of r linearly independent rows and columns to obtain exact decompositions of any rank-r matrix, but it is not true for CoS-NMF. For more information on CUR decomposition and pseudo-skeleton approximation, we refer the interest reader to [4,20,26,36] and the references therein. In Section 3, we will discuss the connection and difference between CoS-NMF and CUR in details. Our model is also related to tri-symNMF model proposed in [2] for learning topic models, i.e., given a word-document matrix M, it aims to find the word-topic matrix W and topic-topic S such that A = MM T ≈ W SW T , where A is the word co-occurrence matrix. Gillis in [14] showed that tri-symNMF model, can be represented in the form of A = W diag(z)−1A(K; K)diag(z)−1W T , r where W (K; :) = diag(z) for some z 2 R+. Any separable NMF algorithms like SPA, can be hired to solve this model. One can solve a minimum-volume tri-symNMF instead since the separability assumption in tri-symNMF can be relaxed to sufficiently scattered condition (SSC), see [9, 10] for more details. We note that if diag(z) = Ir, tri-symNMF model is a special case of CoS-NMF, T provided that P1 = P2 in (2). In [6,35], Ding and et al. proposed a nonnegative matrix tri-factorization for co-clustering, i.e., m×n m×r1 r1×r2 r2×n given a matrix M 2 R+ , it aims to find G1 2 R+ , S 2 R+ and G2 2 R+ such that M ≈ G1SG2. It provides a good framework to simultaneously cluster the rows and columns of M. Here G1 gives row clusters and G2 gives column clusters. When orthogonality constrain is T T added to G1 and G2, i.e., G1 G1 = I, G2G2 = I, the model is called bi-orthogonal tri-factorization (BiOR-NM3F) and related to hard co-clustering. In Section 3, we will show the connection between BiOR-NM3F and CoS-NMF. 1.2 The Outline In this paper, we consider CoS-NMF problem which generates separability condition to co-separability on NMF problem. In Section 2, some equivalent characterizations of CoS-matrix are first provided that lead to an ideal model to tackle CoS-NMF problem. We present some properties and discuss the uniqueness of CoS-NMF problem.

Load more