Nonnegative Matrix Factorization for Signal and Data Analytics: [Identiﬁability, Algorithms, and Applications] Xiao Fu, Kejun Huang, Nicholas D

Nonnegative Matrix Factorization for Signal and Data Analytics: [Identiﬁability, Algorithms, and Applications] Xiao Fu, Kejun Huang, Nicholas D. Sidiropoulos, and Wing-Kin Ma

I.INTRODUCTION to a matrix X whose columns are vectorized human face Nonnegative matrix factorization (NMF) aims at factoring a images. Interestingly, the columns of the resulting W are data matrix into low-rank latent factor matrices with nonnega- clear parts of human faces, e.g., nose, ears, and eyes (see M N details in Sec. II-A). Interpretable NMF results can be traced tivity constraints. Specifically, given a data matrix X R × ∈ and a target rank R, NMF seeks a factorization model back to earlier years. As mentioned, back in the 1990s, in remote sensing and analytical chemistry, researchers started M R N R X WH>, W R × , H R × , (1) ≈ ∈ ∈ to notice that when factoring remotely captured hyperspectral to ‘explain’ the data matrix X, where W 0, H 0, images and laboratory-measured spectral samples of chemical ≥ ≥ and R min M,N . At first glance, NMF is nothing but compounds using techniques that are now recognized as NMF ≤ { } W an alternative factorization model to existing ones such as algorithms, the columns of are spectra of materials on the the singular value decomposition (SVD) [1] or independent ground [4] and constituent elements of the chemical samples, component analysis (ICA) [2] that have different constraints respectively [5]. NMF also exhibits strong interpretability in (resp. orthogonality or statistical independence) on the latent machine learning applications. In text mining, NMF returns factors. However, the extraordinary effectiveness of NMF in prominent topics contained in a document corpus [7], [8]. In analyzing real-life nonnegative data has sparked a substan- community detection, NMF discovers groups of people who tial amount of research in many fields. The linear algebra have similar activities from a social graph [9], [10]. community has shown interest in nonnegative matrices and The interpretability of NMF is intimately related to its nonnegative matrix factorization (known as nonnegative rank model uniqueness, or, latent factor identifiability. That is, factorization) since more than thirty years ago [3]. In the unlike some other factorization techniques (e.g., SVD), NMF W 1990s, researchers in analytical chemistry and remote sensing is able to identify the ground-truth generative factors and H (earth science) already noticed the effectiveness of NMF— up to certain trivial ambiguities—which therefore leads to which was first referred to as ‘positive matrix factorization’ strong interpretability. In [5] that appeared in 1994, Paatero [4], [5]. In 1999, Lee and Seung’s seminal paper published et al. conjectured that the effectiveness of NMF might be in Nature [6] sparked a tremendous amount of research in related to the uniqueness of the factorization model under computer science and signal processing. Many types of real- certain realistic conditions [5]. This conjecture was confirmed life data like images, text, and audio spectra can be represented in theory later on by a number of works from different as nonnegative matrices—and factoring them into nonnegative fields [9], [11]–[16]. The connection between identifiability latent factors yields intriguing results. Today, NMF is well- and interpretability is intuitively pleasing—if the data really recognized as a workhorse for signal and data analytics. follows a generative model, then identifying the ground-truth generative model becomes essential for explaining the data. Interpretability and Model Identifiability Although NMF The understanding of NMF identifiability also opens many arXiv:1803.01257v4 [eess.SP] 16 Nov 2018 had already been tried in many application domains before the 2000’s, it was the ‘interpretability’ argument made in doors for a vast array of applications. For example, in signal the Nature article [6] that really brought NMF to center processing, NMF and related techniques have been applied to stage. Lee and Seung underscored this point in [6] with a speech and audio separation [17], [18], cognitive radio [19], thought-provoking experiment in which NMF was applied and medical imaging [13]; in statistical learning, recent works have shown that NMF can identify topic models coming from X. Fu is with the School of Electrical Engineering and Computer the latent semantic analysis (LSA) model and latent dirichlet Science, Oregon State University, Corvallis, OR 97331, USA. email: allocation (LDA) model [7], the mixed membership stochastic [email protected] K. Huang is with the Department of Computer and Information Science blockmodels (MMSB) [10], and the hidden Markov model and Engineering, University of Florida, Gainesville, FL 32605, USA. email: (HMM) [20], [21]—with provable guarantees. kejun.huang@ufl.edu Our Goal In this feature article, we will review the recent N. D. Sidiropoulos is with the Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, VA 22904, USA email: developments in theory, methods and applications of NMF. We [email protected] will use model identifiability of NMF as a thread to connect W.-K. Ma is with the Department of Electronic Engineering, The Chinese these different aspects of NMF, and show the reason why University of Hong Kong, Shatin, NT, Hong Kong. email: [email protected] This work was supported in part by the National Science Foundation under understanding identifiability of NMF is critical for engineering Project NSF ECCS-1608961 and Project NSF IIS-1447788. applications. We will introduce different identification criteria 2

wAAAC9XicdVLLatwwFNW4r3T6StplN6JDoZRhsENDugzpol2mJZOkjM1wLV9PRCTZSHKGQfgrsm12pdv+TaF/U/kx0EnaC4Ljc3SPr46UloIbG4a/B8Gdu/fuP9h6OHz0+MnTZ9s7z09MUWmGU1aIQp+lYFBwhVPLrcCzUiPIVOBpevGh0U8vURteqGO7KjGRsFA85wysp77GqaTLuYvq+fYonIRt0dsg6sGI9HU03xn8irOCVRKVZQKMmUVhaRMH2nImsB7GlcES2AUscOahAokmce3ENX3tmYzmhfZLWdqyf3c4kMasZOp3SrDn5qbWkONU/kueVTZ/nziuysqiYt2/8kpQW9AmAZpxjcyKlQfANPfjUnYOGpj1OW04HUeJa+ZrbIaxRoVLVkgJKnsb5yC5WGWYQyWsi03ew5oOYw/jVFTo4vZgrsH1RiIOlU9Og8Vxl8iw7fKX2HUUGtQCa/fl42Htdvf2xlG4Pw7r/1s3vMZsTXu4ZjurtdAb99pCI6q11H7UzUuIbt77bXCyO4nCSfT53ejgsH8TW+QleUXekIjskwPyiRyRKWFEkivyjVwHy+A6+B786LYGg77nBdmo4Ocf4zn10g== 1 wAAAC9XicdVLLbtQwFPWEVxleLSzZWIyQEBqNkhFVWVZlAcuCOm3RJBrdODdTq7YT2Q6jkZWvYEt3iC1/g8Tf4DxGYlq4kqWTc3xPro+dloIbG4a/B8Gt23fu3tu5P3zw8NHjJ7t7T09NUWmGM1aIQp+nYFBwhTPLrcDzUiPIVOBZevmu0c++oDa8UCd2XWIiYal4zhlYT32OU0lXCzetF7ujcBK2RW+CqAcj0tfxYm/wK84KVklUlgkwZh6FpU0caMuZwHoYVwZLYJewxLmHCiSaxLUT1/SlZzKaF9ovZWnL/t3hQBqzlqnfKcFemOtaQ45T+S95Xtn8beK4KiuLinX/yitBbUGbBGjGNTIr1h4A09yPS9kFaGDW57TldBIlrpmvsRnGGhWuWCElqOx1nIPkYp1hDpWwLjZ5D2s6jD2MU1Ghi9uDuQbXW4k4VD45DRbHXSLDtstfYtdRaFBLrN2n90e1m+7vj6PwYBzW/7dueI3ZhvZww3ZWG6E37rWlRlQbqf2om5cQXb/3m+B0OonCSfTxzejwqH8TO+Q5eUFekYgckEPygRyTGWFEkq/kG7kKVsFV8D340W0NBn3PM7JVwc8/5cP10w== 2 wAAAC9XicdVLLbtQwFPWEVxleLSzZWIyQEBqNkpaqLKuygGVBnbZoEo1unJupVduJbIfRyMpXsKU7xJa/QeJvcB4jMS1cydLJOb4n18dOS8GNDcPfg+DW7Tt3723dHz54+Ojxk+2dp6emqDTDKStEoc9TMCi4wqnlVuB5qRFkKvAsvXzX6GdfUBteqBO7KjGRsFA85wyspz7HqaTLudur59ujcBK2RW+CqAcj0tfxfGfwK84KVklUlgkwZhaFpU0caMuZwHoYVwZLYJewwJmHCiSaxLUT1/SlZzKaF9ovZWnL/t3hQBqzkqnfKcFemOtaQ45T+S95Vtn8beK4KiuLinX/yitBbUGbBGjGNTIrVh4A09yPS9kFaGDW57ThdBIlrpmvsRnGGhUuWSElqOx1nIPkYpVhDpWwLjZ5D2s6jD2MU1Ghi9uDuQbXG4k4VD45DRbHXSLDtstfYtdRaFALrN2n90e1293fH0fhwTis/2/d8BqzNe3hmu2s1kJv3GsLjajWUvtRNy8hun7vN8Hp7iQKJ9HHN6PDo/5NbJHn5AV5RSJyQA7JB3JMpoQRSb6Sb+QqWAZXwffgR7c1GPQ9z8hGBT//AOhN9dQ= 3 unmixing problems in analytical chemistry. Speciﬁcally, the paper aimed at analyzing key chemical components that are present in a collection of observed chemical samples measured at different wavelengths. This problem is very similar to the hyperspectral unmixing (HU) problem in hyperspectral imaging [26], which is a timely application in remote sensing. We give an illustration of the HU data model in Fig. 1. There, x` is a spectral vector (or, a spectral pixel) that is measured via capturing electromagnetic (EM) patterns carried =AAAC+XicdVJbaxQxGM2OtzreWn30JbgIIssyUyz1RSj1wT5W6baFzVgymW+2obmMSUZZwvwOX+2b+Op/Efw3Zi4LbqsfBM6ck+/Ml5PkleDWJcnvUXTj5q3bdzbuxvfuP3j4aHPr8bHVtWEwY1poc5pTC4IrmDnuBJxWBqjMBZzkF29b/eQzGMu1OnLLCjJJF4qXnFEXqOwNySU++OiJ01VztjlOpklX+DpIBzBGQx2ebY1+kUKzWoJyTFBr52lSucxT4zgT0MSktlBRdkEXMA9QUQk2893UDX4emAKX2oSlHO7Yvzs8ldYuZR52SurO7VWtJSe5/Jc8r135OvNcVbUDxfp/lbXATuM2BVxwA8yJZQCUGR7GxeycGspcyGrN6SjNfDtfaxMTAwq+MC0lVcVLUlLJxbKAktbCeWLLATY4JgGSXNTgSXcw3+JmLREPKiRnqINJn0jcdYWL7Du0oWoBjf/wbr/x2zs7kzTZnSTN/61b3kCxogNcsb3VShiMB21hANRK6j6a9iWkV+/9OjjenqbJNH3/ary3P7yJDfQUPUMvUIp20R46QIdohhj6hL6ib+gy8tFl9D360W+NRkPPE7RW0c8/ciH3hQ== H> by the ground-reﬂected sunlight using a remote sensor at M wavelengths. The linear mixture model (LMM) employed in

AAAC8nicdVJdaxQxFM2OX3X8avXRl+AiiCzLTLG0L0KpD/pYpbtd2FnKncydbWySGZJMZQnzH3y1b+Krv0fw35j5WHBbvRA4OSf35N6bpKXgxkbR70Fw6/adu/e27ocPHj56/GR75+nUFJVmOGGFKPQsBYOCK5xYbgXOSo0gU4Gn6cW7Rj+9RG14oU7sqsSFhKXiOWdgPTV9m6SSzs62h9E4aoPeBHEPhqSP47Odwa8kK1glUVkmwJh5HJV24UBbzgTWYVIZLIFdwBLnHiqQaBauLbemLz2T0bzQfilLW/bvDAfSmJVM/UkJ9txc1xpylMp/yfPK5gcLx1VZWVSsuyuvBLUFbdqnGdfIrFh5AExzXy5l56CBWT+kDaeTeOGa+hqbMNGo8AsrpASVvU5ykFysMsyhEtYlJu9hTcPEwyQVFbqkbcw1uN6YiEPlJ6fB4qibSNhm+RfsMgoNaom1+/T+qHa7e3ujONofRfX/rRteY7amPVyzndVa6I17bakR1VpqN3Xtf0J8/d1vgunuOI7G8cc3w8Oj/k9skefkBXlFYrJPDskHckwmhJHP5Cv5Rq4CG1wF34Mf3dFg0Oc8IxsR/PwD00n0Sg== = X =AAAC8nicdVJdaxQxFM2OX3X8avXRl+AiiCzLTLG0L0KpD/pYpbtb2FnKncydbWySGZJMZQnzH3y1b+Krv0fw35j5WHBbvRA4OSf35N6bpKXgxkbR70Fw6/adu/e27ocPHj56/GR75+nUFJVmOGGFKPRpCgYFVzix3Ao8LTWCTAXO0ot3jT67RG14oU7sqsSFhKXiOWdgPTV9m6SSzs62h9E4aoPeBHEPhqSP47Odwa8kK1glUVkmwJh5HJV24UBbzgTWYVIZLIFdwBLnHiqQaBauLbemLz2T0bzQfilLW/bvDAfSmJVM/UkJ9txc1xpylMp/yfPK5gcLx1VZWVSsuyuvBLUFbdqnGdfIrFh5AExzXy5l56CBWT+kDaeTeOGa+hqbMNGo8AsrpASVvU5ykFysMsyhEtYlJu9hTcPEwyQVFbqkbcw1uN6YiEPlJ6fB4qibSNhm+RfsMgoNaom1+/T+qHa7e3ujONofRfX/rRteY7amPVyzndVa6I17bakR1VpqN3Xtf0J8/d1vgunuOI7G8cc3w8Oj/k9skefkBXlFYrJPDskHckwmhJHP5Cv5Rq4CG1wF34Mf3dFg0Oc8IxsR/PwD0MD0SQ== W M hyperspectral imaging, i.e., x` WH(`, :)> R , assumes ≈ ∈ Fig. 1. Classic application of NMF—hyperspectral unmixing in remote that every pixel is a convex combination (i.e., weighted sum sensing. Top: the LMM for spectral unmixing in remote sensing; every slab with weights summing up to one) of spectral signatures of on the left represents a band of the hyperspectral image; each vector wi describes the spectral signature of a pure material, or endmember; the maps different pure materials on the ground. Under the LMM, HU on the right hand side are called the abundance maps of the endmembers, aims at factoring X to recover the ‘endmembers’ (spectral which are re-arranged columns of H. Bottom: the corresponding NMF model signatures) W and the ‘abundances’ H. In fact, in remote after re-arranging the pixels. sensing and earth science, the use of NMF or related ideas for NMF, reveal their insights, and discuss their pros and has a very long history. Some key concepts in geoscience that cons—both in theory and in practice. We will also intro- are heavily related to modern NMF can be traced back to the duce the associated algorithms and applications, and discuss 1960s [28], although the word ‘NMF’ was not spelled out. in depth the connection between the nature of problems considered, NMF identification criteria, and the associated Perhaps the most well-known (and pioneering) application algorithms. of NMF is representation learning in image recognition [6] We should mention that there are a number of existing ( see Fig. 2 and the next insert). There, x` is a vectorized tutorials on NMF, such as those in [22]–[27]. Many of these image of a human face, and X is a collection of such tutorials focus on computational aspects, and do not cover vectors. Learning a nonnegative H from the factorization the identifiability issues thoroughly (an exception is [23] model X WH> can be understood as learning nonnegative ≈ where an NMF subclass called separable NMF was reviewed). ‘embeddings’ of the faces in a low-dimensional space (spanned In addition, many developments and discoveries on NMF by the basis W ); i.e., the learned H(`, :) can be used as a low- identifiability happened after the appearances of [22]–[27], and dimensional representation of face `. It was in this particular there is no tutorial summarizing these new results. This article paper that Lee and Seung reported a very thought-provoking aims at filling these gaps and introducing the latest pertinent empirical result: the columns of W are interpretable—every research outcomes. column corresponds to a part of a human face. Correspond- Notation We mostly follow the established conventions in ingly, the embeddings are meaningful—the embedding of a X M N signal processing. For example, we use boldface R × person’s face H(`, :) contains weights of the learned parts M N X(m, n) ∈ to denote a matrix of size ; denotes the and H(`, :) is typically sparse, meaning that every face can be m n × X(:, `) x element in the th row and th column; and ` can synthesized from few of the available parts. Such meaningful both denote the `th column of X;“ 1”, “ ”, and “ ” denote − † > embeddings are often found effective in subsequent processing matrix inversion, pseudo-inverse, and transpose, respectively; such as clustering and classification since they reflect the ‘true vec(Y ) Y reshapes the matrix as a vector by concatenating nature’ of the faces. In fact, the ability of generating sparse 1 0 its columns; and denote all-one and all-zero vectors with and part-based embedding is considered the major driving- proper length; X 0 means that every element in X is ≥ force behind the popularity of NMF in machine learning. See nonnegative; “X 0” and “X 0” denote that X is positive more discussions on this particular point in [23], [24]. semidefinite and positive definite, respectively; nnz(X) counts n the number of nonzero elements in X; R+ denotes the nonnegative orthant in the n-dimensional Euclidean space. NMF for Representation Learning. An example of NMF for face representation learning is illustrated in Fig. 2. Here, we use a dataset containing 13,232 images (see the II.WHERE DOES NMFARISE? LFW dataset at http://vis-www.cs.umass.edu/lfw), each Let us first take a look at several interesting and important image has a size of 101 101 pixels. Hence, the X matrix × applications of NMF. These applications will shed light on the has a size of 10, 201 13, 232. We set R = 49 and run × importance of the NMF identifiability issues. the Lee-Seung multiplicative update NMF algorithm [29] with 5,000 iterations. Some of the face images and the A. Early Pioneers: Analytical Chemistry, Remote Sensing, and learned wr’s are presented in Fig. 2. One can see that Image Processing the learned basis W contains parts of human faces as its The concept of positive matrix factorization was men- columns. tioned in 1994 by Paatero et al. [5] for a class of spectral 3

Toy Demonstration of NMF

where E = E[H(`, :)>H(`, :)] denotes the correlation matrix of the topics. The reason for considering the alternative formulation in (3) is that the model (2) is a noisy approximation especially when the length of each document is short Toy Demonstration of NMF: NMF-Extracted Features (e.g., tweets), so that the number of words is not enough to estimate Pr(m, n) reliably. But if we have a large number A face image dataset. Image size = 101 101, number of face images = 13232. Each x ⇥ n of documents, the second-order statistics P can be estimated is the vectorization of one face image, leading to m = 101 101 = 10201, N = 13232. ⇥ reliably. The model (32) is an NMF model wherein W = CE and H = C, if one ignores the underlying structure of W.-K. Ma, ENGG5781 Matrix Analysis and Computations, CUHK, 2017–2018 Term 2. 6 W . Nevertheless, we will see that exploiting the symmetric structure of P may lead to better performance or simpler

NMF settings: r = 49, Lee-Seung multiplicative update with 5000 iterations. algorithms. Fig. 2. Top two rows: face images (x`’s). Bottom two rows: learned representation basis (wr’s) for human faces. Here we have R = 49. C. Hidden Markov Model Identiﬁcation

W.-K. Ma, ENGG5781 Matrix Analysis and Computations, CUHK, 2017–2018 Term 2. 7 HMMs are widely used in machine learning for modeling B. Topic Modeling time series, e.g., speech and language data. The graphical In machine learning, NMF has been an important tool for model of an HMM is shown in Fig. 4, where one can learning topic models [7], [8], [30], [31]. The task of topic see a sequence of observable data Y T emitted from an { t}t=0 mining is to discover prominent topics (represented by sets underlying unobservable Markov chain X T . The HMM { t}t=0 of words) from a large set of documents; see Fig. 3. In topic estimation problem aims at identifying the transition probabil- mining using the celebrated ‘bag-of-words’ model, X is a ity Pr(X X ) and the emission probability Pr(Y X )— t+1| t t| t word-document matrix that summarizes the documents over assuming that they are time-invariant. a dictionary. Specifically, X(m, n) represents the m-th word Consider a case where one can accurately estimate the co- feature of the nth document, say, Term-Frequency (TF) or occurrence probability between two consecutive emissions, Term-Frequency Inverse-Document-Frequency (TF-IDF) [32]. i.e., Pr(Yt,Yt+1). According to the graphical model shown Take the TF representation as an example. There, we have in Fig. 4, it is easy to see that given the values of Xt and X(m, n) Pr(m, n), where Pr(m, n) denotes the joint X , Y and Y are conditionally independent. Therefore, ≈ t+1 t t+1 probability of seeing word m in document n. One simple we have the following model [20], [21]: way to connect NMF with topic modeling is to consider the R probabilistic latent semantic indexing (pLSI) approach (see Pr(Yt,Yt+1) = Pr(Yt Xt = xk) (4) r | × detailed introduction in [33]). Suppose that given a topic , k,j=1 the probabilities of seeing a word m and a document n are X Pr(Y X = x ) Pr(X = x ,X = x ) conditionally independent. We have t+1| t+1 j t k t+1 j M M M R R R Let us define Ω R × , M R × , and Θ R × R ∈ ∈ ∈ Pr(m, n) = Pr(m r)Pr(r)Pr(n r), such that Ω(m, `) = Pr(Yt = ym,Yt+1 = y`), M(m, r) = | | r=1 Pr(Yt = ym Xt = xr), and Θ(k, j) = Pr(Xt = xk,Xt+1 = X | xj). Here, M denotes the number of observable states and where Pr(m r) and Pr(n r) denote the probabilities of seeing | | R the number of hidden states. Then, (4) can be written word m and document n conditioned on topic r. This is compactly as equivalent to the following factorization model Ω = MΘM>. (5) X = CΣD> (2) The above model is very similar to the second-order-statistics where C(m, r) = Pr(m r), Σ is a diagonal matrix with M Θ | topic model in (3): both and are constituted by PMFs, Σ(r, r) = Pr(r) and D(n, r) = Pr(n r). Note that under | and thus being nonnegative. Similar to the topic modeling this model, C(:, r) is a conditional probability mass function example, if one lets X = Ω, W = MΘ and H = M, (PMF) that describes the probabilities of seeing the words then the classic NMF model in (1) is recovered. in the dictionary given topic r—which naturally represents r H = DΣ W = C topic . If one denotes and , the model D. Community Detection X WH> is obtained. Note that W and H are naturally ≈ nonnegative since they are PMFs [32]. This nonnegative Recent work in [10], [34] has related NMF to the celebrated mixed membership stochastic blockmodel (MMSB) [35] that is representation of x` is illustrated in Fig. 3 (right), which can be understood as that a document is a weighted combination widely adopted for overlapped community detection—which of several topics, in layman’s terms. is a problem of assigning multiple community memberships The topic mining problem can also be formulated using the to nodes (e.g., people) in a network; see Fig. 5. Overlapped second-order statistics. Specifically, consider the correlation community detection is considered a hard problem since many matrix of all documents: network datasets are of very large scale and are often recorded in an oversimplified way (e.g., using unweighted undirected P = E x`x>` = CEC>, (3) adjacency matrices). 4

document economy politics history weights xAAAB9HicbVC7SgNBFL0bXzG+ooKNzWAQLCTs2mgZtLFMwDwwu8TZyWwyZGZ2mZkNhiV/Yak2YuvfWNj4LU4ehSYeuHA4517uvSdMONPGdb+c3Mrq2vpGfrOwtb2zu1fcP2joOFWE1knMY9UKsaacSVo3zHDaShTFIuS0GQ5uJn5zSJVmsbwzo4QGAvckixjBxkr3fijQY8ennHeKJbfsToGWiTcnpcpR7fsBAKqd4qffjUkqqDSEY63bnpuYIMPKMMLpuOCnmiaYDHCPti2VWFAdZNOLx+jUKl0UxcqWNGiq/p7IsNB6JELbKbDp60VvIp6H4j+7nZroKsiYTFJDJZntilKOTIwmCaAuU5QYPrIEE8XsuYj0scLE2JwKNgdv8etl0rgoe27Zq9lArmGGPBzDCZyBB5dQgVuoQh0ISHiCF3h1hs6z8+a8z1pzznzmEP7A+fgBvCWThw==sha1_base64="IhySh6uhBhef8LVcUBxmYPQOAr8=">AAAB9HicbVC7SgNBFJ2NrxhfUcHGZjAIFhJ2bbQMsbFMwDwwu4TZyd1kyMzsMjMbDCF/Yak2Yqlf4S9Y2PgtTh6FJh64cDjnXu69J0w408Z1v5zMyura+kZ2M7e1vbO7l98/qOs4VRRqNOaxaoZEA2cSaoYZDs1EAREhh0bYv574jQEozWJ5a4YJBIJ0JYsYJcZKd34o8H3bB87b+YJbdKfAy8Sbk0LpqPrN3soflXb+0+/ENBUgDeVE65bnJiYYEWUY5TDO+amGhNA+6ULLUkkE6GA0vXiMT63SwVGsbEmDp+rviRERWg9FaDsFMT296E3E81D8Z7dSE10FIyaT1ICks11RyrGJ8SQB3GEKqOFDSwhVzJ6LaY8oQo3NKWdz8Ba/Xib1i6LnFr2qDaSMZsiiY3SCzpCHLlEJ3aAKqiGKJHpAT+jZGTiPzovzOmvNOPOZQ/QHzvsPEqmVQw==sha1_base64="BNTEGjqMTGOmScNZIm1Qdk7SvaM=">AAAB9HicbVDLSgNBEOyNrxhfUY9eBoPgQcKuFz0GvXiMYB6YLGF20psMmZldZmaDIeQvPKoX8erfePBvnDwOmljQUFR1090VpYIb6/vfXm5tfWNzK79d2Nnd2z8oHh7VTZJphjWWiEQ3I2pQcIU1y63AZqqRykhgIxrcTv3GELXhiXqwoxRDSXuKx5xR66THdiTJU6eNQnSKJb/sz0BWSbAgJVig2il+tbsJyyQqywQ1phX4qQ3HVFvOBE4K7cxgStmA9rDlqKISTTieXTwhZ07pkjjRrpQlM/X3xJhKY0Yycp2S2r5Z9qbiRST/s1uZja/DMVdpZlGx+a44E8QmZJoA6XKNzIqRI5Rp7s4lrE81ZdblVHA5BMtfr5L6ZTnwy8G9X6rcLBLJwwmcwjkEcAUVuIMq1ICBgmd4hTdv6L14797HvDXnLWaO4Q+8zx8n3pGi ` 0.2 Topic 3: NASA, Columbia, Topic 1: Clinton, White Dow%Jones Dow%Jones shuttle, space, House, Scandal, 0.65 experiments, … Lewinsky, grand jury… Wall%St. Wall%St. 0.15 Congress Congress H(`, :)>

AAAB/nicbVDLSgNBEOz1GeNrjUcvQ4IQMYRdL4qnoJccI5gHZGOYncwmQ2Z2l5lZMSwBv0Q8qRfx6o948G+cPA6aWNBQVHXT3eXHnCntON/Wyura+sZmZiu7vbO7t28f5BoqSiShdRLxSLZ8rChnIa1rpjltxZJi4XPa9IfXE795T6ViUXirRzHtCNwPWcAI1kbq2jnPF6ha9CjnpcuTO09HcdcuOGVnCrRM3DkpVPLe6TMA1Lr2l9eLSCJoqAnHSrVdJ9adFEvNCKfjrJcoGmMyxH3aNjTEgqpOOr19jI6N0kNBJE2FGk3V3xMpFkqNhG86BdYDtehNxJIv/rPbiQ4uOikL40TTkMx2BQlHOkKTLFCPSUo0HxmCiWTmXEQGWGKiTWJZk4O7+PUyaZyVXafs3phArmCGDBxBHorgwjlUoAo1qAOBB3iCV3izHq0X6936mLWuWPOZQ/gD6/MHXrmV4Q==sha1_base64="gwVpPfK0HWugMZTUvzH9LPiRjCk=">AAAB/nicbVDLSgMxFM3UV62vsS7dhBahYikzbhRXRTddVrAP6Iwlk6ZtaDIZkow4DAU/wrVLdSNu/REX/RvTx0JbD1w4nHMv994TRIwq7TgTK7O2vrG5ld3O7ezu7R/Yh/mmErHEpIEFE7IdIEUYDUlDU81IO5IE8YCRVjC6mfqtByIVFeGdTiLiczQIaZ9ipI3UtfNewGGt5BHGylen954WUdcuOhVnBrhK3AUpVgve2fOkmtS79rfXEzjmJNSYIaU6rhNpP0VSU8zIOOfFikQIj9CAdAwNESfKT2e3j+GJUXqwL6SpUMOZ+nsiRVyphAemkyM9VMveVCwH/D+7E+v+pZ/SMIo1CfF8Vz9mUAs4zQL2qCRYs8QQhCU150I8RBJhbRLLmRzc5a9XSfO84joV99YEcg3myIJjUAAl4IILUAU1UAcNgMEjeAFv4N16sl6tD+tz3pqxFjNH4A+srx9stZdnsha1_base64="mBJLU81FDqOXnnv3FDRo0uiukto=">AAAB/nicbVDLSgNBEJyNrxhfazx6GQxChBB2vSiegl5yjGAekF3D7GQ2GTKPZWZWDCHgl3hUL+LVH/Hg3zhJ9qCJBQ1FVTfdXVHCqDae9+3k1tY3Nrfy24Wd3b39A/ew2NIyVZg0sWRSdSKkCaOCNA01jHQSRRCPGGlHo5uZ334gSlMp7sw4ISFHA0FjipGxUs8tBhGH9XJAGKtcnd0HRiY9t+RVvTngKvEzUgIZGj33K+hLnHIiDGZI667vJSacIGUoZmRaCFJNEoRHaEC6lgrEiQ4n89un8NQqfRhLZUsYOFd/T0wQ13rMI9vJkRnqZW8mViL+n91NTXwZTqhIUkMEXuyKUwaNhLMsYJ8qgg0bW4KwovZciIdIIWxsYgWbg7/89SppnVd9r+rfeqXadZZIHhyDE1AGPrgANVAHDdAEGDyCZ/AK3pwn58V5dz4WrTknmzkCf+B8/gBHn5RY White%House White%House Obama Obama … … Word%dictionary Topic 2: Utah, Chicago, Word%dictionary WWII WWII NBA, Jordan, Carl, jazz, Churchill Churchill bull, basketball, final,…

AAAB8XicbVC7SgNBFL3rM8ZXVLCxGQyChYRdLbQM2lgmYB6QLHF2MpsMmccyM6uEJR9hqTZi6/dY2PgtTh6FJh64cDjnXu69J0o4M9b3v7yl5ZXVtfXcRn5za3tnt7C3Xzcq1YTWiOJKNyNsKGeS1iyznDYTTbGIOG1Eg5ux33ig2jAl7+wwoaHAPcliRrB1UqMdCfTYuegUin7JnwAtkmBGiuXD6vc9AFQ6hc92V5FUUGkJx8a0Aj+xYYa1ZYTTUb6dGppgMsA92nJUYkFNmE3OHaETp3RRrLQradFE/T2RYWHMUESuU2DbN/PeWDyLxH92K7XxVZgxmaSWSjLdFaccWYXG76Mu05RYPnQEE83cuYj0scbEupDyLodg/utFUj8vBX4pqLpArmGKHBzBMZxCAJdQhluoQA0IDOAJXuDVM96z9+a9T1uXvNnMAfyB9/EDICWSAg==sha1_base64="ns+uxXcHobZtdCHgYiHi6V7kNnY=">AAAB8XicbVC7SgNBFL0bXzG+ooKNzWAQLCTsxkLLEBvLBMwDkiXMTmaTITOzy8ysEpZ8hKXaiK34F/6ChY3f4uRRaOKBC4dz7uXee4KYM21c98vJrKyurW9kN3Nb2zu7e/n9g4aOEkVonUQ8Uq0Aa8qZpHXDDKetWFEsAk6bwfB64jfvqNIskrdmFFNf4L5kISPYWKnZCQS671508wW36E6Blok3J4XyUe2bvVc+qt38Z6cXkURQaQjHWrc9NzZ+ipVhhNNxrpNoGmMyxH3atlRiQbWfTs8do1Or9FAYKVvSoKn6eyLFQuuRCGynwGagF72JeB6I/+x2YsIrP2UyTgyVZLYrTDgyEZq8j3pMUWL4yBJMFLPnIjLAChNjQ8rZHLzFr5dJo1T03KJXs4FUYIYsHMMJnIEHl1CGG6hCHQgM4QGe4NnRzqPz4rzOWjPOfOYQ/sB5+wF2mpO+sha1_base64="VJFZ4QoxWbnJTOiQb7iG4gNsMIo=">AAAB8XicbVA9SwNBEJ3zM8avqKXNYhAsJNxpoWXQxjKC+YDkCHubuWTJ7t6xu6eEkB9hqTZi6++x8N+4Sa7QxAcDj/dmmJkXpYIb6/vf3srq2vrGZmGruL2zu7dfOjhsmCTTDOssEYluRdSg4ArrlluBrVQjlZHAZjS8nfrNR9SGJ+rBjlIMJe0rHnNGrZOanUiSp+5lt1T2K/4MZJkEOSlDjlq39NXpJSyTqCwT1Jh24Kc2HFNtORM4KXYygyllQ9rHtqOKSjTheHbuhJw6pUfiRLtSlszU3xNjKo0Zych1SmoHZtGbiueR/M9uZza+DsdcpZlFxea74kwQm5Dp+6THNTIrRo5Qprk7l7AB1ZRZF1LR5RAsfr1MGheVwK8E9365epMnUoBjOIEzCOAKqnAHNagDgyE8wyu8ecZ78d69j3nripfPHMEfeJ8/i8+QHQ==

AAAB8XicbVC7SgNBFL0TXzG+ooKNzWAQLCTs2mgZtLFMwDwgWeLsZDYZMjO7zMwqYclHWKqN2Po9FjZ+i5NHoYkHLhzOuZd77wkTwY31vC+UW1ldW9/Ibxa2tnd294r7Bw0Tp5qyOo1FrFshMUxwxeqWW8FaiWZEhoI1w+HNxG8+MG14rO7sKGGBJH3FI06JdVKzE0r82PW7xZJX9qbAy8Sfk1LlqPZ9DwDVbvGz04tpKpmyVBBj2r6X2CAj2nIq2LjQSQ1LCB2SPms7qohkJsim547xqVN6OIq1K2XxVP09kRFpzEiGrlMSOzCL3kQ8D+V/dju10VWQcZWklik62xWlAtsYT97HPa4ZtWLkCKGau3MxHRBNqHUhFVwO/uLXy6RxUfa9sl9zgVzDDHk4hhM4Ax8uoQK3UIU6UBjCE7zAKzLoGb2h91lrDs1nDuEP0McPHReSAA==sha1_base64="IdLZeBaa4kq6Kr9olUDsClHIB4w=">AAAB8XicbVC7SgNBFL0bXzG+ooKNzWAQLCTs2mgZYmOZgHlAsoTZyWwyZGZ2mZlVwpKPsFQbsRX/wl+wsPFbnE1SaOKBC4dz7uXee4KYM21c98vJrayurW/kNwtb2zu7e8X9g6aOEkVog0Q8Uu0Aa8qZpA3DDKftWFEsAk5bweg681t3VGkWyVszjqkv8ECykBFsrNTqBgLd97xeseSW3SnQMvHmpFQ5qn+z9+pHrVf87PYjkggqDeFY647nxsZPsTKMcDopdBNNY0xGeEA7lkosqPbT6bkTdGqVPgojZUsaNFV/T6RYaD0Wge0U2Az1opeJ54H4z+4kJrzyUybjxFBJZrvChCMToex91GeKEsPHlmCimD0XkSFWmBgbUsHm4C1+vUyaF2XPLXt1G0gVZsjDMZzAGXhwCRW4gRo0gMAIHuAJnh3tPDovzuusNefMZw7hD5y3H3OMk7w=sha1_base64="SLE/u0rB+Ewec0yX3ceGmO/oOpY=">AAAB8XicbVA9SwNBEJ2LXzF+RS1tFoNgIeHORsugjWUE8wHJEfY2k2TJ7t6xu6eEIz/CUm3E1t9j4b9xk1yhiQ8GHu/NMDMvSgQ31ve/vcLa+sbmVnG7tLO7t39QPjxqmjjVDBssFrFuR9Sg4AoblluB7UQjlZHAVjS+nfmtR9SGx+rBThIMJR0qPuCMWie1upEkT72gV674VX8OskqCnFQgR71X/ur2Y5ZKVJYJakwn8BMbZlRbzgROS93UYELZmA6x46iiEk2Yzc+dkjOn9Mkg1q6UJXP190RGpTETGblOSe3ILHsz8SKS/9md1A6uw4yrJLWo2GLXIBXExmT2PulzjcyKiSOUae7OJWxENWXWhVRyOQTLX6+S5mU18KvBvV+p3eSJFOEETuEcAriCGtxBHRrAYAzP8ApvnvFevHfvY9Fa8PKZY/gD7/MHiMGQGw== w1 wAAAB8XicbVC7SgNBFL3jM8ZXVLCxGQyChYTdNFoGbSwTMA9Iljg7mU2GzMwuM7NKWPIRlmojtn6PhY3f4uRRaOKBC4dz7uXee8JEcGM97wutrK6tb2zmtvLbO7t7+4WDw4aJU01ZncYi1q2QGCa4YnXLrWCtRDMiQ8Ga4fBm4jcfmDY8Vnd2lLBAkr7iEafEOqnZCSV+7Ja7haJX8qbAy8Sfk2LluPZ9DwDVbuGz04tpKpmyVBBj2r6X2CAj2nIq2DjfSQ1LCB2SPms7qohkJsim547xmVN6OIq1K2XxVP09kRFpzEiGrlMSOzCL3kS8COV/dju10VWQcZWklik62xWlAtsYT97HPa4ZtWLkCKGau3MxHRBNqHUh5V0O/uLXy6RRLvleya+5QK5hhhycwCmcgw+XUIFbqEIdKAzhCV7gFRn0jN7Q+6x1Bc1njuAP0McPHp6SAQ==sha1_base64="NvFAffFweGOnOxCYLd6hw11+CB4=">AAAB8XicbVC7SgNBFL0bXzG+ooKNzWAQLCTspolliI1lAuYByRJmJ7PJkJnZZWZWCUs+wlJtxFb8C3/BwsZvcfIoNPHAhcM593LvPUHMmTau++Vk1tY3Nrey27md3b39g/zhUVNHiSK0QSIeqXaANeVM0oZhhtN2rCgWAaetYHQ99Vt3VGkWyVszjqkv8ECykBFsrNTqBgLd90q9fMEtujOgVeItSKFyUv9m79WPWi//2e1HJBFUGsKx1h3PjY2fYmUY4XSS6yaaxpiM8IB2LJVYUO2ns3Mn6NwqfRRGypY0aKb+nkix0HosAtspsBnqZW8qXgbiP7uTmPDKT5mME0Mlme8KE45MhKbvoz5TlBg+tgQTxey5iAyxwsTYkHI2B2/561XSLBU9t+jVbSBVmCMLp3AGF+BBGSpwAzVoAIERPMATPDvaeXRenNd5a8ZZzBzDHzhvP3UTk70=sha1_base64="HSttVPPyuKv3dmi8VAd1nv54hAc=">AAAB8XicbVA9SwNBEJ3zM8avqKXNYhAsJNyl0TJoYxnBfEByhL3NXLJkd+/Y3VNCyI+wVBux9fdY+G/cJFdo4oOBx3szzMyLUsGN9f1vb219Y3Nru7BT3N3bPzgsHR03TZJphg2WiES3I2pQcIUNy63AdqqRykhgKxrdzvzWI2rDE/VgxymGkg4Ujzmj1kmtbiTJU6/aK5X9ij8HWSVBTsqQo94rfXX7CcskKssENaYT+KkNJ1RbzgROi93MYErZiA6w46iiEk04mZ87JedO6ZM40a6UJXP198SESmPGMnKdktqhWfZm4mUk/7M7mY2vwwlXaWZRscWuOBPEJmT2PulzjcyKsSOUae7OJWxINWXWhVR0OQTLX6+SZrUS+JXg3i/XbvJECnAKZ3ABAVxBDe6gDg1gMIJneIU3z3gv3rv3sWhd8/KZE/gD7/MHikiQHA== 2 w3

high%prob.%word

xAAACOXicbZDLSgMxFIbP1Futt6rgxk2wCIJSZipelkU3boQW7AU6ZcykaRuazAxJRltKn8qlz+GiG0HdiFtfwPSysNUDgZ/vP4eT8/sRZ0rb9tBKLCwuLa8kV1Nr6xubW+ntnbIKY0loiYQ8lFUfK8pZQEuaaU6rkaRY+JxW/M71yK88UKlYGNzpXkTrArcC1mQEa4O89K3rC9T1XMq5i6NIhl1kZ3OuZoIqNPIePQcdG3Z+NgNzY+jMwlMvnbGz9rjQX+FMRSa/V3y9B4CCl35xGyGJBQ004VipmmNHut7HUjPC6SDlxopGmHRwi9aMDLBZVu+Pzx6gQ0MaqBlK8wKNxvT3RB8LpXrCN50C67aa90bwxBf/2bVYNy/rfRZEsaYBmexqxhzpEI1iRA0mKdG8ZwQmkpnvItLGEhNtwk6ZHJz5q/+Kci7rmAyLJpArmFQS9uEAjsCBC8jDDRSgBASeYAjv8GE9W2/Wp/U1aU1Y05ldmCnr+weToqsysha1_base64="hlKo6DFOI4MY3k5/MhniHVcWuRc=">AAACOXicbZDLSgMxFIYz9VbrbVRw4yZYBEEpMxUvy1I3boQW7AU6w5BJ0zY0mRmSjLYMfSRXLt36Ci66EdSNuPUFTC8LWz0Q+Pn+czg5vx8xKpVlDY3UwuLS8kp6NbO2vrG5ZW7vVGUYC0wqOGShqPtIEkYDUlFUMVKPBEHcZ6Tmd69Gfu2OCEnD4Fb1I+Jy1A5oi2KkNPLMG8fnsOc5hDEHRZEIe9DK5R1FOZFw5N17NjzW7PxsBubH0J6Fp56ZtXLWuOBfYU9FtrBXfqUPxeeSZ744zRDHnAQKMyRlw7Yi5SZIKIoZGWScWJII4S5qk4aWAdLL3GR89gAeatKErVDoFyg4pr8nEsSl7HNfd3KkOnLeG8ETn/9nN2LVunQTGkSxIgGe7GrFDKoQjmKETSoIVqyvBcKC6u9C3EECYaXDzugc7Pmr/4pqPmfrDMs6kCKYVBrsgwNwBGxwAQrgGpRABWDwCIbgHXwYT8ab8Wl8TVpTxnRmF8yU8f0D6hes7g==sha1_base64="ipp/sKwhWOXa9vbcC3x/kjh9lok=">AAACOXicbZDLSsNAFIYnXmu9RV26GSyCoJSk4mVZdONGqGAv0IQwmU7aoTNJmJloS+lTufQ5XHSpbsStL+AkzcK2Hhj4+f5zOHN+P2ZUKsuaGEvLK6tr64WN4ubW9s6uubffkFEiMKnjiEWi5SNJGA1JXVHFSCsWBHGfkabfv0395hMRkkbhoxrGxOWoG9KAYqQ08sx7x+dw4DmEMQfFsYgG0CpXHEU5kTD1nj0bnmp2eTEDKxm0Z+G5Z5asspUVXBR2Lkogr5pnvjmdCCechAozJGXbtmLljpBQFDMyLjqJJDHCfdQlbS1DpJe5o+zsMTzWpAODSOgXKpjRvxMjxKUccl93cqR6ct5L4ZnP/7PbiQqu3REN40SREE93BQmDKoJpjLBDBcGKDbVAWFD9XYh7SCCsdNhFnYM9f/WiaFTKts7wwSpVb/JECuAQHIETYIMrUAV3oAbqAIMXMAEf4NN4Nd6NL+N72rpk5DMHYKaMn1//TKlN 0.2 w +0.65 w +0.15 w ` ⇡ ⇥ 1 ⇥ 2 ⇥ 3 doc.&&&&&&&&&&&0.2&x&economy&+&0.65&x&politics&+&0.15&x&history&& > Fig. 3. Topic mining and nonnegative representation x` = WH(`, :) for ` = 1,...,N.

α RR is the parameter vector that characterizes the Dirichlet ... Xt 1 Xt Xt+1 ... ∈ − distribution [36]. In the above, θi and A(i, j) are assumed to be sampled from a Dirichlet distribution and a Bernoulli distribution (with parameter P (i, j)) [36], respectively. Being Yt 1 Yt Yt+1 sampled from the Dirichlet distribution, θi resides in the − probability simplex, and thus θr,i naturally reflects how much Fig. 4. The graphical model of a HMM. node i is associated with community r—e.g., when R = 3, θi = [0.5, 0.2, 0.3]> means 50%, 20%, and 30% of node i’s participation is associated with communities 1, 2, and 3, respectively. Using a Bernoulli distribution also makes a lot of community 1 sense since plenty of network data is very coarsely recorded— using ”0” and ”1” to represent absence/presence of a connection, without regard to connection strength. Both the model parameters Θ (node-membership matrix) and B (community- community connection matrix) are of great interest in network analytics. The model P = ΘBΘ> is an NMF model as in community 3 the HMM case. However, P is not available in practice. The community 2 interesting observation in [10], [34] is that range(Θ) can be accurately approximated. Specifically, denote U as the matrix consisting of the first R principal eigenvectors of A. Under Fig. 5. The problem of overlapped community detection: assigning multiple the MMSB model and some regularity conditions,b one can memberships to different nodes in a network according to their latent show that U> = ZΘ> + N holds for a certain nonsingular attributes. R R Z R × [10] and a noise term N with N F bounded. ∈ k k Then, by lettingb H = Θ and W = Z, one can recover the The MMSB considers the generative model of a symmetric following noisy matrix factorization model N N adjacency matrix A 0, 1 × . Specifically, A(i, j)’s ∈ { } represent the connections between entities in a network—i.e., X = WH> + N, (7) A(i, j) = 1 means that node i and j are connected, while with H 0. We should note that the model in (7) is ≥ A(i, j) = 0 means otherwise. Assume that every node has a not strictly NMF, since W could contain negative elements. 1 R membership indicator vector θi = [θ1,i, . . . , θR,i] R × and ∈ However, (7) is usually regarded as a simple extension of there are R underlying communities. The value of θr,i indi- NMF, since many identifiability results and algorithms for the cates the probability that node i is associated with community classic NMF model can be extended to the model in (7) as r—and thus θi can naturally represent multiple memberships R well [7], [15], [17], [37]–[39] . of node i. Apparently, we have θi 0, r=1 θr,i = 1. R R ≥ Define a matrix B R × , where B(m, n) [0, 1] repre- ∈ ∈ There are many more applications of NMF which we cannot sents the probability that communities m and nPare connected. cover in detail. For example, gene expression learning has Under the assumption that nodes are connected with each other been widely recognized as a good domain of application for through their latent community identities, the node-node link- NMF [40]; NMF can also help in the context of recommender generating process is as follows systems [41]; and NMF has proven very effective in handling θ Dirichlet(α), P = ΘBΘ> (6a) a variety of blind source separation (BSS) problems [17]– i ∼ A(i, j) = A(j, i) Bernoulli(P (i, j)) (6b) [19]—see the supplementary materials of this article. Given ∼ this diverse palette of applications, a natural (and critical) M R where Θ R × collects all the θi’s as its rows, and question to ask is what are the appropriate identification ∈ 5 criteria and computational tools that we should apply for each As an example, consider the most popular NMF identifica- of them? The answer to this question is strongly dependent tion criterion on related identifiability issues, as we will see.

2 minimize X WH> ; (10) III.NMF AND MODEL IDENTIFIABILITY W 0, H 0, − F ≥ ≥

All the introduced problems in the previous section have one thing in common: We are concerned with whether we see [6], [9], [11], [27]. Here, the identification criterion is the can identify the true latent factors H and W from the data optimization problem in (10), and (W?, H?) is an optimal X. For example, in topic modeling, identifying H = C solution to this problem. To establish identifiability of this yields the PMFs that represent the topics. To better understand NMF criterion, one will need to analyze under what conditions the theoretical limitations of such parameter identification (9) holds. Many NMF approaches can be understood as an problems under NMF models, in this section, we will formally optimization criterion, e.g., those in [14]–[17], [42], which define identifiability of NMF and introduce some pertinent directly output estimates of W\ and H\. There are, however, concepts. methods that employ an estimation/optimization criterion for identifying a different set of variables, and then the latter is A. Problem Statement used to produce W\ and H\ under some conditions [30], [37], [39], [43]—which will become clear later. The notion of model uniqueness or model identifiability is not unfamiliar to the signal processing community, since it is one of the basic desiderata in parameter estimation. For Should I care about identifiability? Simply speaking, NMF, identifiability refers to the ability to identify the data when one works with an engineering problem that re- generating factors W and H that give rise to X = WH>. quires parameter estimation, identifiability is essential for A matrix factorization model without constraints on the latent guaranteeing sensible results. For example, in the topic factors is nonunique, since we have mining model (3), W\’s columns are topics that generate a document. If a chosen NMF algorithm is unable to X = WH> = WQ(HQ−>)> (8) produce identifiable results, one could have W\Q as a R R solution of the parameter identification process, which for any invertible Q R × . Therefore, for whichever ∈ is meaningless in interpreting the data since they are (W , H) that fits the model X = WH , the alternative > still mixtures of the ground-truth topics. An example of solution (WQ, HQ ) fits it equally well. Intuitively, adding −> text mining is shown in Table I, where the top-20 key constraints such as nonnegativity on the latent factors may words of 5 mined topics from 1,683 real news articles help us achieve identifiability, since constraints can reduce the are shown. The left is from a classic method (i.e., LDA ‘volume’ of the solution space and remove a lot of possibilities [33]) which has no known identifiability guarantees, and for Q. The study of NMF identifiability, roughly speaking, the right is from a method (i.e., Anchor-Free [8], [44]) addresses under what identification criteria and conditions that is built to guarantee identifiability under certain mild most Q’s can be removed. When studying identifiability, we conditions. One can see that the Anchor-Free method often ignore cases where Q only results in column reordering outputs topics that are clean and easily distinguishable— and scaling /counter-scaling of W and H, which is considered the five groups of words clearly correspond to “White unavoidable, but also inconsequential in most applications. House Scandal in 1998”, “GM Strike in Michigan”, Bearing this in mind, let us formally define identifiability of “NASA & Columbia Space Shuttle”, “NBA Final in nonnegative matrix factorization: 1997”, and “School Shooting in Arkansas”, respectively. Definition 1 (Identifiability) Consider a data matrix gener- In the meantime, the topics mined by the classic method (i.e., LDA [33]) are noticeably mixed—particularly, words ated from the model X = W\H>\ , where W\ and H\ are the ground-truth factors. Suppose that W\ and H\ satisfy a related to ‘White House Scandal’ are spread across several certain condition. Let (W?, H?) be an optimal solution to topics. This illustrates the pivotal role of identifiability in an identification criterion or an output from a procedure. If, data mining problems. under the aforementioned condition of W\ and H\’s, it holds that We would like to remark that in linear algebra, identifiability W = W ΠD, H = H ΠD 1, ? \ ? \ − (9) of factorization models (e.g., tensor factorization) is usually where Π is a permutation matrix and D is a full-rank understood as a property of a given model that is independent diagonal matrix, then we say that the matrix factorization of identification criteria. However, here we examine identifia- model is identifiable under that condition1. bility from a signal processing point of view, i.e., associated with a given criterion. This may seem unusual, but, as it turns 1 Ideally, it would be much more appealing if the conditions that enable out, the choice of estimation/identification criterion is crucial NMF identifiability are defined on X, rather than on W\ and H\, since then for NMF. In other words, NMF can be identifiable under a the conditions could be easier for checking. However, this is in general very hard, since the latent factors play essential roles for establishing identifiability, suitable identification criterion, but not under another, even as we will see. under the same generative model—as will be seen shortly. 6

wAAACAHicbZBLSwMxFIXv1Fetr6pLN8EiuCozRdBl0Y3LCvYh7VAyaaYNTTJDklHKMBv/glvduxO3/hO3/hLTdha29UDg49xzyeUEMWfauO63U1hb39jcKm6Xdnb39g/Kh0ctHSWK0CaJeKQ6AdaUM0mbhhlOO7GiWASctoPxzXTefqRKs0jem0lMfYGHkoWMYGOth7QXCPSU9Wv9csWtujOhVfByqECuRr/80xtEJBFUGsKx1l3PjY2fYmUY4TQr9RJNY0zGeEi7FiUWVPvp7OAMnVlngMJI2ScNmrl/N1IstJ6IwCYFNiO9MAtEtpydhv7LdhMTXvkpk3FiqCTzj8OEIxOhaRtowBQlhk8sYKKYvR2REVaYGNtZyZbiLVewCq1a1XOr3t1FpX6d11OEEziFc/DgEupwCw1oAgEBL/AKb86z8+58OJ/zaMHJd45hQc7XL8z+lsU= 2

AAACAHicbZBLSwMxFIXv1Fetr6pLN8EiuCozRdBl0Y3LCvYh7VAyaaYNTTJDklHKMBv/glvduxO3/hO3/hLTdha29UDg49xzyeUEMWfauO63U1hb39jcKm6Xdnb39g/Kh0ctHSWK0CaJeKQ6AdaUM0mbhhlOO7GiWASctoPxzXTefqRKs0jem0lMfYGHkoWMYGOth7QXCPSU9Wv9csWtujOhVfByqECuRr/80xtEJBFUGsKx1l3PjY2fYmUY4TQr9RJNY0zGeEi7FiUWVPvp7OAMnVlngMJI2ScNmrl/N1IstJ6IwCYFNiO9MAtEtpydhv7LdhMTXvkpk3FiqCTzj8OEIxOhaRtowBQlhk8sYKKYvR2REVaYGNtZyZbiLVewCq1a1XOr3t1FpX6d11OEEziFc/DgEupwCw1oAgEBL/AKb86z8+58OJ/zaMHJd45hQc7XL8z+lsU= wAAACAHicbZBLSwMxFIXv1Fetr6pLN8EiuCozItRl0Y3LCvYh7VAyaaYNTTJDklHKMBv/glvduxO3/hO3/hLTdha29UDg49xzyeUEMWfauO63U1hb39jcKm6Xdnb39g/Kh0ctHSWK0CaJeKQ6AdaUM0mbhhlOO7GiWASctoPxzXTefqRKs0jem0lMfYGHkoWMYGOth7QXCPSU9b1+ueJW3ZnQKng5VCBXo1/+6Q0ikggqDeFY667nxsZPsTKMcJqVeommMSZjPKRdixILqv10dnCGzqwzQGGk7JMGzdy/GykWWk9EYJMCm5FemAUiW85OQ/9lu4kJr/yUyTgxVJL5x2HCkYnQtA00YIoSwycWMFHM3o7ICCtMjO2sZEvxlitYhdZF1XOr3t1lpX6d11OEEziFc/CgBnW4hQY0gYCAF3iFN+fZeXc+nM95tODkO8ewIOfrF8tqlsQ= w TABLE I 1 2 xAAACA3icbZBLSwMxFIUz9VXrq+rSTbAIrsqMCLosunFZwT6gM5RMeqcNTTJDkhHLMEv/glvduxO3/hC3/hLTdha29UDg49xzyeWECWfauO63U1pb39jcKm9Xdnb39g+qh0dtHaeKQovGPFbdkGjgTELLMMOhmyggIuTQCce303nnEZRmsXwwkwQCQYaSRYwSYy0/80OBn/K+D5z3qzW37s6EV8EroIYKNfvVH38Q01SANJQTrXuem5ggI8owyiGv+KmGhNAxGULPoiQCdJDNbs7xmXUGOIqVfdLgmft3IyNC64kIbVIQM9ILs1Dky9lp6L9sLzXRdZAxmaQGJJ1/HKUcmxhPC8EDpoAaPrFAqGL2dkxHRBFqbG0VW4q3XMEqtC/qnlv37i9rjZuinjI6QafoHHnoCjXQHWqiFqIoQS/oFb05z8678+F8zqMlp9g5Rgtyvn4Bf/iYSw== `

wAAACAHicbZBLSwMxFIXv1Fetr6pLN8EiuCozItRl0Y3LCvYh7VAyaaYNTTJDklHKMBv/glvduxO3/hO3/hLTdha29UDg49xzyeUEMWfauO63U1hb39jcKm6Xdnb39g/Kh0ctHSWK0CaJeKQ6AdaUM0mbhhlOO7GiWASctoPxzXTefqRKs0jem0lMfYGHkoWMYGOth7QXCPSU9b1+ueJW3ZnQKng5VCBXo1/+6Q0ikggqDeFY667nxsZPsTKMcJqVeommMSZjPKRdixILqv10dnCGzqwzQGGk7JMGzdy/GykWWk9EYJMCm5FemAUiW85OQ/9lu4kJr/yUyTgxVJL5x2HCkYnQtA00YIoSwycWMFHM3o7ICCtMjO2sZEvxlitYhdZF1XOr3t1lpX6d11OEEziFc/CgBnW4hQY0gYCAF3iFN+fZeXc+nM95tODkO8ewIOfrF8tqlsQ= 1 MINEDTOPICSBY ANCHORFREEANDANANCHORWORD-BASED xAAACA3icbZBLSwMxFIUz9VXrq+rSTbAIrsqMCLosunFZwT6gM5RMeqcNTTJDkhHLMEv/glvduxO3/hC3/hLTdha29UDg49xzyeWECWfauO63U1pb39jcKm9Xdnb39g+qh0dtHaeKQovGPFbdkGjgTELLMMOhmyggIuTQCce303nnEZRmsXwwkwQCQYaSRYwSYy0/80OBn/K+D5z3qzW37s6EV8EroIYKNfvVH38Q01SANJQTrXuem5ggI8owyiGv+KmGhNAxGULPoiQCdJDNbs7xmXUGOIqVfdLgmft3IyNC64kIbVIQM9ILs1Dky9lp6L9sLzXRdZAxmaQGJJ1/HKUcmxhPC8EDpoAaPrFAqGL2dkxHRBFqbG0VW4q3XMEqtC/qnlv37i9rjZuinjI6QafoHHnoCjXQHWqiFqIoQS/oFb05z8678+F8zqMlp9g5Rgtyvn4Bf/iYSw== `

LDA [33] AnchorFree [8] lewinsky spkr school clinton jordan lewinsky gm shuttle bulls jonesboro cone W AAACDHicbZDLSsNAGIUn9VbrLdqlm8EiuCqJCLosunFZwV6gCWUynbRD5xJmJkIIeQVfwa3u3Ylb38GtT+K0zcK2Hhj4OP/5mZ8TJYxq43nfTmVjc2t7p7pb29s/ODxyj0+6WqYKkw6WTKp+hDRhVJCOoYaRfqII4hEjvWh6N5v3nojSVIpHkyUk5GgsaEwxMtYauvU80DHEUpAiyIOIw15QDN2G1/Tmguvgl9AApdpD9ycYSZxyIgxmSOuB7yUmzJEyFDNS1IJUkwThKRqTgUWBONFhPj++gOfWGcFYKvuEgXP370aOuNYZj2ySIzPRS7OIF6vZWei/7CA18U2YU5Gkhgi8+DhOGTQSzpqBI6oINiyzgLCi9naIJ0ghbGx/NVuKv1rBOnQvm77X9B+uGq3bsp4qOAVn4AL44Bq0wD1ogw7AIAMv4BW8Oc/Ou/PhfC6iFafcqYMlOV+/M3qbUw== conv W starr news gm president time monica motors space jazz arkansas { } AAACDHicbZDLSsNAGIUn9VbrLdqlm8EiuCqJCLosunFZwbZCE8pkOmmHziXMTAoh5BV8Bbe6dydufQe3PonTNgvbemDg4/znZ35OlDCqjed9O5WNza3tnepubW//4PDIPT7papkqTDpYMqmeIqQJo4J0DDWMPCWKIB4x0osmd7N5b0qUplI8miwhIUcjQWOKkbHWwK3ngY4hlmJaBHkQcdgLioHb8JreXHAd/BIaoFR74P4EQ4lTToTBDGnd973EhDlShmJGilqQapIgPEEj0rcoECc6zOfHF/DcOkMYS2WfMHDu/t3IEdc645FNcmTGemkW8WI1Owv9l+2nJr4JcyqS1BCBFx/HKYNGwlkzcEgVwYZlFhBW1N4O8RgphI3tr2ZL8VcrWIfuZdP3mv7DVaN1W9ZTBafgDFwAH1yDFrgHbdABGGTgBbyCN+fZeXc+nM9FtOKUO3WwJOfrF074m2Q= { } president people workers house game starr plants columbia nba school white monica arkansas white bulls grand flint astronauts chicago shooting house story people clintons night white workers nasa game boys Fig. 6. Two versions of the NMF geometry when R = 3. Left: the original ms voice children public team jury michigan crew utah teacher grand reporter students allegations left house auto experiments finals students version. Right: the normalized version where H is row-stochastic. The dots lawyers washington strike presidents i”m clinton plant rats jordan westside are x ’s. jury dont jonesboro political chicago counsel strikes mission malone middle ` jones media flint bill jazz intern gms nervous michael 11year independent hes boys affair malone independent strike brain series fire investigation thats day intern help president union aboard championship girls counsel time united american sunday investigation idled system karl mitchell lawyer believe union bennett home affair assembly weightlessness pippen shootings relationship theres shooting personal friends lewinskys production earth basketball suspects office talk motors accusations little relationship north mice win funerals The above gives an NMF model as follows: sexual york plant presidential michael sexual shut animals night children clinton sex middle woman series ken talks fish sixth killed R former press plants truth minutes former autoworkers neurological games 13year W (:, r) W (:, r) H(`, r) monica peterjennings’ outside former lead starrs walkouts seven title johnson X¯ (:, `) = k k1 W (:, r) R r=1 1 r=1 W (:, r)H(`, r) 1 X k k k k ¯ W (:,r) H¯ (`,r) B. Preliminaries P R | {z } | {z } = W¯ (:, r)H¯ (`, r) = W¯ H¯ (`, :)>. 1) Geometric Interpretations: NMF has an intuitively r=1 pleasing underlying geometry. The model X = WH can be X > Note that identifying W¯ recovers W with a column viewed from different geometric perspectives. In this article, scaling ambiguity (which is intrinsic anyway), and thus we focus on the geometry of the columns of X; the geometry the above normalized model is useful in estimating the of the rows of X follows similarly, by transposition. It is original latent factors. To show that H¯ (`, :)1 = 1, note readily seen that that the following holds when W 0: ≥ R R

x` cone W , ` = 1,...,N W (:, r)H(`, r) = H(`, r) W (:, r) 1. ∈ { } k k r=1 1 r=1 X X The above does not hold if W has negative elements. where cone W = x RM x = W θ, θ 0 denotes the { } { ∈ | ≥ } conic hull of the columns of W . This geometry holds because of the nonnegativity of H. There is another very useful— When H is row-stochastic, we have and powerful—geometric interpretation of the NMF model, x` conv W , ` = 1,...,N, which requires an additional assumption. Suppose that H, in ∈ { } M addition to being nonnegative, also satisfies a ‘row sum-to- where conv W = x R x = W θ, θ 0, θ>1 = { } { ∈ | ≥ one’ assumption, i.e., 1 denotes the convex hull of the columns of W . When } rank(W ) = R, it can be shown that the columns of W are also the vertices of this convex hull ( note that the converse H1 = 1, H 0. (11) is not necessarily true). The convex hull is also called a ≥ simplex if the columns of W are affinely independent, i.e., if wr wR r=1,...,R 1 is linearly independent. The two { − } − The assumption is called the row-stochastic assumption, since geometric interpretations of NMF are shown in Fig. 6. From every row of H is constrained to lie in the probability simplex. the geometries, it is also obvious that NMF is equivalent to The assumption naturally holds in many applications such as recovering a data enclosing conic hull or simplex—which can hyperspectral unmixing [26] and image embedding [6], where be very ill-posed as many solutions may exist. To be more pre- H(`, r) means the proportion of wr in x`. In applications cise, let us take the later case as an example: Assume W 0. where (11) does not arise from the physical model, it can be M ≥ Then x` R+ holds for all `. Any W = [w1,..., wR] that ‘enforced’ via pre-processing, i.e., normalizing the columns of ∈ M satisfies conv X conv W R+ also satisfies the data X using the `1-norm of the corresponding columns [13], [37], { } ⊆ { } ⊆ model X = W H = WH for somecH, whereb bothbH and if W 0 holds and noise is absent; see the next insert. > > ≥ H satisfy (11), W = W , andc H = H—see Fig. 7. 6 6 2) Key Characterizations:c c Let usc take another look at Column Normalization to Enforce Row-Stochasticity Fig.c 7. Assume that cW 0 andcH satisfies the row- ≥ [37]. Consider X = WH> where W 0 and H 0. stochasticity condition in (11). Intuitively, if the data points ≥ ≥ M M We wish to have a model where the right factor is row- x`’s are ‘well spread’ in R such that conv X R , + { } ≈ + stochastic. To this end, consider then it would be hard to find a W such that conv X { } ⊆ X(:, `) conv W RM holds—and a unique factorization model is X¯ (:, `) = . { } ⊆ + X(:, `) 1 then possibly guaranteed. This suggestsc that, to understand the k k identifiabilityc of different NMF approaches, it is important to 8

¯ K ¯ to measure the volume; see, e.g., [33]. The VolMin criterion cone(D) and no rotation of R+ can contain cone(D).Fig.8 for BSS-QSS is formulated as follows: (b) shows a situation where Conditions (i)-(ii) are violated. In Fig. 8 (c),Condition(i)issatisfiedwhileCondition(ii)isnot, VolMin Criterion: as one can see that there is a Q such that cone(D¯ ) cone(Q). ⊆ In Fig. 8 (d),weshowasituationwhere(A1) holds. It is clear min det(BH B) (20a) N2 K K M in this figure that (A1) is a special case of (A4) and thus our B C × ,Θ R × ∈ ∈ proposed sufficient condition for identifiability of VolMin is s.t. z[m]=Bθ ,m=1,...,M, (20b) m tighter than the previous one in [30]. θ 0, 1T θ =1, m, (20c) m ≥ m ∀ From a practical point of view, if each source overpowers the rest in some frames, then (A4) is likely to be fulfilled.

K Note that we also want each source to prevail in several where θm R represents the m-th column of Θ for ¯ ∈ frames, so that cone(D) exhibits roughly symmetric shape m =1,...,M.InVolMin,afundamentallyexcitingchallenge K in R+ [cf. Fig. 8 (a)]. In some BSS-QSS problems such as is whether one can prove its identifiability, thereby provid- speech separation, such rough symmetric shape of cone(D¯ ) is ing mathematically precise and non-heuristic explanationsof empirically true. Hence, (A4) is easier to satisfy than (A1) for Craig’s belief. Identifiability of VolMin was previously estab- such BSS-QSS problems. 7 lished under the pure pixel assumption [30], which is (A1) and is believed to be a loose sufficient condition. Here, we provide 3 amorerelaxedsufficientcondition under which Problem (20) e1 e1 uniquely identifies H (up to a permutation ambiguity). To proceed, let D¯ =[[d¯[1],...,d¯[M]] and consider the following assumption: (A4) The matrices H and D¯ satisfy rank(H)=rank(D¯ )= e2 e2 e K.Also,cone(D¯ ) satisfies e3 3 (i) cone(D¯ ),where is a second order cone C⊆ C (a) (b) K T e1 e1 = x R 1 x √K 1 x 2 ; q1 C { ∈ | ≥ − ∥ ∥ } (ii) cone(D¯ ) cone(Q),foranyunitarymatrixQ K K ̸⊆ ∈ Fig. 7. Illustration of the ill-posednessR × that of isNMF. not a permutation The dashed linesmatrix. represent conv Wc ’s that enclose all the x`’s. That is, there are many solutions that q3 satisfy{ the} data model XWe= showWH that> with(A4)His a sufficient0 and H condition1 = 1. for identifiability of VolMin: ≥ e 2 e3 e2 e3

Theorem 1 Under (A4),theVolMincriterionuniquelyiden- q2 tifies both H and D¯ up to a permutation. Specifically, any (c) (d) optimal solution to Problem (20) under (A4) takes the form Fig. 9.Fig. The 8: inner Some circle examples corresponds of cone( to D¯, the) by shadowed assuming polytope that K corresponds=3 B = HΠ, Θ = ΠT D¯ , > C R > to coneandH that, the the outer viewers circle are corresponds facing the hyperplane to x R1T1x =1x fromx 2 , and the red dots{ correspond} to H(`, :)’s. (a) H satisfies{ ∈ the sufficiently| ≥ k k scattered} where Π is a permutation matrix. the positive orthant. In subfigures (a)-(d),theinnercirclecor- condition; (b) H does not satisfy the sufficiently scattered condition;¯ (c) responds> to ,theshadowedpolytopecorrespondstocone(D), cone H butC the regularity condition⋆ is violated;K T (d) H satisfies the The proof of Theorem 1 can be found in Appendix B.C We ⊆ the{ outer} circle corresponds to = x R 1 x x 2 , separability condition. Figure reproducedC (permission{ ∈ | will≥∥ be sought)∥ } from provide intuition regarding (A4) and Theorem 1 using graph- and the red dots correspond to d¯[m]’s. [17], c IEEE. ical examples for K =3;seeFig.8.Intheseexamples,we visualize the cones on the hyperplane 1T x =1.Specifically, K corresponds to a ball, R+ is an equilateral triangle and C cone(D¯ ) is a polytope inside this equilateral triangle.where The B.αr Over-determined-case> 0 is a scalar and Algorithm:er is the Alternatingrth coordinate Optimization vector in columns of Q also span equilateral triangles such that eachR R . WhenUnder (11)(A4) is, H satisfied,can be estimatedαr = 1 usingfor r any= of 1,...,R the existing—which facet is tangent to the ball corresponding to ;notethat C is exactlyVolMin what algorithms we summarized [29], [30], albeit in Definition their computational 2. cost these equilateral triangles determined by Q are actually rotated can be a burden, due to the2 form of the VolMin criterion versions of the triangle determined by K .InFig.8(a),weThe separability condition has been used to come up with R+ in (20). In the specific context of over-determined BSS- show a situation where (A4) is satisfied as is containedeffective in NMF algorithms in many fields—but it is considered QSS, however, VolMin can be significantly simplified by C a relatively restrictive condition. Essentially, it means that the 3Condition (A4) and the proof of Theorem 1 were developed during X. exploiting the special signal structure, as we explain next. Fu’s visit to the University of Minnesota, in the fall of 2013.AtthesameextremeRecall rays that of the in the nonnegative over-determined orthant case, (i.e., weα canrer usefor theαr > 0) Fig. 8. Illustration of thetime, separability W.-K. Ma collaborated (top) and withsufficiently C.-H. Lin scattered and C.-Y. (bottom) Chi, fromNationalare ‘touched’pre-whitened by and some principal rows subspace-projected of H and thus data thez˜[m conic]= hull Tsinghua University, Taiwan, and they independently provedanothersufficient { conditions where M = R = 3. The left and right columns are visualized by y˜[m]/Tr(R[m]) M as input, for fending against the short- condition [38], based on a different approach. We would like to acknowledgeof H> equals to} them=1 entire nonnegative orthant; see Fig. 8 assuming that the viewerthat is these facing werecone parallele1 independent, e2, e3 developments.(towards the origin) and term source cross-correlation problem. As a result, the opera- cone W , respectively. The left column{ is illustrated} in the H-domain. The (top). Correspondingly, in the data domain ( corresponding right{ hand} side shows the corresponding views in the data to the right column of Fig. 8), one can see that there exists

x`r = αrwr for each r = 1,...,R, i.e., the x`’s are actually ‘touching the corners’ in cone W —we have cone X = { } { } characterize the ‘distribution’ of vectors within the nonnega- cone W under separability. Another condition that can { } tive orthant. effectively model the spread of the nonnegative vectors but In the literature, such characterizations are usually described is much more relaxed [9], [15], [17] (see also [16], [45]) is as in terms of the geometric properties of H (or W , by role follows: symmetry). The scattering of H(`, :)’s is directly translated to that of x`’s since they are linked by a full column rank Definition 3 (Sufficiently Scattered) A nonnegative matrix N R matrix W . The following is arguably the most frequently used H R × is said to be sufficiently scattered if the following ∈ characterization: two conditions are satisfied: 1) the rows of H are spread enough so that

Definition 2 (Separability) A nonnegative matrix H cone H> (13) N R ∈ C ⊆ R × is said to satisfy the separability condition if where is a second-order cone defined as = x R C C { ∈ R R x>1 √R 1 x 2 . cone H> = cone e1,..., eR = R+. (12) | ≥ − k k } { } 2) cone H> cone Q does not hold for any orthonor- { } ⊆ { } Note that in Definition 2, the way we state separability is mal Q except the permutation matrices. somewhat different from that in the literature [9], [11], [12], To understand the sufficiently scattered condition, one key [37], [39]; our intention with using Definition 2 is to provide is to understand the second-order cone . This second-order geometric interpretation. In the literature, H is said to satisfy C the separability condition if for every r = 1, ..., R, there exists 2Another remark is that, in the literature, often times separability is defined ` for X; i.e., when H satisfies Definition 2, it is said that X is separable. In a row index r such that this tutorial, we do not use this notion of separability. Instead, we unify the geometric characterizations of the NMF model using the structure of the latent H(`r, :) = αre>r, factors as in Definitions 2-3. 7

IV. SEPARABILITY-BASED METHODSFOR8 NMF In this section, we will introduce some representative ap- To understandproaches the reason that why explicitly separability make helps, use we ofassume the separability condition that H 1 = 1 and H 0, i.e., H is row-stochastic. Under \ and have\ ≥ identifiability\ guarantees. The separability condition separability, there is an index set Λ = `1, `2, . . . , `R that was first named by Donoho{ et al. in} 2004 [9]. However, this collects the indices satisfying H\(`r, :) = e>r for r = 1,...,R. As we have seencondition in Fig. was 8 (top), used the for vertices NMF of in the the convex context of hyperspectral hull are now ‘touched’unmixing by even some data before samples 2004. (i.e., In columns HU, ofresearchers have found X). In otherthat words, there we have are manyX(:,Λ) pixels = W\ thatin the only noiseless contain a single material, case. Instead of findingx = aW simplex(:, r that) enclosesr = 1,...,R all the data X = W H columns, nowi.e., the problemnr boils\ downfor to finding data columns(assuming \ >\ which span awithout convex hull noise), (or conic and hull) the that existence encloses of all such the ‘pure pixels’ can be other data columns.utilized If toΛ buildcan be very successfully effective identified, HU algorithms one [26], [28], [34]. can simply retrieveTo understand the latent factors thereason via W = whyX(: separability,Λ) and helps, we assume Fig. 10. A closer look at the second-order cone for R = 3 from another 2 C H> = W †X (or H = arg minH 0 X WH> F ). angle. A second-order cone [46] is sometimes called the ice cream cone due that H\1 = 1 ≥andk H−\ 0k, i.e., H\ is row-stochastic. Fig.to its 8. shape A closerin the three-dimensional look at for R space.= 3 from another angle. c ≥ C Under separability, there is an index set Λ = n , n , . . . , n c c c c { 1 2 R} cone is tangent to every facet of the nonnegative orthant (see A. Convex Formulationsthat collects the indices efined in Definition 2, which satisfy H αFigs. > 9–10).0) are Hence, ‘touched’ if satisfies by some the sufficiently rows of scatteredH and thusBasic the InsightH (Undern , :) separability, = e for ther = critical 1,...,R step. towards As we have seen in Fig. 7, condition, it means that the rows of H are spread enough in • \ r >r conic hull of H equals to the entire nonnegativeidentifying orthant; W\ is to design an efficient algorithm to ‘pick out’ the nonnegative orthant> (at least every facet of the nonnegative the vertices of the convex hull are now ‘touched’ by some see Fig. 7 (top). Correspondingly, in the data domain,the one vertices can of the convex hull of the data (or extreme rays if orthant has to be ‘touched’ by some rows of H; see Fig. 9 (a)). H 1 = 1 data samples (i.e.,x ,..., columnsx of X). In other words, we have x = α w r = 1,...,R \ is not assumed) from 1 N . It turns out that seeThe that second there condition exists has`r a coupler ofr equivalentfor all forms [9], this, which can be doneX(: via,Λ a) variety = W of\ convexin the formulations. noiseless The case. idea Instead of finding a indicates[17] and seems that ax bit`’s technical spreads at first very glance. well Geometrically, with in cone it behindW — is basedsimplex on a simple that fact:encloses The vertices all the (resp. data extreme columns, now the problem means that cone H> needs to be slightly ‘larger’ than , not { } we have cone{ X} = cone W under separability.C Anotherrays) of a convexboils hull down (resp. conicto finding hull) cannot data be columns represented which span a convex just tangentially containing{ } it; see [14],{ [17]} and the illustration condition that can effectively model the spread of the nonneg-by any otherhull vectors (or that conic reside hull) inside that the convex encloses hull all ( resp. the other data columns. in Fig. 9 (c) for a case that violates the regularity condition. conic hull) using their convex combinations ( resp. nonnegative ativeHowever, vectors the but rows is considered of H need much not be more so ‘extremely relaxed relative to If Λ can be successfully identified, one can simply retrieve combinations). For example, if noise is absent, all x`’s are separabilityscattered’ that is its as conic follows hull ‘covers’ [7], [13], the [15] entire (see nonnegative also [14], [32]): distinct, Eq.the (11) latent (row-stochasticity factors via of HW\) is= satisfied,X(:,Λ and) and H = W †X (or orthant, but only ‘well scattered’ to contain the second-order ` Λ, then weH always= arg have min X WH 2 ). cone —which is much more relaxed than the separability ∈ H 0 > F DefinitionC 3 (Sufficiently Scattered) A nonnegative matrix ≥ k − k condition.N Fig.R 9 gives more illustrations on different scenarios c c c H R × is sufficiently scattered if min x` X `θ q > 0, (14) for cone H> . A key difference in the ambient data geome- cθ 0,1 θ=1 k − − k c ∈ { } A. ≥ Convex> Formulations tries resulting from Definitions 2-3 is that under the sufficiently cone H> where(10) the notation X ` means that X ` = scattered condition, there needC ⊆ not be data points touching the Basic Insight− Under separability,− the critical step towards [ x1,..., x` •1, x`+1,... xN ] (i.e., the column x` is removed andwr’s; a see certain Fig. 8 regularity (bottom right). condition is satisfied (see details in the −NMF identification is to design an efficient algorithm to ‘pick from X), q 1, and q denotes the (convex) q-norm. When supplementaryBefore we move materials to the next and section, [7], one [13]), important where remarkis a second- ≥out’ thek·k vertices of the data convex hull (or extreme rays if ` / Λ, we always have minθ 0,1>θ=1 x` X `θ q = 0 is that we will use W\ and H\ in the sequel to denote theC true ∈ ≥ k − − k order cone defined as follows: since x` convH\1X=(:,Λ1 )is. not assumed) from x1,..., xN . It turns out that generating factors that give rise to the data X = W\H>, to ∈ { } \ A relatively naive implementation of separability-based distinguish them from the optimization variables W and H in this can be done via a variety of convex formulations. The = x x>1 √R 1 x 2 . NMF is to repeatedly solve Problem (14) for each ` the NMF criteria. We will assume the ground-truth generating idea behind is based on a simple fact: The∈ vertices (or extreme C { | ≥ − k k } 1,...,N and compare the objective values, which is the factors W\ and H\ to be full-rank throughout this article, since { } To understand the sufficiently scattered condition, onebasic key idea inrays) [31], [49]. of a It convex can be shownhull (conic that, even hull) if there cannot be represented by most NMF methods work under this condition (although there is to understand the second-order cone . This second-orderis noise, (14)any can other be usedvectors to estimatethat resideW insideand there the convex is hull (conic hull) exist interesting exceptions; see [47]). C cone is special and of interest since it is tangent to everysome facet guaranteeusing on the their noise convex robustness combinations of the estimate (nonngetiave [31]. combinations). of the nonnegative orthant (see Fig. 7 (left bottom)).However, Hence, thisFor formulation example, needs if to noise solve isN absent,convexc programs, Eq. (8) (row-stochasticity of IV. SEPARABILITY-BASED METHODSFOR NMF each of which has N 1 decision variables, which can be quite H H\) is− satisfied, and ` Λ = n1, . . . , nR , then we always if In thissatisfies section, the we sufficiently will introduce scattered some representative condition, ap- it meansinefficient the when N is large. ∈ { } rowsproaches of H thatspread explicitly enough make use in theof the nonnegative separability condition orthant. However,Self-dictionaryhave Convex Formulations To avoid solving (cf. Definition 2) and have identifiability guarantees. From • the rows of H need not be ‘extremely scattered’ that itsa large conic number of convex programs,min a classx` ofX convex`θ q > 0, (11) a terminology point of view, the separability condition was θ 0,1T θ=1 k − − k hull ‘covers’ the entire nonnegative orthant, but onlyoptimization-based ‘well NMF methods ≥ were proposed [30], [39], first coined by Donoho et al. in 2004 [11]. However, this [50]–[52]. The main idea of these methods is the same as scattered’ to contain the second-order cone as its subset— where X ` denotes X with the column x` taken out and condition was used for NMF in the context of hyperspectralC that in (14), i.e., utilizing− the self-expressiveness of X by which is uch more relaxed a condition relative to separability. q 1, i.e., is any convex norm. When ` / Λ, we unmixing well before 2004. In HU, researchers noticed that X(:,Λ) under the≥ separabilityk assumption. · kq By assuming row- ∈ thereAnother are instances key in difference which the remotely in the sensed data hyperspectral geometriesstochasticity resulted ofalwaysH\, the have formulationsminθ 0 in,1 thisT θ=1 line ofx` workX can`θ q = 0 since x` ≥ k − − k ∈ fromimages Definitions have many pixels 2-3 is containing that under only thea single sufficiently material; scatteredbe cast in theconv frameworkX(: of,Λ block-sparse) . optimization: i.e., if noise is absent, there exist induces `r for r = 1,...,R { } condition, there may not be data point touching wr (see Fig. 7 A relatively naive implementation of separability-based such that x = W (:, r) (or equivalently, H (` , r) = 1 minimize C row 0 (15a) `r \ \ r C k k − (rightand H bottom)).(` , j) = 0 Asfor onej = willr). The see, existence such difference of such ‘pure leads to quite NMF is to repeatedly solve Problem (11) for each ` \ r 6 subject to X = XC (15b) ∈ differentpixels’ has NMF been algorithmsexploited to comeand theoretical up with quite results. effective 1,...,N and compare the objective values, which is the { }C 0, 1>C = 1>, (15c) hyperpsectralBefore we unmixing go to the algorithms next section, [42], [43], one [48]. important remark is basic idea in≥ [35], [36]. It can be shown that even there is that we will use W\ and H\ in the sequel to denote the true noise, (11) can still effectively identify a W that is a fairly generating factors of that give rise to the data X = W\H>\ , good approximation for the true W\, under certain conditions. to differentiate with the optimization variables W and H in These are favorable properties. However,c this formulation the NMF factorization criteria. We will assume full rank of needs to solve N convex programs, each of which has N 1 − ground-truth generating factors W\ and H\ throughout this decision variables, which can be quite inefficient when N is article, since most NMF methods work under this condition large. (although there exists interesting exceptions such as that in Self-dictionary Convex Formulations To avoid solving • [33]). a large number of convex programs, a class of convex 9

where Y row 0 counts the number of non-zero rows in B. Greedy Algorithms k k − Y . To understand the formulation, consider, without loss of Another very important class of methods that can provably generality, a case where X = [W , X0], i.e., Λ = 1,...,R . \ { } identify Λ is greedy algorithms—i.e., the class of algorithms Then, one can see that that identify `1, . . . , `R one by one. Here we describe a H particularly simple algorithm in this context. Assuming that C? = >\ , 0 X = W\H>\ and H\ satisfies the row-stochasticity condition in (11), the algorithm discovers the first vertex of the target is an optimal solution. Intuitively, the reasons are that 1) convex hull by ? ? XC = W\H> and 2) no other nonzero rows of C \ ˆ can be removed, since X(:,Λ) can only be represented by `1 = arg max x` q, (17) ` 1,...,N k k X(:,Λ)H\(Λ, :)>. One can easily identify Λ via inspecting ∈{ } ? where 1 < q . The proof is simple: given that H satisfies the indices of the nonzero rows of C . ≤ ∞ \ The formulation in (15) is nonconvex due to the combi- (11), we have natorial nature of the ‘row zero-norm’. Nevertheless, it can x` q = W\H\(`, :)> q be handled by relaxing row 0 to a convex function, e.g., k k M k · k − R Y = Y (m, :) Y M N q 1 q,1 m=1 q where R × and k k k k ∈ ≥ H\(`, r) W\(:, r) q max W\(:, r) q, [19]. In addition, it was proven in [51] that when there is noise, ≤ k k ≤ r 1,...,R k k P r=1 ∈{ } i.e., x` = W\H\(`, :)> + v`, the modified versions such as X where the maximum is attained if and only if H\(:, `) is a minimize C ,1 (16a) unit vector. Note that the first inequality is by the triangle C 0, 1 C=1 , k k∞ ≥ > > inequality and the second uses the fact that H satisfies (11). subject to x Xc , ` (16b) \ k ` − `kq ≤ ∀ After finding the first column index that belongs to Λ, the where q = 1 [30], [53] or q = 2 [39], [51], c` is the algorithm projects all the data vectors x` onto the orthogonal ˆ `th column of C, and is a pre-specified parameters, can complement of X(:, `1), i.e., forming guarantee correct identification of Λ, provided that the noise is x` := P ⊥ ˆ x`, for ` = 1,...,N. (18) not too significant and some more assumptions ( e.g., there are X(:,`1) no repeated columns in the noiseless data WH>) are satisfied. Then, if one repeats (17), the already found vertex will not As we have seen, the self-dictionary sparse optimization appear again and the next vertex will be identified. Repeating methods are appealing in terms of vertex identifiability and (18) for R times will enable us to identify all the vertices. noise robustness. They can also facilitate the derivation of dis- The described algorithm has many names, e.g., succes- tributed algorithms, if the X(m, :)’s are collected by different sive simplex volume maximization (SVMAX) [42] and self- agents [19]. The downside, however, is complexity. All these dictionary simultaneous matching pursuit (SD-SOMP) [39], self-dictionary formulations lift the number of decision vari- since it was re-discovered from different perspectives (see the ables to N 2, which is not affordable when N is large. In [30], supplementary materials for more detailed discussion). The a linear program-based algorithm was employed to handle a algorithm has also many closely related variants, such as the self-dictionary criterion, with judiciously designed updates. In vertex component analysis (VCA) algorithm [43] and FastAn- [52], a first-order fast method was proposed. However, the chor [7], which share the same spirit. The most commonly orders of the computational and memory complexities are still used name for this algorithm in signal processing and machine high. One workaround is to use some simple methods, such as learning is successive projection algorithm (SPA), which first the greedy methods to be introduced in the next subsection, appeared in [48]. Bearing a similar structure as the Gram- to identify a small subset of x` that are candidates for the Schmidt procedure, SPA has very lightweight updates: both vertices, and then apply the self-dictionary methods to refine the q-norm comparison and the orthogonal projection can be the results [54]. carried out easily. It was shown by Gillis and Vavasis in [37] that the algorithm is robust to noise under most convex norms There are other convex formulations under separability. For (except q = 1). It was further shown that using q = 2 exhibits example, recently, a new method has been proposed to identify the best noise-robustness—which is consistent with what has the vertices of conv W via finding a minimum-volume been most frequently used in many fields [7], [17], [42], [48]. { \} ellipsoid that is centered at the origin and encloses the data The downside of SPA is that it is prone to error propagation, columns x ,..., x and their ‘mirrors’ x ,..., x [55], as with all the other greedy algorithms. 1 N − 1 − N [56]. This method has a small number of primal optimization variables (i.e., R2 primal variables) and thus could potentially A remark is that most separability-based methods work lead to more economical algorithms relative to the self- without assuming W\ to be nonnegative, if H\ is row- dictionary formulations. The downside is that the method stochastic. This allows us to apply the algorithms to han- requires accurate knowledge of R, which the self-dictionary dle problems like community detection under the model in methods do not need—and this is an additional advantage of (7). However, when H\ does not satisfy (11), then row- the self-dictionary methods [39]. See more details of ellip- stochasticity has to be enforced through column normalization soid volume minimization-based NMF in the supplementary of X, which does not work without the assumption W 0 \ ≥ materials. [37]; also see the previous insert on column normalization. 10

V. SEPARABILITY-FREE METHODSFOR NMF Separability-based NMF approaches are usually associated with elegant formulations and tractable algorithms. They are also well understood—the performance under noisy scenarios has been studied and characterized. On the other hand, the success of these algorithms hinges on the separability condition, which makes them ‘fragile’ in terms of model robustness. Separability usually has plausible physical interpretations in applications. As discussed before, in hyperspectral imaging, separability means that there are spectral pixels that only Fig. 11. Intuition behind the NMF identifiability conditions in [9], [11], [12]. contain a single material, i.e., pure pixels. Moreover, in such that X = W\Q(H\Q−>)> holds. Intuitively, Q−> is an text mining, separability means that every topic has some operator that can rotate, enlarge, or shrink the red color-shaded characteristic words, which other topics do not use. These region in Fig. 11 (left) (note that this shaded region represents conv H> here). If H is separable or sufficiently scattered are legitimate assumptions to some extent. However, there is { \ } \ also a (sometimes considerable) risk that separability does not and H\Q−> is constrained to be nonnegative, one can see hold in practice. If separability is (grossly) violated, can we that Q−> cannot rotate or enlarge the area since this will result in negative elements of H\Q−>. In fact, it cannot shrink still guarantee identifiability of W\ and H\? We will review a series of important recent developments that address this the area either—since then Q will enlarge a corresponding area determined by cone W> (cf. Fig. 11 (right)). If W is question. { \ } \ sufficiently scattered (Huang et al.’s condition) or if there are some rows in W\ touching the boundary of the nonnegative A. Plain NMF Revisited orthant (Donoho et al. and Laurberg et al.’s conditions), then The arguably most popular NMF criterion is the fitting W\Q will be infeasible. Hence, Q can only be a permutation based formulation defined in (10), i.e., and column-scaling matrix. 2 Although Huang et al.’s identifiability condition does not, minimize X WH> . W 0, H 0, − F strictly speaking, subsume the conditions in [11] and [12], the ≥ ≥ former presents a significant departure from the separability For convenience, we will call (10) the plain NMF criterion condition which [11] and [12] rely upon. The region covered in the sequel. To study the identifiability of the criterion in by rapidly shrinks at a geometric rate when R grows and C (10), it suffices to study the identifiability of the following thus the sufficiently scattered condition can be expected to feasibility problem: be satisfied in many cases—see [21] and the next insert. The find W , H (19a) sufficiently scattered condition per se is also a very important discovery, which has not only helped establish identifiability subject to X = WH> (19b) of the plain NMF criterion, but has also been generalized and W 0, H 0, (19c) connected to many more other important NMF criteria, which ≥ ≥ will be introduced next. since (10) and (19) share the same optimal solutions if noise is absent. In [11], Donoho et al. derived the first result on the identifiability of (10) and (19). In 2008, Laurberg et al. [12] Is the sufficiently scattered condition easy to satisfy? came up with a rather similar sufficient condition. In a nutshell, One possibly annoying fact is that the sufficiently scat- the sufficient conditions in [11] and [12] can be summarized as tered condition is NP-hard to check [9]. One way to follows: 1) the H\ matrix satisfies the separability condition; empirically check the condition is as follows. Consider and 2) there is a certain zero pattern in W\. For example, the following optimization problem: the condition in [11] requires that every column of W\ has a 2 maximize x 2 (20) zero element, and the zero elements from different columns x k k appear in different rows. Geometrically, this assumes that there subject to Hx 0, x>1 = 1. R ≥ are W\(m, :)’s ‘touching’ every facet of R —see a detailed + It can be shown that H is sufficiently scattered if summary in [9]. Huang et al. [9] derived another interesting and only if the maximum of the above is attained at sufficient condition for NMF identifiability in 2014: x e ,..., e , since the constraint set represents the ∈ { 1 R} If both W\ and H\ are sufficiently scattered (cf. Definition dual cone ( the dual cone of a cone is defined as K 3), then any optimal solution (W?, H?) to (10) (and the ∗ = y y>x 0, x )—and the dual cone is K { | ≥ ∀ ∈ K} feasibility problem in (19)) must satisfy W? = W\ΠD and restricted inside x 2 1 and touches the second-order 1 k k ≤ H? = H\ΠD− , where Π is a permutation matrix and D cone x 1 at the canonical vectors only, if H is k k2 ≤ is a full-rank diagonal matrix. sufficiently scattered [9]. Problem (20) is a nonconvex problem, but can be The insights behind the three sufficient conditions in [9], approximated by successive linearization of the objective. [11], [12] can be understood in a unified way by looking at We have conducted an experiment using randomly gener- Problem (19). Specifically, let us assume that there exists a Q 11

W\(:, 3)

AAACDnicbZDLSgMxGIUzXmu9jYorN8EiVJAyo4LFVcGNywr2Ap1hyKSZNjTJDElGKMO8g6/gVvfuxK2v4NYnMW1nYVsPBD7Of37yc8KEUaUd59taWV1b39gsbZW3d3b39u2Dw7aKU4lJC8cslt0QKcKoIC1NNSPdRBLEQ0Y64ehuMu88EaloLB71OCE+RwNBI4qRNlZgH3shh50g8wTSqUQsr95eXJ0HdsWpOVPBZXALqIBCzcD+8foxTjkRGjOkVM91Eu1nSGqKGcnLXqpIgvAIDUjPoECcKD+bnp/DM+P0YRRL84SGU/fvRoa4UmMemiRHeqjmZiHPF7OT0H/ZXqqjup9RkaSaCDz7OEoZ1DGcdAP7VBKs2dgAwpKa2yEeIomwNg2WTSnuYgXL0L6suU7NfbiuNOpFPSVwAk5BFbjgBjTAPWiCFsAgAy/gFbxZz9a79WF9zqIrVrFzBOZkff0CcBObTw== wAAACAHicbZBLSwMxFIXv1Fetr6pLN8EiuCozItRl0Y3LCvYh7VAyaaYNTTJDklHKMBv/glvduxO3/hO3/hLTdha29UDg49xzyeUEMWfauO63U1hb39jcKm6Xdnb39g/Kh0ctHSWK0CaJeKQ6AdaUM0mbhhlOO7GiWASctoPxzXTefqRKs0jem0lMfYGHkoWMYGOth7QXCPSU9b1+ueJW3ZnQKng5VCBXo1/+6Q0ikggqDeFY667nxsZPsTKMcJqVeommMSZjPKRdixILqv10dnCGzqwzQGGk7JMGzdy/GykWWk9EYJMCm5FemAUiW85OQ/9lu4kJr/yUyTgxVJL5x2HCkYnQtA00YIoSwycWMFHM3o7ICCtMjO2sZEvxlitYhdZF1XOr3t1lpX6d11OEEziFc/CgBnW4hQY0gYCAF3iFN+fZeXc+nM95tODkO8ewIOfrF8tqlsQ= 1

wAAACAHicbZBLSwMxFIXv1Fetr6pLN8EiuCozKuiy6MZlBfuQdiiZNNOGJpkhyShlmI1/wa3u3Ylb/4lbf4lpOwvbeiDwce655HKCmDNtXPfbKaysrq1vFDdLW9s7u3vl/YOmjhJFaINEPFLtAGvKmaQNwwyn7VhRLAJOW8HoZjJvPVKlWSTvzTimvsADyUJGsLHWQ9oNBHrKeue9csWtulOhZfByqECueq/80+1HJBFUGsKx1h3PjY2fYmUY4TQrdRNNY0xGeEA7FiUWVPvp9OAMnVinj8JI2ScNmrp/N1IstB6LwCYFNkM9NwtEtpidhP7LdhITXvkpk3FiqCSzj8OEIxOhSRuozxQlho8tYKKYvR2RIVaYGNtZyZbiLVawDM2zqudWvbuLSu06r6cIR3AMp+DBJdTgFurQAAICXuAV3pxn5935cD5n0YKT7xzCnJyvX86SlsY= 3 0 20 40 0.2 W\(:, 2) 60 AAACDnicbZDLSgMxGIUzXmu9jYorN8EiVJAyUwSLq4IblxXsBTrDkEkzbWiSGZKMUIZ5B1/Bre7diVtfwa1PYtrOwrYeCHyc//zk54QJo0o7zre1tr6xubVd2inv7u0fHNpHxx0VpxKTNo5ZLHshUoRRQdqaakZ6iSSIh4x0w/HddN59IlLRWDzqSUJ8joaCRhQjbazAPvVCDrtB5gmkU4lYXr29ql8GdsWpOTPBVXALqIBCrcD+8QYxTjkRGjOkVN91Eu1nSGqKGcnLXqpIgvAYDUnfoECcKD+bnZ/DC+MMYBRL84SGM/fvRoa4UhMemiRHeqQWZiHPl7PT0H/Zfqqjhp9RkaSaCDz/OEoZ1DGcdgMHVBKs2cQAwpKa2yEeIYmwNg2WTSnucgWr0KnXXKfmPlxXmo2inhI4A+egClxwA5rgHrRAG2CQgRfwCt6sZ+vd+rA+59E1q9g5AQuyvn4Bbn6bTg== 80 0.4 100

rank 120 0.6 140 160 180 0.8 200 conv W\ 1 AAACGXicbZA9SwMxHMZzvtb6duroEiyCU7kTwY4FF8cK9gV6x5FLc21okjuSXKGEW/0SfgVX3d3E1cnVT2La3mBbHwj8eP7PnyRPnDGqtOd9OxubW9s7u5W96v7B4dGxe3LaUWkuMWnjlKWyFyNFGBWkralmpJdJgnjMSDce383m3QmRiqbiUU8zEnI0FDShGGlrRS40gUogTsWkCIwJYg67RWQCgXQuESuCInJrXt2bC66DX0INlGpF7k8wSHHOidCYIaX6vpfp0CCpKWakqAa5IhnCYzQkfYsCcaJCM/9JAS+tM4BJKu0RGs7dvxsGcaWmPLZJjvRILc1iXqxmZ6H/sv1cJ43QUJHlmgi8uDjJGdQpnNUEB1QSrNnUAsKS2rdDPEISYW3LrNpS/NUK1qFzXfe9uv9wU2s2ynoq4BxcgCvgg1vQBPegBdoAgyfwAl7Bm/PsvDsfzuciuuGUO2dgSc7XL5bvoX4= 220 W\(:, 1) AAACDnicbZDLSgMxGIUz9VbrbVRcuQkWoYKUGREsrgpuXFawF+gMQybNtKFJZkgyQhnmHXwFt7p3J259Bbc+iWk7C9t6IPBx/vOTnxMmjCrtON9WaW19Y3OrvF3Z2d3bP7APjzoqTiUmbRyzWPZCpAijgrQ11Yz0EkkQDxnphuO76bz7RKSisXjUk4T4HA0FjShG2liBfeKFHHaDzBNIpxKxvHZ76V4EdtWpOzPBVXALqIJCrcD+8QYxTjkRGjOkVN91Eu1nSGqKGckrXqpIgvAYDUnfoECcKD+bnZ/Dc+MMYBRL84SGM/fvRoa4UhMemiRHeqQWZiHPl7PT0H/Zfqqjhp9RkaSaCDz/OEoZ1DGcdgMHVBKs2cQAwpKa2yEeIYmwNg1WTCnucgWr0Lmqu07dfbiuNhtFPWVwCs5ADbjgBjTBPWiBNsAgAy/gFbxZz9a79WF9zqMlq9g5Bguyvn4BbOmbTQ== { } 10% 20% 30% 40% 50% 60% 70% conv W : x conv W nnz(H) AAACMHicbZDLSgMxGIUz9VbrbdSlm2ARXJUZEVpcFdy4rGAv0Cklk2ba0CQzJJliGcY38SV8Bbe615Xo0qcw087CXg4EPs5/fpIcP2JUacf5tAobm1vbO8Xd0t7+weGRfXzSUmEsMWnikIWy4yNFGBWkqalmpBNJgrjPSNsf32bz9oRIRUPxoKcR6XE0FDSgGGlj9e1q4qkA4lBMUi/xfA7bXnoDnzJ67HuEMY8KuCbTt8tOxZkJroKbQxnkavTtH28Q4pgToTFDSnVdJ9K9BElNMSNpyYsViRAeoyHpGhSIE9VLZh9M4YVxBjAIpTlCw5n7fyNBXKkp902SIz1SCzOfp8vZLLQu2411UOslVESxJgLPLw5iBnUIs/bggEqCNZsaQFhS83aIR0girE3HJVOKu1zBKrSuKq5Tce+vy/VaXk8RnIFzcAlcUAV1cAcaoAkweAav4A28Wy/Wh/Vlfc+jBSvfOQULsn7/AFg8qf0= { } ` 2 { }

xAAACAXicbZBLSwMxFIXv+Kz1VXXpJlgEV2VGBLssuHFZwT6wHUomvW1Dk8yQZMRSuvIvuNW9O3HrL3HrLzFtZ2FbDwQ+zj2XXE6UCG6s7397a+sbm1vbuZ387t7+wWHh6Lhu4lQzrLFYxLoZUYOCK6xZbgU2E41URgIb0fBmOm88ojY8Vvd2lGAoaV/xHmfUOuuhHUny1GmjEJ1C0S/5M5FVCDIoQqZqp/DT7sYslagsE9SYVuAnNhxTbTkTOMm3U4MJZUPax5ZDRSWacDy7eELOndMlvVi7pyyZuX83xlQaM5KRS0pqB2ZhFsnJcnYa+i/bSm2vHI65SlKLis0/7qWC2JhM6yBdrpFZMXJAmebudsIGVFNmXWl5V0qwXMEq1C9LgV8K7q6KlXJWTw5O4QwuIIBrqMAtVKEGDBS8wCu8ec/eu/fhfc6ja162cwIL8r5+AaHtlzU= `

500×R Fig. 12. Transition plot of H R ; in the grayscale bar, “1” means Fig. 13. Intuition of simplex volume minimization based NMF. Shrinking ∈ that H does not satisfy the sufﬁciently scattered condition; the zero elements a data-enclosing simplex to have minimum volume will recover the ground appear uniformly at random in different locations of the matrix. truth conv W\ if the data is sufﬁciently spread in conv W\ . { } { }

in conv W , then finding the minimum-volume enclosing ated nonnegative H with N = 500 under various ranks { \} simplex of x ,..., x identifies W (cf. Fig. 13). Craig’s and density levels. The elements of H are drawn from the { 1 N } \ paper has provided an elegant geometric viewpoint for tackling i.i.d. exponential distribution (with the parameter being NMF. One can see that there are very few assumptions on 1), and then a certain amount of selected entries (which W in Craig’s belief. Craig did not provide a mathematical are chosen uniformly at random across all the entries) \ programming form for the stated problem, i.e., finding the are set to zeros according to a pre-specified density level. minimum-volume data-enclosing simplex. In 2015, Fu et al. Fig. 12 shows the probabilities that the generated H is not [17] formulated this volume-minimization (VolMin) problem sufficiently scattered (in the grayscale bar, black means all as follows: the generated H’s do not satisfy the sufficiently scattered condition, and white means the opposite). One can see that minimize det(W>W ) (21a) for every fixed nnz(H) (recall that nnz(H) counts the W ,H H nonzero elements of ), there is a very sharp transition subject to X = WH> (21b) happening somewhere, and R is directly related to where H 0, 1>H> = 1>, (21c) the transition happens. ≥ In fact, as empirically observed in [9], for such random where det(W>W ) is a surrogate of the volume of conv W H, if every column has R 1 zero elements, then the { } − and the constraints mean that x conv W for every ` sufficiently scattered condition is satisfied with very high ` ∈ { } ∈ 1,...,N (also see other similar formulations in [57]–[60]). probability—which is a fairly mild condition. { } Craig’s belief was a conjecture, which had been supported by extensive empirical evidence over the years. However, there had been a lack of theoretical justification. In 2015, two B. Simplex Volume Minimization (VolMin) parallel works [16], [17] pushed forward the understanding of VolMin identifiability substantially. Fu et al. showed in [17] An important issue that was noticed in the literature of NMF the following: is that a necessary condition of the plain NMF identifiability is that both W\ and H\ contain some zero elements [9]. If rank(W\) = rank(H\) = R and H\ satisfies the In some applications, this seemingly mild condition can be sufficiently scattered condition (Definition 3) and (11), then rather damaging for model identification. For example, in any optimal solution of Problem (21) is W? = W\Π, H? = hyperspectral unmixing, the columns of W\ model the spectral H\Π>, where Π is a permutation matrix. signatures of different materials. Many spectral signatures do not contain any zero elements, and dense W\’s frequently In retrospect, it is not surprising that the identifiability of arise. Under such cases, can we still guarantee identifiability VolMin relates to the scattering of the rows of H\—since the of W\ and H\? The answer is affirmative—and it turns out scattering of H\ translates to the spread of x1,..., xN in the main idea behind was proposed almost 30 years ago [4]. the convex hull spanned by W\(:, 1),..., W\(:,R), and this To handle the matrix factorization problem without as- is what enables the success of volume minimization. Remark- suming W\ to contain any zeros, let us again assume row- ably, VolMin sets W\ free—it merely requires rank(W\) = R stochasticity of H , i.e., H 1 = 1, H 0. This way, to be satisfied, yet it allows W to have negative elements \ \ \ ≥ \ x conv W holds for ` = 1,...,N, as mentioned and/or be completely dense. As we have discussed, this can ` ∈ { \} before. In 1994, Craig proposed an geometrically intriguing potentially benefit a lot of applications for which the plain way to recover W\(:, 1),..., W\(:,R) [4]. The so-called NMF cannot offer identifiability. The price to pay is that Craig’s belief is as follows: if the x`’s are sufficiently spread the optimization problem could be even more cumbersome 12 to handle compared to the plain NMF, due to the determinant term for measuring volume. Insights Behind Criterion (22) The criterion is inspired by the intuition that we demonstrated in Fig. 11. There may exist a nontrivial Q such that 1 X = W\QQ− H>, C. Towards More Relaxed Conditions \ (23) H\Q−> 0, 1>H\Q−> = ρ1>, Simplex volume minimization has significant advantages ≥ Q over plain NMF and separability-based NMF, as we have pre- Such a must be ‘shrinking’ the red region in Fig. 11 H = H Q viously seen that the known sufficient identifiability condition (left), which makes the rows of \ −> closer of simplex volume minimization is much better or much more to each other in terms of the Euclidean distance in the H H relaxed than those of plain NMF and separability-based NMF. nonnegative orthant; otherwise, is infeasible if \ But there is a small caveat: the identifiability is established is sufficiently scattered. To prevent this from happening, we can maximize det(HH>)—which, intuitively under the assumption that H\ is row-stochastic. This, in some applications, is not a very restrictive assumption. For example, speaking, maximizes the area of the red region covered cone H in hyperspectral imaging, H (`, :) corresponds to the so-called by > in Fig. 11 (left). Note that maximizing \ det(HH{ ) } det(W W ) abundance vector whose elements represent the proportions of > is equivalent to minimizing > un- X = WH the materials contained in pixel `, and thus is naturally nonneg- der the equality constraint >, which leads to ative and sum-to-one. On the other hand, this assumption is not the criterion in (22). In fact, one can rigorously show that without loss of generality: one cannot assume H\1 = 1 for all det(Q) ρ | | ≥ the factorization models X = W\H>\ . As mentioned, if W\ for any Q satisfying (23) if H is sufficiently scattered and H\ are nonnegative, the sum-to-one condition on the rows \ and the column sums are ρ, and the equality holds of H\ can be enforced by data column normalization [37]. if and only if Q is a permutation matrix [8], [15]. However, when W\ contains negative elements, normalization 2 does not work (cf. the insert in Sec. III). This leads to det(W>W ) = det(Q) det(W\>W\) 2 | | ≥ Very recently, Fu et al. [15] proposed a simple fix to the ρ det(W\>W\), where the lower bound is attained if and above issue. Specifically, the following matrix factorization only if Q is a permutation matrix. This is the key for criterion is proposed: showing the soundness of the criterion in (22).

minimize det W>W (22a) W ,H How important is choosing the right factorization subject to X = WH> (22b) model? Here, we present a toy example to demonstrate H 0, 1>H = ρ1>, (22c) how identifiability of the factorization tools affects the ≥ performance under different data models. We generate where ρ > 0 is any positive real number. The above criterion X = W\H>\ with different types of latent factors. looks very much like the VolMin criterion in (21), but with Following the simulation setup in [15], three cases of the constraint on H changed—the sum-to-one constraint is W\ are generated: case 1 has elements of W\ following now on the columns of H (if one lets ρ = 1), which is the uniform distribution between 0 and 1 with a density without loss of generality since column scaling of W\ and level s = nnz(W\)/(MR) = 0.65, case 2 the uniform counter column scaling of H\ are intrinsic degrees of freedom distribution between 0 and 1 with s = 1, and case in every matrix factorization model. In other words, no column 3 the i.i.d. standard Gaussian distribution. For all three normalization is needed for assuming 1>H = ρ1>, and this cases, the elements of H\ follow the uniform distribution assumption can always be made without loss of generality: between 0 and 1 with a density level s = 0.65. The H 0 and 1>H = ρ1> imply that H(:, r) = ρ for all first two cases have both W\ and H\ being nonnegative, ≥ k k1 r, which means that we wish to fix the column scaling of the while W\ has many negative values in the third case. We right latent factor. It is proven in [15] that: put under test separability-based NMF (SPA [37]), plain NMF (HALS [61]), volume-minimization (VolMin) [58], If rank(W\) = rank(H\) = R and H\ is sufficiently scat- and the determinant-based criterion in (22) (ALP [15]), tered (Definition 3), then the optimal solution of Problem (22) respectively, under R = 3, 5, 10, 15. The mean squared 1 is W? = W\ΠD, H? = H\Π>D− , where Π and D are error (MSE) of the estimated H (as defined in [14], [15]) permutation and full-rank scaling matrices as before. by different approaches can be seen in Table II. Several important observationsc that reflect the im- This generalizes the result of VolMin identifiability in a portance of identifiability are in order: First, for the significant way: the matrix factorization model X = W\H>\ separability-based algorithm, when R is small and the is identifiable given any full column rank W\ and a suffi- separability condition is relatively easy to be satisfied by ciently scattered nonnegative H\—which is by far the mildest a moderately sparse H\, it works very well for case 1 identifiability condition for nonnegative matrix factorization. and case 2. It does not work well for case 3 since the 13

In this section, we will review the main ideas behind adopted algorithm, i.e., SPA, needs the row-stochasticity the separability-free algorithms, since separability-based ap- assumption of H\, which cannot be enforced when W\ proaches are either associated with tractable convex formu- contains negative elements. Second, the plain NMF works lations or greedy procedures—which are already quite well- well for case 1, since both W\ and H\ are very likely understood. sufficiently scattered, but it fails for the second and third cases since the W\’s are dense there, which violates a necessary condition for plain NMF identifiability. Third, A. Fitting-Based NMF VolMin exhibits high estimation accuracy for the first two cases since its identifiability holds for any full column- The computational aspects of plain NMF is very well studied. rank W\. But it fails in the third case because it cannot Let us denote enforce the row-stochasticity of H\ there (note that we 2 f(W , H) = X WH> + r (W ) + r (H), have applied the column normalization trick before apply- − F 1 2 ing VolMin and SPA—which does not work if W\ is not where r1( ) and r2 ( ) are certain regularizers that take into nonnegative). Fourth, the determinant based method can · · H consideration prior knowledge, e.g., sparsity. In some cases, successfully identify \ in all three cases, which echoes 2 X WH> is replaced by L(X WH>) where L( ) our comment that this criterion needs by far the mildest k − kF || ·||· condition for identifiability. Table II illustrates the pivotal denotes distance measures that are non-Euclidean and is role of identifiability in NMF problems. suitable for certain types of data [18] (e.g., the Kullback- We should finish with a remark that it is not always rec- Leibler (KL)-divergence for count data). The general problem ommended to use the approaches that offer the strongest of interest is identifiability guarantees (e.g., VolMin and the criterion minimize f(W , H). (24) in (22)), since these pose harder optimization problems W 0,H 0 in practice and normally require more runtime; also see ≥ ≥ Table II. This will be explained in the next section. This naturally leads to the following block coordinate descent (BCD) scheme:

W (t+1) arg min f W , H(t) , (25a) TABLE II ← W 0 THE ESTIMATED MSE OF Hc (AFTER FIXING PERMUTATION AND SCALING ≥ (t+1) (t+1) AMBIGUITIES) AND AVERAGE RUNTIME GIVEN BY DIFFERENT H arg min f W , H , (25b) APPROACHES UNDER DIFFERENT DATA MODELS. ← H 0 ≥ Separability-Based [37] Plain NMF [9], [61] t Case R = 3 R = 5 R = 10 R = 15 R = 3 R = 5 R = 10 R = 15 where is the iteration index. Note that Problems (25a)-(25b) Case 1 1.11E-32 0.0008 0.1037 0.2475 1.31E-05 4.89E-05 0.0005 0.0014 are both convex problems under a variety of distance measures Case 2 1.08E-32 0.0002 0.0424 0.0881 0.0077 0.0149 0.0411 0.1322 Case 3 0.4714 0.7242 0.7280 0.7979 0.3489 0.2787 0.2628 0.3260 (e.g., Euclidean distance, KL-divergence, and β-divergence) runtime (sec.) 0.0006 0.0005 0.0008 0.0010 0.0550 0.0872 0.2915 0.5622 2 and convex regularizations (e.g., 1 and F ). Therefore, VolMin [17], [58] Determinant-Based Criterion (22) [15] k · k k · k Case R = 3 R = 5 R = 10 R = 15 R = 3 R = 5 R = 10 R = 15 the subproblems can be solved by a large variety of off- Case 1 7.64E-06 7.33E-08 2.89E-08 1.47E-04 1.76E-17 1.07E-17 1.91E-17 2.43E-17 the-shelf convex optimization algorithms. Many such BCD Case 2 2.78E-05 7.78E-10 1.17E-08 5.76E-04 2.02E-17 1.54E-17 1.96E-17 2.02E-17 Case 3 0.6566 1.0035 1.2103 1.2541 2.08E-17 1.90E-17 1.84E-17 1.86E-17 algorithms were summarized in [22], [27]. One of the recently runtime (sec.) 0.0128 0.0124 0.0224 0.0341 2.6780 4.3743 9.2779 13.9156 developed algorithms, namely, the AO-ADMM algorithm [64] employs ADMM for solving the subproblems (25a)-(25b). The salient feature of ADMM is that it can easily handle a large VI.ALGORITHMS variety of regularizations and constraints simultaneously via Identifiability, as a theoretical subject, has significant im- introducing slack variables judiciously and leveraging easily plications on how well an NMF criterion or procedure will solvable subproblems. perform in real-world applications. On the other hand, iden- Note that the subproblems need not be solved exactly. The tifiability is not equivalent to solvability or tractability. Upon popular multiplicative update (MU) and a recently developed a closer look at the NMF methods described in the above algorithm [65] both update W and H via solving local sections, one will realize that many separability-based meth- approximations of Problems (25a) and (25b). There is an ods lead to polynomial-time algorithms, but separability-free interesting trade-off between using exact and inexact solutions methods such as determinant minimization require us to solve of (25a) and (25b). Generally speaking, exact BCD uses nonconvex optimization problems. In fact, both the plain fewer iterations to obtain a good solution, but the per-iteration NMF and determinant minimization problems are known to complexity is higher relative to the inexact ones. Inexact fall in the NP-hard problem class [62], [63]. Hence, how to updating uses many more iterations in general, but some effectively handle an NMF criterion is often times an art, interesting and pragmatic tricks can help substantially reduce which must also take into consideration computational and the number of iterations, e.g., the combination of Nesterov’s memory complexities, regularization, and (local) convergence extrapolation and alternating gradient projection in [65]. We guarantees simultaneously—all of which contribute to good refer the readers to [22], [23], [27], [65] for detailed survey performance in practice. of plain NMF algorithms. 14

B. Optimizing Determinant Related Criteria is to optimize w.r.t. the jth row of Z while fixing all the Next, we turn our attention to the determinant minimization other rows in Z. Note that det(Z) = q>j rj where rj = j+k problems in (21) and (22). Determinant minimization is quite [rj(1),..., rj(R)]>, rj(k) = ( 1) det( j,k) by the co- − Z (R 1) (R 1) factor expansion of determinant, and j,k R − × − is a hard nonconvex problem. The remote sensing community Z ∈ has spent much effort in this direction (to implement Craig’s a submatrix of Z that is formed by nullifying the jth row and kth column of Z. Hence, by considering maximize det(Z) belief ) [57]–[59], [66]. | | Successive Convex Approximation (SCA) The work in subject to the constraints in (26), the resulting update is • [57]–[59] considered a case where W\ is a square matrix and (t+1) zj arg max r>j zj (29a) H\ satisfies (11). The methods can be summarized in a unified ← zj | | way: Consider a square and full rank W . It has a unique subject to 1>zj = bj, z>j X 0, (29b) 1 inverse Z = W − . Hence, the problem in (21) can be recast ≥ (t) as where b = 1> (zm )>X. The objective is still j − m=j nonconvex, but can be solved6 by comparing the solutions maximize det(Z) (26a) P Z | | of two linear programs with objective functions being r>j zj subject to 1>ZX = 1>, ZX 0, (26b) and r>j zj, respectively. The idea of alternating linear pro- − ≥ gramming can be applied to the modified criterion in (22) 2 since det(W>W ) = det(W ) when M = R, and | | [8], [15] as well. In terms of complexity, the algorithm costs minimizing det(W ) is equivalent to maximizing det(Z) . 2 | | | | O(R N) flops for every iteration, even if we use the interior- Problem (26) is still nonconvex, but now the constraints are all point algorithm to solve (29), due to the small scale of the convex sets. From this point, the methods in [8], [15], [59] and linear programs; see more discussion in [8] and an acceleration the methods in [57], [58] take two different routes to handle strategy in [44]. the above (and its variants). Regularized Fitting Another way to tackle the determinant • The first route is successive convex approximation [57], minimization problems is to consider a regularized variant: [58]. First, it is noted that the objective can be changed to 2 minimizing log det(Z) which does not change the optimal minimize X WH> + λ g(W ) (30a) − | | W ,H F solution but the log function keeps log det(Z) ‘safely’ − · | | subject to W , H , (30b) inside the differentiable domain—as it penalizes nonsingular ∈ W ∈ H Q very heavily. This way, the gradient of the cost function where g( ) is an optimization surrogate of det(W>W ) and · W can be used for optimization. Then, by right-multiplying and are constraint sets for W and H, respectively. The 1 H X† = X>(XX>)− to both sides of the equality constraint formulation in (30) can easily incorporate prior information on of Problem (26), we have 1>Z = a, where a = 1>X†. The W and H (via adding different constraints), and its use of the resulting optimization problem may be re-expressed as 2 fitting term X WH> in the objective function makes − F the formulation exhibit some form of noise robustness. Some minimize log det Z (27a) Q − | | commonly used constraints are = W W 0 , = W { | ≥ } H subject to 1>Z = a, ZX 0, (27b) H H1 = 1, H 0 , where is naturally inherited from ≥ { | ≥ } H the VolMin criterion and takes into consideration of the The cost function can be approximated by the following W prior knowledge of W . The problem in (30) is essentially a quadratic function locally around Z = Z(t): regularized version of plain NMF, and can also be handled via (t) (t) (t) (t) f Z; Z = f Z + Z f Z , Z Z two block coordinate descent w.r.t. W and H, respectively. ∇ − However, the subproblems can be much harder than those in γ D 2 E b + Z Z(t) plain NMF—depending on the regularization term g( ) that is 2 − F · employed. For example, in [14], [60], where f(Z) denotes the objective in (27), and the γ is a parameter that is associated with the step size. Given f(Z; Z(t)), g(W ) = det(W>W ) the subproblem in iteration t is was used. Consequently, the W -subproblem is no longer b Z(t+1) arg min f Z; Z(t) (28) convex. A commonly used trick to handle such a nonconvex ← 1>Z=a, ZX 0 W -subproblem is to use projected gradient descent combined ≥ which is a convex quadratic program.b Using a general- with backtracking line search (to ensure decrease of the cost purpose second-order optimization algorithm (e.g., the Newton function) [14], [66]. To circumvent line search which could be method) to solve Problem (28) needs O(R6√RN) flops [57], time consuming, in [14], it was proposed to use a surrogate since there are R2 optimization variables and RN inequality for det(W>W ), namely, constraints. However, there are special structures to be ex- g(W ) = log det(W>W + I). (31) ploited, and using a customized ADMM algorithm to solve it brings down the per-iteration complexity to O(R2N) flops for some pre-specified small constant > 0. The function [44], [58]. log det(W>W + I) still promotes small determinant of Alternating Linear Programming (ALP) The second route W>W since log( ) is monotonic. The term I is to prevent • · is alternating linear programming [8], [59]. The major idea g(W ) from going to . Such a regularization term entails −∞ 15 one to design a block successive upperbound minimization

(BSUM) algorithm [67] to handle the regularized fitting prob- there exists θ`r = e>r for r = 1,...,R [10], [34]. lem, using simple updates of W and H without involving Under the pure-node assumption, H in (7) satisfies the any line search procedure [14]. There are various ways of separability condition, and thus the algorithms that we solving the W - and H-problems in the BSUM framework, introduced in Sec. IV can be directly applied. Also note e.g., ADMM [64], [68] and (accelerated) proximal gradi- that in the community detection model (7), W is not ent [14], [65]. All these first-order optimization algorithms necessarily nonnegative. Nevertheless, the separability- cost O(R2(M + N)) flops for one iteration. based and determinant-based algorithms are not affected We should mention that many other different approxima- in terms of identifiability, as we have learned from the tions for simplex volume or determinant of W>W exist [66], previous sections. [69], [70]. For example, in [69], Berman et al. employs an even simpler regularization, which is Another point that is worth mentioning is that to estimate C\ g(W ) = trace(FWW>), from the second-order model P = C\E\C>\ , it is desired to directly work on the symmetric tri-factorization model, instead F = RI 11 where >. Essentially, this regularization approx- of treating P as an asymmetric factorization model (i.e., by − w ,..., w imates the volume of conv 1 R using the sum of letting W = C E and H = C )—since the former is { } w w \ \ \ \ \ squared distances between all the pairs of i and j (since more natural. It was proved in the literature [8], [44] that the g(W ) R R w w 2)— which intuitively ∝ i=1 j=i+1 k i − jk2 criterion relates to the volume of conv W . This makes the W - P P { } subproblem convex and easy to handle. Nevertheless, this minimize det (E) (32a) approximation of simplex volume is considered rough and C,E | | usually using determinant or log-determinant outputs better subject to X = CEC> (32b) results according to empirical evidences [14]. 1>C = 1>, C 0. (32c) ≥ VII.MORE DISCUSSIONSON APPLICATIONS can identify C\ and E\ with trivial ambiguities—if C\ is suffi- In this section, with everything that we have learned, we ciently scattered. Note that the column-stochasticity constraint will take another look at the applications to discuss how to on C in (32) is added because the topics are modeled as PMFs. choose appropriate factorization tools for them. In terms of computation, such a tri-factorization model is seemingly harder to handle relative to the bi-factorization models. Nevertheless, some good heuristic algorithms exist [8], A. Case Study: Topic Modeling [21], [44]; see the supplementary materials on tri-factorization Recall that X is a word-document matrix, X(m, n) is the and symmetric NMF for more information. term-frequency of the mth word in document n, W\ is the Table III shows the performance of two anchor-based al- word-topic matrix, and H\ is the document-topic matrix. It gorithms (i.e., SPA [37] and XRAY [38]) and an anchor-free makes sense to assume that there are a few characteristic words algorithm [8], [44], when applied to 319 real documents. One for each topic, which other topics do not use. In the literature, can see that SPA only identifies a single topic (all the columns such characteristic words are called anchor words, whose of W correspond to a sexual misconduct case in the military). existence is the same as the separability condition holding for XRAY identifies two topics, i.e., the sexual misconduct story W\. Note that if the anchor word assumption holds for the andc the Cavalese cable car disaster which happened in Italy, model X = C\ΣD>\ (where W\ = C\ and H\ = D\Σ\), 1998. The AnchorFree algorithm [8], [44] that employs the it also holds for the second-order model P = C\E\C>\ since criterion in (32) identifies three different topics. Note that the the P matrix carries the word-topic matrix with it. Therefore, topic that is clearly related to the New York City’s traffic and many separability-based NMF algorithms were applied to taxi drivers was never discovered by the separability-based P or X for topic mining [7], [30], [38]. In some cases, methods. This example shows the power of the sufficiently e.g., when the dictionary size is small, the topics may not scattered condition-based algorithms in topic mining. have anchor words captured in the dictionary. Under such scenarios, it is ‘safer’ to apply the computationally more challenging yet identifiability-wise more robust methods such B. Case Study: HMM Identification as the introduced determinant based method in (22) and its In the HMM model, M(i, r) = Pr[Y = y X = x ] which t i| t r variants [8], [44]. is the emission probability of Yt = yi given that the latent state Xt = xr. In this case, the separability condition translates to Community Detection Note that the discussion here that for every latent state, there is an observable state that is for anchor words-based topic modeling also holds for emitted by this latent state with an overwhelming probability. the community detection problem in (7). The only dif- This is not impossible, but it heavily hinges on the application ference is that the anchor-word assumption is replaced that one is working with. For more general cases where the by the ‘pure-node’ assumption, where a pure node is separability condition is hard to argue, using a determinant a node who has a single community membership; i.e., based method ( such as that in (32)) is always ‘safer’ in terms of identifiability. 16

transition probability emission probability TABLE III 0.8 0.6 ROMINENT TOPICSFOUNDFROM DOCUMENTSINTHE P 3 319 TDT2 Fitting-Based (Eq. (34)) DATASET ( DICTIONARY SIZE = 4, 864) BY DIFFERENT ALGORITHMS. 0.7 Proposed (Eq. (33)) 0.5 SPA [37] XRAY [38] AchorFree [8] mckinney mckinney mckinney mckinney mckinney cable mckinney cable mayor 0.6 sergeant sergeant sergeant sergeant sergeant plane sergeant marine giuliani gene gene gene gene gene italian gene italy city martial army martial martial martial italy martial italian drivers 0.4 sexual major armys major sexual car sexual plane york 0.5 major sexual sexual sexual misconduct marine misconduct accident mayors court rank misconduct court armys accident armys car giulianis misconduct obstruction major army enlisted jet major jet yorkers mckinneys martial court misconduct major flying enlisted crew citys 0.4 0.3 armys misconduct enlisted accusers court ski court ski rudolph enlisted justice counts armys mckinneys crew army flying street army enlisted topenlisted mckinneys army 20 mckinneys 20 cab 0.3 accusers reprimanded army enlisted soldier aviano soldier pilot taxi women sentence 19 women charges pilot women aviano pedestrians 0.2 total variation difference six armys former six women gondola charges gondola civility total variation difference charges conviction soldier former former military former military police 0.2 soldier court mckinneys charges six base six base cabbies former charges jurors soldier 19 severed accusers severed manhattan 0.1 testified jury women sexually testified altitude jury resort traffic sexually mckinneys charges harassing jury killed counts altitude vendors 0.1

0 0 106 108 106 108 A practical estimation criterion for HMM identification is sample size sample size as follows: Fig. 14. The total variation difference (defined as 1/2R M\ Mc 1 and k − k 1/2R Θ\ Θb 1, respectively) between the ground truth and algorithm- k − k minimize KL Ω MΘM> + λ det(Θ) (33) estimated transition probability (left) and emission probability (right). The M,Θ || | | result is averaged over 10 random problem instances. In this example, we subject to 1>M = 1>, M 0, 1>Θ1 = 1, Θ 0, have M = 100 observable states and R = 20 latent states. Both the transition ≥ ≥ and the emission probability matrices are generated randomly following an exponential distribution, and 50% elements of the emission matrix are set to where the constraints are due to the fact the columns of be zeros—so that the sufficiently scattered condition holds for M. M and the matrix Θ are both respectable PMFs, and KL Ω MΘM> measures the KL-divergence between Ω VIII.TAKE-HOME POINTS,OPEN QUESTIONS, || and MΘM . The formulation in (33) can be understood CONCLUDING REMARKS > as a reformulation of (32) by lifting constraint (32b) to the In this feature article, we have reviewed many recent objective function under the KL-divergence based distance developments in identifiable nonnegative matrix factorization, measure. Using the regularized fitting criterion instead of a including models, identifiability theory and methods, and hard constraint Ω = MΘM> is because there is noise in timely applications in signal processing and machine learning. Ω—since it is usually estimated via sample averaging. The Some take-home points are summarized as follows: data fitting part employs the KL-divergence as a distance Different applications may have different characteristics, and • measure since Ω is an estimate of a joint PMF, and MΘM> thus choosing a right NMF tool for the application at hand is is supposed to be a parameterization of the true PMF—and essential. For example, plain NMF is considered not suitable KL divergence is the most natural measure for the distance for hyperspectral unmixing, since the W\ matrices there are between PMFs. The term det(Θ) is critical, since it in general dense, which violates the necessary condition for | | offers identifiability support. Optimizing the Problem (33) is plain NMF identifiability. highly nontrivial, since it is a symmetric model under the KL Separability-based methods have many attractive features, • divergence, but can be approached by carefully approximating such as identifiability, solvability, provable noise robustness, the objective using local optimization surrogates; see [21] for and the existence of lightweight greedy algorithms. In appli- details. cations where separability is likely to hold (e.g., community In the literature, the HMM identification problem was detection in the presence of ‘pure nodes’ and hyperspectral formulated as follows [20]: unmixing with pure pixels), it is a highly recommended approach to try first. 2 minimize Ω MΘM> (34) Sufficiently scattered condition-based approaches are very F • M,Θ − powerful and need a minimum amount of model assumptions. T subject to 1 >M = 1 , M 0, 1>Θ1 = 1, Θ 0. These approaches ensure identifiability under mild conditions ≥ ≥ and often outperform separability-based algorithms in appli- The key difference between (33) and (34) is that the latter cations like topic mining and HMM identification, where the does not have identifiability guarantees of the latent param- separability condition is likely violated. The caveat is that the eters. The impact in practice is quite significant—see the optimization problems arising in such approaches are usually experiment results in Fig. 14, where both formulations are hard, and judicious design that is robust to noise and outliers applied to estimating the transition and emission probabilities is often needed. from synthetic HMM data. One can see that the identifiability- There are also some very important open questions in the guaranteed approach largely outperforms the other approach. field: This example well demonstrates the impact of identifiability The plain NMF, VolMin, and Problem (22) are NP-hard • in practice—under the same data model, the performance can problems. But in practice we often observe very accurate so- be quite different with or without identifiability support. lutions (up to machine accuracy), at least when noise is absent. 17

The conjecture is that with some additional assumptions, the [17] X. Fu, W.-K. Ma, K. Huang, and N. D. Sidiropoulos, “Blind separation problems can be shown to be solvable with high probability— of quasi-stationary sources: Exploiting convex geometry in covariance domain,” IEEE Trans. Signal Process., vol. 63, no. 9, pp. 2306–2320, while now the understanding to this aspect is still limited. May 2015. If solvability can be established under some conditions of [18]C.F evotte,´ N. Bertin, and J.-L. Durrieu, “Nonnegative matrix factor- practical interest, then, combining with identifiability, NMF’s ization with the itakura-saito divergence: With application to music analysis,” Neural computation, vol. 21, no. 3, pp. 793–830, 2009. power as a learning tool will be lifted to another level. [19] X. Fu, W.-K. Ma, and N. Sidiropoulos, “Power spectra separation via Noise robustness of the NMF models is not entirely well structured matrix factorization,” IEEE Trans. Signal Process., vol. 64, • understood so far. This is particularly true for separability- no. 17, pp. 4592–4605, 2016. [20] B. Lakshminarayanan and R. Raich, “Non-negative matrix factorization free cases. The Cramer-Rao´ bound evaluation in [27] may for parameter estimation in hidden markov models,” in Proc. IEEE help identify effective algorithms, but this approach is still MLSP 2010. IEEE, 2010, pp. 89–94. a heuristic. Worst-case analysis is desired since it helps un- [21] K. Huang, X. Fu, and N. D. Sidiropoulos, “Learning hidden markov models using pairwise co-occurrences with applications to topic model- derstand the limitations of the adopted NMF methods for any ing,” Proc. ICML 2018, 2018. problem instance. [22] G. Zhou, A. Cichocki, Q. Zhao, and S. Xie, “Nonnegative matrix Necessary conditions for NMF identifiability is not very well and tensor factorizations: An algorithmic perspective,” IEEE Signal • Processing Magazine, vol. 31, no. 3, pp. 54–65, 2014. understood, but necessary conditions are often useful in saving [23] N. Gillis, “The why and how of nonnegative matrix factorization,” practitioners’ effort for trying NMF on some hopeless cases. Regularization, Optimization, Kernels, and Support Vector Machines, It is conjectured that the sufficiently scattered condition is vol. 12, p. 257, 2014. [24] ——, “Introduction to nonnegative matrix factorization,” arXiv preprint also necessary for VolMin, which is consistent with numerical arXiv:1703.00663, 2017. evidence. But the proof is elusive. [25] Y.-X. Wang and Y.-J. Zhang, “Nonnegative matrix factorization: A comprehensive review,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 6, pp. 1336–1353, 2013. [26] W.-K. Ma, J. Bioucas-Dias, T.-H. Chan, N. Gillis, P. Gader, A. Plaza, EFERENCES R A. Ambikapathi, and C.-Y. Chi, “A signal processing perspective on hyperspectral unmixing,” IEEE Signal Process. Mag., vol. 31, no. 1, pp. [1] G. H. Golub and C. F. V. Loan., Matrix Computations. The Johns 67–81, Jan 2014. Hopkins University Press, 1996. [27] K. Huang and N. Sidiropoulos, “Putting nonnegative matrix factorization [2] P. Common, “Independent component analysis, a new concept?” Signal to the test: a tutorial derivation of pertinent Cramer-Rao bounds and Processing, vol. 36, no. 3, pp. 287 – 314, 1994. performance benchmarking,” IEEE Signal Process. Mag., vol. 31, no. 3, [3] J.-C. Chen, “The nonnegative rank factorizations of nonnegative matri- pp. 76–86, 2014. ces,” Linear algebra and its applications, vol. 62, pp. 207–217, 1984. [28] J. Imbrie and T. H. Van Andel, “Vector analysis of heavy-mineral data,” [4] M. D. Craig, “Minimum-volume transforms for remotely sensed data,” Geological Society of America Bulletin, vol. 75, no. 11, pp. 1131–1156, IEEE Trans. Geosci. Remote Sens., vol. 32, no. 3, pp. 542–552, 1994. 1964. [5] P. Paatero and U. Tapper, “Positive matrix factorization: A non-negative [29] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factor model with optimal utilization of error estimates of data values,” factorization,” in Advances in neural information processing systems, Environmetrics, vol. 5, no. 2, pp. 111–126, 1994. 2001, pp. 556–562. [6] D. Lee and H. Seung, “Learning the parts of objects by non-negative [30] B. Recht, C. Re, J. Tropp, and V. Bittorf, “Factoring nonnegative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999. matrices with linear programs,” in Advances in Neural Information [7] S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, Processing Systems, 2012, pp. 1214–1222. and M. Zhu, “A practical algorithm for topic modeling with provable [31] S. Arora, R. Ge, and A. Moitra, “Learning topic models–going be- guarantees,” in International Conference on Machine Learning (ICML), yond SVD,” in IEEE Symposium on Foundations of Computer Science 2013. (FOCS). IEEE, 2012, pp. 1–10. [8] K. Huang, X. Fu, and N. D. Sidiropoulos, “Anchor-free correlated [32] D. Cai, X. Wang, and X. He, “Probabilistic dyadic data analysis topic modeling: Identifiability and algorithm,” in Advances in Neural with local and global consistency,” in Proceedings of the 26th Annual Information Processing Systems, 2016. International Conference on Machine Learning (ICML’09), 2009, pp. [9] K. Huang, N. Sidiropoulos, and A. Swami, “Non-negative matrix 105–112. factorization revisited: Uniqueness and algorithm for symmetric decom- [33] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” position,” IEEE Trans. Signal Process., vol. 62, no. 1, pp. 211–224, Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003. 2014. [34] M. Panov, K. Slavnov, and R. Ushakov, “Consistent estimation of mixed [10] X. Mao, P. Sarkar, and D. Chakrabarti, “On mixed memberships and memberships with successive projections,” in International Workshop on symmetric nonnegative matrix factorizations,” in International Confer- Complex Networks and their Applications. Springer, 2017, pp. 53–64. ence on Machine Learning, 2017, pp. 2324–2333. [35] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing, “Mixed [11] D. Donoho and V. Stodden, “When does non-negative matrix factoriza- membership stochastic blockmodels,” Journal of Machine Learning tion give a correct decomposition into parts?” in NIPS, vol. 16, 2003. Research, vol. 9, no. Sep, pp. 1981–2014, 2008. [12] H. Laurberg, M. G. Christensen, M. D. Plumbley, L. K. Hansen, and [36] C. Forbes, M. Evans, N. Hastings, and B. Peacock, Statistical distribu- S. Jensen, “Theorems on positive data: On the uniqueness of NMF,” tions. John Wiley & Sons, 2011. Computational Intelligence and Neuroscience, vol. 2008, 2008. [37] N. Gillis and S. Vavasis, “Fast and robust recursive algorithms for [13] T.-H. Chan, W.-K. Ma, C.-Y. Chi, and Y. Wang, “A convex analysis separable nonnegative matrix factorization,” IEEE Trans. Pattern Anal. framework for blind separation of non-negative sources,” IEEE Trans. Mach. Intell., vol. 36, no. 4, pp. 698–714, April 2014. Signal Process., vol. 56, no. 10, pp. 5120 –5134, oct. 2008. [38] A. Kumar, V. Sindhwani, and P. Kambadur, “Fast conical hull algorithms [14] X. Fu, K. Huang, B. Yang, W.-K. Ma, and N. Sidiropoulos, “Robust for near-separable non-negative matrix factorization,” pp. 231–239, volume-minimization based matrix factorization for remote sensing and 2013. document clustering,” IEEE Trans. Signal Process., vol. 64, no. 23, pp. [39] X. Fu, W.-K. Ma, T.-H. Chan, and J. M. Bioucas-Dias, “Self-dictionary 6254–6268, 2016. sparse regression for hyperspectral unmixing: Greedy pursuit and pure [15] X. Fu, K. Huang, and N. D. Sidiropoulos, “On identifiability of pixel search are related,” IEEE J. Sel. Topics Signal Process., vol. 9, nonnegative matrix factorization,” IEEE Signal Process. Lett., vol. 25, no. 6, pp. 1128–1141, Sep. 2015. no. 3, pp. 328–332, 2018. [40] J.-P. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov, “Metagenes and [16] C.-H. Lin, W.-K. Ma, W.-C. Li, C.-Y. Chi, and A. Ambikapathi, molecular pattern discovery using matrix factorization,” Proceedings of “Identifiability of the simplex volume minimization criterion for blind the national academy of sciences, vol. 101, no. 12, pp. 4164–4169, 2004. hyperspectral unmixing: The no-pure-pixel case,” IEEE Trans. Geosci. [41] S. Zhang, W. Wang, J. Ford, and F. Makedon, “Learning from incomplete Remote Sens., vol. 53, no. 10, pp. 5530–5546, Oct 2015. ratings using non-negative matrix factorization,” in Proceedings of the 18

2006 SIAM International Conference on Data Mining. SIAM, 2006, [65] Y. Xu and W. Yin, “A block coordinate descent method for regularized pp. 549–553. multiconvex optimization with applications to nonnegative tensor fac- [42] T.-H. Chan, W.-K. Ma, A. Ambikapathi, and C.-Y. Chi, “A simplex vol- torization and completion,” SIAM Journal on imaging sciences, vol. 6, ume maximization framework for hyperspectral endmember extraction,” no. 3, pp. 1758–1789, 2013. IEEE Trans. Geosci. Remote Sens., vol. 49, no. 11, pp. 4177 –4193, [66] L. Miao and H. Qi, “Endmember extraction from highly mixed data Nov. 2011. using minimum volume constrained nonnegative matrix factorization,” [43] J. Nascimento and J. Bioucas-Dias, “Vertex component analysis: A fast IEEE Trans. Geosci. Remote Sens., vol. 45, no. 3, pp. 765–777, 2007. algorithm to unmix hyperspectral data,” IEEE Trans. Geosci. Remote [67] M. Razaviyayn, M. Hong, and Z.-Q. Luo, “A unified convergence Sens., vol. 43, no. 4, pp. 898–910, 2005. analysis of block successive minimization methods for nonsmooth [44] X. Fu, K. Huang, N. D. Sidiropoulos, Q. Shi, and M. Hong, “Anchor- optimization,” SIAM Journal on Optimization, vol. 23, no. 2, pp. 1126– free correlated topic modeling,” IEEE Trans. Patt. Anal. Machine Intell., 1153, 2013. vol. to appear, DOI: 10.1109/TPAMI.2018.2827377, 2018. [68] X. Fu, W.-K. Ma, K. Huang, and N. D. Sidiropoulos, “Robust volume [45] C.-H. Lin, R. Wu, W.-K. Ma, C.-Y. Chi, and Y. Wang, “Maximum minimization-based matrix factorization via alternating optimization.” in volume inscribed ellipsoid: A new simplex-structured matrix factoriza- ICASSP, 2016, pp. 2534–2538. tion framework via facet enumeration and convex optimization,” arXiv [69] M. Berman, H. Kiiveri, R. Lagerstrom, A. Ernst, R. Dunne, and preprint arXiv:1708.02883, 2017. J. Huntington, “ICE: A statistical approach to identifying endmembers [46] S. Boyd and L. Vandenberghe, Convex Optimization. Cambriadge Press, in hyperspectral images,” IEEE Trans. Geosci. Remote Sens., vol. 42, 2004. no. 10, pp. 2085–2095, 2004. [47] N. Gillis, “Successive nonnegative projection algorithm for robust non- [70] J. Li, J. Bioucas-Dias, and A. Plaza, “Collaborative nonnegative matrix negative blind source separation,” SIAM Journal on Imaging Sciences, factorization for remotely sensed hyperspectral unmixing,” in Proc. vol. 7, no. 2, pp. 1420–1450, 2014. IGARSS 2012, 2012, pp. 3078–3081. [48] U. Araujo,´ B. Saldanha, R. Galvao,˜ T. Yoneyama, H. Chame, and [71] K. Rahbar and J. Reilly, “A frequency domain method for blind source V. Visani, “The successive projections algorithm for variable selection in separation of convolutive audio mixtures,” IEEE Speech Audio Process., spectroscopic multicomponent analysis,” Chemometrics and Intelligent vol. 13, no. 5, pp. 832 – 844, Sep. 2005. Laboratory Systems, vol. 57, no. 2, pp. 65–73, 2001. [72] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalex- [49] W. Naanaa and J.-M. Nuzillard, “Blind source separation of positive and akis, and C. Faloutsos, “Tensor decomposition for signal processing and partially correlated data,” Signal Processing, vol. 85, no. 9, pp. 1711– machine learning,” IEEE Transactions on Signal Processing, vol. 65, 1722, 2005. no. 13, pp. 3551–3582. [50] E. Esser, M. Moller, S. Osher, G. Sapiro, and J. Xin, “A convex model [73] A. Yeredor, “Non-orthogonal joint diagonalization in the least-squares for nonnegative matrix factorization and dimensionality reduction on sense with application in blind source separation,” IEEE Trans. Signal physical space,” IEEE Trans. Image Process., vol. 21, no. 7, pp. 3239 Process., vol. 50, no. 7, pp. 1545 –1553, Jul. 2002. –3252, July 2012. [74] D. Nion, K. N. Mokios, N. D. Sidiropoulos, and A. Potamianos, “Batch [51] X. Fu and W.-K. Ma, “Robustness analysis of structured matrix fac- and adaptive PARAFAC-based blind separation of convolutive speech torization via self-dictionary mixed-norm optimization,” IEEE Signal mixtures,” IEEE Audio, Speech, Language Process., vol. 18, no. 6, pp. Process. Lett., vol. 23, no. 1, pp. 60–64, 2016. 1193 –1207, Aug. 2010. [52] N. Gillis and R. Luce, “A fast gradient method for nonnegative sparse re- [75] A. Ziehe, P. Laskov, G. Nolte, and K.-R. Muller,¨ “A fast algorithm gression with self-dictionary,” IEEE Transactions on Image Processing, for joint diagonalization with non-orthogonal transformations and its vol. 27, no. 1, pp. 24–37, 2018. application to blind source separation,” Journal of Machine Learning [53] N. Gillis, “Robustness analysis of hottopixx, a linear programming Research, pp. 777–800, 2004. model for factoring nonnegative matrices,” SIAM Journal on Matrix [76] A. Vandaele, N. Gillis, Q. Lei, K. Zhong, and I. Dhillon, “Efficient Analysis and Applications, vol. 34, no. 3, pp. 1189–1212, 2013. and non-convex coordinate descent for symmetric nonnegative matrix [54] R. Ammanouil, A. Ferrari, C. Richard, and D. Mary, “Blind and fully factorization,” IEEE Tran. Signal Process., vol. 64, no. 21, pp. 5571– constrained unmixing of hyperspectral images,” IEEE Trans. Image 5584, 2016. Process., vol. 23, no. 12, pp. 5510–5518, Dec. 2014. [77] Q. Shi, H. Sun, S. Lu, M. Hong, and M. Razaviyayn, “Inexact block [55] T. Mizutani, “Ellipsoidal rounding for nonnegative matrix factorization coordinate descent methods for symmetric nonnegative matrix factor- under noisy separability,” The Journal of Machine Learning Research, ization,” IEEE Tran. Signal Process., vol. 65, no. 22, pp. 5995–6008, vol. 15, no. 1, pp. 1011–1039, 2014. 2017. [56] N. Gillis and S. A. Vavasis, “Semidefinite programming based precondi- tioning for more robust near-separable nonnegative matrix factorization,” SIAM Journal on Optimization, vol. 25, no. 1, pp. 677–698, 2015. [57] J. Li and J. M. Bioucas-Dias, “Minimum volume simplex analysis: a fast algorithm to unmix hyperspectral data,” in Proc. IEEE IGARSS 2008, vol. 3, 2008, pp. III–250. [58] J. M. Bioucas-Dias, “A variable splitting augmented lagrangian approach to linear spectral unmixing,” in Proc. IEEE WHISPERS’09, 2009, pp. 1–4. [59] T.-H. Chan, C.-Y. Chi, Y.-M. Huang, and W.-K. Ma, “A convex analysis- based minimum-volume enclosing simplex algorithm for hyperspectral unmixing,” IEEE Trans. Signal Process., vol. 57, no. 11, pp. 4418 –4432, Nov. 2009. [60] G. Zhou, S. Xie, Z. Yang, J.-M. Yang, and Z. He, “Minimum-volume- constrained nonnegative matrix factorization: Enhanced ability of learning parts,” IEEE Trans. Neural Netw., vol. 22, no. 10, pp. 1626–1637, 2011. [61] A. Cichocki and A.-H. Phan, “Fast local algorithms for large scale nonnegative matrix and tensor factorizations,” IEICE transactions on fundamentals of electronics, communications and computer sciences, vol. 92, no. 3, pp. 708–721, 2009. [62] S. A. Vavasis, “On the complexity of nonnegative matrix factorization,” SIAM Journal on Optimization, vol. 20, no. 3, pp. 1364–1377, 2009. [63] A. Packer, “Np-hardness of largest contained and smallest containing simplices for v-and h-polytopes,” Discrete and Computational Geome- try, vol. 28, no. 3, pp. 349–377, 2002. [64] K. Huang, N. D. Sidiropoulos, and A. P. Liavas, “A flexible and efficient algorithmic framework for constrained matrix and tensor factorization,” IEEE Transactions on Signal Processing, vol. 64, no. 19, pp. 5052–5065, 2016. 19

APPENDIX A APPENDIX B ELLIPSOID VOLUME MINIMIZATION FOR SPA, GREEDY PURSUIT, AND SIMPLEX VOLUME SEPARABILITY-BASED NMF MAXIMIZATION In [55], Mizutani introduced the ellipsoid-volume Besides the pure algebraic derivation of SPA, SPA can be minimization-based NMF, whose intuition is shown in derived from many other ways. Consider the self-dictionary Fig. 15 using a case where M = R = 2. This method finds a sparse optimization problem: minimum-volume ellipsoid that is centered at the origin and minimize C row 0 encloses the data columns x1,..., xN and their ‘mirrors’ C k k − x1,..., xN . Then, under the separability condition, the subject to X = XC − − minimum-volume enclosing ellipsoid ‘touches’ the data cloud C 0, 1>C = 1>, at W (:, r) for r = 1,...,R. The identification criterion is ≥ ± \ formulated as follows: One way to handle the problem is to apply the so-called simultaneous orthogonal matching pursuit (SOMP) algorithm in compressive sensing; i.e., one can find out an ‘active atom’ in the dictionary X by the following:

`ˆ1 = max X>x` q, (36) `=1,...,N k k since an active atom (x for ` Λ) in the dictionary is `r r ∈ likely to exhibit large correlation with the data X. By setting q = , one can see that ∞ Fig. 15. Geometric illustration of the ER method when M = R = 2. Note max X>x` = max max x>nx` that H\1 = 1 is assumed and that the solid circle represent the minimum- `=1,...,N k k∞ ` 1,...,N n 1,...,N | | ∈{ } ∈{ } volume ellipsoid that is sought. 2 = max x` 2, ` 1,...,N k k ∈{ } minimize log det(L) (35a) where the last equality is by the Cauchy-Shwartz inequality, L − and the above is exactly the first step of SPA. In fact, SPA subject to L 0 (35b) and SOMP are completely equivalent when q = in (36) x>Lx 1, ` (35c) ∞ ` ` ≤ ∀ [39]. Another way to derive SPA is to consider the so- Here, x>Lx 1 represents an ellipsoid in the R-dimensional called simplex volume maximization criterion; i.e., the vertices ≤ of conv X can be found by finding a maximum-volume space, the constraint (35c) means that all the x`’s are enclosed { } in the ellipsoid parametrized by L, and log det(L) measures simplex that is inside the data cloud, if separability is satisfied. − the volume of the ellipsoid. Note that Problem (35) is very The intuition behind the formulation is shown in Fig. 16, well studied in convex geometry and thus many off-the-shelf which is sometimes referred to as the Winter’s belief [42] in algorithms can be directly applied for NMF. the remote sensing literature. The optimization problem can By assuming that H\ is row-stochastic and M = R, it be expressed as was shown that the active points that satisfy x>L?x` = 1 ` maximize det(W>W ) W are those x`r = W\(:, `r) for r = 1,...,R, where L? is the optimal solution to Problem (35)—which is consistent subject to W = XC, C 0, 1>C = 1>, ≥ with our observation in Fig. 15. This algorithm, referred to where the objective is a measure of the simplex spanned by as the ellipsoid rounding (ER) algorithm, is also robust to the columns of W and the constraints means that W noise. One possible downside of ER is that it only works with ∈ conv X . The above optimization problem can be solved square W\, which means that when M > R, a dimensionality { } by a simple greedy algorithm—which turns out to be exactly reduction procedure has to be employed, resulting in a two- SPA; see derivations in [7], [26], [42]. stage approach, which could be prone to error propagation. Working with the reduced-dimension W\ also loses the oppor- APPENDIX C tunity to incorporate prior information of the original W\, e.g., sparsity and nonnegativity, which are crucial for performance MORE APPLICATIONS:BLIND SEPARATION OF enhancement in noisy cases. Dimensionality reduction also NON-STATIONARY AND STATIONARY SOURCES requires an accurate knowledge of R, which the self-dictionary Blind source separation (BSS) is a classic signal processing methods do not need—in fact, the self-dictionary methods problem. A simple instantaneous mixture case of BSS admits can even help estimate R in noisy cases [39]. The upshot the following signal model: of ER is that it only has R2 optimization variables, which is y[t] = As[t], t = 0, 1, 2,... (37) considerably smaller than N 2 in many real-world applications M (e.g., in topic modeling R is the number of topics and N is where ym[t] (the mth element of the vector y[t] C ) ∈ usually the number of documents or number of words in a denotes the mth received signal (by sensor m) at time t, sr[t] dictionary), which may potentially lead to memory-wise more (the rth element of the vector s[t] CR) denotes the rth M R ∈ economical algorithms. source at time t and A C × is the mixing matrix. Here we ∈ 20

time duration dominated by source 1 time duration dominated by source 2

s1[t]

s2[t]

Fig. 16. Intuition behind simplex volume maximization: the ground-truth convex hull conv W\ is found via ﬁnding the maximum-volume simplex spanned by W which is enclosed in the data convex hull conv X . { } time axis

Fig. 17. The local dominance phenomenon in speech separation, which gives assume the signals and the mixing system are both complex- rise to the spearability condition if the problem is recast as an NMF problem valued since BSS are commonly applied in the complex as described. domain (e.g., the frequency domain in convolutive speech 5 ProSPA 5 mixture separation [71] and communication signal separation FFDIAG 4.5 UWEDGE 0 4 [19]). With the observations y[t] for t = 1,...,T , the goal Clustering−based −5 FastICA 3.5 TALS of BSS is to estimate A and s[t] for t = 1,...,T . BSS −10 3 ACDC

DIEM 2.5 −15 ﬁnds many applications in speech and audio processing, array PHAM time (sec.) MSE (dB) 2 −20 processing, and communications. ProSPA 1.5 −25 1 1) Speech Separation: To recast the BSS problem as an −30 0.5

−35 NMF problem, consider a case where sr[t] is a zero-mean −10 −5 0 5 10 15 20 25 30 0 SNR (dB) ProSPA FFDIAG UWEDGE Clust. FastICA TALS ACDC DIEM PHAM piecewise stationary (or quasi-stationary—meaning that the Fig. 18. The MSE (measured at SNR=10dB) of the estimated A and average source is stationary within a short time interval but the runtime performance in the speech separation problem, respectively. Here, 5 statistics vary from interval to interval), which is a commonly sources with a 6-second duration are used; 6 sensors are employed. ‘ProSPA’ used model for speech sources. Consequently, one can approx- [17] is an NMF based approach under the local correlation formulation. All the baselines are competitive source separation methods. ProSPA exhibits imate the following local correlation matrix (e.g., via sample 50 times faster execution time compared to the competitor that has similar averaging): accuracy (FFDIAG) [75]. More details can be found in [17]. H H R` = E y[t]y[t] = ADnA , t interval `, { } ∈ H and Dn = E s[t]s[t] is the local covariance of the sources holds for H\ if there exists short time durations where only { } (assuming zero-mean sources) within interval n. Note that D` one speaker is speaking. This makes sense since there are is diagonal if the sources are zero-mean and are uncorrelated, a lot of pauses and stops during utterances; see Fig. 17. and D`(r, r) is the local average power of source r within SPA and its variants can be applied to identify W\ (and interval n—which is nonnegative. Then, by forming a matrix thus A). Since SPA is a Gram-Schmidt-like algorithm, it X = [x1,..., xN ] where x` = vec(R`), we have helps unmix the speech signal in nearly real time (cf. the ‘ProSPA’ algorithm in Fig. 18, which is based on SPA X = WH>, and judicious preprocessing). When there are many speakers where W = A∗ A and H(`, :)> = diag(D ), where talking simultaneously, using volume minimization initialized ` diag(Z) means taking the diagonal of Z and making it a by ProSPA gives more promising results [17], since now the column vector (as in MATLAB) and is the Khatri-Rao separability condition is likely violated. product [72]. Here, one can see that H is a nonnegative factor, 2) Power Spectra Separation: In some wireless communi- and the factorization model above is considered NMF in a cation systems such as cognitive radio, it is desired to establish broader sense, as discussed before. Factoring X to W and spectral awareness—i.e., to build up the knowledge about the H identifies the mixing system A3, which is usually a critical spectral (un)occupancy so that transmission opportunities can stepping stone towards source separation. be identified. Although many sophisticated algorithms exist for source Consider the received signals at the receivers of a cognitive separation (e.g., independent component analysis (ICA) [2], radio system, which can be represented as in (37). Note that joint diagonalization [73], and tensor decomposition [74]), now the sources are transmitted signals from some primary using NMF techniques (especially separability-based NMF) or licensed users, and are usually assumed to be zero-mean can further boost the performance—in particular, the runtime and wide-sense stationary. The goal is to identify the power performance. In speech separation, separability approximately spectrum of each individual source, so that interference avoid- ance strategies can be designed for the cognitive transmitters. 3This is because W (:, r) = A(:, r)∗ A(:, r) where denotes the Kronecker product [72] and thus A(:, r) can⊗ be estimated from⊗ W (:, r) via In such cases, we can use the following idea to construct an eigendecomposition [17]. NMF model. We first compute the auto-correlation of ym[t], Motivating Example - Power Spectra Separation

NMF−HALS NMF−HALS identiﬁcation, as we have seen. Consider the following signal

contributed by source 1 model:

contributed by source 2 X = C\E\C>\ . Normalized PSD Normalized PSD

0 500 1000 0 500 1000 Index of frequency bin Index of frequency bin One way to handle the factorization problem is to ﬁrst treat

SPA SPA C\E\ = W\ and H\ = C\ and directly apply any of the Normalized PSD previously introduced methods. However, this would be sup-

Normalized PSD Normalized PSD optimal in noisy cases, since the structure of the data matrix

0 200 400 600 800 1000 0 500 1000 0 500 1000 is not yet fully exploited. In [8], [44], the authors proposed Index of frequency bin Index of frequency bin Index of frequency bin Fig. 19. Different outcomes for a spectra separation problem in cognitive the following identification criterion: radio. Two primarySneak Peek transmitters: Experiment in and Comm. two Lab, cognitive Dept. ECE, receivers U of M. are used in this experiment. The data was measured in the Communications Laboratory at minimize det (E) (38a) University of Minnesota with USRP software defined radios; see [19]. Left: C,E | | Xiao Fu: A Structured Matrix Factorization Model 4 mixture of the received power spectra observed at receiver 1. Right: the unmixed spectra by different NMF approaches. Figure reproduced (permission subject to X = CEC> (38b) will be sought) from [19], c IEEE. 1>C = 1>, C 0. (38c) ≥ It was also shown in [8], [44] that: which is Solving (38) leads to E? = ΠE\Π> and C? = C\Π if C\ C E r[τ] = E [y∗[t] ~ y[t τ]] = (A∗ ~ A) q[τ], τ, satisfies the sufficiently scattered condition and \ and \ − ∀ have full rank, where Π is a permutation matrix. if the sources are spatially uncorrelated, where ~ is the R Hadamard product, q[τ] C and qr[τ] = E[s∗[t]sr[t τ]] is The above can be proven utilizing the result that we have ∈ r − the autocorrelation of the rth source. Taking Fourier transform for the determinant-based bi-factorization criterion, by simply of each element in r[τ], we have looking at the identifiability of C\ in the column or row subspace of X. However, the practical implication is significant. z(ω) = (A∗ A) p(ω), ω ~ ∀ For example, the identifiability of (38) allows us to use the following implementation with identifiability support: where p(ω) RR contains the power spectral density (PSD) ∈ 2 of the R source signals at frequency ω. By putting the minimize X CEC> F + λ det(E) (39a) C,E k − k | | above in a matrix X whose columns correspond to different subject to 1>C = 1>, C 0, (39b) (discretized) frequencies, we have a model X = WH> where ≥ W = (A∗ A) 0 and H(:, r) 0 the PSD of the rth ~ ≥ ≥ Problem (39) accounts for noise and it directly works with source sampled at N frequencies. the model X = CEC> rather than treating CE = W Some experiment results are shown in Fig. 19, where dif- and ignoring the symmetry. This way, one can also easily ferent NMF approaches are applied to separate power spectra incorporate prior information on E\ and C\ (e.g., E\ is of two sources. The received signals were measured in the nonnegative and positive semidefinite in topic modeling) to communications laboratory at University of Minnesota. Fig. 19 enhance performance in noisy cases. In addition, E is usually shows the spectra separation results using the HALS algorithm a small matrix of size R R, and dealing with det(E) × (plain NMF fitting) [61] and SPA, respectively. One can see can be relatively easier than dealing with det(W>W ) where that the plain NMF does not work, since W here is dense and W = CE in algorithm design. violates the necessary condition for plain NMF identifiability (to be precise, W results from the Hadamard product of the \ APPENDIX E wireless channel matrix and its conjugate (i.e.,W = A A), \ ∗ ~ SUBSPACE METHODSFOR SYMMETRIC NMF which is nonnegative. However, it is in general dense almost surely since A is well-approximated by a random matrix As we have seen, in some applications such as topic model- whose elements are drawn from some continuous distribution). ing and HMM identification, symmetric NMF naturally arises. On the other hand, SPA does not need to assume any zero The associated models, e.g., X = W\W\> and X = C\E\C>\ , patterns in the wireless channel, and the considered case pose very hard optimization problems, though. Unlike the obviously satisfies the separability condition (i.e., the power asymmetric case, where the subproblems w.r.t. each of the spectra of the two transmitters both have exclusive frequency block variables are convex or can be effectively approximated bins that the other does not use). by convex problems, now the optimization criteria such as 2 minimize X WW> F (40) W 0 k − k ≥ APPENDIX D or Problem (39), are fourth-order optimization problems. In TRI-FACTORIZATION MODELAND IDENTIFIABILITY recent years, much effort has been invested to handle Prob- The results that were developed for identifiability of VolMin lem (40), e.g., using coordinate descent [76] and judiciously and determinant minimization can be generalized to cover designed BSUM [77]. These are viable approaches but are con- certain tri-factorization models, which find important appli- sidered ‘heavy’ in implementation since they are essentially cations in topic modeling, community detection, and HMM optimizing the problem w.r.t. each element (or row) of W 22 cyclically. For Problem (39), the problem is even harder since det(E) is involved and more variables need to be optimized. An interesting way to handle symmetric factorization has a strong flavor of array processing, in particular, the subspace methods. Using the idea of subspace methods, the problem of identifying both the models X = W\W\> and X = C\E\C>\ can be recast as problems that are easier to deal with— without losing identifiability. To see this, let us consider the tri-factorization model in (38), i.e., X = C\E\C>\ . The idea is simple: In many applications X = BB> always holds since X 0 (e.g., the second-order statistics-based topic M R mining), where B R × is a square root of X. Then, it ∈ is easy to see that B> = G\C>\ for a certain invertible G\ if rank(X) = rank(C\) = rank(E\) = R, since range(B) = range(X) = range(C\) under such circumstances. Therefore, the algorithms for determinant minimization can be directly applied for identifying G\ and C\—this is the idea in [8], [44]. For the model X = W\W\>, the same arguments still hold—i.e., B = W\G\ holds for a certain invertible G\. Therefore, the determinant minimization-based can also be applied to identify W\, if W\ is sufficiently scattered. Nev- ertheless, since under the model X = W\W\>, G\G>\ = I holds, an even simpler reformulation can be employed: 2 minimize B WG F , GG =I,W 0 k − k > ≥ which can be handled by two-block coordinate descent, and each subproblem admits a very simple solution [9]; to be more specific, the updates are as follows: (t+1) (t) G UV >, UΣV > svd((W )>B) ← ← (t+1) (t+1) W max B(G )>, 0 , ← { } which are both easy to implement. The viability of the described algorithms for X = C\E\C>\ and X = W\W\> are results of the fact that the identifiability of C\ and W\ in the two models has nothing to do with the structure of the respective G\’s, if C\ and W\ are sufficiently scattered. In other words, for complicated models, if one can identify range(C\) or range(W\) in a reliable way, the identifiability of the associated nonnegative factors are not lost—this insight can substantially simplify the identification procedure in some applications.