Electrical and Computer Engineering Publications Electrical and Computer Engineering

6-15-2017

A Nonconvex Splitting Method for Symmetric Nonnegative Factorization: Convergence Analysis and Optimality

Songtao Lu Iowa State University, [email protected]

Mingyi Hong Iowa State University

Zhengdao Wang Iowa State University, [email protected]

Follow this and additional works at: https://lib.dr.iastate.edu/ece_pubs

Part of the Industrial Technology Commons, and the Signal Processing Commons

The complete bibliographic information for this item can be found at https://lib.dr.iastate.edu/ ece_pubs/162. For information on how to cite this item, please visit http://lib.dr.iastate.edu/ howtocite.html.

This Article is brought to you for free and open access by the Electrical and Computer Engineering at Iowa State University Digital Repository. It has been accepted for inclusion in Electrical and Computer Engineering Publications by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. A Nonconvex Splitting Method for Symmetric Nonnegative Matrix Factorization: Convergence Analysis and Optimality

Abstract Symmetric nonnegative matrix factorization (SymNMF) has important applications in data analytics problems such as document clustering, community detection, and image segmentation. In this paper, we propose a novel nonconvex variable splitting method for solving SymNMF. The proposed algorithm is guaranteed to converge to the set of Karush-Kuhn-Tucker (KKT) points of the nonconvex SymNMF problem. Furthermore, it achieves a global sublinear convergence rate. We also show that the algorithm can be efficiently implemented in parallel. Further, sufficient conditions ear provided that guarantee the global and local optimality of the obtained solutions. Extensive numerical results performed on both synthetic and real datasets suggest that the proposed algorithm converges quickly to a local minimum solution.

Keywords Symmetric nonnegative matrix factorization, Karush-Kuhn-Tucker points, variable splitting, global and local optimality, clustering

Disciplines Electrical and Computer Engineering | Industrial Technology | Signal Processing

Comments This is a manuscript of an article published as Lu, Songtao, Mingyi Hong, and Zhengdao Wang. "A nonconvex splitting method for symmetric nonnegative matrix factorization: Convergence analysis and optimality." IEEE Transactions on Signal Processing (2017). DOI: 10.1109/TSP.2017.2679687. Posted with permission.

This article is available at Iowa State University Digital Repository: https://lib.dr.iastate.edu/ece_pubs/162 arXiv:1703.08267v1 [math.OC] 24 Mar 2017 ytm niern,Iw tt nvriy ms A50011 IA Ames, University, [email protected]). State Iowa Engineering, Systems optrEgneig oaSaeUiest,Ae,I 5001 IA Ames, { University, State Iowa Engineering, Computer yNFudrGat o 537 n o 567,adb FS u AFOSR by and 1526078, No. and suppor 15RT0767. 1523374 was No. No. work Grant Grants This under 2017. appro NSF Si 5–9, by and present March and Speech, was Orleans, Acoustics, manuscript New paper (ICASSP), on the The this Conference of 2017. International Part of 20, IEEE Moretti. review Marco February was accepted the publication 2017; coordinating 16, editor February and 2017, ihasial endapoiainerri desired. (usually is given error matrix a given approximation nonnegative approximates the symmetric) defined SymNMF where suitably soluti Mathematically, approximate case a an exactly, the with factorized be matrix cannot In nonnegative matrix symmetric (SymNMF). further so-called transpos When factorization after the survey. machine identical recent becomes are and factors a NMF two for the processing linear that [8] interesting requiring signal see the many as [3]–[7]; imaging, to and learning such in leading (PCA) methods (LDA), applications analysis traditional analysis component discriminant from principal The distinct the [2]. nonnegati whose NMF as component-wise [1], matrices are recognized makes problem factors been two the decomposition long that of requirement matrix has product It important nonnegative. the an all into are matrix entries given a auhKh-ukrpit,vral pitn,goa an global clustering optimality, splitting, variable points, Karush-Kuhn-Tucker nbt ytei n eldt essgetta h propose the solution. that minimum suggest local sets a to data perfo quickly real results converges and be algorithm numerical synthetic can optimalit Extensive both local on algorithm solutions. and global obtained the the the guarantee co that which sufficient provided the Further, show are parallel. nonconvex also in to sublinea the implemented global We efficiently converge a of rate. achieves to points SymNMF. it convergence novel Furthermore, guaranteed (KKT) solving problem. a Karush-Kuhn-Tucker SymNMF is for propose of algorithm we analytics method set paper, proposed detection splitting data this The community variable In in clustering, segmentation. nonconvex document applications image as and important such problems has (SymNMF) oga,zhengdao songtao, igiHn swt h eateto nutiladManufac and Industrial of Department the with is Hong Mingyi oga uadZega agaewt h eateto Elect of Department the with are Wang Zhengdao and Lu Songtao aucitrcie a 5 06 eie coe ,21,J 2016, 6, October revised 2016; 15, May received Manuscript ongtv arxfcoiain(M)rfr ofactorin to refers (NMF) factorization matrix Nonnegative ne Terms Index Abstract oga Lu, Songtao ocne pitn ehdfrSymmetric for Method Splitting Nonconvex A Smercnneaiemti factorization matrix nonnegative —Symmetric Smercnneaiemti factorization, matrix nonnegative —Symmetric } @iastate.edu). ovrec nlssadOptimality and Analysis Convergence tdn ebr IEEE Member, Student .I I. ongtv arxFactorization: Matrix Nonnegative NTRODUCTION Z ∈ R N × N igiHong, Mingyi , yalwrank low a by ,UA(emails: USA 1, nlProcessing gnal da h 42nd the at ed S (email: USA , igi for it ving e npart in ted nay6, anuary nditions associate ia and rical local d ition, rmed turing of y nder on ve d g r ebr IEEE Member, matrix opnn-ienneaie yial with typically nonnegative, component-wise omltda ocne piiainpolm[9]–[11]: problem optimization nonconvex a as formulated k · k lseigadSmM r w ifrn aso relaxing spectral of that ways shown different been kernel two on has are graph the based SymNMF it transformed and is [15], some clustering In which of matrix. [14], Laplacian decomposition [13], eigenvalue image e.g., know is [9] the clustering, methods detection, clustering bioinformatics spectral of class in as important community clustering An [12]. pattern [11], clustering, and segmentation document in .RltdWork Related A. [ results interpretable of and cert advantage meaningful the more relaxes yielding has latter often also the SymNMF constraint. while orthogonality constraint nonnegativity the oSmM.Rcnl,teehv enwrsfcsn on below. focusing review works we which been SymNMF, for have algorithms appl there customized directly Recently, not SymNMF. are between they to symmetry therefore the factors, enforce resulting satisfactory the NMF to for mechanisms to methods effective algorithms aforementioned lack different lead the of for most them comments therein Unfortunately, references and of the comparison the and [8] detailed most see of practice; and in points performance problem) (KKT) Karush-Kuhn-Tucker coordinate NMF block convergen (to the strong possess on methods guarantees often based methods recent methods These [20]. few as descent well passi a message as and approximate [19], generalized [17] [18], bilinear method the as set (ANLS such active squares grad least the nonnegative projected with includ alternating alternating [16], [6], finding algorithms methods update for Well-known multiplicative literature solutions. the the in high-quality proposed its been have algorithms ipesrtg st goeteeult constraint equality the ignore to is strategy simple A n hnatraigypromtefloigtoses 1) steps: two following the perform solving alternatingly then and eety yNFhsfudmn applications many found has SymNMF Recently, oti n,fis ert yNFeuvlnl as equivalently SymNMF rewrite first end, this To many problem, NMF the of importance the to Due F XX eoeteFoeisnr.Tepolmcnbe can problem The norm. Frobenius the denote Y with K T hr h atrmatrix factor the where , mascutrn,weetefre relaxes former the where clustering, -means X min ≥ n hndoWang, Zhengdao and X Y 0 ≥ min en xd( ongtv es squares least nonnegative (a fixed being 0 , f X ( X = Y = ) 1 2 k 1 2 XY k XX T − T Z − elw IEEE Fellow, k Z F 2 X k . F 2 K . ∈ ≪ R N X N × icable = K Let . 11]. ient ain for (1) (2) Y ng ce is n e 1 ) , , . 2 problem); 2) solving X with Y being fixed (a least squares [29] and checking whether a positive semidefinite matrix problem). Such ANLS algorithm has been proposed in [11] for can be decomposed exactly by SymNMF is also NP-hard dealing with SymNMF. Unfortunately, despite the fact that an [30]. However, some promising recent findings suggest that optimal solution can be obtained in each subproblem, there is when the structure of the underlying factors are appropriately no guarantee that the Y-iterate will converge to the X-iterate. utilized, it is possible to obtain rather strong results. For The algorithm in [11] adds a regularized term for the difference example, in [31], the authors have shown that for the low between the two factors to the objective function and explicitly rank factorized stochastic optimization problem where the enforces that the two matrices are equal at the output. Such two low rank matrices are symmetric, a modified stochastic an extra step enforces symmetry, but unfortunately also leads gradient descent algorithm is capable of converging to a global to the loss of global convergence guarantees. A related optimum with constant probability from a random starting ANLS-based method has been introduced in [10]; however the point. Related works also include [32]–[34]. However, when algorithm is based on the assumption that there exists an exact the factors are required to be nonnegative and symmetric, symmetric factorization (i.e., X 0 such that XXT = Z). it is no longer clear whether the existing analysis can Without such assumption, the∃ algorithm≥ may not converge to still be used to show convergence to global/local optimal the set of KKT points1 of problem (1). A multiplicative update points. For the nonnegative principal component problem (i.e., for SymNMF has been proposed in [9], but the algorithm finding the leading nonnegative eigenvector) under the spiked lacks convergence guarantees (to KKT points of problem model, reference [35] shows that certain approximate message (1)) [21], and has a much slower convergence speed than passing algorithm is able to find the global optimal solution the one proposed in [10]. In [11], [22], algorithms based asymptotically. Unfortunately, this analysis does not generalize on the projected gradient descent (PGD) and the projected to an arbitrary symmetric observation matrix for the case Newton (PNewton) have been proposed, both of which directly K > 1. To our best knowledge, a characterization of global solve the original formulation (1). Again there has been no and local optimal solutions for SymNMF is still lacking. global convergence analysis since the objective function is a nonconvex fourth-order polynomial. More recently, the work B. Contributions [23] applies the nonconvex coordinate descent (CD) algorithm In this paper, we first propose a novel algorithm for for SymNMF. Due to the fact that the minimizer of the fourth SymNMF, which utilizes nonconvex splitting and is capable order polynomial is not unique in each coordinate updating, of converging to the set of KKT points with a provable global the CD-based method may not converge to stationary points. convergence rate. The main idea is to relax the symmetry Another popular method for NMF is based on the requirement at the beginning and gradually enforce it as alternating direction method of multipliers (ADMM), which the algorithm proceeds. Second, we provide a number of is a flexible tool for large scale convex optimization [24]. easy-to-check sufficient conditions guaranteeing the local or For example, using ADMM for both NMF and matrix global optimality of the obtained solutions. Numerical results completion, high quality results have been obtained in [25] on both synthetic and real data show that the proposed for gray-scale and hyperspectral image recovery. Furthermore, algorithm achieves fast and stable convergence (often to local ADMM has been applied to generalized versions of NMF minimum solutions) with low computational complexity. where the objective function is the general beta-divergence More specifically, the main contributions of this paper are: [26]. A hybrid alternating optimization and ADMM method 1) We design a novel nonconvex splitting SymNMF was proposed for NMF, as well as tensor factorization, (NS-SymNMF) algorithm, which converges to the set of KKT under a variety of constraints and loss measures in [27]. points of SymNMF with a global sublinear rate. To our best However, despite the promising numerical results, none of the knowledge, it is the first SymNMF solver that possesses global works discussed above has rigorous theoretical justification convergence rate guarantees. for SymNMF. Recently, the work [28] has applied the 2) We provide a set of easily checkable sufficient conditions ADMM for NMF and provided one of the first analysis for (which only involve finding the smallest eigenvalue of certain using ADMM to solve nonconvex matrix-factorization type matrix) that characterize the global and local optimality of problems. However, it is important to note that the algorithm the solutions. By utilizing such conditions, we demonstrate in [28] does not apply to the SymNMF case, because our numerically that with high probability, our proposed algorithm problem is more restrictive in that symmetric factors are converges not only to the set of KKT points but to a local desired, while in NMF symmetry is not enforced. Technically, optimal solution as well. imposing symmetry poses much difficulty in the analysis (we Notation: Bold upper case letters without subscripts (e.g., will comment on this point shortly). In fact, the convergence X, Y) denote matrices and bold lower case letters without of ADMM for SymNMF is still open in the literature. subscripts (e.g., x, y) represent vectors. The notation Zi,j An important research question for NMF and SymNMF is denotes the (i, j)-th entry of matrix Z. Vector Xi denotes ′ whether it is possible to design algorithms that lead to globally the ith row of matrix X and Xm denotes the mth column of optimal solutions. At the first sight such problem appears the matrix. very challenging since finding the exact NMF is NP-hard II. THE PROPOSED ALGORITHM 1Let d(a, s) denote the distance between two points a and s. We say that a sequence ai converges to a set S if the distance between ai and S, defined The proposed algorithm leverages the reformulation (2). as infs∈S d(ai,s), converges to zero, as i →∞. Our main idea is to gradually tighten the difficult equality 3 constraint X = Y as the algorithm proceeds so that when Lemma 1. For problem (1), a point X∗, is a KKT point, convergence is approached, such equality is eventually which means there exists some Ω∗ such that (X∗, Ω∗) satisfies satisfied. To this end, let us construct the augmented (5a)–(5d), if and only if X∗ is a stationary point, which means Lagrangian for (2), given by it satisfies (6). 1 ρ (X, Y; Λ)= XYT Z 2 + Y X, Λ + Y X 2 Proof: See Section VII-A L 2k − kF h − i 2k − kF (3) Lemma 2. Suppose τ>θk, k where where Λ RN×K is a matrix of dual variables, denotes ∀ ∈ h·i the inner product operator, and ρ > 0 is a penalty parameter 1 N 2 Zk,k + 2 i=1(Zi,k + Zk,i) whose value will be determined later. θk , q , (7) It may be tempting to directly apply the well-known ADMM P 2 method to the augmented Lagrangian (3), which alternatingly then the KKT points of problem (1) and the KKT points of minimizes the primal variables X and Y, followed by a dual problem (4) have a one-to-one correspondence. ascent step Λ Λ + ρ(Y X). Unfortunately, the classical result for ADMM← presented− in [24], [36], [37] only works for Proof: See Section VII-B. convex problems, hence they do not apply to our nonconvex We remark that the previous work [23] has made the problem (2) (note this is a linearly constrained nonconvex observation that solving SymNMF with the additional problem where the nonconvexity arises in the objective constraints Xi 2 Z F , i will not result in any loss k k2 ≤ k k ∀ function). Recent results such as [38]–[41] that analyze of the global optimality.p Lemma 2 provides a stronger result, ADMM for nonconvex problems do not apply either, because that all KKT points of SymNMF are preserved within a smaller 2 in these works the basic requirements are: 1) the objective bounded feasible set , Y Yi 0, Yi τ, i Y { | ≥ k k2 ≤ ∀ } function is separable over the block variables; 2) the smooth (note, that τ 2 Z F in general). ≪ k k part of the augmented Lagrangian has Lipschitz continuous The proposed NS-SymNMF algorithm alternates between gradient with respect to all variable blocks. Unfortunately the primal updates of variables X and Y, and the dual update neither of these conditions are satisfied in our problem. for Λ. Below we present its detailed steps (superscript t is Next we begin presenting the proposed algorithm. We start used to denote the iteration number). by considering the following reformulation of problem (1)

(t+1) 1 (t) T 2 1 T 2 Y = arg min X Y Z 2 F min XY Z F (4) Y≥0,kYik ≤τ,∀i 2k − k X,Y 2k − k 2 (t) 2 ρ t t β t s.t. Y 0, X = Y, Yi 2 τ, i, + Y X( ) + Λ( )/ρ 2 + Y Y( ) 2 , ≥ k k ≤ ∀ 2k − kF 2 k − kF where τ > 0 is some given constant. (8) Let Ω∗ denote the dual matrix for the constraint X 0 1 ≥ X(t+1) = arg min X(Y(t+1))T Z 2 in the Lagrangian of problem (1). The KKT conditions of X 2k − kF ρ problem (1) are given by [42, eq. (5.49)] + X Λ(t)/ρ Y(t+1) 2 , (9) 2k − − kF ZT + Z 2 X∗(X∗)T X∗ Ω∗ =0, (5a) Λ(t+1) =Λ(t) + ρ(Y(t+1) X(t+1)), (10)  − 2  − − ∗ (t+1) 6 (t+1) (t+1) T 2 Ω 0, (5b) β = X (Y ) Z F . (11) ≥ ρk − k X∗ 0, (5c) ≥ X∗ Ω∗ = 0 (5d) We remark that this algorithm is very close in form to the ◦ standard ADMM method applied to problem (4) (which lacks where denotes the Hadamard product. For a point X∗, if convergence guarantees). The key difference is the use of ◦ Ω∗ X∗ Ω∗ (t) 2 we can find some such that ( , ) satisfies conditions the proximal term Y Y F multiplied by an iteration (5a)–(5d), then we term X∗ a KKT point of problem (1). dependent penaltyk parameter− kβ(t) 0, whose value is A stationary point for problem (1) is a point X∗ proportional to the size of the objective≥ value. Intuitively, if the that satisfies the following optimality condition [43, algorithm converges to a solution with a small objective value, Proposition 2.1.2]: then parameter β(t) vanishes in the limit. Introducing such ZT + Z proximal term is one of the main novelty of the algorithm, and X∗(X∗)T X∗, X X∗ 0, X 0. (6) it is crucial in guaranteeing the convergence of NS-SymNMF.  − 2 −  ≥ ∀ ≥  It can be checked that when τ in (4) is sufficiently large (larger than a threshold dependent on Z), then problem (4) is III. CONVERGENCE ANALYSIS equivalent to problem (1), in the sense that the KKT points X∗ of the two problems are identical. Also, there is a one-to-one In this section we provide convergence analysis of correspondence between the KKT points and stationary points NS-SymNMF for a general SymNMF problem. We do not of the SymNMF problem, although in general such one-to-one require Z to be symmetric, positive-semidefinite, or to have correspondence may not hold. To be more precise, we have: positive entries. We assume K can be any integer in [1, N]. 4

A. Convergence and Convergence Rate B. Sufficient Global and Local Optimality Conditions Below we present our first main result, which asserts Since problem (1) is not convex, the KKT points obtained that when the penalty parameter ρ is sufficiently large, the by NS-SymNMF could be different from the global optimal NS-SymNMF algorithm converges globally to the set of KKT solutions. Therefore it is important to characterize the points of problem (1). conditions under which these two different types of solutions coincide. Below we provide an easily checkable sufficient Theorem 1. Suppose the following is satisfied condition to ensure that a KKT point X∗ is also a globally ρ> 6Nτ. (12) optimal solution for problem (1). Theorem 3. Suppose that X∗ is a KKT point of problem (1). Then the following statements are true for NS-SymNMF: Then, X∗ is also a global optimal point if the following is 1) The equality constraint is satisfied in the limit, i.e., satisfied ZT + Z (t) (t) 2 S , X∗(X∗)T 0. (17) lim X Y F 0. t→∞ k − k → − 2  Proof: See Section VII-E. 2) The sequence X(t), Y(t)Λ(t) generated by the It is important to note that condition (17) is only a sufficient algorithm is bounded.{ And every} limit point of the condition and hence may be difficult to satisfy in practice. In sequence is a KKT point of problem (1). this section we provide a milder condition which ensures thata An equivalent statement on the convergence is that the KKT point is locally optimal. This type of result is also very sequence X(t), Y(t)Λ(t) converges to the set of KKT points useful in practice since it can help identify spurious saddle of problem{ (1); cf. footnote} 1 on Page 2. points such as the point X∗ = 0 in the case where ZT + Z is Proof: See Section VII-C. not negative semidefinite. Our second result characterizes the convergence rate of the We have the following characterization of the local optimal algorithm. To this end, we construct a function that measures solution of the SymNMF problem. the optimality of the iterates X(t), Y(t), Λ(t) . Define the Theorem 4. Suppose that X∗ is a KKT point of problem (1). proximal gradient of the augmented{ Lagrangian} function as Define a T RKN×KN whose (m,n)th block is a matrix of size N N∈as follows T T Y × , Y projY [Y ( (Y, X, Λ)] (X, Y, Λ) − − ∇ L ′∗ T ′∗ ′∗ 2 ′∗ ′∗ T ∇L  X (X, Y, Λ)  Tm,n , (X ) X δ X I + X (X ) + δm,nS, ∇ L m n − k n k2 n m e  (18) where where S is defined in (17), δm,n is the Kronecker delta ′∗ ∗ , 2 function, and Xm denotes the mth column of X . If there proj (W) arg min W Y F (13) Y Y Y 2 ∗ ≥0,k ik2≤τ,∀i k − k exists some δ > 0 such that T 0, then X is a strict local minimum solution of problem (1)≻ , meaning that there exists i.e., it is the projection operator that projects a given matrix some ǫ > 0 small enough such that for all X 0 satisfying W onto the feasible set of Y. Here we propose to use the ∗ ≥ X X F ǫ, we have following quantity to measure the progress of the algorithm k − k ≤ ∗ γ ∗ 2 f(X) f(X )+ X X F . (19) (X(t), Y(t), Λ(t)) , (X(t), Y(t), Λ(t)) 2 ≥ 2 k − k P k∇L kF (t) (t) 2 Here the constant γ is given by e + X Y F . (14) k − k 2 (t) (t) (t) 2K 2 It can be verified that if lim (X , Y , Λ )=0, then γ = + K(K 2) ǫ +2λmin(T) > 0 (20) t→∞ −  δ −  a KKT point of problem (1) is obtained.P (t) (t) (t) Below we show that the function (X , Y , Λ ) goes where λmin(T) > 0 is the smallest eigenvalue of T. to zero in a sublinear manner. P Proof: See Section VII-F. Theorem 2. For a given small constant ǫ, let T (ǫ) denote the In the special case of K = 1, the sufficient condition set iteration index satisfying the following inequality forth in Theorem 4 can be significantly simplified. ∗ T (ǫ) , min t (X(t), Y(t), Λ(t)) ǫ,t 0 . (15) Corollary 1. Suppose that x is the KKT point of problem { |P ≤ ≥ } (1) when K =1. If there exists some δ > 0 such that Then there exists some constant C > 0 such that ZT + Z T , (1 δ) x∗ 2I +2x∗(x∗)T 0, (21) C (X(1), Y(1), Λ(1)) 1 − k k2 − 2 ≻ ǫ L . (16) ≤ T (ǫ) then x∗ is a strict local minimum point of problem (1). Proof: See Section VII-D. Proof: See Section VII-G. The result indicates that it takes (1/ǫ) iterations We comment that the condition given in Theorem 4 is much for (X(t), Y(t), Λ(t)) to be less than Oǫ. It follows that milder than that in Theorem 3. Further such condition is also NS-SymNMFP converges sublinearly. very easy to check as it only involves finding the smallest 5 eigenvalue of a KN KN matrix for a given δ 2. In our Other algorithms such as accelerated version of the gradient numerical results (to be× presented shortly), we set a series of projection [45] can also be used to solve the Y-subproblem. consecutive δ when performing the test. We have observed that It is also worth noting that when Z is sparse, the complexity the solutions generated by NS-SymNMF satisfy the condition of computing ZY(t+1) in (23) and (X(t))T Z in (26) is only provided in Theorem 4 with high probability. proportional to the number of nonzero entries of A.

IV. IMPLEMENTATION V. NUMERICAL RESULTS In this section we discuss the implementation of the In this section, we compare the proposed algorithm with proposed algorithm. a few existing SymNMF solvers on both synthetic and real data sets. We run each algorithm with 20 random A. The X-Subproblem initializations (except for SNMF, which does not require t The subproblem for updating X( +1) in (9) is equivalent to external initialization). The entries of the initialized X (or the following problem Y) follow an i.i.d. uniform distribution in the range [0, τ]. (t+1) (t+1) 2 All algorithms are started with the same initial point each min ZX XAX (22) X k − kF time, and all tests are performed using Matlab on a computer where with Intel Core i5-5300U CPU running at 2.30GHz with t 8GB RAM. Since the compared algorithms have different Z( +1) , ZY(t+1) + Λ(t) + ρY(t+1) (23) X computational complexity, we use the objective values versus (t+1) (t+1) T (t+1) AX , (Y ) Y + ρI 0 CPU time for fair comparison. We next describe different ≻ are two fixed matrices. Clearly problem (22) is just a least SymNMF solvers that are compared in our work. squares problem and can be solved in closed-form. The Algorithms Comparison. In our numerical simulations, we solution is given by compare the following algorithms. a) Projected Gradient Descent (PGD) and Projected (t+1) (t+1) (t+1) −1 X = ZX (AX ) . (24) Newton method (PNewton) [11], [22]: The PGD and PNewton t directly use the gradient of the objective function. The key We remark that the A( +1) is a K K matrix, where K is X difference between them is that PGD adopts the identity usually small (e.g., the number of clusters× for graph clustering matrix as a scaling matrix while PNewton exploits reduced applications). As a result, X(t+1) in (24) can be obtained Hessian for accelerating the convergence rate. The PGD by solving a small system of linear equations and hence algorithm converges slowly if the step size is not well computationally cheap. selected, while the PNewton algorithm has high per-iteration complexity compared with ANLS and NS-SymNMF, due to B. The Y-Subproblem the requirement of computing the . Note that The Y-subproblem (8) can be decomposed into N separable to the best of our knowledge, neither PGD nor PNewton constrained least squares problems, each of which can be possesses convergence or rate of convergence guarantees. solved independently, and hence can be implemented in b) Alternating Nonnegative Least Square (ANLS) parallel. We may use the conventional gradient projection (GP) [11]: The ANLS method is a very competitive SymNMF for solving each subproblem, using iterations solver, which can be implemented in parallel easily. ANLS (r+1) (r) (t) (r) (t) reformulates SymNMF as Y = proj (Y α(AY Y ZY )) (25) i Y i − i − ,i T 2 2 where min g(X, Y)= XY Z F + ν X Y F X,Y≥0 k − k k − k (t) (t) T (t) T (t) T (t) (t) T ZY , (X ) Z + ρ(X ) (Λ ) + β (Y ) , − where ν > 0 is the regularization parameter. One of (26) shortcomings is that there is no theoretical guarantee that (t) (t) T (t) (t) AY , (X ) X + (ρ + β )I 0, (27) the ANLS method can converge to the set of KKT points ≻ of problem (1) or even producing two symmetric factors, ZY,i denotes the ith column of matrix ZY, α is the step (t) although a penalty term for the difference between the factors size, which is chosen either as a constant 1/λmax(AY ), or by (X and Y) is included in the objective. using some line search procedure [43]; r denotes the iteration c) Symmetric Nonnegative Matrix Factorization (SNMF) of the inner loop; for a given vector w , projY (w) denotes [10]: The SNMF algorithm transforms the original problem the projection of it to the feasible set of Yi, which can be to another one under the assumption that Z can be exactly evaluated in closed-form [44, pp. 80] as follows decomposed by XXT . Although SNMF often converges + , quickly in practice, there has been no theoretical analysis w = proj+(w) max w, 0K×1 , (28) +{ } under the general case where Z cannot be exactly decomposed. Yi = proj w+ 2 τ (w ) k k2≤ d) Coordinate Descent (CD) [23]: The CD method , + + √τw / max √τ, w 2 . (29) updates each entry of X in a cyclic way. For updating { k k } each entry, we only need to find the roots of a fourth-order 2To find such smallest eigenvalue, we can find the largest eigenvalue of ηI−T , using algorithms such as the power method [14], where η is sufficient univariate function. However, CD may not converge to the set large based on τ and kZkF . of KKT points of SymNMF. Instead, there is an additional 6

NS-SymNMF PGD [22] PNewton [22] ANLS [11] SNMF [10] CD [23] Relative Objetive Value (log scale)

(a) N = 500, K = 60. (b) N = 500, K = 60, and Z is a full rank matrix.

Fig. 1. Data Set I: the convergence behaviors of different SymNMF solvers; each point in the figures is an average of 20 independent MC trials.

NS-SymNMF PGD [22] PNewton [22] ANLS [11] SNMF [10]

Relative Objetive Value (log scale) CD [23]

(a) Objective Value (b) Optimality Gap

Fig. 2. Data Set II: the convergence behaviors of different SymNMF solvers; each point in the figures is an average of 20 independent MC trials; N = 2000, K = 4 condition given in [23] for checking whether the generated To update Y, we implement the block pivoting method [17] sequence converges to a unique limit point. A heuristic method since such method is faster than the GP method for solving (t+1) 2 for checking the condition is additionally provided, which the nonnegative least squares problem. If Yi 2 τ requires, e.g., plotting the norm between the different iterates. k (t) k ≤ is not satisfied, then we switch to GP on Yi . We also e) The Proposed NS-SymNMF: The update rule of remark that we set the step size of PGD to 10−5 for all tested NS-SymNMF is similar to that of ANLS. The difference cases, and use the Matlab codes of PNewton and ANLS from between them is that NS-SymNMF uses one additional block http://math.ucla.edu/∼dakuang/. for dual variables and ANLS adds a penalty term. The dual Performance on Synthetic Data. First we describe the two update involved in NS-SymNMF benefits the convergence of synthetic data sets that we have used in the first part of the the algorithm to KKT points of SymNMF. numerical results. We remark that in the implementation of NS-SymNMF Data set I (Random symmetric matrices): We randomly we let τ = max θ (cf. (7)) and the maximum number of k k generate two types of symmetric matrices, one is of low rank iterations of GP be 40. Also, we gradually increase the value of and the other is of full rank. ρ from an initial value to meet condition (12) for accelerating the convergence rate [46]. Here, the choice of ρ follows For the low rank matrix, we first generate a matrix M with (t+1) (t) (t) −3 dimension N K, whose entries follow an i.i.d. Gaussian ρ = min ρ /(1 ǫ/ρ ), 6.1Nτ where ǫ = 10 as × suggested in [47].{ We choose− ρ(1) =τ ¯ for} the case that Z can distribution with zero mean and unit variance. We use Mi,j be exactly decomposed and √Nτ¯ for the rest of cases, where to denote the (i, j)th entry of M. Then generate a new matrix M whose (i, j)th entry is Mi,j . Finally, we obtain a positive τ¯ is the mean of θk, k. The similar strategy is also applied for | | (t) ∀ (t) (t) (t) (t) 2 (t) symmetric Z = MMT as the given matrix to be decomposed. updating β . We choose β =6ξ X Y Z F /ρ f (t+1) (t) (t) k (1) − k where ξ = min ξ /(1 ǫ/ξ ), 1 and ξ =0.01, and For the full rankff matrix, we first randomly generate a N N only update β(t) once{ every− 100 iterations} to save CPU time. matrix P, whose entries follow an i.i.d. uniform distribution× 7

T in the interval [0, 1]. Then we compute Z = (P + P )/2. δ > 0 that ensures λmin(T) > 0 is listed as the last column. Data set II (Adjacency matrices): One important application We note that there always existed a δ such that T is positive of SymNMF is graph partitioning, where the definite in all cases that we tested. This indicates that (with of a graph is factorized. We randomly generate a graph high probability) the proposed algorithm converges to a locally as follows. First, set the number of nodes to N and the optimal solution. In Figure 3, we provide the values of δ that number of cluster to 4, and the numbers of nodes within each make the corresponding λmin(T) > 0 at each realization. cluster to 300, 500, 800, 400. Second, we randomly generate We also remark that in practice we stop the algorithm data points whose relative distance will be used to construct in finite steps, so only an approximate KKT point will the adjacency matrix. Specifically, data points xi R, be obtained, and the degree of such approximation can be i = 1,...,N, are generated in one dimension.{ Within} ∈ one measured by the optimality gap defined in (14). cluster, data points follow an i.i.d. Gaussian distribution. The means of the random variables in these 4 clusters are 2, 3, 6, 8, )

respectively, and the variance is 0.5 for all distributions. T Construct the similarity matrix A RN×N , whose (i, j)th ( min

2 ∈ 2 2 λ entry is Ai,j = exp( (xi xj ) /(2σ )) where σ =0.5. The convergence behaviors− − of different SymNMF solvers for the synthetic data sets are shown in Figure 1 and Figure 2. The results are averaged over 20 Monte Carlo (MC) trials with independently generated data. In Figure 1(a), the generated Z can be exactly decomposed by SymNMF. It can be observed that NS-SymNMF and SNMF converge to the global optimal solution quickly, and SNMF is the fastest one among all compared algorithms. However, the case where the matrix can be exactly factorized is not common in most practical applications. Hence, we also consider the case where matrix Z Fig. 3. Checking local optimality condition, where N = 500. cannot be factorized exactly by a N K matrix. The results are shown in Figure 1(b) and we use the× relative objective value T 2 2 TABLE I for comparison, i.e., XX Z F / Z F . We can observe that NS-SymNMF and CDk can− achievek k ak lower objective value LOCAL OPTIMALITY than other methods. It is worth noting that there is a gap N λmin(T) δ Local Optimality (true) 4 between SNMF and others, since the assumption of SNMF 50 2.71 × 10− 0.42 100% 4 is not satisfied in this case. 100 4.16 × 10− 0.37 100% 2 We also implement the algorithms on the adjacency matrices 500 1.8 × 10− 0.91 100% (data set II), where the results are shown in Figure 2. The NS-SymNMF and SNMF algorithms converge very fast, but Performance on Real Data. We also implement the algorithm it can be observed that there is still a gap between SNMF on a few real data sets in clustering applications, which will and NS-SymNMF as shown in Figure 2(a). We further show be described in the next paragraphs. the convergence rates with respective to optimality gap versus 1) Dense Similarity Matrix: we generate the dense CPU time in Figure 2(b). The optimality gap (14) measures similarity matrices based on the two real data sets: the closeness between the generated sequence and the true Reuters-21578 and TDT2 [49]. We use the 10th subset stationary point. To get rid of the effect of the dimension of of the processed Reuters-21578 data set, which includes

Z, we use X proj+[X X(f(X))] ∞ as the optimality N = 4, 633 documents divided into K = 25 classes. The gap. It is interestingk − to see− ∇ the “swamp”k effect [48], where number of features is 18,933. Topic detection and tracking the objective value generated by the CD algorithm remains 2 (TDT2) corpus includes two newswires (APW and NYT), almost constant during the time period from around 25s to 75s two radio programs (VOA and PRI) and two television although actually the corresponding iterates do not converge, programs (CNN and ABC). We use the 10th subset of the and then the objective value starts decreasing again. processed TDT2 data set with K = 25 classes which includes Checking Global/Local Optimality. After the NS-SymNMF N =8, 939 documents and each of them has 36,771 features. algorithm has converged, the local/global optimality can be We comment that the 10th TDT2 subset is the largest among checked according to Theorem 3 and Theorem 4. To find an the all TDT2 and Reuters subsets. Any other subset can appropriate δ that satisfying the condition where λmin(T) > 0, be used equally well. The similarity matrix is constructed we initialize δ as 1 and decrease it by 0.01 each time and check by the Gaussian function where the difference between two the minimum eigenvalue of T. Here, we use data set II with documents is measured by all features using the Euclidean the fixed ratio of the number of nodes within each cluster distance [49]. (i.e., 3:5:8:4) and test on the different total numbers The means and standard deviations of the objective values of nodes. The simulation results are shown in Table I with of the final solutions are shown in Table II. Convergence 100 MC trials, where the average value of λmin(T) and δ are results of the algorithms are shown in Figure 4. For the given. Further, the percentage of being able to find a valid Reuters and TDT2 datasets, before SNMF completes the 8 Relative Objetive Value (log scale) Relative Objetive Value (log scale)

(a) Mean of the objective values: Reuters data set (b) Mean of the objective values: TDT2 data set

Fig. 4. The convergence behaviors of different SymNMF solvers for the dense similarity matrix; each point in the figures is an average of 20 independent MC trials based on random initializations. Relative Objetive Value (log scale) Relative Objetive Value (log scale)

(a) Mean of the objective values: email-Enron data set (b) Mean of the objective values: loc-Brightkite data set

Fig. 5. The convergence behaviors of different SymNMF solvers for the sparse similarity matrix; each point in the figures is an average of 20 independent MC trials based on random initializations.

TABLE II T 2 2 MEAN AND STANDARD DEVIATION OF kXX − ZkF /kZkF OF THE FINAL SOLUTIONOF EACH ALGORITHM BASED ON RANDOM INITIALIZATIONS

Dense Data Sets N K NS-SymNMF PGD [22] PNewton [22] ANLS [11] SNMF [10] CD [23] Reuters [49] 4,633 25 2.65e-3±3.31e-10 1.14e-2±1.18e-5 2.98e-3±3.71e-6 1.16e-2±1.61e-5 9.32e-3 2.66e-3±2.04e-8 TDT2 [49] 8,939 25 1.01e-2±5.35e-9 1.74e-2±7.34e-6 - 2.25e-2±1.25e-6 3.29e-2 1.01e-2±1.21e-6 eigenvalue decomposition for the first iteration, CD and locations by checking-in. The friendships of the users were NS-SymNMF have already obtained low objective values. maintained by Brightkite. The way of constructing the Also, since calculating Hessian in PNewton is time consuming, similarity matrix is the same as the Enron email data set. the result of PNewton is out of range in Figure 4(b). The means and standard deviations of the objective values 2) Sparse Similarity Matrix: we also generate multiple of the final solutions are shown in Table III. From the convergence curves for each algorithm with random simulation results shown in Figure 5, it can be observed initializations based on some sparse real data sets. that the NS-SymNMF algorithm converges faster than CD, Email-Enron network data set [50]: Enron email corpus while SNMF and ANLS converge to some points where the includes around half million emails. We use the relationships relative objective values are higher than the one obtained by between two email addresses to construct the similarity NS-SymNMF. matrix for decomposing. If an address i sent at least one email to address j, then we take Ai,j = Aj,i =1. Otherwise, VI. CONCLUSIONS we set Ai,j = Aj,i =0. In this paper, we propose a nonconvex splitting algorithm Brightkite data set [51]: Brightkite was a location-based social for solving the SymNMF problem. We show that the proposed networking website. Users were able to share their current algorithm converges to a KKT point in a sublinear manner. 9

TABLE III T 2 2 MEAN AND STANDARD DEVIATION OF kXX − ZkF /kZkF OF THE FINAL SOLUTIONOF EACH ALGORITHM BASED ON RANDOM INITIALIZATIONS

Sparse Data Sets N K #nonzero NS-SymNMF ANLS [11] SNMF [10] CD [23] email-Enron [50] 36,692 50 367,662 8.05e-1±4.66e-4 9.18e-1±6.20e-3 9.69e-1 8.13e-1±1.47e-3 loc-Brightkite [51] 58,228 50 428,156 8.75e-1±9.52e-4 9.33e-1±1.93e-3 9.43e-1 8.84e-1±1.49e-3

Further, we provide sufficient conditions to identify global or Proof: It is sufficient to show that when τ is large enough, local optimal solutions of the SymNMF problem. Numerical there can be no KKT point whose column has size τ, leading ∗ 2 experiments show that the proposed method can converge to the fact that the constraint Xk τ is always inactive. quickly to local optimal solutions. We check the optimality conditionk k of≤ the SymNMF problem ∗ 2 In the future, we plan to extend the proposed methods at X = τk, where τk > 0 is a constant. We can rewrite k kk in a way such that the algorithms can converge to the the objective function as local or even global optimal solutions of SymNMF without N N 1 T 2 requiring checking conditions. Also, it is possible to apply f(X)= (XiX Zi,j ) 2 j − the nonconvex splitting method to more general matrix i=1X,i6=k j=1X,j6=k factorization problems, such as the quadratic nonnegative N T 2 matrix factorization problem [52]. + (XiX Zi,k) k − i=1X,i6=k VII. APPENDIX N T 2 T 2 + (XkX Zk,j ) + (XkX Zk,k) . A. Proof of Lemma 1 j − k −  j=1X,j6=k Sufficiency: the stationary points satisfy Note, Xi, Xj , Xk denote rows of matrix X. We take the gradient of f(X) with respective to Xk: X∗(X∗)T (ZT + Z)/2 X∗, X X∗ 0, X 0.  − −  ≥ ∀ ≥ X N  ∂f( ) T (30) = Xi,m(XiXk Zi,k) ∗ ∗ T T ∗ ∂Xk,m − Let Ω , (X (X ) (Z + Z)/2)X /2. We have Ω, X i=1X,i6=k − h − X∗ 0, X 0. By setting X appropriately as 0 X N ∗i ≥ ∀ ≥ ≤ ∗ ≤ T T X , we have Ωi,j 0, (i, j) where = i, j X = Xj,m(XkX Zk,j )+2Xk,m(XkX Zk,k) ≥ ∈ S S { | i,j 6 j − k − 0 . Also, by setting X appropriately as X X∗, we have j=1X,j6=k } ≥ Ωi,j 0, (i, j) / . Combining the two cases, we conclude N ≥ ∈ S T that Ω 0. = Xi,m(XiX (Zi,k + Zk,i)) ≥ k − From (30), we know that Ω, X Ω, X∗ . Since Ω 0 i=1X,i6=k h i ≥ h i ≥ T and X 0, we have Ω, X 0, X, meaning that +2Xk,m(XkXk Zk,k) (34) Ω, X∗ ≥ 0. Combining withh Xi∗ ≥ 0 and∀ Ω 0, we have − h ∗i ≤ ≥∗ ≥ where Xi,m denotes the mth entry of the ith row of X. Ω, X 0, which results in Ω, X =0. X∗ ∗ ∂f( k ) h i≥ h i Assume that X is a KKT point. We have ( X )(Xk In summary, we have k ∂ k ∗ T 2 − Xk) 0, Xk , where = Xk Xk 0, Xk T ≥ ∀ ∈ X X { | ≥ k k ≤ ∗ ∗ T Z + Z ∗ τk , which implies 2 X (X ) X Ω =0, (31a) }  − 2  − ∗ ∂f(Xk) ∗ Ω 0, (31b) (Xk,m Xk,m) 0 ≥ ∂Xk,m − ≥ X∗ 0, (31c) ≥ K ∗ ∗ ∗ 2 X , Ω =0, (31d) 0 Xk,m X = vτk (X ) m. (35) h i ≤ ≤ k,m u − k,n ∀ u n=1X,n6=m which are the KKT conditions of the SymNMF problem. t ∗ 2 Necessity: If the point is a KKT point of SymNMF, we have Since Xk = τk, there exists an index m such that ∗ k k ∗ X > 0. Consider a feasible point 0 Xk,m < X , T + k,m ≤ k,m ∗ ∗ ∗ T Z Z ∗ , ∗ Ω =2 X (X ) X . (32) where m m m Xk,m =0 . Thanks to (35), we have − 2 ∈ S { | 6 }  X∗ ∗ ∗ ∂f( k,m) ∗ Combining with X , Ω =0, we know that 0, 0 Xk,m < Xk,m m m. (36) h i ∂Xk,m ≤ ≤ ∀ ∈ S Ω∗, X X∗ 0, X 0, (33) ∗ h − i≥ ∀ ≥ Plugging (34) into (36) and multiplying Xk,m on both sides which is the condition of stationary points. of (36), we can obtain N Z + Z X∗ X∗ X∗(X∗)T i,k k,i k,m i,m i k − 2 B. Proof of Lemma 2 i=1X,i6=k 

We prove that if τ is large enough, then the KKT conditions ∗ ∗ ∗ T + X (X (X ) Zk,k) 0 m m. (37) of (1) and (4) are the same. k,m k k −  ≤ ∀ ∈ S 10

∗ For the case m / m, we know that X =0. Summing Plugging (45) into (44), we have ∈ S k,m up (37) m, and noting that m 1 we can get ∀ |S |≥ Λ(t+1) Λ(t) N − Z + Z (t+1) (t) (t+1) (t) (t+1) T (t+1) , ∗ ∗ T ∗ ∗ T i,k k,i =Z(Y Y ) (X X )(Y ) Y p Xi (Xk) Xi (Xk) − − − − 2 (t) (t) T (t+1) (t) i=1X,i6=k  X (Y ) (Y Y ) , − − Mi,k X(t)(Y(t+1) Y(t))Y(t+1) | ∗ ∗ T ∗ {z∗ T } − − +X (X ) (X (X ) Zk,k) 0. (38) t t T t t k k k k − ≤ =(Z X( )(Y( )) )(Y( +1) Y( )) − − In (38), i,k is a quadratic function with respective to (X(t+1) X(t))(Y(t+1))T Y(t+1) M , ∗ ∗ T − − Ci,k, where Ci,k Xi (Xk) , so the minimum of i,k (t) (t+1) (t) T (t+1) 2 M X (Y Y ) Y . (46) is 1/4((Zi,k + Zk,i)/2) . Consequently, the minimum of − − N− N 2 i=1,i6=k i,k is 1/4 i=1,i6=k((Zi,k + Zk,i)/2) . Using triangle inequality, we arrive at M − ∗ 2 PIn addition, since we haveP X = τk, the lower bound of k (t+1) (t) (t+1) (t) (t+1) T (t+1) , N k k 2 Λ Λ F X X F (Y ) Y F p is pL 1/4 i=1,i6=k((Zi,k + Zk,i)/2) + τk(τk Zk,k) k − k ≤k − k k k − − (t) (t) T (t+1) (t) which is a quadraticP function in terms of τk. Therefore, if + X (Y ) Z F Y Y F k − k k − k + X(t)(Y(t+1) Y(t))T Y(t+1) . (47) 1 N 2 F F Zk,k + 2 i=1(Zi,k + Zk,i) k − k k k τk >θk , q , (39) 2 P 2 Since Yi τ, we know that Y F √Nτ. Squaring both sidesk k of (47),≤ we obtain k k ≤ then p pL > 0, which contradicts the optimality condition ≥ t t t t (37). It can be concluded that whenever τk is large enough, Λ( +1) Λ( ) 2 3N 2τ 2 X( +1) X( ) 2 k − kF ≤ k − kF at any KKT point no column will have size equal to τk. (t) (t) T 2 (t+1) (t) 2 +3 X (Y ) Z F Y Y F Furthermore, it can be easily checked that τ > maxk θk is k − k k − k (t) (t+1) (t) T 2 a sufficient condition. The proof is complete. +3Nτ X (Y Y ) F . (48) k − k The claim is proved. C. Convergence Proof of the Proposed Algorithm In the second step, we bound the successive difference of the augmented Lagrangian. In this section, we prove Theorem 1. The analysis consists of a series of lemmas. Lemma 4. Consider using the update rules (8)–(10). If Lemma 3. Consider using the update rules (8) – (10) to solve 6 ρ> 6Nτ and β(t) > X(t)(Y(t))T Z 2 ρ, (49) problem (1). Then we have ρk − kF − (t+1) (t) 2 2 2 (t+1) (t) 2 we have Λ Λ F 3N τ X X F k − k ≤ k − k t t t t t t +3 X(t)(Y(t))T Z 2 Y(t+1) Y(t) 2 (X( +1), Y( +1), Λ( +1)) (X( ), Y( ), Λ( )) k − kF k − kF L − L (t) (t+1) (t) T 2 (t+1) (t) 2 (t) (t+1) (t) T 2 +3Nτ X (Y Y ) . (40) c1 X X F c2 X (Y Y ) F k − kF ≤− k − k − k − k c Y(t+1) Y(t) 2 Proof: The optimality condition of the X subproblem (9) 3 F − k − k (50) is given by where c1,c2,c3 > 0 are some positive constants. (X(t+1)(Y(t+1))T Z)Y(t+1) − Proof: Let + ρ(X(t+1) Y(t+1) + Λ(t)/ρ)=0. (41) − 1 Substituting (10) into (41), we have (X(t), Y, Λ(t)) , X(t)YT Z 2 L 2k − kF (t) (t+1) (t+1) (t+1) T (t+1) b ρ β Λ = (X (Y ) Z)Y . (42) + X(t) Y + Λ(t)/ρ 2 + Y Y(t) 2 , (51) − − 2k − kF 2 k − kF Subtracting the same equation in iteration t, we have the which is an upper bound of (X(t), Y, Λ(t)), and successive difference of the dual matrix (44), shown at the L top of the next page. , (X(t), Y(t+1), Λ(t)) (X(t), Y(t), Λ(t)), Note that the following is true A L − L , (X(t+1), Y(t+1), Λ(t)) (X(t), Y(t+1), Λ(t)), B L − L 1 t t t t t t = X(t)(Y(t+1) Y(t))T (Y(t+1) Y(t)) , (X( +1), Y( +1), Λ( +1)) (X( +1), Y( +1), Λ( )), Q 2 − − C L − L , (t) (t+1) (t) (t) (t) (t) +2X(t)(Y(t))T (Y(t+1) Y(t)) (X , Y , Λ ) (X , Y , Λ ). − A L − L 1 T  b b + X(t)(Y(t+1) Y(t)) (Y(t+1) + Y(t)) We have the following descent estimate 2 − =X(t)(Y(t))T (Y(t+1) Y(t)) (X(t+1), Y(t+1), Λ(t+1)) (X(t), Y(t), Λ(t)) − L − L + X(t)(Y(t+1) Y(t))T Y(t+1). (45) = + + + + . (52) − A B C ≤ A B C b 11

Λ(t+1) Λ(t) = X(t+1)(Y(t+1))T Y(t+1) X(t)(Y(t))T Y(t) Z(Y(t+1) Y(t)) (43) − − h − − − i = (X(t+1) X(t))(Y(t+1))T Y(t+1) + X(t) (Y(t+1))T Y(t+1) (Y(t))T Y(t) + Z(Y(t+1) Y(t)) − − − − h  i =Z(Y(t+1) Y(t)) (X(t+1) X(t))(Y(t+1))T Y(t+1) − − − 1 X(t) (Y(t+1) + Y(t))T (Y(t+1) Y(t)) + (Y(t+1) Y(t))T (Y(t+1) + Y(t)) . (44) − 2 − −  ,Q | {z }

Next we bound the quantities in (52) which are equivalent to

1 1 (t) (t) T 2 2 (t) (t+1) T 2 (t) (t) T 2 (t) 6 X (Y ) Z F ρ = X (Y ) Z F X (Y ) Z F ρ> 6Nτ and β > k − k − , (57) A 2k − k − 2k − k ρ ρ b + X(t) Y(t+1) + Λ(t)/ρ 2 2k − kF then (X(t+1), Y(t+1), Λ(t+1)) (X(t), Y(t), Λ(t)) < 0. L − L ρ β(t) Then, it is concluded that (X(t+1), Y(t+1), Λ(t+1)) is X(t) Y(t) + Λ(t)/ρ 2 + Y(t+1) Y(t) 2 L − 2k − kF 2 k − kF decreasing. (t+1) (t+1) (t+1) (a) t t T t t t In the next step we prove that X Y Λ is = (X( )(Y( +1)) Z)X( ), Y( +1) Y( ) ( , , ) h − − i lower bounded. L 1 X(t)(Y(t+1) Y(t))T 2 − 2k − kF Lemma 5. Consider using the update rules (8) (9) (10). If + ρ X(t) Y(t+1) + Λ(t)/ρ, Y(t+1) Y(t) ρ Nτ is satisfied, we have h − − i ≥ ρ β(t) (t+1) (t+1) (t+1) Y(t+1) Y(t) 2 + Y(t+1) Y(t) 2 (X , Y , Λ ) 0. (58) − 2k − kF 2 k − kF L ≥ (b) 1 ρ Proof: At iteration t +1, the augmented Lagrangian can X(t)(Y(t+1) Y(t))T 2 Y(t+1) Y(t) 2 ≤ − 2k − kF − 2k − kF be lower bounded as (t) β (t+1) (t+1) (t+1) Y(t+1) Y(t) 2 (X , Y , Λ ) F L − 2 k − k 1 = X(t+1)(Y(t+1))T Z 2 + X(t+1) Y(t+1), Λ(t+1) where (a) is due to the fact that Taylor expansion for quadratic 2k − kF h − i ρ problems is exact, and (b) is due to the optimality condition + X(t+1) Y(t+1) 2 for problem (8). Similarly, we have 2k − kF (a) 1 = X(t+1)(Y(t+1))T Z 2 1 (t+1) (t) (t+1) T 2 F (X X )(Y ) F 2k − k B≤− 2k − k + X(t+1) Y(t+1), (X(t+1)(Y(t+1))T Z)Y(t+1) ρ (t+1) (t) 2 h − − − i X X F , (53) ρ (t+1) (t+1) 2 − 2k − k + X Y F X(t+1) Y(t+1) Λ(t+1) Λ(t) 2k − k = , (b) C h − − i 1 (t+1) (t+1) 2 (a) 1 (ρ Nτ) X Y (59) = Λ(t+1) Λ(t) 2 (54) ≥ 2 − k − kF ρk − kF where (a) is due to (42), and (b) is true because where (a) is from (10). Substituting the result of Lemma 3 into (54), we can obtain 0 (X(t+1) Y(t+1))(Y(t+1))T (X(t+1)(Y(t+1))T Z) 2 ≤k − − − kF (t+1) (t+1) (t+1) T 2 (t+1) (t+1) (t+1) (t) (t) (t) = (X Y )(Y ) (X , Y , Λ ) (X , Y , Λ ) k − kF L − L (t+1) T (t+1) (t+1) (t+1) (t+1) T 2 2 2 (Y ) (X Y ), X (Y ) Z ρ 3N τ (t+1) (t) 2 − h − − i X X F (t+1) (t+1) T 2 ≤− 2 − ρ  k − k + X (Y ) Z) F , k − k 1 3Nτ (t) (t+1) (t) T 2 2 X (Y Y ) and Y F Nτ. F k k ≤ − 2 − ρ  k − k From (59), we know that if ρ Nτ, we have (t) (t) (t) T 2 (t+1) (t+1) (t+1) ≥ ρ β 3 X (Y ) Z F (t+1) (t) 2 (X , Y , Λ ) 0. + k − k Y Y F L ≥ −  2 2 − ρ  k − k These lemmas lead to the main convergence claim.

1 (t+1) (t) (t+1) T 2 Proof: Combing (50) and (58), we have (X X )(Y ) F . (55) − 2k − k (t+1) (t) 2 lim X X F =0, (60) ρ 3N 2τ 2 1 3Nτ t→∞ k − k Therefore, from (55) if 2 ρ > 0, 2 ρ > 0, and (t) (t+1) (t) T 2 − − lim X (Y Y ) F =0, t→∞ k − k (t) (t) (t) T 2 ρ + β 3 X (Y ) Z F (t) (t) T 2 (t+1) (t) 2 k − k > 0, (56) lim X (Y ) Z F Y Y F =0. 2 − ρ t→∞ k − k k − k 12

By Lemma 3, we have Then, we have

(t) T (t) T (t) T (t) (t) T (t+1) (t) 2 (Y ) projY (Y ) ((X ) (X (Y ) Z) lim Λ Λ F =0, (61) − − − t→∞ k − k 

(t) (t) (t) T (t) (t) 2 ρ(X Y + Λ /ρ) ) which implies limt→∞ X Y F =0. Combining with − − F k − k (t+1) (t) 2  (60), we can further know that limt→∞ Y Y =0. F t T t T t T (t) k − k = (Y( )) (Y( +1)) + (Y ( +1)) The boundedness assumption of X then follows from the − boundedness of Y(t). Using the expression of Λ(t) in (42), (t) T (t) T (t) (t) T t proj (Y ) ((X ) (X (Y ) Z) one can show that Λ( ) is also bounded. − Y − − { }  The optimality condition of (8) is given by ρ(X(t) Y(t) + Λ(t)/ρ)T ) − −  F (a) (t) T (t) (t+1) T (t) (t+1) (t) T (t) (t+1) (X ) (X (Y ) Z) ρ(X Y +Λ /ρ) Y Y F − − − ≤k − k (t) Y(t+1) Y(t) T Y Y(t+1) T + β ( ) , ( ) 0, (t+1) T (t) T (t) (t+1) T − − 2 ≥ + projY (Y ) ((X ) (X (Y ) Z) Y 0 and Yi τ i. (62) − − ∀ ≥ k k2 ≤ ∀  ρ(X(t) Y(t+1) + Λ(t)/ρ)T + β(t)(Y(t+1) Y(t))T ) − − − Substituting (42) into (62), using (60), and taking limit over proj (Y(t))T ((X(t))T (X(t)(Y(t))T Z)  (t) (t) (t) − Y − − any converging subsequence of X , Y , Λ , we have  { } t t t T ρ(X( ) Y( ) + Λ( )/ρ) ) T T T T − − (X∗) (X∗(Y∗) Z) + ((X∗(Y∗) Z)Y∗)  F h − − (b) ∗ ∗ T ∗ T (t) (t+1) (t) ρ(X Y ) , (Y Y ) 0, (2 + ρ + β ) Y Y F − − − i≥ ≤ k − k 2 (t) T (t) (t+1) (t) T Y 0 and Yi 2 τ i. (63) + (X ) X (Y Y ) F ∀ ≥ k k ≤ ∀ k − k (c) (t) (t+1) (t) The optimality condition of (9) is given by (2 + ρ + β ) Y Y F ≤ k − k (t) (t+1) (t) T + Nγ X (Y Y ) F (68) (X(t+1)(Y(t+1))T Z)(Y(t+1)) p k − k − where projY denotes the projection of Y to the feasible +ρ(X(t+1) Y(t+1) + Λ(t)/ρ)=0. (64) − space; in (a) we used triangle inequality; (b) is due to the nonexpansiveness of the projection operator; and (c) is due to Taking limit of (64) over the same subsequence, we have the boundedness of X F . Similarly, we cank boundk the size of the gradient of the (X∗(Y∗)T Z)Y∗ + ρ(X∗ Y∗ + Λ∗/ρ)=0. (65) augmented Lagrangian with respect to X by the following − − series of inequalities ∗ ∗ Using the fact X = Y , we have (t) (t) (t) (t) (t) T (t) X (X , Y , Λ ) F = (X (Y ) Z)Y k∇ L k k − (t) (t) (t) ZT Z + ρ(X Y + Λ /ρ) F ∗ ∗ T + ∗ ∗ − k X (X ) X , X X 0, (a)  − 2 −  ≥ = (X(t)(Y(t))T Z)Y(t) + ρ(X(t) Y(t) + Λ(t)/ρ)  2 − − X 0, Xi τ i, (66) (t+1) (t+1) T (t+1) ∀ ≥ k k2 ≤ ∀ ((X (Y ) Z)Y ∗ ∗ T ∗ ∗ − − X X Z X Λ t t t ( ( ) ) + =0, (67) + ρ(X( +1) Y( +1) + Λ( )/ρ)) − − F

which are the KKT conditions of problem (1). (X(t)(Y(t))T Z)Y(t) ≤k − (t+1) (t+1) T (t+1) ((X (Y ) Z)Y ) F − − k (t+1) (t) (t+1) (t) + ρ Y Y F + ρ X X F (69) D. Convergence Rate Proof of the Proposed Algorithm k − k k − k (b) (t+1) (t) (t+1) (t) = Λ Λ F + ρ Y Y F Proof: Based on Theorem 1, X(t) 2 is bounded. There k − k k − k k kF (t+1) (t) must exist a finite γ > 0 such that X(t) 2 Nγ, t, where + ρ X X F (70) k kF ≤ ∀ k − k γ is only dependent on τ, N and Z F . k k where (a) is from the optimality condition of the From the optimality condition of Y in (8), we have X-subproblem (41); (b) is true due to (43) and (42). Squaring both sides of (70) and applying Lemma 3, we have (t+1) T (t+1) T (t) (t) (t) 2 (Y ) = proj (Y ) X (X , Y , Λ ) Y  k∇ L kF 2 2 2 (t+1) (t) 2 (t) T (t) (t+1) T (t) 3(3N τ + ρ ) X X ((X ) (X (Y ) Z) ρ(X ≤ k − kF − − − (t) (t) T 2 2 (t+1) (t) 2 + 3(3 X (Y ) Z F + ρ ) Y Y F Y(t+1) + Λ(t)/ρ)T + β(t)(Y(t+1) Y(t))T ) . k − k k − k − −  +9Nτ X(t)(Y(t+1) Y(t))T 2 . (71) k − kF 13

Due to the boundedness of X(t) and Y(t), we must have E. Sufficient Condition of Global Optimality (t) (t) T that for some δ > 0, X (Y ) Z F δ. Therefore, combiningk (68) and− (71),k there≤ must exists a Proof: Let Ω be the Lagrange multipliers matrix. The finite positive number σ1 such that Lagrangian of problem (1) is given by (t) (t) (t) 2 (X , Y , Λ ) F σ1 (72) k∇L k ≤ F 1 (X, Ω)= Tr ((XXT Z)T (XXT Z)) X, Ω . (82) where e L 2 − − − h i , X(t+1) X(t) 2 + Y(t+1) Y(t) 2 F k − kF k − kF + X(t)(Y(t+1) Y(t))T 2 (73) Let (X∗, Ω∗) be a KKT point of problem (1). To show k − kF global optimality of (X∗, Ω∗), it is sufficient to prove the In particular, we have σ , max 3(3N 2τ 2 + ρ2), 3(2 + ρ + 1 following saddle point condition [42, pp. 238] β(t))2 + 3(3δ2 + ρ2), 3γ +9Nτ {and β(t) 6δ2/ρ. According to Lemma 3, we have} ≤ ∗ ∗ ∗ ∗ 1 (X , Ω) (X , Ω ) (X, Ω ), Ω 0, X. (83) X(t+1) Y(t+1) 2 = Λ(t+1) Λ(t) 2 σ (74) L ≤ L ≤ L ∀ ≥ ∀ k − kF ρ2 k − kF ≤ 2F 2 2 2 2 2 2 where some constant σ2 , max 3N τ /ρ , 3δ /ρ , 3Nτ/ρ . To show the left hand side of (83), we have the following Also, we have { } (t) (t) X∗ Ω∗ X∗ Ω X∗ Ω∗ X∗ Ω X Y F ( , ) ( , )= , ( , ) k − k L − L −h i− −h i (t) (t+1) (t+1) (t+1) (t+1) (t) (a) (b) = X X + X Y + Y Y F ∗ ∗ ∗ k − − − k = X , Ω Ω = X , Ω 0. (84) (t) (t+1) (t+1) (t+1) h − i h i ≥ X X F + X Y F ≤k − k k − k (t+1) (t) + Y Y F , (75) where (a) is due to (5d), and (b) is due to Ω 0 and (5c). k − k ≥ which yields Next we show the right hand side of (83) (t) (t) 2 X Y F σ3 (76) k − k ≤ F ∗ ∗ ∗ , 2 2 2 2 2 2 (X, Ω ) (X , Ω ) for σ3 max 9N τ /ρ +3, 9δ /ρ +3, 9Nτ/ρ . L − L { } 1 The inequalities (72) and (76) imply that = Tr[(XXT X∗(X∗)T )(XXT X∗(X∗)T )] (t) (t) (t) 2 (t) (t) 2 2 − − (X , Y , Λ ) F + X Y F (σ1 + σ3) . k∇L k k − k ≤ (77)F ,M e | ∗ ∗ T T{z T ∗ ∗ T } According to Lemma 4, there exists a constant σ4 , + Tr[(X (X ) Z )(XX X (X ) )] ∗ ∗ − − min c1,c2,c3 such that X X , Ω { } − h − i (t) (t) (t) (t+1) (t+1) (t+1) (a) T X Y Λ X Y Λ T Z + Z ( , , ) ( , , ) σ4 . X X∗, X∗(X∗) (X + X∗) (85) L − L ≥ F(78) ≥ h −  − 2  i Combining (77) and (78), we have X X∗, Ω∗ − h − i T (t) (t) (t) 2 (t) (t) 2 (b) ∗ ∗ ∗ T Z + Z ∗ (X , Y , Λ F + X Y F = X X , X (X ) (X X ) k∇L k k − k ≤ h −  − 2  − i σ1 + σ3 (t) (t) (t) (t+1) (t+1) (t+1) e ( (X , Y , Λ ) (X , Y , Λ )). ZT + Z σ4 L −L =Tr (X X∗)T X∗(X∗)T (X X∗) (86) (79) −  − 2  −   Summing both sides of (79) over t =1,...,r, we have ,S r | {z } (X(t), Y(t), Λ(t)) 2 + X(t) Y(t) 2 k∇L kF k − kF where (a) is due to 0 and the fact that Xt=1 M≥ e σ1 + σ3 (1) (1) (1) (t+1) (t+1) (t+1) ( (X , Y , Λ ) (X , Y , Λ )) 1 ≤ σ4 L − L XXT X∗(X∗)T = (X + X∗)(X X∗)T (a) − 2 − σ1 + σ3 (1) (1) (1)  T (X , Y , Λ ) (80) +(X X∗)(X + X∗) ; (87) ≤ σ4 L −  where (a) is due to Lemma 5. (t) (t) (t) (b) is true because of (5a). Clearly, if we have S 0, then According to the definition of T (ǫ) and (X , Y , Λ ),  the above inequality becomes P the following inequality must be true σ + σ T (ǫ)ǫ 1 3 (X(1), Y(1), Λ(1)). (81) ≤ σ L (X, Ω∗) (X∗, Ω∗) 0. 4 L − L ≥ Dividing both sides by T (ǫ), and by setting C , (σ1 + σ3)/σ4, the desired result is obtained. This completes the proof. 14

F. Sufficient Condition of Local Optimality For the (m,n)th block, we have Proof: We first simplify the term in (85) as follows. 1 M (Y′ )T (Y′ )T Y′ + 2(Y′ )T X′∗ + (X′∗ )T X′∗ I m 2 m n m n m n  1 T T T T T  Tr[(XX X∗(X∗) ) (XX X∗(X∗) )] b ′∗ ′∗ T b b ′ b 2 − − + Xn (Xm) + δm,nS Yn (a) 1 ∗ T ∗ ∗ T T  = Tr ((X X )X + X (X X ) ) (a) 1 b 1 2  − − (Y′ )T Y′ 2 + Y′ 2 Y′ 2 ≥ m − 4 k mk2 k nk2 − δ k mk2 ∗ T ∗ ∗ T    ((X X )X + X (X X ) ) b ′∗ 2 ′∗b T ′∗ b ′∗ ′∗ T b ′ − −  δ Xn 2 + (Xm) Xn I + Xn (Xm) + δm,nS Yn − k k   (b) T 1 ∗ T ∗ T ′ T 1 1 ′ 2 1 ′ 2 ′ b = Tr Y(Y + X ) + X Y =(Ym) ( + ) Ym 2 Yn 2 Yn 2    − 4 δ k k − 4k k  b b b b b b b Y(Y + X∗)T + X∗YT + (Y′ )T (X′∗ )T X′∗ δ X′∗ 2 I + X′∗(X′∗ )T  m  m n − k n k2 n m    (c) 1 b b b b T ∗ T ∗ T ∗ T ′ = Tr U U + X Y U + Y(X ) U + X Y U + δm,nS Y 2  n  ∗ T ∗ T ∗ T ∗ T + X Y Y(Xb) + XbY X Y b (b) b ′ ′ 1 1 ′ 2 1 ′ 2 ∗ T ∗ T ∗ T ∗ T ∗ T Ym Yn ( + ) Ym Yn + Y(Xb )bU + Y(X ) bY(X b) + Y(X ) X Y ≥k kk k − 4 δ k k2 − 4k k2 1  b b′ T ′ b b = Tr UUb T +4UXb∗YT +2bY(X∗)T X∗bYT b + (Ym) Tm,nYn 2 h i where b b + Tr X∗YT X∗YT b b b h i T , (X′∗ )T X′∗ δ X′∗ 2 I + X′∗(X′∗ )T + δ S, ∗ T m,n m n n 2 n m m,n 1 b b I 4X T − k k = Tr Y YT I YT I Y  ∗ T ∗ δ is the Kronecker delta function, and T is the (m,n)th 2  h i  0 2(X ) X  h i  m,n m,n block of matrix T, and we use triangle inequality and b b∗ T ∗ T b b (a) + Tr X Y X Y (88) δ > 0 is any positive number; (b) we use Cauchy-Schwarz h i inequality. where (a) is dueb to theb fact that If there exists δ such that T is positive definite, then X∗ XXT X∗(X∗)T = (X X∗)XT + X∗(X X∗)T ; (89) is a strict local minimum point of problem (1). That is, there − − − exist some γ,ǫ> 0 such that , ∗ in (b) we defined Y X X which shows the difference γ ∗ − T T ∗ ∗ ∗ ∗ 2 between X and X ; and in (c) we defined U , YY = U . (X, Ω ) (X , Ω ) X X F , b L − L ≥ 2 k − k Combining (86) and (88), we have ′ ′∗ 2 b b X such that Xm Xm 2 ǫ, (91) ∀ k − k ≤ (X, Ω∗) (X∗, Ω∗) where γ is given by L − L 2 1 T T ∗ ∗ T ∗ T 2K =Tr Y Y Y +2Y X + (X ) X Y γ = + K(K 2) ǫ2 +2λ (T) (92)  2   −  δ −  min b b b b b T ∗ T ∗ T T ∗ ∗ T Z + Z + Tr X Y X Y + Tr Y X (X ) Y where λmin(T) is the smallest eigenvalue of matrix T. Clearly h i   − 2   γ can be made positive for sufficiently small ǫ. K K b b Kb K b According to the definition of Lagrangian (82), we have ′ T ′ ′ T ′ = (Y ) m,nY + (Y ) m,nY m K n m K n ∗ ∗ Xm Xn Xm Xn (X, Ω )= f(X) X, Ω . (93) K b b b e b L − h i Combing with (91) and KKT conditions (5b)–(5d), we can + (Y′ )T SY′ m m obtain Xm b b ∗ ∗ γ ∗ 2 T f(X) (X, Ω ) f(X )+ X X , =vec(Y) Tvec(Y) ≥ L ≥ 2 k − k2 X 0 such that X X∗ ǫ. (94) where b b ∀ ≥ k − k≤ Therefore X∗ is a strict local minimum point of problem (1). , I + , + S ,K I + ,K K1 1 K1 1 ··· K1 K1 ,  . .  T .e . e , ···  K, I + K, K,K I + K,K + S  G. Sufficient Local Optimality Condition When K =1  K 1 K 1 ··· K K  e e Proof: The term is as follows. 1 M , ′ T ′ ′ T ′∗ ′∗ T ′∗ ∗ m,n (Ym) Yn + 2(Ym) Xn + (Xm) Xn , (90) 1 T I 4X T T T K 2 = Tr[Y[ Y I ] T [ Y I ] Y ] M 2  0 2(X∗) X∗  , b′∗ ′∗b T b and m,n Xn (Xm) , (m,n) denotes the (m,n)th block b b b∗ T ∗ bT K +Tr X Y X Y . of a matrix, X′∗ (Y′ ) denotes the mth (or nth) column of e m n h i matrix X∗ (or Y). b b(95) b b 15

When K =1, (95) becomes REFERENCES ∗ 1 I 4x T yT y yT 1 yT 1 [1] S. L. Campbell and G. D. Poole, “Computing nonnegative rank 2  0 2(x∗)T x∗  factorizations,” Linear Algebra and its Applications, vol. 35,     pp. 175–182, Feb. 1981. ∗ T ∗ T b+bTr [xb y x y ] b [2] P. Paatero and U. Tapper, “Positive matrix factorization: A non-negative 1 factor model with optimal utilization of error estimates of data values,” = yT y (yT yb+4ybT x∗ + 2(x∗)T x∗)+ yT x∗(x∗)T y (96) Environmetrics, vol. 5, no. 2, pp. 111–126, June 1994. 2 [3] N. Gillis and S. A. Vavasis, “Fast and robust recursive algorithmsfor b ∗b b b b b ∗ b separable nonnegative matrix factorization,” IEEE Transactions on where x and y denote the column of matrix X and Y. Pattern Analysis and Machine Intelligence, vol. 36, no. 4, pp. 698–714, Combining with (86), we have Apr. 2014. b b [4] Y.-X. Wang and Y.-J. Zhang, “Nonnegative matrix factorization: A ∗ ∗ ∗ (x, Ω ) (x , Ω ) comprehensive review,” IEEE Transactions on Knowledge and Data L − L 1 Engineering, vol. 25, no. 6, pp. 1336–1353, June 2013. = yT yT y +2yT x∗ + (x∗)T x∗ y [5] P. O. Hoyer, “Non-negative matrix factorization with sparseness 2  constraints,” Journal of Machine Learning Research, vol. 5, b b b b ZT + Z b pp. 1457–1469, 2004. + yT 2x∗(x∗)T y [6] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix  − 2  factorization,” in Proc. of Neural Information Processing Systems (a) b 1 1 b (NIPS), pp. 556–562, 2001. yT yT y y 2 δ x∗ 2 + (x∗)T x∗ y [7] B. Yang, X. Fu, and N. D. Sidiropoulos, “Joint factor analysis and ≥ 2 − δ k k2 − k k2  latent clustering,” in Proc. of IEEE Int. Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP) b b b b ZT + Z b , pp. 173–176, + yT 2x∗(x∗)T y Dec. 2015.  − 2  [8] N. Gillis, “The why and how of nonnegative matrix factorization,” in Regularization, Optimization, Kernels, and Support Vector Machines. 1 b 4 1 4 b = y 2 y 2 Chapman & Hall/CRC, Machine Learning and Pattern Recognition 2k k − δ k k Series, 2014. ZT Z [9] Z. He, S. Xie, R. Zdunek, G. Zhou, and A. Cichocki, “Symmetric bT b ∗ 2 ∗ ∗ T + + y (1 δ) x 2I +2x (x ) y nonnegative matrix factorization: Algorithms and applications to  − k k − 2  probabilistic clustering,” IEEE Transactions on Neural Networks, b b vol. 22, no. 12, pp. 2117–2131, Dec. 2011. ,T1 | {z } [10] K. Huang, N. Sidiropoulos, and A. Swami, “Non-negative matrix where in (a) we have used the triangle inequality and δ > 0 factorization revisited: Uniqueness and algorithm for symmetric decomposition,” IEEE Transactions on Signal Processing, vol. 62, no. 1, is any positive number. pp. 211–224, Jan. 2014. If there exists δ > 0 which ensures that T1 0, then there [11] D. Kuang, S. Yun, and H. Park, “SymNMF: nonnegative low-rank exist some γ,ǫ> 0 such that the following is≻ true approximation of a similarity matrix for graph clustering,” Journal of Global Optimization, vol. 62, no. 3, pp. 545–574, Jul. 2015. γ (x, Ω∗) (x∗, Ω∗) x x∗ 2, [12] F. Wang, T. Li, X. Wang, S. Zhu, and C. Ding, “Community discovery L − L ≥ 2 k − k2 using nonnegative matrix factorization,” Data Mining and Knowledge x such that x x∗ ǫ. (97) Discovery, vol. 22, no. 3, pp. 493–521, May 2011. ∀ k − k≤ [13] U. von Luxburg, “A tutorial on spectral clustering,” Statistics and In the above inequality, the constant γ is given by Computing, vol. 17, no. 4, pp. 395–416, 2007. [14] S. Lu and Z. Wang, “Accelerated algorithms for eigen-value decomposition with application to spectral clustering,” in Proc. of 2 2 γ = 1 ǫ +2λmin(T1) (98) Asilomar Conf. Signals, Systems and Computers, pp. 355–359, Nov.  − δ  2015. [15] C. H. Ding, X. He, and H. D. Simon, “On the equivalence of nonnegative where λmin(T1) denotes the smallest eigenvalue of T1. matrix factorization and spectral clustering.,” in Proc. of SIAM Int. Conf. Clearly γ can be made positive by setting ǫ sufficiently small. Data Mining, vol. 5, pp. 606–610, 2005. According to the definition of the Lagrangian, we have [16] C.-J. Lin, “Projected gradient methods for nonnegative matrix factorization,” Neural computation, vol. 19, no. 10, pp. 2756–2779, (x, Ω∗)= f(x) x, Ω∗ . (99) 2007. L − h i [17] J. Kim and H. Park, “Fast nonnegative matrix factorization: An Therefore, combining with (97) and the KKT conditions, active-set-like method and comparisons,” SIAM Journal on Scientific Computing, vol. 33, no. 6, pp. 3261–3281, 2011. we can obtain [18] J. Parker, P. Schniter, and V. Cevher, “Bilinear generalized approximate γ message passing – Part I: Derivation,” IEEE Transactions on Signal f(x) (x, Ω∗) f(x∗)+ x x∗ 2. ≥ L ≥ 2 k − k2 Processing, vol. 62, no. 22, pp. 5839–5853, Nov. 2014. ∗ [19] J. Parker, P. Schniter, and V. Cevher, “Bilinear generalized approximate x 0 such that x x ǫ. (100) message passing – Part II: Applications,” IEEE Transactions on Signal ∀ ≥ k − k≤ Processing, vol. 62, no. 22, pp. 5854–5867, Nov. 2014. [20] J. Kim, Y. He, and H. Park, “Algorithms for nonnegative matrix and tensor factorizations: A unified view based on block coordinate descent framework,” Journal of Global Optimization, vol. 58, no. 2, pp. 285–319, Mar. 2013. [21] C. J. Lin, “On the convergence of multiplicative update algorithms for nonnegative matrix factorization,” IEEE Transactions on Neural Networks, vol. 18, no. 6, pp. 1589–1596, Nov. 2007. [22] D. Kuang, C. Ding, and H. Park, “Symmetric nonnegative matrix factorization for graph clustering,” in Proc. of SIAM Int. Conf. Data Mining, pp. 106–117, 2012. [23] A. Vandaele, N. Gillis, Q. Lei, K. Zhong, and I. Dhillon, “Efficient and non-convex coordinate descent for symmetric nonnegative matrix 16

factorization,” IEEE Transactions on Signal Processing, vol. 64, no. 21, [47] G. Scutari, F. Facchinei, P. Song, D. P. Palomar, and J. S. Pang, pp. 5571–5584, Nov. 2016. “Decomposition by partial linearization: Parallel optimization of [24] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed multi-agent systems,” IEEE Transactions on Signal Processing, vol. 62, optimization and statistical learning via the alternating direction method no. 3, pp. 641–656, Feb. 2014. of multipliers,” Foundations and Trends in Machine Learning, vol. 3, [48] C. Navasca, L. De Lathauwer, and S. Kindermann, “Swamp reducing no. 1, pp. 1–122, 2011. technique for tensor decomposition,” in Proc. of the 16th European [25] Y. Xu, W. Yin, Z. Wen, and Y. Zhang, “An alternating direction Signal Processing Conference, pp. 1–5, 2008. algorithm for matrix completion with non-negative factors,” Frontiers [49] D. Cai, X. He, and J. Han, “Locally consistent concept factorization of Mathematics in China, vol. 7, no. 2, pp. 365–384, June 2012. for document clustering,” IEEE Transactions on Knowledge and Data [26] D. Sun and C. Fevotte, “Alternating direction method of multipliers for Engineering, vol. 23, no. 6, pp. 902–913, 2011. non-negative matrix factorization with the beta-divergence,” in Proc. [50] J. Leskovec, K. Lang, A. Dasgupta, and M. Mahoney, “Community of IEEE Int. Conf. Acoustics Speech and Signal Process (ICASSP), pp. structure in large networks: Natural cluster sizes and the absence of large 6201–6205, May 2014. well-defined clusters.,” Internet Mathematics, vol. 6, no. 1, pp. 29–123, [27] K. Huang, N. D. Sidiropoulos, and A. P. Liavas, “A flexible and effcient 2009. algorithmic framework for constrained matrix and tensor factorization,” [51] E. Cho, S. A. Myers, and J. Leskovec, “Friendship and mobility: User IEEE Transactions on Signal Processing, vol. 64, no. 19, pp. 5052–5065, movement in location-based social networks,” in Proc. of ACM SIGKDD June 2016. International Conference on Knowledge Discovery and Data Mining [28] D. Hajinezhad, T. H. Chang, X. Wang, Q. Shi, and M. Hong, (KDD), 2011. “Nonnegative matrix factorization using ADMM: Algorithm and [52] Z. Yang and E. Oja, “Quadratic nonnegative matrix factorization,” convergence analysis,” in Proc. of IEEE Int. Conf. Acoustics Speech Pattern Recognition, vol. 45, no. 4, pp. 1500–1510, 2012. and Signal Process (ICASSP), pp. 4742–4746, Mar. 2016. [29] S. A. Vavasis, “On the complexity of nonnegative matrix factorization,” SIAM Journal on Optimization, vol. 20, no. 3, pp. 1364–1377, 2009. [30] P. J. C. Dickinson and L. Gijben, “On the computational complexity of membership problems for the completely positive cone and its dual,” Computational Optimization and Applications, vol. 57, pp. 403–415, Mar. 2014. [31] C. D. Sa, C. Re, and K. Olukotun, “Global convergence of stochastic gradient descent for some non-convex matrix problems,” in Proc. of Int. Conf. Machine Learning (ICML), pp. 2332–2341, 2015. [32] N. Gillis, “Sparse and unique nonnegative matrix factorization through data preprocessing,” Journal of Machine Learning Research, vol. 13, pp. 3349–3386, 2012. [33] R. Sun and Z.-Q. Luo, “Guaranteed matrix completion via non-convex factorization,” IEEE Transactions on Information Theory, vol. 62, no. 11, pp. 6535–6579, Nov. 2016. [34] T. Zhao, Z. Wang, and H. Liu, “A nonconvex optimization framework for low rank matrix estimation,” in Proc. of Neural Information Processing Systems (NIPS), pp. 559–567, 2015. [35] A. Montanari and E. Richard, “Non-negative principal component analysis: Message passing algorithms and sharp asymptotics,” IEEE Transactions on Information Theory, vol. 62, no. 3, pp. 1458–1484, Mar. 2016. [36] D. P. Bertsekas, P. Hosein, and P. Tseng, “Relaxation methods for network flow problems with convex arc costs,” SIAM Journal on Control and Optimization, vol. 25, no. 5, pp. 1219–1243, Sept. 1987. [37] J. Eckstein and D. P. Bertsekas, “On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators,” Mathematical Programming, vol. 55, no. 1, pp. 293–318, 1992. [38] M. Hong, Z.-Q. Luo, and M. Razaviyayn, “Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems,” SIAM Journal on Optimization, vol. 26, no. 1, pp. 337–364, 2016. [39] G. Li and T.-K. Pong, “Global convergence of splitting methods for nonconvex composite optimization,” SIAM Journal on Optimization, vol. 25, no. 4, pp. 2434–2460, 2015. [40] B. P. W. Ames and M. Hong, “Alternating direction method of multipliers for penalized zero-variance discriminant analysis,” Computational Optimization and Applications, vol. 64, no. 3, pp. 725–754, 2016. [41] Y. Wang, W. Yin, and J. Zeng, “Global convergence of ADMM in nonconvex nonsmooth optimization,” UCLA CAM Report, pp. 15–61, 2015. [42] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, 2004. [43] D. P. Bertsekas, Nonlinear Programming, 2nd ed, Athena Scientific, Belmont, MA, 1999. [44] F. Facchinei and J.-S. Pang, Finite-dimensional variational inequalities and complementarity problems, Springer Science & Business Media, 2007. [45] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM Journal on Imgaging Science, vol. 2, no. 1, pp. 183–202, 2009. [46] M. Razaviyayn, M. Hong, Z.-Q. Luo, and J.-S. Pang, “Parallel successive convex approximation for nonsmooth nonconvex optimization,” in Proc. of Neural Information Processing Systems (NIPS), pp. 1440–1448, 2014.