Greedy Coordinate Descent Method on Non-Negative Quadratic Programming

arXiv:2012.05943v1 [math.OC] 10 Dec 2020 aey h atrtocnb oriaefinl C)[21] (CF) coordinate-friendly be can two is CD, latter cyclic CD and the greedy randomized namely, the the Computationally, than expensive [30]. more CD generally cyclic convergen the [36] faster have [26], than can [22], CD [15], randomized [14], focu the (e.g., random Theoretically, works methods recent introduces CD many randomized that coordinate, on maximized [17] the updated is the work to of value pioneering selection change objective the the the Since of that [2]. decrease such the or greedily variable or value. cyclicly, objective the ordinates decrease to a way) (by certain upda one Roughl a then picks and (in [38]–[40]). coordinates iteration, it many possibly [35], each of out at [33], rule) method, certain [29], [4] CD [25], [3], the see speaking, [22], problem (e.g., [17], settings large-scale nonconvex [7], very and become convex solving both recently under for have descent) popular variant its gradient particularly and coordinate method require- CD as the memory reason, and (such lower and this has to simpler descent due method is Partly CD gradient ment. update the the also coordinate and as the cheaper, such method, Newton’s method, Compared the full-update system. an Jacobi linear a the a to to solving related for closely is methods and Gauss-Seidel [6] 1950s proble to optimization back solving dates for methods iterative classic ig ongtv arxfactorization matrix nonnegative ming, bevdo ntne ihsnhtcdt n loiaedat image also and data synthetic matr with proble subroutine non-negative instances both a on the on as observed results and CD numerical NQP Promising greedy constrained while factorization. proposed converge linearly cost, the overall solve per-update apply faster to also same signiﬁcantly We have the NQP can speed. almost CD the have greedy on the However, CDs counterparts. three th randomized CD these complexity and per-update greedy cyclic fast expensive The more its and solvi much on (NQP). requirement, has CD programming generally greedy memory the quadratic low part explore we non-negative update, problems, paper, this simple large-scale In convergence. its very to solving for due popular become reycodnt ecn ehdo non-negative on method descent coordinate Greedy hswr nprl upre yteNFgatDMS-1719549. grant NSF the by supported partly in work This al ok eg,[] 1] 3] nC hs h co- the chose CD on [32]) [16], [6], (e.g., works Early most the of one is method (CD) descent coordinate The ne Terms Index Abstract Tecodnt ecn C)mto a recently has method (CD) descent coordinate —The ged oriaedset udai program- quadratic descent, coordinate —greedy eateto ahmtclSciences Mathematical of Department esearPltcncInstitute Polytechnic Rensselaer .I I. ry Y110 USA 12180, NY Troy, NTRODUCTION [email protected] hnuWu Chenyu udai programming quadratic sare ms s It ms. nce tes a. ng an ce ix ly s, ). d y s s , , hl h reyoemyfi o oee,frsm special- some the for as However, such to. fail problems may structured one greedy the while ueia efrac ftepooe methods. problems promisin proposed observe Notation. both the we of On data, performance real-world numerical [20]. and/or [11], synthetic the with (NMF) in nonnegat factorization and the for NQP matrix minimization constrained alternating Lagrangian the linearly augmented of as the and framework the CD for of cyclic (ALM) greedy framework the method the the than in apply subroutine faster then a We signiﬁcantly counterparts. be randomized can CD that demonstrate greedy we Numerically, coordinate-update. each sas FfrsligteNP ymitiigtefull the maintaining by NQP, with the it CD renewing and solving greedy objective the the for of that gradient CF show also we methods, cycli is CD the to randomized Similar and (NQP). programming quadratic negative x Di Fadcnb atrta ohternoie and practi randomized and the theory both in [23]) than [19], faster [18], be [12], (e.g., can CD and cyclic CF is CD esthan less oieta eew olw[]adgedl choose greedily and [2] follow we here that Notice where eiaieaotthe about derivative P i nteojcievle nteltrtr,teeaeafwoth few a are there literature, the In value. objective the on o solving for derivative. partial hwtedtiso o oapyi oteNQP. the to it apply to how on details the and show problem optimization coordinate-constrained general wc ifrnibefunction differentiable twice efrsteupdate: the performs i R ∈ R [ [email protected] agagXu Yangyang R x n ihidcsgetrthan greater indices with eacoe ovxst h reyCD greedy The set. convex closed a be i i n i ] C eadfeetal ucinadfreach for and function differentiable a be i ∈ t aibeand variable -th ∈ t ounand column -th x ersnsteset the represents X [ OORDINATE n i k f i ] +1 ( x f min : ) = ( x f x i k ∇ i i 2 i x : , f i { ) i i k }

i k . osafter flops , iteratively i k . based then the (2) (1) x ive ce. er i g c . ways to greedily choose ik, based on the magnitude of partial Therefore, to renew g, it takes an additional O(n) flops. This derivatives or the change of coordinate update. We refer the way, we need O(n) flops to update the iterate and maintain full readers to the review paper [29]. gradient from (xk, gk) to (xk+1, gk+1), and thus completing 2 In general, to choose ik by (2), we need to solve n one- n greedy coordinate updates costs O(n ), which is similar to dimensional minimization problem, and thus the per-update the cost of a gradient descent update. complexity of the greedy CD can be as high as n times of that of a cyclic or randomized CD. Similar to [12] that considers B. Stopping Condition the LASSO, we show that the greedy CD on the NQP can have A point x∗ ≥ 0 is an optimal solution to (3) if and only if ∗ ∗ n similar per-update cost as the cyclic and randomized CD. 0 ∈ ∇F (x )+ N+(x ), where N+(x) = g ∈ R : gixi ≤ 0, ∀ i ∈ [n] denotes the normal cone of the non-negative

A. Greedy CD on the NQP orthant at x ≥ 0. Therefore, x∗ should satisfy the following ∗ ∗ Now we derive the details on how to apply the aforemen- conditions for each i ∈ [n]: ∇iF (x )=0 if xi > 0, and ∗ ∗ k k k tioned greedy CD on the following NQP: ∇iF (x ) ≥ 0 if xi =0. Let I0 = {i ∈ [n]: xi =0}, I+ = k {i ∈ [n]: xi > 0}, and 1 ⊤ ⊤ x 0 min ≥ F (x)= 2 x Px + d x. (3) k 2 k 2 δk = i∈Ik min(0, ∇iF (x )) + i∈Ik ∇iF (x ) . Here, P ∈ Rn×n is a given symmetric positive semidefinite qP 0 P + (7) (PSD) matrix, and d ∈ Rn is given. To have well-defined Then if δ is smaller than a pre-specified error tolerance, we coordinate updates, we assume P > 0, ∀ i ∈ [n]. The work k ii can stop the algorithm. [31] has studied a greedy block CD on unconstrained QPs, and it requires the matrix P to be positive definite. Hence, our C. Pseudocode of the greedy CD setting is more general, and more importantly, our algorithm Summarizing the above discussions, we have the pseu- can be used as a subroutine to solve a larger class of problems docode for solving the NQP (3) in Algorithm 1. such as the linear equality-constrained NQP and the NMF, discussed in sections III and IV, respectively. We emphasize Algorithm 1: Greedy CD for (3): P, d, ε, x0 that our discussion on the greedy CD may be extended to other GCD( ) n×n n applications with separable non-smooth regularizers. 1 Input: a PSD matrix P ∈ R , d ∈ R , initial point k x0 ≥ 0, and an error tolerance ε> 0. Suppose that the value of the k-th iterate is x . We define 0 0 (k) k k 2 Overhead: let k = 0, g = ∇F (x ) and set δ0 by (7). G (xi)= F (x , xi, x ). Since F is a quadratic function, i i 3 while δk >ε do by the Taylor expansion, it holds k 4 Compute xî for all i ∈ [n] by (5); 5 Find ik ∈ [n] by the rule in (6); (k) k k k Pii k 2 G (xi)= F (x )+∇iF (x )(xi −x )+ (xi −x ) . (4) k+1 k k+1 k i i 2 i 6 Let xi = xi for i =6 ik and xi =x î for i = ik; k+1 k k+1 k 7 i k (k) Update g = g +(xi − xik )p k ; Let xˆ be the minimizer of G (xi) over xi ≥ 0. Then k i i 8 Increase k ← k + 1 and compute δk by (7). xk k xˆk = max 0, xk − ∇iF ( ) , ∀ i ∈ [n], (5) 9 Output x i i Pii and thus the best coordinate by the rule in (2) is D. Convergence result k k k Pii k k 2 ik = arg min ∇iF (x )(ˆxi − xi )+ 2 (ˆxi − xi ) . (6) The greedy CD that chooses coordinates based on the i∈ n [ ] objective decrease has been analyzed in [2]. We apply the Notice that the most expensive part in (5) lies in computing results there to obtain the convergence of Algorithm 1. ∇ F (xk), which takes O(n) flops. Hence, to obtain the best i Theorem 1. Let {xk} be the sequence from Algorithm 1. Sup- i and thus a new iterate xk+1, it costs O(n2). Therefore, k pose that the lower level set L = x ≥ 0 : F (x) ≤ F (x0) performing n coordinate updates will cost O(n3), which is 0 is compact. Then F (xk) → F ∗ as k → ∞, where F ∗ is the order of magnitude larger than the per-update cost O(n2) by optimal objective value of (3). the gradient descent method. However, this naive implemen- k tation does not exploit the coordinate update, i.e., any two Proof. Since F (x ) is decreasing with respect to k and L0 is consecutive iterates differ at most at one coordinate. Using this compact, the sequence {xk} has a finite cluster point x¯, and fact, we show below that the greedy CD method on solving (3) F (xk) → F (x¯) as k → ∞. Now notice that the minimizer (k) can be CF [21] by maintaining a full gradient, i.e., n coordinate of Gi in (4) is unique for each i since Pii > 0. Hence, it updates cost similarly as a gradient descent update. follows from [2, Theorem 3.1] that x¯ must be a stationary Let gk = ∇F (xk). Provided that gk is stored in the point of (3) and thus an optimal solution because P is PSD. k ∗ memory, we only need O(1) flops to have xî defined in (5), Therefore, F (x¯)= F , and this completes the proof. and thus to obtain the best i , it costs O n . Furthermore, k ( ) k+1 notice that ∇F (x)= Px + d. Hence, Remark 1. It is not difficult to show that F (x ) − k Pii k+1 k 2 F (x ) ≤ − 2 kx − x k . In addition, by [34, Theorem F xk+1 F xk P xk+1 xk gk xk xk p . F ∇ ( )= ∇ ( )+ ( − )= +(îk − ik ) ik 18], the quadratic function satisfies the so-called global error bound. Hence, it is possible to show a globally linear which is a quadratic function. Hence, (9a) is an NQP, and convergence result of Algorithm 1. Due to the page limitation, we propose to apply the greedy CD derived previously to we do not extend the detailed discussion here, but instead we solve it. The pseudocode of the proposed method is shown will explore it in an extended version of the paper. in Algorithm 2. E. Comparison to the cyclic and randomized CD methods Algorithm 2: inexact ALM with greedy CD for (8) We compare the greedy CD to its cyclic and random- n×n n m×n 1 Input: a PSD matrix Q ∈ R , c ∈ R , A ∈ R , m ized counterparts and also the accelerated projected gradient b ∈ R , and ε> 0. 0 n 0 method FISTA [1]. Two random NQP instances were gen- 2 Initialization: x ∈ R , y = 0, and β0 > 0; set k = 0. erated. For the first one, we set n = 5, 000 and generated 3 while a stopping condition not satisfied do ⊤ ⊤ ⊤ a symmetric PSD matrix P and the vector d by the nor- 4 Let P = Q + βkA A and d = c + A y − βkA b. k+1 k mal distribution. For the second one, we set n = 1, 000, 5 Choose εk ≤ ε and let x = GCD(P, d,εk, x ). k+1 P = 0.1I + 0.9E and d = −10e, where I denotes the 6 Obtain y by (9b). identity matrix, and E and e are all-ones matrix and vector. 7 Choose βk+1 ≥ βk, and increase k ← k + 1. The P in the second instance was used to construct a difficult k+1 k+1 k unconstrained QP for the cyclic CD in [10], [30]. Figure 1 Notice that each x satisfies dist 0, ∇xLβk (x , y )+ k+1 k+1 plots the objective error produced by the three different CD N (x ) ≤ εk. Hence, by the update of y , it holds + methods and FISTA. From the plots, we clearly see that the k+1 ⊤ k+1 k+1 dist 0, ∇F (x )+ A y + N (x ) ≤ εk, greedy CD performs significantly better than the other two + CDs and FISTA on both instances. namely, the dual residual is always no larger than εk. Since k+1 k+1 10 5 ε ε, the output x , y violates the KKT conditions 10 10 k ≤ ( ) at most ε in terms of both primal and dual feasibility, if we 105 100 stop the algorithm once kAxk+1 − bk≤ ε.

100 10-5 greedy CD B. Convergence result -5 -10 randomized CD 10 10 cyclic CD FISTA We can apply the results in [13], [27], [37] to obtain the -10 -15

objective distance to optimality 10 objective distance to optimality 10 convergence of Algorithm 2 based on the actual iterate. 10 20 30 40 50 10 20 30 40 50 number of epochs number of epochs Theorem 2. Suppose that for each i ∈ [n], Qii > 0 or ai 6= 0. Fig. 1. Comparison of three different CDs and FISTA on two NQP instances. k Let {x }k≥0 be the sequence from Algorithm 2. If εk → 0 and Left: dimension n = 5, 000 and P generated by normal distribution; Right: k ∗ k n = 1, 000 and P = 0.1I + 0.9E. βk → ∞, then |F (x )−F | → 0 and kAx −bk → 0, where F ∗ denotes the optimal objective value of (8). III. LINEAR EQUALITY-CONSTRAINED NON-NEGATIVE QUADRATIC PROGRAMMING IV. NON-NEGATIVE MATRIX FACTORIZATION In this section, we apply Algorithm 1 as a subroutine in In this section, we consider the non-negative matrix fac- the inexact ALM framework to solve the NQP with linear torization (NMF). It aims to factorize a given non-negative Rm×n constraints. More specifically, we consider the problem matrix M ∈ + into two low-rank non-negative matrices ⊤ 1 ⊤ ⊤ X and Y such that M ≈ XY . The factor matrix X plays minx≥0 F (x)= x Qx + c x, s.t. Ax = b, (8) 2 a role of basis while Y contains the coefficient. Due to the Rn×n Rm×n where Q ∈ is a PSD matrix, and A ∈ . The ALM non-negativity, NMF can be used to learn local features of an [24], [27] is perhaps the most popular method for solving objective [11] and has better interpretability than the principal nonlinear functional constrained problems. Applied to (8), it component analysis (PCA). Measuring the approximation error iteratively performs the updates: by the Frobenius norm, one can model the NMF as k+1 k x = arg minx 0 Lβ (x, y ) (9a) 1 ⊤ 2 m×r n×r ≥ k X Y R R min , 2 kXY −MkF , s.t. X ∈ + , Y ∈ + , (11) k+1 k k+1 y = y + βk(Ax − b). (9b) Rm×n where M ∈ + is given, and r is a user-specified rank. Here, y ∈ Rm is the Lagrange multiplier, and A. Alternating minimization with greedy CD L x, y F x y⊤ Ax b β Ax b 2 β( )= ( )+ ( − )+ 2 k − k (10) The objective of (11) is non-convex jointly with respect to is the AL function with a penalty parameter β > 0. X and Y. However, it is convex with respect to one of X and Y while the other is fixed. For this reason, one natural way A. inexact ALM with greedy CD to solve (11) is the alternating minimization (AltMin), which In the ALM update, the y-update is easy. However, the x- iteratively performs the update subproblem (9a) in general requires an iterative solver. On k+1 1 k ⊤ 2 solving (8), we can rewrite the AL function as X = arg minX≥0 2 kX(Y ) − MkF , (12a) k 1 k ⊤ 1 ⊤ ⊤ ⊤ ⊤ ⊤ +1 +1 2 Y = arg minY≥0 2 k(X )Y − MkF . (12b) Lβ(x, y)= 2 x (Q + βA A)x + (c + A y − βA b) x, × 1 Both subproblems are in the form of minZ∈Rr p 2 kAZ − TABLE I 2 RESULTS BY ALGORITHM 2 ON THREE DIFFERENT-SIZED INSTANCES OF BkF , which is equivalent to solving p independent NQPs (8) AND THE SPEED COMPARISON WITH MATLAB FUNCTION quadprog.

1 2 HERE, TOL∗ . IS THE ε IN ALGORITHM 2; OBJ. RELERR IS COMPUTED BY minz ∈Rr kAzi − bik , i ∈ [p]. |F (x)−F | kAx−bk i 2 RES RELERR IS BY TIME IS THE RUNNING TIME |F ∗| ; . kbk ; 1 Hence, we can apply the greedy CD derived in section II (SEC.) BY THE PROPOSED METHOD; TIME2 IS BY quadprog. to solve the X-subproblem and Y-subproblem in (12), by problem size tol. obj. relerr res. relerr time1 time2 breaking them respectively into m and n independent NQPs. m = 200 0.01 2.758e-05 5.192e-04 0.42 0.32 The pseudocode is shown in Algorithm 3. In Lines 4 and n = 1000 0.001 1.118e-06 2.811e-05 0.76 m = 1000 0.01 3.714e-06 1.138e-04 16.68 8, we rescale those two factor matrices such that they have 24.92 n = 5000 0.001 2.257e-07 8.074e-06 30.85 balanced norms. This way, neither of them will blow up or m = 2000 0.01 4.361e-07 2.021e-05 71.90 217.41 diminish, and the resulting NQP subproblems are relatively n = 10000 0.001 3.795e-07 1.419e-05 133.07 well-conditioned. Numerically, we observe that the rescaling technique can signiﬁcantly speed up the convergence. face image data [8], which consists of 6,977 images of size Notice that it is straightforward to extend our method to 19 × 19. We picked the ﬁrst 2,000 images and vectorized the non-negative tensor decomposition [28]. Due to the page each image into a vector. This way, we formed a non-negative limitation, we do not give the details here but leave it to an M ∈ R361×2000, and we set r = 30. In Algorithm 3, we −3 extended version of this paper. set εk = 10 , ∀ k, for the subroutine GCD on the test with synthetic data and εk = 0.1, ∀ k, on the face image data. Algorithm 3: AltMin with greedy CD for (11) Different tolerances were adopted here because the image data Rm×n does not admit an exact factorization. 1 Input: M ∈ + , and rank r. 0 Rm×r 0 Rn×r 2 Initialization: X ∈ and Y ∈ . 10 + + 10 10 10 11000 3 for k , ,... do 10000 proposed with rescale = 0 1 proposed without rescale k k k k 9000 100 5 BlockPivot 4 Rescale X and Y to kxi k = kyi k, ∀ i ∈ [r]. 10 8000 Accelerated AltPG k ⊤ k k ⊤ ⊤ 7000 5 Let P =(Y ) Y and D = −(Y ) M . -10 0

objective 10 objective objective 10 6000 6 Choose εk > 0. k k 5000 +1 10-20 10-5 7 Compute xi: = GCD(P, di,εk, xi:), for i ∈ [m]. k+1 k k+1 k 0 10 20 30 40 50 0 500 1000 0 5 10 8 Rescale X and Y to kxi k = kyi k, ∀ i ∈ [r]. running time (sec.) running time (sec.) running time (sec.) k+1 ⊤ k+1 k+1 ⊤ 9 Let P =(X ) X and D = −(X ) M. k+1 k Fig. 2. Comparison of the proposed method (Algorithm 2) with and without 10 Compute y P, di,εk, yi , for i n . i: = GCD( :) ∈ [ ] the rescaling to the BlockPivot method in [9] and the accelerated AltPG in [38]. Left: synthetic data with m = n = 1, 000,r = 50; Middle: synthetic data with m = n = 5, 000,r = 100; Right: CBCL face image data set B. Convergence result The results for the tests on (8) are shown in Table I. The convergence of the AltMin has been well-studied; see From the results, we see that our method can yield medium- [5] for example. Although we rescale the two factor matrices accurate solutions. For the middle-sized instance, our method and inexactly solve each subproblem, it is not difficult to can be faster than MATLAB’s solver when ε = 10−2, and adapt the existing analysis and obtain the convergence of for the large-sized instance, our method is faster under both Algorithm 3 as follows. settings of ε = 10−2 and 10−3. This implies that for solving k k Theorem 3. Let {(X , Y )}k≥0 be the sequence generated large-scale linear-constrained NQP, the proposed method can from Algorithm 3 with εk → 0. Then any finite limit point of outperform MATLAB’s solver if a medium-accurate solution the sequence is a stationary point of (11). is required. The results for the tests on (11) are plotted in Figure 2, where we compared Algorithm 3 with the BlockPivot V. NUMERICAL EXPERIMENTS method in [9] and the accelerated AltPG method in [38]. The In this section, we test Algorithm 2 on Gaussian random- BlockPivot is also an AltMin, but different from our greedy generated instances of (8) and compare it to the MATLAB CD, it uses an active-set-type method as a subroutine to exactly built-in solver quadprog. Also, we test Algorithm 3 on solve each subproblem. The accelerated AltPG performs block three instances of the NMF (11), two with synthetic data proximal gradient update to X and Y alternatingly, and it uses and another with face image data. For the tests on (8), we extrapolation technique for acceleration. From the results, we generated three different-sized instances. In Algorithm 2, we see that the proposed method performs significantly better than −3 set εk = 10 , ∀ k, for the subroutine GCD, and the algorithm BlockPivot, which seems to be trapped at a local solution for was stopped once kAxk − bk ≤ ε with ε = 10−2 or the two synthetic cases. Compared to the accelerated AltPG, 10−3. For the tests on (11) with synthetic data, we obtain the proposed method can be faster in the beginning to obtain a ⊤ Rm×r Rn×r M = LR , where L ∈ + and R ∈ + were respec- medium accuracy. In addition, the proposed method converges tively generated by MATLAB’s code max(0,randn(m,r)) faster with the rescaling than that without the rescaling on the and max(0,randn(n,r)). One dataset was generated with instances with large-sized synthetic data and the face image m = n = 1, 000, r = 50 and the other with m = n = data. That could be because the subproblems may be bad- 5, 000, r = 100. For the other test on (11), we used the CBCL conditioned if rescaling is not applied. REFERENCES [25] M. Razaviyayn, M. Hong, and Z.-Q. Luo. A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23(2):1126–1153, 2013. [1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding [26] P. Richtárik and M. Takáˇc. Iteration complexity of randomized block- algorithm for linear inverse problems. SIAM journal on imaging coordinate descent methods for minimizing a composite function. Math- sciences, 2(1):183–202, 2009. ematical Programming, 144(1-2):1–38, 2014. [2] B. Chen, S. He, Z. Li, and S. Zhang. Maximum block improvement and [27] R. T. Rockafellar. The multiplier method of hestenes and powell polynomial optimization. SIAM Journal on Optimization, 22(1):87–107, applied to convex programming. Journal of Optimization Theory and 2012. applications, 12(6):555–562, 1973. [3] C. D. Dang and G. Lan. Stochastic block mirror descent methods for [28] A. Shashua and T. Hazan. Non-negative tensor factorization with nonsmooth and stochastic optimization. SIAM Journal on Optimization, applications to statistics and computer vision. In Proceedings of the 25(2):856–881, 2015. 22nd international conference on Machine learning, pages 792–799. [4] X. Gao, Y.-Y. Xu, and S.-Z. Zhang. Randomized primal–dual proximal ACM, 2005. block coordinate updates. Journal of the Operations Research Society [29] H.-J. M. Shi, S. Tu, Y. Xu, and W. Yin. A primer on coordinate descent of China, 7(2):205–250, 2019. algorithms. arXiv preprint arXiv:1610.00040, 2016. [5] L. Grippo and M. Sciandrone. On the convergence of the block nonlinear [30] R. Sun and Y. Ye. Worst-case complexity of cyclic coordinate descent: gauss–seidel method under convex constraints. Operations research O(n2) gap with randomized version. Mathematical Programming, letters, 26(3):127–136, 2000. pages 1–34, 2019. [6] C. Hildreth. A quadratic programming procedure. Naval research [31] G. Thoppe, V. S. Borkar, and D. Garg. Greedy block coordinate descent logistics quarterly, 4(1):79–85, 1957. (gbcd) method for high dimensional quadratic programs. arXiv preprint [7] M. Hong, X. Wang, M. Razaviyayn, and Z.-Q. Luo. Iteration complexity arXiv:1404.6635, 2014. analysis of block coordinate descent methods. Mathematical Program- [32] P. Tseng. Convergence of a block coordinate descent method for ming, 163(1-2):85–114, 2017. nondifferentiable minimization. Journal of optimization theory and [8] P. O. Hoyer. Non-negative matrix factorization with sparseness con- applications, 109(3):475–494, 2001. straints. Journal of machine learning research, 5(Nov):1457–1469, [33] P. Tseng and S. Yun. A coordinate gradient descent method for 2004. nonsmooth separable minimization. Mathematical Programming, 117(1- [9] J. Kim and H. Park. Toward faster nonnegative matrix factorization: 2):387–423, 2009. A new algorithm and comparisons. In 2008 Eighth IEEE International [34] P.-W. Wang and C.-J. Lin. Iteration complexity of feasible descent Conference on Data Mining, pages 353–362. IEEE, 2008. methods for convex optimization. The Journal of Machine Learning [10] C.-P. Lee and S. J. Wright. Random permutations fix a worst case Research, 15(1):1523–1548, 2014. for cyclic coordinate descent. IMA Journal of Numerical Analysis, [35] S. J. Wright. Coordinate descent algorithms. Mathematical Program- 39(3):1246–1275, 2018. ming, 151(1):3–34, 2015. [11] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative [36] Y. Xu. Asynchronous parallel primal–dual block coordinate update matrix factorization. Nature, 401(6755):788, 1999. methods for affinely constrained convex programs. Computational [12] Y. Li and S. Osher. Coordinate descent optimization for l1 minimization Optimization and Applications, 72(1):87–113, 2019. with application to compressed sensing; a greedy algorithm. Inverse [37] Y. Xu. Iteration complexity of inexact augmented lagrangian methods for Problems and Imaging, 3(3):487–503, 2009. constrained convex programming. Mathematical Programming, Series A, pages 1–46, 2019. [13] Z. Li and Y. Xu. Augmented lagrangian based first-order methods for [38] Y. Xu and W. Yin. A block coordinate descent method for regularized convex and nonconvex programs: nonergodic convergence and iteration multiconvex optimization with applications to nonnegative tensor factor- complexity. arXiv preprint arXiv:2003.08880, 2020. ization and completion. SIAM Journal on Imaging Sciences, 6(3):1758– [14] Q. Lin, Z. Lu, and L. Xiao. An accelerated proximal coordinate gradient 1789, 2013. method. In Advances in Neural Information Processing Systems, pages [39] Y. Xu and W. Yin. Block stochastic gradient iteration for convex and 3059–3067, 2014. nonconvex optimization. SIAM Journal on Optimization, 25(3):1686– [15] J. Liu and S. J. Wright. Asynchronous stochastic coordinate descent: 1716, 2015. Parallelism and convergence properties. SIAM Journal on Optimization, [40] Y. Xu and W. Yin. A globally convergent algorithm for nonconvex 25(1):351–376, 2015. optimization based on block coordinate update. Journal of Scientific [16] Z.-Q. Luo and P. Tseng. On the convergence of the coordinate descent Computing, 72(2):700–734, 2017. method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72(1):7–35, 1992. [17] Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012. [18] J. Nutini. Greed is good: greedy optimization methods for large-scale structured problems. PhD thesis, University of British Columbia, 2018. [19] J. Nutini, M. Schmidt, I. Laradji, M. Friedlander, and H. Koepke. Coordinate descent converges faster with the gauss-southwell rule than random selection. In International Conference on Machine Learning, pages 1632–1641, 2015. [20] P. Paatero and U. Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2):111–126, 1994. [21] Z. Peng, T. Wu, Y. Xu, M. Yan, and W. Yin. Coordinate-friendly structures, algorithms and applications. Annals of Mathematical Sciences and Applications, 1(1):57–119, 2016. [22] Z. Peng, Y. Xu, M. Yan, and W. Yin. ARock: an algorithmic framework for asynchronous parallel coordinate updates. SIAM Journal on Scientific Computing, 38(5):A2851–A2879, 2016. [23] Z. Peng, M. Yan, and W. Yin. Parallel and distributed sparse optimization. In 2013 Asilomar conference on signals, systems and computers, pages 659–646. IEEE, 2013. [24] M. J. Powell. A method for non-linear constraints in minimization problems. in Optimization, R. Fletcher Ed., Academic Press, New York, NY, 1969.