Accelerating Greedy Methods

Haihao Lu 1 Robert M. Freund 2 Vahab Mirrokni 3

Abstract sentially recover the same results (in expectation) as full We introduce and study two to accel- descent, including obtaining “accelerated” (i.e., erate greedy coordinate descent in theory and in O(1/k2)) convergence guarantees. On the other hand, in practice: Accelerated Semi-Greedy Coordinate some important applications, greedy co- Descent (ASCD) and Accelerated Greedy Co- ordinate methods demonstrate superior numerical perfor- ordinate Descent (AGCD). On the theory side, mance while also delivering much sparser solutions. For our main results are for ASCD: we show that example, greedy coordinate descent is one of the fastest ASCD achieves O(1/k2) convergence, and it algorithms for the graphical LASSO implemented in DP- also achieves accelerated linear convergence for GLASSO (Mazumder & Hastie, 2012). And sequence mini- strongly convex functions. On the empirical side, mization optimization (SMO) (a variant of greedy coordi- while both AGCD and ASCD outperform Acceler- nate descent) is widely known as the best solver for kernel ated Randomized Coordinate Descent on most in- SVM (Joachims, 1999)(Platt, 1999) and is implemented in stances in our numerical experiments, we note that LIBSVM and SVMLight. AGCD significantly outperforms the other two In general, for smooth the standard methods in our experiments, in spite of a lack of first-order methods converge at a rate of O(1/k) (including theoretical guarantees for this method. To comple- greedy coordinate descent). In 1983, Nesterov (Nesterov, ment this empirical finding for AGCD, we present 1983) proposed an that achieved a rate of O(1/k2) an explanation why standard proof techniques for – which can be shown to be the optimal rate achievable by acceleration cannot work for AGCD, and we in- any first-order method (Nemirovsky & Yudin, 1983). This troduce a technical condition under which AGCD method (and other similar methods) is now referred to as is guaranteed to have accelerated convergence. Accelerated (AGD). Finally, we confirm that this technical condition holds in our numerical experiments. However, there has not been much work on accelerating the standard Greedy Coordinate Descent (GCD) due to the inherent difficulty in demonstrating O(1/k2) computational 1. Introduction guarantees (we discuss this difficulty further in Section 4.1). The only work that might be close as far as the authors are Coordinate descent methods have received much-deserved aware is (Song et al., 2017), which updates the z-sequence attention recently due to their capability for solving large- using the full gradient and thus should not be considered as scale optimization problems (with sparsity) that arise in a coordinate descent method in the standard sense. machine learning applications and elsewhere. With inexpen- sive updates at each iteration, coordinate descent algorithms In this paper, we study ways to accelerate greedy coordi- obtain faster running times than similar full gradient de- nate descent in theory and in practice. We introduce and scent algorithms in order to reach the same near-optimality study two algorithms: Accelerated Semi-Greedy Coordi- tolerance; indeed some of these algorithms now comprise nate Descent (ASCD) and Accelerated Greedy Coordinate the state-of-the-art in machine learning algorithms for loss Descent (AGCD). While ASCD takes greedy steps in the minimization. x-updates and randomized steps in the z-updates, AGCD is a straightforward extension of GCD that only takes greedy Most recent research on coordinate descent has focused on steps. On the theory side, our main results are for ASCD: versions of randomized coordinate descent, which can es- we show that ASCD achieves O(1/k2) convergence, and 1Department of Mathematics and Operations Research Cen- it also achieves accelerated linear convergence when the 2 3 ter, MIT Sloan School of Management, MIT Google Research. objective function is furthermore strongly convex. However, Correspondence to: Haihao Lu . a direct extension of convergence proofs for ARCD does not Proceedings of the 35 th International Conference on Machine work for ASCD due to the different coordinates we use to Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 update x-sequence and z-sequence. Thus, we present a new by the author(s). Accelerating Greedy Coordinate Descent Methods proof technique – which shows that a greedy coordinate step 2012) developed the first accelerated randomized coordinate yields a better objective function value than a full gradient . (Lu & Xiao, 2015) present a sharper con- step with a modified smoothness condition. vergence analysis of Nesterov’s method using a randomized estimate sequence framework. (Fercoq & Richtarik, 2015) On the empirical side, we first note that in most of our proposed the APPROX (Accelerated, Parallel and PROXi- experiments ASCD outperforms Accelerated Randomized mal) coordinate descent method and obtained an accelerated Coordinate Descent (ARCD) in terms of running time. On sublinear convergence rate, and (Lee & Sidford, 2013) de- the other hand, we note that AGCD significantly outper- veloped an efficient implementation of ARCD. forms the other accelerated coordinate descent methods in all instances, in spite of a lack of theoretical guarantees for this method. To complement the empirical study of AGCD, 1.2. Accelerated Coordinate Descent we present a Lyapunov energy function argument that points Framework to an explanation for why a direct extension of the proof for Our optimization problem of interest is: AGCD does not work. This argument inspires us to intro- ∗ duce a technical condition under which AGCD is guaranteed P : f := minimumx f(x) , (1) to converge at an accelerated rate. Interestingly, we confirm that the technical condition holds in a variety of instances where f(·): Rn → R is a differentiable convex func- in our empirical study, which in turn justifies our empirical tion. observation that AGCD works so well in practice. Definition 1.1. f(·) is coordinate-wise L-smooth for the 1.1. Related Work vector of parameters L := (L1,L2,...,Ln) if ∇f(·) is coordinate-wise Lipschitz continuous for the corresponding Coordinate Descent. Coordinate descent methods have n coefficients of L, i.e., for all x ∈ R and h ∈ R it holds a long history in optimization, and convergence of these that: methods has been extensively studied in the optimization community in the 1980s-90s, see (Bertsekas & Tsitsiklis, |∇if(x + hei) − ∇if(x)| ≤ Li|h| , i = 1, . . . , n , (2) 1989), (Luo & Tseng, 1992), and (Luo & Tseng, 1993). There are roughly three types of coordinate descent meth- th ods depending on how the coordinate is chosen: randomized where ∇if(·) denotes the i coordinate of ∇f(·) and ei is th coordinate descent (RCD), cyclic coordinate descent (CCD), i unit coordinate vector, for i = 1, . . . , n. and greedy coordinate descent (GCD). RCD has received much attention since the seminal paper of Nesterov (Nes- We presume throughout that Li > 0 for i = 1, . . . , n. terov, 2012). In RCD, the coordinate is chosen randomly Let L denote the diagonal matrix whose diagonal coeffi- cients correspond to the respective coefficients of L. Let from a certain fixed distribution. (Richtarik & Takac, 2014) n provides an excellent review of theoretical results for RCD. h·, ·i denote the standard coordinate inner product in R , hx, yi = Pn x y k · k ` CCD chooses the coordinate in a cyclic order, see (Beck namely i=1 i i, and let p denote the p 1 ≤ p ≤ ∞ hx, yi := Pn L x y = & Tetruashvili, 2013) for basic convergence results. More norm for . Let L i=1 i i i recent results show that CCD is inferior to RCD in the worst hx, Lyi = hLx, yi denote the L-inner product. Define p −1 case (Sun & Ye, 2016), while it is better than RCD in certain the norm kxkL := hx, Lxi. Letting L denote the in- situations (Gurbuzbalaban et al., 2017). In GCD, we select verse of L, we will also use the norm k · kL−1 defined by q the coordinate yielding the largest reduction in the objective p −1 Pn −1 2 kvk −1 := hv, L vi = L v . function value. GCD usually delivers better function val- L i=1 i i ues at each iteration in practice, though this comes at the Algorithm1 presents a generic framework for accelerated expense of having to compute the full gradient in order to coordinate descent methods that is flexible enough to en- select the gradient coordinate with largest magnitude. The compass deterministic as well as randomized methods. One recent work (Nutini et al., 2015) shows that GCD has faster specific case is the standard Accelerated Randomized Coor- convergence than RCD in theory, and also provides several dinate Descent (ARCD). In this paper we propose and study applications in machine learning where the full gradient two other cases. The first is Accelerated Greedy Coordinate can be computed cheaply. A parallel GCD method is pro- Descent (AGCD), which is a straightforward extension of posed in (You et al., 2016) and numerical results show its greedy coordinate descent to the acceleration framework and advantage in practice. which, surprisingly, has not been previously studied (that Accelerated Randomized Coordinate Descent. Since we are aware of). The second is a new algorithm which we Nesterov’s paper on RCD (Nesterov, 2012) there has been call Accelerated Semi-Greedy Coordinate Descent (ASCD) significant focus on accelerated versions of RCD. (Nesterov, that takes greedy steps in the x-updates and randomized steps in the z-update. Accelerating Greedy Coordinate Descent Methods

Algorithm 1 Accelerated Coordinate Descent Framework coordinate is used to update both the x-sequence and the without Strong Convexity z-sequence. As far as we know AGCD has not appeared in Input: Objective function f with coordinate-wise the first-order method literature. One reason for this is that smoothness parameter L, initial point z0 = x0, param- while AGCD is the natural accelerated version of greedy eter sequence {θk} defined as follow: θ0 = 1, and de- coordinate descent, the standard proof methodologies for 1−θk 1 2 fine θk recursively by the relationship 2 = 2 for establishing acceleration guarantees (i.e., O(1/k ) conver- θk θk−1 k = 1, 2,.... gence) fail for AGCD. Nevertheless, we show in Section for k = 0, 1, 2,... do 5 that AGCD is extremely effective in numerical experi- k k k ments on synthetic linear regression problems as well as Define y := (1 − θk)x + θkz 1 on practical logistic regression problems, and dominates Choose coordinate jk (by some rule) k+1 k 1 k other coordinate descent methods in terms of numerical Compute x := y − ∇j1 f(y )ej1 Lj1 k k k performance. Furthermore, we observe that AGCD attains 2 2 Choose coordinate jk (by some rule) O(1/k ) convergence (or better) on these problems in prac- k+1 k 1 k Compute z := z − ∇j2 f(y )ej2 tice. Thus AGCD is worthy of significant further study, nLj2 θk k k k end for both computationally as well as theoretically. Indeed, in Section4 we will present a technical condition that implies O(1/k2) convergence when satisfied, and we will argue that In the framework of Algorithm1 we choose a coordinate this condition ought to be satisfied in many settings. j1 of the gradient ∇f(yk) to perform the update of the x- k ASCD, which we consider to be the new theoretical contri- sequence, and we choose (a possibly different) coordinate bution of this paper, combines the salient features of AGCD j2 of the gradient ∇f(yk) to perform the update of the z- k and ARCD. In ASCD we choose the greedy coordinate of sequence. Herein we will study three different rules for the gradient to perform the x-update, while we choose a choosing the coordinates j1 and j2 which then define three k k random coordinate to perform the z-update. In this way different specific algorithms: we achieve the practical advantage of greedy x-updates, • ARCD (Accelerated Randomized Coordinate Descent): while still guaranteeing O(1/k2) convergence in expecta- use the rule tion by virtue of choosing the random coordinate used in the z-update, see Theorem 2.1. And under strong convexity, 2 1 jk = jk :∼ U[1, ··· , n] (3) ASCD achieves accelerated linear convergence as will be shown in Section3. • AGCD (Accelerated Greedy Coordinate Descent): use The paper is organized as follows. In Section2 we present the rule the O(1/k2) convergence guarantee for ASCD. In Section

2 1 √1 k 3 we present an extension of the accelerated coordinate de- jk = jk := arg max |∇if(y )| (4) i Li scent framework to the case of strongly convex functions, and we present the associated linear convergence guarantee • ASCD (Accelerated Semi-Greedy Coordinate De- for ASCD under strong convexity. In Section4 we study scent): use the rule AGCD; we present a Lyapunov energy function argument

1 1 k that points to why standard analysis of accelerated gradient j := arg maxi √ |∇if(y )| k Li descent methods fails in the analysis of AGCD. In Section 2 (5) jk :∼ U[1, ··· , n] . 4.2 we present a technical condition under which AGCD achieves O(1/k2) convergence. In Section5, we present re- 1 sults of our numerical experiments using AGCD and ASCD In ARCD a random coordinate jk is chosen at each iteration k, and this coordinate is used to update both the x-sequence on synthetic linear regression problems as well as practical and the z-sequence. ARCD is well studied, and is known logistic regression problems. to have the following convergence guarantee in expectation (see (Fercoq & Richtarik, 2015) for details): 2. Accelerated Semi-Greedy Coordinate  k ∗  2n2 ∗ 0 2 Descent (ASCD) E f(x ) − f(x ) ≤ (k+1)2 kx − x kL , (6) In this section we present our computational guarantee for where the expectation is on the random variables used to the Accelerated Semi-Greedy Coordinate Descent (ASCD) define the first k iterations. method in the non-strongly convex case, which is Algorithm In AGCD we choose the coordinate j1 in a “greedy” fash- 1 with rule (5). At each iteration k the ASCD method k 1 ion, i.e., corresponding to the maximal (weighted) absolute chooses the greedy coordinate jk to do the x-update, and k 2 value coordinate of the the gradient ∇f(y ). This greedy chooses a randomized coordinate jk ∼ U[1, ··· , n] to do Accelerating Greedy Coordinate Descent Methods the z-update. Unlike ARCD where the same randomized Lemma 2.2. coordinate is used in both the x-update and the z-update – 1 1 f(xk+1) ≤ f(yk)+h∇f(yk), sk+1−yki+ n ksk+1−ykk2 . in ASCD jk is chosen in a deterministic greedy way, jk and 2 L 2 jk are likely to be different. 2 k+1 At each iteration k of ASCD the random variable jk is Utilizing the interpretation of s as a gradient descent introduced, and therefore xk depends on the realization of step from yk but with a larger smoothness descriptor (nL the random variable as opposed to L), we can invoke the standard proof for accelerated gradient descent derived in (Tseng, 2008) for ξ := {j2, . . . , j2 } . k 0 k−1 example. We define tk+1 := zk − 1 L−1∇f(yk), or nθk k+1 For convenience we also define ξ0 := ∅. equivalently we can define t by:

The following theorem presents our computational guaran- k+1 k k nθk k 2 t = arg min h∇f(y ), z − z i + kz − z kL (9) tee for ASCD for the non-strongly convex case: z 2 Theorem 2.1. Consider the Accelerated Semi-Greedy Co- (which corresponds to zk+1 in (Tseng, 2008) for standard ordinate Descent method (Algorithm1 with rule (5)). If f(·) accelerated gradient descent). Then we have: is coordinate-wise L-smooth, it holds for all k ≥ 1 that: Lemma 2.3.

2 nθ2  k ∗  2n ∗ 0 2 f(xk+1) ≤(1 − θ )f(xk) + θ f(x∗) + k kx∗ − zkk2 Eξk f(x ) − f(x ) ≤ (k+1)2 kx − x kL . (7) k k 2 L nθ2 − k kx∗ − tk+1k2 . (10) In the interest conveying some intuition on proofs of accel- 2 L erated methods in general, we will present the proof of The- orem 2.1 after first stating some intermediary results along Notice that tk+1 is an all-coordinate update of zk, and com- with some explanatory comments. The proofs for these puting tk+1 can be very expensive. Instead we will use results can be found in supplementary materials. zk+1 to replace tk+1 in (10) by using the equality in the next lemma. We start with the Three-Point Property (Tseng, 2008). Given a differentiable convex function h(·), the Bregman distance Lemma 2.4. for h(·) is Dh(y, x) := h(y) − h(x) − h∇h(x), y − xi. The n ∗ k 2 n ∗ k+1 2 Three-Point property can be stated as follows: 2 kx − z kL − 2 kx − t kL n2 ∗ k 2 n2  ∗ k+1 2  = kx − z kL − Ej2 kx − z kL . (11) Lemma 2.1. (Three-Point Property (Tseng, 2008)) Let 2 2 k φ(·) be a convex function, and let Dh(·, ·) be the Bregman distance for h(·). For a given vector z, let We now have all the ingredients needed to prove Theorem 2.1. + z := arg min {φ(x) + Dh(x, z)} . n x∈R Proof of Theorem 2.1 Substituting (11) into (10), we ob- Then for all x ∈ Rn, tain: + + + n2θ2 φ(x) + Dh(x, z) ≥ φ(z ) + Dh(z , z) + Dh(x, z ) k+1 k ∗ k ∗ k 2 f(x ) ≤(1 − θk)f(x ) + θkf(x ) + 2 kx − z kL 2 2 with equality holding in the case when φ(·) is a linear func- n θk  ∗ k+1 2  − Ej2 kx − z kL . tion and h(·) is a quadratic function. 2 k

1−θk+1 1 Rearranging and substituting θ2 = θ2 yields: It follows from integration and the coordinate Lipschitz k+1 k n condition (2) that for all x ∈ and h ∈ : 2 2 R R 1−θk+1 k+1 ∗  n ∗ k+1 2 f(x ) − f (x ) + E 2 x − z 2 θ 2 jk L h Li k+1 f(x + hei) ≤ f(x) + h · ∇if(x) + . (8) 2 h 2 2 i 1−θk k ∗  n ∗ k ≤ 2 f(x ) − f (x ) + x − z . θk 2 L At each iteration k = 0, 1,... of ASCD, notice that xk+1 is one step of greedy coordinate descent from yk in the norm k+1 k 1 −1 k Taking the expectation over the random variables k · kL. Now define s := y − n L ∇f(y ), which is 2 2 2 k j1 , j2 , . . . , jk, it follows that: a full steepest-descent step from y in the norm k · knL . We first establish that the greedy coordinate descent step h 1−θ 2 2 i E k+1 f(xk+1) − f (x∗) + n x∗ − zk+1 ξk+1 θ2 2 L yields a good objective function value as compared to the k+1 k+1 h 2 2 i quadratic model that yields s . 1−θk k ∗  n ∗ k ≤Eξk 2 f(x ) − f (x ) + x − z . θk 2 L Accelerating Greedy Coordinate Descent Methods

Algorithm 2 Accelerated Coordinate Descent Framework Just as in the non-strongly convex case, we extend the three (µ-strongly convex case) algorithms ARCD, AGCD, and ASCD to the strongly con- Input: Objective function f(·) with known smoothness vex case by using the rules (3), (4), and (5) in Algorithm2. The following theorem presents our computational guaran- parameter L and strongly convex√ parameter µ, initial µ µa 0 0 √ tee for ASCD for the strongly convex case: point z = x , parameters a = n+ µ and b = n2 . for k = 0, 1, 2,... do Theorem 3.1. Consider the Accelerated Semi-Greedy Co- Define yk = (1 − a)xk + azk 1 ordinate Descent method for the strongly convex case (Algo- Choose jk (by some rule) k+1 k 1 k rithm2 with rule (5)). If f(·) is coordinate-wise L-smooth Compute x = y − ∇j1 f(y )ej1 Lj1 k k k and µ-strongly convex with respect to k · kL, it holds for all k a2 k b k k ≥ 1 that: Compute u = a2+b z + a2+b y j2 Choose k (by some rule) h k ∗ n2 2 k ∗ 2 i k+1 k a 1 k Eξk f(x ) − f + (a + b)kz − x kL Compute z = u − 2 ∇fj2 (y )ej2 2 a +b nLj2 k k k √ k  µ   2  end for √ 0 ∗ n 2 0 ∗ 2 ≤ 1 − n+ µ f(x ) − f + 2 (a + b)kx − x kL . (12) Applying the above inequality in a telescoping manner for k = 1, 2,..., yields: We provide a concise proof of Theorem 3.1 in the supple- mentary materials. h i 1−θk k ∗  Eξk 2 f(x ) − f (x ) θk h 2 2 i 4. Accelerated Greedy Coordinate 1−θk k ∗  n ∗ k ≤Eξk θ2 f(x ) − f (x ) + 2 x − z k L Descent . . In this section we discuss accelerated greedy coordinate h 2 2 i descent (AGCD), which is Algorithm1 with rule (4). In the 1−θ0 0 ∗  n ∗ 0 ≤Eξ0 2 f(x ) − f (x ) + x − z θ0 2 L interest of clarity we limit our discussion to the non-strongly n2 ∗ 0 2 convex case. We present a Lyapunov function argument = x − x . 2 L which shows why the standard type of proof of accelerated gradient methods fails for AGCD, and we propose a tech- 2 Note from an induction argument that θi ≤ i+2 for all i = nical condition under which AGCD is guaranteed to have 0, 1,..., whereby the above inequality rearranges to: an O(1/k2) accelerated convergence rate. Although there are no guarantees that the technical condition will hold for θ2 2 2 E f(xk) − f (x∗) ≤ k n x∗ − x0 a given function f(·), we provide intuition as to why the ξk 1−θ 2 L k technical condition ought to hold in most cases. 2n2 ∗ 0 2 ≤ (k+1)2 x − x L . 4.1. Why AGCD fails (in theory) 3. Accelerated Coordinate Descent The mainstream research community’s interest in Nesterov’s Framework under Strong Convexity accelerated method (Nesterov, 1983) started around 15 years ago; and yet even today most researchers struggle to find We begin with the definition of strong convexity as devel- basic intuition as to what is really going on in accelerated oped in (Lu & Xiao, 2015): methods. Indeed, Nesterov’s estimation sequence proof technique seems to work out arithmetically but with lit- Definition 3.1. f(·) is µ-strongly convex with respect to tle fundamental intuition. There are many recent works k · k if for all x, y ∈ n it holds that: L R trying to explain this acceleration phenomenon (Su et al., 2016)(Wilson et al., 2016)(Hu & Lessard, 2017)(Lin et al., f(y) ≥ f(x) + h∇f(x), y − xi + µ ky − xk2 . 2 L 2015)(Frostig et al., 2015)(Allen-Zhu & Orecchia, 2014) (Bubeck et al., 2015). A line of recent work has attempted Note that µ can be viewed as an extension of the condition to give a physical explanation of acceleration techniques number of f(·) in the traditional sense since µ is defined by studying the continuous-time interpretation of acceler- relative to the coordinate smoothness coefficients through k· ated gradient descent via dynamical systems, see (Su et al., kL, see (Lu & Xiao, 2015). Algorithm2 presents the generic 2016), (Wilson et al., 2016), and (Hu & Lessard, 2017). In framework for accelerated coordinate descent methods when particular, (Su et al., 2016) introduced the continuous-time f(·) is µ-strongly convex for known µ. dynamical system model for accelerated gradient descent, Accelerating Greedy Coordinate Descent Methods and presented a convergence analysis using a Lyapunov 4.2. How to make AGCD work (in theory) energy function in the continuous-time setting. (Wilson Here we propose the following technical condition under et al., 2016) studied discretizations of the continuous-time which the proof of acceleration of AGCD can be made to dynamical system, and also showed that Nesterov’s estima- work. tion sequence analysis is equivalent to the Lyapunov energy function analysis in the dynamical system in the discrete- Technical Condition 4.1. There exists a positive constant time setting. And (Hu & Lessard, 2017) presented an energy γ and an iteration number K such that for all k ≥ K it dissipation argument from control theory for understanding holds that: accelerated gradient descent. k k X 1 i i ∗ X nγ i i ∗ In the discrete-time setting, one can construct a Lyapunov h∇f(y ), z − x i ≤ ∇j f(y )(z − x ) , θ θ i ji ji energy function of the form (Wilson et al., 2016): i=0 i i=0 i (14) √1 k k ∗ 1 ∗ k 2 where ji = arg maxi |∇if(y )| is the greedy coordi- E = A (f(x ) − f ) + kx − z k , (13) Li k k 2 L nate at iteration i.

2 where Ak is a parameter sequence with Ak ∼ O(k ), and One can show that this condition is sufficient to prove an 2 one shows that Ek is nonincreasing in k, yielding: accelerated convergence rate O(1/k ) for AGCD. Therefore let us take a close look at Technical Condition 4.1. The 1   condition considers the weighted sum (with weights θ ∼ Ek E0 1 i f(xk) − f ∗ ≤ ≤ ∼ O . O(i2)) of the inner product of ∇f(yk) and zk − x∗, and A A k2 k k the condition states that the inner product corresponding to the greedy coordinate (the right side above) is larger than The earlier proof techniques of acceleration methods such the average of all coordinates in the inner product, by a as (Nesterov, 1983) and (Tseng, 2008), as well as the recent factor of γ. In the case of ARCD and ASCD, it is easy to proof techniques for accelerated randomized coordinate de- show that Technical Condition 4.1 holds automatically up scent (such as (Nesterov, 2012), (Lu & Xiao, 2015), and to expectation, with γ = 1. (Fercoq & Richtarik, 2015)) can all be rewritten in the above form (up to expectation) each with slightly different param- Here is an informal explanation of why Technical Con- dition 4.1 ought to hold for most convex functions and eter sequences {Ak}. most iterations of AGCD. When k is sufficiently large, the Now let us return to accelerated greedy coordinate descent. three sequence {xk}, {yk} and {zk} ought to all converge ∗ Let us assume for simplicity that L1 = ··· = Ln (as we to x (which always is observed in practice though not can always do rescaling to achieve this condition). Then justified by theory), whereby zk is close to yk. Thus we 1 k k ∗ the greedy coordinate jk is chosen as the coordinate of the can instead consider the inner product h∇f(y ), y − x i gradient with the largest magnitude, which corresponds to in (14). Notice that for any coordinate j it holds that k ∗ 1 k k k the coordinate yielding the greatest guaranteed decrease |y − x | ≥ |∇jf(y )|, and therefore |∇jf(y ) · (y − j j Lj j in the objective function value. However, in the proof of ∗ 1 k 2 x )| ≥ |∇jf(j )| . Now the greedy coordinate is cho- j Lj acceleration using the Lyapunov energy function, one needs 1 k 2 sen by ji := arg maxj |∇jf(j )| , and therefore it is to prove a decrease in Ek (13) instead of a decrease in Lj k 1 reasonably likely that in most cases the greedy coordinate the objective function value f(x ). The coordinate jk is not necessarily the greedy coordinate for decreasing the will yield a better product than the average of the compo- nents of the inner product. energy function Ek due to the presence of the second term ∗ k 2 kx −z kL in (13). This explains why the greedy coordinate The above is not a rigorous argument, and we can likely ∗ can fail to decrease Ek, at least in theory. And because x design some worst-case functions for which Technical Con- is not known when running AGCD, there does not seem to dition 4.1 fails. But the above argument provides some be any way to find the greedy descent coordinate for the intuition as to why the condition ought to hold in most energy function Ek. cases, thereby yielding the observed improvement of AGCD That is why in ASCD we use the greedy coordinate to as compared with ARCD that we will shortly present in Sec- perform the x-update (which corresponds to the fastest tion5, where we also observe that Technical Condition 4.1 coordinate-wise decrease for the first term in energy func- holds empirically on all of our problem instances. tion), while we choose a random coordinate to perform the With a slight change in the proof of Theorem 2.1, we can z-update (which corresponds to the second term in the en- show the following result: ergy function); thereby mitigating the above problem in the case of ASCD. Theorem 4.1. Consider the Accelerated Greedy Coordinate Accelerating Greedy Coordinate Descent Methods

κ = 102 κ = 103 κ = 104 κ = ∞

0 0 0 0 0 0 0 0 0 0 Algorithm 0 0 0 0 Framework 1 0 0 0 0 0 (non-strongly 0 0 0 0 0 0 convex) 0 0 0 00 0

0 0 0 0 0 0 0 0 0

0

0 0 0 0

0 0 0 0 0 0 Algorithm 0 0 0 0

Framework 2 0 0 0 0

(strongly 0 0 0 0

0 0 convex) 0 0

0 00 0 00 0 0 0 0 0 0 0 0 0 0

Figure 1. Plots showing the optimality gap versus run-time (in seconds) for synthetic linear regression problems solved by ASCD, ARCD and AGCD.

Descent (Algorithm1 with rule (4)). If f(·) is coordinate- the horizontal axis is the running time in seconds. Each wise L-smooth and satisfies Technical Condition 4.1 with column corresponds to an instance with the prescribed con- constant γ ≤ 1 and iteration number K, then it holds for dition number κ of XT X, where κ = ∞ means that the all k ≥ K that: minimum eigenvalue of XT X is 0. The first row of plots is

k ∗ 2n2γ ∗ 0 2 for Algorithm Framework1 which is ignorant of any strong f(x ) − f(x ) ≤ (k+1)2 kx − x kL . (15) convexity information. The second row of plots is for Al- gorithm Framework2, which uses given strong convexity We note that if γ < 1 (which we always observe in practice), information. And because the linear regression optimiza- then AGCD will have a better convergence guarantee than tion problem is quadratic, it is straightforward to compute ARCD. κ as well as the true parameter µ for the instances where Remark 4.1. The arguments in Section 4.1 and Section κ > 0. The last column of the figure corresponds to κ = ∞, 4.2 also work for strongly convex case, albeit with suitable wherein we set µ using the smallest positive eigenvalue of minor modifications. XT X, which can be shown to work in theory since all rele- vant problem computations are invariant in the nullspace of 5. Numerical Experiments X. Here we see in Figure1 that AGCD and ASCD consistently 5.1. Linear Regression have superior performance over ARCD for both the non- We consider solving synthetic instances of the linear regres- strongly convex case and the strongly convex case, with sion model with least-squares objective function: ASCD performing almost as well as AGCD in most in- stances. f ∗ := min f(β) := ky − Xβk2 p 2 β∈R We remark that the behavior of any convex function near the optimal solution is similar to the quadratic function using ASCD, ARCD and AGCD, where the mechanism for defined by the Hessian at the optimum, and therefore the generating the data (y, X) and the algorithm implementa- above numerical experiments show promise that AGCD and tion details are described in the supplementary materials. ASCD are likely to outperform ARCD asymptotically for Figure1 shows the optimality gap versus time (in seconds) any twice-differentiable convex function. for solving different instances of linear regression with dif- ferent condition numbers of the matrix XT X using ASCD, ARCD and AGCD. In each plot, the vertical axis is the ob- jective value optimality gap f(βk) − f ∗ in log scale, and Accelerating Greedy Coordinate Descent Methods

Dataset µ¯ = 10−3 µ¯ = 10−5 µ¯ = 10−7 µ¯ = 0

00

0 0 0 0 0 0 0 0 0

0 0 w1a 0

0 0 0 0

0 0 0 0

0 0 00 00 00 00 00 0 00 00 00 00 00 0 00 00 00 00 00 0 00 00 00 00 00

0 00 0 0 0 0 0 0 0 0 0 0 0 0 0

0 a1a 0 0

0 0 0 00

0 0 0 0 0 0

0 00 00 00 00 00 0 00 00 00 00 00 0 00 00 00 00 00 0 00 00 00 00 00

Figure 2. Plots showing the optimality gap versus run-time (in seconds) for the logistic regression instances w1a and a1a in LIBSVM, solved by ASCD, ARCD and AGCD.

5.2. Logistic Regression Table 1. Largest observed values of γ for five different datasets in Here we consider solving instances of the logistic regression LIBSVM for k ≥ K¯ := 5000. loss minimization problem: Dataset: w1a a1a heart madelon rcv1 n γ: 0.25 0.17 0.413 0.24 0.016 ∗ 1 X T f := min f(β) := log(1 + exp(−yiβ xi)) , β∈ p n R i=1 using ASCD, ARCD and AGCD, where {xi, yi} is the feature-response pair for the i-th data point and yi ∈ Here we see in Figure2 that AGCD always has superior {−1, 1}. Although the loss function f(β) is not in gen- performance as compared to both ASCD and ARCD. In the eral strongly convex, it is essentially locally strongly convex relevant range of optimality gaps (≥ 10−9), ASCD typi- around the optimum but with unknown local strong convex- cally outperforms ARCD for smaller values of the assigned ity parameter µ¯. And although we do not know the local strong convexity parameter µ¯. However, the performance of strong convexity parameter µ¯, we can still run the strongly ASCD and ARCD are essentially the same when no strong convex algorithm (Algorithm Framework2) by assigning a convexity is presumed. value of µ¯ that is hopefully close to the actual value. Using this strategy, we solved a large number of logistic regression Last of all, we attempt to estimate the parameter γ instances from LIBSVM (Chang & Lin, 2011). Figure2 that arises in Technical Condition 4.1 for AGCD in sev- shows the optimality gap versus time (in seconds) for solv- eral of the datasets in SVMLIB. Although for small k, the ratio between Pk 1 h∇f(yi), zi − x∗i and ing two of these instances, namely w1a and a1a; we present i=0 θi Pk n i i ∗ similar comparisons for more datasets in the supplementary ∇j f(y )(z − x ) can fluctuate widely, this ratio i=0 θi i ji ji materials. In each plot, the vertical axis is the objective stabilizes after a number of iterations in all of our numerical value optimality gap f(βk) − f ∗ in log scale, and the hor- tests. From Technical Condition 4.1, we know that γ is the izontal axis is the running time in seconds. Each column upper bound of such ratio for all k ≥ K for some large corresponds to a different assigned value of the local strong enough value of K. Table1 presents the observed values convexity parameter µ¯. The right-most column in the figure of γ for all K ≥ K¯ := 5, 000. Recalling from Theorem uses the assignment µ¯ = 0, in which case the algorithms are 4.1 that the γ value represents how much better AGCD can implemented as in the non-strongly convex case (Algorithm perform compared with ARCD in terms of computational Framework1). The implementation details are described in guarantees, we see from Table1 that AGCD should outper- the supplementary materials, which also contains plots of form ARCD for these representative instances – and indeed the optimality gaps versus the number of iterations. this is what we observe in practice in our tests. Accelerating Greedy Coordinate Descent Methods

Acknowledgment Science, FOCS ’13, pp. 147–156, Washington, DC, USA, 2013. IEEE Computer Society. The authors thank Martin Jaggi for pointing out the related work (Locatello et al., 2018), whose connection to this paper Lin, Hongzhou, Mairal, Julien, and Harchaoui, Zaid. A uni- is discussed in the supplementary materials. The authors versal catalyst for first-order optimization. In Advances are also grateful to the referees for their comprehensive in Neural Information Processing Systems, 2015. efforts and their suggestions to improve the readability of the paper. Locatello, Francesco, Raj, Anant, Reddy, Sai Praneeth, Ratsch,¨ Gunnar, Scholkopf,¨ Bernhard, Stich, Sebastian U, and Jaggi, Martin. Revisiting first-order convex optimiza- References tion over linear spaces. arXiv preprint arXiv:1803.09539, Allen-Zhu, Zeyuan and Orecchia, Lorenzo. Linear coupling: 2018. An ultimate unification of gradient and mirror descent. Lu, Zhaosong and Xiao, Lin. On the complexity analysis of arXiv preprint arXiv:1407.1537, 2014. randomized block-coordinate descent methods. Mathe- Beck, Amir and Tetruashvili, Luba. On the convergence of matical Programming, 152(1-2):615–642, 2015. block coordinate descent type methods. SIAM journal on Optimization, 23(4):2037–2060, 2013. Luo, Zhi-Quan and Tseng, Paul. On the convergence of the coordinate descent method for convex differentiable Bertsekas, Dimitri and Tsitsiklis, John. Parallel and dis- minimization. Journal of Optimization Theory and Appli- tributed computation: numerical methods, volume 23. cations, 72(1):7–35, 1992. Prentice hall Englewood Cliffs, NJ, 1989. Luo, Zhi-Quan and Tseng, Paul. Error bounds and conver- Bubeck, Sebastien,´ Lee, Yin Tat, and Singh, Mohit. A gence analysis of feasible descent methods: a general geometric alternative to nesterov’s accelerated gradient approach. Annals of Operations Research, 46(1):157– descent. arXiv preprint arXiv:1506.08187, 2015. 178, 1993.

Chang, Chih-Chung and Lin, Chih-Jen. Libsvm: a library Mazumder, Rahul and Hastie, Trevor. The graphical lasso: for support vector machines. ACM transactions on intel- new insights and alternatives. Electronic Journal of Statis- ligent systems and technology (TIST), 2(3):27, 2011. tics, 6:2125, 2012. Fercoq, Olivier and Richtarik, Peter. Accelerated, paral- Nemirovsky, A. S. and Yudin, D. B. Problem Complexity lel, and proximal coordinate descent. SIAM Journal on and Method Efficiency in Optimization. Wiley, New York, Optimization, 25(4):1997–2023, 2015. 1983. Frostig, Roy, Ge, Rong, Kakade, Sham, and Sidford, Aaron. Nesterov, Yu. Efficiency of coordinate descent methods Un-regularizing: approximate proximal point and faster on huge-scale optimization problems. SIAM Journal on stochastic algorithms for empirical risk minimization. In Optimization, 22(2):341–362, 2012. International Conference on Machine Learning, 2015. Gurbuzbalaban, Mert, Ozdaglar, Asuman, Parrilo, Pablo A, Nesterov, Yurii. A method of solving a convex program- 2 and Vanli, Nuri. When cyclic coordinate descent outper- ming problem with convergence rate O(1/k ). In Soviet forms randomized coordinate descent. In Advances in Mathematics Doklady, volume 27, pp. 372–376, 1983. Neural Information Processing Systems, pp. 7002–7010, Nutini, Julie, Schmidt, Mark, Laradji, Issam, Friedlander, 2017. Michael, and Koepke, Hoyt. Coordinate descent con- Hu, Bin and Lessard, Laurent. Dissipativity theory verges faster with the Gauss-Southwell rule than random for Nesterov’s accelerated method. arXiv preprint selection. In International Conference on Machine Learn- arXiv:1706.04381, 2017. ing, pp. 1632–1641, 2015. Joachims, Thorsten. Advances in kernel methods. chap- Platt, John C. Advances in kernel methods. chapter Fast ter Making Large-scale Support Vector Machine Learn- Training of Support Vector Machines Using Sequential ing Practical, pp. 169–184. MIT Press, Cambridge, MA, Minimal Optimization, pp. 185–208. MIT Press, Cam- USA, 1999. bridge, MA, USA, 1999. Lee, Yin Tat and Sidford, Aaron. Efficient accelerated Richtarik, Peter and Takac, Martin. Iteration complexity of coordinate descent methods and faster algorithms for randomized block-coordinate descent methods for min- solving linear systems. In Proceedings of the 2013 IEEE imizing a composite function. Mathematical Program- 54th Annual Symposium on Foundations of Computer ming, 144(1-2):1–38, 2014. Accelerating Greedy Coordinate Descent Methods

Song, Chaobing, Cui, Shaobo, Jiang, Yong, and Xia, Shu- Tao. Accelerated stochastic greedy coordinate descent by soft thresholding projection onto simplex. In Advances in Neural Information Processing Systems, pp. 4841–4850, 2017. Su, Weijie, Boyd, Stephen, and Candes, Emmanuel J. A differential equation for modeling Nesterovs accelerated gradient method: theory and insights. Journal of Machine Learning Research, 17(153):1–43, 2016. Sun, Ruoyu and Ye, Yinyu. Worst-case complexity of cyclic coordinate descent: O(n2) gap with randomized version. arXiv preprint arXiv:1604.07130, 2016. Tseng, P. On accelerated proximal gradient methods for convex-concave optimization. Technical report, May 21, 2008. Wilson, Ashia C, Recht, Benjamin, and Jordan, Michael I. A Lyapunov analysis of momentum methods in optimiza- tion. arXiv preprint arXiv:1611.02635, 2016. You, Yang, Lian, Xiangru, Liu, Ji, Yu, Hsiang-Fu, Dhillon, Inderjit S, Demmel, James, and Hsieh, Cho-Jui. Asyn- chronous parallel greedy coordinate descent. In Advances in Neural Information Processing Systems, pp. 4682– 4690, 2016.