Accelerating Greedy Coordinate Descent Methods
Haihao Lu 1 Robert M. Freund 2 Vahab Mirrokni 3
Abstract sentially recover the same results (in expectation) as full We introduce and study two algorithms to accel- gradient descent, including obtaining “accelerated” (i.e., erate greedy coordinate descent in theory and in O(1/k2)) convergence guarantees. On the other hand, in practice: Accelerated Semi-Greedy Coordinate some important machine learning applications, greedy co- Descent (ASCD) and Accelerated Greedy Co- ordinate methods demonstrate superior numerical perfor- ordinate Descent (AGCD). On the theory side, mance while also delivering much sparser solutions. For our main results are for ASCD: we show that example, greedy coordinate descent is one of the fastest ASCD achieves O(1/k2) convergence, and it algorithms for the graphical LASSO implemented in DP- also achieves accelerated linear convergence for GLASSO (Mazumder & Hastie, 2012). And sequence mini- strongly convex functions. On the empirical side, mization optimization (SMO) (a variant of greedy coordi- while both AGCD and ASCD outperform Acceler- nate descent) is widely known as the best solver for kernel ated Randomized Coordinate Descent on most in- SVM (Joachims, 1999)(Platt, 1999) and is implemented in stances in our numerical experiments, we note that LIBSVM and SVMLight. AGCD significantly outperforms the other two In general, for smooth convex optimization the standard methods in our experiments, in spite of a lack of first-order methods converge at a rate of O(1/k) (including theoretical guarantees for this method. To comple- greedy coordinate descent). In 1983, Nesterov (Nesterov, ment this empirical finding for AGCD, we present 1983) proposed an algorithm that achieved a rate of O(1/k2) an explanation why standard proof techniques for – which can be shown to be the optimal rate achievable by acceleration cannot work for AGCD, and we in- any first-order method (Nemirovsky & Yudin, 1983). This troduce a technical condition under which AGCD method (and other similar methods) is now referred to as is guaranteed to have accelerated convergence. Accelerated Gradient Descent (AGD). Finally, we confirm that this technical condition holds in our numerical experiments. However, there has not been much work on accelerating the standard Greedy Coordinate Descent (GCD) due to the inherent difficulty in demonstrating O(1/k2) computational 1. Introduction guarantees (we discuss this difficulty further in Section 4.1). The only work that might be close as far as the authors are Coordinate descent methods have received much-deserved aware is (Song et al., 2017), which updates the z-sequence attention recently due to their capability for solving large- using the full gradient and thus should not be considered as scale optimization problems (with sparsity) that arise in a coordinate descent method in the standard sense. machine learning applications and elsewhere. With inexpen- sive updates at each iteration, coordinate descent algorithms In this paper, we study ways to accelerate greedy coordi- obtain faster running times than similar full gradient de- nate descent in theory and in practice. We introduce and scent algorithms in order to reach the same near-optimality study two algorithms: Accelerated Semi-Greedy Coordi- tolerance; indeed some of these algorithms now comprise nate Descent (ASCD) and Accelerated Greedy Coordinate the state-of-the-art in machine learning algorithms for loss Descent (AGCD). While ASCD takes greedy steps in the minimization. x-updates and randomized steps in the z-updates, AGCD is a straightforward extension of GCD that only takes greedy Most recent research on coordinate descent has focused on steps. On the theory side, our main results are for ASCD: versions of randomized coordinate descent, which can es- we show that ASCD achieves O(1/k2) convergence, and 1Department of Mathematics and Operations Research Cen- it also achieves accelerated linear convergence when the 2 3 ter, MIT Sloan School of Management, MIT Google Research. objective function is furthermore strongly convex. However, Correspondence to: Haihao Lu
Algorithm 1 Accelerated Coordinate Descent Framework coordinate is used to update both the x-sequence and the without Strong Convexity z-sequence. As far as we know AGCD has not appeared in Input: Objective function f with coordinate-wise the first-order method literature. One reason for this is that smoothness parameter L, initial point z0 = x0, param- while AGCD is the natural accelerated version of greedy eter sequence {θk} defined as follow: θ0 = 1, and de- coordinate descent, the standard proof methodologies for 1−θk 1 2 fine θk recursively by the relationship 2 = 2 for establishing acceleration guarantees (i.e., O(1/k ) conver- θk θk−1 k = 1, 2,.... gence) fail for AGCD. Nevertheless, we show in Section for k = 0, 1, 2,... do 5 that AGCD is extremely effective in numerical experi- k k k ments on synthetic linear regression problems as well as Define y := (1 − θk)x + θkz 1 on practical logistic regression problems, and dominates Choose coordinate jk (by some rule) k+1 k 1 k other coordinate descent methods in terms of numerical Compute x := y − ∇j1 f(y )ej1 Lj1 k k k performance. Furthermore, we observe that AGCD attains 2 2 Choose coordinate jk (by some rule) O(1/k ) convergence (or better) on these problems in prac- k+1 k 1 k Compute z := z − ∇j2 f(y )ej2 tice. Thus AGCD is worthy of significant further study, nLj2 θk k k k end for both computationally as well as theoretically. Indeed, in Section4 we will present a technical condition that implies O(1/k2) convergence when satisfied, and we will argue that In the framework of Algorithm1 we choose a coordinate this condition ought to be satisfied in many settings. j1 of the gradient ∇f(yk) to perform the update of the x- k ASCD, which we consider to be the new theoretical contri- sequence, and we choose (a possibly different) coordinate bution of this paper, combines the salient features of AGCD j2 of the gradient ∇f(yk) to perform the update of the z- k and ARCD. In ASCD we choose the greedy coordinate of sequence. Herein we will study three different rules for the gradient to perform the x-update, while we choose a choosing the coordinates j1 and j2 which then define three k k random coordinate to perform the z-update. In this way different specific algorithms: we achieve the practical advantage of greedy x-updates, • ARCD (Accelerated Randomized Coordinate Descent): while still guaranteeing O(1/k2) convergence in expecta- use the rule tion by virtue of choosing the random coordinate used in the z-update, see Theorem 2.1. And under strong convexity, 2 1 jk = jk :∼ U[1, ··· , n] (3) ASCD achieves accelerated linear convergence as will be shown in Section3. • AGCD (Accelerated Greedy Coordinate Descent): use The paper is organized as follows. In Section2 we present the rule the O(1/k2) convergence guarantee for ASCD. In Section
2 1 √1 k 3 we present an extension of the accelerated coordinate de- jk = jk := arg max |∇if(y )| (4) i Li scent framework to the case of strongly convex functions, and we present the associated linear convergence guarantee • ASCD (Accelerated Semi-Greedy Coordinate De- for ASCD under strong convexity. In Section4 we study scent): use the rule AGCD; we present a Lyapunov energy function argument
1 1 k that points to why standard analysis of accelerated gradient j := arg maxi √ |∇if(y )| k Li descent methods fails in the analysis of AGCD. In Section 2 (5) jk :∼ U[1, ··· , n] . 4.2 we present a technical condition under which AGCD achieves O(1/k2) convergence. In Section5, we present re- 1 sults of our numerical experiments using AGCD and ASCD In ARCD a random coordinate jk is chosen at each iteration k, and this coordinate is used to update both the x-sequence on synthetic linear regression problems as well as practical and the z-sequence. ARCD is well studied, and is known logistic regression problems. to have the following convergence guarantee in expectation (see (Fercoq & Richtarik, 2015) for details): 2. Accelerated Semi-Greedy Coordinate k ∗ 2n2 ∗ 0 2 Descent (ASCD) E f(x ) − f(x ) ≤ (k+1)2 kx − x kL , (6) In this section we present our computational guarantee for where the expectation is on the random variables used to the Accelerated Semi-Greedy Coordinate Descent (ASCD) define the first k iterations. method in the non-strongly convex case, which is Algorithm In AGCD we choose the coordinate j1 in a “greedy” fash- 1 with rule (5). At each iteration k the ASCD method k 1 ion, i.e., corresponding to the maximal (weighted) absolute chooses the greedy coordinate jk to do the x-update, and k 2 value coordinate of the the gradient ∇f(y ). This greedy chooses a randomized coordinate jk ∼ U[1, ··· , n] to do Accelerating Greedy Coordinate Descent Methods the z-update. Unlike ARCD where the same randomized Lemma 2.2. coordinate is used in both the x-update and the z-update – 1 1 f(xk+1) ≤ f(yk)+h∇f(yk), sk+1−yki+ n ksk+1−ykk2 . in ASCD jk is chosen in a deterministic greedy way, jk and 2 L 2 jk are likely to be different. 2 k+1 At each iteration k of ASCD the random variable jk is Utilizing the interpretation of s as a gradient descent introduced, and therefore xk depends on the realization of step from yk but with a larger smoothness descriptor (nL the random variable as opposed to L), we can invoke the standard proof for accelerated gradient descent derived in (Tseng, 2008) for ξ := {j2, . . . , j2 } . k 0 k−1 example. We define tk+1 := zk − 1 L−1∇f(yk), or nθk k+1 For convenience we also define ξ0 := ∅. equivalently we can define t by:
The following theorem presents our computational guaran- k+1 k k nθk k 2 t = arg min h∇f(y ), z − z i + kz − z kL (9) tee for ASCD for the non-strongly convex case: z 2 Theorem 2.1. Consider the Accelerated Semi-Greedy Co- (which corresponds to zk+1 in (Tseng, 2008) for standard ordinate Descent method (Algorithm1 with rule (5)). If f(·) accelerated gradient descent). Then we have: is coordinate-wise L-smooth, it holds for all k ≥ 1 that: Lemma 2.3.
2 nθ2 k ∗ 2n ∗ 0 2 f(xk+1) ≤(1 − θ )f(xk) + θ f(x∗) + k kx∗ − zkk2 Eξk f(x ) − f(x ) ≤ (k+1)2 kx − x kL . (7) k k 2 L nθ2 − k kx∗ − tk+1k2 . (10) In the interest conveying some intuition on proofs of accel- 2 L erated methods in general, we will present the proof of The- orem 2.1 after first stating some intermediary results along Notice that tk+1 is an all-coordinate update of zk, and com- with some explanatory comments. The proofs for these puting tk+1 can be very expensive. Instead we will use results can be found in supplementary materials. zk+1 to replace tk+1 in (10) by using the equality in the next lemma. We start with the Three-Point Property (Tseng, 2008). Given a differentiable convex function h(·), the Bregman distance Lemma 2.4. for h(·) is Dh(y, x) := h(y) − h(x) − h∇h(x), y − xi. The n ∗ k 2 n ∗ k+1 2 Three-Point property can be stated as follows: 2 kx − z kL − 2 kx − t kL n2 ∗ k 2 n2 ∗ k+1 2 = kx − z kL − Ej2 kx − z kL . (11) Lemma 2.1. (Three-Point Property (Tseng, 2008)) Let 2 2 k φ(·) be a convex function, and let Dh(·, ·) be the Bregman distance for h(·). For a given vector z, let We now have all the ingredients needed to prove Theorem 2.1. + z := arg min {φ(x) + Dh(x, z)} . n x∈R Proof of Theorem 2.1 Substituting (11) into (10), we ob- Then for all x ∈ Rn, tain: + + + n2θ2 φ(x) + Dh(x, z) ≥ φ(z ) + Dh(z , z) + Dh(x, z ) k+1 k ∗ k ∗ k 2 f(x ) ≤(1 − θk)f(x ) + θkf(x ) + 2 kx − z kL 2 2 with equality holding in the case when φ(·) is a linear func- n θk ∗ k+1 2 − Ej2 kx − z kL . tion and h(·) is a quadratic function. 2 k
1−θk+1 1 Rearranging and substituting θ2 = θ2 yields: It follows from integration and the coordinate Lipschitz k+1 k n condition (2) that for all x ∈ and h ∈ : 2 2 R R 1−θk+1 k+1 ∗ n ∗ k+1 2 f(x ) − f (x ) + E 2 x − z 2 θ 2 jk L h Li k+1 f(x + hei) ≤ f(x) + h · ∇if(x) + . (8) 2 h 2 2 i 1−θk k ∗ n ∗ k ≤ 2 f(x ) − f (x ) + x − z . θk 2 L At each iteration k = 0, 1,... of ASCD, notice that xk+1 is one step of greedy coordinate descent from yk in the norm k+1 k 1 −1 k Taking the expectation over the random variables k · kL. Now define s := y − n L ∇f(y ), which is 2 2 2 k j1 , j2 , . . . , jk, it follows that: a full steepest-descent step from y in the norm k · knL . We first establish that the greedy coordinate descent step h 1−θ 2 2 i E k+1 f(xk+1) − f (x∗) + n x∗ − zk+1 ξk+1 θ2 2 L yields a good objective function value as compared to the k+1 k+1 h 2 2 i quadratic model that yields s . 1−θk k ∗ n ∗ k ≤Eξk 2 f(x ) − f (x ) + x − z . θk 2 L Accelerating Greedy Coordinate Descent Methods
Algorithm 2 Accelerated Coordinate Descent Framework Just as in the non-strongly convex case, we extend the three (µ-strongly convex case) algorithms ARCD, AGCD, and ASCD to the strongly con- Input: Objective function f(·) with known smoothness vex case by using the rules (3), (4), and (5) in Algorithm2. The following theorem presents our computational guaran- parameter L and strongly convex√ parameter µ, initial µ µa 0 0 √ tee for ASCD for the strongly convex case: point z = x , parameters a = n+ µ and b = n2 . for k = 0, 1, 2,... do Theorem 3.1. Consider the Accelerated Semi-Greedy Co- Define yk = (1 − a)xk + azk 1 ordinate Descent method for the strongly convex case (Algo- Choose jk (by some rule) k+1 k 1 k rithm2 with rule (5)). If f(·) is coordinate-wise L-smooth Compute x = y − ∇j1 f(y )ej1 Lj1 k k k and µ-strongly convex with respect to k · kL, it holds for all k a2 k b k k ≥ 1 that: Compute u = a2+b z + a2+b y j2 Choose k (by some rule) h k ∗ n2 2 k ∗ 2 i k+1 k a 1 k Eξk f(x ) − f + (a + b)kz − x kL Compute z = u − 2 ∇fj2 (y )ej2 2 a +b nLj2 k k k √ k µ 2 end for √ 0 ∗ n 2 0 ∗ 2 ≤ 1 − n+ µ f(x ) − f + 2 (a + b)kx − x kL . (12) Applying the above inequality in a telescoping manner for k = 1, 2,..., yields: We provide a concise proof of Theorem 3.1 in the supple- mentary materials. h i 1−θk k ∗ Eξk 2 f(x ) − f (x ) θk h 2 2 i 4. Accelerated Greedy Coordinate 1−θk k ∗ n ∗ k ≤Eξk θ2 f(x ) − f (x ) + 2 x − z k L Descent . . In this section we discuss accelerated greedy coordinate h 2 2 i descent (AGCD), which is Algorithm1 with rule (4). In the 1−θ0 0 ∗ n ∗ 0 ≤Eξ0 2 f(x ) − f (x ) + x − z θ0 2 L interest of clarity we limit our discussion to the non-strongly n2 ∗ 0 2 convex case. We present a Lyapunov function argument = x − x . 2 L which shows why the standard type of proof of accelerated gradient methods fails for AGCD, and we propose a tech- 2 Note from an induction argument that θi ≤ i+2 for all i = nical condition under which AGCD is guaranteed to have 0, 1,..., whereby the above inequality rearranges to: an O(1/k2) accelerated convergence rate. Although there are no guarantees that the technical condition will hold for θ2 2 2 E f(xk) − f (x∗) ≤ k n x∗ − x0 a given function f(·), we provide intuition as to why the ξk 1−θ 2 L k technical condition ought to hold in most cases. 2n2 ∗ 0 2 ≤ (k+1)2 x − x L . 4.1. Why AGCD fails (in theory) 3. Accelerated Coordinate Descent The mainstream research community’s interest in Nesterov’s Framework under Strong Convexity accelerated method (Nesterov, 1983) started around 15 years ago; and yet even today most researchers struggle to find We begin with the definition of strong convexity as devel- basic intuition as to what is really going on in accelerated oped in (Lu & Xiao, 2015): methods. Indeed, Nesterov’s estimation sequence proof technique seems to work out arithmetically but with lit- Definition 3.1. f(·) is µ-strongly convex with respect to tle fundamental intuition. There are many recent works k · k if for all x, y ∈ n it holds that: L R trying to explain this acceleration phenomenon (Su et al., 2016)(Wilson et al., 2016)(Hu & Lessard, 2017)(Lin et al., f(y) ≥ f(x) + h∇f(x), y − xi + µ ky − xk2 . 2 L 2015)(Frostig et al., 2015)(Allen-Zhu & Orecchia, 2014) (Bubeck et al., 2015). A line of recent work has attempted Note that µ can be viewed as an extension of the condition to give a physical explanation of acceleration techniques number of f(·) in the traditional sense since µ is defined by studying the continuous-time interpretation of acceler- relative to the coordinate smoothness coefficients through k· ated gradient descent via dynamical systems, see (Su et al., kL, see (Lu & Xiao, 2015). Algorithm2 presents the generic 2016), (Wilson et al., 2016), and (Hu & Lessard, 2017). In framework for accelerated coordinate descent methods when particular, (Su et al., 2016) introduced the continuous-time f(·) is µ-strongly convex for known µ. dynamical system model for accelerated gradient descent, Accelerating Greedy Coordinate Descent Methods and presented a convergence analysis using a Lyapunov 4.2. How to make AGCD work (in theory) energy function in the continuous-time setting. (Wilson Here we propose the following technical condition under et al., 2016) studied discretizations of the continuous-time which the proof of acceleration of AGCD can be made to dynamical system, and also showed that Nesterov’s estima- work. tion sequence analysis is equivalent to the Lyapunov energy function analysis in the dynamical system in the discrete- Technical Condition 4.1. There exists a positive constant time setting. And (Hu & Lessard, 2017) presented an energy γ and an iteration number K such that for all k ≥ K it dissipation argument from control theory for understanding holds that: accelerated gradient descent. k k X 1 i i ∗ X nγ i i ∗ In the discrete-time setting, one can construct a Lyapunov h∇f(y ), z − x i ≤ ∇j f(y )(z − x ) , θ θ i ji ji energy function of the form (Wilson et al., 2016): i=0 i i=0 i (14) √1 k k ∗ 1 ∗ k 2 where ji = arg maxi |∇if(y )| is the greedy coordi- E = A (f(x ) − f ) + kx − z k , (13) Li k k 2 L nate at iteration i.
2 where Ak is a parameter sequence with Ak ∼ O(k ), and One can show that this condition is sufficient to prove an 2 one shows that Ek is nonincreasing in k, yielding: accelerated convergence rate O(1/k ) for AGCD. Therefore let us take a close look at Technical Condition 4.1. The 1 condition considers the weighted sum (with weights θ ∼ Ek E0 1 i f(xk) − f ∗ ≤ ≤ ∼ O . O(i2)) of the inner product of ∇f(yk) and zk − x∗, and A A k2 k k the condition states that the inner product corresponding to the greedy coordinate (the right side above) is larger than The earlier proof techniques of acceleration methods such the average of all coordinates in the inner product, by a as (Nesterov, 1983) and (Tseng, 2008), as well as the recent factor of γ. In the case of ARCD and ASCD, it is easy to proof techniques for accelerated randomized coordinate de- show that Technical Condition 4.1 holds automatically up scent (such as (Nesterov, 2012), (Lu & Xiao, 2015), and to expectation, with γ = 1. (Fercoq & Richtarik, 2015)) can all be rewritten in the above form (up to expectation) each with slightly different param- Here is an informal explanation of why Technical Con- dition 4.1 ought to hold for most convex functions and eter sequences {Ak}. most iterations of AGCD. When k is sufficiently large, the Now let us return to accelerated greedy coordinate descent. three sequence {xk}, {yk} and {zk} ought to all converge ∗ Let us assume for simplicity that L1 = ··· = Ln (as we to x (which always is observed in practice though not can always do rescaling to achieve this condition). Then justified by theory), whereby zk is close to yk. Thus we 1 k k ∗ the greedy coordinate jk is chosen as the coordinate of the can instead consider the inner product h∇f(y ), y − x i gradient with the largest magnitude, which corresponds to in (14). Notice that for any coordinate j it holds that k ∗ 1 k k k the coordinate yielding the greatest guaranteed decrease |y − x | ≥ |∇jf(y )|, and therefore |∇jf(y ) · (y − j j Lj j in the objective function value. However, in the proof of ∗ 1 k 2 x )| ≥ |∇jf(j )| . Now the greedy coordinate is cho- j Lj acceleration using the Lyapunov energy function, one needs 1 k 2 sen by ji := arg maxj |∇jf(j )| , and therefore it is to prove a decrease in Ek (13) instead of a decrease in Lj k 1 reasonably likely that in most cases the greedy coordinate the objective function value f(x ). The coordinate jk is not necessarily the greedy coordinate for decreasing the will yield a better product than the average of the compo- nents of the inner product. energy function Ek due to the presence of the second term ∗ k 2 kx −z kL in (13). This explains why the greedy coordinate The above is not a rigorous argument, and we can likely ∗ can fail to decrease Ek, at least in theory. And because x design some worst-case functions for which Technical Con- is not known when running AGCD, there does not seem to dition 4.1 fails. But the above argument provides some be any way to find the greedy descent coordinate for the intuition as to why the condition ought to hold in most energy function Ek. cases, thereby yielding the observed improvement of AGCD That is why in ASCD we use the greedy coordinate to as compared with ARCD that we will shortly present in Sec- perform the x-update (which corresponds to the fastest tion5, where we also observe that Technical Condition 4.1 coordinate-wise decrease for the first term in energy func- holds empirically on all of our problem instances. tion), while we choose a random coordinate to perform the With a slight change in the proof of Theorem 2.1, we can z-update (which corresponds to the second term in the en- show the following result: ergy function); thereby mitigating the above problem in the case of ASCD. Theorem 4.1. Consider the Accelerated Greedy Coordinate Accelerating Greedy Coordinate Descent Methods
κ = 102 κ = 103 κ = 104 κ = ∞