Accelerating Greedy Coordinate Descent Methods

Accelerating Greedy Coordinate Descent Methods Haihao Lu 1 Robert M. Freund 2 Vahab Mirrokni 3 Abstract sentially recover the same results (in expectation) as full We introduce and study two algorithms to accel- gradient descent, including obtaining “accelerated” (i.e., erate greedy coordinate descent in theory and in O(1=k2)) convergence guarantees. On the other hand, in practice: Accelerated Semi-Greedy Coordinate some important machine learning applications, greedy co- Descent (ASCD) and Accelerated Greedy Co- ordinate methods demonstrate superior numerical perfor- ordinate Descent (AGCD). On the theory side, mance while also delivering much sparser solutions. For our main results are for ASCD: we show that example, greedy coordinate descent is one of the fastest ASCD achieves O(1=k2) convergence, and it algorithms for the graphical LASSO implemented in DP- also achieves accelerated linear convergence for GLASSO (Mazumder & Hastie, 2012). And sequence mini- strongly convex functions. On the empirical side, mization optimization (SMO) (a variant of greedy coordi- while both AGCD and ASCD outperform Acceler- nate descent) is widely known as the best solver for kernel ated Randomized Coordinate Descent on most in- SVM (Joachims, 1999)(Platt, 1999) and is implemented in stances in our numerical experiments, we note that LIBSVM and SVMLight. AGCD significantly outperforms the other two In general, for smooth convex optimization the standard methods in our experiments, in spite of a lack of first-order methods converge at a rate of O(1=k) (including theoretical guarantees for this method. To comple- greedy coordinate descent). In 1983, Nesterov (Nesterov, ment this empirical finding for AGCD, we present 1983) proposed an algorithm that achieved a rate of O(1=k2) an explanation why standard proof techniques for – which can be shown to be the optimal rate achievable by acceleration cannot work for AGCD, and we in- any first-order method (Nemirovsky & Yudin, 1983). This troduce a technical condition under which AGCD method (and other similar methods) is now referred to as is guaranteed to have accelerated convergence. Accelerated Gradient Descent (AGD). Finally, we confirm that this technical condition holds in our numerical experiments. However, there has not been much work on accelerating the standard Greedy Coordinate Descent (GCD) due to the inherent difficulty in demonstrating O(1=k2) computational 1. Introduction guarantees (we discuss this difficulty further in Section 4.1). The only work that might be close as far as the authors are Coordinate descent methods have received much-deserved aware is (Song et al., 2017), which updates the z-sequence attention recently due to their capability for solving large- using the full gradient and thus should not be considered as scale optimization problems (with sparsity) that arise in a coordinate descent method in the standard sense. machine learning applications and elsewhere. With inexpen- sive updates at each iteration, coordinate descent algorithms In this paper, we study ways to accelerate greedy coordi- obtain faster running times than similar full gradient de- nate descent in theory and in practice. We introduce and scent algorithms in order to reach the same near-optimality study two algorithms: Accelerated Semi-Greedy Coordi- tolerance; indeed some of these algorithms now comprise nate Descent (ASCD) and Accelerated Greedy Coordinate the state-of-the-art in machine learning algorithms for loss Descent (AGCD). While ASCD takes greedy steps in the minimization. x-updates and randomized steps in the z-updates, AGCD is a straightforward extension of GCD that only takes greedy Most recent research on coordinate descent has focused on steps. On the theory side, our main results are for ASCD: versions of randomized coordinate descent, which can es- we show that ASCD achieves O(1=k2) convergence, and 1Department of Mathematics and Operations Research Cen- it also achieves accelerated linear convergence when the 2 3 ter, MIT Sloan School of Management, MIT Google Research. objective function is furthermore strongly convex. However, Correspondence to: Haihao Lu <[email protected]>. a direct extension of convergence proofs for ARCD does not Proceedings of the 35 th International Conference on Machine work for ASCD due to the different coordinates we use to Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 update x-sequence and z-sequence. Thus, we present a new by the author(s). Accelerating Greedy Coordinate Descent Methods proof technique – which shows that a greedy coordinate step 2012) developed the first accelerated randomized coordinate yields a better objective function value than a full gradient gradient method. (Lu & Xiao, 2015) present a sharper con- step with a modified smoothness condition. vergence analysis of Nesterov’s method using a randomized estimate sequence framework. (Fercoq & Richtarik, 2015) On the empirical side, we first note that in most of our proposed the APPROX (Accelerated, Parallel and PROXi- experiments ASCD outperforms Accelerated Randomized mal) coordinate descent method and obtained an accelerated Coordinate Descent (ARCD) in terms of running time. On sublinear convergence rate, and (Lee & Sidford, 2013) de- the other hand, we note that AGCD significantly outper- veloped an efficient implementation of ARCD. forms the other accelerated coordinate descent methods in all instances, in spite of a lack of theoretical guarantees for this method. To complement the empirical study of AGCD, 1.2. Accelerated Coordinate Descent we present a Lyapunov energy function argument that points Framework to an explanation for why a direct extension of the proof for Our optimization problem of interest is: AGCD does not work. This argument inspires us to intro- ∗ duce a technical condition under which AGCD is guaranteed P : f := minimumx f(x) ; (1) to converge at an accelerated rate. Interestingly, we confirm that the technical condition holds in a variety of instances where f(·): Rn ! R is a differentiable convex func- in our empirical study, which in turn justifies our empirical tion. observation that AGCD works so well in practice. Definition 1.1. f(·) is coordinate-wise L-smooth for the 1.1. Related Work vector of parameters L := (L1;L2;:::;Ln) if rf(·) is coordinate-wise Lipschitz continuous for the corresponding Coordinate Descent. Coordinate descent methods have n coefficients of L, i.e., for all x 2 R and h 2 R it holds a long history in optimization, and convergence of these that: methods has been extensively studied in the optimization community in the 1980s-90s, see (Bertsekas & Tsitsiklis, jrif(x + hei) − rif(x)j ≤ Lijhj ; i = 1; : : : ; n ; (2) 1989), (Luo & Tseng, 1992), and (Luo & Tseng, 1993). There are roughly three types of coordinate descent meth- th ods depending on how the coordinate is chosen: randomized where rif(·) denotes the i coordinate of rf(·) and ei is th coordinate descent (RCD), cyclic coordinate descent (CCD), i unit coordinate vector, for i = 1; : : : ; n. and greedy coordinate descent (GCD). RCD has received much attention since the seminal paper of Nesterov (Nes- We presume throughout that Li > 0 for i = 1; : : : ; n. terov, 2012). In RCD, the coordinate is chosen randomly Let L denote the diagonal matrix whose diagonal coefficients correspond to the respective coefficients of L. Let from a certain fixed distribution. (Richtarik & Takac, 2014) n provides an excellent review of theoretical results for RCD. h·; ·i denote the standard coordinate inner product in R , hx; yi = Pn x y k · k ` CCD chooses the coordinate in a cyclic order, see (Beck namely i=1 i i, and let p denote the p 1 ≤ p ≤ 1 hx; yi := Pn L x y = & Tetruashvili, 2013) for basic convergence results. More norm for . Let L i=1 i i i recent results show that CCD is inferior to RCD in the worst hx; Lyi = hLx; yi denote the L-inner product. Define p −1 case (Sun & Ye, 2016), while it is better than RCD in certain the norm kxkL := hx; Lxi. Letting L denote the in- situations (Gurbuzbalaban et al., 2017). In GCD, we select verse of L, we will also use the norm k · kL−1 defined by q the coordinate yielding the largest reduction in the objective p −1 Pn −1 2 kvk −1 := hv; L vi = L v . function value. GCD usually delivers better function val- L i=1 i i ues at each iteration in practice, though this comes at the Algorithm1 presents a generic framework for accelerated expense of having to compute the full gradient in order to coordinate descent methods that is flexible enough to en- select the gradient coordinate with largest magnitude. The compass deterministic as well as randomized methods. One recent work (Nutini et al., 2015) shows that GCD has faster specific case is the standard Accelerated Randomized Coor- convergence than RCD in theory, and also provides several dinate Descent (ARCD). In this paper we propose and study applications in machine learning where the full gradient two other cases. The first is Accelerated Greedy Coordinate can be computed cheaply. A parallel GCD method is pro- Descent (AGCD), which is a straightforward extension of posed in (You et al., 2016) and numerical results show its greedy coordinate descent to the acceleration framework and advantage in practice. which, surprisingly, has not been previously studied (that Accelerated Randomized Coordinate Descent. Since we are aware of). The second is a new algorithm which we Nesterov’s paper on RCD (Nesterov, 2012) there has been call Accelerated Semi-Greedy Coordinate Descent (ASCD) significant focus on accelerated versions of RCD. (Nesterov, that takes greedy steps in the x-updates and randomized steps in the z-update. Accelerating Greedy Coordinate Descent Methods Algorithm 1 Accelerated Coordinate Descent Framework coordinate is used to update both the x-sequence and the without Strong Convexity z-sequence. As far as we know AGCD has not appeared in Input: Objective function f with coordinate-wise the first-order method literature.

Accelerating Greedy Coordinate Descent Methods

A Generic Coordinate Descent Solver for Nonsmooth Convex Optimization Olivier Fercoq

12. Coordinate Descent Methods

Process Optimization

The Gradient Sampling Methodology

Decomposable Submodular Function Minimization: Discrete And

Arxiv:1706.05795V4 [Math.OC] 4 Nov 2018

A New Algorithm of Nonlinear Conjugate Gradient Method with Strong Convergence*

Research Article Stochastic Block-Coordinate Gradient Projection Algorithms for Submodular Maximization

A Hybrid Method of Combinatorial Search and Coordinate Descent For

On the Links Between Probabilistic Graphical Models and Submodular Optimisation Senanayak Sesh Kumar Karri

Using a New Nonlinear Gradient Method for Solving Large Scale Convex Optimization Problems with an Application on Arabic Medical Text

An Iterated Coordinate Descent Algorithm for Regularized 0/1 Loss Minimization