Approximate Steepest Coordinate Descent

CORE Metadata, citation and similar papers at core.ac.uk Provided by Infoscience - École polytechnique fédérale de Lausanne Approximate Steepest Coordinate Descent Sebastian U. Stich 1 Anant Raj 2 Martin Jaggi 1 Abstract ture is less clear. For instance in (Shalev-Shwartz & We propose a new selection rule for the coor- Zhang, 2013; Friedman et al., 2007; 2010; Shalev-Shwartz dinate selection in coordinate descent methods & Tewari, 2011) uniform sampling (UCD) is used, whereas for huge-scale optimization. The efficiency of other papers propose adaptive sampling strategies that this novel scheme is provably better than the effi- change over time (Papa et al., 2015; Csiba et al., 2015; Os- ciency of uniformly random selection, and can okin et al., 2016; Perekrestenko et al., 2017). reach the efficiency of steepest coordinate de- A very simple deterministic strategy is to move along the scent (SCD), enabling an acceleration of a factor direction corresponding to the component of the gradient of up to n, the number of coordinates. In many with the maximal absolute value (steepest coordinate de- practical applications, our scheme can be imple- scent, SCD) (Boyd & Vandenberghe, 2004; Tseng & Yun, mented at no extra cost and computational effi- 2009). For smooth functions this strategy yields always ciency very close to the faster uniform selection. better progress than UCD, and the speedup can reach a fac- Numerical experiments with Lasso and Ridge re- tor of the dimension (Nutini et al., 2015). However, SCD gression show promising improvements, in line requires the computation of the whole gradient vector in with our theoretical guarantees. each iteration which is prohibitive (except for special applications, cf. Dhillon et al.(2011); Shrivastava & Li(2014)). 1. Introduction In this paper we propose approximate steepest coordinate descent (ASCD), a novel scheme which combines the best Coordinate descent (CD) methods have attracted a substan- parts of the aforementioned strategies: (i) ASCD maintains tial interest the optimization community in the last few an approximation of the full gradient in each iteration and years (Nesterov, 2012; Richtarik´ & Takaćˇ, 2016). Due to selects the active coordinate among the components of this their computational efficiency, scalability, as well as their vector that have large absolute values — similar to SCD; ease of implementation, these methods are the state-of-the- and (ii) in many situations the gradient approximation can art for a wide selection of machine learning and signal pro- be updated cheaply at no extra cost — similar to UCD. We cessing applications (Fu, 1998; Hsieh et al., 2008; Wright, show that regardless of the errors in the gradient approxi- 2015). This is also theoretically well justified: The com- mation (even if they are infinite), ASCD performs always plexity estimates for CD methods are in general better than better than UCD. the estimates for methods that compute the full gradient in Similar to the methods proposed in (Tseng & Yun, 2009) one batch pass (Nesterov, 2012; Nesterov & Stich, 2017). we also present variants of ASCD for composite problems. In many CD methods, the active coordinate is picked We confirm our theoretical findings by numerical experi- arXiv:1706.08427v1 [cs.LG] 26 Jun 2017 at random, according to a probability distribution. For ments for Lasso and Ridge regression on a synthetic dataset smooth functions it is theoretically well understood how as well as on the RCV1 (binary) dataset. the sampling procedure is related to the efficiency of the scheme and which distributions give the best complexity Structure of the Paper and Contributions. In Sec.2 we estimates (Nesterov, 2012; Zhao & Zhang, 2015; Allen- review the existing theory for SCD and (i) extend it to the Zhu et al., 2016; Qu & Richtarik´ , 2016; Nesterov & Stich, setting of smooth functions. We present (ii) a novel lower 2017). For nonsmooth and composite functions — that bound, showing that the complexity estimates for SCD and appear in many machine learning applications — the pic- UCD can be equal in general. We (iii) introduce ASCD and the save selection rules for both smooth (Sec.3) and 1EPFL 2Max Planck Institute for Intelligent Systems. Corre- to composite functions (Sec.5). We prove that (iv) ASCD spondence to: Sebastian U. Stich <sebastian.stich@epfl.ch>. performs always better than UCD (Sec.3) and (v) it can Proceedings of the 34 th International Conference on Machine reach the performance of SCD (Sec.6). In Sec.4 we dis- Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 cuss important applications where the gradient estimate can by the author(s). efficiently be maintained. Our theory is supported by nu- Approximate Steepest Coordinate Descent (ASCD) merical evidence in Sec.7, which reveals that (vi) ASCD Hence, the lower bound on the one step progress of SCD is performs extremely well on real data. always at least as large as the lower bound on the one step progress of UCD. Moreover, the one step progress could Notation. Define [x]i := hx; eii with ei the standard be even lager by a factor of n. However, it is very difficult n unit vectors in R . We abbreviate rif := [rf]i.A to formally prove that this linear speed-up holds for more n convex function f : R ! R with coordinate-wise Li- than one iteration, as the expressions in (7) depend on the 1 Lipschitz continuous gradients for constants Li > 0, (a priori unknown) sequence of iterates fxtgt≥0. i 2 [n] := f1; : : : ; ng, satisfies by the standard reasoning Strongly Convex Objectives. Nutini et al.(2015) Li 2 f(x + ηei) ≤ f(x) + ηrif(x) + η (1) 2 present an elegant solution of this problem for µ2-strongly 2 n convex functions . They propose to measure the strong for all x 2 R and η 2 R. A function is coordinate-wise convexity of the objective function in the 1-norm instead L-smooth if Li ≤ L for i = 1; : : : ; n. For an optimization ? of the 2-norm. This gives rise to the lower bound problem min n f(x) define X := arg min n f(x) x2R x2R ? n ? ? and denote by x 2 R an arbitrary element x 2 X . µ1 ? τSCD(xt) ≥ L (f(xt) − f(x )) ; (8) µ 2. Steepest Coordinate Descent where 1 denotes the strong convexity parameter. By this, they get a uniform upper bound on the convergence that In this section we present SCD and discuss its theoretical does not directly depend on local properties of the function, properties. The functions of interest are composite convex like for instance τSCD(xt), but just on µ1. It always holds functions F : Rn ! R of the form µ1 ≤ µ2, and for functions where both quantities are equal, SCD enjoys a linear speedup over UCD. F (x) := f(x) + Ψ(x) (2) Smooth Objectives. When the objective function f is where f is coordinate-wise L-smooth and Ψ convex and n just smooth (but not necessarily strongly convex), then the separable, that is that is Ψ(x) = P Ψ ([x] ). In the i=1 i i analysis mentioned above is not applicable. We here extend first part of this section we focus on smooth problems, i.e. the analysis from (Nutini et al., 2015) to smooth functions. we assume that Ψ ≡ 0. Theorem 2.1. Let f : Rn ! R be convex and coordinate- Coordinate descent methods with constant step size gener- wise L-smooth. Then for the sequence fxtgt≥0 generated ate a sequence fxtgt≥0 of iterates that satisfy the relation by SCD it holds: 1 2 xt+1 = xt − ri f(x)ei : (3) 2LR L t t f(x ) − f(x?) ≤ 1 ; (9) t t In UCD the active coordinate it is chosen uniformly at random from the set [n], it 2u:a:r: [n]. SCD chooses the coor- ? for R1 := max max [kx − x k j f(x) ≤ f(x0)] . ? ? n 1 dinate according to the Gauss-Southwell (GS) rule: x 2X x2R it = arg max ri jf(xt)j : (4) Proof. In the proof we first derive a lower bound on the one i2[n] step progress (Lemma A.1), similar to the analysis in (Nes- terov, 2012). The lower bound for the one step progress of 2.1. Convergence analysis SCD can in each iteration differ up to a factor of n from the With the quadratic upper bound (1) one can easily get a analogous bound derived for UCD (similar as in (7)). All lower bound on the one step progress details are given in Section A.1 in the appendix. h 1 2i E [f(xt) − f(xt+1) j xt] ≥ Eit 2L jrit f(xt)j : (5) Note that the R1 is essentially the diameter of the level set at f(x0) measured in the 1-norm. In the complexity esti- For UCD and SCD the expression on the right hand side 2 2 mate of UCD, R1 in (9) is replaced by nR2, where R2 is evaluates to the diameter of the level at f(x0) measured in the 2-norm 1 2 (cf. Nesterov(2012); Wright(2015)). As in (7) we observe τUCD(xt) := krf(xt)k 2nL 2 (6) with Cauchy-Schwarz 1 2 τSCD(xt) := krf(xt)k1 2L 1 2 2 2 R1 ≤ R2 ≤ R1 ; (10) With Cauchy-Schwarz we find n i.e. SCD can accelerate up to a factor of n over to UCD. 1 τSCD(xt) ≤ τUCD(xt) ≤ τSCD(xt) : (7) n 2 A function is µp-strongly convex in the p-norm, p ≥ 1, if 1 n µp 2 n jrif(x + ηei) − rif(x)j ≤ Li jηj ; 8x 2 R ; η 2 R. f(y) ≥ f(x) + hrf(x); y − xi + 2 ky − xkp, 8x; y 2 R .

Load more