Acceleration Methods Alexandre d’Aspremont CNRS & Ecole Normale Supérieure, Paris [email protected] Damien Scieur Samsung SAIT AI Lab & MILA, Montreal [email protected] Adrien Taylor INRIA & Ecole Normale Supérieure, Paris [email protected] arXiv:2101.09545v2 [math.OC] 11 Mar 2021 Contents 1 Introduction 2 2 Chebyshev Acceleration 5 2.1 Introduction ................................... 5 2.2 OptimalMethodsandMinimaxPolynomials . ..... 7 2.3 TheChebyshevMethod ............................. 9 3 Nonlinear Acceleration 15 3.1 Introduction ................................... 15 3.2 QuadraticCase ................................. 16 3.3 The Non-Quadratic Case: Regularized Nonlinear Acceleration.. .. .. 20 3.4 Constrained/Proximal Nonlinear Acceleration . .......... 26 3.5 Heuristics for Speeding-up Nonlinear Acceleration . ............ 29 3.6 RelatedWork .................................. 31 4 Nesterov Acceleration 32 4.1 Introduction ................................... 33 4.2 GradientMethodandPotential Functions . ...... 35 4.3 OptimizedGradientMethod . 38 4.4 Nesterov’sAcceleration . 45 4.5 AccelerationunderStrongConvexity . ...... 51 4.6 Recent Variants of Accelerated Methods . ...... 59 4.7 PracticalExtensions . .. .. .. .. .. .. .. .. 65 4.8 Notes&References ............................... 81 5 Proximal Acceleration and Catalyst 85 5.1 Introduction ................................... 85 5.2 Proximal Point Algorithm and Acceleration . ........ 86 5.3 Güler and Monteiro-Svaiter Acceleration . ........ 90 5.4 ExploitingStrongConvexity. .... 93 5.5 Application:CatalystAcceleration . ....... 97 5.6 Notes&References ............................... 104 6 Restart Schemes 107 6.1 Introduction ................................... 107 6.2 HölderianErrorBounds . .. .. .. .. .. .. .. 110 6.3 OptimalRestartSchemes . 112 6.4 Robustness&Adaptivity . 113 6.5 Extensions .................................... 114 6.6 CalculusRules .................................. 116 6.7 Example:CompressedSensing . 117 6.8 RestartingOtherFirst-OrderMethods . ...... 118 6.9 Notes&References ............................... 120 Appendices 121 A Useful Inequalities 122 A.1 Smoothness and Strong Convexity with Euclidean Norm . ........ 122 A.2 Smooth Convex Functions for General Norms over RestrictedSets . 127 B Variations around Nesterov Acceleration 128 B.1 Relationsbetween Acceleration Methods . ....... 128 B.2 ConjugateGradientMethod. 133 B.3 Proximal Accelerated Gradient Without Monotone Backtracking . 137 Acknowledgements 144 References 145 Acceleration Methods Alexandre d’Aspremont1, Damien Scieur2 and Adrien Taylor3 1CNRS & Ecole Normale Supérieure, Paris; [email protected] 2Samsung SAIT AI Lab & MILA, Montreal; [email protected] 3INRIA & Ecole Normale Supérieure, Paris; [email protected] ABSTRACT This monograph covers some recent advances on a range of acceleration tech- niques frequently used in convex optimization. We first use quadratic op- timization problems to introduce two key families of methods, momentum and nested optimization schemes, which coincide in the quadratic case to form the Chebyshev method whose complexity is analyzed using Chebyshev polynomials. We discuss momentum methods in detail, starting with the seminal work of Nesterov (1983) and structure convergence proofs using a few master tem- plates, such as that of optimized gradient methods which have the key benefit of showing how momentum methods maximize convergence rates. We further cover proximal acceleration techniques, at the heart of the Cata- lyst and Accelerated Hybrid Proximal Extragradient frameworks, using similar algorithmic patterns. Common acceleration techniques directly rely on the knowledge of some reg- ularity parameters of the problem at hand, and we conclude by discussing restart schemes, a set of simple techniques to reach nearly optimal conver- gence rates while adapting to unobserved regularity parameters. 1 Introduction Optimization methods are a core component of the modern numerical toolkit. In many cases, iterative algorithms for solving convex optimization problems have reached a level of efficiency and reliability comparable to that of advanced linear algebra routines. This is largely true on medium scale problems where interior point methods reign supreme, less so on large-scale problems where the complexity of first-order methods is not as well understood, and efficiency still remains a concern. The situation has markedly improved in recent years, driven in particular by the emergence of a number of applications in statistics, machine learning and signal process- ing. Building on Nesterov’s path-breaking algorithm from the 80’s, several accelerated methods and numerical schemes have been developed that both improve the efficiency of optimization algorithms and refine their complexity bounds. Our objective in this monograph is to cover these recent developments using a few master templates. The methods described in this manuscript can be arranged in roughly two categories. The first, stemming from the work of Nesterov (1983), produces variants of the gradient method with accelerated worst-case convergence rates which are provably optimal under classical regularity assumptions. The second uses outer iteration (aka nested) schemes to speed up convergence. In this second setting, accelerated schemes run both an inner and an outer loop, with inner iterations being solved by classical optimization methods. Direct acceleration techniques. Ever since the original algorithm by Nesterov (1983), the acceleration phenomenon has remained somewhat of a mystery. While accelerated gradient methods can be seen as iteratively building a model for the function, using it to guide gradient computations, the argument is purely algebraic and simply effectively exploits regularity assumptions. This approach of collecting inequalities induced by reg- ularity assumptions and cleverly chaining them to prove convergence was also used in 2 3 e.g. (Beck and Teboulle, 2009) to produce an optimal proximal gradient method. There too however, the proof yielded very little evidence on why the method must actually be faster. Fortunately, we are now better equipped to push the proof mechanisms much further. Recent advances in programmatic design of optimization algorithms allow us to perform the design and analysis of algorithms following a more principled approach. In particular, the performance estimation approach, pioneered by Drori and Teboulle (2014), can be used to design optimal methods from scratch, picking algorithmic parameters to optimize worst-case performance guarantees (Drori and Teboulle, 2014; Kim and Fessler, 2016). Primal dual optimality conditions on the design problem then provide a blueprint for the accelerated algorithm structure and for its convergence proof. Using this framework, acceleration is not a mystery anymore, it is the main objective in the algorithm’s design. We recover the usual “soup of regularity inequalities” template of classical convergence proofs, but the optimality conditions of the design problem explicitly produce a method that optimizes the convergence rate. In this monograph, we cover accelerated first-order methods using this systematic template and describe a number of convergence proofs for classical variants of the accelerated gradient method of (Nesterov, 1983), such as those of Nesterov (1983), Nesterov (2003), and Beck and Teboulle (2009). Nested acceleration schemes. The second category of acceleration techniques we cover in this monograph is formed by outer iteration schemes, which use classical optimization algorithms as a black-box in their inner loops, and where acceleration is produced by an argument on the outer loop. We describe three types of acceleration results in this vein. The first scheme is based on nonlinear acceleration techniques. Using arguments dating back to (Aitken, 1927; Wynn, 1956; Anderson and Nash, 1987), these techniques use a weighted average of iterates to extrapolate a better candidate solution than the last iterate. We begin by describing the Chebyshev method for solving quadratic problems, which interestingly qualifies both as a gradient method and as an outer iteration scheme, and takes its name from the fact that Chebyshev polynomial coefficients are used to approximately minimize the gradient at the extrapolated solution. This argument can be extended to non quadratic optimization problems provided the extrapolation procedure is regularized. The second scheme, due to (Güler, 1992; Monteiro and Svaiter, 2013; Lin et al., 2015) solves a conceptual accelerated proximal point algorithm, and uses classical iter- ative methods to approximate the proximal point in an inner loop. In particular, this framework produces accelerated gradient methods (in the same sense as Nesterov’s accel- eration) when the approximate proximal points are computed using linearly converging gradient-based optimization methods, taking advantage of the fact that the inner prob- lems are always strongly convex. Finally, we describe restart schemes. These take advantage of regularity properties called Hölderian error bounds, which extend strong convexity properties near the opti- 4 Introduction mum and hold almost generically, to improve convergence rates of most first-order meth- ods. The parameters of Hölderian error bounds are usually unknown, but the restart schemes are robust which means that they are adaptive to these constants and that their empirical performance is excellent on problems with reasonable precision targets. Content
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages162 Page
-
File Size-