arXiv:2101.09545v2 [math.OC] 11 Mar 2021 asn ATA a IA Montreal MILA, & Lab AI SAIT Samsung NI cl oml uéiue Paris Supérieure, Normale Ecole & INRIA NS&EoeNraeSprer,Paris Supérieure, Normale Ecole & CNRS ceeainMethods Acceleration lxnr d’Aspremont Alexandre [email protected] [email protected] [email protected] ainScieur Damien dinTaylor Adrien Contents

1 Introduction 2

2 Chebyshev Acceleration 5 2.1 Introduction ...... 5 2.2 OptimalMethodsandMinimaxPolynomials ...... 7 2.3 TheChebyshevMethod ...... 9

3 Nonlinear Acceleration 15 3.1 Introduction ...... 15 3.2 QuadraticCase ...... 16 3.3 The Non-Quadratic Case: Regularized Nonlinear Acceleration...... 20 3.4 Constrained/Proximal Nonlinear Acceleration ...... 26 3.5 Heuristics for Speeding-up Nonlinear Acceleration ...... 29 3.6 RelatedWork ...... 31

4 Nesterov Acceleration 32 4.1 Introduction ...... 33 4.2 GradientMethodandPotential Functions ...... 35 4.3 OptimizedGradientMethod ...... 38 4.4 Nesterov’sAcceleration ...... 45 4.5 AccelerationunderStrongConvexity ...... 51 4.6 Recent Variants of Accelerated Methods ...... 59 4.7 PracticalExtensions ...... 65 4.8 Notes&References ...... 81

5 Proximal Acceleration and Catalyst 85 5.1 Introduction ...... 85 5.2 Proximal Point Algorithm and Acceleration ...... 86 5.3 Güler and Monteiro-Svaiter Acceleration ...... 90 5.4 ExploitingStrongConvexity...... 93 5.5 Application:CatalystAcceleration ...... 97 5.6 Notes&References ...... 104

6 Restart Schemes 107 6.1 Introduction ...... 107 6.2 HölderianErrorBounds ...... 110 6.3 OptimalRestartSchemes ...... 112 6.4 Robustness&Adaptivity ...... 113 6.5 Extensions ...... 114 6.6 CalculusRules ...... 116 6.7 Example:CompressedSensing ...... 117 6.8 RestartingOtherFirst-OrderMethods ...... 118 6.9 Notes&References ...... 120

Appendices 121

A Useful Inequalities 122 A.1 Smoothness and Strong Convexity with Euclidean Norm ...... 122 A.2 Smooth Convex Functions for General Norms over RestrictedSets ...... 127

B Variations around Nesterov Acceleration 128 B.1 Relationsbetween Acceleration Methods ...... 128 B.2 ConjugateGradientMethod...... 133 B.3 Proximal Accelerated Gradient Without Monotone Backtracking ...... 137

Acknowledgements 144

References 145 Acceleration Methods Alexandre d’Aspremont1, Damien Scieur2 and Adrien Taylor3 1CNRS & Ecole Normale Supérieure, Paris; [email protected] 2Samsung SAIT AI Lab & MILA, Montreal; [email protected] 3INRIA & Ecole Normale Supérieure, Paris; [email protected]

ABSTRACT This monograph covers some recent advances on a range of acceleration tech- niques frequently used in convex optimization. We first use quadratic op- timization problems to introduce two key families of methods, momentum and nested optimization schemes, which coincide in the quadratic case to form the Chebyshev method whose complexity is analyzed using Chebyshev polynomials. We discuss momentum methods in detail, starting with the seminal work of Nesterov (1983) and structure convergence proofs using a few master tem- plates, such as that of optimized gradient methods which have the key benefit of showing how momentum methods maximize convergence rates. We further cover proximal acceleration techniques, at the heart of the Cata- lyst and Accelerated Hybrid Proximal Extragradient frameworks, using similar algorithmic patterns. Common acceleration techniques directly rely on the knowledge of some reg- ularity parameters of the problem at hand, and we conclude by discussing restart schemes, a set of simple techniques to reach nearly optimal conver- gence rates while adapting to unobserved regularity parameters. 1

Introduction

Optimization methods are a core component of the modern numerical toolkit. In many cases, iterative algorithms for solving convex optimization problems have reached a level of efficiency and reliability comparable to that of advanced linear algebra routines. This is largely true on medium scale problems where interior point methods reign supreme, less so on large-scale problems where the complexity of first-order methods is not as well understood, and efficiency still remains a concern. The situation has markedly improved in recent years, driven in particular by the emergence of a number of applications in statistics, machine learning and signal process- ing. Building on Nesterov’s path-breaking algorithm from the 80’s, several accelerated methods and numerical schemes have been developed that both improve the efficiency of optimization algorithms and refine their complexity bounds. Our objective in this monograph is to cover these recent developments using a few master templates. The methods described in this manuscript can be arranged in roughly two categories. The first, stemming from the work of Nesterov (1983), produces variants of the gradient method with accelerated worst-case convergence rates which are provably optimal under classical regularity assumptions. The second uses outer iteration (aka nested) schemes to speed up convergence. In this second setting, accelerated schemes run both an inner and an outer loop, with inner iterations being solved by classical optimization methods.

Direct acceleration techniques. Ever since the original algorithm by Nesterov (1983), the acceleration phenomenon has remained somewhat of a mystery. While accelerated gradient methods can be seen as iteratively building a model for the function, using it to guide gradient computations, the argument is purely algebraic and simply effectively exploits regularity assumptions. This approach of collecting inequalities induced by reg- ularity assumptions and cleverly chaining them to prove convergence was also used in

2 3

e.g. (Beck and Teboulle, 2009) to produce an optimal proximal gradient method. There too however, the proof yielded very little evidence on why the method must actually be faster. Fortunately, we are now better equipped to push the proof mechanisms much further. Recent advances in programmatic design of optimization algorithms allow us to perform the design and analysis of algorithms following a more principled approach. In particular, the performance estimation approach, pioneered by Drori and Teboulle (2014), can be used to design optimal methods from scratch, picking algorithmic parameters to optimize worst-case performance guarantees (Drori and Teboulle, 2014; Kim and Fessler, 2016). Primal dual optimality conditions on the design problem then provide a blueprint for the accelerated algorithm structure and for its convergence proof. Using this framework, acceleration is not a mystery anymore, it is the main objective in the algorithm’s design. We recover the usual “soup of regularity inequalities” template of classical convergence proofs, but the optimality conditions of the design problem explicitly produce a method that optimizes the convergence rate. In this monograph, we cover accelerated first-order methods using this systematic template and describe a number of convergence proofs for classical variants of the accelerated gradient method of (Nesterov, 1983), such as those of Nesterov (1983), Nesterov (2003), and Beck and Teboulle (2009).

Nested acceleration schemes. The second category of acceleration techniques we cover in this monograph is formed by outer iteration schemes, which use classical optimization algorithms as a black-box in their inner loops, and where acceleration is produced by an argument on the outer loop. We describe three types of acceleration results in this vein. The first scheme is based on nonlinear acceleration techniques. Using arguments dating back to (Aitken, 1927; Wynn, 1956; Anderson and Nash, 1987), these techniques use a weighted average of iterates to extrapolate a better candidate solution than the last iterate. We begin by describing the Chebyshev method for solving quadratic problems, which interestingly qualifies both as a gradient method and as an outer iteration scheme, and takes its name from the fact that Chebyshev polynomial coefficients are used to approximately minimize the gradient at the extrapolated solution. This argument can be extended to non quadratic optimization problems provided the extrapolation procedure is regularized. The second scheme, due to (Güler, 1992; Monteiro and Svaiter, 2013; Lin et al., 2015) solves a conceptual accelerated proximal point algorithm, and uses classical iter- ative methods to approximate the proximal point in an inner loop. In particular, this framework produces accelerated gradient methods (in the same sense as Nesterov’s accel- eration) when the approximate proximal points are computed using linearly converging gradient-based optimization methods, taking advantage of the fact that the inner prob- lems are always strongly convex. Finally, we describe restart schemes. These take advantage of regularity properties called Hölderian error bounds, which extend strong convexity properties near the opti- 4 Introduction

mum and hold almost generically, to improve convergence rates of most first-order meth- ods. The parameters of Hölderian error bounds are usually unknown, but the restart schemes are robust which means that they are adaptive to these constants and that their empirical performance is excellent on problems with reasonable precision targets.

Content & organization. We present a few convergence acceleration techniques that are particularly relevant in the context of (first-order) convex optimization. This sum- mary presents our own points of view on the topic, and focuses on techniques that received a lot of attention since the early 2000’s, although some of the underlying ideas date back to long before. We do not pretend to be exhaustive, and are aware that very valuable references might not appear below. Chapters can be read nearly independently. However, we believe the insights of some chapters largely benefit the others. In particular, Chebyshev acceleration (Chapter 2)& Nonlinear acceleration (Chapter 3) are clear complementary readings. Similarly, Cheby- shev acceleration (Chapter 2) & Nesterov acceleration (Chapter 4), Nesterov accelera- tion (Chapter 4) & proximal acceleration (Chapter 5), as well as Nesterov acceleration (Chapter 4) & restarts (Chapter 6) certainly belong together.

Prerequisites and complementary readings. This monograph is not meant to be a general-purpose manuscript on convex optimization, for which we refer the reader to the now classical references (Boyd and Vandenberghe, 2004; Bonnans et al., 2006; Nocedal and Wright, 2006). Other directly related references are provided in the text. We assume the reader to have a working knowledge of base linear algebra and convex analysis (such as on subdifferentials, and subgradients), as we do not spend much time on the corresponding technical details. Classical references on the latter include (Rockafellar, 1970; Rockafellar and Wets, 2009; Hiriart-Urruty and Lemaréchal, 2013).

Notes on this version.

Last compiled on March 12, 2021.

This version is still preliminary and under revision. Any feedback is welcome. In particu- lar, the current version certainly does not contain all appropriate references. Please feel free to suggest any that you believe is relevant. 2

Chebyshev Acceleration

While “Chebyshev polynomials are everywhere dense in numerical analysis”, we would like to argue here that Chebyshev polynomials also provide one of the most direct and intuitive explanations for acceleration arguments in first-order methods. In quadratic optimization, one can form a linear combination of fixed step gradient descent iterates that minimizes the gradient norm optimally and uniformly over all quadratic problems satisfying given regularity conditions. Given a fixed number of iterations, these linear combination coefficients emerge from a Chebyshev minimization problem, whose solution can also be computed iteratively, thus yielding an algorithm, called the Chebyshev method (Nemirovsky and Polyak, 1984) which traces its roots at least to (Flanders and Shortley, 1950) who credit Tuckey and Grosch, is detailed in what follows and asymptotically matches the heavy-ball method.

2.1 Introduction

In what follows, we show basic acceleration results on quadratic minimization problems. In this case, the optimum points can be written as the solutions of a linear system and we see that the basic gradient method can be seen as a simple iterative solver for this linear system. We then detail acceleration methods on these sequences using a classical argument involving Chebyshev polynomials. Analyzing this simple scenario is useful in two ways. First, recursive formulations of the Chebyshev argument yield a basic algorithmic template to design accelerated meth- ods, providing some intuition on their structure, such as the presence of a momentum term. Second, the arguments are robust to perturbations of the quadratic function f, hence apply asymptotically to more generic convex functions, thus enabling acceleration in a much wider range of applications. This is covered in the next chapter.

5 6 Chebyshev Acceleration

For now, consider the following unconstrained quadratic minimization problem 1 minimize f(x) , xT Hx bT x (2.1) 2 − in the variable x Rd, where H S (the set of symmetric matrices of size d d) is ∈ ∈ d × the Hessian of f. We further assume that f is both smooth and strongly convex, i.e., that there are µ,L > 0 such that µI H LI (this can be readily extended to the   case where µ is the smallest nonzero eigenvalue, or where µ = 0). Suppose we solve this problem using the fixed step gradient method detailed as Algorithm 1.

Algorithm 1 Gradient Method

Input: A differentiable convex function f, an initial point x0, a step size γ > 0. 1: for k = 1,...,kmax do 2: xk := xk 1 γ f(xk 1) − − ∇ − 3: end for Output: An approximate solution kmax.

On problem (2.1), the iterations become

x = (I γH)x + γb, k+1 − k 1 and calling x⋆ = H− b the optimum of problem (2.1), this is again

x x = (I γH)(x x ). (2.2) k+1 − ⋆ − k − ⋆ This means that gradient descent iterates x x are computed from x x using the k − ⋆ 0 − ⋆ matrix polynomial P Grad(H) = (I γH)k. (2.3) k − Suppose we set the step size γ such that

I γH < 1, k − k2 where stands for the operator ℓ norm. Then, (2.2) allows us to control convergence, k·k 2 with x x I γH k x x , for k 0. (2.4) k k − ⋆k2 ≤ k − k2 k 0 − ⋆k2 ≥ Because the matrix H is symmetric hence diagonalizable in an orthogonal basis, given γ > 0, we get

I γH 2 max I γH 2 k − k ≤ µI H LI k − k   max 1 γh ≤ µ h L | − | ≤ ≤ max γh 1, 1 γh ≤ µ h L { − − } ≤ ≤ max γL 1, 1 γµ . ≤ { − − } 2.2. Optimal Methods and Minimax Polynomials 7

1 1 γλ 1 0.8 | − | 1 γλ 3 | − | | 5 ) 1 γλ

λ 0.6 PSfrag replacements ( | − | grad k 0.4 P |

0.2

0 µ L λ [µ, L] ∈ Grad Figure 2.1: We plot |Pk (λ)| (for the optimal γ in (2.6)) for k ∈ {1, 2 ..., 5}, µ = 1, L = 10. The Grad polynomial is normalized such that P (0) = 1. The rate is equal to the largest value of |Pk (λ)| on the interval, achieved at the boundaries where λ = µ or L.

To get the best possible worst-case rate, we now minimize this quantity in γ > 0, solving L µ min max γL 1, 1 γµ = − , (2.5) γ>0 { − − } L + µ with the optimal step size obtained when both terms in the max are equal, at 2 γ = . (2.6) L + µ Calling κ = L/µ 1 the condition number of function f, the bound in (2.4) finally ≥ becomes κ 1 k x x − x x , for k 0. (2.7) k ⋆ 2 κ + 1 0 ⋆ 2 k − k ≤   k − k ≥ which characterizes the convergence of the gradient method in the smooth, strongly convex case.

2.2 Optimal Methods and Minimax Polynomials

In equation (2.4) above, we saw that the convergence rate of the gradient method on quadratic functions can be controlled by the spectral norm of a matrix polynomial and Grad Figure 2.1 plots the polynomial Pk for various degrees k. We can push this reasoning a bit further to produce methods with accelerated rates of convergence.

2.2.1 First-Order Methods and Matrix Polynomials The bounds derived above for gradient descent can be extended to a broader class of first-order methods for quadratic optimization. We consider first-order algorithms where 8 Chebyshev Acceleration

each iterate belongs to the span of previous gradients, i.e.

x x + span f(x ), f(x ), ..., f(x ) , k+1 ∈ 0 {∇ 0 ∇ 1 ∇ k } and show that the iterates are computed using a matrix polynomial as in (2.3) above.

Proposition 2.1. Assume the iterates xk are generated by a first-order algorithm with fixed step size such that

x x + span f(x ), f(x ), ..., f(x ) , (2.8) k+1 ∈ 0 {∇ 0 ∇ 1 ∇ k } where f is the gradient of the quadratic objective function in (2.1). Then, the error at ∇ iterate xk can be written x x = P (H)(x x ) (2.9) k − ⋆ k 0 − ⋆ where Pk is a polynomial of degree at most k and Pk(0) = 1.

Proof. Indeed, since f(x) is the gradient of a quadratic function, it reads ∇ f(x)= Hx b = H(x x ) ∇ − − ⋆

for any x⋆ satisfying Hx⋆ = b, where H is symmetric. We have

x x = 1 (x x ) 0 − ⋆ · 0 − ⋆ = P (H)(x x ) 0 0 − ⋆ x x = (x x )+ α(1)H(x x ) 1 − ⋆ 0 − ⋆ 0 0 − ⋆ = p (H)(x x ) 1 0 − ⋆ Assuming recursively that (2.9) holds for all indices i k, our assumption on iterates ≤ in (2.8) means

k x x = x x + α(k+1) f(x ) k+1 − ⋆ 0 − ⋆ i ∇ i Xi=0 k = x x + α(k+1)HP (H)(x x ) 0 − ⋆ i i 0 − ⋆ Xi=0 k (k+1) = I H αi Pi(H) (x0 x⋆) − ! − Xi=0 writing P (x) = 1 x k α(k+1)P (x), we have k+1 − i=0 i i P x x = P (H)(x x ) k+1 − ⋆ k+1 0 − ⋆ with P (0) = 1 and deg(P ) k + 1. k+1 k+1 ≤ 2.3. The Chebyshev Method 9

Note that the proof structure does not change if the step size is allowed to vary with x , H,..., so that α(k) α(k)(x , H,...) in the proof of Proposition 2.1, e.g. when using k i ≡ i k line search. This means that the above result readily applies in this case. This last proposition helps us design optimal algorithms, given a class of problem M matrices. Indeed, because there is a one-to-one link between polynomials and first-order methods, we can use tools from approximation theory to find optimal polynomials and design optimal methods. Given a matrix class , this means minimizing the worst-case M rate of convergence over H , solving ∈ M

Pk∗ = argmin max P (H) 2 P k,P (0)=1 H k k ∈Pk≤ ∈M where P is the set of polynomial of degree at most k. This polynomial P will be an ∈ Pk k∗ optimal polynomial for , and will yield a (worst-case) optimal algorithm for the class M . In the simple case where is the set of positive definite matrices with bounded M M spectrum, namely = H S : 0 µI H LI , M { ∈ d ≺   } finding the optimal polynomial over this class simplifies to

Pk∗ = argmin max P (λ) (2.10) P , λ [µ, L] | | k ∈ P (0)=1∈P Polynomials solving (2.10) are derived from Chebyshev polynomials of the first kind in ap- proximation theory and can be formed explicitly to produce an optimal algorithm called the Chebyshev method. The next section describes and analyses the rate of convergence of this method.

2.3 The Chebyshev Method

In the previous section, we have mentioned that the optimal polynomial for the class

= H S : 0 µI H LI , M { ∈ n ≺   } is related to Chebyshev polynomials of the first kind. We now explicitly introduce these polynomials, then analyse the rate of convergence of the associated algorithm, namely the Chebyshev method. A more complete treatment of these polynomials is available in e.g. Mason and Handscomb (2002).

2.3.1 Shifted Chebyshev Polynomials The Chebyshev polynomials of the first kind are defined recursively as follows

(x) = 1, T0 (x)= x, (2.11) T1 k(x) = 2x k 1(x) k 2(x), for k 2. T T − −T − ≥ 10 Chebyshev Acceleration

There exists a compact explicit solution for Chebyshev polynomial involving trigonomet- ric functions, cos(k acos(x)) x [ 1, 1], ∈ − k(x)= cosh(k acosh(x)) x> 1, (2.12) T  ( 1)kcosh(k acosh( x)) x< 1. − − Originally, the Chebyshev polynomial were defined for the interval [ 1, 1]. However, we  − seek minimal polynomials on the interval [µ,L]. Using a simple linear mapping from [µ, L] to [ 1, 1], − 2x (L + µ) x t[µ,L](x)= − , → L µ − we obtain shifted Chebyshev polynomials as [µ,L] k t (x) C[µ,L](x)= T . (2.13) k t[µ,L](0) Tk   [µ,L] where we have enforced the normalization constraint Ck (0) = 1. Golub and Varga (1961a) characterize the ℓ optimality in the interval [µ,L] using an equi-oscillation ∞ argument (see (Süli and Mayers, 2003) for example), i.e., they show that the solution [µ,L] of (2.10) is Pk∗ = Ck . Using this mapping, we obtain, after simplification, the following recursion [µ,L] C0 (x) = 1, 2 C[µ,L](x) = 1 x, (2.14) 1 − L + µ [µ,L] 2δk [µ,L] Ck (x) = (L + µ 2x) Ck 1 (x) L µ − − − 2δk(L + µ) [µ,L] + 1 Ck 2 (x), for k 2, − L µ − ≥  −  where δ0 = 1 and − 1 δ = , for k 1. k L+µ ≥ 2 L µ δk 1 − − − [µ,L] The sequence δk ensures Ck (0) = 1. This recursion is simple but may present numer- ical stability issues (see (Gutknecht and Röllin, 2002, Algorithm 1) for a more stable [µ,L] [µ,L] [µ,L] implementation). We plot C1 (x), C3 (x) and C5 (x) in Figure 2.2 as illustration.

2.3.2 Chebyshev and Polyak’s Heavy-Ball Algorithms The shifted Chebyshev polynomials above solve (2.10) hence allow us to bound the con- vergence rate of first-order methods using (2.9). We now present the resulting algorithm, called Chebyshev semi- (Golub and Varga, 1961b). [µ,L] Suppose we define iterates using Ck (x) as follows, x x = C[µ,L](H)(x x ). k − ⋆ k 0 − ⋆ 2.3. The Chebyshev Method 11

1 [µ,L] C1 0.8 | | C[µ,L] | 3 | | [µ,L] ) C λ 0.6 5

PSfrag replacements (

] | | µ,L [ k 0.4 C |

0.2

0 µ L λ [µ, L] ∈ [µ,L] [µ,L] [µ,L] Figure 2.2: We plot the absolute value of C1 (x), C3 (x) and C5 (x) for λ ∈ [µ, L] where µ = 1 and L = 10. The polynomial is normalized such that P (0) = 1. The maximum value of the image of [µ, L] by C[µ,L]k decreases very fast as k grows, implying a better rate of convergence.

the recursion in (2.14) yields

2δk xk x⋆ = ((L + µ)I 2H) (xk 1 x⋆) − L µ − − − − 2δk(L + µ) + 1 (xk 2 x⋆). − L µ − −  −  Since the gradient of the function reads f(x )= H(x x ), ∇ k k − ⋆ we can simplify away x⋆ to get the following recursion

2δk 2δk(L + µ) xk = ((L + µ)xk 1 2 f(xk 1)) + 1 xk 2. L µ − − ∇ − − L µ − −  −  which describes iterates of the Chebyshev method. We summarize it as Algorithm 2. By construction, the Chebyshev method is a worst-case optimal first-order method for minimizing quadratics whose spectrum lies in [µ,L]. Surprisingly, its iteration structure is simple and intuitive: it involves a gradient descent step with variable stepsize 4δ /(L µ), k − combined with a variable momentum term. Asymptotically in k, the Chebyshev method converges to Polyak’s Heavy-ball method, is defined by 4 (√L √µ)2 xk = xk 1 f(xk 1)+ − (xk 1 xk 2). − − (√L + √µ)2 ∇ − (√L + √µ)2 − − −

To see this, it suffices to compute the limit of δk, written as δ , solving ∞ 1 δ = ∞ (L+µ) 2 L µ δ − − ∞ 12 Chebyshev Acceleration

Algorithm 2 Chebyshev method (see (Gutknecht and Röllin, 2002, Algorithm 1) for a more stable implementation)

Input: x0, strong convexity constant µ, smoothness constant L. 1: Set δ = 1, x = x 2 f(x ). 0 − 1 0 − L+µ ∇ 0 2: for k = 1,...,kmax do 1 3: Set δk = (L+µ) , 2 δ − L−µ − k 1 4δk 2δk(L+µ) 4: xk := xk 1 L µ f(xk 1)+ 1 L µ (xk 2 xk 1). − − − ∇ − − − − − − 5: end for  

to get √L √µ δ = − . (2.15) ∞ √L + √µ We obtain Polyak’s heavy-ball method by replacing δk by δ in Algorithm 2. ∞ 2.3.3 Rate of convergence The rate of convergence of the Chebyshev method is bounded as in (2.5), with [µ,L] [µ,L] xk x⋆ 2 Ck (H)(x0 x⋆) 2 x0 x⋆ 2 max Ck (x) . (2.16) k − k ≤ k − k ≤ k − k x [µ,L] | | ∈ The maximum value is given by evaluating the polynomial at one of the extremities of the interval (Mason and Handscomb, 2002, Chap.2), i.e, [µ,L] [µ,L] max Ck (x) = Ck (L). x [µ,L] | | ∈ Using successively (2.13) and (2.12), 1 1 C[µ,L](L) = = k [µ,L] L+µ | | k t (0) cosh k acosh L µ |T | − We get the convergence rate of the Chebyshev  method, as stated in the following theorem. Theorem 2.1. Suppose we are solving the quadratic minimization problem (2.1) with H S such that µI H LI. Among all first-order methods satisfying (2.8), the ∈ n   Chebyshev algorithm is optimal, and its rate of convergence is given by L 2 µ + 1 x x x x where ξ = . (2.17) k ⋆ 2 k k 0 ⋆ 2 q k − k ≤ ξ + ξ− k − k L 1 µ − Proof. The optimality of Chebyshev’s method follows directlyq from (2.10), whose solution is given by the shifted Chebyshev polynomial (2.13). It remains to bound (2.16), i.e. [µ,L] xk x⋆ 2 x0 x⋆ 2 max Ck (x) k − k ≤ k − k x [µ,L] | | ∈ 1 = x x . k 0 − ⋆k2 L+µ cosh k acosh L µ −    2.3. The Chebyshev Method 13

First, we evaluate the acosh term, for which

L + µ L + µ L + µ 2 acosh = ln + 1 , L µ L µ s L µ −   −  −  −   L  µ + 1 = ln (ξ) , where ξ = . q L 1 µ − q After plugging this result in the cosh, 1 2 2 = k ln(ξ) k ln(ξ) = k k , cosh (k ln (ξ)) e + e− ξ + ξ− which gives the desired result.

This rate of convergence may be hard to compare with gradient descent due to a more k complex expression. However, neglecting the denominator term ξ− , we can simplify it as follows k √κ 1 xk x⋆ 2 2 − x0 x⋆ 2. k − k ≤ √κ + 1 ! k − k This matches the rate of Polyak’s Heavy-Ball method, and is better than that of gradient descent in (2.5), which reads

κ 1 k x x 2 − x x . k ⋆ 2 κ + 1 0 ⋆ 2 k − k ≤   k − k We summarize this in the following corollary, comparing the number of iterations required to reach a target accuracy ǫ.

Corollary 2.2. For fixed µ and L, to ensure

x x ǫ, k k − ⋆k2 ≤ we need

L x0 x⋆ 2 • k 2 log k − k iterations of Gradient descent, ≥ µ ǫ   L x0 x⋆ 2 • k 2 log 2 k − k iterations of the Polyak Heavy-Ball method (i.e., Algo- ≥ µ ǫ rithmq2 after replacing δk by δ , see (2.15)). ∞ Proof. For gradient descent, a sufficient condition on the number of iteration to reach an ǫ accuracy reads

κ 1 k x x − x x ǫ. k ⋆ 2 κ + 1 0 ⋆ 2 k − k ≤   k − k ≤ 14 Chebyshev Acceleration

Taking logs on both sides, assuming x x >ǫ, we get k 0 − ⋆k x0 x⋆ 2 log k −ǫ k k ≥  κ 1  log κ−+1   1 x 1 However, log 1+−x > 2x . This means the condition above can be strengthened by con- sidering   x x k 2κ log k 0 − ⋆k2 ≥  ǫ  This gives the desired result for gradient descent. With the same approach, we also get the result for the Chebyshev algorithm.

This corollary shows in particular that the Chebyshev method can be √κ faster than gradient descent. This means a factor 100 speedup in problems with a (reasonable) condition number of 104, which is very significant. Furthermore, when the dimension of the ambient space is large, and without further assumptions on the spectrum of H, these worst-case guarantees on Chebyshev’s method are essentially unimprovable, see for example (Nemirovsky and Yudin, 1983a; Nemirovsky, 1992; Nesterov, 2003). The methods presented in this chapter are worst-case optimal. this means they are optimal w.r.t. any choice of H, as long as µI H LI. Recent work analyses instead   optimal methods w.r.t. random matrices H, see for instance (Pedregosa and Scieur, 2020; Lacotte and Pilanci, 2020). Also, Scieur and Pedregosa (2020) shows that, as for the Chebyshev algorithm, all these optimal methods in the average-case converge asymptotically (in the number of iterations) to Polyak’s Heavy-ball scheme. However, to achieve this optimal rate of convergence, the Chebyshev method requires knowledge of the constants µ and L. While L can be estimated easily (since L corresponds to the largest eigenvalue of H), the constant µ is usually more difficult to bound. We will now introduce another nonlinear acceleration technique, known as Anderson acceleration, which achieves the same convergence rate as Chebyshev’s method, but runs without assuming knowledge of µ. 3

Nonlinear Acceleration

In what follows, we see that the main argument used in the Chebyshev method can be adapted beyond quadratic problems. This follows a pattern that is known in numerical analysis as nonlinear acceleration and seeks to accelerate convergence of sequences by extrapolation using nonlinear averages. These methods are known under various names, starting with Aitken’s ∆2 (Aitken, 1927), Wynn’s epsilon algorithm (Wynn, 1956) or Anderson Acceleration (Anderson, 1965) (a survey of these techniques can be found in (Sidi et al., 1986)). These results can be transposed to the optimization setting where a regularization argument produces convergence bounds. In this chapter, we use as the Euclidean norm for vectors, and as the ℓ operator k · k 2 norm for matrices.

3.1 Introduction

We focus again on a generic unconstrained convex minimization problem written as

minimize f(x) (3.1)

in the variable x Rd, where f is twice continuously differentiable in a neighborhood ∈ of its minimizers x⋆. In what follows, some of the ideas behind Chebyshev acceleration, developed on quadratic models in Chapter 2, are adapted to arbitrary convex minimiza- tion problems. Quite naturally, this stems from a local quadratic approximation of the objective function f written as

f(x)= f + (x x )T H(x x )/2+ o( x x 2), (3.2) ⋆ − ⋆ − ⋆ k − ⋆k where H = 2f(x ) S (the set of symmetric d d matrices) is the Hessian matrix ∇ ⋆ ∈ d × of f at x , satisfying µI H LI. Problem (3.2) can be written in the format of ⋆  

15 16 Nonlinear Acceleration

problem (2.1) in Chapter 2, with

f(x)= xT Hx/2+ bT x + c, where b = Hx , c = f + (x )T Hx /2, − ⋆ ⋆ ⋆ ⋆ but the form in (3.2) simplifies the notations below. We could apply Chebyshev’s method directly to this approximate quadratic model in (3.2), but we can get better weights by adaptively forming the coefficients of the polynomial Pk(H) in Proposition 2.1 to improve convergence speed. The Chebyshev acceleration arguments of Chapter 2 are then used in this case as a wrapper (or outer loop) around existing first-order methods, and form better estimates of the optimum using linear combinations of the sequence of gradients produced by an underlying first- order algorithm. This technique is called nonlinear acceleration and is known under other various names such as Anderson or minimum polynomial acceleration (see §3.6 for more references). More concretely, assume we run a first-order method of the form

x x + span f(x ), f(x ), ..., f(x ) , (3.3) k+1 ∈ 0 {∇ 0 ∇ 1 ∇ k } producing a sequence of gradients

f(x ), f(x ),..., f(x ). ∇ 0 ∇ 1 ∇ k Nonlinear acceleration approximates a minimizer x⋆ using a linear combination of the iterates x by minimizing the norm of the gradient, since f(x ) = 0, solving for i ∇ ⋆ k 2 c⋆ = argmin f cixi . (3.4) 1T c=1 ∇ ! i=0 X While we linearly combine the iterates x ’s, the coefficients c depend on f and the x , i i ∇ i hence the term nonlinear acceleration.

3.2 Quadratic Case

Solving (3.4) is of course just as hard as solving (3.1) in general. However, when f is a quadratic function f(x) = (x x )T H(x x )/2+ f , (3.5) − ⋆ − ⋆ ⋆ the gradient in (3.4) is a linear function of x and reads

k k f cixi = H cixi x⋆ . ∇ ! − ! Xi=0 Xi=0 When the coefficients ci sum to one, the gradient of the linear combination becomes a linear combination of gradients, satisfying

k k k k f cixi = H cixi x⋆ = ciH (xi x⋆)= ci f(xi). (3.6) ∇ ! − ! − ∇ Xi=0 Xi=0 Xi=0 Xi=0 3.2. Quadratic Case 17

This means that, in the quadratic case, subproblem (3.4) becomes a simple quadratic program with one equality constraint, involving known quantities (the gradients at each iterate of the underlying first-order method), written as

k 2 c⋆ = argmin ci f (xi) , (3.7) T 1 ∇ c =1 i=0 X

in the variable c Rk+1. In matrix form, the problem reads ∈ 2 c⋆ = argmin Gc , (3.8) cT 1=1 k k

in the variable c Rk+1, where G = [ f(x ),..., f(x )] is the matrix formed by con- ∈ ∇ 0 ∇ k catenating all gradients produced by the underlying first-order method. This quadratic subproblem requires solving a small linear system of (k + 1) equations with an explicit solution computed as

z T 1 c⋆ = k , where z = (G G)− 1. i=0 zi P 3.2.1 Mixing Iterates and Gradients

The previous section shows that it is possible to efficiently minimize the gradient by combining previous iterates xi. However, even if we use the information contained in f(x ) to form the coefficients c , the extrapolation y only belongs to the span ∇ k i k+1 of gradients up to f(xk 1), thus implicitly discarding the information contained in ∇ − f(x ). ∇ k We can incorporate f(x ) by mixing the points x and their gradient as follows, ∇ k i k k y = c x h c f(x ), k+1 i i − k+1 i∇ i Xi=0 Xi=0 where hk+1 is the mixing parameter, an can be seen as the “step size” of the nonlin- ear acceleration algorithm. The mixing corresponds to a simple gradient step at the k extrapolated point i=0 cixi. In fact, using equation (3.6), we obtain

P k k yk+1 = cixi hk+1 f cixi . − ∇ ! Xi=0 Xi=0 which gives a simple strategy slightly improving the rate of convergence of nonlinear acceleration using f(x ). In practice, the mixing parameter h is not necessary outside ∇ k of certain contexts (see §“online acceleration” in Section 3.5.1).

3.2.2 Nonlinear Acceleration Algorithm

All together, we obtain the nonlinear acceleration method described as Algorithm 3. 18 Nonlinear Acceleration

Algorithm 3 Nonlinear Acceleration

Input: Sequence of pairs (xi, f(xi)) i=0...k ; step size h { ∇ } T 1: Form the matrix G = [ f(x0), ..., f(xk)], and compute G G. ∇ T ∇1 z 2: Solve the linear system z = (G G)− 1, and compute c = zT 1 . Output: Return the extrapolation k c (x h f(x )). i=0 i i − ∇ i P

The complexity of this algorithm is O(dk2 + k3) where k is the number of iterations and d the ambient dimension. The first term comes from the matrix- in step 1, the second term comes from inverting a (k + 1) (k + 1) matrix in step 2. The × length k of the sequence is typically much smaller than d, thus the complexity is roughly O(dk2). If an extrapolation step is computed each time a new gradient is inserted, the per iteration complexity can be reduced up to O(dk), and computing the matrix GT G and the coefficients c⋆ using low-rank updates.

3.2.3 Convergence Bounds

The accuracy of nonlinear acceleration can be quantified when the sequence of points xi is generated by a first-order method, under mild assumptions. The following ensures that the matrix GT G is invertible. Assumption 1. At iteration k of 3.3, the dimension of the space

x + span f(x ),...,f(x ) 0 {∇ 0 k }

is k + 1. Under this assumption, the next theorem shows that Algorithm 1 is at least as efficient as Chebyshev acceleration. We first show that the solution of (3.7) combined with the latest gradient, allows obtaining guarantees on the norm of the gradient at the extrapolated point.

Theorem 3.1. Consider the quadratic function f described in (3.5). Let xi be gener- ated by a first-order method (3.3), satisfying Assumption 1. Let yk+1 be the output of Algorithm 3 on the sequence (x , f(x )) , then { i ∇ i }i=0...k

k f(yk+1) I hH min ci f(xi) , (3.9) k∇ k ≤ k − k cT 1=1 ∇ i=0 X

where I hH max 1 hµ , 1 hL . k − k≤ {| − | | − |} 3.2. Quadratic Case 19

Proof. We have the following inequalities

k f(yk+1) = f ci(xi h f(xi)) k∇ k ∇ − ∇ ! i=0 X k

= H ci(I hH)(xi x⋆) − − ! i=0 X k

I hH ciH(xi x⋆) ≤ k − k − ! i=0 X k

= I hH ci f(xi) . k − k ∇ i=0 X and µI H LI yields the desired result.  

This result shows that, once the last step h is fixed, the rate of convergence is es- sentially controlled by the optimal value of a minimization problem. This value can be bounded using a Chebyshev argument, and the next proposition thus connects the rate of convergence of Nonlinear Acceleration with the maximum of Chebyshev polynomials. Unlike Chebyshev acceleration, whose coefficients are predetermined and minimize the worst-case rate of convergence, the nonlinear acceleration technique adaptively looks for the best coefficients, and is thus at least as fast as Chebyshev.

Proposition 3.1. Consider the quadratic function f described in (3.5). Let xi be gener- ated by a first-order method (3.3), satisfying Assumption 1. Let yk+1 be the output of Algorithm 3 on the sequence (x , f(x )) . In such case, Nonlinear Acceleration { i ∇ i }i=0...k finds the best polynomial, satisfying

k min ci f(xi) = min P (H) f(x0) T c 1=1 ∇ P k,P (0)=1 k ∇ k i=0 ∈P X where is the set of polynomials of degree at most k. This implies Pk k L 2 µ + 1 min ci f(xi) f(x0) , ξ = . T k k q c 1=1 ∇ ≤ ξ + ξ− k∇ k L i=0 µ 1 X − q Proof. We start with the definition of first-order method, yielding iterates xi such that xi x0 + span f(x0), ..., f(xi 1) . ∈ {∇ ∇ − } From Section 2.3, Proposition 2.1, if xi+1 follows the previous recursion (for a quadratic function), then xi reads

x = x + P (H)(x x ), P , P (0) = 1. i ⋆ i 0 − ⋆ i ∈ Pi i 20 Nonlinear Acceleration

The same idea holds for its gradient,

f(x ) = H(x x )= HP (H)(x x )= P (H)(H(x x )) ∇ i i − ⋆ i 0 − ⋆ i 0 − ⋆ = P (H)( f(x )). (3.10) i ∇ 0 We inject this expression in the objective of Algorithm 3.7,

n 2 n 2 min ci f(xi) = min ciPi(H) f(x0) . (3.11) cT 1=1 ∇ cT 1=1 ! ∇ i=0 i=0 X X

By Assumption 1, the dimension of the span at iteration k is k + 1. This means all gradi- ents are linearly independent. Therefore, because of Equation (3.10), the polynomials Pi are also linearly independent, and hence Pi i=0...k is a basis for the space k. Finally, T { } P because Pi(0) = 1 and c 1 = 1, we can rephrase the objective (3.11) as

k min ciPi(H) f(x0) = min P (H) f(x0) . T c 1=1 ! ∇ P k:P (0)=1 k ∇ k i=0 ∈P X

A convergence rate is thus given by k f(yk+1) I hH ci f(xi) k∇ k ≤ k − k ∇ i=0 X = I hH min P (H) f(x0) . k − k P k:P (0)=1 k ∇ k ∈P Because the shifted Chebyshev polynomial from Theorem 2.1 is a feasible solution of this last minimization problem, we can use it to bound the last term.

In short, nonlinear acceleration adaptively looks for the best combination of previous iterates given the information stored in previous gradients, as opposed to Chebyshev acceleration that uses the same worst-case optimal polynomial in all cases. Moreover, Algorithm 3 does not require knowledge of smoothness and strong convexity parameters. Note however that the iteration complexity of nonlinear acceleration grows with the number of iterates, so it is typically only used on a limited number of steps.

3.3 The Non-Quadratic Case: Regularized Nonlinear Acceleration

Nonlinear acceleration suffers some serious drawbacks when used outside the restricted setting of quadratic functions. In fact, vanilla nonlinear acceleration is highly numerically unstable. The origin of this problem comes from the conditioning of the matrix GT G, used to compute the coefficients ci’s. To illustrate this statement, suppose we are given a noisy sequence of gradients in Algorithm 3,

Nonlinear Acceleration( (x , f(x )+ e ); . . . ; (x , f(x )+ e ) , h), { 0 ∇ 0 0 k ∇ k k } 3.3. The Non-Quadratic Case: Regularized Nonlinear Acceleration 21 where e ǫ. Scieur et al. (2016, Proposition 3.1) show that the (relative) distance k ik ≤ betweenc ˜ (the coefficients computed using the noisy sequence above) and its noise-free version c is bounded by

c c˜ T 1 k − k O E ((G + E) (G + E))− . (3.12) c ≤ k kk k k k   where E , [e0,...,ek] is the noise matrix. In this bound, the perturbation impacts the solution proportionally to its norm and to the conditioning of G + E. Unfortunately, even for small perturbations, the condition number of G + E and the norm of the vector c are usually huge. In fact, Scieur et al. (2016) show that G has a Krylov matrix structure, which is notoriously poorly conditioned (Tyrtyshnikov, 1994). This means even very small perturbation E in (3.12) can have a significant impact on performance. We briefly illustrate the link between G and Krylov matrices. Consider for simplicity the gradient descent algorithm with step size γ on a quadratic function. Equation (2.3) from Section 2.3 shows that the iterates follow the recursion

x x = (1 γH)k(x x ) f(x ) = (1 γH)k f(x ). k+1 − ⋆ − 0 − ⋆ ⇔ ∇ k+1 − ∇ 0 Replacing gradients in the matrix G by these expressions give

G = f(x ), (I γH) f(x ), (I γH)2 f(x ), ... , ∇ 0 − ∇ 0 − ∇ 0 h i which shows that G is in fact a Krylov matrix (a Krylov matrix K associated to a matrix A and vector v is defined as M = [v, Av, A2v,...]). Figure 3.1 shows the norm of c and the condition number of GT G when nonlinear acceleration is used to accelerate gradient descent, fast gradient descent (Nesterov, 2013), and itself (in the online setting, algorithm (3.23)). We see from 3.1a that after only 3 iterations, the system is considered as singular (i.e., the conditioning is larger than 1016). To stabilize the method, it is common to regularize the linear system. The resulting algorithm is often referred to as regularized nonlinear acceleration (RNA) (Scieur et al., 2016). The following subsection is devoted to the theoretical properties of this method.

3.3.1 Regularized Nonlinear Acceleration Regularized nonlinear acceleration consists in using Algorithm (3) with a regularization term. In short, the base operation used by RNA solves for 2 Gc 2 argmin k k2 + λ c cref (3.13) T G k − k c 1=1 k k in the variable c Rk, where c is the reference vector of coefficients, and that regular- ∈ ref ization forces c to be close to cref. Ideally, cref should sum to one. For instance, setting 22 Nonlinear Acceleration

1010

1020

105 1010

100 100 0 2 4 6 8 10 2 4 6 8 10

(a) (b)

Figure 3.1: Illustration of the sensitivity of nonlinear acceleration, when applying nonlinear acceleration on gradient descent, Nesterov algorithm (see later in section 4) and using its online variant (introduced later, see equation 3.23) to minimize a random quadratic function. Figure 3.1a the condition number of the matrix GT G, that grows exponentially with its size (the plateau at the end it caused by numerical errors). Figure 3.1b, the norm of the vector of coefficients c, which grows quickly over time.

this vector 1/k makes c closer to averaging. Another possibility is cref = [0k 1, 1], which − puts more weight on the last iterate. The division by G 2 is for scaling purpose only, k k which makes λ unit-free.

Algorithm 4 Regularized Nonlinear Acceleration Input: Sequence of pairs (x , f(x )) , step size h, regularization term λ, refer- { i ∇ i }i=0...k ence vector cref. T 1: Form G = [ f(x ), ..., f(x )], compute = G G . ∇ 0 ∇ k G GT G 1 k k 2: Solve the linear system w = λ( + λI)− cref. G 1 3: Solve the linear system z = ( + λI)− 1. G (1 wT 1) 4: Compute the coefficients c = w + z −zT 1 Output: Return the extrapolation k c (x h f(x )). i=0 i i − ∇ i P The resulting method is provided in Algorithm 4, and is slightly more complicated than its unregularized counterpart. In the special case where cref = 0, the procedure simplifies to Algorithm 5. Because of the regularization term, the output c of Algorithm 4 is less sensitive to noise. The next section introduces the concept of perturbed quadratic gradient, and illustrates that regularization is crucial in some applications.

3.3.2 Perturbed Quadratic Gradient and Non-Quadratic Problems

This section introduces the model of perturbed quadratic gradients. We first focus on deterministic non-quadratic function f(x), with gradient f(x). ∇ 3.3. The Non-Quadratic Case: Regularized Nonlinear Acceleration 23

Algorithm 5 Regularized Nonlinear Acceleration (Simplified)

Input: See Algorithm 4. Ensure that cref = 1/k. T 1: Form G = [ f(x ), ..., f(x )], compute = G G . ∇ 0 ∇ k G GT G 1 k k 2: Solve the linear system z = ( + λI)− 1/k. G z 3: Compute the coefficients c = zT 1 . Output: Return the extrapolation k c (x h f(x )). i=0 i i − ∇ i P

As before, the iterates xi come from a first-order algorithm, i.e.

x x + span f(x ), f(x ), ..., f(x ) . k+1 ∈ 0 {∇ 0 ∇ 1 ∇ k } However, f(x) is no longer the gradient of a quadratic function. Instead, the gradient ∇ contains some error term e(x), and reads as in (3.2)

f(x)= H(x x )+ e(x), (3.14) ∇ − ⋆ for some matrix H such that µI H LI. We detail two scenarios under which this   decomposition holds locally.

Example 1: Twice-differentiable functions.

Consider the problem min f(x), x in the variable x Rd, for f a twice-differentiable function. For any x around its optimal ∈ point x⋆, the function satisfies

f(x)= f + f(x )(x x ) + (x x )T 2f(x )(x x )/2+ O( x x 3). ⋆ ∇ ⋆ − ⋆ − ⋆ ∇ ⋆ − ⋆ k − ⋆k =0

This means that f|(x){z can} be approximated by the quadratic function (3.14), with H = 2f(x ). Using the same technique, its gradient reads ∇ ⋆ f(x)= f(x ) + 2f(x )(x x )+ e(x)= H(x x )+ e(x), ∇ ∇ ⋆ ∇ ⋆ − ⋆ − ⋆ =0 where e(x) is the first-order| {z } Taylor remainder of the gradient. Thus, minimizing a non- quadratic function is equivalent to minimizing a perturbed quadratic one, whose error on the gradient is of second order and satisfies

e(x )= f(x ) H(x x ) e(x ) = O( x x 2) . k ∇ k − k − ⋆ ⇒ k k k k k − ⋆k   where H = 2f(x ). ∇ ⋆ 24 Nonlinear Acceleration

Example 2: Stochastic iterations. In the case where the gradient is corrupted by stochastic noise, the iterate xk+1 reads

x x + span ˜ f(x ), ˜ f(x ), ..., ˜ f(x ) . k+1 ∈ 0 {∇ 0 ∇ 1 ∇ k } where ˜ f(x ) is a stochastic estimate of the true gradient f(x ) that satisfies ∇ i ∇ i E[ ˜ f(x )] = f(x ), E[ ˜ f(x ) f(x ) 2] σ2. ∇ i ∇ i k∇ i −∇ i k ≤ Here the gradient f(x) is a perturbed version of the gradient of a quadratic function, ∇ with ˜ f(x )= H(x x )+ e , ∇ i i − ⋆ i where ei is now the sum of a stochastic noise and a Taylor remainder.

3.3.3 Convergence bound Using a perturbation argument, it is possible to derive a convergence guarantee for regularized nonlinear acceleration. We state here a simplified version of Scieur et al. (2018, Theorem 3.2), describing how regularization balances acceleration and stability in the algorithm. For simplicity, we state the next theorem specifically for Algorithm 5, but one can extend the results for any cref. We will discuss convergence rates in more details in what follows.

Theorem 3.2. Let xextr be the output of Algorithm 5 used on the sequence

(x , f(x )),..., (x , f(x )) , { 0 ∇ 0 k ∇ k } where f(x) follows (3.14) with e(x ) ǫ. Then, ∇ k i k≤

extr  1  H(x x⋆) I hH C H(x0 x⋆) + O 1+ ǫ k − k ≤ k − k  k − k r λ !  Acceleration     Stability  | {z }  where C is a constant, corresponding to the maximum value| in the{z interval} [µ,L] of the regularized Chebyshev polynomial,

C = argmin max P 2(x)+ λ P 2, (3.15) P ,P (0)=1 x [µ,L] kGkk k ∈Pk ∈ in which P is the norm of the coefficients of the polynomial. k k This theorem shows that regularization, as expected, stabilizes the algorithm while slowing down the convergence rate. The regularized Chebyshev polynomial interpolates [µ,L] the classical shifted Chebyshev polynomial Ck from (2.13), and the polynomial whose coefficients are defined by 1/k, i.e., the polynomial that averages the iterates xk. By construction, its maximum value is always higher than that of the Chebyshev polynomial, 3.3. The Non-Quadratic Case: Regularized Nonlinear Acceleration 25

but the norm of its coefficients is smaller. Unfortunately, there is for now no known explicit expression of the regularized Chebyshev polynomial. However, its value can be computed numerically, using sum-of-squares and its value can be upper bounded using proxy Chebyshev polynomials (Barré et al., 2020b). Notice that, for stochastic noise, Theorem (3.2) still holds but in expectation (Scieur et al., 2017a). This interpolation between Chebyshev coefficients and simple averaging of iterates is also natural in the context of noisy iterations. When the noise is negligible, a small regularization parameter combines the iterates xi using classical Chebyshev weights. In cases where the noise is more significant, a larger regularization makes the coefficients c closer to the average 1/k which reduces the noise (i.e., the Stability term), but increases the value of the Acceleration term, and therefore reduces the overall acceleration effect.

3.3.4 Asymptotically Optimal Rate of Convergence

We quickly discuss the behavior of the RNA algorithm when the initialisation x0 is getting closer to the solution x⋆. In particular, the next proposition shows that if the perturbation magnitude ǫ decreases faster than H(x x ) , the parameter λ can be k 0 − ⋆ k adjusted to ensure an asymptotically optimal rate of convergence, i.e. a rate comparable to that of the Chebyshev method on quadratics (Nemirovsky and Yudin, 1983c). The intuition is fairly simple: as x0 gets closer to x⋆, the Taylor expansion of the function f gets closer to a quadratic. In this setting, the rate of convergence of RNA converges to the one of Theorem 3.1.

Proposition 3.2. Let xextr be the output of Algorithm 5 used on the sequence

(x , f(x )),..., (x , f(x )) , { 0 ∇ 0 k ∇ k } where f(x) follows (3.14) with e(x ) ǫ. Assume the error magnitude ǫ satisfies ∇ k i k≤ ǫ O( x x α), α> 1. ≤ k 0 − ⋆k Then, if λ = O( x x s), where 0

Proof. We start with the result from Theorem 3.2, and we divide both sides by H(x x ) , k 0 − ⋆ k 1 ǫ H(xextr x ) / H(x x ) I hH C H(x x ) + O 1+ . k − ⋆ k k 0 − ⋆ k ≤ k − k k 0 − ⋆ k λ H(x x ) r k 0 − ⋆ k !! Let λ = H(x x ) s and ǫ = H(x x ) α. k 0 − ⋆ k k 0 − ⋆ k 26 Nonlinear Acceleration

Therefore, H(xextr x ) k − ⋆ k H(x x ) k 0 − ⋆ k 2(α 1) 2(α 1) H(x0 x⋆) − I hH C H(x0 x⋆) + O H(x0 x⋆) − + k − k . ≤ k − k  k − k sk − k H(x x ) s  k 0 − ⋆ k    When x x , we have 0 → ⋆ 2(α 1) H(x x ) − 0 (since α> 1), k 0 − ⋆ k → 2(α 1) s H(x x ) − − 0 (since s< 2(α 1)). k 0 − ⋆ k → − Finally, in (3.15), the regularization parameter λ = O( H(x x ) 2) 0. There- kGk k 0 − ⋆ k → fore, the regularized Chebyshev in (3.15) converges to the regular (shifted) Chebyshev polynomial.

In short, the previous theorem shows that asymptotically optimal rate of convergence is achieved as soon as λ decreases faster than x x , but slower than ǫ2. This k 0 − ⋆k is possible, for instance, when accelerating twice differentiable functions with gradient descent. Indeed, in this setting, the error decreases as O( x x 2), and therefore k 0 − ⋆k α = 2 > 1 (see §3.3.2). Note that this may not always be possible, in particular if ǫ x x . This ≥ k 0 − ⋆k happens, for instance, when accelerating fixed step stochastic gradient descent. In this case ǫ = O(1) and asymptotic acceleration is not possible. However, this is not a surprise, as the fixed step version of SGD itself does not converge to the optimum.

3.4 Constrained/Proximal Nonlinear Acceleration

So far, we discussed the acceleration of optimization algorithms solving unconstrained problems. Unfortunately, extending these results to the constrained case is not straight- forward. For instance, when minimizing a function f over a restricted domain Rd, there C ∈ is no known mechanism to ensure that the output of Algorithm 3 (or its regularized variant) remains inside . Moreover, projecting the extrapolated point onto the set C C does not guarantee an improvement in accuracy. Even worse, our definition of first-order methods (3.3) does not handle algorithms pos- sibly involving projections: after a projection step, the resulting iterate might be outside the span of previous gradients. Therefore, the whole intuition of nonlinear acceleration developed in previous sections vanishes, as well as its convergence guarantees. Recently, Mai and Johansson (2020) adapted the Anderson Acceleration method to handle a large class of constrained and non-smooth problems, using Clarke’s generalized Jacobian (Clarke, 1990) and semi-smoothness (Mifflin, 1977; Qi and Sun, 1993). Before 3.4. Constrained/Proximal Nonlinear Acceleration 27

providing the technical details, the next section introduces proximal gradient descent, as well as some notations.

3.4.1 Composite Optimization and Proximal Gradient Descent

Consider the composite optimization problem

min f(x)+ φ(x), (3.16) x Rd ∈ where f(x) is a smooth strongly convex function and φ(x) is convex, potentially non- smooth, with tractable proximal operator

, 1 2 proxhφ(x) argmin hφ(y)+ x y . (3.17) y 2  k − k  The composite optimization problem (3.16) can be easily solved using proximal gradient descent, whose recursion reads

x = prox (x h f(x )). (3.18) k+1 hφ k − ∇ k We skip most of the details on proximal algorithms and refer the reader to e.g. (Parikh and Boyd, 2014) for an excellent survey. The proximal step corresponds to an implicit update. That is, using the first-order optimality condition on (3.17), 1 0 ∂ φ(y)+ x y 2 y x h∂φ(y), 2h ∈  k − k  ⇔ ∈ − where ∂ stands for the sub-gradient. Injecting this property in the proximal gradient step (3.18) gives the recursion

x = x h ( f(x )+ ∂φ(x )) . (3.19) k+1 k − ∇ k k+1 The term implicit comes from the fact that the gradient of the “proximal” function is evaluated at xk+1. Many constrained optimization problems have a composite structure as in (3.16). For example, the constrained optimization problem

min f(x), (3.20) x ∈C where Rd is a convex set. We can form an equivalent problem of the form (3.16), by C ∈ letting φ(x) be the indicator function of the set, i.e.,

+ if x , φ(x)= ∞ 6∈ C 0 otherwise.  This way, the constrained optimization problem (3.20) is cast into the composite model (3.16). Moreover, the proximal operator (3.17) corresponds to the orthogonal projection 28 Nonlinear Acceleration

onto the set . Note that it does not mean that the proximal operation is easy to compute, C as it might largely depend on the set. Another typical example is the ℓ regularization, where φ(x) = x . In this case, 1 k k1 the proximal operator is called soft thresholding, with

0 if x [ h, h], ∈ − proxh 1 (x)= x h if x > h, k·k  − x + h if x < h.  This operator is commonly used in the LASSO problem, for exploiting the sparsity- inducing properties of ℓ1 regularization.

3.4.2 Nonlinear Acceleration on Proximal Gradient

In the previous sections, we considered first-order methods (3.3), where xk+1 belongs to the span of previous gradients f(x ). Unfortunately, proximal gradient descent (3.18) ∇ i does not satisfy this requirement due to the proximal operator. Indeed, after expanding (3.19), we obtain

x x + span f(x )+ ∂φ(x ),..., f(x )+ ∂φ(x ) , k+1 ∈ 0 {∇ 0 1 ∇ k k+1 } which does not satisfy Definition (3.3). We can however consider another sequence of points zk, satisfying

z x + span f(x )+ ∂φ(x ), ..., f(x )+ ∂φ(x ) , (3.21) k+1 ∈ 0 {∇ 0 0 ∇ k k } such that

xk+1 = proxhφ(zk+1). In the case of proximal gradient descent, this can be achieved by considering the sequence

z = x h f(x ), k+1 k − ∇ k x = prox (z ).  k+1 hφ k+1 The modified sequence is simply an intermediate iterate between the proximal operator and the gradient step, whose iterates zk follow

z = prox (z ) h f(prox (z )). k+1 hφ k − ∇ hφ k This simple trick allows the derivation of convergence bounds for proximal nonlinear acceleration, with little changes to the algorithm. In particular, Mai and Johansson (2020) show that using Algorithm (6) in the presence of a proximal operator does not change the convergence analysis, assuming that the function φ is twice epi-differentiable, and that φ is twice-differentiable at the solution x⋆. We refer the reader to Rockafellar and Wets (1998, Section 13) for a comprehensive treatment of epi-differentiability. 3.5. Heuristics for Speeding-up Nonlinear Acceleration 29

Algorithm 6 Online Regularized Nonlinear Acceleration with Proximal operator

Input: Step size h, regularization term λ, reference vector cref. 1: for k = 0 . . . maxIter do

{// Perform one proximal step} 2: Compute g = x h f(z ), and g = r z k k − ∇ k k k − k {// Compute the coefficients of Nonlinear Acceleration} T 3: Form G = [r ,...,r ], compute = G G . 0 k G GT G k 1 k 4: Solve the linear system w = λ( + λI)− cref. G 1 5: Solve the linear system z = ( + λI)− 1. G (1 wT 1) 6: Compute the coefficients c = w + z −zT 1 .

{// Compute the extrapolation} k 7: zk+1 = i=0 cigi. 8: xk+1 =P proxhφ(zk+1). 9: end for Output: Return xk+1.

3.5 Heuristics for Speeding-up Nonlinear Acceleration

In what follows, we discuss a few strategies to improve the performance of nonlinear acceleration. Among the proposed heuristics, the online variant combined with limited memory is generally the most popular and efficient in practice.

3.5.1 Offline and Online Acceleration For now, we mostly described nonlinear acceleration in Algorithm (3) as a refinement (post processing) step, but the scheme can either run in parallel, without interfering with the original optimization process (i.e. offline), or accelerate its own iterates (i.e. online nonlinear acceleration, or Anderson Acceleration) as in Algorithm 6. We discuss this in more details below.

Offline Acceleration When run in offline mode, the algorithm keeps track of the iterates xk and yk independently, as follows,

x = FirstOrderMethod(x0, f(x0),..., f(x ) , ...) k+1 {∇ ∇ k } (3.22) y = NonlinAcc( (x , f(x )); . . . ; (x , f(x )) , h).  k+1 { 0 ∇ 0 k ∇ k } In this setting, nonlinear acceleration is used as a post-processing step on the iterates after the first-order method outputs its final iterate xk, thus the xk are independent of the yk. Since the strategy does not interfere with the original algorithm, this ensures that 30 Nonlinear Acceleration

the convergence rate of the original scheme hold. The parallelization of this technique is straightforward, but its numerical performance is typically worse than that of online nonlinear acceleration.

Online Acceleration Since nonlinear acceleration in Algorithm 3 creates a new point yk+1 that belongs to the span of previous gradients, we can in fact consider it as a first-order method. That means we can use nonlinear acceleration recursively, with x = NonlinAcc( (x , f(x )),..., (x , f(x )) , h), (3.23) k+1 { 0 ∇ 0 k ∇ k } under the condition h = 0. Otherwise, there is no contribution of f(x ) in x , which 6 ∇ k k+1 breaks Assumption 1. Online nonlinear acceleration usually exhibits better performance compared to the offline version.

3.5.2 Limited-memory acceleration Despite the asymptotically optimal rate of convergence of nonlinear acceleration, its computational cost per iteration grows with the length of f(x ),..., f(x ) . As {∇ 0 ∇ k } it turns out that the most relevant information is mostly carried by the last iterates in practice, therefore, limited memory variants are often harmless. More precisely, this means running the extrapolation using only the last m iterates

(xk m+1, f(xk m+1)),..., (xk, f(xk)) . { − ∇ − ∇ } 3.5.3 Descent Conditions

k The nonlinear acceleration algorithm computes an extrapolated point i=0 cixi and an extrapolated descent direction k c f(x ). Therefore, we can check whether (1) the i=0 i∇ i P extrapolated point is better thanP other iterates xi and, (2) if the extrapolated direction is a descent direction. More formally, this means checking k f i=0 cixi < mini 0...k f(xi), T ∈{ }  Pk c f(x ) f k c x > 0,  i=0 i∇ i ∇ i=0 i i     and then to keep the extrapolated P point if these conditionsP are met. Otherwise, we discard it and perform a step of the base algorithm without extrapolation. When optimizing quadratics without numerical errors, this descent condition should never discard any acceleration step. However that may not be the case beyond quadratics, or in the presence of large enough numerical errors.

3.5.4 Line search

2 Nonlinear acceleration requires picking a step size h, whose best theoretical value is µ+L . In practice, it is possible to perform a line search on the parameter h by solving k k min f cixi h ci f(xi) h ! − ∇ !! Xi=0 Xi=0 3.6. Related Work 31

approximately.

3.6 Related Work

Nonlinear acceleration techniques were extensively studied during the past decades, and excellent reviews can be found in (Smith et al., 1987; Jbilou and Sadok, 1991; Brezinski and Zaglia, 1991; Jbilou and Sadok, 1995; Jbilou and Sadok, 2000; Brezinski, 2001; Brezinski and Redivo–Zaglia, 2019). The first usage of acceleration technique for fixed point iteration can be traced back to (Gekeler, 1972; Brezinski, 1971; Brezinski, 1970). There are a lot of independent works leading to methods similar to those described here. The most classical, and probably closest one, is Anderson Acceleration (Anderson, 1965), and corresponds exactly to the online mode of nonlinear acceleration (without regularization). Despite being an old algorithm, there is a recent regain of interest in its convergence analysis (Walker and Ni, 2011; Toth and Kelley, 2015) thanks to its good empirical performance, and strong connection with quasi-Newton methods (Fang and Saad, 2009). Other versions of nonlinear acceleration use different arguments but behave similarly. For instance, minimal polynomial extrapolation (MPE) which uses the properties of the minimal polynomial of a matrix (Cabay and Jackson, 1976), reduced rank extrapolation (RRE) or the Mesina method (Mešina, 1977; Eddy, 1979) are also variants of Ander- son acceleration. The properties and equivalences of these approaches were extensively studied during the past decades (Sidi, 1988; Ford and Sidi, 1988; Sidi, 1991; Jbilou and Sadok, 1991; Sidi and Shapira, 1998; Sidi, 2008; Sidi, 2017a; Sidi, 2017b; Brezinski et al., 2018; Brezinski et al., 2020). Unfortunately, these methods hardly extend to nonlinear functions, especially due to conditioning problems (Sidi, 1986; Sidi and Bridger, 1988; Scieur et al., 2016). Recent works still prove the convergence of such methods, provided we can ensure a good conditioning of the linear system (Sidi, 2019). There are also other classes of nonlinear acceleration algorithms, based on existing algorithms for accelerating the convergence of scalar sequences (Brezinski, 1975). For instance the topological epsilon vector algorithm (TEA) extends the idea of the scalar ε-algorithm of (Wynn, 1956) to vectors. 4

Nesterov Acceleration

This chapter presents a systematic interpretation of gradient method’s acceleration stem- ming from Nesterov’s original work in (Nesterov, 1983). The early part of this chapter is devoted to the gradient method, and the “optimized gradient method”, due to Drori and Teboulle (2014) and Kim and Fessler (2016), since the motivations and ideas un- derlying this method are natural, and very similar to those behind the introduction of Chebyshev methods for optimizing quadratic functions (see Chapter 2). Furthermore, the optimized gradient method has a relatively simple format and proof, and can be used as an inspiration for developing many variants with wider ranges of applications— including Nesterov’s early accelerated gradient methods (Nesterov, 1983; Nesterov, 2013), and FISTA (Beck and Teboulle, 2009). Some parts of this chapter are more technical, however we believe all ideas can be reasonably well understood even when skipping the algebraic proofs.

We start by the base theory and interpretation of acceleration in a simple setting: smooth unconstrained convex minimization in a Euclidean space, as all subsequent devel- opments follow from the exact same template, namely a conic combination of regularity inequalities, with additional ingredients being added on a one-by-one basis. A second part is then devoted to methods that take advantage of strong convexity, using the same ideas and algorithmic structures. For later references, we provide a few different (equivalent) templates for the algorithms, as, when going into more advanced settings, they naturally do not generalize the same way. We then recap and discuss a few of the practical extensions for handling constrained problems, nonsmooth regularization terms, unknown problem parameters/line searches, and non-Euclidean geometries. Notebooks for simpler reproduction of the proofs of this chapter are provided in Section 4.8.

32 4.1. Introduction 33

4.1 Introduction

In the first part of this chapter, we consider smooth unconstrained convex minimization problems. This type of problems is a direct extension of unconstrained convex quadratic minimization problems where the quadratic function has eigenvalues bounded above by some constant. More precisely, consider the simple unconstrained differentiable convex minimization problem

f⋆ = min f(x), (4.1) x Rd ∈ where f is convex with an L-Lipschitz gradient (we call such functions convex and L- smooth, see definition below), and we assume throughout that there exists a minimizer x⋆. The goal of the methods presented below is to find a candidate solution x satisfying f(x) f ǫ for some ǫ> 0. Depending on the target application, other quality measures, − ⋆ ≤ such as guarantees on f(x) or x x , might be preferred, and we refer to Section 4.8 k∇ k k − ⋆k §“Changing the performance measure”, for discussions on this topic. We start with gradient descent, and then show that its iteration complexity can be significantly improved, using an acceleration technique due to Nesterov (1983). After presenting the theory for the smooth convex case, we will see how it extends to the smooth strongly convex case, a direct extension of the unconstrained convex quadratic minimization problem where the quadratic function has eigenvalues bounded above and below, by some constants L and µ.

Definition 4.1. Let 0 µ

• (µ-strong convexity) for all x,y Rd, it holds that ∈ µ f(x) f(y)+ f(y); x y + x y 2, (4.3) ≥ h∇ − i 2 k − k

µ 1 furthermore, we denote by q = L the inverse condition number (so q = κ with κ being the usual condition number, used e.g. in Chapter 2) of functions in the class . Fµ,L

Notations. We naturally use the notation for the set of smooth convex functions. F0,L By extension, we use 0, for the set of (possibly non-differentiable) proper, closed, and F ∞ convex functions (i.e., convex functions whose epigraphs are non-empty closed convex sets). Finally, we denote by ∂f(x) the subdifferential of f at x Rd, by g (x) ∂f(x) ∈ f ∈ a particular subgradient of f at x. 34 Nesterov Acceleration

Smooth strongly convex functions. Figure 4.1 provides an illustration of the global quadratic upper approximation (with curvature L) on f(.) due to smoothness, and of the global quadratic lower approximation (with curvature µ) on f(.) due to strong convexity.

f f

f(y) f(y) • •

y x y x

Figure 4.1: Let f(.) (blue) be a differentiable function. (Right) Smoothness: f(.) (blue) is L-smooth if L 2 and only if it is upper bounded by f(y)+ h∇f(y); . − yi + 2 k. − yk (dashed, brown) for all y. (Left) Strong convexity: f(.) (blue) is µ-strongly convex if and only if it is lower bounded by f(y)+ h∇f(y); . − µ 2 yi + 2 k. − yk (dashed, brown) for all y.

A number of inequalities can be written to characterize functions in , see for Fµ,L example (Nesterov, 2003, Theorem 2.1.5). When analyzing methods for minimizing func- tions in that class, it is crucial to have the right inequalities at our disposal, as worst-case analyses essentially boil down to appropriately combine such inequalities. We provide the most important ones along with their interpretations and proofs in Appendix A. In this chapter, we only use three of them. First, we use the quadratic upper and lower bounds arising in our definition of smooth strongly convex functions, that is, (4.2) and (4.3). For some analyses however, we need an additional inequality, provided by the following theorem. This inequality, although apparently a bit rough, is often referred to as an interpolation (or extension) inequality (details in Appendix A). It can be shown that worst-case analyses of all first-order methods for minimizing smooth strongly convex functions can be performed using only this inequality, instantiated at the iterates of the method at hand, as well as at an optimizer—this follows from the interpolation/extension property and is established in (Taylor et al., 2017c). Furthermore, its proof is relatively simple, as it only consists in requiring all quadratic lower bound from (4.3) to be below all quadratic upper bounds from (4.2) (details in Section A.1).

Theorem 4.1. Let f be a continuously differentiable function. f is L-smooth and µ- strongly convex (possibly with µ = 0) if and only if for all x,y Rd it holds that ∈ 1 f(x) f(y)+ f(y); x y + f(y) f(x) 2 ≥ h∇ − i 2Lk∇ −∇ k µL (4.4) + x y 1 ( f(x) f(y)) 2. 2(L µ)k − − L ∇ −∇ k − However, as discussed later in Section 4.3.3, this inequality has some flaws when it 4.2. Gradient Method and Potential Functions 35

comes to extending methods to more general settings than (4.1) (for example in the presence of constraints). Therefore, we still use (4.2) and (4.3) whenever possible. Before moving to the next section, let us mention that both smoothness and strong convexity are strong assumptions. More generic assumptions are discussed in Chapter 6 to get improved rates under weaker assumptions.

4.2 Gradient Method and Potential Functions

In this section, we analyze gradient descent using the concept of potential functions. The resulting proofs are technically simple, although they might not seem to provide any direct intuition about the method at hand. We use the same ideas for analyzing a few improvements over gradient descent, before providing some interpretations underlying this mechanism.

4.2.1 Gradient Descent The simplest and probably most natural method for minimizing differentiable functions is gradient descent. It is often attributed to Cauchy (1847) and consists in iterating

x = x γ f(x ), k+1 k − k∇ k where γk is some step size. There are many different techniques for picking γk, and the simplest one is to set γk = 1/L, assuming L is known—otherwise, line-search techniques are typically used, see Section 4.7. Our objective now is to assess the number of iterations required by gradient descent to obtain an approximate minimizer xk of f, satisfying f(x ) f ǫ. k − ⋆ ≤ 4.2.2 A simple proof mechanism: potential functions Potential (or Lyapunov) functions are classical tools for proving convergence rates in the first-order literature, and a nice recent review of this topic is given by Bansal and Gupta (2019). For gradient descent, the idea consists in recursively using a simple inequality (proof below) L L (k + 1)(f(x ) f )+ x x 2 k(f(x ) f )+ x x 2, k+1 − ⋆ 2 k k+1 − ⋆k ≤ k − ⋆ 2 k k − ⋆k that is valid for all f and all x Rd, when x = x 1 f(x ). In this context, ∈ F0,L k ∈ k+1 k − L ∇ k we refer to ∆ L φ = k(f(x ) f )+ x x 2 k k − ⋆ 2 k k − ⋆k as a potential, and use φ φ as the building block for the worst-case analysis. Once k+1 ≤ k such a potential inequality φ φ is established, a convergence rate can easily be k+1 ≤ k deduced through a recursive argument which yields

L 2 N(f(xN ) f⋆) φN φN 1 . . . φ0 = x0 x⋆ , (4.5) − ≤ ≤ − ≤ ≤ 2 k − k 36 Nesterov Acceleration

L 2 hence f(xN ) f⋆ 2N x0 x⋆ . We also conclude that the worst-case accuracy of gradi- − ≤ 1 k − k 1 ent descent is O(N − ), or equivalently that its iteration complexity is O(ǫ− ). Therefore, the main inequality to be proved for this worst-case analysis to work, is the potential inequality φ φ . In other words, the analysis of N iterations of gradient descent is k+1 ≤ k reduced to the analysis of a single iteration, using an appropriate potential. This kind of approach was already used for example by Nesterov (1983), and many different vari- ants of this potential function can be used to prove convergence of gradient descent and related methods, in similar ways.

Theorem 4.2. Let f be an L-smooth convex function, x argmin f(x), and k N. ⋆ ∈ x ∈ For any A 0 and x Rd it holds that k ≥ k ∈ L A (f(x ) f )+ x x 2 k+1 k+1 − ⋆ 2 k k+1 − ⋆k L A (f(x ) f )+ x x 2, ≤ k k − ⋆ 2 k k − ⋆k with x = x 1 f(x ) and A =1+ A . k+1 k − L ∇ k k+1 k Proof. The proof consists in performing a weighted sum of the following inequalities:

• convexity of f between x and x , with weight λ = A A k ⋆ 1 k+1 − k 0 f(x ) f + f(x ); x x , ≥ k − ⋆ h∇ k ⋆ − ki

• smoothness of f between xk and xk+1 with weight λ2 = Ak+1 L 0 f(x ) f(x ) f(x ); x x x x 2. ≥ k+1 − k − h∇ k k+1 − ki− 2 k k − k+1k

This last inequality is often referred to as the descent lemma, as substituting xk+1 allows obtaining f(x ) f(x ) 1 f(x ) 2. k+1 ≤ k − 2L k∇ k k This weighted sum forms a valid inequality:

0 λ [f(x ) f + f(x ); x x ] ≥ 1 k − ⋆ h∇ k ⋆ − ki L + λ f(x ) f(x ) f(x ); x x x x 2 . 2 k+1 k k k+1 k 2 k k+1  − − h∇ − i− k − k  Using x = x 1 f(x ), this inequality can be rewritten (completing the squares, k+1 k − L ∇ k or simply extending both expressions and verifying that they match on a term by term basis) as follows L 0 (A + 1)(f(x ) f )+ x x 2 ≥ k k+1 − ⋆ 2 k k+1 − ⋆k L A 1 A (f(x ) f ) x x 2 + k+1 − f(x ) 2 − k k − ⋆ − 2 k k − ⋆k 2L k∇ k k (A A 1) f(x ); x x , − k+1 − k − h∇ k k − ⋆i 4.2. Gradient Method and Potential Functions 37 which can be reorganized and simplified to L (A + 1)(f(x ) f )+ x x 2 k k+1 − ⋆ 2 k k+1 − ⋆k L A 1 A (f(x ) f )+ x x 2 k+1 − f(x ) 2 ≤ k k − ⋆ 2 k k − ⋆k − 2L k∇ k k + (A A 1) f(x ); x x k+1 − k − h∇ k k − ⋆i L A (f(x ) f )+ x x 2, ≤ k k − ⋆ 2 k k − ⋆k where the last inequality follows from picking Ak+1 = Ak + 1 and neglecting the last residual term Ak f(x ) 2 on the right hand side (which is nonpositive). − 2L k∇ k k

A convergence rate for gradient descent can directly be obtained as a consequence of Theorem 4.2, following the reasoning of (4.5), and the rate corresponds to f(xN ) f⋆ = 1 1 − O(AN− )= O(N − ). We detail this in the next corollary. Corollary 4.3. Let f be an L-smooth convex function, and x argmin f(x). For any ⋆ ∈ x N N, the iterates of gradient descent with step size γ = γ = . . . = γ = 1 satisfy ∈ 0 1 N L L x x 2 f(x ) f k 0 − ⋆k . N − ⋆ ≤ 2N Proof. Following the reasoning of (4.5), we recursively use Theorem 4.2 starting with A0 = 0. That is, we define L φ =∆ A (f(x ) f )+ x x 2, k k k − ⋆ 2 k k − ⋆k and recursively use the inequality φ φ from Theorem 4.2, with A = A + 1 k+1 ≤ k k+1 k and A0 = 0; hence Ak = k. We obtain L A (f(x ) f ) φ . . . φ = x x 2, N N − ⋆ ≤ N ≤ ≤ 0 2 k 0 − ⋆k resulting in the desired statement

L 2 L 2 f(xN ) f⋆ x0 x⋆ = x0 x⋆ . − ≤ 2AN k − k 2N k − k

4.2.3 How Conservative is this Worst-case Guarantee?

1 Before moving to other methods, let us show that this worst-case rate O(N − ) rate of gradient descent can actually be attained on very simple problems, motivating the search for alternate methods with better guarantees. This rate is observed on e.g. all functions that are nearly linear over large regions. One such common function is the Huber loss

aτ x bτ if x τ, f(x)= L |2 |− | |≥ ( 2 x otherwise, 38 Nesterov Acceleration

(with a = Lτ and b = L τ 2 to ensure continuity and differentiability of this function). τ τ − 2 Indeed, on this function, as long as the iterates of gradient descent satisfy x τ, | k| ≥ they behave as if the function was linear, and the gradient on this part is constant. It is therefore relatively easy to explicitly compute all iterates. In particular, picking 2 x0 L x0 x⋆ 1 τ = | | , we get that f(x ) f = k − k , reaching the O(N ) worst-case bound 2N+1 N − ⋆ 2(2N+1) − (this example is due to Drori and Teboulle (2014, Theorem 3.2)). Therefore, it appears that the worst-case bound from Corollary 4.3 for gradient descent can only be improved in terms of the constants, but the rate itself is the best possible one for this simple method; see for example (Drori and Teboulle, 2014; Drori, 2014) for the corresponding tight expressions. In the next section, we show that a similar reasoning based on potential functions, 2 1 produces methods with improved iteration complexity O(N − ), compared to the O(N − ) of simple gradient descent.

4.3 Optimized Gradient Method

Given that the complexity bound for gradient descent cannot be improved, it is rea- sonable to look for alternate, hopefully better, methods. In this section, we show that accelerated methods can be designed by optimizing their worst-case performances. To do so, we start with a reasonably broad family of candidates first-order methods described by y = y h f(y ), 1 0 − 1,0∇ 0 y = y h f(y ) h f(y ), 2 1 − 2,0∇ 0 − 2,1∇ 1 y = y h f(y ) h f(y ) h f(y ), 3 2 − 3,0∇ 0 − 3,1∇ 1 − 3,2∇ 2 . (4.6) . N 1 − yN = yN 1 hN,i f(yi). − − ∇ Xi=0 Of course, methods in this form are impractical, as they require keeping track of all previous gradients. Neglecting this potential problem for now, one possibility for choosing the step sizes h is to solve a minimax problem { i,j} f(yN ) f⋆ min max − 2 : yN obtained from (4.6) and x0 . (4.7) hi,j f 0,L y0 x⋆ { } ∈F  k − k  In other words, we are looking for the best possible worst-case ratio among methods of the form (4.6). Of course, different target notions of accuracy could be considered instead of (f(y ) f ) y x 2 , but we stick with this notion for now. N − ⋆ k 0 − ⋆k It turns out that (4.7) has a clean solution, obtained by Kim and Fessler (2016), based on clever reformulations and relaxations of (4.7) developed by Drori and Teboulle (2014) (some details are provided in Section 4.8). Furthermore, this method has “factorized” forms which do not require keeping track of previous gradients. The optimized gradient 4.3. Optimized Gradient Method 39

method (OGM) is parameterized by a sequence θ that is constructed recursively { k,N }k starting from θ 1,N = 0 (or equivalently θ0,N = 1), using − 2 1+ 4θk,N +1 2 if k N 2 θk+1,N =  2 ≤ − (4.8) 1+p8θk,N +1  2 if k = N 1. p − Let us also mention that optimized gradient methods can be stated in various equivalent formats, which we provide in Algorithm 7 and Algorithm 8 (a rigorous equivalence statement is provided in Appendix B.1.1). While the shape of Algorithm 8 is more common to accelerated methods, the equivalent formulation provided in Algorithm 7 allows for slightly more direct proofs.

Algorithm 7 Optimized gradient method (OGM), form I

Input: An L-smooth convex function f, initial point x0, budget N. 1: Initialize z0 = y0 = x0 and θ 1,N = 0. − 2: for k = 0,...,N 1 do 2 − 1+ 4θk−1,N +1 3: θk,N = 2 p1 1 4: yk = 1 xk + zk − θk,N θk,N 5: x = y 1  f(y ) k+1 k − L ∇ k 6: z = x 2 k θ f(y ) k+1 0 − L i=0 i,N ∇ k 7: end for P 2 1 1 1+ 8θN−1,N +1 Output: Approx. solution yN = 1 xN + zN with θN,N = − θN,N θN,N 2   p

Algorithm 8 Optimized gradient method (OGM), form II

Input: An L-smooth convex function f, initial point x0, budget N. 1: Initialize z0 = y0 = x0 and θ0,N = 1. 2: for k = 0,...,N 1 do − 2 1+ 4θk,N +1 2 if k N 2 3: θk+1,N =  2 ≤ − 1+p8θk,N +1  2 if k = N 1. 1 p − 4: xk+1 = yk L f(yk) − ∇θk,N 1 θk,N 5: yk+1 = xk+1 + − (xk+1 xk)+ (xk+1 yk) θk+1,N − θk+1,N − 6: end for Output: Approx. solution yN

Direct approaches to (4.7) are rather technical—see details in (Drori and Teboulle, 2014; Kim and Fessler, 2016). However, showing that the OGM is indeed optimal on the class of smooth convex function can be done indirectly, by providing an upper bound on its worst-case complexity guarantees, and by showing that no first-order method can have a better worst-case guarantee on this class of problems. We detail a fully explicit 40 Nesterov Acceleration worst-case guarantee for OGM in the next section, which consists in showing that

2 1 2 L 2 φk = 2θk 1,N f(yk 1) f⋆ f(yk 1) + zk x⋆ (4.9) − 2L − 2 −  − − k∇ k  k − k is a potential function for the optimized gradient method (Theorem 4.4 below), when k < N. For k = N, we need a minor adjustment (Lemma 4.5 below), to obtain a bound on f(y ) f and not in terms of f(y ) f 1 f(y ) 2 which appears in the N − ⋆ N − ⋆ − 2L k∇ N k potential. As for gradient descent, the proof therefore relies on potential functions. Following the recursive argument from (4.5), the convergence guarantee is driven by the convergence 2 speed of θ− towards 0. Let us note that when k < N 1 k,N − 2 1+ 4θk,N + 1 1 + 2θ 1 θ = k,N = θ + , (4.10) k+1,N q 2 ≥ 2 k,N 2 therefore θ k + 1. We also directly obtain k,N ≥ 2 2 1+ 8θN 1,N + 1 1+ 2(N + 1)2 + 1 N + 1 θ = − , (4.11) N,N q 2 ≥ p 2 ≥ √2 2 2 and hence θN,N− = O(N − ). Before providing the proof, let us mention that it heavily relies on inequality (4.4), with µ = 0. This inequality turns out to be key for formulat- ing (4.7) in a tractable way.

4.3.1 A Potential for the Optimized Gradient Method The main point now is to prove that (4.9) is indeed a potential for the optimized gradient method. We emphasize again that our main motivation for doing so, is to show that OGM provides a good template algorithm for acceleration (i.e. a method involving two or three sequences), and that the corresponding potential functions can also be used as a template for the analysis of more advanced methods. Note that while the potential structure does not seem immediately intuitive, it was actually found using computer-assisted proof design techniques, see Section 4.8 §“On Obtaining Proofs from this Chapter” for further references. In particular, the following theorem is due to Taylor and Bach (2019, Theorem 11)). N Theorem 4.4. Let f be an L-smooth convex function, x⋆ argminx f(x), and N . d ∈ ∈ For any k N with 0 k N 1 and any yk 1, zk R it holds that ∈ ≤ ≤ − − ∈ 1 L 2θ2 f(y ) f f(y ) 2 + z x 2 k,N k ⋆ 2L k 2 k+1 ⋆  − − k∇ k  k − k 2 1 2 L 2 2θk 1,N f(yk 1) f⋆ f(yk 1) + zk x⋆ , ≤ −  − − − 2Lk∇ − k  2 k − k when yk and zk+1 are obtained from Algorithm 7. 4.3. Optimized Gradient Method 41

Proof. Let us recall that the algorithm can be written as 1 1 1 yk = 1 yk 1 f(yk 1) + zk − θk,N !  − − L∇ −  θk,N 2θ z = z k,N f(y ). k+1 k − L ∇ k The proof then consists in performing a weighted sum of the following inequalities.

2 • Smoothness and convexity of f between yk 1 and yk with weight λ1 = 2θk 1,N − − 1 2 0 f(yk) f(yk 1)+ f(yk); yk 1 yk + f(yk) f(yk 1) , ≥ − − h∇ − − i 2Lk∇ −∇ − k

• smoothness and convexity of f between x⋆ and yk with weight λ2 = 2θk,N 1 0 f(y ) f + f(y ); x y + f(y ) 2. ≥ k − ⋆ h∇ k ⋆ − ki 2Lk∇ k k The weights being nonnegative, the weighted sum produces a valid inequality of the form

1 2 0 λ1 f(yk) f(yk 1)+ f(yk); yk 1 yk + f(yk) f(yk 1) ≥ − − h∇ − − i 2Lk∇ −∇ − k   (4.12) 1 λ f(y ) f + f(y ); x y + f(y ) 2 , 2 k ⋆ k ⋆ k 2L k  − h∇ − i k∇ k  which can be reformulated as (again either by completing the squares, or simply by extending both expressions and verifying that they match on a term by term basis)

1 2 0 λ1 f(yk) f(yk 1)+ f(yk); yk 1 yk + f(yk) f(yk 1) − − 2L − ≥  − h∇ − i k∇ −∇ k  1 λ f(y ) f + f(y ); x y + f(y ) 2 2 k ⋆ k ⋆ k 2L k  − h∇ − i k∇ k  1 L =2θ2 f(y ) f f(y ) 2 + z x 2 k,N k ⋆ 2L k 2 k+1 ⋆  − − k∇ k  k − k 2 1 2 L 2 2θk 1,N f(yk 1) f⋆ f(yk 1) zk x⋆ − −  − − − 2Lk∇ − k  − 2 k − k 2 2 2 1 + θk 1,N + θk,N θk,N f(yk); yk 1 L f(yk 1) zk θk,N − − h∇ − − ∇ − − i   2 2 1 2 + 2 θk 1,N + θk,N θk,N f(yk) f⋆ + f(yk) , − − − 2Lk∇ k     and the desired conclusion follows from picking θk,N θk 1,N satisfying ≥ − 2 2 θk 1,N + θk,N θk,N = 0, − − hence the choice (4.8), reaching 1 L 2θ2 f(y ) f f(y ) 2 + z x 2 k,N k ⋆ 2L k 2 k+1 ⋆  − − k∇ k  k − k 2 1 2 L 2 2θk 1,N f(yk 1) f⋆ f(yk 1) + zk x⋆ . − 2L − 2 ≤ −  − − k∇ k  k − k 42 Nesterov Acceleration

A last technical fix is required now. In order to show that the optimized gradient method is an optimal solution to (4.7), we need an upper bound on function values, and not on function values minus a squared gradient norm. This discrepancy is handled by the following technical lemma. N Lemma 4.5. Let f be an L-smooth convex function, x⋆ argminx f(x), and N . For d ∈ ∈ any yN 1, zN R it holds that − ∈ L θ2 (f(y ) f )+ z θN,N f(y ) x 2 N,N N − ⋆ 2 k N − L ∇ N − ⋆k 2 1 2 L 2 2θN 1,N f(yN 1) f⋆ f(yN 1) + zN x⋆ − 2L − 2 ≤ −  − − k∇ k  k − k where yN is obtained from Algorithm 7.

Proof. The proof consists in performing a weighted sum of the following inequalities.

2 • Smoothness and convexity of f between yN 1 and yN with weight λ1 = 2θN 1,N − − 1 2 0 f(yN ) f(yN 1)+ f(yN ); yN 1 yN + f(yN ) f(yN 1) , ≥ − − h∇ − − i 2Lk∇ −∇ − k

• smoothness and convexity of f between x⋆ and yN with weight λ2 = θN,N 1 0 f(y ) f + f(y ); x y + f(y ) 2. ≥ N − ⋆ h∇ N ⋆ − N i 2Lk∇ N k

The weights being nonnegative, the weighted sum produces a valid inequality of the form

1 2 0 λ1 f(yN ) f(yN 1)+ f(yN ); yN 1 yN + f(yN ) f(yN 1) − − 2L − ≥  − h∇ − i k∇ −∇ k  1 2 + λ2 f(yN ) f⋆ + f(yN ); x⋆ yN + f(yN ) ,  − h∇ − i 2Lk∇ k  which can be reformulated as L 0 θ2 (f(y ) f )+ z θN,N f(y ) x 2 ≥ N,N N − ⋆ 2 k N − L ∇ N − ⋆k 2 1 2 L 2 2θN 1,N f(yN 1) f⋆ f(yN 1) zN x⋆ − −  − − − 2Lk∇ − k  − 2 k − k 1 2 2 1 + 2θN 1,N θN,N + θN,N f(yN ); yN 1 L f(yN 1) zN θN,N − − h∇ − − ∇ − − i   2 2 1 2 + 2θN 1,N θN,N + θN,N f(yN ) f⋆ + f(yN ) . − − − 2Lk∇ k     The conclusion follows from picking θN,N θN 1,N such that ≥ − 2 2 2θN 1,N θN,N + θN,N = 0, − − 4.3. Optimized Gradient Method 43

reaching the desired L θ2 (f(y ) f )+ z θN,N f(y ) x 2 N,N N − ⋆ 2 k N − L ∇ N − ⋆k 2 1 2 L 2 2θN 1,N f(yN 1) f⋆ f(yN 1) + zN x⋆ . ≤ −  − − − 2Lk∇ − k  2 k − k

By combining Theorem 4.4 and technical Lemma 4.5, we get the final worst-case convergence bound performances of OGM on function values, detailed in the corollary below.

Corollary 4.6. Let f be an L-smooth convex function, and x argmin f(x). For any ⋆ ∈ x N N and x Rd, the output of the optimized gradient method (OGM, Algorithm 7 ∈ 0 ∈ or Algorithm 8) satisfies

2 2 L x0 x⋆ L x0 x⋆ f(yN ) f⋆ k 2− k k − 2k . − ≤ 2θN,N ≤ (N + 1) Proof. Defining, for k 1,...,N ∈ { } 2 1 2 L 2 φk = 2θk 1,N f(yk 1) f⋆ f(yk 1) + zk x⋆ , − 2L − 2 −  − − k∇ k  k − k and L φ = θ2 (f(y ) f )+ z θN,N f(y ) x 2, N+1 N,N N − ⋆ 2 k N − L ∇ N − ⋆k we reach the desired statement L θ2 (f(y ) f ) φ φ . . . φ = x x 2. N,N N − ⋆ ≤ N+1 ≤ N ≤ ≤ 0 2 k 0 − ⋆k using Theorem 4.4 and technical Lemma 4.5. We obtain the last bound using θ N,N ≥ (N + 1)/√2, see (4.11).

In the following section, we will mostly use potential functions relying directly on the function value f(xk) instead of that of f(yk), for practical reasons discussed below. Note that using the descent lemma directly on the potential function allows obtaining a bound on f(xk) for OGM, as well. This result can be found in (Kim and Fessler, 2017, Theorem 3.1).

4.3.2 Optimality of Optimized Gradient Methods A nice, commonly used guide for designing optimal methods consists in constructing problems that are difficult for all methods within a certain class. This strategy results in lower complexity bounds, and is often deployed via the concept of minimax risk (of a class of problems and a class of methods)—see e.g., Guzmán and Nemirovsky (2015)—, which 44 Nesterov Acceleration

corresponds to the worst-case performance of the best method within the prescribed class. In this section, we briefly discuss such results in the context of smooth convex minimization, on the particular class of black-box first-order methods. The term black- box is used to emphasize that the method has no prior knowledge on f (beyond the class of functions to which f belongs; so methods are allowed to use L), and can only obtain information on f by a evaluating its gradient/function value through an oracle. Of particular interest to us, Drori (2017) established that the worst-case performance achieved by the optimized gradient method (provided by Corollary 4.6) cannot be im- proved in general, by any black-box first-order method, on the class of smooth convex functions.

Theorem 4.7. (Drori, 2017, Theorem 3) Let L > 0, d, N N and d N + 1. For ∈ ≥ any black-box first-order method that performs at most N calls to the first-order oracle (f(.), f(.)), there exists a function f (Rd) such that ∇ ∈ F0,L 2 L x0 x⋆ f(xN ) f(x⋆) k 2− k , − ≥ 2θN,N where x argmin f(x), and x is the output of the method under consideration. ⋆ ∈ x N While Drori’s approach to obtain this lower bound is rather technical, there are sim- 2 pler approaches allowing to show that the rate O(N − ) (that is, neglecting the tight constants) cannot be beaten in general, in smooth convex minimization. For one such example, we refer to (Nesterov, 2003, Theorem 2.1.6). In a very related line of work, (Ne- mirovsky, 1991) established similar exact bounds in the context of solving linear systems of equations, and for minimizing convex quadratic functions. For convex quadratic prob- lems whose Hessian have bounded eigenvalues between 0 and L, these lower bounds are attained by Chebyshev (see Chapter 2) and by conjugate gradient methods (Nemirovsky, 1991; Nemirovsky, 1992). Perhaps surprisingly, the conjugate gradient method also achieves the lower complex- ity bound of smooth convex minimization, provided by Theorem 4.7. Furthermore, the proof follows is essentially the same as that of OGM, and follows from the same potential (see Appendix B.2).

4.3.3 Optimized Gradient Method: Summary Before going further, let us quickly summarize what we have learned from optimized gra- dient methods. First of all, the optimized gradient method can be seen as a counterpart of the Chebyshev method for minimizing quadratics, applied to smooth convex minimiza- tion. It is an optimal method in the sense that it has the smallest possible worst-case f(yN ) f⋆ ratio − 2 over the class f 0,L, among all black-box first-order methods, given y0 x⋆ ∈ F a fixedk computational− k budget of N gradient evaluations. Furthermore, although this method has a few drawbacks (we mention a few below), it can be seen as a template for designing other accelerated methods using the same algorithmic and proof structures. 4.4. Nesterov’s Acceleration 45

We extensively use variants of this template below. In other words, most variants of accelerated gradient methods usually rely on the same two (or three) sequence structure, and on similar potential functions. Those variants usually rely on slight variations in the choices of the parameters used throughout the iterative process, typically involving less aggressive step size strategies (i.e., smaller values for h in (4.6)). { i,j} Second, OGM is not a very practical method as such: it is fined-tuned for uncon- strained smooth convex minimization, and does not readily extend to other situations, for example involving constraints, for which (4.4) does not hold in general; see discussions in (Drori, 2018) and Remark A.1. On the other hand, we will see in what follows that it is relatively easy to design 2 others methods, following the same template and achieving the same O(N − ) rate, while fixing the issues of OGM listed above. Those methods use slightly less aggressive step size strategies, at the cost of being slightly suboptimal for (4.7), i.e. with slightly worse worst- case guarantees. In this vein, we start by discussing the original accelerated gradient method due to Nesterov (1983).

4.4 Nesterov’s Acceleration

Motivated by the format of the optimized gradient method, we detail a potential-based proof for Nesterov’s method. We then quickly review the concept of estimate sequences, and show that they provide an interpretation of potential functions as increasingly good models of the function to be minimized. We then extend these results to strongly convex minimization.

4.4.1 Nesterov’s Method, from Potential Functions

In this section, we follow the same algorithmic template as that provided by the optimized gradient method. In this spirit, we start by discussing the first accelerated method in its simplest form (Algorithm 9), as well as its potential function, due to Nesterov (1983) (although our presentation is slightly different). Our goal is to derive the simplest algebraic proof for this scheme. We follow the algorithmic template of the optimized gradient method (which is further motivated in Section 4.6.1) and once a potential is chosen, the proofs are quite straightforward as simple combinations of inequalities and basic algebra. Our choice of potential function is not immediately obvious but allows for simple extensions afterwards. Other choices are possible, for example by incorporating f(yk) as in OGM, or additional terms such as f(x ) 2. We pick a potential function similar to that used for gradient descent and k∇ k k that of the optimized gradient method, written

L φ = A (f(x ) f )+ z x 2, k k k − ⋆ 2 k k − ⋆k 46 Nesterov Acceleration where one iteration of the algorithm has the following form, reminiscent of the OGM y = x + τ (z x ) k k k k − k x = y α f(y ) (4.13) k+1 k − k∇ k z = z γ f(y ). k+1 k − k∇ k Our goal is to pick algorithmic parameters (τ , α , γ ) for greedily making A as { k k k }k k+1 large as possible as a function of Ak, as the convergence rate of the method is controlled 1 by the inverse of the growth rate of A , i.e. f(x ) f = O(A− ). k N − ⋆ N In practice, we can pick A = A + 1 (1+√4A + 1), by choosing τ = 1 A /A , k+1 k 2 k k − k k+1 α = 1 , and γ = (A A )/L (see Algorithm 9), and the proof is then quite compact. k L k k+1 − k Algorithm 9 Nesterov’s method, form I

Input: An L-smooth convex function f, initial point x0. 1: Initialize z0 = x0 and A0 = 0. 2: for k = 0,... do 1 3: ak = 2 (1 + √4Ak + 1) 4: Ak+1 = Ak + ak 5: y = x + (1 Ak )(z x ) k k Ak+1 k k −1 − 6: xk+1 = yk L f(yk) − A ∇ A 7: z = z k+1− k f(y ) k+1 k − L ∇ k 8: end for Output: An approximate solution xk+1.

1 2 Before moving to the proof of the potential inequality, let us show that Ak− = O(k− ). Indeed we have,

1+ 4Ak 1 + 1 1 Ak = Ak 1 + − Ak 1 + + Ak 1 − 2 − 2 − p ≥ (4.14) 1 2 k2 q Ak 1 + , ≥ − 2 ≥ 4 q  where the last inequality follows from a recursive application of the previous one along with A0 = 0. Theorem 4.8. Let f be an L-smooth convex function, x argmin f(x), and k N. ⋆ ∈ x ∈ For any x , z Rd, and A 0 the iterates of Algorithm 9 satisfy k k ∈ k ≥ L L A (f(x ) f )+ z x 2 A (f(x ) f )+ z x 2, k+1 k+1 − ⋆ 2 k k+1 − ⋆k ≤ k k − ⋆ 2 k k − ⋆k

1+√4Ak+1 with Ak+1 = Ak + 2 Proof. The proof consists in a weighted sum of the following inequalities. • Convexity of f between x and y with weight λ = A A ⋆ k 1 k+1 − k f f(y )+ f(y ); x y , ⋆ ≥ k h∇ k ⋆ − ki 4.4. Nesterov’s Acceleration 47

• convexity of f between xk and yk with weight λ2 = Ak f(x ) f(y )+ f(y ); x y , k ≥ k h∇ k k − ki

• smoothness of f between yk and xk+1 (a.k.a., descent lemma) with weight λ3 = Ak+1 L f(y )+ f(y ); x y + x y 2 f(x ). k h∇ k k+1 − ki 2 k k+1 − kk ≥ k+1 We therefore arrive to the following valid inequality 0 λ [f(y ) f + f(y ); x y ] ≥ 1 k − ⋆ h∇ k ⋆ − ki + λ [f(y ) f(x )+ f(y ); x y ] 2 k − k h∇ k k − ki L + λ [f(x ) f(y ) f(y ); x y x y 2]. 3 k+1 − k − h∇ k k+1 − ki− 2 k k+1 − kk

For the sake of simplicity, we do not substitute Ak+1 by its expression until the last stage of the reformulation. Substituting yk, xk+1, and zk+1 by their expressions in (4.13) 1 Ak+1 Ak along with τ = 1 A /A , α = , and γ = − , basic algebra shows that the k − k k+1 k L k L previous inequality can be reorganized as L 0 A (f(x ) f )+ z x 2 ≥ k+1 k+1 − ⋆ 2 k k+1 − ⋆k L A (A A )2 A (f(x ) f ) z x 2 + k+1 − k − k+1 f(y ) 2. − k k − ⋆ − 2 k k − ⋆k 2L k∇ k k The claim follows from picking A A such that A (A A )2 = 0, reaching k+1 ≥ k k+1 − k − k+1 L L A (f(x ) f )+ z x 2 A (f(x ) f )+ z x 2. k+1 k+1 − ⋆ 2 k k+1 − ⋆k ≤ k k − ⋆ 2 k k − ⋆k

The final worst-case guarantee is obtained using the same chaining argument as in (4.5), combined with an upper bound on AN . Corollary 4.9. Let f be an L-smooth convex function, and x argmin f(x). For any ⋆ ∈ x N N, the iterates of Algorithm 9 satisfy ∈ 2L x x 2 f(x ) f k 0 − ⋆k . N − ⋆ ≤ N 2

Proof. Following the argument of (4.5), we recursively use Theorem 4.8 with A0 = 0 L A (f(x ) f ) φ . . . φ = x x 2, N N − ⋆ ≤ N ≤ ≤ 0 2 k 0 − ⋆k which yields 2 2 L x0 x⋆ 2L x0 x⋆ f(xN ) f⋆ k − k k −2 k , − ≤ 2AN ≤ N where we used A N 2/4 from (4.14) to reach the last inequality. N ≥ 48 Nesterov Acceleration

2 Before moving on, let us insist on the fact this O(N − ) rate matches that of lower bounds (see e.g., Theorem 4.7) up to absolute constants. Finally, note that Nesterov’s method is often written in a slightly different format, similar to that of Algorithm 8. This alternate formulation omits the third sequence zk and is provided by Algorithm 10. It is preferred in many references on the topic due to its simplicity. A third equivalent variant is provided by Algorithm 11; this variant turns out to be useful when generalizing the method beyond Euclidean spaces. The equiva- lence statements between Algorithm 9, Algorithm 10, and Algorithm 11 are relatively simple, and are provided in Appendix B.1.2. Many references typically favor one of those formulations, and we want to point out that they are often equivalent. Although the ex- pressions of those different formats in terms of the same external sequence A does { k}k not always correspond to their simplest forms (particularly in the strongly convex case, which follows), we stick with it in order not to introduce too many variations around the same theme.

Algorithm 10 Nesterov’s method, form II

Input: An L-smooth convex function f, initial point x0. 1: Initialize y0 = x0 and A0 = 0. 2: for k = 0,... do 1 3: ak = 2 (1 + √4Ak + 1) 4: Ak+1 = Ak + ak 1 5: xk+1 = yk L f(yk) − ∇ak 1 6: yk+1 = xk+1 + − (xk+1 xk) ak+1 − 7: end for Output: An approximate solution xk+1.

Algorithm 11 Nesterov’s method, form III

Input: An L-smooth convex function f, initial point x0. 1: Initialize z0 = x0 and A0 = 0. 2: for k = 0,... do 1 3: ak = 2 (1 + √4Ak + 1) 4: Ak+1 = Ak + ak Ak 5: yk = xk + (1 )(zk xk) − Ak+1 − Ak+1 Ak 6: zk+1 = zk L− f(yk) A−k ∇ Ak 7: xk+1 = xk + (1 )zk+1 Ak+1 − Ak+1 8: end for Output: An approximate solution xk+1. 4.4. Nesterov’s Acceleration 49

4.4.2 Estimate Sequence Interpretation We now relate the potential function approach to estimate sequences, which maintain a model of the function along iterations. This technique was originally developed in (Nes- terov, 2003, Section 2.2), and was then used in numerous works to obtain accelerated first-order methods in various settings (see discussions in Section 4.8). We present a slightly modified version, related to those of (Baes, 2009; Wilson et al., 2016), which simplifies comparisons with previous material.

Estimate Sequences As we see below, the base idea underlying estimate sequences is closely related to that of potential function, but has explicit interpretations in terms of models of the objective function, f. More precisely a sequence of pairs (A , ϕ (x)) , with A 0 and ϕ : { k k }k k ≥ k Rd R, is called an estimate sequence of a function f if → (i) for all k 0 and x Rd we have ≥ ∈ 1 ϕ (x) f(x) A− (ϕ (x) f(x)), (4.15) k − ≤ k 0 − (ii) A as k . k →∞ →∞ If in addition, an estimate sequence satisfies (iii) for all k 0, there exists some xk such that f(xk) ϕk(x⋆), then we can guarantee ≥ 1 ≤ that f(x ) f = O(A− ). k − ⋆ k The purpose of estimate sequences is to start from an initial model ϕ0(x) satisfying ϕ (x) f for all x Rd, and to design a sequence of convex models ϕ that are 0 ≥ ⋆ ∈ k increasingly good approximations of f, in the sense of (4.15). We now further comment on conditions (i) and (iii), assuming for simplicity that A is monotonically increasing. { k} • Regarding (i), for all x Rd, we have to design ϕ to be either (a) a lower bound on ∈ k the function (i.e., ϕk(x) f(x) 0 for that x), or (b) an increasingly good upper − ≤ 1 approximation of f(x) when 0 ϕ (x) f(x) A− (ϕ (x) f(x)) for that x. ≤ k − ≤ k 0 − That is, we require that the error f(x) ϕ (x) incurred when approximating f(x) | − k | by ϕk(x) gets smaller for all x for which ϕk(x) is an upper bound on f(x). To develop such models and the corresponding methods, three sequences of points are commonly used: (a) minimizers of our models ϕk will correspond to iterates zk of the corresponding method, (b) a sequence yk of points, whose first-order information is used to update the model of the function, and (c) the iterates xk, corresponding to the best possible f(xk) we can form (which often do not corre- spond to the minimum of the model, ϕk, which is not necessarily an upper bound on the function).

• Regarding (iii), this condition ensures that the models ϕk remain upper bounds on the optimal value f . That is, f ϕ (x ) (since f f(x )), and hence that ⋆ ⋆ ≤ k ⋆ ⋆ ≤ k ϕ (x ) f 0. From previous bullet point, this ensures that the modeling error k ⋆ − ⋆ ≥ 50 Nesterov Acceleration

of f⋆ goes to 0 asymptotically as k increases. More formally, conditions (ii) and (iii) allow constructing proofs similar to potential functions, and to obtain convergence rates. That is, under (iii), we get that

1 f(x ) f ϕ (x ) f(x ) A− (ϕ (x ) f(x )), (4.16) k − ⋆ ≤ k ⋆ − ⋆ ≤ k 0 ⋆ − ⋆ 1 and therefore f(xk) f⋆ O(Ak− ), and the convergence rate is dictated by that 1 − ≤ of Ak− , which goes to 0 by (ii). Now, the game consists in picking appropriate sequences (A , ϕ ) corresponding to { k k } simple algorithms. We thus translate our potential functions results in terms of estimate sequences.

Potential Functions as Estimate Sequences One can observe that potential functions and estimate sequences are closely related. First, 1 in both cases, convergence speed is dictated by that of a scalar sequence Ak− . In fact, there is one subtle but important difference between the two approaches: whereas ϕk(x) should be an increasingly good approximation of f for all x in the context of estimate sequences, potential functions require a model to be an increasingly good approximation of f⋆ only, which is less restrictive. Hence, estimate sequences are more general but may not handle situations where the analysis actually requires having a weaker model holding only on f⋆, and not of f(x) for all x. Let us make this discussion more concrete on three examples, namely gradient descent, Nesterov’s method, and the optimized gradient method.

• Gradient descent: the potential inequality from Theorem 4.2 actually holds for all x, and not only x⋆, as the proof does not use the optimality of x⋆, that is L (A + 1)(f(x ) f(x)) + x x 2 k k+1 − 2 k k+1 − k L A (f(x ) f(x)) + x x 2, ≤ k k − 2 k k − k for all x Rd. Therefore the pair (A , ϕ (x)) with ∈ { k k }k L 2 ϕk(x)= f(xk)+ xk x 2Ak k − k

and Ak = A0 + k (with A0 > 0) is an estimate sequence for gradient descent.

• Nesterov’s first method: the potential inequality from Theorem 4.8 also holds for all x Rd, not only x , as the proof does not use optimality of x , that is ∈ ⋆ ⋆ L A (f(x ) f(x)) + z x 2 k+1 k+1 − 2 k k+1 − k L A (f(x ) f(x)) + z x 2, ≤ k k − 2 k k − k 4.5. Acceleration under Strong Convexity 51

for all x Rd. Hence the pair (A , ϕ (x)) with ∈ { k k }k L 2 ϕk(x)= f(xk)+ zk x 2Ak k − k

1+√4Ak−1+1 and Ak = Ak 1 + (with A0 > 0) is an estimate sequence for Nesterov’s − 2 method.

• Optimized gradient method: the potential inequality from Theorem 4.4 actually uses the fact that x⋆ is an optimal point. Indeed, the proof relies on 1 f(x) f(y )+ f(y ); x y + f(y ) 2 ≥ k h∇ k − ki 2Lk∇ k k which is a valid upper bound on f(x) only when f(x) = 0 (see Equation (4.4)). ∇ This does not mean there is no estimate sequence-type model of the function as the algorithm proceeds, but the potential does not directly correspond to one. Alternatively, one can interpret

1 2 L 2 ϕk(x)= f(yk) f(yk) + 2 zk+1 x − 2Lk∇ k 4θk,N k − k

as an increasingly good model of f⋆. A similar conclusion holds for the conjugate gradient method (CG), from Ap- pendix B.2. We are not aware of any estimate sequence allowing to prove CG reaches the lower bound from Theorem 4.7.

Those discussions can be carried on to the strongly convex setting, which we now address.

4.5 Acceleration under Strong Convexity

Before designing faster methods exploiting strong convexity, let us briefly describe the benefits and limitations of this additional assumption. Roughly speaking, strong convex- ity guarantees gradients will get larger further away from the optimal solution. One way of looking at it is as follows: a function f is L-smooth and µ-strongly convex if and only if µ f(x)= f˜(x)+ x x 2, 2 k − ⋆k where x is an optimal point for both f and f˜, and where f˜ is convex and (L µ)-smooth. ⋆ − Therefore, one iteration of gradient descent can be described as follows: x x = x x γ f(x ) k+1 − ⋆ k − ⋆ − ∇ k = x x γ( f˜(x )+ µ(x x )) k − ⋆ − ∇ k k − ⋆ = (1 γµ)(x x ) γ f˜(x ), − k − ⋆ − ∇ k and we see that for small enough step sizes γ, there is an additional contraction effect due to the factor (1 γµ), as compared to the effect of gradient descent has on smooth − 52 Nesterov Acceleration

convex functions such as f˜. In what follows, we adapt our proofs to develop accelerated methods in the strongly convex case. On the other hand, smooth strongly convex func- tions are sandwiched between two quadratic functions, so these assumptions are much more restrictive than smoothness alone.

4.5.1 Gradient Descent & Strong Convexity As in the smooth convex case, the smooth strongly convex case can be studied through potential functions. There are many ways to prove convergence rates for this setting, but let us only consider one that allows recovering the µ = 0 case as its limit, in order to have well-defined results even in degenerate cases. The proof below is essentially the same as that for the smooth convex case in Theorem 4.2 and the same inequalities are used, with strong convexity instead of convexity. The potential is only slightly modified allowing Ak to have a geometric growth rate, as follows L + µA φ = A (f(x ) f )+ k x x 2. k k k − ⋆ 2 k k − ⋆k µ For notation convenience we use q = L to denote the inverse condition ratio below. This quantity plays a key role in the geometric convergence of first-order methods in the presence of strong convexity.

Theorem 4.10. Let f be an L-smooth µ-strongly (possibly with µ = 0) convex function, x argmin f(x), and k N. For any A 0 and any x it holds that ⋆ ∈ x ∈ k ≥ k L + µA A (f(x ) f )+ k+1 x x 2 k+1 k+1 − ⋆ 2 k k+1 − ⋆k L + µA A (f(x ) f )+ k x x 2 ≤ k k − ⋆ 2 k k − ⋆k with x = x 1 f(x ), A = (1+ A )/(1 q), and q = µ . k+1 k − L ∇ k k+1 k − L Proof. The proof consists in performing a weighted sum of the following inequalities:

• µ-strong convexity of f between x and x , with weight λ = A A k ⋆ 1 k+1 − k µ 0 f(x ) f + f(x ); x x + x x 2, ≥ k − ⋆ h∇ k ⋆ − ki 2 k ⋆ − kk

• smoothness of f between xk and xk+1 with weight λ2 = Ak+1 L 0 f(x ) f(x ) f(x ); x x x x 2. ≥ k+1 − k − h∇ k k+1 − ki− 2 k k − k+1k This weighted sum yields a valid inequality µ 0 λ [f(x ) f + f(x ); x x + x x 2] ≥ 1 k − ⋆ h∇ k ⋆ − ki 2 k ⋆ − kk L + λ [f(x ) f(x ) f(x ); x x x x 2]. 2 k+1 − k − h∇ k k+1 − ki− 2 k k − k+1k 4.5. Acceleration under Strong Convexity 53

Using x = x 1 f(x ), this inequality can be rewritten exactly as k+1 k − L ∇ k L + µA A (f(x ) f )+ k+1 x x 2 k+1 k+1 − ⋆ 2 k k+1 − ⋆k L + µA (1 q)A 1 A (f(x ) f )+ k x x 2 − k+1 − f(x ) 2 ≤ k k − ⋆ 2 k k − ⋆k − 2L k∇ k k + ((1 q)A A 1) f(x ); x x . − k+1 − k − h∇ k k − ⋆i The desired inequality follows from A = (1+ A )/(1 q) and the sign of A , making k+1 k − k one of the last two terms nonpositive and the other one equal to zero, reaching L + µA A (f(x ) f )+ k+1 x x 2 k+1 k+1 − ⋆ 2 k k+1 − ⋆k L + µA A (f(x ) f )+ k x x 2 ≤ k k − ⋆ 2 k k − ⋆k

From this theorem, we observe that adding strong convexity to the problem allows A to follow a geometric rate given by (1 q) 1 (where we denote again by q = µ the k − − L inverse condition number). The corresponding iteration complexity of gradient descent L 1 for smooth strongly convex minimization is therefore O( µ log ǫ ) to find an approximate solution f(x ) f ǫ. This rate is essentially tight, as can be verified on quadratic k − ⋆ ≤ functions (see e.g. Chapter 2), and follows from the following corollary.

Corollary 4.11. Let f be an L-smooth µ-strongly convex function, and x argmin f(x). ⋆ ∈ x For any N N, the iterates of gradient descent with step size γ = γ = . . . = γ = 1 ∈ 0 1 N L satisfy 2 µ x0 x⋆ f(xN ) f⋆ k − N k , − ≤ 2((1 q)− 1) µ − − with q = L the inverse condition number. Proof. Following the reasoning of (4.5), we recursively use Theorem 4.2 starting with A0 = 0, that is, L A (f(x ) f ) φ . . . φ = x x 2, N N − ⋆ ≤ N ≤ ≤ 0 2 k 0 − ⋆k

and notice that the recurrence equation Ak+1 = (Ak + 1)/(1 q) has the solution 2 k − L x0 x⋆ Ak = ((1 q)− 1)/q. The final bound is obtained using again f(xN ) f⋆ k − k . − − − ≤ 2AN

Note that, as µ 0, the result of Corollary 4.11 tends to that of Corollary 4.3. → Remark 4.1 (Lower bounds). As in the smooth convex case, one can derive lower com- plexity bounds for smooth strongly convex optimization. Using the same lower bounds arising from smooth strongly convex quadratic minimization (for which Chebyshev’s 54 Nesterov Acceleration

methods have optimal iteration complexities), one can conclude that no black-box first- (1 √q)2 order method can behave better than f(x ) f = O(ρk) with ρ = − . We refer k − ⋆ (1+√q)2 the reader to Nesterov (2003) and Nemirovsky (1992) for more details. For smooth strongly convex problems beyond quadratics, this lower bound can be improved to f(x ) f = O((1 √q)2k) as provided in (Drori and Taylor, 2021, Corollary 4). k − ⋆ − In this context we will see that Nesterov’s acceleration satisfies f(x ) f = O((1 √q)k), k − ⋆ − L 1 i.e. has a O( µ log ǫ ) iteration complexity, reaching the lower complexity bound up to a constantq factor. As for the optimized gradient method provided in Section 4.3, an optimal method for the smooth strongly convex case is detailed in Section 4.6.1, and can be shown to match the corresponding worst-case lower-complexity bound.

4.5.2 Acceleration for Smooth Strongly Convex Objectives To adapt proofs of convergence of accelerated methods in the strongly convex case, we need to make a small adjustment in the shape of the previous accelerated method y = x + τ (z x ) k k k k − k x = y α f(y ) (4.17) k+1 k − k∇ k z = (1 µ δ )z + µ δ y γ f(y ). k+1 − L k k L k k − k∇ k As discussed below, this scheme can be seen as an optimized gradient method for smooth strongly convex minimization (whose details are provided in Section 4.6.1), similar to the smooth convex case. Following this scheme, Nesterov’s method for strongly convex problems is presented in Algorithm 12. As in the smooth convex case, we detail several of its convenient reformulations in Algorithm 25 and Algorithm 26. The corresponding equivalences are established in Appendix B.1.3.

Algorithm 12 Nesterov’s method, form I

Input: An L-smooth µ-strongly (possibly with µ = 0) convex function f, initial x0. 1: Initialize z0 = x0 and A0 = 0, set q = µ/L (the inverse condition ratio). 2: for k = 0,... do 2 2Ak+1+√4Ak+4qAk+1 3: Ak+1 = 2(1 q) − (Ak+1 Ak)(1+qAk) Ak+1 Ak 4: Set τ = − 2 and δ = − k A +2qA A qA k 1+qAk+1 k+1 k k+1− k 5: y = x + τ (z x ) k k k k − k 6: x = y 1 f(y ) k+1 k − L ∇ k 7: z = (1 qδ )z + qδ y δk f(y ) k+1 − k k k k − L ∇ k 8: end for Output: An approximate solution xk+1.

Regarding the potential, we make the same adjustment as for gradient descent, ar- riving to the following theorem. 4.5. Acceleration under Strong Convexity 55

Theorem 4.12. Let f be an L-smooth µ-strongly (possibly with µ = 0) convex func- tion, x argmin f(x), and k N. For all x , z Rd, and A 0 the iterates of ⋆ ∈ x ∈ k k ∈ k ≥ Algorithm 12 satisfy L + µA A (f(x ) f )+ k+1 z x 2 k+1 k+1 − ⋆ 2 k k+1 − ⋆k L + µA A (f(x ) f )+ k z x 2, ≤ k k − ⋆ 2 k k − ⋆k

2 2Ak+1+√4Ak+4qAk+1 µ with Ak+1 = 2(1 q) and q = L . − Proof. The proof consists in a weighted sum of the following inequalities:

• strong convexity between x and y with weight λ = A A ⋆ k 1 k+1 − k µ f f(y )+ f(y ); x y + x y 2, ⋆ ≥ k h∇ k ⋆ − ki 2 k ⋆ − kk

• convexity between xk and yk with weight λ2 = Ak

f(x ) f(y )+ f(y ); x y , k ≥ k h∇ k k − ki

• smoothness between yk and xk+1 (descent lemma) with weight λ3 = Ak+1 L f(y )+ f(y ); x y + x y 2 f(x ). k h∇ k k+1 − ki 2 k k+1 − kk ≥ k+1

We therefore arrive at the following valid inequality µ 0 λ [f(y ) f + f(y ); x y + x y 2] ≥ 1 k − ⋆ h∇ k ⋆ − ki 2 k ⋆ − kk + λ [f(y ) f(x )+ f(y ); x y ] 2 k − k h∇ k k − ki L + λ [f(x ) f(y ) f(y ); x y x y 2]. 3 k+1 − k − h∇ k k+1 − ki− 2 k k+1 − kk

For the sake of simplicity, we do not substitute Ak+1 by its expression until the last stage of the reformulation. Substituting xk+1, zk+1 by their expressions in (4.17) along with (Ak+1 Ak)(1+qAk) 1 Ak+1 Ak δk τk = − 2 , αk = L , δk = 1+qA− and γk = L , basic algebra shows that Ak+1+2qAkAk+1 qAk k+1 the previous inequality− can be reorganized as L + µA A (f(x ) f )+ k+1 z x 2 k+1 k+1 − ⋆ 2 k k+1 − ⋆k L + µA A (f(x ) f )+ k z x 2 ≤ k k − ⋆ 2 k k − ⋆k 2 2 (Ak Ak+1) Ak+1 qAk+1 1 2 + − − − f(yk) 1+ qAk+1 2Lk∇ k (A A )(1 + qA )(1 + qA ) µ A2 k+1 − k k k+1 x z 2. − k (A + 2qA A qA2)2 2 k k − kk k+1 k k+1 − k 56 Nesterov Acceleration

The desired statement follows from picking A A 0, and such that k+1 ≥ k ≥ (A A )2 A qA2 = 0, k − k+1 − k+1 − k+1 yielding L + µA A (f(x ) f )+ k+1 z x 2 k+1 k+1 − ⋆ 2 k k+1 − ⋆k L + µA A (f(x ) f )+ k z x 2. ≤ k k − ⋆ 2 k k − ⋆k

The final worst-case guarantee is obtained using the same reasoning as before, to- gether with a simple bound on Ak+1

2 2 2Ak +1+ 4Ak + 4qAk + 1 2Ak + 2Ak√q Ak Ak+1 = = (4.18) 2q (1 q) ≥ 2 (1q q)  1 √q − − − which means f(x ) f = O((1 √q)k) when µ > 0, or alternatively O L log 1 k − ⋆ − µ ǫ iteration complexity for obtaining an approximate solution f(x ) f ǫ. Theq following N − ⋆ ≤ corollary summarizes what we obtained for Nesterov’s method.

Corollary 4.13. Let f be an L-smooth µ-strongly (possibly with µ = 0) convex function, and x argmin f(x). For all N N, N 1, the iterates of Algorithm 12 satisfy ⋆ ∈ x ∈ ≥ 2 f(x ) f min , (1 √q)N L x x 2, N ⋆ N 2 0 ⋆ − ≤  −  k − k µ with q = L .

Proof. Following the argument of (4.5), we recursively use Theorem 4.12 with A0 = 0, together with the bounds on AN for the smooth convex case (4.14) (notice that Ak+1 is an increasing function of µ, and hence the bound for the smooth case remains valid in the smooth strongly convex one) and for the smooth strongly convex one (4.18) with 1 1 1 1 1 N A1 = 1 q = (1 √q)(1+√q) 2 (1 √q)− , reaching AN 2 (1 √q)− . − − ≥ − ≥ −

Remark 4.2. Before moving to the next section, let us mention that another direct consequence of the potential inequality above (Theorem 4.12) is that zk may also serve as an approximate solution to x⋆ when µ> 0. Indeed, using the inequality L + µA L N z x 2 φ . . . φ = x x 2, 2 k N − ⋆k ≤ N ≤ ≤ 0 2 k 0 − ⋆k it follows that N 2 1 2 2(1 √q) 2 zN x⋆ x0 x⋆ − x0 x⋆ , k − k ≤ 1+ qA k − k ≤ 2(1 √q)N + q k − k N − 4.5. Acceleration under Strong Convexity 57

and hence that z x 2 = O((1 √q)N ), as well, and therefore also that k N − ⋆k − L f(z ) f z x 2 = O((1 √q)N ). N − ⋆ ≤ 2 k N − ⋆k − In addition, yN being a convex combination of xN and zN , the same conclusion holds for y x 2 and f(y ) f . Similar observations also apply to other variants of k N − ⋆k N − ⋆ accelerated methods when µ> 0.

4.5.3 A simplified Method with Constant Momentum Some important simplifications are often made to the method in the strongly convex case where µ> 0. Several approaches produce the same method, known as the “constant momentum” version of Nesterov’s accelerated gradient. We derive it by observing that the limit case, as k of Algorithm 12 can be characterized explicitly. In particular, →∞ when k , it is clear that A as well. We can then take the limits of all → ∞ k → ∞ parameters as A , to obtain the “limit algorithm”. This is similar in spirit with the k →∞ result showing that Polyak’s heavy-ball method is the asymptotic version of Chebyshev’s method, discussed in Section 2.3.2. First, the convergence rate is obtained as

Ak+1 1 lim = (1 √q)− . Ak A − →∞ k By taking the limits of all algorithmic parameters, that is, √q 1 lim τk = , lim δk = , Ak 1+ q Ak q →∞ √ →∞ √ we obtain Algorithm 13 and its equivalent, probably most well-known, second form, provided by Algorithm 14.

Algorithm 13 Nesterov’s method, form I, constant momentum

Input: An L-smooth µ-strongly convex function f, initial point x0. 1: Initialize z0 = x0 and A0 > 0, set q = µ/L (the inverse condition ratio). 2: for k = 0,... do A 3: k Ak+1 = 1 √q {Only for the proof/relation to previous methods.} − √q 4: y = x + (z x ) k k 1+√q k − k 5: x = y 1 f(y ) k+1 k − L ∇ k 6: z = 1 √q z + √q y 1 f(y ) k+1 − k k − µ ∇ k 7: end for    Output: Approximate solutions (yk,xk+1, zk+1).

From a worst-case analysis perspective, those simplifications correspond to using a Lyapunov function obtained by dividing that of Theorem 4.12 by Ak and then taking the limit, written

1 µ 2 µ 2 ρ− f(x ) f + z x f(x ) f + z x , k+1 ⋆ 2 k+1 ⋆ k ⋆ 2 k ⋆  − k − k  ≤ − k − k 58 Nesterov Acceleration

Algorithm 14 Nesterov’s method, form II, constant momentum

Input: An L-smooth µ-strongly convex function f, initial point x0. 1: Initialize y0 = x0 and A0 > 0, set q = µ/L (the inverse condition ratio). 2: for k = 0,... do 3: Ak Ak+1 = 1 √q {Only for the proof/relation to previous methods.} − 1 4: xk+1 = yk L f(yk) − ∇1 √q 5: y = x + − (x x ) k+1 k+1 1+√q k+1 − k 6: end for Output: Approximate solutions (yk,xk+1). with ρ = (1 √q). − Theorem 4.14. Let f be an L-smooth µ-strongly convex function, x⋆ = argminx f(x), and k N. For any x , z Rd, the iterates of Algorithm 13 (or equivalently those of ∈ k k ∈ Algorithm 14) satisfy

1 µ 2 µ 2 ρ− f(x ) f + z x f(x ) f + z x . k+1 ⋆ 2 k+1 ⋆ k ⋆ 2 k ⋆  − k − k  ≤ − k − k Proof. Let A > 0 and A = A /(1 √q). The proof is essentially the same as k k+1 k − that of as Theorem 4.12, by dividing all weights Ak leading to a slight variation in the reformulation of the weighted sum, and the following valid inequality µ 0 λ [f(y ) f + f(y ); x y + x y 2] ≥ 1 k − ⋆ h∇ k ⋆ − ki 2 k ⋆ − kk + λ [f(y ) f(x )+ f(y ); x y ] 2 k − k h∇ k k − ki L + λ [f(x ) f(y ) f(y ); x y x y 2], 3 k+1 − k − h∇ k k+1 − ki− 2 k k+1 − kk with weights

Ak+1 Ak √q Ak Ak+1 1 λ1 = − = , λ2 = = 1, and λ3 = = (1 √q)− . A 1 √q A A − k − k k Substituting √q yk = xk + (zk xk) 1+ √q − 1 x = y f(y ) k+1 k − L∇ k 1 z = (1 √q) z + √q y f(y ) , k+1 k k µ k −  − ∇  we arrive to the following valid inequality

1 µ 2 ρ− f(xk+1) f⋆ + zk+1 x⋆  − 2 k − k  µ 2 √q µ 2 f(xk) f⋆ + zk x⋆ xk zk , ≤ − 2 k − k − (1 + √q)2 2 k − k 4.6. Recent Variants of Accelerated Methods 59 where we reach the desired statement from the last term being nonpositive

1 µ 2 µ 2 ρ− f(x ) f + z x f(x ) f + z x . k+1 ⋆ 2 k+1 ⋆ k ⋆ 2 k ⋆  − k − k  ≤ − k − k

Corollary 4.15. Let f be an L-smooth µ-strongly convex function, and x argmin f(x). ⋆ ∈ x For all N N, N 1, the iterates of Algorithm 13 (or equivalently of Algorithm 14) ∈ ≥ satisfy µ f(x ) f (1 √q)N f(x ) f + x x 2 , N ⋆ 0 ⋆ 2 0 ⋆ − ≤ −  − k − k  µ with q = L . Proof. It directly follows from Theorem 4.14 with

k µ 2 φ = ρ− f(x ) f + z x , k k ⋆ 2 k ⋆  − k − k  N ρ = 1 √q, z0 = x0, and ρ− (f(xN ) f⋆) φN . − − ≤

Remark 4.3. In view of Section 4.4.2, one can also find estimate sequence interpretations for Algorithm 12 and 13 from their respective potential functions.

Remark 4.4. A few works on accelerated methods focus on understanding this particular instance of Nesterov’s method. Our analysis here is largely inspired by that of Nesterov (2003), but such potentials can be obtained in different ways, see for example (Wilson et al., 2016; Shi et al., 2018; Bansal and Gupta, 2019).

4.6 Recent Variants of Accelerated Methods

In this section, we first push the potential function reasoning to its limit. We present the information-theoretic exact method (Taylor and Drori, 2021), which generalizes the optimized gradient descent in the strongly convex case. Similar to Nesterov’s method with constant momentum, the information-theoretic exact method has a limit case that is known as the triple momentum method (Van Scoy et al., 2017). We then discuss a more geometric variant, known as geometric descent (Bubeck et al., 2015), or quadratic averaging (Drusvyatskiy et al., 2018).

4.6.1 An Optimal Gradient Method for Strongly Convex Minimization It turns out that there also exists optimal gradient methods for smooth strongly convex minimization, similar to the optimized gradient method for smooth convex minimization. Such methods can be obtained by solving a minimax problem similar to (4.7) with different objectives. 60 Nesterov Acceleration

2 zN x⋆ The following scheme is optimal for the criterion k − k2 , reaching the exact worst- x0 x⋆ case lower complexity bound for this criterion, as discussek − k d below. In addition, this method reduces to OGM (see Section 4.3) when µ = 0 using the correspondence Ak+1 = 2 4θk,N (for k < N). Therefore, this method is doubly optimal, i.e. optimal for two criteria, f(yN ) f⋆ in the sense that it also achieves the lower complexity bound for − 2 when µ = 0, x0 x⋆ using the last iteration adjustment from Lemma 4.5. k − k The following analysis is reminiscent of Nesterov’s method in Algorithm 12, but also of the optimized gradient method and its proof (see Theorem 4.4). That is, the known potential function for ITEM relies on inequality (4.4), not only for its proof but also simply for showing that it is nonnegative, which follows from instantiating (4.4) at y = x⋆. The following analyses can be found almost verbatim in (Taylor and Drori, 2021). The main proof of this section is particularly algebraic, but can be very reasonably skipped as it follows from similar ideas as previous developments.

Algorithm 15 Information-theoretic exact method (ITEM)

Input: An L-smooth µ-strongly (possibly with µ = 0) convex function f, initial x0. 1: Initialize z0 = x0, A0 = 0, and set q = µ/L (the inverse condition ratio). 2: for k = 0,... do (1+q)Ak+2 1+√(1+Ak)(1+qAk) 3: Ak+1 = (1 q)2 − 2 A 1 (1 q) Ak+1 (1+q)Ak 4: Set τ = 1 k , and δ = − − k (1 q)A k 2 1+q+qAk − − k+1 5: yk = xk + τk(zk xk) 1 − 6: xk+1 = yk L f(yk) − ∇ δ 7: z = (1 qδ )z + qδ y k f(y ) k+1 − k k k k − L ∇ k 8: end for Output: Approximate solutions (yk,xk+1, zk+1).

Theorem 4.16. Let f be an L-smooth µ-strongly (possibly with µ = 0) convex function, N Rd x⋆ argminx f(x), and k . For all yk 1, zk , and Ak 0 the iterates of ∈ ∈ − ∈ ≥ Algorithm 15 satisfy φ φ , k+1 ≤ k with

1 2 µ 1 2 φk =Ak f(yk 1) f⋆ f(yk 1) yk 1 f(yk 1) x⋆ − − − 2Lk∇ − k − 2(1 q)k − − L ∇ − − k  −  L + µA + k z x 2 1 q k k − ⋆k −

(1+q)Ak+2 1+√(1+Ak)(1+qAk) and Ak+1 = (1 q)2 . −  Proof. Perform a weighted sum of two inequalities due to Theorem 4.4. 4.6. Recent Variants of Accelerated Methods 61

• Smoothness and strong convexity between yk 1 and yk with weight λ1 = Ak − 1 2 f(yk 1) f(yk)+ f(yk); yk 1 yk + f(yk) f(yk 1) − ≥ h∇ − − i 2Lk∇ −∇ − k µ 1 2 + yk yk 1 ( f(yk) f(yk 1)) , 2(1 q)k − − − L ∇ −∇ − k − • smoothness and strong convexity of f between x and y with weight λ = A ⋆ k 2 k+1 − Ak 1 f f(y )+ f(y ); x y + f(y ) ⋆ ≥ k h∇ k ⋆ − ki 2Lk∇ k k µ + y x 1 f(y ) 2. 2(1 q)k k − ⋆ − L ∇ k k − Summing up and reorganizing those two inequalities (without substituting Ak+1 by its expression, for simplicity), we arrive to the following valid inequality

1 2 0 λ1 f(yk)+ f(yk); yk 1 yk + f(yk) f(yk 1) − 2L − ≥  h∇ − i k∇ −∇ k µ 1 2 + yk yk 1 ( f(yk) f(yk 1)) 2(1 q)k − − − L ∇ −∇ − k −  1 +λ2 f(yk) f⋆ + f(yk); x⋆ yk + f(yk)  − h∇ − i 2Lk∇ k µ + y x 1 f(y ) 2 . 2(1 q)k k − ⋆ − L ∇ k k −  Substituting the expressions of yk and zk+1 with

1 yk = (1 τk) yk 1 f(yk 1) + τkzk − − − L ∇ − z = (1 qδ )z + qδ y δk f(y ) k+1 − k k k k − L ∇ k (noting that this substitution is valid even for k = 0 as A0 = 0 in this case, and hence τ0 = 1 and y0 = z0), the previous inequality can be reformulated exactly as LK φ φ 1 P (A ) z x 2 k+1 ≤ k − 1 q k+1 k k − ⋆k − K + 2 P (A ) (1 q)A f(y ) µA (x x )+ K µ(z x ) 2, 4L(1 q) k+1 k − k+1∇ k − k k − ⋆ 3 k − ⋆ k − with three parameters (well defined given that 0 µ

as well as P (A ) = (A (1 q)A )2 4A (1 + qA ). k+1 k − − k+1 − k+1 k For obtaining the desired inequality, we pick A such that A A and P (A ) = 0, k+1 k+1 ≥ k k+1 reaching the claim φ φ as well as the choice for A . k+1 ≤ k k+1

The final bound for this method is obtained after the usual growth analysis of the sequence Ak, as follows. When µ = 0, we have A =2+ A + 2 1+ A 2+ A + 2 A (1 + A )2, k+1 k k ≥ k k ≥ k reaching A 1+ √A andp hence √A k and Ap k2. Whenp µ> 0, we can use k+1 ≥ k k ≥ k ≥ an alternatep bound (1 + q)A + 2 1+ (1 + A )(1 + qA ) 2 k k k (1 + q)Ak + 2 qAk Ak Ak+1 = = ,  (1 pq)2  ≥ (1 q)2q (1 √q)2 − − − therefore reaching similar bounds as before. In this case, we only emphasize convergence results for z x , as it corresponds to the lower complexity bound for smooth strongly k N − ⋆k convex minimization (provided below). Corollary 4.17. Let f and denote q = µ/L. For any x = z Rd and N N ∈ Fµ,L 0 0 ∈ ∈ with N 1, the iterates of 15 satisfy ≥ 2N 2 1 2 (1 √q) 2 zN x⋆ z0 x⋆ − z0 x⋆ . k − k ≤ 1+ qA k − k ≤ (1 √q)2N + q k − k N − Proof. From Theorem 4.16, we get

2 φN φN 1 . . . φ0 = L z0 x⋆ . ≤ − ≤ ≤ k − k 2 From (4.4), we have that φN (L + AN µ) zN x⋆ . It remains to use the bounds ≥4 4k − k 2 on AN . That is, using A1 = 2 = 2 2 (1 √q)− , we have that AN (1 q) (1+√q) (1 √q) ≥ − ≥ 2N − − (1 √q)− which concludes the proof. −

Before concluding, let us mention that the algorithm is non-improvable for minimiz- ing large-scale smooth strongly convex functions in the following sense. Theorem 4.18. (Drori and Taylor, 2021, Corollary 4) Let 0 µ

Remark 4.5. Just as for the optimized gradient method from Section 4.3, this method might serve as a template for designing other accelerated schemes. However, it has the same caveats as the optimized gradient method, also similar to those of the triple mo- mentum method, presented in the next section. As emphasized in Section 4.3.3, it is unclear how to generalize it to broader classes of problems, e.g., involving constraints.

4.6.2 The Triple Momentum Method

The triple momentum method (TMM) is due to Van Scoy et al. (2017) and is remi- niscent of Nesterov’s method with constant momentum, provided in Algorithm 13. It corresponds to the asymptotic behavior of the information-theoretic exact method. In- deed, considering Algorithm 15, one can explicitly compute

Ak+1 2 lim = (1 √q)− , Ak A − →∞ k as well as 1 √q 1 lim τk = 1 − , lim δk = , Ak − 1+ q Ak q →∞ √ →∞ √

Algorithm 16 Triple momentum method (TMM)

Input: An L-smooth µ-strongly convex function f, initial point x0. 1: Initialize y 1 = z0 = x0, and set q = µ/L (the inverse condition ratio). − 2: for k = 0,... do 1 √q 1 1 √q 3: yk = − yk 1 f(yk 1) + 1 − zk 1+√q − − L ∇ − − 1+√q 4: z = √q y 1 f(y ) + 1 √q z  k+1 k − µ ∇ k − k 5: end for    Output: An approximate solution yk (or zk+1).

As for the information-theoretic exact method, the known potential function for the triple momentum method relies on inequality (4.4), not only for its proof but also simply for showing that it is nonnegative, which follows from instantiating (4.4) at y = x⋆.

Theorem 4.19. Let f be an L-smooth µ-strongly convex function, x⋆ = argminx f(x), and k N. For any x , z Rd, the iterates of Algorithm 16 satisfy ∈ k k ∈ 1 µ µ f(y ) f f(y ) 2 y x 1 f(y ) 2 + z x 2 k − ⋆ − 2Lk∇ k k − 2(1 q)k k − ⋆ − L ∇ k k 1 q k k+1 − ⋆k − − 2 1 2 µ 1 2 ρ f(yk 1) f⋆ f(yk 1) yk 1 x⋆ f(yk 1) ≤ − − − 2Lk∇ − k − 2(1 q)k − − − L ∇ − k  − µ + z x 2 , 1 q k k − ⋆k −  with ρ = 1 √q. − 64 Nesterov Acceleration

Proof. For simplicity, let us consider Algorithm 16 in the following form, parameter- ized by ρ ρ 1 ρ yk = yk 1 f(yk 1) + 1 zk 2 ρ − − L ∇ − − 2 ρ −    −  z = (1 ρ) y 1 f(y ) + ρz . k+1 − k − µ ∇ k k We combine the following two inequalities 

2 • smoothness and strong convexity between yk 1 and yk with weight λ1 = ρ −

1 2 f(yk 1) f(yk)+ f(yk); yk 1 yk + f(yk) f(yk 1) − ≥ h∇ − − i 2Lk∇ −∇ − k µ 1 2 + yk yk 1 ( f(yk) f(yk 1)) . 2(1 q)k − − − L ∇ −∇ − k − • smoothness and strong convexity between x and y with weight λ = 1 ρ2 ⋆ k 2 − 1 f f(y )+ f(y ); x y + f(y ) 2 ⋆ ≥ k h∇ k ⋆ − ki 2Lk∇ k k µ + y x 1 f(y ) 2, 2(1 q)k k − ⋆ − L ∇ k k − After some algebra, the weighted sum can be reformulated exactly as (it is simpler not to use the expression of ρ to verify this) 1 µ µ f(y ) f f(y ) 2 y x 1 f(y ) 2 + z x 2 k − ⋆ − 2Lk∇ k k − 2(1 q)k k − ⋆ − L ∇ k k 1 q k k+1 − ⋆k − − 2 1 2 µ 1 2 ρ f(yk 1) f⋆ f(yk 1) yk 1 x⋆ f(yk 1) ≤ − − − 2Lk∇ − k − 2(1 q)k − − − L ∇ − k  − µ + z x 2 1 q k ⋆ k − k  − 2 q (ρ 1) ρ − − f(yk); ρ(yk 1 x⋆) 2(ρ 1)(zk x⋆) f(yk 1) − (ρ 2) (1 q)h∇ − − − − − − L ∇ − i − − q (ρ 1)2 − − f(y ) 2. − (1 q) µ k∇ k k − Using the expression ρ = 1 √q, the last two terms cancel and we arrive to the desired − 1 µ µ f(y ) f f(y ) 2 y x 1 f(y ) 2 + z x 2 k − ⋆ − 2Lk∇ k k − 2(1 q)k k − ⋆ − L ∇ k k 1 q k k+1 − ⋆k − − 2 1 2 µ 1 2 ρ f(yk 1) f⋆ f(yk 1) yk 1 x⋆ f(yk 1) ≤ − − − 2Lk∇ − k − 2(1 q)k − − − L ∇ − k  − µ + z x 2 . 1 q k k − ⋆k −  4.7. Practical Extensions 65

Remark 4.6. This method was proposed in (Van Scoy et al., 2017). It was further studied in Cyrus et al. (2018) from a robust control perspective, and by Zhou et al. (2020) as an accelerated method for a different objective. The triple momentum method can also be obtained as a time-independent optimized gradient method (Lessard and Seiler, 2020). Of course, all drawbacks that were valid for OGM and ITEM also apply to the triple momentum method, so the same questions related to generalizations of this scheme remain open. Furthermore, this method is defined only for µ> 0, like Nesterov’s method with constant momentum.

4.6.3 Quadratic Averaging and Geometric Descent

Accelerated methods tailored for the strongly convex setting, such as Algorithms 13 and 16, make use of two kinds of step sizes for updating the different sequences. First, they perform small gradient steps with the step size 1/L. Such steps correspond to minimizing quadratic upper bounds (4.2). Second, they use so-called large steps 1/µ, which correspond to minimizing quadratic lower bounds (4.3). This idea can be exploited for developing more geometric accelerated methods. In particular, Drusvyatskiy et al. (2018) proposes a method based on quadratic averaging, similar in shape with Algorithm 13, but where the last sequence zk is explicitly computed as the minimum of a quadratic lower bound on the function (the coefficients arising in the computation of zk are then picked dynamically to maximize this lower bound). One of those lower bounds is that of the previous iteration, whose minimum is achieved at zk, and the other one is f(y )+ f(y ),x y + µ x y 2, whose minimum is y 1 f(y ). k h∇ k − ki 2 k − kk k− µ ∇ k Because of the specific shapes of those lower bounds, their convex combinations has a minimum which is the convex combination of their respective minima, with the same 1 weights, motivating the update rule zk = λ(yk f(yk))+(1 λ)zk 1, with dynamically − µ ∇ − − chosen λ (for maximizing the minimum value of the new under-approximation).

Alternatively, the sequence zk can be interpreted in terms of a localization method, tracking x⋆ using intersections of balls. In this case, the sequence zk corresponds to cen- ters of the balls containing x⋆. This alternative is referred to as geometric descent (Bubeck et al., 2015), and the λ are picked to minimize the radius of the ball, centered at 1 zk = λ(yk f(yk)) + (1 λ)zk 1, while making sure they contain x⋆. Geometric − µ ∇ − − descent and quadratic averaging produces the same sequences of iterates, as provided in (Drusvyatskiy et al., 2018, Theorem 4.5). Geometric descent is detailed in (Bubeck et al., 2015) and (Bubeck, 2015, Section 3.6.3). It was extended to handle constraints (Chen et al., 2017), and studied using the same Lyapunov function as that used in Theorem 4.14 (Karimi and Vavasis, 2017).

4.7 Practical Extensions

Our goal now is to detail a few of the many extensions of Nesterov’s accelerated methods. We see below that additional elements can be incorporated to the accelerated methods 66 Nesterov Acceleration

above while keeping exactly the same proof structures. That is, we perform weighted sums involving the exact same inequalities for the smooth (possibly strongly) convex functions, and only use a few additional inequalities for taking the new elements into account. We also intend to provide intuitions, along with bibliographical pointers for going further. The following scenarios are particularly important in practice.

Constraints. In the presence of constraints, or nonsmooth functions, a common ap- proach is to resort to (proximal) splitting techniques. This idea is not recent, see e.g., (Dou- glas and Rachford, 1956; Glowinski and Marroco, 1975; Lions and Mercier, 1979), but is still very relevant in signal processing, computer vision and statistical learning (Peyré, 2011; Parikh and Boyd, 2014; Chambolle and Pock, 2016; Fessler, 2020).

Adaptation. Problem constants, such as smoothness or strong convexity parameters, are generally unknown. Furthermore, their local values are typically much more advan- tageous than their, typically conservative, global ones. Typically, smoothness constants are estimated on the fly using backtracking line-search strategies, see e.g. (Goldstein, 1962; Armijo, 1966; Nesterov, 1983). Strong convexity, or the more general Hölderian error bounds (see Chapter 6), on the other hand are more difficult to estimate, and restart schemes are often used to adapt to these additional regularity properties, see e.g. (Nesterov, 2013; Becker et al., 2011; O’Donoghue and Candes, 2015; Roulet and d’Aspremont, 2017). This technique is the workhorse of Chapter 6.

Non Euclidean settings. Although we only briefly mention this topic, taking into ac- count the geometry of the problem at hand is in general key to obtain good empirical performances. In particular, optimization problems are often more naturally formulated in a non-Euclidean space, with non Euclidean norms producing a better implicit model for the function. A popular method for this setting is commonly known as mirror de- scent (Nemirovsky and Yudin, 1983a)—see also (Ben-Tal and Nemirovsky, 2001; Juditsky and Nemirovsky, 2011a)—, which we do not intend to detail at length here. Good surveys are provided by Beck and Teboulle (2003) and Bubeck (2015). However, we do detail an accelerated method in this setting in Section 4.7.3.

4.7.1 Handling Nonsmooth Terms/Constraints In this section, we consider the problem of minimizing a sum of two convex functions

F⋆ = min F (x) f(x)+ h(x) , (4.19) x Rd{ ≡ } ∈ where f is L-smooth and (possibly µ-strongly) convex, and where h is convex, closed and proper (CCP), which we denote by f µ,L and h 0, . Those are technical condi- ∈ F ∈ F ∞ tions to ensure that the proximal operator, defined hereafter, is well defined everywhere on Rd. We refer to the nice introduction by Ryu and Boyd (2016) and the references 4.7. Practical Extensions 67

therein for further details. In addition, we assume a proximal operator of h to be readily available, so 1 2 proxγh(x) = argminy Rd γh(y)+ 2 x y , (4.20) ∈ { k − k } can be evaluated efficiently (Chapter 5 deals with some cases where this operator is approximated, see also discussions in Section 4.8). The proximal operator can be seen as an implicit (sub)gradient step on h, as dictated by the optimality conditions of the proximal operation

x = prox (x) x = x γg (x ) with g (x ) ∂h(x ). + γh ⇔ + − h + h + ∈ + In particular, when h(x) is the indicator function of a closed convex set Q, then the proximal operation corresponds to the orthogonal projection onto Q. There are a few commonly used functions for which the proximal operation has an analytical solution, such as the h(x) = x , see for instance a list provided by (Chierchia et al., 2020). In k k1 the proofs below, we incorporate h using inequalities characterizing convexity, that is

h(x) h(y)+ g (y); x y , ≥ h h − i with g (y) ∂h(y) some subgradient of h at y. h ∈ In this setting, classical methods for solving (4.19) consist in using a forward-backward splitting strategy (in other words, forward, or gradient, steps on f, and backward, prox- imal, or implicit, steps on h), introduced by Passty (1979). This topic is addressed in many references, and we refer to (Parikh and Boyd, 2014; Ryu and Boyd, 2016) and the references therein for further details. In the context of accelerated methods, forward- backward splittings were introduced by Nesterov (2003) and Nesterov (2013) through the concept of gradient mapping; see also Tseng (2008) and Beck and Teboulle (2009). Problem (4.19) is also sometimes referred to as the composite convex optimization set- ting (Nesterov, 2013). Depending on the assumptions made on f and h, there are of course alternate ways of solving this problem—for example, when the proximal operator is available for both, one could use the Douglas-Rachford splitting (Douglas and Rach- ford, 1956; Lions and Mercier, 1979), but this goes beyond the purpose of this chapter and we refer to (Ryu and Boyd, 2016; Condat et al., 2019) and the references therein for further discussions on those topics.

4.7.2 Adaptation to Unknown Regularity Parameters In previous sections, we assumed f to be L-smooth and possibly µ-strongly convex. In all previous algorithms, we explicitly used the values of both L and µ to design the methods. However, this is not a desirable feature. First, it means that we need to be able to estimate valid values for L and µ. Second, it means that we are not adaptive to potentially better (local) values of those parameters. That is, we will not benefit from problems being simpler than specified, i.e. where the smallest valid L is much smaller than our estimate; and/or the largest valid µ is much larger than our estimation. We also want to benefit from typically better local properties of the function at hand, along 68 Nesterov Acceleration

the path taken by the method, rather than global ones. The difference between local and global regularity properties is often significant, meaning that adaptive methods will often converge much faster in practice. We discuss below how adaptation is implemented for the smoothness constant, us- ing line search techniques. However, it is still an open issue whether strong convexity parameters can be efficiently estimated while keeping reasonable worst-case guarantees, without resorting to restart schemes (i.e. outer iterations) (see Chapter 6). To handle unknown parameters, the key is to look at the inequalities used in the proofs of the desired method. It turns out that smoothness is usually only used in in- equalities between pairs of iterates, which means that these inequalities can be tested during iterations. For our guarantees to hold, we therefore do not need the function to be L-smooth everywhere, but rather only need some inequality to hold for the value of L we are using (smoothness of the function ensures that such a L exists). Strong convexity however is typically only used in inequalities involving the optimal point (see, for example, proof of Theorem 4.12), which we do not know a priori. Therefore, those inequalities cannot be tested as the algorithm proceeds, which complicates the estima- tion of strong convexity while running the algorithm. Adaptation to strong convexity is therefore typically done via the use of restarts. Those approaches are common, not new (Goldstein, 1962; Armijo, 1966), and were already used by Nesterov (1983). They were adapted to the forward-backward setting in Nesterov (2013) and Beck and Teboulle (2009), and further exploited to improve performances in various settings, see e.g., (Scheinberg et al., 2014). The topic is further discussed in the next section, as well as in the notes and references provided in Section 4.8.

An Accelerated Forward-backward Methods with Backtracking As discussed above, the smoothness constant L is used very sparsely in the proofs, both for gradient descent (see Theorem 4.2 and Theorem 4.10) and accelerated variants (The- orem 4.8, Theorem 4.12, and Theorem 4.14). Essentially, it is used only in three places: (i) to compute A (actually only when µ> 0), (ii) for computing x = y 1 f(y ), k+1 k+1 k − L ∇ k and (iii) in the inequality L f(x ) f(y )+ f(y ); x y + x y 2 (4.21) k+1 ≤ k h∇ k k+1 − ki 2 k k+1 − kk (a.k.a., descent lemma, as substituting the gradient step x = y 1 f(y ) allows k+1 k − L ∇ k writing it as f(x ) f(y ) 1 f(y ) 2). In addition, outside of L, (4.21) only k+1 ≤ k − 2L k∇ k k contains information which we observe, hence we can simply check whether this inequal- ity holds for a given estimate of L. When it does not, we simply increase the current approximation of L and recompute, using this new estimate of L, (i) Ak+1 (it is neces- sary only if µ> 0) and the corresponding yk, (ii) xk+1 using the new step size. We then check whether (4.21) is satisfied, again. If (4.21) is satisfied, then we can proceed (the potential inequality is verified), and otherwise we increase our approximation of L until the descent condition (4.21) is satisfied. Finally, for guaranteeing that we only perform 4.7. Practical Extensions 69

a finite number of “wasted” gradient steps for estimating L, we need an appropriate rule for increasing it. It is common to simply multiply the current approximation by some constant α > 1, allowing to guarantee to waste at most log L gradient steps in the ⌈ α L0 ⌉ process, where L is the true smoothness constant and L0 is our starting estimate. As we see below, both backtracking and nonsmooth terms require proofs very similar to those presented above, and potential-based analyses are still well suited here. We present two particular extensions of Nesterov’s first method that handle nons- mooth term, and have a backtracking procedure on the smoothness parameter. The first one, FISTA, is particularly popular, while the second one fixes one potential issue arising in the original FISTA.

FISTA. The fast iterative shrinkage-thresholding algorithm (FISTA), due to Beck and Teboulle (2009), is a natural extension of Nesterov (1983) in its first form (see Algo- rithm 12), handling an additional nonsmooth term. In this section, we present a strongly convex variant of FISTA, provided by Algorithm 17. The proof contains the same ingre- dients as in the original work, and can easily be compared to previous material.

Algorithm 17 Strongly convex FISTA, form I Input: An L-smooth µ-strongly (possibly with µ = 0) convex function f, a convex func- tion h with proximal operator available, an initial point x0, and an initial estimate L0 >µ. 1: Initialize z0 = x0, A0 = 0, and some α> 1. 2: for k = 0,... do 3: Lk+1 = Lk {Alternate strategies are provided in Remark 4.7.} 4: loop 5: qk+1 = µ/Lk+1 2 2Ak+1+√4Ak+4qk+1Ak+1 6: Ak+1 = 2(1 q ) − k+1 (Ak+1 Ak)(1+qk+1Ak) Ak+1 Ak 7: set τ = − 2 and δ = − k A +2q A A q A k 1+qk+1Ak+1 k+1 k+1 k k+1− k+1 k 8: yk = xk + τk(zk xk) − 1 9: xk+1 = proxh/L yk f(yk) k+1 − Lk+1 ∇ 10: z = (1 q δ )z + q δ y + δ (x y ) k+1 − k+1 k k k+1 k k k k+1 − k 11: if (4.21) holds then 12: break {Iterates accepted; k will be incremented.} 13: else 14: Lk+1 = αLk+1 {Iterates not accepted; recompute new Lk+1.} 15: end if 16: end loop 17: end for Output: An approximate solution xk+1.

The proof follows exactly the same lines as those of Theorem 4.12 (Nesterov’s method 70 Nesterov Acceleration

for strongly convex functions), but accounts for the nonsmooth function h (observe that the potential is stated in terms of F and not f). Two additional inequalities, involving convexity of h between two different pairs of points, allow taking this nonsmooth term into account. In this case, f is assumed to be smooth and convex over Rd (i.e., it has full domain, dom f = Rd), and we are therefore allowed to evaluate gradients of f outside of the domain of h. Theorem 4.20. Let f (with full domain; dom f = Rd), h be a closed, proper ∈ Fµ,L and convex function, x argmin f(x)+ h(x) , and k N. For any x , z Rd and ⋆ ∈ x { } ∈ k k ∈ A 0, the iterates of Algorithm 17 satisfying (4.21) also satisfy k ≥ L + µA A (F (x ) F )+ k+1 k+1 z x 2 k+1 k+1 − ⋆ 2 k k+1 − ⋆k L + µA A (F (x ) F )+ k+1 k z x 2, ≤ k k − ⋆ 2 k k − ⋆k 2 2Ak+1+√4Ak+4Akqk+1+1 with Ak+1 = 2(1 q ) and q = µ/Lk+1. − k+1 Proof. The proof consists in a weighted sum of the following inequalities: • strong convexity of f between x and y with weight λ = A A ⋆ k 1 k+1 − k µ f(x ) f(y )+ f(y ); x y + x y 2, ⋆ ≥ k h∇ k ⋆ − ki 2 k ⋆ − kk

• strong convexity of f between xk and yk with weight λ2 = Ak f(x ) f(y )+ f(y ); x y , k ≥ k h∇ k k − ki

• smoothness of f between yk and xk+1 (descent lemma) with weight λ3 = Ak+1 L f(y )+ f(y ); x y + k+1 x y 2 f(x ), k h∇ k k+1 − ki 2 k k+1 − kk ≥ k+1 • convexity of h between x and x with weight λ = A A ⋆ k+1 4 k+1 − k h(x ) h(x )+ g (x ); x x , ⋆ ≥ k+1 h h k+1 ⋆ − k+1i 1 with gh(xk+1) ∂h(xk+1) and xk+1 = yk ( f(yk)+ gh(xk+1)) ∈ − Lk+1 ∇ • convexity of h between xk and xk+1 with weight λ5 = Ak h(x ) h(x )+ g (x ); x x . k ≥ k+1 h h k+1 k − k+1i We get the following inequality µ 0 λ [f(y ) f(x )+ f(y ); x y + x y 2] ≥ 1 k − ⋆ h∇ k ⋆ − ki 2 k ⋆ − kk + λ [f(y ) f(x )+ f(y ); x y ] 2 k − k h∇ k k − ki L + λ [f(x ) (f(y )+ f(y ); x y + k+1 x y 2)] 3 k+1 − k h∇ k k+1 − ki 2 k k+1 − kk + λ [h(x ) h(x )+ g (x ); x x ] 4 k+1 − ⋆ h h k+1 ⋆ − k+1i + λ [h(x ) h(x )+ g (x ); x x ]. 5 k+1 − k h h k+1 k − k+1i 4.7. Practical Extensions 71

Substituting the yk, xk+1, and zk+1 using y = x + τ (z x ) k k k k − k 1 xk+1 = yk ( f(yk)+ gh(xk+1)) − Lk+1 ∇ z = (1 q δ )z + q δ y + δ (x y ) k+1 − k+1 k k k+1 k k k k+1 − k the previous weighted sum can be reformulated exactly as L + A µ A (f(x )+ h(x ) f(x ) h(x )) + k+1 k+1 z x 2 k+1 k+1 k+1 − ⋆ − ⋆ 2 k k+1 − ⋆k L + A µ A (f(x )+ h(x ) f(x ) h(x )) + k+1 k z x 2 ≤ k k k − ⋆ − ⋆ 2 k k − ⋆k 2 2 (Ak Ak+1) Ak+1 qk+1Ak+1 1 2 + − − − f(yk)+ gh(xk+1) 1+ qk+1Ak+1 2Lk+1 k∇ k 2 Ak(Ak+1 Ak)(1 + qk+1Ak)(1 + qk+1Ak+1) µ 2 − xk zk . − A + 2q A A q A2 2 2 k − k k+1 k+1 k k+1 − k+1 k Using 0 q 1 and picking A such that A  A and ≤ k+1 ≤ k+1 k+1 ≥ k (A A )2 A q A2 = 0, k − k+1 − k+1 − k+1 k+1 yields the desired result L + µA A (f(x )+ h(x ) f(x ) h(x )) + k+1 k+1 z x 2 k+1 k+1 k+1 − ⋆ − ⋆ 2 k k+1 − ⋆k L + µA A (f(x )+ h(x ) f(x ) h(x )) + k+1 k z x 2. ≤ k k k − ⋆ − ⋆ 2 k k − ⋆k

Finally, we obtain a complexity guarantee by adapting the potential argument (4.5), and by noting that Ak+1 is a decreasing function of Lk+1 (whose maximal value is αL, assuming L0 < L; otherwise its maximal value is L0). The growth rate of Ak in the smooth convex setting remains unchanged, see (4.14), however its geometric grow rate is actually slightly degraded to 1 µ − Ak+1 1 Ak, ≥  − rαL  µ which remains better than the (1 αL ) rate of gradient descent with backtracking − µ (assuming in both case again that L0

Proof. We assume that L>L as otherwise f and the proof also directly 0 ∈ Fµ,L0 follows the case without backtracking. Define L + µA φ = A (F (x ) F )+ k k z x 2. k k k − ⋆ 2 k k − ⋆k Since L /L 1, we have k+1 k ≥ Lk+1 + µAk 2 Lk+1 φk+1 Ak(F (xk) F⋆)+ zk x⋆ φk ≤ − 2 k − k ≤ Lk The chained potential argument (4.5) can now be adapted to get

LN LN LN AN (F (xN ) F⋆) φN φN 1 φN 2 . . . φ0, − ≤ ≤ LN 1 − ≤ LN 2 − ≤ ≤ L0 − − where we used Theorem 4.20 and the fact the output of the algorithm satisfies (4.21). Using A0 = 0, we reach 2 LN x0 x⋆ F (xN ) F⋆ k − k . − ≤ 2AN

Using our previous bounds on AN (noting that Ak+1 is an decreasing function of Lk+1), in e.g., Corollary 4.13, along with the fact that the estimated smoothness cannot be larger than the growth rate α times the true constant LN < αL in the worst-case, except if L0 was already larger than the true L, in which case LN = L0, and hence L ℓ = max αL,L , yielding the desired result. N ≤ { 0}

Remark 4.7. There are two common variations around the backtracking strategy pre- sented in this section. One can for example reset L L (at line 3 of Algorithm 17) k+1 ← 0 at each iteration, potentially using a total of N log L additional gradient evaluations ⌈ α L0 ⌉ over all iterations. Another possibility is to pick some additional constant 0 <β< 1 and to initiate L βL (at line 3 of Algorithm 17). In the case β = 1/α, this strategy k+1 ← k potentially costs 1 additional gradient evaluation per iteration due to the backtracking strategy, potentially using a total of N + log L additional gradient evaluations over ⌈ α L0 ⌉ all iterations. Incorporating such non-monotonic estimations of L can be done at low additional technical cost. The corresponding methods and their analyses are essentially the same as those of this chapter; they are provided in Appendix B.3.1 and B.3.2 (see Algorithm 28 and Algorithm 29).

Remark 4.8. Variations around strongly convex extensions of FISTA, involving back- tracking line-searches, can be found in, e.g. (Chambolle and Pock, 2016; Calatroni and Chambolle, 2019; Florea and Vorobyov, 2018; Florea and Vorobyov, 2020), together with practical improvements. The method presented in this section was designed for easy com- parisons with previous material. 4.7. Practical Extensions 73

Another Accelerated Gradient Method for Composite Convex Optimization FISTA potentially evaluates gradients outside of the domain of h, and therefore implicitly assumes that f is defined even outside this region. In many situations, this is not an issue, for example when f is quadratic. However, it can be problematic in some cases. In this section, we assume instead that f is continuously differentiable and satisfies smoothness condition (4.22) only for all x,y dom h. ∈ Definition 4.2. Let 0 µ

By extension, µ, (C) denotes the set of proper, closed, and µ-strongly convex func- F ∞ tions whose domain contains C, and 0, denotes the set of proper, closed and convex F ∞ functions. There exists different ways of handling this situation. The method of this section relies on using the proximal operator on the sequence zk, and on formulating Nesterov’s method in its form III (see Algorithm 11), where, assuming the initial point is feasible (F (x ) < ) then both x ’s and y ’s are obtained from convex combinations of feasible 0 ∞ k k points, and hence are feasible. A wide variety of accelerated methods exists, and most variants settle FISTA’s prob- lem by using two proximal operations per iteration (on both sequences xk and zk). The variant of this section performs only one projection per iteration, while also fixing the infeasibility issue of yk in FISTA. Variations on this theme can found in a number of references, see for example (Auslender and Teboulle, 2006, “Improved interior gradient algorithm”), (Tseng, 2008, Algorithm 1), or more recently (Gasnikov and Nesterov, 2018, “Method of similar triangles”).

Theorem 4.22. Let h 0, , f µ,L(dom h), x⋆ argminx F (x) f(x)+ h(x) , ∈ F ∞ ∈ F ∈ { ≡ } and k N. For any x , z Rd and A 0, the iterates of Algorithm 18 satisfying (4.21) ∈ k k ∈ k ≥ also satisfy L + µA A (F (x ) F )+ k+1 k+1 z x 2 k+1 k+1 − ⋆ 2 k k+1 − ⋆k L + µA A (F (x ) F )+ k+1 k z x 2, ≤ k k − ⋆ 2 k k − ⋆k 74 Nesterov Acceleration

Algorithm 18 A proximal accelerated gradient method

Input: h 0, with proximal operator available, f µ,L(dom h), an initial point ∈ F ∞ ∈ F x dom h, and an initial estimate L >µ. 0 ∈ 0 1: Initialize z0 = x0, A0 = 0, and some α> 1. 2: for k = 0,... do 3: Lk+1 = Lk {Alternate strategies are provided in Remark 4.7.} 4: loop 5: qk+1 = µ/Lk+1 2 2Ak+1+√4Ak+4qk+1Ak+1 6: Ak+1 = 2(1 q ) − k+1 (Ak+1 Ak)(1+qk+1Ak) Ak+1 Ak 7: set τ = − 2 and δ = − k A +2q A A q A k 1+qk+1Ak+1 k+1 k+1 k k+1− k+1 k 8: yk = xk + τk(zk xk) − δk 9: zk+1 = proxδ h/L (1 qk+1δk)zk + qk+1δkyk f(yk) k k+1 − − Lk+1 ∇ Ak Ak 10: xk+1 = xk + (1  )zk+1  Ak+1 − Ak+1 11: if (4.21) holds then 12: break {Iterates accepted; k will be incremented.} 13: else 14: Lk+1 = αLk+1 {Iterates not accepted; recompute new Lk+1} 15: end if 16: end loop 17: end for Output: An approximate solution xk+1.

2 2Ak+1+√4Ak+4Akqk+1+1 with Ak+1 = 2(1 q ) and q = µ/Lk+1. − k+1 Proof. First, zk are in dom h by construction—it is the output of the proxi- { } Ak mal/projection step. Furthermore, we have 0 1 given that Ak+1 Ak 0. ≤ Ak+1 ≤ ≥ ≥ A direct consequence is that, since z = x dom h, all subsequent y and x are 0 0 ∈ { k} { k} also in dom h (as they are obtained from convex combinations of feasible points). The rest of the proof consists in a weighted sum of the following inequalities (which are valid due to feasibility of the iterates).

• Strong convexity of f between x and y with weight λ = A A ⋆ k 1 k+1 − k µ f(x ) f(y )+ f(y ); x y + x y 2, ⋆ ≥ k h∇ k ⋆ − ki 2 k ⋆ − kk

• convexity of f between xk and yk with weight λ2 = Ak f(x ) f(y )+ f(y ); x y , k ≥ k h∇ k k − ki

• smoothness of f between yk and xk+1 (descent lemma) with weight λ3 = Ak+1 L f(y )+ f(y ); x y + k+1 x y 2 f(x ), k h∇ k k+1 − ki 2 k k+1 − kk ≥ k+1 4.7. Practical Extensions 75

• convexity of h between x and z with weight λ = A A ⋆ k+1 4 k+1 − k h(x ) h(z )+ g (z ); x z , ⋆ ≥ k+1 h h k+1 ⋆ − k+1i δk with gh(zk+1) ∂h(zk+1) and zk+1 = (1 qδk)zk +qδkyk ( f(yk)+gh(zk+1)) ∈ − − Lk+1 ∇

• convexity of h between xk and xk+1 with weight λ5 = Ak h(x ) h(x )+ g (x ); x x , k ≥ k+1 h h k+1 k − k+1i with g (x ) ∂h(x ) h k+1 ∈ k+1 • convexity of h between z and x with weight λ = A A k+1 k+1 6 k+1 − k h(z ) h(x )+ g (x ); z x . k+1 ≥ k+1 h h k+1 k+1 − k+1i We get the following inequality µ 0 λ [f(y ) f(x )+ f(y ); x y + x y 2] ≥ 1 k − ⋆ h∇ k ⋆ − ki 2 k ⋆ − kk + λ [f(y ) f(x )+ f(y ); x y ] 2 k − k h∇ k k − ki L + λ [f(x ) (f(y )+ f(y ); x y + k+1 x y 2)] 3 k+1 − k h∇ k k+1 − ki 2 k k+1 − kk + λ [h(z ) h(x )+ g (z ); x z ] 4 k+1 − ⋆ h h k+1 ⋆ − k+1i + λ [h(x ) h(x )+ g (x ); x x ] 5 k+1 − k h h k+1 k − k+1i + λ [h(x ) h(z )+ g (x ); z x ]. 6 k+1 − k+1 h h k+1 k+1 − k+1i Substituting the yk, zk+1, and xk+1 using y = x + τ (z x ) k k k k − k δk zk+1 = (1 qk+1δk)zk + qk+1δkyk ( f(yk)+ gh(zk+1)) − − Lk+1 ∇ A A x = k x + 1 k z , k+1 A k A k+1 k+1  − k+1  we reformulate the previous inequality as L + A µ A (f(x )+ h(x ) f(x ) h(x )) + k+1 k+1 z x 2 k+1 k+1 k+1 − ⋆ − ⋆ 2 k k+1 − ⋆k L + A µ A (f(x )+ h(x ) f(x ) h(x )) + k+1 k z x 2 ≤ k k k − ⋆ − ⋆ 2 k k − ⋆k 2 2 2 (Ak Ak+1) (Ak Ak+1) Ak+1 qk+1A − − − − k+1 1 2 + 2 f(yk)+ gh(zk+1) Ak+1(1 + qk+1Ak+1)  2Lk+1 k∇ k 2 Ak(Ak+1 Ak)(1 + qk+1Ak)(1 + qk+1Ak+1) µ 2 − xk zk . − A + 2q A A q A2 2 2 k − k k+1 k+1 k k+1 − k+1 k Picking A A such that  k+1 ≥ k (A A )2 A q A2 = 0, k − k+1 − k+1 − k+1 k+1 76 Nesterov Acceleration yields the desired result L + µA A (f(x )+ h(x ) f(x ) h(x )) + k+1 k+1 z x 2 k+1 k+1 k+1 − ⋆ − ⋆ 2 k k+1 − ⋆k L + µA A (f(x )+ h(x ) f(x ) h(x )) + k+1 k z x 2. ≤ k k k − ⋆ − ⋆ 2 k k − ⋆k

We have the following corollary.

Corollary 4.23. Let h 0, , f µ,L(dom h) and x⋆ argminx F (x) f(x)+h(x) . ∈ F ∞ ∈ F ∈ { ≡ } For any N N, N 1, and x Rd, the output of Algorithm 18 satisfies ∈ ≥ 0 ∈ N 2 µ − F (x ) F min , 1 ℓ x x 2, N ⋆ N 2 ℓ 0 ⋆ − ≤ (  − r  ) k − k with ℓ = max αL,L . { 0} Proof. The proof follows the same arguments as those of Corollary 4.21, using the potential from Theorem 4.22, using the fact the output of the algorithm satisfies (4.21).

Remark 4.9. In this chapter, we introduced backtracking techniques by “observing” what inequalities were used in the proof. In particular, because smoothness was used only through the descent lemma, in which the only unknown value is L, one can simply check this inequality at runtime. Another possible exploitation of the observation (“what inequalities do we need in the proof”) is to identify minimal assumptions on the class of functions under which it is possible to prove accelerated rates; this topic is explored by e.g., Necoara et al. (2019) and Hinder et al. (2020). More generally the same question holds for the ability of proving convergence rates of simpler methods, such as gradient descent (Bolte et al., 2017; Necoara et al., 2019).

4.7.3 Beyond Euclidean Geometries using Mirror Maps In this section, we put ourselves in a slightly different scenario, often referred to as the non-Euclidean setting or the mirror descent setup. Let us consider the convex minimiza- tion problem F⋆ = min F (x) f(x)+ h(x) , (4.24) x Rd{ ≡ } ∈ with h, f : Rd R + closed proper and convex. Furthermore, we assume f to be dif- → ∪{ ∞} ferentiable and to have Lipschitz gradients with respect to some (possibly non-Euclidean)

norm . . That is, denoting by s = supx s; x : x 1 the corresponding dual k k k k∗ {h i k k ≤ } norm, we require f(x) f(y) L x y , k∇ −∇ k∗ ≤ k − k for all x,y dom h. In this setting, inequality (4.22) also holds (see Appendix A.2), and ∈ we perhaps abusively also denote f (dom h). ∈ F0,L 4.7. Practical Extensions 77

To solve (4.24), we define a few additional ingredients. First, we pick a 1-strongly convex, closed and proper function w : Rd R + such that dom h dom w (recall → ∪{ ∞} ⊆ that by assumption on h, dom h = , and therefore dom w = ). Those assumptions 6 ∅ 6 ∅ ensure the proximal operations below are well defined, and w is commonly referred to as the distance generating function. Finally, pick g (y) ∂w(y) and define a Bregman divergence generated by w as w ∈ D (x; y)= w(x) w(y) g (y); x y , (4.25) w − − h w − i which we use below as a notion of distance for generalizing the previous proximal oper- ator (4.20). Note that the Bregman divergence Dw(.; .) generated by any subgradient of w at zk is considered valid here. Now, the base ingredient we use for solving (4.24) is the Bregman proximal gradient ak step, with step size L

ak zk+1 = argmin (h(y)+ f(yk); y yk )+ Dw(y; zk) , (4.26) y  L h∇ − i  which corresponds to the usual Euclidean proximal gradient step when w(x) = 1 x 2. 2 k k2 Under previous assumptions (4.26) is well defined and we can explicitly write a g (z ) g (z ) k ( f(y )+ g (z )) , w k+1 ∋ w k − L ∇ k h k+1 with some g (z ) ∂w(z ), g (z ) ∂w(z ), and g (z ) ∂h(z ). w k+1 ∈ k+1 w k ∈ k h k+1 ∈ k+1 Under this construction, one can then rely on (4.26) for solving (4.24) using Algo- rithm 19. Note that when w is differentiable (which is usually the case, but for which requiring w to be closed, proper and convex requires some additional discussions), we often refer to w as a mirror map, which is a bijective mapping due to strong convexity ∇ and differentiability of w. In this case, iterations can be described as a w(z )= w(z ) k ( f(y )+ g (z )) . ∇ k+1 ∇ k − L ∇ k h k+1 Theorem 4.24 below provides a convergence guarantee for Algorithm 19 using a potential argument.

Theorem 4.24. Let h 0, , f 0,L(dom h), w 1, with dom h dom w, let ∈ F ∞ ∈ F ∈ F ∞ ⊆ x argmin F (x) f(x)+ h(x) , and k N. For any x , z dom h such that ⋆ ∈ y{ ≡ } ∈ k k ∈ ∂w(z ) = and A 0, iterates of Algorithm 19 satisfy k 6 ∅ k ≥ A (F (x ) F )+ LD (x ; z ) A (F (x ) F )+ LD (x ; z ), k+1 k+1 − ⋆ w ⋆ k+1 ≤ k k − ⋆ w ⋆ k

1+√4Ak+1 with Ak+1 = Ak + 2 and Dw(.; .) a Bregman divergence (4.25) with respect to w. Furthermore, ∂w(z ) = . k+1 6 ∅ Proof. First, z are all feasible, i.e., z dom h, by construction. Indeed, z is { k} k ∈ 0 feasible by assumption, and the following iterates zk (k > 0) are obtained after proximal steps, hence ∂h(z ) = and therefore z dom h. k 6 ∅ k ∈ 78 Nesterov Acceleration

Algorithm 19 Proximal accelerated Bregman gradient method

Input: h 0, , f 0,L(dom h), w 1, with dom h dom w, and x0 dom h ∈ F ∞ ∈ F ∈ F ∞ ⊆ ∈ (such that ∂w(z ) = ). 0 6 ∅ 1: Initialize z0 = x0 and A0 = 0. 2: for k = 0,... do 1+√4Ak+1 3: ak = 2 4: Ak+1 = Ak + ak Ak Ak 5: yk = xk + 1 zk Ak+1 − Ak+1 6: z = argmin  ak (h(y)+ f(y ); y y )+ D (y; z ) k+1 y L h∇ k − ki w k Ak Ak 7: xk+1 = xk + 1 zk+1 Ak+1 − Ak+1 8: end for   Output: Approximate solution xk+1

Ak Secondly, it is direct to verify that 0 1 given that Ak+1 Ak 0. As ≤ Ak+1 ≤ ≥ ≥ a consequence, all subsequent y and x are obtained as convex combinations. It { k} { k} follows from z = x dom h, that the sequences y and x are also in dom h, which 0 0 ∈ { k} { k} is convex. The rest of proof consists in a weighted sum of the following inequalities:

• convexity of f between x and y with weight λ = A A ⋆ k 1 k+1 − k f(x ) f(y )+ f(y ); x y , ⋆ ≥ k h∇ k ⋆ − ki

• convexity of f between xk and yk with weight λ2 = Ak

f(x ) f(y )+ f(y ); x y , k ≥ k h∇ k k − ki

• smoothness of f between yk and xk+1 (descent lemma) with weight λ3 = Ak+1 L f(y )+ f(y ); x y + x y 2 f(x ), k h∇ k k+1 − ki 2 k k+1 − kk ≥ k+1 • convexity of h between x and z with weight λ = A A ⋆ k+1 4 k+1 − k h(x ) h(z )+ g (z ); x z , ⋆ ≥ k+1 h h k+1 ⋆ − k+1i with g (z ) ∂h(z ) h k+1 ∈ k+1

• convexity of h between xk and xk+1 with weight λ5 = Ak

h(x ) h(x )+ g (x ); x x , k ≥ k+1 h h k+1 k − k+1i with g (x ) ∂h(x ), h k+1 ∈ k+1 • convexity of h between z and x with weight λ = A A k+1 k+1 6 k+1 − k h(z ) h(x )+ g (x ); z x , k+1 ≥ k+1 h h k+1 k+1 − k+1i 4.7. Practical Extensions 79

• 1-strong convexity of w between zk+1 and zk with weight λ7 = L 1 w(z ) w(z )+ g (z ); z z + z z 2, k+1 ≥ k h w k k+1 − ki 2k k − k+1k with g (z ) ∂w(z ). w k ∈ k We then get the following inequality 0 λ [f(y ) f + f(y ); x y ] ≥ 1 k − ⋆ h∇ k ⋆ − ki + λ [f(y ) f(x )+ f(y ); x y ] 2 k − k h∇ k k − ki L + λ [f(x ) (f(y )+ f(y ); x y + x y 2)] 3 k+1 − k h∇ k k+1 − ki 2 k k+1 − kk + λ [h(z ) h(x )+ g (z ); x z ] 4 k+1 − ⋆ h h k+1 ⋆ − k+1i + λ [h(x ) h(x )+ g (x ); x x ] 5 k+1 − k h h k+1 k − k+1i + λ [h(x ) h(z )+ g (x ); z x ] 6 k+1 − k+1 h h k+1 k+1 − k+1i 1 + λ [w(z ) w(z )+ g (z ); z z + z z 2]. 7 k − k+1 h w k k+1 − ki 2k k − k+1k Now, by substituting (using g (z ) ∂w(z )) w k+1 ∈ k+1 A A y = k x + 1 k z k A k A k k+1  − k+1  A A g (z )= g (z ) k+1 − k ( f(y )+ g (z )) w k+1 w k − L ∇ k h k+1 A A x = k x + 1 k z k+1 A k A k+1 k+1  − k+1  2 2 (Ak+1 Ak) 2 − we obtain exactly xk+1 yk = A2 zk zk+1 , and the weighted sum can be k − k k+1 k − k reformulated exactly as A (F (x ) F )+ LD (x , z ) k+1 k+1 − ⋆ w ⋆ k+1 2 (Ak Ak+1) Ak+1 L 2 Ak(F (xk) F⋆)+ LDw(x⋆, zk)+ − − zk zk+1 , ≤ − Ak+1 2 k − k and we obtain the desired inequality from picking A satisfying A A and k+1 k+1 ≥ k (A A )2 A = 0. k − k+1 − k+1

Let us conclude this section by providing the final corollary describing the worst-case performance of the method.

Corollary 4.25. Let h 0, , f 0,L(dom h), w 1, with dom h dom w, and ∈ F ∞ ∈ F ∈ F ∞ ⊆ let x argmin F (x) f(x)+ h(x) . For any x dom h such that ∂w(x ) = , and ⋆ ∈ y{ ≡ } 0 ∈ 0 6 ∅ any N N the iterates of Algorithm 19 satisfy ∈ LDw(x⋆; z0) 4LDw(x⋆; z0) F (xN ) F⋆ 2 . − ≤ AN ≤ N 80 Nesterov Acceleration

Proof. The claim directly follows from previous arguments using potential φk = 2 A (F (x ) F )+ LD (x ; z ), along with A N from (4.14). k k − ⋆ w ⋆ k N ≥ 4

Remark 4.10. Bregman first-order methods are often split into two different families: mirror descent and dual averaging (which we did not explicitly mention here), for which we refer the reader to the discussions in (Bubeck, 2015, Chapter 4). The method pre- sented in this section is essentially a case of (Tseng, 2008, Algorithm 1), and corresponds to (Auslender and Teboulle, 2006, “Improved interior gradient algorithm”) when the norm is Euclidean. It is also similar to (Tseng, 2008, Algorithm 3) and to (Gasnikov and Nesterov, 2018, “Method of similar triangles”), which are “dual-averaging” versions of the same algorithm: they are essentially equivalent in the Euclidean setup without con- straints. It enjoys a number of variants, see e.g., discussions in (Tseng, 2008), possibly involving two projections per iteration, as in (Nesterov, 2005, Section 3), (Lan et al., 2011, Section 3). The method can also naturally be embedded with a backtracking pro- cedure, exactly as in previous sections. Finally, such methods can be adapted to strong convexity, either on f, as in (Gasnikov and Nesterov, 2018) or on h, as in (Diakonikolas and Guzmán, 2021).

Remark 4.11. Beyond the Euclidean setting where w(x)= 1 . 2, one of the most classi- 2 k k2 cal examples of the impact of the non-Euclidean setup optimizes a simple function over the simplex. In this case, writing x(i) the ith component of some x Rd, we consider ∈ the situation where h is the indicator function of the simplex 0 if d x(i) = 1, x(i) 0 (i = 1,...,d) h(x)= i=1 ≥ + otherwise, ( ∞ P and w is the entropy (which is closed, proper, and convex, and 1-strongly convex over the simplex, for . = . ; this is known as Pinsker’s inequality), that is, define some k k k k1 wi : R R → x log x if x> 0, wi(x)=  0 if x = 0,  + otherwise, ∞ d (i) and set w(x) = i=1 wi(x ). In this case, the expression for the Bregman proximal gradient step in Algorithm 19 can be computed exactly, assuming y(i) = 0 (for all P k 6 i = 1, ..., d) y(i) exp ak [ f(y )](i) (i) k − L ∇ k zk+1 = . d y(i) exph ak [ f(y )]i(i) i=1 k − L ∇ k (i) P (i)  Hence we also have that yk = 0 as soon as y0 = 0; a common technique is to instantiate (i) 1 6 6 x0 = d . In this setup, a non-Euclidean geometry often shows a significant practical advantage when optimizing large-scale functions, by improving the dependence from d to log(d) 4.8. Notes & References 81

in the final complexity bound. Indeed, we have Dw(x⋆,x0) ln d here compared to 1 ≤ D 1 (x⋆,x0) in the Euclidean case, so the dependence in d is seemingly better . 2 2 2 k k2 ≤ in the Euclidean case. However, the choice of norms has a very significant impact here. When the gradient has a small Lipschitz constant measured with respect to . = . , k k k k1 that is

f(x) f(y) L1 x y 1, k∇ −∇ k∞ ≤ k − k the Lipschitz constant might be up to d times smaller than that computed using the Euclidean norm . = . (using norm equivalences), i.e. k k k k2 f(x) f(y) L x y , k∇ −∇ k2 ≤ 2k − k2 with L1 = O(L2/d). The final complexity bound using an Euclidean geometry then reads

F (x ) F 2Ld , N − ⋆ ≤ N 2 compared to F (x ) F 4L ln d N − ⋆ ≤ N 2 for the same L> 0, if we were using the geometry induced by the entropy. The impact of the choice of norm is discussed extensively in e.g. (Juditsky et al., 2009, Example 2.1) and (d’Aspremont et al., 2018). Another, related, example is that of optimizing over the spectrahedron, see for ex- ample the nice introduction by Bubeck (2015, Chapter 4). This setup is also largely motivated in (Nesterov, 2005). We refer to (Allen-Zhu and Orecchia, 2014; Diakonikolas and Guzmán, 2021) and the references therein for further discussions on this topic.

4.8 Notes & References

Estimate Sequences, Potential Functions, and Differential Equations. Potential func- tions were used in the original paper by Nesterov (1983) to develop accelerated meth- ods. Nesterov (2003) developed estimate sequences as an alternate, more constructive approach to get optimal first-order methods. Since then, both approaches were used in many references on this topic in various settings and Tseng (2008) provides a nice unified view on accelerated methods. Estimate sequences were extensively studied by e.g. Nes- terov (2013), Baes (2009), Devolder (2011), and Kulunchakov and Mairal (2019b). Nesterov’s methods were also interpreted and studied in a continuous-time settings, through ordinary differential equations, in the smooth convex case (Su et al., 2016; Krich- ene et al., 2015; Attouch and Peypouquet, 2016), and in the strongly convex one (Shi et al., 2018) where constant momentum schemes can be deployed. In this case, Nesterov’s method can be seen as an appropriate integration scheme of the gradient flow (Scieur et al., 2017b). Another related approach is that of approximate duality gap Diakonikolas and Orecchia (2019b), a constructive approach to estimate sequences/potential functions with a continuous-time counterpart. 82 Nesterov Acceleration

Beyond Euclidean Geometries. Mirror descent dates back to the work of Nemirovsky and Yudin (1983a). It was further developed and used in many subsequent works (Ben- Tal and Nemirovsky, 2001; Nesterov, 2005; Nesterov, 2009; Xiao, 2010; Juditsky and Nesterov, 2014; Diakonikolas and Guzmán, 2021). Nice pedagogical surveys and can be found in (Beck and Teboulle, 2003; Juditsky and Nemirovsky, 2011a; Juditsky and Nemirovsky, 2011b; Bubeck, 2015). Beyond the setting described in this chapter, mirror descent was also studied in the relative smoothness setting, introduced by Bauschke et al. (2016)—see also (Teboulle, 2018)—, and extended to a notion of relative strong convexity by Lu et al. (2018). How- ever, acceleration remains an open issue there, and it is generally unclear which additional assumptions allow accelerated rates. It is however clear that additional assumptions are required, as emphasized by the lower bound provided by Dragomir et al. (2019) in a generic setting.

Lower Complexity Bounds. Lower complexity bounds were studied in a variety of settings to establish limits on worst-case performance from black-box methods. The classical reference on this topic is the book of Nemirovsky and Yudin (1983c). Of particular interest to us, Nemirovsky (1991) and Nemirovsky (1992) establish opti- mality of Chebyshev and conjugate gradient methods for convex quadratic minimization. Lower bounds for black-box first-order methods in the context of smooth convex and smooth strongly convex optimization can be found in (Nesterov, 2003). The final lower bound for black-box smooth convex minimization was obtained by Drori (2017) and shows optimality of the Optimized Gradient Method, as well as that of conjugate gradi- ents, as discussed earlier in this chapter. Lower bounds for ℓp norms in the mirror descent setup are constructed in Guzmán and Nemirovsky (2015), whereas a lower bound for mirror descent in the relative smoothness setup is provided by Dragomir et al. (2019).

Changing the Performance Measure. Obtaining (practical) accelerated method for other types of convergence criterion, such as gradient norms, is still not a fully settled issue. Those criterion are important in other contexts, including dual methods, and to draw links between methods intrinsically designed for solving convex problems and those used in nonconvex settings, for which finding stationary points is the goal. There are a few tricks allowing to pass from a guarantee in one context to another one. For example, a regularization trick is proposed in (Nesterov, 2012b), yielding approximate solutions with small gradient norm. Beyond that, in the context of smooth convex minimization, recent progresses were made by Kim and Fessler (2020) who designed an optimized method for minimizing the gradient norm after a given number of iterations; this method was analyzed through potential functions in (Diakonikolas and Wang, 2021). Corresponding lower bounds, based on quadratic minimization, for a variety of performance measures can be found in (Nemirovsky, 1992). 4.8. Notes & References 83

Adaptation and Backtracking Line Searches. The idea of using backtracking line searches is classical, and attributed to e.g. Goldstein (1962) and Armijo (1966)—see e.g., discussions (Nocedal and Wright, 2006; Bonnans et al., 2006). It was already incorpo- rated in the original work of Nesterov (1983) estimating the smoothness constant within an accelerated gradient method. Since then, many works on the topic heavily rely on this technique, often adapted to obtain better practical performance, see for example (Schein- berg et al., 2014; Chambolle and Pock, 2016; Florea and Vorobyov, 2018; Calatroni and Chambolle, 2019). A more recent adaptive step size strategy (without line search) can be found in (Malitsky and Mishchenko, 2020).

Inexactness, Stochasticity, and Randomness. The ability to use approximate first- order information, be it stochastic or deterministic, is key for tackling certain problems for which computing exact gradient is expansive. Deterministic (or adversarial) error models are studied in e.g. (d’Aspremont, 2008; Schmidt et al., 2011; Devolder et al., 2014; Devolder, 2013; Devolder et al., 2013) through different noise models. Such ap- proaches can also be deployed when the projection/proximal operation is computed approximately (Schmidt et al., 2011; Villa et al., 2013) (see also Chapter 5 and the references therein). Similarly, stochastic approximations and incremental gradient methods are key in many statistical learning problems, where samples are accessed one at a time and for which it is not desirable to optimize beyond data accuracy (Bottou and Bousquet, 2007). For this reason, the old idea of stochastic approximations (Robbins and Monro, 1951) is still widely used and an active area of research. Their “optimal” variants were de- veloped much later (Lan, 2008) with the rise of machine learning applications; in this context, it is not possible to asymptotically accelerate convergence rates, but only to accelerate the transient phase toward a purely stochastic regime; see also (Hu et al., 2009; Xiao, 2010; Devolder, 2011; Lan, 2012; Dvurechensky and Gasnikov, 2016; Aybat et al., 2019; Gorbunov et al., 2020)—in particular, we note that “stochastic” estimate sequences were developed in (Devolder, 2011; Kulunchakov and Mairal, 2019b). The case of the stochastic noise arising from sampling an objective function that is a finite sum of smooth components attracted a lot of attention in the 2010’s, starting with (Schmidt et al., 2017; Johnson and Zhang, 2013; Shalev-Shwartz and Zhang, 2013; Defazio et al., 2014a; Defazio et al., 2014b; Mairal, 2015) and then extended to feature acceleration techniques (Shalev-Shwartz and Zhang, 2014; Allen-Zhu, 2017; Zhou et al., 2018; Zhou et al., 2019). Acceleration techniques also apply in the context of randomized block coor- dinate descent as well, see for example Nesterov (2012a), Lee and Sidford (2013), Fercoq and Richtárik (2015), and Nesterov and Stich (2017).

Higher-order Methods. Acceleration mechanisms were also proposed in the context of higher-order methods, as well. This line of works started with the cubic regularized Newton method introduced in (Nesterov and Polyak, 2006), and its acceleration using estimate sequences mechanisms (Nesterov, 2008); see also (Baes, 2009; Wilson et al., 84 Nesterov Acceleration

2016) and (Monteiro and Svaiter, 2013) (which we also discuss in the next chapter). Optimal higher-order methods were presented by (Gasnikov et al., 2019). It was not clear before the work of Nesterov (2019) that intermediate subproblems arising in the context of higher-order methods were tractable. The fact that tractability is not an issue has attracted a lot of attention towards these methods.

Optimized Methods. Optimized gradient methods were premised by Drori and Teboulle (2014), and discovered by Kim and Fessler (2016). Since then, optimized methods have been studied in various settings, incorporating constraints/proximal terms (Kim and Fessler, 2018b; Taylor et al., 2017a), optimizing gradient norms (Kim and Fessler, 2018c; Kim and Fessler, 2020; Diakonikolas and Wang, 2021) (as an alternative to (Nesterov, 2012b)), adapting to unknown problem parameters using exact line searches (Drori and Taylor, 2020), or restarts (Kim and Fessler, 2018a), and in the strongly convex case (Van Scoy et al., 2017; Cyrus et al., 2018; Park et al., 2021; Taylor and Drori, 2021). Such methods also appeared in the context of fixed-point iterations (Lieder, 2020), and prox- imal methods (Kim, 2019; Barré et al., 2020a).

On Obtaining Proofs from this Chapter. The worst-case performance of first-order methods can often be computed numerically, as shown in (Drori and Teboulle, 2014; Drori, 2014; Drori and Teboulle, 2016) through the introduction of performance estima- tion problems. The performance estimation approach was shown to provide tight certificates, from which one could recover both worst-case certificates and matching worst-case problem instances in (Taylor et al., 2017c; Taylor et al., 2017a). A consequence is that worst- case guarantees for first-order methods such as those detailed in this chapter can always be obtained as weighted sum of appropriate inequalities characterizing the problem at hand, see for instance (De Klerk et al., 2017; Dragomir et al., 2019). A similar approach framed in control theoretic terms, and originally tailored to obtain geometric convergence rates, was developed by Lessard et al. (2016) and can also be used to form potential func- tions (Hu and Lessard, 2017), as well as optimized methods such as the triple momentum method (Van Scoy et al., 2017; Cyrus et al., 2018; Lessard and Seiler, 2020). Proofs of this section were obtained using the performance estimation approach tai- lored for potential functions (Taylor and Bach, 2019) together with the performance estimation toolbox (Taylor et al., 2017b) (in particular, the potential function for the optimized gradient method can be found in (Taylor and Bach, 2019, Theorem 11). These techniques can be used for either validating or rediscovering the proofs of this chapter numerically, through semidefinite programming. For reproducibility purposes, we provide the corresponding codes, as well as note- books for symbolically verifying the algebraic reformulations of this section at

https://github.com/AdrienTaylor/AccelerationMonograph. 5

Proximal Acceleration and Catalyst

In this chapter we present simple methods based on approximate proximal operations that produce accelerated gradient-based methods. This idea is exploited for example in the Catalyst (Lin et al., 2015; Lin et al., 2018) and Accelerated Hybrid Proximal Extragradient frameworks (Monteiro and Svaiter, 2013). In essence, the idea is to de- velop (conceptual) accelerated proximal point algorithms, and to use classical iterative methods to approximate the proximal point. In particular, these frameworks produce accelerated gradient methods (in the same sense as Nesterov’s acceleration) when the approximate proximal points are computed using linearly converging gradient-based op- timization methods.

5.1 Introduction

We review acceleration from the perspective of proximal point algorithms (PPA). The key concept here, called proximal operation, dates back to the sixties, with the works of Moreau (1962; 1965). Its introduction in optimization is attributed to Martinet (1970; 1972) and was primarily motivated by its link with augmented Lagrangian techniques. In contrast with previous sections, where information about the functions to be min- imized was obtained through their gradients, the following pages deal with the case where information is gathered through a proximal operator, or an approximation of that operator. The proximal point algorithm, and its use for developing optimization schemes are nicely surveyed in (Parikh and Boyd, 2014). We aim to go in a slightly different direction here and describe its use in an outer loop to obtain improved convergence guarantees, in the spirit of the Accelerated Hybrid Proximal Extragradient Method (Monteiro and Svaiter, 2013), and of Catalyst acceleration (Lin et al., 2015; Lin et al., 2018).

85 86 Proximal Acceleration and Catalyst

In this chapter, we focus on the problem of solving

f⋆ = min f(x), (5.1) x Rd ∈ where f is closed, proper, and convex (it has a non-empty, closed, and convex epigraph), which we denote by f 0, , in line with Definition 4.1 from Chapter 4. We denote ∈ F ∞ by ∂f(x) the subdifferential of f at x Rd, and by g (x) ∂f(x) some element of the ∈ f ∈ subdifferential at x, irrespective of f being continuously differentiable or not. We aim to find an ǫ-approximate solution x such that f(x) f ǫ. − ⋆ ≤ It is possible to develop optimized proximal methods in the spirit of optimized gra- dient methods presented in Chapter 4. That is, given a computational budget—in the proximal setting, this consists of a number of iterations, and a sequence of step sizes—, one can choose the algorithmic parameters to optimize worst-case performances. The proximal equivalent to the optimized gradient method is Güler’s second method (Güler, 1992, Section 6) (see discussions in Section 5.6). We do not spend time on this method here, and directly aim for methods designed from simple potential functions, in the same spirit as for Nesterov’s accelerated gradient methods from Chapter 4.

5.2 Proximal Point Algorithm and Acceleration

Whereas the base method for minimizing a function using its gradient is gradient descent,

x = x λg (x ), k+1 k − f k the base method for minimizing a function using its proximal oracle is the proximal point algorithm,

xk+1 = proxλf (xk), (5.2) where the proximal operator is given by 1 prox (x) argmin Φ(y; x) λf(y)+ y x 2 . λf ≡ y{ ≡ 2k − k } The proximal point algorithm has a number of intuitive interpretations, with two of them particularly convenient for our purposes.

• Optimality conditions of the proximal subproblem reveal that a proximal step corresponds to an implicit (sub)gradient method

x = x λg (x ). k+1 k − f k+1 where g (x ) ∂f(x ). f k+1 ∈ k+1 • Using the proximal point algorithm is equivalent to applying gradient descent to the Moreau envelope of f, where the Moreau envelope, denoted Fλ, is provided by

1 2 Fλ(x) = min f(y)+ y x . y { 2λk − k } 5.2. Proximal Point Algorithm and Acceleration 87

The Moreau envelope has the same set of optimal solutions as that of f, while enjoying nice additional regularity properties (it is 1/λ-smooth and convex). More precisely, its gradient is given by

F (x)= x prox (x) /λ, ∇ λ − λf   (see (Lemaréchal and Sagastizábal, 1997) for more details). This allows writing

x = prox (x )= x λ F (x ) k+1 λf k k − ∇ λ k hence writes proximal minimization methods (as well as their inexact and acceler- ated variants) applied on f as classical gradient methods (and their inexact and accelerated variants) applied to Fλ. In general, proximal operations are expensive, sometimes nearly as expensive as mini- mizing the function itself. However, there are many cases, especially in the context of composite optimization problems, where one can isolate parts of the objective for which proximal operators actually have analytical solutions, see e.g. (Chierchia et al., 2020) for a list of such examples. In the following sections, we start by analyzing such proximal point methods, then show at the end of this chapter how proximal methods can be used in outer loops, where proximal subproblems are solved approximately using a classical iterative method (in inner loops). In particular, we describe how this combination produces accelerated numerical schemes.

5.2.1 Convergence analysis Given the links between proximal operations and gradient methods, it is probably not surprising that proximal point methods for convex optimization can be analyzed using potential functions similar to those used for gradient methods. However, there is a huge difference between gradient and proximal steps, as the latter can be made arbitrarily “powerful”, by taking large step sizes. In other words, a single proximal operation can produce an arbitrarily good approximate solution, by picking an arbitrarily large step size. This contrasts with gradient descent, for which large step sizes make the method diverge. This fact is made clearer later by Corollary 5.2. However, this nice property of proximal operators comes at the cost: we may not be able to compute efficiently the proximal step. As emphasized by the next theorem, proximal point methods for solving (5.1) can be analyzed using similar potentials as those of gradient-based methods. We use 1 φ = A (f(x ) f(x )) + x x 2, k k k − ⋆ 2k k − ⋆k and show that φ φ . As before, this type of reasoning can be used recursively k+1 ≤ k 1 2 AN (f(xN ) f(x⋆)) φN φN 1 . . . φ0 = A0(f(x0) f(x⋆)) + x0 x⋆ , − ≤ ≤ − ≤ ≤ − 2k − k 88 Proximal Acceleration and Catalyst

1 2 1 reaching bounds of type f(xN ) f⋆ x0 x⋆ = O(A− ), assuming A0 = 0. − ≤ 2AN k − k N Therefore, the convergence rates are dictated by the growth rate of the scalar sequence A , so the proofs are designed to increase A as fast as possible. { k} k

Theorem 5.1. Let f 0, . For any k N, Ak, λk 0 and any xk, it holds that ∈ F ∞ ∈ ≥ 1 A (f(x ) f(x )) + x x 2 k+1 k+1 − ⋆ 2k k+1 − ⋆k 1 A (f(x ) f(x )) + x x 2, ≤ k k − ⋆ 2k k − ⋆k with xk+1 = proxλkf (xk) and Ak+1 = Ak + λk. Proof. We perform a weighted sum of the following valid inequalities originating from our assumptions.

• convexity between xk+1 and x⋆ with weight λk

f(x ) f(x )+ g (x ); x x , ⋆ ≥ k+1 h f k+1 ⋆ − k+1i with some g (x ) ∂f(x ), f k+1 ∈ k+1

• convexity between xk+1 and xk with weight Ak

f(x ) f(x )+ g (x ); x x , k ≥ k+1 h f k+1 k − k+1i with the same g (x ) ∂f(x ) as before. f k+1 ∈ k+1 By performing a weighted sum of those two inequalities, with respective weights, we obtain the following valid inequality:

0 λ [f(x ) f(x )+ g (x ); x x ] ≥ k k+1 − ⋆ h f k+1 ⋆ − k+1i + A [f(x ) f(x )+ g (x ); x x ] . k k+1 − k h f k+1 k − k+1i By matching the expressions term by term and by substituting x = x λ g (x ), k+1 k − k f k+1 one can easily check that the previous inequality can be rewritten exactly as 1 (A + λ )(f(x ) f(x )) + x x 2 k k k+1 − ⋆ 2k k+1 − ⋆k 1 2A + λ A (f(x ) f(x )) + x x 2 λ k k g (x ) 2. ≤ k k − ⋆ 2k k − ⋆k − k 2 k f k+1 k By omitting the last term of the right hand side (which is nonpositive), we reach the desired statement.

The first proof of the following worst-case guarantee is due to Güler (1991), and directly follows from the previous potential. 5.2. Proximal Point Algorithm and Acceleration 89

Corollary 5.2. Let f 0, , λi i 0 be a sequence of nonnegative step sizes, and xi i 0 ∈ F ∞ { } ≥ { } ≥ be the sequence of iterates from the corresponding proximal point algorithm (5.2). For all k N, k 1 it holds that ∈ ≥ 2 x0 x⋆ f(xk) f⋆ k −k 1 k . − ≤ 2 i=0− λi

Proof. Directly follows from the potential andP the choice A0 = 0, resulting in Ak = k 1 i=0− λi, along with 1 2 P f(xk) f(x⋆) x0 x⋆ , − ≤ 2Ak k − k and the claim follows.

Note again that we can make this bound arbitrarily good by simply increasing the value of λk’s. There is no contradiction here because the proximal oracle is massively stronger than the usual gradient step, as previously discussed. However, solving even a single proximal step is usually (nearly) as hard as solving the original optimization problem, so the proximal method, as detailed here, is a purely conceptual algorithm. Note that the choice of a constant step size λ = λ results in a f(x ) f = O(N 1) k N − ⋆ − convergence, reminiscent of gradient descent. It turns out that, as for gradient-based 2 optimization of smooth convex functions, it is possible to improve this result to O(N − ) by using information from previous iterations. This idea was proposed by Güler (1992). In the case of a constant step size λk = λ, one possible way to obtain this improvement is to apply Nesterov’s method (or any other accelerated variant) to the Moreau envelope of f. For varying step sizes, the corresponding bound has the form

2 2 x0 x⋆ f(xk) f⋆ k − k . − ≤ k 1 2 i=0− √λi   In addition, Güler’s acceleration is actually robustP to computation errors (as described in the next sections), while allowing varying step size strategies. Those two key properties allow using Güler’s acceleration to design improved numer- ical optimization schemes:

1. approximate proximal steps, for example by approximately solving the proximal subproblems via iterative methods, can be used.

2. Step sizes λi’s can be increased from iteration to iteration, allowing to reach arbi- trarily fast convergence rates (assuming of course that the proximal subproblems can be solved efficiently).

It is important to note that classical lower bounds for gradient-type methods do not apply here, as we use the much stronger, and more expensive, proximal oracle. It is 90 Proximal Acceleration and Catalyst

therefore not a surprise that such techniques (i.e., increasing step sizes λk from iteration 2 to iteration) might beat the O(N − ) bound obtained through Nesterov’s acceleration. Such increasing step size rules can, for example, be used when solving the proximal subproblem via Newton’s method, as proposed by Monteiro and Svaiter (2013).

5.3 Güler and Monteiro-Svaiter Acceleration

In this section, we describe an accelerated version of the proximal point algorithm, pos- sibly involving inexact proximal evaluations. The method detailed below is a simplified version of that of Monteiro and Svaiter (2013), sufficient for our purposes, for which we provide a simple convergence proof. The method essentially boils down to that of Güler (1992) when exact proximal evaluations are used (note that the inexact analysis of Güler (1992) has a few gaps). Before proceeding, let us mention that there exists quite a few natural notions of inexactness for proximal operations. In this chapter, we focus on approximately satisfying first-order optimality conditions of the proximal problem

1 2 xk+1 = argminx Φ(x; yk) f(x)+ x yk . { ≡ 2λk k − k } In other words, optimality conditions of the proximal subproblem is

0= λ g (x )+ x y , k f k+1 k+1 − k for some g (x ) ∂f(x ). In the following lines, we tolerate, instead, an error e f k+1 ∈ k+1 k e = λ g (x )+ x y , k k f k+1 k+1 − k and require e to be small enough, for guaranteeing convergence—as even starting at k kk an optimal point does not imply staying at it, without proper assumptions on ek. One possibility is to require e to be small with respect to the distance between the starting k kk point yk and the approximate solution to the proximal subproblem xk+1. Formally, we use the following definition for an approximate solution with relative inaccuracy 0 δ 1 ≤ ≤

ek δ xk+1 yk k k≤ ∆k − k xk+1 δ proxλ f (yk) with e = x y + λ g (x ) (5.3) ≈ k ⇐⇒  k k+1 − k k f k+1  for some g (x ) ∂f(x )  f k+1 k+1   ∈  Intuitively, this notion tolerates relatively large errors when the solution of the proximal subproblem is far from yk (meaning that yk is also far away from a minimum of f), while imposing relatively small errors when getting closer to a solution. On the other side, if yk is an optimal point for f, then so is xk+1, as shown by the following proposition.

Proposition 5.1. Let y argmin f(x). For any δ [0, 1] and any x prox (y ), k ∈ x ∈ k+1 ≈δ λkf k it holds that x argmin f(x). k+1 ∈ x 5.3. Güler and Monteiro-Svaiter Acceleration 91

Proof. We only consider the case δ = 1, as without loss of generality x k+1 ≈1 prox (y ) x prox (y ) for any δ [0, 1]. By definition of x , we have λkf k ⇒ k+1 ≈δ λkf k ∈ k+1 x y + λ g (x ) 2 x y 2 k k+1 − k k f k+1 k ≤ k k+1 − kk 2λ g (x ); x y λ2 g (x ) 2, (5.4) ⇔ kh f k+1 k+1 − ki ≤ − kk f k+1 k for some g (x ) ∂f(x ) (the second inequality follows from base algebraic manip- f k+1 ∈ k+1 ulations of the first one). In addition, optimality of yk implies

g (x ); x y = g (x ) g (y ); x y 0, h f k+1 k+1 − ki h f k+1 − f k k+1 − ki≥ with g (y ) = 0 ∂f(y ), and where the second inequality follows from convexity of f f k ∈ k (see e.g., Section A). Therefore, condition (5.4) can be satisfied only when gf (xk+1) = 0, meaning that xk+1 is a minimizer of f.

Assuming now it is possible to find an approximate solution to the proximal operator, one can use Algorithm 20, originating from (Monteiro and Svaiter, 2013), to minimize the convex function f. The parameters Ak, ak in the algorithm were optimized for δ = 1 for simplicity and can be slightly improved by exploiting the cases 0 δ < 1. ≤ Algorithm 20 An inexact accelerated proximal point method (Monteiro and Svaiter, 2013)

Input: A convex function f, an initial point x0. 1: Initialize z0 = x0 and A0 = 0. 2: for k = 0,... do 2 λk+√λk+4Akλk 3: Pick ak = 2 4: Ak+1 = Ak + ak. 5: y = Ak x + ak z k Ak+ak k Ak+ak k 6: x prox (y ) (see Eq. (5.3), for some δ [0, 1]) k+1 ≈δ λkf k ∈ 7: z = z a g (x ) k+1 k − k f k+1 8: end for Output: An approximate solution xk+1.

Perhaps surprisingly, the method can be analyzed with the same potential as the proximal point algorithm, despite the presence of computation errors.

d Theorem 5.3. Let f 0, . For any k N, Ak, λk 0 and any xk, zk R it holds ∈ F ∞ ∈ ≥ ∈ that 1 1 A (f(x ) f(x )) + z x 2 A (f(x ) f(x )) + z x 2, k+1 k+1 − ⋆ 2k k+1 − ⋆k ≤ k k − ⋆ 2k k − ⋆k where xk+1 and zk+1 are generated by one iteration of Algorithm 20, and Ak+1 = Ak +ak. 92 Proximal Acceleration and Catalyst

Proof. We perform a weighted sum of the following valid inequalities, originating from our assumptions.

• Convexity between x and x with weight A A k+1 ⋆ k+1 − k f(x ) f(x )+ g (x ); x x , ⋆ ≥ k+1 h f k+1 ⋆ − k+1i for some g (x ) ∂f(x ), which we also use below, f k+1 ∈ k+1

• convexity between xk+1 and xk with weight Ak

f(x ) f(x )+ g (x ); x x , k ≥ k+1 h f k+1 k − k+1i

• error magnitude with weight (Ak+1)/(2λk)

e 2 x y 2, k kk ≤ k k+1 − kk which is valid for all δ [0, 1] in (5.3). ∈ By performing the weighted sum of those three inequalities, we obtain the following valid inequality 0 (A A ) [f(x ) f(x )+ g (x ); x x ] ≥ k+1 − k k+1 − ⋆ h f k+1 ⋆ − k+1i + A [f(x ) f(x )+ g (x ); x x ] k k+1 − k h f k+1 k − k+1i Ak+1 2 2 + ek xk+1 yk 2λk k k − k − k h i After substituting A = A +a , x = A /(A +a )x +a /(A +a )z λ g (x )+ k+1 k k k+1 k k k k k k k k− k f k+1 e , and z = z a g (x ), one can easily check that the previous inequality can k k+1 k − k f k+1 be rewritten as 1 (A + a )(f(x ) f(x )) + z x 2 k k k+1 − ⋆ 2k k+1 − ⋆k 1 λ (a + A ) a2 A (f(x ) f(x )) + z x 2 k k k − k g (x ) 2, ≤ k k − ⋆ 2k k − ⋆k − 2 k f k+1 k either by comparing the expressions on a term by term basis, or by using an appropriate “complete the squares” strategy. We obtain the desired statement by enforcing λk(ak + A ) a2 0, which allows us to neglect the last term of the right hand side (which is then k − k ≥ nonpositive). Finally, since we already assumed a 0, requiring λ (a + A ) a2 0 k ≥ k k k − k ≥ corresponds to 2 λk + λk + 4Akλk 0 a ≤ k ≤ q 2 which yields the desired result.

The convergence speed then follows from the same reasoning as for the proximal point method. 5.4. Exploiting Strong Convexity 93

Corollary 5.4. Let f 0, , λi i 0 be a sequence of nonnegative step sizes, and xi i 0 ∈ F ∞ { } ≥ { } ≥ be the corresponding sequence of iterates from Algorithm 20. For all k N, k 1 it ∈ ≥ holds that 2 2 x0 x⋆ f(xk) f⋆ k − k . − ≤ k 1 2 i=0− √λi P  Proof. Using the potential from Theorem 5.3 with A0 = 0, we directly obtain

2 x0 x⋆ f(xk) f⋆ k − k . − ≤ 2Ak The desired result then follows from

2 λk + λk + 4Akλk λ A = A + a = A + A + k + A λ k+1 k k k q 2 ≥ k 2 k k p 2 2 1 1 k 1 hence Ak Ak 1 + λk 1 i=0− √λi . ≥ − 2 − ≥ 4 p p  P 

5.4 Exploiting Strong Convexity

In this section, we provide refined convergence results when the function to minimize is µ-strongly convex (all results from previous sections are recovered by picking µ = 0). The algebra is slightly more technical, but the message and techniques are the same. While the proofs in the previous section can simply be seen as particular cases of the proofs presented below, we detail both versions separately to make the argument more accessible.

Proximal point algorithm under strong convexity

Let us begin by refining the results on the proximal point algorithm. The same modifi- cation to the potential function is used for incorporating acceleration in the sequel.

Theorem 5.5. Let f be a closed proper and µ-strongly convex function. For any k N, ∈ A , λ 0, any x , and A = A (1 + λ µ)+ λ , it holds that k k ≥ k k+1 k k k 1+ µA A (f(x ) f(x )) + k+1 x x 2 k+1 k+1 − ⋆ 2 k k+1 − ⋆k 1+ µA A (f(x ) f(x )) + k x x 2, ≤ k k − ⋆ 2 k k − ⋆k with xk+1 = proxλkf (xk).

Proof. We perform a weighted sum of the following valid inequalities, originating from our assumptions. 94 Proximal Acceleration and Catalyst

• Strong convexity between x and x with weight A A k+1 ⋆ k+1 − k µ f(x ) f(x )+ g (x ); x x + x x 2, ⋆ ≥ k+1 h f k+1 ⋆ − k+1i 2 k ⋆ − k+1k with some g (x ) ∂f(x ) which we further use below, f k+1 ∈ k+1

• convexity between xk+1 and xk with weight Ak f(x ) f(x )+ g (x ); x x . k ≥ k+1 h f k+1 k − k+1i By performing the weighted sum of those two inequalities, we obtain the following valid inequality µ 0 (A A ) f(x ) f(x )+ g (x ); x x + x x 2 k+1 k k+1 ⋆ f k+1 ⋆ k+1 2 ⋆ k+1 ≥ −  − h − i k − k  + A [f(x ) f(x )+ g (x ); x x ] . k k+1 − k h f k+1 k − k+1i By matching the expressions term by term, after substituting x = x λ g (x ) k+1 k − k f k+1 and Ak+1 = Ak(1+λkµ)+λk, one can check that the previous inequality can be rewritten as 1+ µA A (f(x ) f(x )) + k+1 x x 2 k+1 k+1 − ⋆ 2 k k+1 − ⋆k 1+ µA A (f(x ) f(x )) + k x x 2 ≤ k k − ⋆ 2 k k − ⋆k A (2 + λ µ)+ λ λ k k k g (x ) 2. − k 2 k f k+1 k Neglecting the last term of the right hand side (which is nonpositive), we reach the desired statement.

In order to get to the convergence speed guaranteed by the previous potential, we again have to characterize the growth rate of Ak, observing that

Ak Ak+1 Ak(1 + λkµ)= , ≥ 1 λkµ − 1+λkµ 1+λµ 1 we conclude in the corollary below that accuracy ǫ is therefore achieved in O λµ log ǫ iterations of the proximal point algorithm when the step size λk = λ is kept constant, 1 which contrasts with the O λǫ in the non-strongly convex case.   Corollary 5.6. Let f be a closed proper and µ-strongly convex function, with µ 0, ≥ λi i 0 be a sequence of nonnegative step sizes, and xi i 0 be the corresponding se- { } ≥ { } ≥ quence of iterates from the proximal point algorithm. For all k N, k 1 it holds ∈ ≥ that 2 µ x0 x⋆ f(xk) f⋆ k k1 − k . − ≤ 2(Π − (1 + λ µ) 1) i=0 i − 5.4. Exploiting Strong Convexity 95

Proof. One can note that the recurrence for Ak provided in Theorem 5.5 has a simple k 1 solution A = ([Π − (1 + λ µ)] 1)/µ. Together with k i=0 i − 2 x0 x⋆ f(xk) f⋆ k − k , − ≤ 2Ak

as provided by Theorem 5.5 with A0 = 0, we reach the desired statement.

As a particular case, one can note that we recover the case µ = 0 from the previous k 1 corollary, as A − λ when µ goes to zero. k → i=0 i P Proximal acceleration and inexactness under strong convexity To accelerate convergence while exploiting strong convexity, we upgrade Algorithm 20, and end up with Algorithm 21, whose analysis follows along the same lines as before. In this case, the algorithm was optimized for δ = √1+ λkµ for simplicity; it can be slightly improved by exploiting the cases 0 δ < √1+ λ µ. This method is a simplified version ≤ k of the A-HPE method of (Barré et al., 2020a).

Algorithm 21 An inexact accelerated proximal point method

Input: A (µ-strongly) convex function f, an initial point x0. 1: Initialize z0 = x0 and A0 = 0. 2: for k = 0,... do 2 2 λk+2Akλkµ+√4Akλkµ(λkµ+1)+4Ak λk(λkµ+1)+λk 3: Pick Ak+1 = Ak + 2 (A A )(A µ+1) 4: k+1− k k yk = xk + A +2µA A µA2 (zk xk) k+1 k k+1− k − 5: x prox (y ) (see Eq. (5.3), for some δ [0, √1+ λ µ]) k+1 ≈δ λkf k ∈ k Ak+1 Ak Ak+1 Ak 6: zk+1 = zk + µ − (xk+1 zk) − gf (xk+1) 1+µAk+1 − − 1+µAk+1 7: end for Output: An approximate solution xk+1.

Theorem 5.7. Let f be a closed proper and µ-strongly convex function. For any k N, ∈ A , λ 0, the iterates of Algorithm 21 satisfy k k ≥ 1+ µA A (f(x ) f(x )) + k+1 z x 2 k+1 k+1 − ⋆ 2 k k+1 − ⋆k 1+ µA A (f(x ) f(x )) + k z x 2. ≤ k k − ⋆ 2 k k − ⋆k Proof. We perform a weighted sum of the following valid inequalities, originating from our assumptions. • Strong convexity between x and x with weight A A k+1 ⋆ k+1 − k µ f(x ) f(x )+ g (x ); x x + x x 2, ⋆ ≥ k+1 h f k+1 ⋆ − k+1i 2 k ⋆ − k+1k 96 Proximal Acceleration and Catalyst

with some g (x ) ∂f(x ), and where this particular subgradient is used f k+1 ∈ k+1 repetitively below,

• strong convexity between xk+1 and xk with weight Ak µ f(x ) f(x )+ g (x ); x x + x x 2, k ≥ k+1 h f k+1 k − k+1i 2 k k − k+1k

A +2µA A µA2 • error magnitude with weight k+1 k k+1− k 2λk(1+µAk+1)

e 2 (1 + λ µ) x y 2, k kk ≤ k k k+1 − kk which is valid for all δ [0, √1+ λ µ] in (5.3). ∈ k By performing a weighted sum of those three inequalities, with their respective weights, we obtain the following valid inequality µ 0 (A A ) f(x ) f(x )+ g (x ); x x + x x 2 k+1 k k+1 ⋆ f k+1 ⋆ k+1 2 ⋆ k+1 ≥ −  − h − i k − k  µ 2 + Ak f(xk+1) f(xk)+ gf (xk+1); xk xk+1 + xk xk+1  − h − i 2 k − k  2 Ak+1 + 2µAkAk+1 µAk 2 2 + − [ ek (1 + λkµ) xk+1 yk ]. 2λk(1 + µAk+1) k k − k − k

By matching the expressions term by term and by substituting the expressions of yk, x = y λ g (x )+ e , and z , one can check that the previous inequality can k+1 k − k f k+1 k k+1 be rewritten as (we advise not substituting Ak+1 at this stage) 1+ µA A (f(x ) f(x )) + k+1 z x 2 k+1 k+1 − ⋆ 2 k k+1 − ⋆k 1+ µA A (f(x ) f(x )) + k z x 2 ≤ k k − ⋆ 2 k k − ⋆k 2 2 (Ak+1 + 2µAkAk+1 µAk)λk (Ak+1 Ak) 1 2 − − − gf (xk+1) − 1+ µAk+1 2k k A (A A )(1 + µA ) µ k k+1 − k k x z 2 −A + 2µA A µA2 2 k k − kk k+1 k k+1 − k The conclusion follows from A A , allowing to discard the last term (which is then k+1 ≥ k nonpositive). Positivity of the first residual term can be enforced by choosing Ak+1 such that (A + 2µA A µA2)λ (A A )2 0, k+1 k k+1 − k k − k+1 − k ≥ and the desired result is, in particular, achieved by picking the largest root of second order polynomial in A , so that A A . k+1 k+1 ≥ k 5.5. Application: Catalyst Acceleration 97

In contrast with the previous proximal point algorithm, this accelerated version re- quires 1+λµ 1 O λµ log ǫ q  inexact proximal iterations to reach f(x ) f(x ) ǫ, when using a constant step size k − ⋆ ≤ λk = λ. This follows from an inequality characterizing the growth rate of the sequence Ak, written

Ak Ak+1 Ak(1 + λkµ)+ Ak λkµ(1 + λkµ)= . (5.5) ≥ 1 λkµ q − 1+λkµ q Corollary 5.8. Let f µ, with µ 0, λi i 0 be a sequence of nonnegative step sizes, ∈ F ∞ ≥ { } ≥ and xi i 0 be the corresponding sequence of iterates from Algorithm 21. For all k N, { } ≥ ∈ k 1 it holds that ≥ 2 k 1 λiµ x0 x⋆ f(xk) f⋆ Πi=1− 1 1+λ µ k − k . − ≤ − i 2λ0  q  Proof. The proof follows from the same arguments as before, that is, 2 x0 x⋆ f(xk) f⋆ k − k , − ≤ 2Ak using λ0 Ak , k 1 λiµ ≥ Π − 1 i=1 − 1+λiµ  q  where we used A0 = 0, resulting in A1 = λ0 and proceeding with (5.5).

Before going to the next section, let us note that combining Corollary 5.4 with Corollary 5.8 shows

k 1 λiµ Π − 1 2 i=1 − 1+λiµ 2 x0 x⋆ f(xk) f⋆ min  , 2  k − k . − ≤  λ0q  k 1 2  i=0− √λi     P  5.5 Application: Catalyst Acceleration

There exists many notions of inexactness for solving the proximal subproblems, giving rise to different types of guarantees, together with slightly different methods. In particular, we required the approximate solution to have a small gradient. Other notions include satisfying an inaccuracy criterion involving of function values (Güler, 1992; Schmidt et al., 2011; Villa et al., 2013). Different choices might be favored, de- pending on the target application or on the target algorithm for solving inner problems. A fairly general framework was developed by Monteiro and Svaiter (2013) (where the error is controlled via a primal-dual gap on the proximal subproblem). 98 Proximal Acceleration and Catalyst

In what follows, we illustrate how to use proximal methods as meta-algorithms to improve convergence of simple gradient-based first-order methods. This idea can be extended by embedding basically any algorithm that can solve the proximal subproblem.

5.5.1 Catalyst acceleration A popular application of inexact accelerated proximal gradient is “Catalyst” accelera- tion (Lin et al., 2015). For readability purposes, we do not present the general Catalyst framework, but rather a simple instance. Stochastic versions of this acceleration pro- cedure were also developed, which we briefly summarize in Section 5.5.4. The idea is again to use a base first-order method to approximate the proximal subproblem up to the required accuracy. For now, let us assume we want to minimize an L-smooth convex function f, i.e., minimizex f(x). The corresponding proximal subproblem has the form

1 2 proxλf (y) = argmin f(x)+ x y , (5.6) x 2λ  k − k  and is therefore the minimization of a (L + 1/λ)-smooth and 1/λ-strongly convex. To solve such a problem, one can use a first-order method to approximate its solution.

Preliminaries In what follows, we consider using a method to solve the proximal subproblem (5.6). M We assume this method is guaranteed to converge linearly on any smooth strongly convex problem with minimizer w⋆, satisfying

k wk w⋆ C (1 τ ) w0 w⋆ (5.7) k − k≤ M − M k − k for some constant C 0 and rate of convergence 0 <τ 1. Note that we consider M ≥ M ≤ linear convergence in terms of w w for convenience, other notions can be used k k − ⋆k instead, such as convergence in function values. We distinguish the points xk, yk, and zk which are the iterates of the inexact acceler- ated proximal point algorithms (Algorithm (20)), and the sequence of iterates w0,...,wk, which are are the iterates of , used to approximate the step 5 of Algorithm (20). M One can then apply Algorithm 20 to minimize f, while (approximately) solving the proximal subproblems with . We first define three iteration counters, M

1. Nouter, the number of iterations of the inexact accelerated proximal point method (Algorithm 20, or 21), which serves as “outer loop” for the overall acceleration scheme.

2. N (k), the number of iterations of method for approximately solving the inner M proximal subproblem at iteration k of the outer loop inexact proximal point method. 5.5. Application: Catalyst Acceleration 99

3. N , the total number of iterations of method : total M Nouter 1 − Ntotal = Nuseless + Ninner(k), kX=0 where N is the number of iterations of that did not allow finishing one useless M additional outer iteration, so Nuseless < Ninner(Nouter) (i.e., the number of useless iterations of is smaller than the number of such iterations that would have led M to an additional outer iteration).

Overall complexity

As we detail in the sequel, assuming indeed satisfies (5.7), the overall complexity of M the combination of methods is then guaranteed to be

2 f(x ) f = O(N − ), Nouter − ⋆ total where xNouter is the iterate produced after Nouter iterations of the inexact accelerated proximal point method; or equivalently the iterate produced after a total number of iterations N of the method . More precisely, it is guaranteed to satisfy total M x x 2 x x 2 f(x ) f(x ) k 0 − ⋆k k 0 − ⋆k , Nouter ⋆ 2 1 2 − ≤ λNouter ≤ λ B− ,λNtotal ⌊ M ⌋

Ntotal where we used Ninner(k) B ,λ for all k 0, and hence Nouter, where the ≤ M ≥ ⌊ BM,λ ⌋ ≤ constant B ,λ depends solely on the choice of λ and on properties of . It represents M M the computational burden of approximately solving one proximal subproblem with , M and satisfies log(C (λL + 2)) B ,λ M + 1. M ≤ τ M Let us provide a few simple examples based on gradient methods for smooth strongly convex minimization. For all those methods, the embedding within the inexact proximal framework yields 2 L x0 x⋆ Ntotal = O B ,λ k − k (5.8) M ǫ  q  iteration complexity in terms of total number of calls to to find a point satisfying M f(x ) f(x ) ǫ. We can make this bound a bit more explicit depending on the Nouter − ⋆ ≤ choice of . M • Suppose we solve the proximal subproblem with a regular gradient method using M step size 1/(L + 1/λ). The method is known to converge linearly with C = 1 and M τ = 1 (inverse condition ratio for the proximal subproblem) and produces the M 1+λL accelerated rate in (5.8). Note that directly applying the gradient method to the 2 L x0 x⋆ problem of minimizing f yields a much worse iteration complexity O k −ǫ k .   100 Proximal Acceleration and Catalyst

• Let be a gradient method featured with an exact line search. It is guaranteed to M converge linearly with C = λL + 1 (condition ratio of the proximal subproblem) M and τ = 2 . The iteration complexity of applying this steepest descent scheme M 2+λL directly to f is similar to that of vanilla gradient descent. One can also choose λ in order not to have a too large B ,λ, for example λ = 1/L. M • Let be an accelerated gradient method specifically tailored for smooth strongly M convex optimization, for example Nesterov’s method with constant momentum, see Algorithm 14. It is guaranteed to converge linearly with C = λL +1 and τ = 1 M M 1+λL . Although there is no working guarantee for this method on the original minimizationq problem, if f is not strongly convex, it can still be used for minimizing f through the inexact proximal point framework, as proximal subproblems are strongly convex.

As a conclusion, inexact accelerated proximal schemes produce accelerated rates for vanilla optimization methods that are linearly converging in the smooth strongly convex case. This idea of embedding a simple first-order method within an inexact accelerated scheme can be applied to a large panel of settings, including obtaining acceleration in strongly convex (see below), and stochastic settings. However, one should note that practical tuning of the corresponding numerical schemes (and particularly of the step size parameters) critically affects overall performance, as discussed in e.g. (Lin et al., 2018) and makes effective implementation somewhat tricky. The analysis of non-convex settings is beyond the scope of this chapter, but examples of such results can be found in e.g. (Paquette et al., 2018).

5.5.2 Detailed complexity analysis

Recall that function value accuracies, e.g. in Corollary 5.4 are expressed in terms of outer loop iterations. Therefore, in order to complete the analysis, we need to answer the following question: given total budget N of inner iterations of method , how total M many iterations of Algorithm 20, Nouter, will we perform in the ideal strategy (in other words, what is B ,λ)? To answer this question, we start by analyzing the computational M cost of solving a single proximal subproblem through . M

Computational cost of inner problems Let 1 Φ (x)= f(x)+ x y 2 k 2λk − kk be the objective of the proximal subproblem we aim to solve at iteration k (line 5 of Algorithm 20) centered at yk. By construction, Φk(x) is (L + 1/λ)-smooth and 1/λ- strongly convex. Also denote by w0 = yk our (warm-started) initial iterate, and by w , w ,...,w the iterates of for solving min Φ (x). 0 1 Ninner(k) M x k 5.5. Application: Catalyst Acceleration 101

We need to compute an upper bound on the number of iterations Ninner(k) required to satisfy the error criterion (5.3):

e = λ Φ′ (w ) w w , (5.9) k Ninner(k)k k k Ninner(k) k ≤ k Ninner(k) − 0k where we denote by N (k) = inf i : Φ (w ) 1/λ w w the index of the first inner { k k′ i k≤ k i − 0k} iteration such that (5.9) is satisfied, which is precisely the quantity we want to upper bound. We start by the following observations.

• By (L + 1/λ)-smoothness of Φk

Φ′ (w ) (L + 1/λ) w w (Φ ) , (5.10) k k i k≤ k i − ⋆ k k

where w⋆(Φk) is the minimizer of Φk.

• A triangle inequality on w w (Φ ) implies k 0 − ⋆ k k w w (Φ ) w w (Φ ) w w . (5.11) k 0 − ⋆ k k − k i − ⋆ k k ≤ k 0 − ik

Hence (5.9) is satisfied if the right hand side of (5.10) is smaller than the left hand side of (5.11) divided by λ, and hence, for any i for which we can prove

(L + 1/λ) w w (Φ ) 1/λ( w w (Φ ) w (Φ ) w ). k i − ⋆ k k≤ k 0 − ⋆ k k − k ⋆ k − ik we get N (k) i. Rephrasing this inequality leads to inner ≤ 1 w w (Φ ) w w (Φ ) . k i − ⋆ k k≤ λL + 2k 0 − ⋆ k k Therefore, by assumption on , (5.9) is guaranteed to hold as soon as M 1 C (1 τ )i , M − M ≤ λL + 2 hence for any i satisfying log (C (λL + 2)) i M . ≥ log (1/(1 τ ))  − M  We conclude that (5.9) is satisfied before this number of iterations is achieved, hence

log (C (λL + 2)) log (C (λL + 2)) N (k) M M + 1. inner ≤ log (1/(1 τ )) ≤ log (1/(1 τ ))  − M  − M Given that the right hand side does not depend on k, we use the notation

∆ log (C (λL + 2)) B ,λ = M + 1 M log (1/(1 τ )) − M as our upper bound on the iteration cost of solving the proximal subproblem via . M 102 Proximal Acceleration and Catalyst

Global complexity bound We showed that the number of iterations in the inner loop was bounded above by a constant that depends on the specific choice of the regularization parameter, and on the method , in other words: Ninner(k) B ,λ. Writing Ntotal the M ≤ M total number of calls to the gradient of f, by Nouter the number of iterations performed by Algorithm 20, and by N the number of iteration of that did result in an useless M additional outer iteration, we conclude that

Nouter 1 − Ntotal = Nuseless + Ninner(k) NouterB ,λ, ≤ M kX=0 1 hence Nouter B− ,λNtotal , as Nuseless < B ,λ (the number of useless iteration is ≥ ⌊ ⌋ M smaller than the numberM of iterations that would have led to an additional outer itera- tion). The conclusion follows from Corollary 5.4:

x x 2 x x 2 f(x ) f(x ) k 0 − ⋆k k 0 − ⋆k . Nouter ⋆ 2 1 2 − ≤ λNouter ≤ λ B− ,λNtotal ⌊ M ⌋ In other words given a target accuracy ǫ, the iteration complexity written in terms of 2 x0 x⋆ total number of approximate proximal minimizations in Algorithm 20 is O( k −λǫ k ), the total iteration complexity for solving the problem using in inner loopsq is simply M the same bound multiplied by the cost of solving a single proximal subproblem, namely 2 x0 x⋆ O(B ,λ k − k ). M λǫ q 5.5.3 Catalyst for strongly convex problems

The previous analysis holds for the convex (but not necessarily strongly convex) case, the iteration complexity of solving inner problem remains valid in the strongly convex case, and the expression for B ,λ can only slightly be improved—by taking into account M the better strong convexity parameter µ + 1/λ and the possibly larger acceptable error magnitude of a factor √1+ λµ in Algorithm 21. Therefore, the total number of iterations of Algorithm 21 embedded with remains bounded in a similar fashion and the overall MNouter/BM,λ error decreases as 1 λµ ⌊ ⌋, and the iteration complexity is therefore of − 1+λµ order  q  1+ λµ 1 O B ,λ log . (5.12) M s λµ ǫ ! It is then natural to choose the value of λ by optimizing the overall iteration complexity of Algorithm 21 combined with . One way to proceed is by optimizing M 1+ λµ τ , s λµ , M essentially neglecting the factor log(C (λL+2)) in the complexity estimate (5.12). Here M are a few examples: 5.5. Application: Catalyst Acceleration 103

• Gradient method with suboptimal tuning (e.g. when using backtracking or line- µλ+1 1 search techniques): τ = Lλ+1 . Optimizing the ratio leads to the choice λ = L 2µ M − and the ratio is equal to 2 L 1. Assuming C = 1 (which is the case for the µ − M q L 1 standard step size 1/L), the overall iteration complexity is then O µ log ǫ , 1 µ/L q  where we neglected the factor log(2 1 −2µ/L ) log 2 when L/µ is large enough. − ≈ • Gradient method with optimal tuning, τ is 2(µλ+1) . The resulting choice is M (Lλ+µλ+2) 2 √ L L 1 λ = L 3µ and the ratio is 2 µ 1, arriving to the same O µ log ǫ . − − q q  5.5.4 Catalyst for randomized/stochastic methods Similar results hold for stochastic methods as well, assuming convergence of in expec- Mk tation instead of (5.7), for example in the form E wk w⋆ C (1 τ ) w0 w⋆ . k − k≤ M − M k − k Overall, the idea remains the same:

1. Use the inexact accelerated proximal point (Algorithm 20 or 21) as if was M deterministic.

2. Use the stochastic method to obtain points satisfying the accuracy requirement. M Dealing with the computational burden of solving the inner problem is a bit more technical but the overall analysis remains similar. One can bound the expected number E (stoch) of iterations for solving the inner problem [Ninner(k)] by some constant B ,λ of the form (details below) M (stoch) ∆ log (C (λL + 2)) B ,λ = M + 2, M log (1/(1 τ )) − M (stoch) which is simply B ,λ = B ,λ + 1. A simple argument to obtain this bound uses M Markov’s inequalityM as follows

P(N (k) > i) P w w (Φ ) > 1 w w (Φ ) inner ≤ k i − ⋆ k k λL+2 k 0 − ⋆ k k E[ w w (Φ ) ]  k i − ⋆ k k (Markov) ≤ 1 w w (Φ ) λL+2 k 0 − ⋆ k k C(1 τ)i w w (Φ ) C(1 τ)i − k 0 − ⋆ k k = − . ≤ 1 w w (Φ ) 1 λL+2 k 0 − ⋆ k k λL+2 i We then use a refined version of this bound: P(Ninner(k) > i) min 1, (λL + 2)C(1 τ) E ≤ − and, in order to bound [Ninner(k)], we proceed as 

∞ E[N (k)] = P(N (k) t) inner inner ≥ Xt=1 N0 1dt + C(λL + 2) ∞(1 τ)tdt, ≤ − Z0 ZN0 104 Proximal Acceleration and Catalyst

N0 where N0 is such that 1 = C(λL + 2)(1 τ) ; direct computation yields E[Ninner(k)] (stoch) ∆ − ≤ B ,λ = N0 + 1. MThe overall expected iteration complexity is that of the inexact accelerated proximal point method multiplied by the expected computational burden of solving the proximal (stoch) subproblems B ,λ . That is, the expected iteration complexity becomes M 2 (stoch) x0 x⋆ O B ,λ k − k  M s λǫ    in the smooth convex setting, and

(stoch) 1+ λµ 1 O B ,λ log M s λµ ǫ ! in the smooth strongly convex setting. The main argument of this section, namely the use of Markov’s inequality, was adapted from Lin et al. (2018, Appendix B.4) (merged with those of the deterministic case above). Stochastic versions of Catalyst were also studied in (Kulunchakov and Mairal, 2019a).

5.6 Notes & References

In the optimization literature, the proximal operation is an essential algorithmic prim- itive at the heart of many practical optimization methods. Proximal point algorithms are also largely motivated by the fact that they offer a nice framework for obtaining “meta” (or high-level) algorithms. Among others, they naturally appear in augmented Lagrangian and splitting-based numerical schemes. We refer the reader to the very nice surveys in (Parikh and Boyd, 2014; Ryu and Boyd, 2016) for more details.

Proximal point algorithms, accelerated and inexact variants. Proximal point algo- rithms have a long history, dating back to the works of Moreau (1962; 1965), and brought to the optimization community by Martinet (1970; 1972). Early interest in proximal methods was motivated by their connections to augmented Lagrangian tech- niques (Rockafellar, 1973; Rockafellar, 1976; Iusem, 1999), see also the nice tutorial by Eckstein and Silva (2013)). Among the many other successes and uses of proximal operations, one can cite the many splitting techniques (Lions and Mercier, 1979; Eck- stein, 1989), for which nice surveys already exist (Boyd et al., 2011; Eckstein and Yao, 2012; Condat et al., 2019). In this context, inexact proximal operations were already introduced by Rockafellar (1976) and combined with acceleration much later by Güler (1992)—although not with a perfectly rigorous proof, which was corrected later in (Salzo and Villa, 2012).

Hybrid proximal extragradient (HPE) framework. Whereas “Catalyst” acceleration is based on the idea of solving the proximal subproblem via a first-order method, the 5.6. Notes & References 105

(related) hybrid proximal extragradient framework was also used together with a Newton scheme in (Monteiro and Svaiter, 2013). Furthermore, the accelerated hybrid proximal extragradient framework allows using an increasing sequence of step sizes, leading to faster rates than those obtained via vanilla first-order methods (i.e. using an increasing sequence of λ , ( N √λ )2 might grow much faster than N 2). { i}i i=1 i The HPE frameworkP was initiated before being embedded with acceleration tech- niques by Solodov and Svaiter (1999; 1999; 2000; 2001).

Catalyst. The variant presented in this chapter was chosen for simplicity of exposition and is largely inspired by recent works on the topic in (Lin et al., 2018; Ivanova et al., 2019) along with (Monteiro and Svaiter, 2013). Efficient implementations of Catalyst can be found in the Cyanure package by Mairal (2019). In particular, most efficient practical implementations of Catalyst appear to rely on absolute inaccuracy criterion for the inexact proximal operation, instead of relative (or multiplicative) ones, as that used in this chapter. In practice, the most convenient and efficient variants appear to be the ones using a constant number of inner loop iterations for approximately solving the proximal subproblems. In the chapter, we chose the relative error model as we believe it allows a slightly simpler exposition, while relying on essentially the same techniques. Catalyst was orig- inally proposed by Lin et al. (2015) as a generic tool for reaching accelerated methods. Among others, it allowed accelerating stochastic methods such as SVRG (Johnson and Zhang, 2013), SAGA (Defazio et al., 2014a), MISO (Mairal, 2015), and Finito (Defazio et al., 2014b) before direct acceleration techniques became known for them (Allen-Zhu, 2017; Zhou et al., 2018; Zhou et al., 2019).

Higher-order proximal subproblems. Higher-order proximal subproblems of the form

1 p+1 min f(x)+ x xk , (5.13) x λ(p + 1)  k − k  were used by (Nesterov, 2020a; Nesterov, 2020b) as new primitive for designing opti- mization schemes. Those subproblems can also be solved approximately (via pth order tensor methods (Nesterov, 2019)) while keeping good convergence guarantees.

Optimized proximal point methods. It is possible to develop optimized proximal meth- ods in the spirit of optimized gradient methods. That is, given a computational budget— in the proximal setting, this consists of a number of iterations, and a sequence of step sizes λi 1 i N — one can choose algorithmic parameters to optimize worst-case perfor- { } ≤ ≤ mance of a method of type

k x = x β g (x ) λ g (x ) k+1 0 − i f i − k+1 f k+1 Xi=1 106 Proximal Acceleration and Catalyst

with respect to βi’s. The proximal equivalent to the optimized gradient method is Güler’s second method (Güler, 1992, Section 6), which was obtained as an optimized proximal point method in (Barré et al., 2020a). Alternatively, Güler’s second method (Güler, 1992, Section 6) can be obtained by applying the optimized gradient method (without its last iteration trick) to the Moreau envelope of the nonsmooth convex function f. More precisely, denoting by f˜ the Moreau enveloppe of f, one can apply the optimized gradient 1 2 method without the last iteration trick to f˜, as f(x ) f = f˜(x ) f˜ g ˜(x ) , k − ⋆ k − ⋆ − 2L k f k k which precisely corresponds to the first term of the potential of the optimized gradient method (see Equation 4.9). In the more general setting of monotone inclusions, one can obtain alternate optimized proximal point methods for different criterion as in (Kim, 2019; Lieder, 2020).

Proofs of this Chapter. Proofs of the potential inequalities of this chapter were ob- tained through the performance estimation methodology, introduced by Drori and Teboulle (2014), and specialized for studying inexact proximal operations by Barré et al. (2020a). More details can be found in Section 4.8, §“On Obtaining Proofs of this Chapter”. In particular, for reproducibility purposes, we provide codes for symbolically verifying the algebraic reformulations of this section at

https://github.com/AdrienTaylor/AccelerationMonograph

together with those of Chapter 4. 6

Restart Schemes

6.1 Introduction

First-order methods typically exhibit a sublinear convergence, whose rate varies with gradient smoothness. Polynomial upper complexity bounds are typically convex functions of the number of iterations, so these methods converge faster at the beginning while convergence tails off as iterations progress. This suggests that periodically restarting first- order methods, i.e. simply running more “early” iterations, could accelerate convergence. We illustrate this in Figure 6.1. ⋆ ⋆ f f − − f f

Iterations Iterations

Figure 6.1: Left: Sublinear convergence plot without restart. Right: Sublinear convergence plot with restart.

Beyond this graphical argument, all accelerated methods have memory and look back at least one step to compute the next iterate. They iteratively form a model for the func- tion around the optimum and restarting allows this model to be periodically refreshed,

107 108 Restart Schemes

discarding outdated information as the algorithm converges towards the optimum. While the benefits of restart are immediately apparent in this plot, restart schemes raise several important questions. How many iterations should we run between restarts? What is the best complexity bound we can hope for using this scheme? What regularity properties of the problem drive the performance of restart schemes? Fortunately here, all these questions have an explicit answer stemming from a simple and intuitive argument. We will see that restart schemes are also adaptive to unknown regularity constants and often reach near optimal convergence rates without observing these parameters. We begin by illustrating this on the problem of minimizing a strongly convex function using the fixed step gradient method.

6.1.1 The Strongly Convex Case We illustrate the main argument of this chapter when minimizing strongly convex func- tions using fixed step gradient descent. Suppose we seek to solve the following minimiza- tion problem minimize f(x) (6.1) in the variable x Rd. Suppose that the gradient of f is Lipschitz continuous with ∈ constant L with respect to the Euclidean norm, i.e.

f(y) f(x) L y x , for all x,y Rd (6.2) k∇ −∇ k2 ≤ k − k2 ∈ We can use the fixed step gradient method to solve problem (6.1) as in Algorithm 22 below.

Algorithm 22 Gradient Method

Input: A smooth convex function f, an initial point x0. 1: for i = 0,...,k 1 do − 2: x := x 1 f(x ) i+1 i − L ∇ i 3: end for Output: An approximate solution xk.

The smoothness assumption in (6.2) ensures following complexity bound 2L x x 2 f(x ) f k 0 − ⋆k2 (6.3) k − ⋆ ≤ k + 4 after k iterations (see Chapter 4 for a complete discussion). Assume now that f is also strongly convex with parameter µ with respect to the Euclidean norm. Strong convexity means that f satisfies µ x x 2 f(x) f (6.4) 2 k − ⋆k2 ≤ − ⋆ where x⋆ is an optimal solution to problem (6.1) and f⋆ the corresponding optimal objective value. Let us write (x , k) the output of k iterations of Algorithm 22 started A 0 6.1. Introduction 109

Algorithm 23 Restart Scheme

Input: A smooth convex function f, an initial point x0 and an inner optimization algorithm (x, k). A 1: for i = 0,...,N 1 do − 2: Obtain xi+1 by running ki iterations of the gradient method, starting at xi, i.e.

x := (x , k ) i+1 A i i 3: end for Output: xN

at x0, and suppose that we periodically restart the gradient method according to the following scheme. Combining the strong convexity bound in (6.4) with the complexity bound in (6.3) yields 2L x x 2 4L f(x ) f k i − ⋆k2 (f(x ) f ) (6.5) i+1 − ⋆ ≤ k + 4 ≤ µ(k + 4) i − ⋆ after an iteration of the restart scheme in Algorithm 23 in which we run k (inner) iterations of the gradient method in Algorithm 22. This means that if we set 8L k = k = i µ then 1 N f(x ) f (f(x ) f ) N ⋆ 2 0 ⋆ − ≤   − after N iterations of the restart scheme in Algorithm 23. Therefore, running a total of T = Nk gradient steps, we can rewrite the complexity bound in terms of this total number of gradient oracle calls (or inner iterations) as

1 T f(xT ) f⋆ µ (f(x0) f⋆) (6.6) − ≤ 2 8L  − which proves linear convergence in the strongly convex case. Of course, the basic gradient method with fixed step size in Algorithm 22 has no memory, so restarting it has no impact on iterations or numerical performance. Invok- ing the restart scheme in Algorithm 23 simply allows us to produce a better complexity bound in the strongly convex case. Without information on the strong convexity parame- ter, the classical bound yields sublinear convergence, while the restart method converges linearly. Crucially here, the argument in (6.5) can be significantly generalized to improve the convergence rate of several types of first-order methods. In fact, as we will see below, a local bound on the growth rate of the function akin to strong convexity holds almost generically, albeit with a different exponent than in (6.4). 110 Restart Schemes

6.2 Hölderian Error Bounds

We now recall several results related to subanalytic functions and Hölderian error bounds of the form µ d(x, X )r f(x) f , for all x K, (HEB) r ⋆ ≤ − ⋆ ∈ for some µ,r > 0, where d(x, X⋆) is the distance to the optimal set. We refer the reader to e.g. (Bolte et al., 2007) for a more complete discussion. These results produce bounds akin to local versions of strong convexity, with various exponents, and are known to hold under very generic conditions. They thus better explain the good empirical performance of restart schemes. In general of course, these values are neither observed nor known a priori, but as detailed below, restart schemes are essentially adaptive to these parameters and reach optimal convergence rates without any prior information on µ, r.

6.2.1 Hölderian Error Bound and Smoothness Let f be a smooth convex function on Rd. Smoothness ensures L f(x) f + x y 2, ≤ ⋆ 2 k − k for any x Rd and y X . Setting y to be the projection of x on X , this yields the ∈ ∈ ⋆ ⋆ following upper bound on suboptimality L f(x) f d(x, X )2. (6.7) − ⋆ ≤ 2 ⋆ Now, assume that f satisfies the Hölderian error bound (HEB) on a set K with param- eters (r, µ). Combining (6.7) and (HEB) leads to

2µ 2 r d(x, X ) − , rL ≤ ⋆ for every x K. This means that 2 r by taking x close enough to X . We will also ∈ ≤ ⋆ allow the gradient smoothness exponent 2 to vary in later results, where we assume the gradient to be Hölder smooth, but we first detail the smooth case for simplicity. In what follows, we will use the following notations

2 2 κ , L/µ r and τ , 1 (6.8) − r defining generalized condition numbers for the function f. Note that if r = 2, then κ matches the classical condition number of the function.

6.2.2 Subanalytic Functions Subanalytic functions form a very broad class of functions for which we can show Hölde- rian error bounds bounds as in (HEB), akin to strong convexity. We recall some key definitions and refer the reader to e.g. (Bolte et al., 2007) for a more complete discus- sion. 6.2. Hölderian Error Bounds 111

Definition 6.1 (Subanalyticity). (i) A subset A Rd is called semianalytic if each point ⊂ of Rd admits a neighborhood V for which A V assumes the following form ∩ p q x V : f (x) = 0, g (x) > 0 , { ∈ ij ij } i[=1 j\=1 where f , g : V R are real analytic for 1 i p, 1 j q. ij ij → ≤ ≤ ≤ ≤ (ii) A subset A Rd is called subanalytic if each point of Rd admits a neighborhood V ⊂ such that A V = x Rd : (x,y) B ∩ { ∈ ∈ } where B is a bounded semianalytic subset of Rd Rm. × (iii) A function f : Rd R + is called subanalytic if its graph is a subanalytic → ∪ { ∞} subset of Rd R. × The class of subanalytic functions is of course very large, but the definition above suffers from one key shortcoming as the image and preimage of a subanalytic function are not in general subanalytic. To remedy this stability issue, we can define a notion of global subanalyticity. We first define the function βn with

, x1 xd βd(x) 2 ,..., 2 1+ x1 1+ xd ! and we have the following definition.

Definition 6.2 (Global subanalyticity). (i) A subset A of Rd is called globally subanalytic d if its image under βd is a subanalytic subset of R . (ii) A function f : Rd R + is called globally subanalytic if its graph is a globally → ∪ { ∞} subanalytic subset of Rd R. × We now recall the Łojasiewicz factorization lemma, which will give us local growth bounds on the graph of the function around its minimum.

Theorem 6.1 (Łojasiewicz factorization lemma). Let K Rd be a compact set and ⊂ g, h : K R two continuous globally subanalytic functions. If → 1 1 h− (0) g− (0), ⊂ then µ g(x) r h(x) , for all x K, (6.9) r | | ≤ | | ∈ for some µ,r > 0.

In an optimization context on a compact set K Rd, we can set h(x) = f(x) f ⊂ − ⋆ and g(x)= d(x, X⋆), the Euclidean distance from x to the set X⋆, where X⋆ is the set of optimal solutions. In this case, we have h 1(0) g 1(0), we can show that g is globally − ⊂ − 112 Restart Schemes

subanalytic if X⋆ is and if f is continuous and globally subanalytic, Theorem 6.1 shows the following Hölderian error bound, µ d(x, X )r f(x) f , for all x K, (6.10) r ⋆ ≤ − ⋆ ∈ for some µ,r > 0. Here, Theorem 6.1 produces a bound on the growth rate of the function around the optimum, generalizing the strong convexity bound in (6.4). We illustrate this in Figure 6.2. Overall, since continuity and subanalyticity are very weak conditions, Theorem 6.1 shows that the Hölderian error bound in (HEB) holds almost generically.

0.5 0.5 0.5

0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0 -0.5 0 0.5 -1 0 1 -1 0 1

Figure 6.2: Left and center: The functions |x| and x2 satisfy a growth condition around zero. Right: The function exp(−1/x2) does not.

6.3 Optimal Restart Schemes

We now discuss how to exploit the Hölderian error bounds detailed above using restart schemes. Suppose again that we seek to solve the following unconstrained minimization problem, written minimize f(x) (6.11) in the variable x Rd, where the gradient of f is Lipschitz continuous with constant L ∈ with respect to the Euclidean norm. The optimal method in (4.9) detailed as Algorithm 9 produces a point xk satisfying 4L f(x ) f x x 2 (6.12) k − ⋆ ≤ k2 k 0 − ⋆k2 after k iterations. Assume the function f satisfies the Hölderian error bound (HEB), we can use a chaining argument similar to that in (6.5) to show improved convergence rates. While a constant number of inner iterations (between restarts) was optimal in the strongly convex case, the optimal restart scheme for r> 2 will involve a geometrically increasing number of inner iterations (Nemirovsky and Nesterov, 1985; Roulet and d’Aspremont, 2017). 6.4. Robustness & Adaptivity 113

Theorem 6.2 (Restart Complexity). Let f be a smooth convex function satisfying (6.2) with parameter L and (HEB) with parameters (r, µ) on a set K. Assume that we are d given x0 R such that x f(x) f(x0) K. Run the restart scheme in Algorithm 23 ∈ { | ≤ ⋆ }⊂τi from x0 with iteration schedule ki = Cκ,τ e , for i = 1,...,R, where

⋆ 1 τ 1 τ C , e − (cκ) 2 (f(x ) f )− 2 , (6.13) κ,τ 0 − ⋆ with κ and τ defined in (6.8) and c = 4e2/e here. The precision reached at the last point xˆ is bounded by,

f(x0) f⋆ 2 τ f(ˆx) f⋆ − 2 = O N − , (6.14) τ − ≤ 1 1 τ τe (f(x0) f⋆) 2 (cκ)− 2 N + 1   − −   R when τ > 0, where N = i=1 ki is the total number of inner iterations. P In the strongly convex case, i.e. when τ = 0, the bound above actually becomes

1 1 1 f(ˆx) f exp 2e− (cκ)− 2 N (f(x ) f )= O exp( κ− 2 N) − ⋆ ≤ − 0 − ⋆ −     and we recover the classical linear convergence bound for Algorithm 12 in the strongly convex case. When 0 <τ < 1 on the other hand, bound (6.14) shows a faster conver- gence rate than accelerated gradient methods on non-strongly convex functions (i.e. when r > 2). The closer r is to 2, the tighter our bounds are, which yields a better model for the function and faster convergence. This matches the lower bounds for optimizing smooth and sharp functions (Nemirovsky and Nesterov, 1985) up to constant factors. Also, setting k = C⋆ eτi yields continuous bounds on precision, i.e. when τ 0, bound i κ,τ → (6.14) converges to the linear bound, which also shows that for τ near zero, constant restart schemes are almost optimal.

6.4 Robustness & Adaptivity

The previous restart schedules depend on the sharpness parameters (r, µ) in (HEB). In general of course, these values are neither observed nor known a priori. Making the restart scheme above adaptive is thus crucial to practical performance. Fortunately, we will see that a simple logarithmic grid search on these parameters is enough to guarantee nearly optimal performance. In other words, as shown in (Roulet and d’Aspremont, 2017), the complexity bound in (6.14) is somewhat robust to a misspecification of the inner iteration schedule ki.

6.4.1 Grid Search Complexity

We can run several restart schemes in Algorithm 23, each with a given number of in- ner iterations N to perform a log-scale grid search on the values of τ and κ in (6.8). 114 Restart Schemes

2 We see below that running (log2 N) restart schemes suffices to achieve nearly optimal performance. We define these schemes as follows.

: Restart Algorithm 9 with k = C , Sp,0 i p (6.15) : Restart Algorithm 9 with k = C eτq i, ( Sp,q i p p q where Cp = 2 and τq = 2− . We stop these schemes when the total number of inner algorithm iterations has exceed N, i.e. at the smallest R such that R k N. The i=1 i ≥ size of the grid search in Cp is naturally bounded as we cannot restartP the algorithm after more than N total inner iterations, so p [1,..., log N ]. Also, when τ is smaller ∈ ⌊ 2 ⌋ than 1/N, a constant schedule performs as well as the optimal geometrically increasing schedule, which crucially means we can also choose q [0,..., log N ] and limits the ∈ ⌈ 2 ⌉ cost of grid search to log2 N. We have the following complexity bounds.

Theorem 6.3 (Adaptive Restart Complexity). Let f be a smooth convex function satis- fying (6.2) with parameter L and (HEB) with parameters (r, µ) on a set K. Assume that we are given x Rd such that x f(x) f(x ) K and write N a given 0 ∈ { | ≤ 0 } ⊂ number of iterations. Run schemes defined in (6.15) for p [1,..., log N ] and Sp,q ∈ ⌊ 2 ⌋ q [0,..., log2 N ], stopping each time after N total inner algorithm iterations i.e. for ∈ ⌈ R ⌉ ⋆ 1 R such that i=1 ki N. Assume N is large enough, so N 2Cκ,τ , and if N >τ > 0, ⋆ ≥ ≥ Cκ,τ > 1. P (i) If τ = 0, there exists p [1,..., log N ] such that scheme achieves a precision ∈ ⌊ 2 ⌋ Sp,0 given by 1 1 f(ˆx) f exp e− (cκ)− 2 N (f(x ) f ). − ⋆ ≤ − 0 − ⋆ (ii) If τ > 0, there exist p [1,..., log N ] and q [1,..., log N ] such that scheme ∈ ⌊ 2 ⌋ ∈ ⌈ 2 ⌉ achieves a precision given by Sp,q f(x0) f⋆ f(ˆx) f⋆ − 2 . τ − ≤ 1 1 τ τe (cκ)− 2 (f(x ) f ) 2 (N 1)/4 + 1 − 0 − ⋆ −   2 Overall, running the logarithmic grid search has a complexity (log2 N) times higher than running N iterations using the optimal scheme where we know the parameters in (HEB), while the convergence rate is roughly slowed down by a factor four.

6.5 Extensions

We now discuss several extensions of the results above.

Hölder Smooth Gradient The results above can be somewhat directly extended to more general notions of regu- larity. In particular, if we assume that there exist s [1, 2] and L> 0 on a set J Rd, ∈ ⊂ i.e. s 1 f(x) f(y) L x y − , for all x,y J. (6.16) k∇ −∇ k≤ k − k ∈ 6.5. Extensions 115

so that the gradient is Hölder smooth. Without further assumptions on f, the optimal rate of convergence for this class of functions is bounded as O(1/N ρ), where N is the total number of iterations and ρ = 3s/2 1, (6.17) − which gives ρ = 2 for smooth functions and ρ = 1/2 for non-smooth functions. The universal fast gradient method (Nesterov, 2015) achieves this rate, requiring both a target accuracy ǫ and a starting point x as inputs. It outputs a point x , (x ,ǫ,t), 0 U 0 such that 2 2 ǫ cL s d(x0, X⋆) ǫ f(x) f⋆ + 2 2ρ , (6.18) − ≤ 2 ǫ s t s 2 4s−2 after t iterations, where c is a constant (c = 2 s ). We can extend the definition of κ and τ in (6.8) to this setting, with

2 , L s , s κ 2 and τ 1 (6.19) µ r − r here. We see that τ acts as an analytical condition number measuring tightness of upper and lower bound models. The key difference with the smooth case described above is that here we need to schedule both the target accuracy ǫi used by the algorithm and the th number of iterations ki made at the i run of the algorithm. Our scheme is described in Algorithm 24.

Algorithm 24 Universal scheduled restarts for convex minimization Input: x Rd, ǫ f(x ) f , γ 0, a sequence k and an inner algorithm (x,ǫ,k). 0 ∈ 0 ≥ 0 − ⋆ ≥ i U for i = 1,...,R do

γ ǫi := e− ǫi 1, xi := (xi 1,ǫi, ki) − U − end for Output: An iterate xR.

We will choose a sequence ki that ensures f(x ) f ǫ , i − ⋆ ≤ i for the geometrically decreasing sequence ǫi. Grid search on the restart scheme still works in this case but requires the knowledge of both s and τ. Theorem 6.4. Let f be a convex function satisfying (6.16) with parameters (s,L) on a set J and (HEB) with parameters (r, µ) on a set K. Given x Rd assume that 0 ∈ x f(x) f(x ) J K. Run the restart scheme in Algorithm 24 from x for a given { | ≤ 0 }⊂ ∩ 0 ǫ f(x ) f with 0 ≥ 0 − ⋆ s τ ⋆ τi ⋆ , 1 τ 2ρ − ρ γ = ρ, ki = Cκ,τ,ρe , where Cκ,τ,ρ e − (cκ) ǫ0 116 Restart Schemes where ρ is defined in (6.17), κ and τ are defined in (6.19) and c = 8e2/e here. The precision reached at the last point xR is given by, 1 s s f(x ) f exp ρe− (cκ)− 2ρ N ǫ = O exp( κ− 2ρ N) , R − ⋆ ≤ − 0 − when τ = 0, while     ǫ0 s ρ f(x ) f ρ = O κ 2τ N − τ , R ⋆ τ − ≤ s − τ 1 2ρ ρ τe− (cκ)− ǫ0 N + 1     R when τ > 0, where N = i=1 ki is total number of iterations. P Relative Smoothness We can also extend the inequality defining condition (HEB), replacing the distance to the optimal set by a more general Bregman divergence. Suppose h(x) is a one strongly convex function with respect to the Euclidean norm, the Bregman divergence Dh(x,y) is defined as D (x,y) , h(x) h(y) f(y); (x y) (6.20) h − − h∇ − i and we will say that a function f is L-smooth with respect to h on Rd if f(y) f(x)+ f(x); (y x) + LD (y,x), for all x,y Rd. (6.21) ≤ h∇ − i h ∈ We can then extend the Hölderian error bound to the Bregman setting as follows. In an optimization context on a compact set K Rd, we can set h(x) = f(x) f and ⊂ − ⋆ g(x) = D(x, X⋆) = infy X⋆ Dh(x,y) where X⋆ is the set of optimal solutions. In this ∈ case, we have h 1(0) g 1(0), we can show that g is globally subanalytic if X is and − ⊂ − ⋆ if f is continuous and globally subanalytic, Theorem 6.1 again shows µ D(x, X )r f(x) f , for all x K, (HEB-B) r ⋆ ≤ − ⋆ ∈ for some µ,r > 0. This allows to use the restart scheme complexity results above to accelerate proximal gradient methods.

6.6 Calculus Rules

In general, the exponent r and the factor µ in the bounds (HEB) and (6.26) are not observed and hard to estimate. Despite this, the robustness result in Theorem 6.3 means that searching for the best restart scheme only introduces a log factor in the overall algorithm complexity. There are however a number of scenarios where we can produce much more precise estimates on r and µ, hence get refined a priori complexity bounds and reduce the cost of the grid search in (6.15). In particular, (Li and Pong, 2018) shows “calculus rules” for the exponent for a number of elementary operations using a related type of error bound known as the Kurdyka-t inequality. The results focus on the Kurdyka-Łojasiewicz exponent α, defined as follows. 6.7. Example: Compressed Sensing 117

Definition 6.3. A proper closed convex function has Kurdyka-Łojasiewicz (KL) exponent α iff for any point x¯ dom∂f there is a neighborhood of x¯, an ν > 0 and a constant ∈ V c> 0 such that D(∂f(x), 0) c(f(x) f(¯x))α (6.22) ≥ − whenever x and f(¯x) f(x) f(¯x)+ ν. ∈V ≤ ≤ In particular, (Bolte et al., 2007, Th. 3.3) shows that (HEB) implies (6.22) with exponent α = 1 1/r. Very briefly, the following calculus rules apply to the exponent α. −

• If f(x) = mini fi(x) and each fi has KL exponent αi then f has KL exponent α = maxi αi (Li and Pong, 2018, Cor. 3.1).

• Let f(x) = g F (x), where g is a proper closed function and F is a continuously ◦ differentiable mapping. Suppose in addition that g is a KL function with exponent α and the Jacobian JF (x) is a surjective mapping at some x¯ dom∂f. Then f ∈ has the KL property at x¯ with exponent α (Li and Pong, 2018, Th. 3.2).

• If f(x)= i fi(xi) and each fi is continuous and has KL exponent αi then f has KL exponentP α = maxi αi (Li and Pong, 2018, Cor. 3.3). • Let f be a proper closed convex function with a KL exponent α [0, 2/3]. Suppose ∈ further that f is continuous on dom∂f. Fix λ> 0 and consider

1 2 Fλ(X) = inf f(y)+ x y y 2λ  k − k  1 α then Fλ has KL exponent α = max 2 , 2 2α (Li and Pong, 2018, Th. 3.4). − n o Note that another related notion of error bound where the primal gap is replaced by the norm of the prox step was studied in e.g. (Pang, 1987; Luo and Tseng, 1992; Tseng, 2010; Zhou and So, 2017).

6.7 Example: Compressed Sensing

In some applications such as compressed sensing, under some classical assumptions on the problem data, the exponent r is equal to one and the constant µ can be directly computed from quantities controlling recovery performance. In these problems, a single parameter thus controls both signal recovery and computational performance. Consider for instance a sparse recovery problem using the ℓ1 norm. Given a matrix A Rn p and observations b = Ax on a signal x Rp, recovery is performed by ∈ × ⋆ ⋆ ∈ solving the ℓ1 minimization program

minimize x k k1 (ℓ recovery) subject to Ax = b 1 118 Restart Schemes

in the variable x Rp. A number of conditions on A have been derived to guarantee ∈ that (ℓ1 recovery) recovers the true signal whenever it is sparse enough. Among these, the null space property (see Cohen et al., 2009 and references therein) is defined as follows. Definition 6.4. (Null Space Property) The matrix A satisfies the Null Space Property (NSP) on support S 1,p with constant α 1 if for any z Null(A) 0 , ⊂ { } ≥ ∈ \ { } α z < z c . (NSP) k S k1 k S k1 The matrix A satisfies the Null Space Property at order s with constant α 1 if it ≥ satisfies it on every support S of cardinality at most s. The null space property is a necessary and sufficient condition for the convex pro- gram (ℓ1 recovery) to recover all signals up to some sparsity threshold. We have, the following proposition directly linking the null space property and the Hölderian error bound (HEB). Proposition 6.1. Given a coding matrix A Rn p satisfying (NSP) at order s with ∈ × constant α 1, if the original signal x is s-sparse, then for any x Rp satisfying ≥ ⋆ ∈ Ax = b, x = x , we have 6 ⋆ α 1 x x > − x x . (6.23) k k1 − k ⋆k1 α + 1k − ⋆k1

This implies signal recovery, i.e. optimality of x⋆ for (ℓ1 recovery) and the Hölderian α 1 error bound (HEB) with µ = α+1− .

6.8 Restarting Other First-Order Methods

The restart argument can be readily extended to other optimization methods provided their complexity bound directly depends on some measure of distance to optimality. This is the case for instance for the Frank-Wolfe method, as detailed in (Kerdreux et al., 2018). Suppose that we seek to solve the following constrained optimization problem minimize f(x) (6.24) subject to x ∈C Distance to optimality is now measured in terms of the strong Wolfe gap, defined as follows. Definition 6.5 (Strong Wolfe Gap). Let f be a smooth convex function, a polytope, C and let x be arbitrary. Then the strong Wolfe gap w(x) over is defined as ∈C C w(x) , min max f(x); (y z) (6.25) S x y S,z h − i ∈S ∈ ∈C where x Co(S) and ∈ S = S Ext( ), finite, x proper combination of elements of S , x { ⊂ C } is the set of proper supports of x. 6.8. Restarting Other First-Order Methods 119

The inequality playing the role of the Hölderian error bound in (HEB) for the strong Wolfe gap is then written as follows.

Definition 6.6 (Strong Wolfe primal bound). Let K be a compact neighborhood of X⋆ in , where X is the set of solutions of the constrained optimization problem (6.24). A C ⋆ function f satisfies a r-strong Wolfe primal bound on K, if and only if there exists r 1 ≥ and µ> 0 such that for all x K ∈ f(x) f µw(x)r, (6.26) − ⋆ ≤ and f⋆ its optimal value. Notice that this inequality is an upper bound on the primal gap f(x) f , while the − ⋆ Hölderian error bound in (HEB) provides a lower bound. This is because the strong Wolfe gap can be understood as a gradient norm, so that (6.26) is a Łojasiewicz inequality as in (Bolte et al., 2007), instead of a direct consequence of the Łojasiewicz factorization lemma as in (HEB) above. Regularity of f is measured using away curvature as in (Lacoste-Julien and Jaggi, 2015), with A , 2 Cf sup 2 f(y) f(x) η f(x),s v , (6.27) x,s,v η − − h∇ − i η [0∈C,1]  y=x+∈η(s v) − and allows us to bound the performance the Fractional Away-Step Frank-Wolfe Algo- rithm in (Kerdreux et al., 2018), as follows. A Theorem 6.5. Let f be a smooth convex function with away curvature Cf . Assume the strong Wolfe primal bound in (6.26) holds for some 1 r 2. Let γ > 0 and assume ≤ ≤ x is such that e γ w(x , )/2 CA. With γ = γ, the output of the Fractional 0 ∈ C − 0 S0 ≤ f k Away-Step Frank-Wolfe Algorithm satisfies 1 f(xT ) f⋆ w0 1 when 1 r< 2 − ≤ 2−r ≤  1+ TC˜ r  γ  (6.28)    ˜  γ T f(xT ) f⋆ w0 exp when r = 2 , − ≤ −e2γ 8CAµ  f !   after T steps, with w = w(x , ), T˜ , T ( ), and 0 0 S0 − |S0| − |ST | eγ(2 r) 1 Cr , − − . (6.29) γ 8e2γ CAµw(x , )r 2 f 0 S0 − This result is similar to that of Theorem 6.4, and shows that restart yields linear com- plexity bounds when the exponent in the strong Wolfe primal bound in (6.26) matches that in the curvature (i.e. r = 2), and improved linear rates when this exponent r satisfies 1 r < 2. Crucially here, the method is fully adaptive to the error bound parameters ≤ so no prior knowledge of these parameters is required to get the accelerated rates in Theorem 6.5 and no log scale grid search is required. 120 Restart Schemes

6.9 Notes & References

The optimal complexity bounds and exponential restart schemes detailed here can be traced back to (Nemirovsky and Nesterov, 1985). Restart schemes were extensively bench- marked in the numerical toolbox TFOCS by (Becker et al., 2011), with a particular focus on compressed sensing applications. The robustness result showing that log scale grid search produces near optimal complexity bounds is due to (Roulet and d’Aspremont, 2017). Restart schemes based on the gradient norm as a termination criterion also reach nearly optimal complexity bounds and adapt to strong convexity or HEB parameters, as discussed in (Nesterov, 2013; Ito and Fukuda, 2021). Hölderian error bounds can be traced back to the work of Lojasiewicz (1963) for analytic functions. This was extended to much broader classes of functions by (Kurdyka, 1998; Bolte et al., 2007). Several examples of problems in signal processing where this condition holds can be found in e.g.(Zhou et al., 2015; Zhou and So, 2017). Calculus rules for the exponent are discussed in details in e.g. (Li and Pong, 2018). Restart also helps in a stochastic setting with (Davis et al., 2019) showing recently that stochastic algorithms with geometric step decay converge linearly on functions sat- isfying Hölderian error bounds, thus validating a classical empirical acceleration trick, which restarts every few epochs after adjusting step size (aka learning rate in machine learning terminology). Appendices A

Useful Inequalities

In this section, we prove basic inequalities involving smooth strongly convex functions. Most inequalities are not used in our developments. Nevertheless, we believe they are useful for gaining intuitions about those classes of functions, as well as for comparisons with the literature. Also note that those inequalities are somehow standard (see e.g. (Nesterov, 2003, Theorem 2.1.5).

A.1 Smoothness and Strong Convexity with Euclidean Norm

In this section, we consider an Euclidean setting, where x 2 = x; x , and .; . : Rd k k h i h i × Rd R is a dot product. → The following theorem summarizes known inequalities characterizing the class of smooth convex functions. Note that those characterizations of f are all equivalent ∈ F0,L assuming f 0, , as convexity is not implied by some points below. In particular, (i), ∈ F ∞ (ii), (v), (vi), or (vii) alone do not encode convexity of f, whereas (iii) or (iv) encode both smoothness and convexity.

Theorem A.1. Let f : Rd R be a differentiable convex function. The following state- → ments are equivalent for inclusion in . F0,L (i) f satisfies a Lipschitz condition; for all x,y Rd ∇ ∈ f(x) f(y) L x y . k∇ −∇ k≤ k − k (ii) f is upper bounded by quadratic functions; for all x,y Rd ∈ L f(x) f(y)+ f(y); x y + x y 2. ≤ h∇ − i 2 k − k

122 A.1. Smoothness and Strong Convexity with Euclidean Norm 123

(iii) f satisfies, for all x,y Rd ∈ 1 f(x) f(y)+ f(y); x y + f(x) f(y) 2. ≥ h∇ − i 2Lk∇ −∇ k (iv) f is cocoercive; for all x,y Rd ∇ ∈ 1 f(x) f(y); x y f(x) f(y) 2. h∇ −∇ − i≥ Lk∇ −∇ k (v) f satisfies, for all x,y Rd ∇ ∈ f(x) f(y); x y L x y 2. h∇ −∇ − i≤ k − k (vi) L x 2 f(x) is convex. 2 k k − (vii) f satisfies, for all λ [0, 1] ∈ L f(λx + (1 λ)y) λf(x) + (1 λ)f(y) λ(1 λ) x y 2. − ≥ − − − 2 k − k Proof. Let us start by (i) (ii). We use a first-order expansion ⇒ 1 f(y)= f(x)+ f(x + τ(y x)); y x dτ. h∇ − − i Z0 The quadratic upper bound then follows from algebraic manipulations, and from upper bounding the integral term 1 f(y)= f(x)+ f(x); y x + f(x + τ(y x)) f(x); y x dτ h∇ − i h∇ − −∇ − i Z0 1 f(x)+ f(x); y x + f(x + τ(y x)) f(x) y x dτ ≤ h∇ − i k∇ − −∇ kk − k Z0 1 f(x)+ f(x); y x + L x y 2 τdτ ≤ h∇ − i k − k Z0 L = f(x)+ f(x); y x + x y 2. h∇ − i 2 k − k We proceed with (ii) (iii). The idea is to require the quadratic upper bound to be ⇒ everywhere above the linear lower bound arising from convexity of f, that is, for all x,y,z Rd ∈ L f(y)+ f(y); z y f(z) f(x)+ f(x); z x + x z 2. h∇ − i≤ ≤ h∇ − i 2 k − k In other words, we must have for all z Rd ∈ L f(y)+ f(y); z y f(x)+ f(x); z x + x z 2 h∇ − i≤ h∇ − i 2 k − k L f(y) f(x)+ f(y); z y f(x); z x x z 2 0 ⇔ − h∇ − i−h∇ − i− 2 k − k ≤ L f(y) f(x)+max f(y); z y f(x); z x x z 2 0 ⇔ − z Rdh∇ − i−h∇ − i− 2 k − k ≤ ∈ 1 f(y) f(x)+ f(y); x y + f(x) f(y) 2 0, ⇔ − h∇ − i 2Lk∇ −∇ k ≤ 124 Useful Inequalities where the last line follows from the explicit maximization on z, that is, we pick z = x 1 ( f(x) f(y)), reaching the desired result after base algebraic manipulations. − L ∇ −∇ We continue with (iii) (iv), which simply follows from adding ⇒ 1 f(x) f(y)+ f(y); x y + f(x) f(y) 2 ≥ h∇ − i 2Lk∇ −∇ k 1 f(y) f(x)+ f(x); y x + f(x) f(y) 2. ≥ h∇ − i 2Lk∇ −∇ k For obtaining (iv) (i), one can use Cauchy-Schwartz ⇒ 1 f(x) f(y) 2 f(x) f(y); x y f(x) f(y) x y , Lk∇ −∇ k ≤ h∇ −∇ − i≤k∇ −∇ kk − k which allows concluding f(x) f(y) L x y , reaching the final statement. k∇ −∇ k≤ k − k For obtaining (ii) (v), we simply add ⇒ L f(x) f(y)+ f(y); x y + x y 2 ≤ h∇ − i 2 k − k L f(y) f(x)+ f(x); y x + x y 2, ≤ h∇ − i 2 k − k and reorganize the resulting inequality. For obtaining (v) (ii), we use a first-order expansion, again ⇒ 1 f(y)= f(x)+ f(x + τ(y x)); y x dτ. h∇ − − i Z0 The quadratic upper bound then follows from algebraic manipulations, and from upper bounding the integral term (we use the intermediate variable z = x + τ(y x) for τ − convenience)

1 f(y)= f(x)+ f(x); y x + f(x + τ(y x)) f(x); y x dτ h∇ − i h∇ − −∇ − i Z0 1 1 = f(x)+ f(x); y x + f(z ) f(x); z x dτ h∇ − i τ h∇ τ −∇ τ − i Z0 1 L f(x)+ f(x); y x + z x 2dτ ≤ h∇ − i τ k τ − k Z0 1 = f(x)+ f(x); y x + L x y 2 τdτ h∇ − i k − k Z0 L = f(x)+ f(x); y x + x y 2. h∇ − i 2 k − k For the equivalence (vi) (ii), simply define h(x)= L x 2 f(x) (and hence h(x)= ⇔ 2 k k − ∇ Lx f(x)) and observe that for all x,y Rd −∇ ∈ L h(x) h(y)+ h(y); x y f(x) f(y)+ f(y); x y + x y 2, ≥ h∇ − i ⇔ ≤ h∇ − i 2 k − k which follows from base algebraic manipulations. A.1. Smoothness and Strong Convexity with Euclidean Norm 125

Finally, the equivalence (vi) (vii) follows the same h(x)= L x 2 f(x) (and hence ⇔ 2 k k − h(x)= Lx f(x)) and the observation that for all x,y Rd and λ [0, 1] we have ∇ −∇ ∈ ∈ h(λx + (1 λ)y) λh(x) + (1 λ)h(y) − ≤ − ⇔ L f(λx + (1 λ)y) λf(x) + (1 λ)f(y) λ(1 λ) x y 2, − ≥ − − − 2 k − k which follows from base algebraic manipulations.

For obtaining the corresponding inequalities in the strongly convex case, one can rely on Fenchel conjugation between smoothness and strong convexity, see for example, (Rock- afellar and Wets, 1998, Proposition 12.6). The following inequalities are stated without proofs, and can be obtained either as direct consequences of the definitions, or from Fenchel conjugation along with statements of Theorem A.1.

Theorem A.2. Let f : Rd R be a closed, proper, and convex function. The following → statements are equivalent for inclusion in . Fµ,L (i) f satisfies a Lipschitz and an inverse Lipschitz condition, for all x,y Rd ∇ ∈ µ x y f(x) f(y) L x y . k − k≤k∇ −∇ k≤ k − k

(ii) f is lower and upper bounded by quadratic functions; for all x,y Rd ∈ µ L f(y)+ f(y); x y + x y 2 f(x) f(y)+ f(y); x y + x y 2. h∇ − i 2 k − k ≤ ≤ h∇ − i 2 k − k

(iii) f satisfies, for all x,y Rd ∈ 1 f(y)+ f(y); x y + f(x) f(y) 2 h∇ − i 2Lk∇ −∇ k f(x) ≤ ≤ 1 f(y)+ f(y); x y + f(x) f(y) 2. h∇ − i 2µk∇ −∇ k

(iv) f satisfies for all x,y Rd ∇ ∈ 1 1 f(x) f(y) 2 f(x) f(y); x y f(x) f(y) 2. Lk∇ −∇ k ≤ h∇ −∇ − i≤ µk∇ −∇ k

(v) f satisfies for all x,y Rd ∇ ∈ µ x y 2 f(x) f(y); x y L x y 2. k − k ≤ h∇ −∇ − i≤ k − k 126 Useful Inequalities

(vi) for all λ [0, 1] ∈ L λf(x)+(1 λ)f(y) λ(1 λ) x y 2 − − − 2 k − k f(λx + (1 λ)y) ≤ − ≤ µ λf(x)+(1 λ)f(y) λ(1 λ) x y 2 − − − 2 k − k (vii) f(x) µ x 2 and L x 2 f(x) are convex and (L µ)-smooth. − 2 k k 2 k k − − Finally, let us mention that the existence of an inequality allowing to encode both smoothness and strong convexity together. This inequality is also known as an interpola- tion inequality (Taylor et al., 2017c) and turns out to be particularly useful for proving worst-case guarantees.

Theorem A.3. Let f : Rd R be a differentiable function. f is L-smooth µ-strongly → convex if and only if 1 f(x) f(y)+ f(y); x y + f(x) f(y) 2 ≥ h∇ − i 2Lk∇ −∇ k µ 1 (A.1) + x y ( f(x) f(y)) 2. 2(1 µ/L)k − − L ∇ −∇ k − Proof. (f (A.1)) The idea is to require the quadratic upper bound from ∈ Fµ,L ⇒ smoothness to be everywhere above the quadratic lower bound arising from strong con- vexity, that is, for all x,y,z Rd ∈ µ L f(y)+ f(y); z y + z y 2 f(z) f(x)+ f(x); z x + x z 2. h∇ − i 2 k − k ≤ ≤ h∇ − i 2 k − k In other words, we must have for all z Rd ∈ µ L f(y)+ f(y); z y + z y 2 f(x)+ f(x); z x + x z 2 h∇ − i 2 k − k ≤ h∇ − i 2 k − k µ L f(y) f(x)+ f(y); z y + z y 2 f(x); z x x z 2 0 ⇔ − h∇ − i 2 k − k − h∇ − i− 2 k − k ≤ µ L f(y) f(x)+max f(y); z y + z y 2 f(x); z x x z 2 0 ⇔ − z Rdh∇ − i 2 k − k − h∇ − i− 2 k − k ≤ ∈ Lx µy 1 explicit maximization over z, that is, picking z = − ( f(x) f(y)) allows L µ − L µ ∇ −∇ reaching the desired inequality after base algebraic manip− ulations.− ((A.1) f ) f is direct by observing that (A.1) is stronger than ⇒ ∈ Fµ,L ∈ F0,L Theorem A.1(iii); f is then direct by reformulating (A.1) as ∈ Fµ,L µ f(x) f(y)+ f(y); x y + x y 2 ≥ h∇ − i 2 k − k 1 + f(x) f(y) µ(x y) 2, 2L(1 µ/L)k∇ −∇ − − k − which is stronger than f(x) f(y)+ f(y); x y + µ x y 2. ≥ h∇ − i 2 k − k A.2. Smooth Convex Functions for General Norms over Restricted Sets 127

Remark A.1. It is crucial to recall that some inequalities above are only valid when dom f = Rd—in particular, those of Theorem A.1(iii & iv), Theorem A.2 (iii&iv), and Theorem A.1. We refer to (Drori, 2018) for illustrating that some inequalities are not valid when restricted on some dom f = Rd. Most standard inequalities, however, do 6 hold even in the case of restricted domains, as established in e.g. (Nesterov, 2003). Some other inequalities, such as those of Theorem A.1(iv), Theorem A.2(iv) do hold under the additional assumption of twice continuous differentiability (De Klerk et al., 2020).

A.2 Smooth Convex Functions for General Norms over Restricted Sets

In this section, we show that requiring a Lipschitz condition on f, on a convex set ∇ C Rd, implies a quadratic upper bound on f. That is, requiring that for all x,y C ⊆ ∈ f(x) f(y) L x y , k∇ −∇ k∗ ≤ k − k where . is some norm and . is the corresponding dual norm, implies a quadratic k k k k∗ upper bound x,y C ∀ ∈ L f(x) f(y)+ f(y); x y + x y 2. ≤ h∇ − i 2 k − k Theorem A.4. Let f : Rd R + be continuously differentiable on some open → ∪ { ∞} convex set C Rd, and satisfying a Lipschitz condition ⊆ f(x) f(y) L x y , k∇ −∇ k∗ ≤ k − k for all x,y C. Then, it holds that ∈ L f(x) f(y)+ f(y); x y + x y 2, ≤ h∇ − i 2 k − k for all x,y C. ∈ Proof. The desired result is obtained from a first-order expansion 1 f(y)= f(x)+ f(x + τ(y x)); y x dτ. h∇ − − i Z0 The quadratic upper bound then follows from algebraic manipulations, and from upper bounding the integral term 1 f(y)= f(x)+ f(x); y x + f(x + τ(y x)) f(x); y x dτ h∇ − i h∇ − −∇ − i Z0 1 f(x)+ f(x); y x + f(x + τ(y x)) f(x) y x dτ ≤ h∇ − i k∇ − −∇ k∗k − k Z0 1 f(x)+ f(x); y x + L x y 2 τdτ ≤ h∇ − i k − k Z0 L = f(x)+ f(x); y x + x y 2. h∇ − i 2 k − k B

Variations around Nesterov Acceleration

B.1 Relations between Acceleration Methods

B.1.1 Optimized Gradient Method, Forms I & II

In this short section, we show that Algorithm 7 and Algorithm 8 generate the same sequence y . A direct consequence of this statement is that the sequences x also { k} { k} match, as they are generated, in both cases, from simple gradient steps on y . { k} For doing so, we show that Algorithm 8 is a reformulation of Algorithm 7.

Proposition B.1. The sequence y generated by Algorithm 7 is equal to that of { k}k Algorithm 8.

Proof. Let us first observe that the sequences are initiated the same way in both formulations of OGM. Furthermore, consider one iteration of OGM in form I, we have:

1 1 yk = 1 xk + zk. − θk,N ! θk,N

Clearly, we therefore have z = θ y + (1 θ )x . At the next iteration, we have k k,N k − k,N k 1 1 2θ y = 1 x + z k,N f(y ) k+1 θ k+1 θ k L k − k+1,N ! k+1,N  − ∇  1 1 2θ = 1 x + θ y + (1 θ )x k,N f(y ) θ k+1 θ k,N k k,N k L k − k+1,N ! k+1,N  − − ∇  where we substituted zk by its equivalent expression from previous iteration. Now, by

128 B.1. Relations between Acceleration Methods 129

noting that 1 f(y )= x y , we reach − L ∇ k k+1 − k θk+1,N 1 1 yk+1 = − xk+1 + ((1 θk,N )xk + 2θk,N xk+1 θk,N yk) θk+1,N θk+1,N − − θk,N 1 θk,N = xk+1 + − (xk+1 xk)+ (xk+1 yk) θk+1,N − θk+1,N − where we reorganized the terms for reaching the same format as in Algorithm 8.

B.1.2 Nesterov’s Accelerated Gradient Method, Form I, II & III Proposition B.2. The two sequences x and y generated by Algorithm 9 are { k}k { k}k equal to those of Algorithm 10.

2 Proof. In order to prove the result, we use the identities Ak+1 = ak, as well as k 1 2 2 Ak = i=0− ai, and ak+1 = ak + ak+1. Given that the sequences x are obtained from gradient steps on y in both P { k}k k formulations, it is sufficient to prove that the sequences y match. The equivalence { k} is clear for k = 0, as both methods generate y = x 1 f(x ). For k 0, from 1 0 − L ∇ 0 ≥ Algorithm 9, one can write iteration k as

Ak Ak yk = xk + 1 zk, Ak+1  − Ak+1  and hence A A z = k+1 y + 1 k+1 x k A A k − A A k k+1 − k  k+1 − k  = a y + (1 a ) x . k k − k k Substituting this expression in that of iteration k + 1, we reach A A A A A y = k+1 x + k+2 − k+1 z k+1 − k f(y ) k+1 A k+1 A k L k k+2 k+2  − ∇  a2 1 a = k x + a y + (1 a ) x k f(y ) 2 k+1 a k k k k L k ak+1 k+1  − − ∇  2 ak 1 = 2 xk+1 + (akxk+1 + (1 ak) xk) ak+1 ak+1 − ak 1 =xk+1 + − (xk+1 xk), ak+1 − where we substituted the expression of zk, and used previous identities for reaching the desired statement.

The same relationship holds with Algorithm 11, as provided by the next proposition. 130 Variations around Nesterov Acceleration

Proposition B.3. The three sequences z , x and y generated by Algorithm 9 { k}k { k}k { k}k are equal to those of Algorithm 11.

Proof. Clearly, we have x0 = z0 = y0 in both methods. Let us assume that the sequences match up to iteration k, that is, up to yk 1, xk and zk. Clearly, both yk and − zk+1 are computed in the exact same way in both methods. It remains to compare update rules for xk+1; in Algorithm 11, we have A A x = k x + 1 k z k+1 A k A k+1 k+1  − k+1  A A A = y 1 k k+1 − k f(y ) k A L k −  − k+1  ∇ where we used the update rule for zk+1. Further simplifications, along with the identity (A A )2 = A allows arriving to k+1 − k k+1 2 (Ak+1 Ak) xk+1 = yk − f(yk) − LAk+1 ∇ 1 = y f(y ), k − L∇ k which is clearly the same update rule as that of Algorithm 9, and hence all sequences match and the desired statement is proved. B.1. Relations between Acceleration Methods 131

B.1.3 Nesterov’s Accelerated Gradient Method (Strongly Convex Case), Forms I, II, & III In this short section, we provide alternate, equivalent, formulations for Algorithm 12.

Algorithm 25 Nesterov’s method, form II

Input: A L-smooth µ-strongly convex function f, initial point x0. 1: Initialize z = x , set q = µ/L, A = 0, and A = (1 q) 1. 0 0 0 1 − − 2: for k = 0,... do 2 2Ak+1+1+ 4Ak+1+4qAk+1+1 3: Ak+2 = 2(1 q) − 4: x = y 1 fp(y ) k+1 k − L ∇ k 5: yk+1 = xk+1 + βk(xk+1 xk) (A A )(A− (1 q) A 1) 6: k+2− k+1 k+1 − − k− with βk = A (2qA +1) qA2 k+2 k+1 − k+1 7: end for Output: An approximate solution xN .

Proposition B.4. The two sequences x and y generated by Algorithm 12 are { k}k { k}k equal to those of Algorithm 25.

Proof. Without loss of generality, we can consider that a third sequence zk is present in Algorithm 25 (although it is not computed). Clearly, we have x0 = z0 = y0 in both methods. Let us assume that the sequences match up to iteration k, that is, up to yk, xk and zk. Clearly, xk+1 is computed in the exact same way in both methods, as a gradient step from yk, and it remains to compare update rules for yk+1. In Algorithm 12, we have (δ 1)τ + 1 y = x + (τ τ (τ 1)(1 qδ )) (z x ) k − k+1 f(y ), k+1 k k − k+1 k − − k k − k − L ∇ k whereas in Algorithm 12, we have 1+ β y = x + (β + 1)τ (z x ) k f(y ). k+1 k k k k − k − L ∇ k By remarking that β = τ (δ 1), the coefficients in front of f(y ) match in both k k+1 k − ∇ k expressions, and it remains to check that (β + 1)τ (τ τ (τ 1)(1 qδ )) k k − k − k+1 k − − k is identically 0 for reaching the desired statement. Substituting β = τ (δ 1), this k k+1 k − expression reduces to τ (δ (τ (1 q)+ q) 1), k+1 k k − − and we have to verify (δ (τ (1 q)+ q) 1) to be zero. Substituting and reworking this k k − − expression, using those of τk, and δk, we arrive to 2 2 τk (Ak+1 Ak) Ak+1 qAk+1 − − − = 0,  (A A )(1 + qA )  k+1 − k k+1 132 Variations around Nesterov Acceleration

as we recognize (A A )2 A qA2 = 0 (which is the expression we used for k+1 − k − k+1 − k+1 picking Ak+1).

Algorithm 26 Nesterov’s method, form III

Input: A L-smooth µ-strongly convex function f, initial point x0. 1: Initialize z0 = x0 and A0 = 0, set q = µ/L. 2: for k = 0,... do 2 2Ak+1+√4Ak+4qAk+1 3: Ak+1 = 2(1 q) − (Ak+1 Ak)(1+qAk) Ak+1 Ak 4: set τ = − 2 and δ = − k A +2qA A qA k 1+qAk+1 k+1 k k+1− k 5: yk = xk + τk(zk xk) − δk 6: zk+1 = (1 qδk)zk + qδkyk L f(yk) A−k Ak − ∇ 7: xk+1 = xk + (1 )zk+1 Ak+1 − Ak+1 8: end for Output: An approximate solution xN .

Proposition B.5. The three sequences z , x and y generated by Algorithm 12 { k}k { k}k { k}k are equal to those of Algorithm 26.

Proof. Clearly, we have x0 = z0 = y0 in both methods. Let us assume that the sequences match up to iteration k, that is, up to yk 1, xk and zk. Clearly, yk and zk+1 are computed − in the exact same way in both methods, so we only have to verify that the update rules for xk+1 match. In other words, we have to verify

Ak Ak 1 xk + (1 )zk+1 = yk f(yk), Ak+1 − Ak+1 − L∇ which, using the update rules for zk+1 and yk amounts to verify

2 2 (Ak+1 Ak) Ak+1 qAk+1 − − − f(yk) = 0, − LAk+1(1 + qAk+1) ∇ which is true, as we recognize (A A )2 A qA2 = 0 (which is the expression k+1 − k − k+1 − k+1 used for choosing Ak+1). B.2. Conjugate Gradient Method 133

B.2 Conjugate Gradient Method

Historically, Nesterov’s accelerated gradient method (Nesterov, 1983) was preceded by a 2 few other methods with optimal worst-case convergence rates O(N − ) for smooth convex minimization. However, those alternate schemes required the possibility of optimizing ex- actly over a few dimensions—plan searches were used in (Nemirovsky and Yudin, 1983c; Nemirovsky and Yudin, 1983b) and line searches were used in (Nemirovsky, 1982); un- fortunately they are not to be found in English, and we refer to (Narkiss and Zibulevsky, 2005) for related discussions. In this vein, accelerated methods can be obtained through their links with conjugate gradients (Algorithm 27), as by-products of the worst-case analysis. In this section, we illustrate that the connection between OGM and conjugate gradients is absolutely per- fect: the exact same proof (achieving the lower bound) is valid for both. The conjugate

Algorithm 27 Conjugate gradient method

Input: A L-smooth convex function f, initial point y0, budget N. 1: for k = 0,...,N 1 do − 2: y = argmin f(x) : x y + span f(y ), f(y ),..., f(y ) k+1 x { ∈ 0 {∇ 0 ∇ 1 ∇ k }} 3: end for Output: An approximate solution yN .

gradient method (CG) for solving quadratic optimization problems is known to have an efficient form not requiring to perform those span searches (which are in general too expensive to be of any practical interest), see for example (Nocedal and Wright, 2006). Beyond quadratics, it is in general not possible to reformulate the method in an efficient way. However, it is possible to find other methods for which the exact same worst-case analysis applies, and it turns out that OGM is one of them. Similarly, by slightly weak- ening the analysis of CG, one can find other methods, such as Nesterov’s accelerated gradients. More precisely, let us recall the previous definition for the sequence θ defined { k,N }k in (4.8); 2 1+ 4θk,N +1 2 if k N 2 θk+1,N =  2 ≤ − 1+p8θk,N +1  2 if k = N 1. p − As a result of the worst-case analysis presented below, all methods satisfying

i 1 1 1 1 2 − f(yi); yi 1 yi 1 L f(yi 1) + y0 L θj,N f(yj) 0. h∇ −" − θi,N ! − − ∇ − θi,N  − ∇  #i≤   jX=0   (B.1) achieve the optimal worst-case complexity of smooth convex minimization, provided by Theorem 4.7. On the one hand, CG ensures this inequality to hold thanks to span 134 Variations around Nesterov Acceleration

searches (i.e., orthogonality of successive search directions), that is 1 f(yi); yi yi 1 + (yi 1 y0) = 0 h∇ − − θi,N − − i f(y ); f(y ) = 0 h∇ i ∇ 0 i . .

f(yi); f(yi 1) = 0. h∇ ∇ − i On the other hand OGM enforces this inequality by using

i 1 1 1 1 2 − yi = 1 yi 1 L f(yi 1) + y0 L θj,N f(yj) . − θi,N ! − − ∇ − θi,N  − ∇    jX=0   Optimized and Conjugate Gradient Methods: Worst-case Analyses The worst-case analysis below relies on the exact same potentials as for the optimized gradient method, see Theorem 4.4 and Lemma 4.5.

Theorem B.1. Let f be a L-smooth convex function, N N, and some x argmin f(x). ∈ ⋆ ∈ x The iterates of the conjugate gradient method (CG, Algorithm 27), and of all methods whose iterates are compliant with (B.1), satisfy

2 L y0 x⋆ f(yN ) f(x⋆) k −2 k , − ≤ 2θN,N for all y Rd. 0 ∈ Proof. The result is obtained from the exact same potential as that of OGM, ob- tained from more inequalities. That is, first perform a weighted sum of the following inequalities.

2 • Smoothness and convexity of f between yk 1 and yk with weight λ1 = 2θk 1,N − − 1 2 0 f(yk) f(yk 1)+ f(yk); yk 1 yk + f(yk) f(yk 1) , ≥ − − h∇ − − i 2Lk∇ −∇ − k

• smoothness and convexity of f between x⋆ and yk with weight λ2 = 2θk,N 1 0 f(y ) f(x )+ f(y ); x y + f(y ) 2. ≥ k − ⋆ h∇ k ⋆ − ki 2Lk∇ k k

2 • search procedure for obtaining yk, with weight λ3 = 2θk,N

1 1 1 0 f(yk); yk 1 yk 1 L f(yk 1) + zk , ≥ h∇ − " − θk,N ! − − ∇ − θk,N #i   2 k 1 where we used z := y − θ f(y ). k 0 − L j=0 j,N ∇ j P B.2. Conjugate Gradient Method 135

The weighted sum is a valid inequality

1 2 0 λ1[f(yk) f(yk 1)+ f(yk); yk 1 yk + f(yk) f(yk 1) ] ≥ − − h∇ − − i 2Lk∇ −∇ − k 1 + λ [f(y ) f(x )+ f(y ); x y + f(y ) 2] 2 k − ⋆ h∇ k ⋆ − ki 2Lk∇ k k

1 1 1 + λ3[ f(yk); yk 1 yk 1 L f(yk 1) + zk ]. h∇ − " − θk,N ! − − ∇ − θk,N #i   Substituting zk+1, the previous inequality can be reformulated exactly as

2 1 2 L 2 0 2θk,N f(yk) f⋆ f(yk) + zk+1 x⋆ ≥  − − 2Lk∇ k  2 k − k 2 1 2 L 2 2θk 1,N f(yk 1) f⋆ f(yk 1) zk x⋆ − 2L − 2 − −  − − k∇ k  − k − k 2 2 1 2 + 2 θk 1,N θk,N + θk,N f(yk) f⋆ + f(yk) − − − 2Lk∇ k     2 2 1 + 2 θk 1,N θk,N + θk,N f(yk); yk 1 L f(yk 1) yk . − − h∇ − − ∇ − − i   We reach the desired inequality by picking θk,N θk 1,N satisfying ≥ − 2 2 θk 1,N θk,N + θk,N = 0, − − reaching the same potential as in Theorem 4.4. For obtaining the technical lemma allowing to bound the final f(y ) f , we follow N − ⋆ the same steps with the following inequalities. 2 • Smoothness and convexity of f between yk 1 and yk with weight λ1 = 2θN 1,N − − 1 2 0 f(yN ) f(yN 1)+ f(yN ); yN 1 yN + f(yN ) f(yN 1) , ≥ − − h∇ − − i 2Lk∇ −∇ − k

• smoothness and convexity of f between x⋆ and yk with weight λ2 = θN,N 1 0 f(y ) f(x )+ f(y ); x y + f(y ) 2. ≥ N − ⋆ h∇ N ⋆ − N i 2Lk∇ N k 2 • search procedure for obtaining yN , with weight λ3 = θN,N

1 1 1 0 f(yN ); yN 1 yN 1 L f(yN 1) + zN . ≥ h∇ − " − θN,N ! − − ∇ − θN,N #i   The weighted sum can then be reformulated exactly as L 0 θ2 (f(y ) f )+ z θN,N f(y ) x 2 ≥ N,N N − ⋆ 2 k N − L ∇ N − ⋆k 2 1 2 L 2 2θN 1,N f(yN 1) f⋆ f(yN 1) zN x⋆ − 2L − 2 − −  − − k∇ k  − k − k 2 2 1 2 + 2θN 1,N θN,N + θN,N f(yN ) f⋆ + f(yN ) − − − 2Lk∇ k     2 2 1 + 2θN 1,N θN,N + θN,N f(yN ); yN 1 L f(yN 1) yN − − h∇ − − ∇ − − i   136 Variations around Nesterov Acceleration

reaching the desired inequality as in Lemma 4.5 by picking θN,N θN 1,N satisfying ≥ − 2 2 2θN 1,N θN,N + θN,N . − − Hence, the potential argument from Corollary 4.6 applies as such and we reach the desired conclusion. In other words, one can define, for all k 0,...,N , ∈ { } 2 1 2 L 2 φk = 2θk 1,N f(yk 1) f⋆ f(yk 1) + zk x⋆ , −  − − − 2Lk∇ − k  2 k − k and L φ = θ2 (f(y ) f )+ z θN,N f(y ) x 2, N+1 N,N N − ⋆ 2 k N − L ∇ N − ⋆k and reach the desired statement by chaining the inequalities L θ2 (f(y ) f ) φ φ . . . φ = y x 2. N,N N − ⋆ ≤ N+1 ≤ N ≤ ≤ 0 2 k 0 − ⋆k

Remark B.1. It is possible to further exploit the conjugate gradient method for designing practical accelerated methods in different settings, such as that of Nesterov (1983). Such points of view were exploited in, among others, (Narkiss and Zibulevsky, 2005; Karimi and Vavasis, 2016; Karimi and Vavasis, 2017; Diakonikolas and Orecchia, 2019a). The link between CG and OGM presented in this section is due to Drori and Taylor (2020), though under a different presentation not involving potential function. B.3. Proximal Accelerated Gradient Without Monotone Backtracking 137

B.3 Proximal Accelerated Gradient Without Monotone Backtracking

B.3.1 FISTA without Monotone Backtracking In this section, we show how to incorporate backtracking strategies possibly not satisfying L L , which is important in practice. The developments are essentially the same, k+1 ≥ k and one possible trick is to incorporate all the knowledge about Lk in Ak. That is, we use a rescaled shape for the potential function 1+ µB φ = B (f(x ) f )+ k z x 2, k k k − ⋆ 2 k k − ⋆k

Ak where, without backtracking strategy, Bk = L . This somehow cosmetic change allows φk to depend on Lk solely via Bk, and applies to both backtracking methods presented in this chapter. The idea used for obtaining both methods below is that, one can perform the same computations as in Algorithm 12, replacing Ak by Lk+1Bk and Ak+1 by Lk+1Ak+1 at iteration k. So, as in previous versions, only the current approximate Lipschitz constant Lk+1 is used at iteration k; previous approximations were only used for computing Bk.

Algorithm 28 Strongly convex FISTA (general initialization of Lk+1) Input: A L-smooth (possibly µ-strongly) convex function f, a convex function h with proximal operator available, an initial point x0, and an initial estimate L0 >µ. 1: Initialize z0 = x0, B0 = 0, and some α> 1. 2: for k = 0,... do 3: Pick L (µ, ){reasonable choices include L [L ,L ].} k+1 ∈ ∞ k+1 ∈ 0 k 4: loop 5: set qk+1 = µ/Lk+1, 2 2Lk+1Bk+1+√4Lk+1Bk+4µLk+1Bk+1 6: Bk+1 = 2(L µ) k+1− (Bk+1 Bk)(1+µBk ) Bk+1 Bk 7: set τ = − 2 and δ = L − k (B +2µB B µB ) k k+1 1+µBk+1 k+1 k k+1− k 8: yk = xk + τk(zk xk) − 1 9: xk+1 = proxh/L yk f(yk) k+1 − Lk+1 ∇ 10: z = (1 q δ )z + q δ y + δ (x y ) k+1 − k+1 k k k+1 k k k k+1 − k 11: if (4.21) holds then 12: break {Iterates accepted; k will be incremented.} 13: else 14: Lk+1 = αLk+1 {Iterates not accepted; recompute new Lk+1.} 15: end if 16: end loop 17: end for Output: An approximate solution xk+1.

The proof follows exactly the same lines as those of FISTA (Algorithm 4.20). In this 138 Variations around Nesterov Acceleration

case, f is assumed to be smooth and convex over Rd (i.e., it has full domain, dom f = Rd), and we are therefore allowed to evaluate gradients of f outside of the domain of h.

Theorem B.2. Let f (with full domain; dom f = Rd), h be a closed, proper ∈ Fµ,L and convex function, x argmin f(x)+ h(x) , and k N. For any x , z Rd and ⋆ ∈ x { } ∈ k k ∈ B 0, the iterates of Algorithm 28 satisfying (4.21) also satisfy k ≥ 1+ µB B (F (x ) F )+ k+1 z x 2 k+1 k+1 − ⋆ 2 k k+1 − ⋆k 1+ µB B (F (x ) F )+ k z x 2, ≤ k k − ⋆ 2 k k − ⋆k

2 2Lk+1Bk+1+√4Lk+1Bk+4µLk+1Bk+1 with Bk+1 = 2(L µ) . k+1− Proof. The proof consists in a weighted sum of the following inequalities:

• strong convexity of f between x and y with weight λ = B B ⋆ k 1 k+1 − k µ f f(y )+ f(y ); x y + x y 2, ⋆ ≥ k h∇ k ⋆ − ki 2 k ⋆ − kk

• strong convexity of f between xk and yk with weight λ2 = Bk

f(x ) f(y )+ f(y ); x y , k ≥ k h∇ k k − ki

• smoothness of f between yk and xk+1 (descent lemma) with weight λ3 = Bk+1 L f(y )+ f(y ); x y + k+1 x y 2 f(x ), k h∇ k k+1 − ki 2 k k+1 − kk ≥ k+1 • convexity of h between x and x with weight λ = B B ⋆ k+1 4 k+1 − k h(x ) h(x )+ g (x ); x x , ⋆ ≥ k+1 h h k+1 ⋆ − k+1i 1 with gh(xk+1) ∂h(xk+1) and xk+1 = yk ( f(yk)+ gh(xk+1)) ∈ − Lk+1 ∇

• convexity of h between xk and xk+1 with weight λ5 = Bk

h(x ) h(x )+ g (x ); x x . k ≥ k+1 h h k+1 k − k+1i We get the following inequality µ 0 λ [f(y ) f + f(y ); x y + x y 2] ≥ 1 k − ⋆ h∇ k ⋆ − ki 2 k ⋆ − kk + λ [f(y ) f(x )+ f(y ); x y ] 2 k − k h∇ k k − ki L + λ [f(x ) (f(y )+ f(y ); x y + k+1 x y 2)] 3 k+1 − k h∇ k k+1 − ki 2 k k+1 − kk + λ [h(x ) h(x )+ g (x ); x x ] 4 k+1 − ⋆ h h k+1 ⋆ − k+1i + λ [h(x ) h(x )+ g (x ); x x ]. 5 k+1 − k h h k+1 k − k+1i B.3. Proximal Accelerated Gradient Without Monotone Backtracking 139

Substituting the yk, xk+1, and zk+1 using y = x + τ (z x ) k k k k − k 1 xk+1 = yk ( f(yk)+ gh(xk+1)) − Lk+1 ∇ z = (1 q δ )z + q δ y + δ (x y ) k+1 − k+1 k k k+1 k k k k+1 − k yields, after some basic but tedious algebra 1+ B µ B (f(x )+ h(x ) f(x ) h(x )) + k+1 z x 2 k+1 k+1 k+1 − ⋆ − ⋆ 2 k k+1 − ⋆k 1+ B µ B (f(x )+ h(x ) f(x ) h(x )) + k z x 2 ≤ k k k − ⋆ − ⋆ 2 k k − ⋆k 2 2 Lk+1(Bk Bk+1) Bk+1 µBk+1 1 2 + − − − f(yk)+ gh(xk+1) 1+ µBk+1 2Lk+1 k∇ k 2 Bk(Bk+1 Bk)(1 + µBk)(1 + µBk+1) µ 2 − xk zk . − B + 2µB B µB2 2 2 k − k k+1 k k+1 − k Picking B such that B B and  k+1 k+1 ≥ k L (B B )2 B µB2 = 0, k+1 k − k+1 − k+1 − k+1 yields the desired result 1+ B µ B (f(x )+ h(x ) f(x ) h(x )) + k+1 z x 2 k+1 k+1 k+1 − ⋆ − ⋆ 2 k k+1 − ⋆k 1+ B µ B (f(x )+ h(x ) f(x ) h(x )) + k z x 2. ≤ k k k − ⋆ − ⋆ 2 k k − ⋆k

Finally, we obtain a complexity guarantee by adapting the potential argument (4.5), and by noting that Bk+1 is a decreasing function of Lk+1 (whose maximal value is αL, assuming L0 < L; otherwise its maximal value is L0). The growth rate of Bk in the smooth convex setting remains unchanged, see (4.14), as we have

2 1 2 + BkLk+1 Bk+1 , ≥  pLk+1  2 1 √ k and hence Bk+1 + Bk, and therefore Bk √ with ℓ = max L0, αL ≥ 2√Lk+1 ≥ 2 ℓ { } and L pℓ. As for the geometric rate, we obtain, similarly  k+1 ≤ 1+ µ Lk+1 B B B   = k , k+1 k q µ µ ≥ 1 L 1 − k+1 − Lk+1 q and therefore B (1 µ ) 1B . k+1 ≥ − ℓ − k q 140 Variations around Nesterov Acceleration

Corollary B.3. Let f (Rd), h be a closed, proper and convex function, and x ∈ Fµ,L ⋆ ∈ argmin F (x) f(x)+ h(x) . For any N N, N 1, and x Rd, the output of x { ≡ } ∈ ≥ 0 ∈ Algorithm 28 satisfies

N 2 µ 2 F (xN ) F⋆ min 2 , 1 ℓ x0 x⋆ , − ≤ (N  − r ℓ  ) k − k with ℓ = max αL,L . { 0} Proof. We assume that L>L as otherwise f and the proof directly follows 0 ∈ Fµ,L0 from the case without backtracking. The chained potential argument (4.5) can be used as before. Using B0 = 0, we reach 2 x0 x⋆ F (xN ) F⋆ k − k . − ≤ 2BN

Using our previous bounds on BN yields the desired result, using 1 1 1 1 2ℓ− 2ℓ− ℓ− B1 = = , µ µ µ µ Lk+1 µ ≥ 1 ℓ 1 1+ ≥ 1 − − − ℓ ℓ − ℓ  q  q  q N 2 and hence B ℓ 1 1 µ − , as well as B k . N ≥ − − ℓ k ≥ 2√ℓ  q   

B.3.2 Another Proximal Accelerated Gradient Without Monotone Backtracking Just as for FISTA, we can perform the same cosmetic change to Algorithm 18 for incor- porating a non-monotonic estimations of the Lipschitz constant. The proof is therefore essentially that of Algorithm 18.

Theorem B.4. Let h 0, , f µ,L(dom h), x⋆ argminx F (x) f(x)+ h(x) , ∈ F ∞ ∈ F ∈ { ≡ } and k N. For any x , z Rd and B 0, the iterates of Algorithm 29 satisfying (4.21) ∈ k k ∈ k ≥ also satisfy 1+ µB B (F (x ) F )+ k+1 z x 2 k+1 k+1 − ⋆ 2 k k+1 − ⋆k 1+ µB B (F (x ) F )+ k z x 2, ≤ k k − ⋆ 2 k k − ⋆k 2 2Lk+1Bk+1+√4Lk+1Bk+4µLk+1Bk+1 with Bk+1 = 2(L µ) . k+1− Proof. First, zk are in dom h by construction—it is the output of the proxi- { } Bk mal/projection step. Furthermore, we have 0 1 given that Bk+1 Bk 0. ≤ Bk+1 ≤ ≥ ≥ A direct consequence is that, since z = x dom h, all subsequent y and x are 0 0 ∈ { k} { k} also in dom h (as they are obtained from convex combinations of feasible points). The rest of the proof consists in a weighted sum of the following inequalities (which are valid due to feasibility of the iterates): B.3. Proximal Accelerated Gradient Without Monotone Backtracking 141

Algorithm 29 A proximal accelerated gradient (general initialization of Lk+1)

Input: h 0, with proximal operator available, f µ,L(dom h), an initial point ∈ F ∞ ∈ F x dom h, and an initial estimate L >µ. 0 ∈ 0 1: Initialize z0 = x0, A0 = 0, and some α> 1. 2: for k = 0,... do 3: Pick L (µ, ){reasonable choices include L [L ,L ].} k+1 ∈ ∞ k+1 ∈ 0 k 4: loop 5: set qk+1 = µ/Lk+1, 2 2Lk+1Bk+1+√4Lk+1Bk+4µLk+1Bk+1 6: Bk+1 = 2(L µ) k+1− Lk+1(Bk+1 Bk)(1+µBk ) Bk+1 Bk 7: set τ = − 2 and δ = L − k L (B +2µB B µB ) k k+1 1+µBk+1 k+1 k+1 k k+1− k 8: yk = xk + τk(zk xk) − δk 9: zk+1 = proxδ h/L (1 qk+1δk)zk + qk+1δkyk f(yk) k k+1 − − Lk+1 ∇ Ak Ak 10: xk+1 = xk + (1  )zk+1  Ak+1 − Ak+1 11: if (4.21) holds then 12: break {Iterates accepted; k will be incremented.} 13: else 14: Lk+1 = αLk+1 {Iterates not accepted; recompute new Lk+1.} 15: end if 16: end loop 17: end for Output: An approximate solution xk+1.

• strong convexity of f between x and y with weight λ = B B ⋆ k 1 k+1 − k µ f(x ) f(y )+ f(y ); x y + x y 2, ⋆ ≥ k h∇ k ⋆ − ki 2 k ⋆ − kk

• convexity of f between xk and yk with weight λ2 = Bk f(x ) f(y )+ f(y ); x y , k ≥ k h∇ k k − ki

• smoothness of f between yk and xk+1 (descent lemma) with weight λ3 = Bk+1 L f(y )+ f(y ); x y + k+1 x y 2 f(x ), k h∇ k k+1 − ki 2 k k+1 − kk ≥ k+1 • convexity of h between x and z with weight λ = B B ⋆ k+1 4 k+1 − k h(x ) h(z )+ g (z ); x z , ⋆ ≥ k+1 h h k+1 ⋆ − k+1i δk with gh(zk+1) ∂h(zk+1) and zk+1 = (1 qδk)zk +qδkyk ( f(yk)+gh(zk+1)) ∈ − − Lk+1 ∇ • convexity of h between xk and xk+1 with weight λ5 = Bk h(x ) h(x )+ g (x ); x x , k ≥ k+1 h h k+1 k − k+1i with g (x ) ∂h(x ) h k+1 ∈ k+1 142 Variations around Nesterov Acceleration

• convexity of h between z and x with weight λ = B B k+1 k+1 6 k+1 − k h(z ) h(x )+ g (x ); z x . k+1 ≥ k+1 h h k+1 k+1 − k+1i We get the following inequality µ 0 λ [f(y ) f + f(y ); x y + x y 2] ≥ 1 k − ⋆ h∇ k ⋆ − ki 2 k ⋆ − kk + λ [f(y ) f(x )+ f(y ); x y ] 2 k − k h∇ k k − ki L + λ [f(x ) (f(y )+ f(y ); x y + k+1 x y 2)] 3 k+1 − k h∇ k k+1 − ki 2 k k+1 − kk + λ [h(z ) h(x )+ g (z ); x z ] 4 k+1 − ⋆ h h k+1 ⋆ − k+1i + λ [h(x ) h(x )+ g (x ); x x ] 5 k+1 − k h h k+1 k − k+1i + λ [h(x ) h(z )+ g (x ); z x ]. 6 k+1 − k+1 h h k+1 k+1 − k+1i Substituting the yk, zk+1, and xk+1 using y = x + τ (z x ) k k k k − k δk zk+1 = (1 qk+1δk)zk + qk+1δkyk ( f(yk)+ gh(zk+1)) − − Lk+1 ∇ B B x = k x + 1 k z , k+1 B k B k+1 k+1  − k+1  some algebra allows obtaining the following reformulation 1+ µB B (f(x )+ h(x ) f(x ) h(x )) + k+1 z x 2 k+1 k+1 k+1 − ⋆ − ⋆ 2 k k+1 − ⋆k 1+ µB B (f(x )+ h(x ) f(x ) h(x )) + k z x 2 ≤ k k k − ⋆ − ⋆ 2 k k − ⋆k 2 2 2 (Bk Bk+1) Lk+1(Bk Bk+1) Bk+1 µB − − − − k+1 1 2 + 2 f(yk)+ gh(zk+1)  Bk+1(1 + µBk+1)  2k∇ k 2 Bk(Bk+1 Bk)(1 + µBk)(1 + µBk+1) µ 2 − xk zk , − B + 2µB B µB2 2 2 k − k k+1 k k+1 − k and the desired inequality follows from picking B B such that k+1 ≥ k L (B B )2 B µB2 = 0, k+1 k − k+1 − k+1 − k+1 yielding 1+ µB B (f(x )+ h(x ) f(x ) h(x )) + k+1 z x 2 k+1 k+1 k+1 − ⋆ − ⋆ 2 k k+1 − ⋆k 1+ B µ B (f(x )+ h(x ) f(x ) h(x )) + k z x 2. ≤ k k k − ⋆ − ⋆ 2 k k − ⋆k

The last corollary follows exactly from the same arguments as those used in Corol- lary B.3, and provides the final bound for Algorithm 29. B.3. Proximal Accelerated Gradient Without Monotone Backtracking 143

Corollary B.5. Let h 0, , f µ,L(dom h) and x⋆ argminx F (x) f(x)+ h(x) . ∈ F ∞ ∈ F ∈ { ≡ } For any N N, N 1, and x Rd, the output of Algorithm 29 satisfies ∈ ≥ 0 ∈ N 2 µ − F (x ) F min , 1 ℓ x x 2, N ⋆ N 2 ℓ 0 ⋆ − ≤ (  − r  ) k − k with ℓ = max αL,L . { 0} Proof. The proof follows the same arguments as those of Corollary B.3, using the potential from Theorem B.4, using the fact the output of the algorithm satisfies (4.21). Acknowledgements

The authors would like to warmly thank Raphaël Berthier, Mathieu Barré, Fabian Pe- dregosa and Baptiste Goujaud for comments on early versions of this manuscript, for spotting a few typos, and for discussions and developments related to Chapter 2, Chap- ter 4 and Chapter 5. We are also greatly indebted to Lenïc Chizat, Laurent Condat, Jelena Diakonikolas, Alexander Gasnikov, Cristóbal Guzmán, Julien Mairal, and Irène Waldspurger, for spotting a few typos and inconsistencies in the first version of the manuscript. We further want to thank Francis Bach, Sébastien Bubeck, Aymeric Dieuleveut, Radu-Alexandru Dragomir, Yoel Drori, Hadrien Hendrikx, Reza Babanezhad, Claude Brezinski, Pavel Dvurechensky, Georges Lan, Michela Redivo-Zaglia, Simon Lacoste- Julien, Vincent Roulet, and Ernest Ryu for fruitful discussions and pointers, which largely simplified the writing and revision process of this manuscript. AA is at the département d’informatique de l’ENS, École normale supérieure, UMR CNRS 8548, PSL Research University, 75005 Paris, France, and INRIA. AA would like to acknowledge support from the ML and Optimisation joint research initiative with the fonds AXA pour la recherche and Kamet Ventures, a Google focused award, as well as funding by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA- 0001 (PRAIRIE 3IA Institute). AT is at INRIA and the département d’informatique de l’ENS, École normale supérieure, CNRS, PSL Research University, 75005 Paris, France. AT acknowledges support from the European Research Council (ERC grant SEQUOIA 724063).

144 References

Aitken, A. C. 1927. “On Bernoulli’s Numerical Solution of Algebraic Equations”. Pro- ceedings of the Royal Society of Edinburgh. 46: 289–305. Allen-Zhu, Z. 2017. “Katyusha: The first direct acceleration of stochastic gradient meth- ods”. The Journal of Machine Learning Research (JMLR). 18(1): 8194–8244. Allen-Zhu, Z. and L. Orecchia. 2014. “Linear coupling: An ultimate unification of gradi- ent and mirror descent”. arXiv preprint arXiv:1407.1537. Anderson, D. G. 1965. “Iterative procedures for nonlinear integral equations”. Journal of the ACM (JACM). 12(4): 547–560. Anderson, E. J. and P. Nash. 1987. Linear programming in infinite-dimensional spaces. Edward J. Anderson, Peter Nash. Chichester: Wiley. Armijo, L. 1966. “Minimization of functions having Lipschitz continuous first partial derivatives”. Pacific Journal of mathematics. 16(1): 1–3. Attouch, H. and J. Peypouquet. 2016. “The rate of convergence of Nesterov’s accelerated forward-backward method is actually faster than 1/k2”. SIAM Journal on Optimiza- tion. 26(3): 1824–1834. Auslender, A. and M. Teboulle. 2006. “Interior gradient and proximal methods for convex and conic optimization”. SIAM Journal on Optimization. 16(3): 697–725. Aybat, N. S., A. Fallah, M. Gurbuzbalaban, and A. Ozdaglar. 2019. “A Universally Optimal Multistage Accelerated Stochastic Gradient Method”. Baes, M. 2009. Estimate sequence methods: extensions and approximations. url: http://www.optimization-online.org/DB_FILE/2009/08/2372.pdf . Bansal, N. and A. Gupta. 2019. “Potential-function proofs for gradient methods”. Theory of Computing. 15(1): 1–32. Barré, M., A. Taylor, and F. Bach. 2020a. “Principled Analyses and Design of First-Order Methods with Inexact Proximal Operators”. preprint arXiv:2006.06041. Barré, M., A. Taylor, and A. d’Aspremont. 2020b. “Convergence of constrained anderson acceleration”. arXiv preprint arXiv:2010.15482.

145 146 References

Bauschke, H. H., J. Bolte, and M. Teboulle. 2016. “A descent Lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications”. Mathematics of Operations Research. 42(2): 330–348. Beck, A. and M. Teboulle. 2009. “A fast iterative shrinkage-thresholding algorithm for linear inverse problems”. SIAM Journal on Imaging Sciences. 2(1): 183–202. Beck, A. and M. Teboulle. 2003. “Mirror descent and nonlinear projected subgradient methods for convex optimization”. Operations Research Letters. 31(3): 167–175. Becker, S. R., E. J. Candès, and M. C. Grant. 2011. “Templates for convex cone problems with applications to sparse signal recovery”. Mathematical Programming Computa- tion. 3(3): 165–218. Ben-Tal, A. and A. S. Nemirovsky. 2001. Lectures on modern convex optimization : analysis, algorithms, and engineering applications. MPS-SIAM series on optimization. SIAM. Bolte, J., A. Daniilidis, and A. Lewis. 2007. “The Lojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems”. SIAM Journal on Optimization. 17(4): 1205–1223. Bolte, J., T. P. Nguyen, J. Peypouquet, and B. W. Suter. 2017. “From error bounds to the complexity of first-order descent methods for convex functions”. Mathematical Programming. 165(2): 471–507. Bonnans, J.-F., J. C. Gilbert, C. Lemaréchal, and C. A. Sagastizábal. 2006. Numerical optimization: theoretical and practical aspects. Springer Science & Business Media. Bottou, L. and O. Bousquet. 2007. “The tradeoffs of large scale learning”. Advances in Neural Information Processing Systems (NIPS). 20: 161–168. Boyd, S., N. Parikh, E. Chu, B. Peleato, and J. Eckstein. 2011. “Distributed optimization and statistical learning via the alternating direction method of multipliers”. Founda- tions and Trends in Machine learning. 3(1): 1–122. Boyd, S. and L. Vandenberghe. 2004. Convex optimization. Cambridge university press. Brezinski, C. 1975. “Généralisations de la transformation de Shanks, de la table de Padé et de l’ε-algorithme”. Calcolo. 12(4): 317–360. Brezinski, C. 2001. “Convergence acceleration during the 20th century”. Numerical Anal- ysis: Historical Developments in the 20th Century: 113. Brezinski, C. 1970. “Application de l’ε-algorithme à la résolution des systèmes non linéaires”. CR Acad. Sci. Paris. 271(A): 1174–1177. Brezinski, C. 1971. “Sur un algorithme de résolution des systèmes non linéaires”. CR Acad. Sci. Paris. 272(A): 145–148. Brezinski, C., S. Cipolla, M. Redivo-Zaglia, and Y. Saad. 2020. “Shanks and Anderson- type acceleration techniques for systems of nonlinear equations”. preprint arXiv:2007.05716. Brezinski, C. and M. Redivo–Zaglia. 2019. “The genesis and early developments of Aitken’s process, Shanks’ transformation, the ε–algorithm, and related fixed point methods”. Numerical Algorithms. 80(1): 11–133. Brezinski, C., M. Redivo-Zaglia, and Y. Saad. 2018. “Shanks sequence transformations and Anderson acceleration”. SIAM Review. 60(3): 646–669. References 147

Brezinski, C. and M. R. Zaglia. 1991. Extrapolation methods: theory and practice. Else- vier. Bubeck, S. 2015. “Convex Optimization: Algorithms and Complexity”. Foundations and Trends in Machine Learning. 8(3-4): 231–357. Bubeck, S., Y. T. Lee, and M. Singh. 2015. “A geometric alternative to Nesterov’s accelerated gradient descent”. preprint arXiv:1506.08187. Cabay, S. and L. Jackson. 1976. “A polynomial extrapolation method for finding limits and antilimits of vector sequences”. SIAM Journal on Numerical Analysis. 13(5): 734–752. Calatroni, L. and A. Chambolle. 2019. “Backtracking strategies for accelerated descent methods with smooth composite objectives”. SIAM Journal on Optimization. 29(3): 1772–1798. Cauchy, A. 1847. “Méthode générale pour la résolution des systemes d’équations simul- tanées”. Comptes Rendus de l’Académie des Sciences de Paris. 25(1847): 536–538. Chambolle, A. and T. Pock. 2016. “An introduction to continuous optimization for imag- ing”. Acta Numerica. 25: 161–319. Chen, S., S. Ma, and W. Liu. 2017. “Geometric descent method for convex composite minimization”. In: Advances in Neural Information Processing Systems (NIPS). 636– 644. Chierchia, G., E. Chouzenoux, P. L. Combettes, and J.-C. Pesquet. 2020. The Proximity Operator Repository. User’s guide. url: http://proximity-operator.net/download/guide.pdf. Clarke, F. H. 1990. Optimization and nonsmooth analysis. Vol. 5. SIAM. Cohen, A., W. Dahmen, and R. DeVore. 2009. “Compressed sensing and best k-term approximation”. Journal of the AMS. 22(1): 211–231. Condat, L., D. Kitahara, A. Contreras, and A. Hirabayashi. 2019. “Proximal splitting algorithms: A tour of recent advances, with new twists”. preprint arXiv:1912.00137. Cyrus, S., B. Hu, B. Van Scoy, and L. Lessard. 2018. “A robust accelerated optimiza- tion algorithm for strongly convex functions”. In: Proceedings of the 2018 American Control Conference (ACC). IEEE. 1376–1381. d’Aspremont, A. 2008. “Smooth Optimization with Approximate Gradient”. SIAM Jour- nal on Optimization. 19(3): 1171–1183. d’Aspremont, A., C. Guzman, and M. Jaggi. 2018. “Optimal affine-invariant smooth minimization algorithms”. SIAM Journal on Optimization. 28(3): 2384–2405. Davis, D., D. Drusvyatskiy, and V. Charisopoulos. 2019. “Stochastic algorithms with geo- metric step decay converge linearly on sharp functions”. arXiv preprint arXiv:1907.09547. De Klerk, E., F. Glineur, and A. B. Taylor. 2020. “Worst-case convergence analysis of in- exact gradient and Newton methods through semidefinite programming performance estimation”. SIAM Journal on Optimization. 30(3): 2053–2082. De Klerk, E., F. Glineur, and A. B. Taylor. 2017. “On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions”. Optimization Letters. 11(7): 1185–1199. 148 References

Defazio, A., F. Bach, and S. Lacoste-Julien. 2014a. “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives”. In: Advances in Neural Information Processing Systems (NIPS). 1646–1654. Defazio, A. J., T. S. Caetano, and J. Domke. 2014b. “Finito: A faster, permutable incremental gradient method for big data problems”. In: Proceedings of the 31st International Conference on Machine Learning (ICML). 1125–1133. Devolder, O. 2011. “Stochastic first order methods in smooth convex optimization”. Tech. rep. CORE discussion paper. Devolder, O. 2013. “Exactness, inexactness and stochasticity in first-order methods for large-scale convex optimization”. PhD thesis. Devolder, O., F. Glineur, and Y. Nesterov. 2013. “Intermediate gradient methods for smooth convex problems with inexact oracle”. Devolder, O., F. Glineur, and Y. Nesterov. 2014. “First-order methods of smooth convex optimization with inexact oracle”. Mathematical Programming. 146(1-2): 37–75. Diakonikolas, J. and C. Guzmán. 2021. “Complementary Composite Minimization, Small Gradients in General Norms, and Applications to Regression Problems”. preprint arXiv:2101.11041. Diakonikolas, J. and L. Orecchia. 2019a. “Conjugate gradients and accelerated methods unified: The approximate duality gap view”. preprint arXiv:1907.00289. Diakonikolas, J. and L. Orecchia. 2019b. “The approximate duality gap technique: A unified theory of first-order methods”. SIAM Journal on Optimization. 29(1): 660– 689. Diakonikolas, J. and P. Wang. 2021. “Potential Function-based Framework for Making the Gradients Small in Convex and Min-Max Optimization”. preprint arXiv:2101.12101. Douglas, J. and H. H. Rachford. 1956. “On the numerical solution of heat conduction problems in two and three space variables”. Transactions of the American mathemat- ical Society. 82(2): 421–439. Dragomir, R.-A., A. Taylor, A. d’Aspremont, and J. Bolte. 2019. “Optimal complexity and certification of Bregman first-order methods”. preprint arXiv:1911.08510. Drori, Y. 2014. “Contributions to the Complexity Analysis of Optimization Algorithms”. PhD thesis. Tel-Aviv University. Drori, Y. 2017. “The exact information-based complexity of smooth convex minimiza- tion”. Journal of Complexity. 39: 1–16. Drori, Y. 2018. “On the properties of convex functions over open sets”. preprint arXiv:1812.02419. Drori, Y. and A. Taylor. 2021. “On the oracle complexity of smooth strongly convex minimization”. preprint. Drori, Y. and A. B. Taylor. 2020. “Efficient first-order methods for convex minimization: a constructive approach”. Mathematical Programming. 184(1): 183–220. Drori, Y. and M. Teboulle. 2014. “Performance of first-order methods for smooth convex minimization: a novel approach”. Mathematical Programming. 145(1-2): 451–482. Drori, Y. and M. Teboulle. 2016. “An optimal variant of Kelley’s cutting-plane method”. Mathematical Programming. 160(1-2): 321–351. References 149

Drusvyatskiy, D., M. Fazel, and S. Roy. 2018. “An optimal first order method based on optimal quadratic averaging”. SIAM Journal on Optimization. 28(1): 251–271. Dvurechensky, P. and A. Gasnikov. 2016. “Stochastic intermediate gradient method for convex problems with stochastic inexact oracle”. Journal of Optimization Theory and Applications. 171(1): 121–145. Eckstein, J. 1989. “Splitting methods for monotone operators with applications to par- allel optimization”. PhD thesis. Massachusetts Institute of Technology. Eckstein, J. and P. J. Silva. 2013. “A practical relative error criterion for augmented Lagrangians”. Mathematical Programming. 141(1-2): 319–348. Eckstein, J. and W. Yao. 2012. “Augmented Lagrangian and alternating direction meth- ods for convex optimization: A tutorial and some illustrative computational results”. RUTCOR Research Reports. 32(3). Eddy, R. 1979. “Extrapolating to the limit of a vector sequence”. In: Information linkage between applied mathematics and industry. Elsevier. 387–396. Fang, H.-r. and Y. Saad. 2009. “Two classes of multisecant methods for nonlinear accel- eration”. with Applications. 16(3): 197–221. Fercoq, O. and P. Richtárik. 2015. “Accelerated, parallel, and proximal coordinate de- scent”. SIAM Journal on Optimization. 25(4): 1997–2023. Fessler, J. A. 2020. “Optimization methods for magnetic resonance image reconstruction: Key models and optimization algorithms”. IEEE Signal Processing Magazine. 37(1): 33–40. Complete version: http://arxiv.org/abs/1903.03510. Flanders, D. A. and G. Shortley. 1950. “Numerical determination of fundamental modes”. Journal of Applied Physics. 21(12): 1326–1332. Florea, M. I. and S. A. Vorobyov. 2018. “An accelerated composite gradient method for large-scale composite objective problems”. IEEE Transactions on Signal Processing. 67(2): 444–459. Florea, M. I. and S. A. Vorobyov. 2020. “A generalized accelerated composite gradient method: Uniting Nesterov’s fast gradient method and FISTA”. IEEE Transactions on Signal Processing. Ford, W. F. and A. Sidi. 1988. “Recursive algorithms for vector extrapolation methods”. Applied numerical mathematics. 4(6): 477–489. Gasnikov, A., P. Dvurechensky, E. Gorbunov, E. Vorontsova, D. Selikhanovych, and C. Uribe. 2019. “Optimal Tensor Methods in Smooth Convex and Uniformly Convex Optimization”. In: Proceedings of the 32nd Conference on Learning Theory (COLT). Vol. 99. 1–18. Gasnikov, A. V. and Y. Nesterov. 2018. “Universal method for stochastic composite op- timization problems”. Computational Mathematics and Mathematical Physics. 58(1): 48–64. Gekeler, E. 1972. “On the solution of systems of equations by the epsilon algorithm of Wynn”. Mathematics of Computation. 26(118): 427–436. 150 References

Glowinski, R. and A. Marroco. 1975. “Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires”. ESAIM: Mathematical Modelling and Numerical Analysis-Modélisation Mathématique et Analyse Numérique. 9(R2): 41–76. Goldstein, A. 1962. “Cauchy’s method of minimization”. Numerische Mathematik. 4(1): 146–150. Golub, G. H. and R. S. Varga. 1961a. “Chebyshev semi-iterative methods, successive overrelaxation iterative methods, and second order Richardson iterative methods”. Numerische Mathematik. 3(1): 147–156. Golub, G. H. and R. S. Varga. 1961b. “Chebyshev semi-iterative methods, successive overrelaxation iterative methods, and second order Richardson iterative methods”. Numerische Mathematik. 3(1): 157–168. Gorbunov, E., M. Danilova, and A. Gasnikov. 2020. “Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping”. In: Advances in Neural In- formation Processing Systems (NeurIPS). Güler, O. 1991. “On the convergence of the proximal point algorithm for convex mini- mization”. SIAM Journal on Control and Optimization. 29(2): 403–419. Güler, O. 1992. “New proximal point algorithms for convex minimization”. SIAM Jour- nal on Optimization. 2(4): 649–664. Gutknecht, M. H. and S. Röllin. 2002. “The Chebyshev iteration revisited”. Parallel Computing. 28(2): 263–283. Guzmán, C. and A. S. Nemirovsky. 2015. “On lower complexity bounds for large-scale smooth convex optimization”. Journal of Complexity. 31(1): 1–14. Hinder, O., A. Sidford, and N. Sohoni. 2020. “Near-Optimal Methods for Minimizing Star-Convex Functions and Beyond”. In: Proceedings of the 33rd Conference on Learning Theory (COLT). 1894–1938. Hiriart-Urruty, J.-B. and C. Lemaréchal. 2013. Convex analysis and minimization algo- rithms I: Fundamentals. Vol. 305. Springer science & business media. Hu, B. and L. Lessard. 2017. “Dissipativity theory for Nesterov’s accelerated method”. In: Proceedings of the 34th International Conference on Machine Learning (ICML). 1549–1557. Hu, C., W. Pan, and J. Kwok. 2009. “Accelerated gradient methods for stochastic opti- mization and online learning”. Advances in Neural Information Processing Systems (NIPS). 22: 781–789. Ito, M. and M. Fukuda. 2021. “Nearly optimal first-order methods for convex optimiza- tion under gradient norm measure: An adaptive regularization approach”. Journal of Optimization Theory and Applications: 1–35. Iusem, A. N. 1999. “Augmented Lagrangian methods and proximal point methods for convex optimization”. Investigación Operativa. 8(11-49): 7. Ivanova, A., D. Grishchenko, A. Gasnikov, and E. Shulgin. 2019. “Adaptive Catalyst for smooth convex optimization”. preprint arXiv:1911.11271. References 151

Jbilou, K. and H. Sadok. 2000. “Vector extrapolation methods. Applications and nu- merical comparison”. Journal of Computational and Applied Mathematics. 122(1-2): 149–165. Jbilou, K. and H. Sadok. 1991. “Some results about vector extrapolation methods and related fixed-point iterations”. Journal of Computational and Applied Mathematics. 36(3): 385–398. Jbilou, K. and H. Sadok. 1995. “Analysis of some vector extrapolation methods for solving systems of linear equations”. Numerische Mathematik. 70(1): 73–89. Johnson, R. and T. Zhang. 2013. “Accelerating stochastic gradient descent using pre- dictive variance reduction”. In: Advances in Neural Information Processing Systems (NIPS). 315–323. Juditsky, A., G. Lan, A. S. Nemirovsky, and A. Shapiro. 2009. “Stochastic Approxima- tion Approach to Stochastic Programming”. SIAM Journal on Optimization. 19(4): 1574–1609. Juditsky, A. and A. S. Nemirovsky. 2011a. “First order methods for nonsmooth con- vex large-scale optimization, i: general purpose methods”. Optimization for Machine Learning. 30(9): 121–148. Juditsky, A. and A. S. Nemirovsky. 2011b. “First order methods for nonsmooth convex large-scale optimization, ii: utilizing problems structure”. Optimization for Machine Learning. 30(9): 149–183. Juditsky, A. and Y. Nesterov. 2014. “Deterministic and stochastic primal-dual subgra- dient algorithms for uniformly convex minimization”. Stochastic Systems. 4(1): 44– 80. Karimi, S. and S. Vavasis. 2017. “A single potential governing convergence of conjugate gradient, accelerated gradient and geometric descent”. preprint arXiv:1712.09498. Karimi, S. and S. A. Vavasis. 2016. “A unified convergence bound for conjugate gradient and accelerated gradient”. preprint arXiv:1605.00320. Kerdreux, T., A. d’Aspremont, and S. Pokutta. 2018. “Restarting Frank-Wolfe”. arXiv preprint arXiv:1810.02429. Kim, D. 2019. “Accelerated proximal point method for maximally monotone operators”. preprint arXiv:1905.05149. Kim, D. and J. A. Fessler. 2018a. “Adaptive restart of the optimized gradient method for convex optimization”. Journal of Optimization Theory and Applications. 178(1): 240–263. Kim, D. and J. A. Fessler. 2016. “Optimized first-order methods for smooth convex minimization”. Mathematical Programming. 159(1-2): 81–107. Kim, D. and J. A. Fessler. 2017. “On the convergence analysis of the optimized gradient method”. Journal of optimization theory and applications. 172(1): 187–205. Kim, D. and J. A. Fessler. 2018b. “Another look at the fast iterative shrinkage/thresholding algorithm (FISTA)”. SIAM Journal on Optimization. 28(1): 223–250. Kim, D. and J. A. Fessler. 2018c. “Generalizing the optimized gradient method for smooth convex minimization”. SIAM Journal on Optimization. 28(2): 1920–1950. 152 References

Kim, D. and J. A. Fessler. 2020. “Optimizing the efficiency of first-order methods for decreasing the gradient of smooth convex functions”. Journal of Optimization Theory and Applications. Krichene, W., A. Bayen, and P. L. Bartlett. 2015. “Accelerated mirror descent in continu- ous and discrete time”. Advances in Neural Information Processing Systems (NIPS). 28: 2845–2853. Kulunchakov, A. and J. Mairal. 2019a. “A Generic Acceleration Framework for Stochas- tic Composite Optimization”. In: Advances in Neural Information Processing Systems (NeurIPS). Kulunchakov, A. and J. Mairal. 2019b. “Estimate sequences for stochastic composite optimization: Variance reduction, acceleration, and robustness to noise”. preprint arXiv:1901.08788. Kurdyka, K. 1998. “On gradients of functions definable in o-minimal structures”. In: Annales de l’institut Fourier. Vol. 48. No. 3. 769–783. Lacoste-Julien, S. and M. Jaggi. 2015. “On the global linear convergence of Frank-Wolfe optimization variants”. Advances in Neural Information Processing Systems (NIPS). 28: 496–504. Lacotte, J. and M. Pilanci. 2020. “Optimal randomized first-order methods for least- squares problems”. In: International Conference on Machine Learning. PMLR. 5587– 5597. Lan, G. 2008. “Efficient Methods for stochastic composite optimization”. Tech. rep. School of Industrial and Systems Engineering, Georgia Institute of Technology. url: http://www.optimization-online.org/DB_HTML/2008/08/2061.html. Lan, G. 2012. “An optimal method for stochastic composite optimization”. Mathematical Programming. 133(1-2): 365–397. Lan, G., Z. Lu, and R. D. Monteiro. 2011. “Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming”. Mathematical Programming. 126(1): 1– 29. Lee, Y. T. and A. Sidford. 2013. “Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems”. In: 54th Symposium on Foundations of Computer Science. IEEE. 147–156. Lemaréchal, C. and C. Sagastizábal. 1997. “Practical Aspects of the Moreau–Yosida Reg- ularization: Theoretical Preliminaries”. SIAM Journal on Optimization. 7(2): 367– 385. Lessard, L., B. Recht, and A. Packard. 2016. “Analysis and design of optimization al- gorithms via integral quadratic constraints”. SIAM Journal on Optimization. 26(1): 57–95. Lessard, L. and P. Seiler. 2020. “Direct synthesis of iterative algorithms with bounds on achievable worst-case convergence rate”. In: Proceedings of the 2020 American Control Conference (ACC). IEEE. 119–125. References 153

Li, G. and T. K. Pong. 2018. “Calculus of the exponent of Kurdyka–Łojasiewicz inequal- ity and its applications to linear convergence of first-order methods”. Foundations of computational mathematics. 18(5): 1199–1232. Lieder, F. 2020. “On the convergence rate of the Halpern-iteration”. Optimization Letters: 1–14. Lin, H., J. Mairal, and Z. Harchaoui. 2015. “A universal catalyst for first-order optimiza- tion”. In: Advances in Neural Information Processing Systems (NIPS). 3384–3392. Lin, H., J. Mairal, and Z. Harchaoui. 2018. “Catalyst acceleration for first-order convex optimization: from theory to practice”. The Journal of Machine Learning Research (JMLR). 18(1): 7854–7907. Lions, P.-L. and B. Mercier. 1979. “Splitting algorithms for the sum of two nonlinear operators”. SIAM Journal on Numerical Analysis. 16(6): 964–979. Lojasiewicz, S. 1963. “Une propriété topologique des sous-ensembles analytiques réels”. Les équations aux dérivées partielles: 87–89. Lu, H., R. M. Freund, and Y. Nesterov. 2018. “Relatively smooth convex optimization by first-order methods, and applications”. SIAM Journal on Optimization. 28(1): 333– 354. Luo, Z.-Q. and P. Tseng. 1992. “On the linear convergence of descent methods for convex essentially smooth minimization”. SIAM Journal on Control and Optimization. 30(2): 408–425. Mai, V. and M. Johansson. 2020. “Anderson acceleration of proximal gradient methods”. In: Proceedings of the 37th International Conference on Machine Learning (ICML). PMLR. 6620–6629. Mairal, J. 2015. “Incremental majorization-minimization optimization with application to large-scale machine learning”. SIAM Journal on Optimization. 25(2): 829–855. Mairal, J. 2019. “Cyanure: An Open-Source Toolbox for Empirical Risk Minimization for Python, C++, and soon more”. preprint arXiv:1912.08165. Malitsky, Y. and K. Mishchenko. 2020. “Adaptive Gradient Descent without Descent”. In: Proceedings of the 37th International Conference on Machine Learning (ICML). PMLR. 6702–6712. Martinet, B. 1970. “Régularisation d’inéquations variationnelles par approximations suc- cessives”. Revue Française d’Informatique et de Recherche Opérationnelle. 4: 154– 158. Martinet, B. 1972. “Détermination approchée d’un point fixe d’une application pseudo- contractante. Cas de l’application prox.” Comptes rendus hebdomadaires des séances de l’Académie des sciences de Paris. 274: 163–165. Mason, J. C. and D. C. Handscomb. 2002. Chebyshev polynomials. CRC press. Mešina, M. 1977. “Convergence acceleration for the iterative solution of the equations X= AX+ f”. Computer Methods in Applied Mechanics and Engineering. 10(2): 165– 173. Mifflin, R. 1977. “Semismooth and semiconvex functions in constrained optimization”. SIAM Journal on Control and Optimization. 15(6): 959–972. 154 References

Monteiro, R. D. and B. F. Svaiter. 2013. “An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods”. SIAM Journal on Optimization. 23(2): 1092–1125. Moreau, J.-J. 1962. “Fonctions convexes duales et points proximaux dans un espace hilbertien”. Comptes rendus hebdomadaires des séances de l’Académie des sciences de Paris. 255: 2897–2899. Moreau, J.-J. 1965. “Proximité et dualité dans un espace hilbertien”. Bulletin de la Société mathématique de France. 93: 273–299. Narkiss, G. and M. Zibulevsky. 2005. Sequential subspace optimization method for large- scale unconstrained problems. Technion-IIT, Department of Electrical Engineering. Necoara, I., Y. Nesterov, and F. Glineur. 2019. “Linear convergence of first order methods for non-strongly convex optimization”. Mathematical Programming. 175(1-2): 69–107. Nemirovsky, A. and D. Yudin. 1983a. Problem complexity and method efficiency in opti- mization. Nemirovsky, A. S. 1982. “Orth-method for smooth convex optimization”. Izvestia AN SSSR, Transl.: Eng. Cybern. Soviet J. Comput. Syst. Sci. 2: 937–947. Nemirovsky, A. S. 1991. “On optimality of Krylov’s information when solving linear operator equations”. Journal of Complexity. 7(2): 121–130. Nemirovsky, A. S. 1992. “Information-based complexity of linear operator equations”. Journal of Complexity. 8(2): 153–175. Nemirovsky, A. S. and Y. Nesterov. 1985. “Optimal methods of smooth convex mini- mization”. USSR Computational Mathematics and Mathematical Physics. 25(2): 21– 30. Nemirovsky, A. S. and B. T. Polyak. 1984. “Iterative methods for solving linear ill-posed problems under precise information.” ENG. CYBER. (4): 50–56. Nemirovsky, A. S. and D. B. Yudin. 1983b. “Information-based complexity of mathemat- ical programming (in Russian)”. Izvestia AN SSSR, Ser. Tekhnicheskaya Kibernetika (the journal is translated to English as Engineering Cybernetics. Soviet J. Computer & Systems Sci.) 1. Nemirovsky, A. S. and D. B. Yudin. 1983c. “Problem Complexity and Method Efficiency in Optimization.” Willey-Interscience, New York. Nesterov, Y. 1983. “A method of solving a convex programming problem with conver- gence rate O(1/k2)”. Soviet Mathematics Doklady. 27(2): 372–376. Nesterov, Y. 2003. Introductory Lectures on Convex Optimization. Springer. Nesterov, Y. 2005. “Smooth minimization of non-smooth functions”. Mathematical Pro- gramming. 103(1): 127–152. Nesterov, Y. 2009. “Primal-dual subgradient methods for convex problems”. Mathemat- ical programming Series B. 120(1): 221–259. Nesterov, Y. 2008. “Accelerating the cubic regularization of Newton’s method on convex problems”. Mathematical Programming. 112(1): 159–181. Nesterov, Y. 2012a. “Efficiency of coordinate descent methods on huge-scale optimization problems”. SIAM Journal on Optimization. 22(2): 341–362. References 155

Nesterov, Y. 2013. “Gradient methods for minimizing composite functions”. Mathemat- ical Programming. 140(1): 125–161. Nesterov, Y. 2015. “Universal gradient methods for convex optimization problems”. Mathematical Programming. 152(1-2): 381–404. Nesterov, Y. 2012b. “How to make the gradients small”. Optima. Mathematical Opti- mization Society Newsletter. (88): 10–11. Nesterov, Y. 2019. “Implementable tensor methods in unconstrained convex optimiza- tion”. Mathematical Programming: 1–27. Nesterov, Y. 2020a. “Inexact accelerated high-order proximal-point methods”. Tech. rep. CORE discussion paper. Nesterov, Y. 2020b. “Inexact high-order proximal-point methods with auxiliary search procedure”. Tech. rep. CORE discussion paper. Nesterov, Y. and B. T. Polyak. 2006. “Cubic regularization of Newton method and its global performance”. Mathematical Programming. 108(1): 177–205. Nesterov, Y. and S. U. Stich. 2017. “Efficiency of the accelerated coordinate descent method on structured optimization problems”. SIAM Journal on Optimization. 27(1): 110–123. Nocedal, J. and S. Wright. 2006. Numerical optimization. Springer Science & Business Media. O’Donoghue, B. and E. Candes. 2015. “Adaptive restart for accelerated gradient schemes”. Foundations of computational mathematics. 15(3): 715–732. Pang, J.-S. 1987. “A posteriori error bounds for the linearly-constrained variational in- equality problem”. Mathematics of Operations Research. 12(3): 474–484. Paquette, C., H. Lin, D. Drusvyatskiy, J. Mairal, and Z. Harchaoui. 2018. “Catalyst for Gradient-based Nonconvex Optimization”. In: Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS). Vol. 84. Parikh, N. and S. Boyd. 2014. “Proximal algorithms”. Foundations and Trends in Opti- mization. 1(3): 127–239. Park, C., J. Park, and E. K. Ryu. 2021. “Factor-√2 Acceleration of Accelerated Gradient Methods”. preprint arXiv:2102.07366. Passty, G. B. 1979. “Ergodic convergence to a zero of the sum of monotone operators in Hilbert space”. Journal of Mathematical Analysis and Applications. 72(2): 383–390. Pedregosa, F. and D. Scieur. 2020. “Acceleration through spectral density estimation”. In: International Conference on Machine Learning. PMLR. 7553–7562. Peyré, G. 2011. “The numerical tours of signal processing-advanced computational signal and image processing”. IEEE Computing in Science and Engineering. 13(4): 94–97. Qi, L. and J. Sun. 1993. “A nonsmooth version of Newton’s method”. Mathematical programming. 58(1-3): 353–367. Robbins, H. and S. Monro. 1951. “A stochastic approximation method”. The annals of mathematical statistics: 400–407. Rockafellar, R. T. 1973. “A dual approach to solving nonlinear programming problems by unconstrained optimization”. Mathematical Programming. 5(1): 354–373. 156 References

Rockafellar, R. T. 1976. “Augmented Lagrangians and applications of the proximal point algorithm in convex programming”. Mathematics of operations research. 1(2): 97–116. Rockafellar, R. T. 1970. Convex Analysis. Princeton.: Princeton University Press. Rockafellar, R. T. and R.-B. Wets. 1998. Variational Analysis. Berlin: Springer-Verlag. Rockafellar, R. T. and R. J.-B. Wets. 2009. Variational analysis. Vol. 317. Springer Science & Business Media. Roulet, V. and A. d’Aspremont. 2017. “Sharpness, restart and acceleration”. In: Ad- vances in Neural Information Processing Systems (NIPS). 1119–1129. Ryu, E. K. and S. Boyd. 2016. “Primer on monotone operator methods”. Appl. Comput. Math. 15(1): 3–43. Salzo, S. and S. Villa. 2012. “Inexact and accelerated proximal point algorithms”. Journal of Convex analysis. 19(4): 1167–1192. Scheinberg, K., D. Goldfarb, and X. Bai. 2014. “Fast first-order methods for composite convex optimization with backtracking”. Foundations of Computational Mathematics. 14(3): 389–417. Schmidt, M., N. Le Roux, and F. Bach. 2011. “Convergence rates of inexact proximal- gradient methods for convex optimization”. In: Advances in Neural Information Pro- cessing Systems (NIPS). 1458–1466. Schmidt, M., N. Le Roux, and F. Bach. 2017. “Minimizing finite sums with the stochastic average gradient”. Mathematical Programming. 162(1-2): 83–112. Scieur, D., F. Bach, and A. d’Aspremont. 2017a. “Nonlinear acceleration of stochastic algorithms”. In: Advances in Neural Information Processing Systems. 3982–3991. Scieur, D., A. d’Aspremont, and F. Bach. 2016. “Regularized nonlinear acceleration”. In: Advances in Neural Information Processing Systems (NIPS). 712–720. Scieur, D., E. Oyallon, A. d’Aspremont, and F. Bach. 2018. “Online Regularized Non- linear Acceleration”. preprint arXiv:1805.09639. Scieur, D. and F. Pedregosa. 2020. “Universal Asymptotic Optimality of Polyak Momen- tum”. In: International Conference on Machine Learning. PMLR. 8565–8572. Scieur, D., V. Roulet, F. Bach, and A. d’Aspremont. 2017b. “Integration methods and optimization algorithms”. In: Advances in Neural Information Processing Systems (NIPS). 1109–1118. Shalev-Shwartz, S. and T. Zhang. 2013. “Stochastic dual coordinate ascent methods for regularized loss minimization”. The Journal of Machine Learning Research (JMLR). 14(Feb): 567–599. Shalev-Shwartz, S. and T. Zhang. 2014. “Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization”. In: Proceedings of the 31st International Conference on Machine Learning (ICML). 64–72. Shi, B., S. S. Du, M. I. Jordan, and W. J. Su. 2018. “Understanding the acceleration phenomenon via high-resolution differential equations”. preprint arXiv:1810.08907. Sidi, A. 1986. “Convergence and stability properties of minimal polynomial and reduced rank extrapolation algorithms”. SIAM Journal on Numerical Analysis. 23(1): 197– 209. References 157

Sidi, A. 1988. “Extrapolation vs. projection methods for linear systems of equations”. Journal of Computational and Applied Mathematics. 22(1): 71–88. Sidi, A. 1991. “Efficient implementation of minimal polynomial and reduced rank ex- trapolation methods”. Journal of Computational and Applied Mathematics. 36(3): 305–337. Sidi, A. 2008. “Vector extrapolation methods with applications to solution of large sys- tems of equations and to PageRank computations”. Computers & Mathematics with Applications. 56(1): 1–24. Sidi, A. 2017a. “Minimal polynomial and reduced rank extrapolation methods are re- lated”. Advances in Computational Mathematics. 43(1): 151–170. Sidi, A. 2017b. Vector extrapolation methods with applications. SIAM. Sidi, A. 2019. “A convergence study for reduced rank extrapolation on nonlinear sys- tems”. Numerical Algorithms: 1–26. Sidi, A. and J. Bridger. 1988. “Convergence and stability analyses for some vector ex- trapolation methods in the presence of defective iteration matrices”. Journal of Com- putational and Applied Mathematics. 22(1): 35–61. Sidi, A., W. F. Ford, and D. A. Smith. 1986. “Acceleration of convergence of vector sequences”. SIAM Journal on Numerical Analysis. 23(1): 178–196. Sidi, A. and Y. Shapira. 1998. “Upper bounds for convergence rates of acceleration methods with initial iterations”. Numerical Algorithms. 18(2): 113–132. Smith, D. A., W. F. Ford, and A. Sidi. 1987. “Extrapolation methods for vector se- quences”. SIAM review. 29(2): 199–233. Solodov, M. V. and B. F. Svaiter. 1999a. “A hybrid approximate extragradient–proximal point algorithm using the enlargement of a maximal monotone operator”. Set-Valued Analysis. 7(4): 323–345. Solodov, M. V. and B. F. Svaiter. 1999b. “A hybrid projection-proximal point algorithm”. Journal of convex analysis. 6(1): 59–70. Solodov, M. V. and B. F. Svaiter. 2000. “Error bounds for proximal point subprob- lems and associated inexact proximal point algorithms”. Mathematical Programming. 88(2): 371–389. Solodov, M. V. and B. F. Svaiter. 2001. “A unified framework for some inexact proximal point algorithms”. Numerical functional analysis and optimization. 22(7-8): 1013– 1035. Su, W., S. Boyd, and E. J. Candes. 2016. “A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights”. The Journal of Machine Learning Research (JMLR). 17(1): 5312–5354. Süli, E. and D. F. Mayers. 2003. An introduction to numerical analysis. Cambridge university press. Taylor, A. and F. Bach. 2019. “Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions”. In: vol. 99. Proceedings of the 32nd Conference on Learning Theory (COLT). 2934–2992. 158 References

Taylor, A. and Y. Drori. 2021. “An optimal gradient method for smooth (possibly strongly) convex minimization”. preprint. Taylor, A. B., J. M. Hendrickx, and F. Glineur. 2017a. “Exact worst-case performance of first-order methods for composite convex optimization”. SIAM Journal on Opti- mization. 27(3): 1283–1313. Taylor, A. B., J. M. Hendrickx, and F. Glineur. 2017b. “Performance estimation toolbox (PESTO): automated worst-case analysis of first-order optimization methods”. In: Proceedings of the 56th Conference on Decision and Control (CDC). IEEE. 1278– 1283. Taylor, A. B., J. M. Hendrickx, and F. Glineur. 2017c. “Smooth strongly convex in- terpolation and exact worst-case performance of first-order methods”. Mathematical Programming. 161(1-2): 307–345. Teboulle, M. 2018. “A simplified view of first order methods for optimization”. Mathe- matical Programming. 170(1): 67–96. Toth, A. and C. Kelley. 2015. “Convergence analysis for Anderson acceleration”. SIAM Journal on Numerical Analysis. 53(2): 805–819. Tseng, P. 2008. “On accelerated proximal gradient methods for convex-concave optimiza- tion”. url: http://www.mit.edu/~dimitrib/PTseng/papers.html. Tseng, P. 2010. “Approximation accuracy, gradient methods, and error bound for struc- tured convex optimization”. Mathematical Programming. 125(2): 263–295. Tyrtyshnikov, E. E. 1994. “How bad are Hankel matrices?” Numerische Mathematik. 67(2): 261–269. Van Scoy, B., R. A. Freeman, and K. M. Lynch. 2017. “The fastest known globally con- vergent first-order method for minimizing strongly convex functions”. IEEE Control Systems Letters. 2(1): 49–54. Villa, S., S. Salzo, L. Baldassarre, and A. Verri. 2013. “Accelerated and inexact forward- backward algorithms”. SIAM Journal on Optimization. 23(3): 1607–1633. Walker, H. F. and P. Ni. 2011. “Anderson acceleration for fixed-point iterations”. SIAM Journal on Numerical Analysis. 49(4): 1715–1735. Wilson, A. C., B. Recht, and M. I. Jordan. 2016. “A Lyapunov analysis of momentum methods in optimization”. preprint arXiv:1611.02635. Wynn, P. 1956. “On a device for computing the em(Sn) transformation”. Mathematical Tables and Other Aids to Computation. 10(54): 91–96. Xiao, L. 2010. “Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization”. The Journal of Machine Learning Research (JMLR). 11: 2543–2596. Zhou, K., Q. Ding, F. Shang, J. Cheng, D. Li, and Z.-Q. Luo. 2019. “Direct acceleration of SAGA using sampled negative momentum”. In: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS). 1602–1610. Zhou, K., F. Shang, and J. Cheng. 2018. “A Simple Stochastic Variance Reduced Al- gorithm with Fast Convergence Rates”. In: Proceedings of the 35th International Conference on Machine Learning (ICML). 5980–5989. References 159

Zhou, K., A. M.-C. So, and J. Cheng. 2020. “Boosting First-order Methods by Shifting Objective: New Schemes with Faster Worst Case Rates”. In: Advances in Neural Information Processing Systems (NeuRIPS). Zhou, Z. and A. M.-C. So. 2017. “A unified approach to error bounds for structured convex optimization problems”. Mathematical Programming: 1–40. Zhou, Z., Q. Zhang, and A. M.-C. So. 2015. “l1, p-Norm Regularization: Error Bounds and Convergence Rate Analysis of First-Order Methods”. In: Proceedings of the 32nd International Conference on Machine Learning,(ICML). 1501–1510.