Part 2. and Subgradient Methods for Unconstrained

Math 126 Winter 18

Date of current version: January 29, 2018

Abstract This note studies (sub)gradient methods for unconstrained convex optimization. Many parts of this note are based on the chapters [1, Chapter 4] [2, Chapter 3,5,8,10] [5, Chapter 9] [14, Chapter 2,3] and their corresponding lecture notes available online by the authors. Please email me if you find any typos or errors.

1 Unconstrained Convex Optimization (see [5, Chapter 9])

We consider the following unconstrained problem: min f(x) (1.1) d x Ê ∈ where we assume that

d Ê – f : Ê is convex and continuously differentiable, – the problem→ is solvable, i.e., there exists an optimal point x . ∗ Note that it is equivalent to solve the optimality condition: d Find x Ê s.t. f(x)= 0. (1.2) ∈ ∇ Iterative algorithms generate a sequence of points x0, x1,... dom f with f(xk) p , where p is the ∈ → ∗ ∗ optimal value. The iterative algorithm is terminated, for example, when f(xk) ǫ where ǫ > 0 is some specified tolerance. ||∇ || ≤

2 Descent Methods (see [1, Chapter 4] [5, Chapter 9])

2.1 General Descent Methods

d d d

Ê Ê Ê Definition 2.1 Let f : Ê be a continuously differentiable function over . A vector 0 = d is called a descent direction of→f at x if it satisfies 6 ∈

f(x )⊤d < 0, (2.1) ∇ k k i.e., it takes an acute angle with the negative gradient f(x ). −∇ k

d d Ê Lemma 2.1 Let f be a continuously differentiable function over Ê , and let x . Suppose that d is descent direction of f at x. Then there exists ǫ> 0 such that ∈ f(x + sd) 0 such that

f(x + sd) f(x) − < 0 s

for any s (0,ǫ]. ∈ The outline of general descent methods is as follows.

Algorithm 1 General descent methods 1: Input: x0 ∈ dom f. 2: for k ≥ 0 do 3: Determine a descent direction dk. 4: Choose a step size sk > 0 satisfying f(xk + skdk)

One can choose a step size sk at each iteration by one of the following approaches:

– Exact : sk = argmins 0 f(xk + sdk) – Backtracking line search: starting≥ from an initial s > 0, repeat s βs until the following sufficient decrease condition is satisfied: ←

f(x + sd)

1 d Example 2.1 Exact line search for quadratic functions. Let f(x)= 2 x⊤Qx + p⊤x, where Q Ë++ and

d d d ∈

Ê Ê p Ê . Let d be a descent direction of f at x . We will derive an explicit formula for the step∈ size generated∈ by exact line search: ∈

sˆ(x, d) = argmin f(x + sd), (2.3) s 0 ≥ where we have

1 1 2 1 f(x + sd)= (x + sd)⊤Q(x + sd)+ p⊤(x + sd)= (d⊤Qd)s +(d⊤Qx + d⊤p)s + x⊤Qx + p⊤x. 2 2 2

The optimality condition of (2.3) is

d f(x + sd)=(d⊤Qd)s +(d⊤Qx + d⊤p) = 0 ds and using f(x)= Qx + p we have ∇

d⊤ f(x) sˆ(x, d)= ∇ . − d⊤Qd

Note that the exact line search is not easy to compute in general.

2 2.2 Methods

Gradient descent methods take the negative gradient as a descent direction:

d = f(x ), (2.4) k −∇ k 2 since f(xk)⊤dk = f(xk) < 0 as long as f(xk) = 0. The∇ outline of gradient−||∇ descent|| methods is as∇ follows.6

Algorithm 2 Gradient descent methods 1: Input: x0 ∈ dom f. 2: for k ≥ 0 do 3: dk = −∇f(xk). 4: Choose a step size sk > 0 satisfying f(xk + skdk)

2.3 Steepest Descent Methods

d Definition 2.2 Let be any norm on Ê . We define a normalized steepest descent direction (with respect to the norm || · ||) as ||·||

dnsd = argmin f(x)⊤v : v = 1 . (2.5) v ∇ || ||  This is a unit-norm step with most negative directional derivative f(x)⊤v. (Recall that v is a descent ∇ direction if f(x)⊤v < 0.) In other words, a normalized steepest descent direction is the direction in the unit ball∇ of that extends farthest in the direction f(x). ||·|| −∇ Definition 2.3 A (unnormalized) steepest descent step is defined as

dsd = f(x) dnsd, (2.6) ||∇ ||∗ where z = max z, y : y 1 denotes the dual norm (e.g., 2 is a dual norm of 2, and || ||∗ {h i || || ≤ } ||·||2 ||·|| 1 is a dual norm of ). This satisfies f(x)⊤dsd = f(x) . ||·|| ||·||∞ ∇ −||∇ ||∗ Example 2.2 Examples of steepest descent methods.

– Euclidean norm (ℓ2-norm): dsd = f(x). – The resulting algorithm is a gradient−∇ descent method. 1/2 d 1 – Quadratic P -norm ( z P = P z 2 where P Ë++): dsd = P − f(x). – The resulting algorithm|| || is|| a preconditioned|| gradient∈ descent method− ∇ with a preconditioner P (or 1 1/2 P − ). This is equivalent to a gradient descent method with the change of coordinates x¯ = P x. – A good choice of P (e.g., P 2f(x )) makes the condition number of the problem after the change of coordinates x¯ = P 1≈/2x ∇small,∗ which likely makes the problem easier to solve. – ℓ -norm: d = ∂f(x) e for the index i satisfying f(x) = ( f(x)) , where e is the ith 1 sd ∂xi i i i standard basis vector.− ||∇ ||∞ | ∇ | – The resulting algorithm is an instance of coordinate-descent methods that update only one com- ponent of x at each iteration.

Remark 2.1 At the end of this term (in Part 6), we will study the Newton’s descent direction:

2 1 d = [ f(x)]− f(x), (2.7) nt − ∇ ∇ 2 d which is a steepest descent direction at x in local Hessian norm 2f(x) when f(x) Ë++. ||·||∇ ∇ ∈

3 3 Convergence Analysis of the (See [1, Chapter 4] [2, Chapter 5] [14, Chapter 2])

3.1 Lipschitz Continuity of the Gradient

Definition 3.1 A function f is said to be L-smooth if it is continuously differentiable and its gradient d f is Lipschitz continuous over Ê , meaning that there exists L> 0 for which ∇ d f(x) f(y) L x y for any x, y Ê . (3.1) ||∇ − ∇ ||2 ≤ || − ||2 ∈

1,1 d 1,1 1 We denote the class of functions with Lipschitz gradient with constant L by L (Ê ) or L . One can generalize the choice of the norm as C C

f(x) f(y) L x y , (3.2) ||∇ − ∇ ||∗ ≤ || − || where is a dual norm of , but we will not consider this generalization in this class. ||·||∗ ||·|| Lemma 3.1 If f is twice continuously differentiable, f is L-smooth if and only if

LI 2f(x) LI (or equivalently 2f(x) L) (3.3) −  ∇  ||∇ ||2 ≤ Example 3.1 Smooth functions.

d d 1 1,1 Ê – Quadratic functions: Let Q Ë and p . Then, the function f(x) = 2 x⊤Qx + p⊤x is a L function with L = Q , since∈ ∈ C || ||2 f(x) f(y) = Qx + p (Qy + p) = Q(x y) Q x y . ||∇ − ∇ ||2 || − ||2 || − ||2 ≤ || ||2 · || − ||2 – A f(x)= 1+ x 2: We have f(x)= 1 x and 2 √ x 2+1 || || ∇ || ||2 p 2 1 1 1 f(x)= I xx⊤ I I. 2 2 3/2 2 ∇ x + 1 − ( x 2 + 1)  x + 1  || ||2 || || || ||2 p p Therefore, f is a 1,1 function with L = 1. CL 1,1 An important result for L functions is that they can be bounded above by a certain quadratic function, which is fundamentalC in convergence proofs of gradient-based methods.

1,1 d Lemma 3.2 (Descent lemma) Let f L ( ) for some L > 0 and a given Ê . Then for any x, y , ∈ C D D ⊆ ∈D L f(y) f(x)+ f(x), y x + x y 2. (3.4) ≤ h∇ − i 2 || − ||2

Proof By the fundamental theorem of calculus, we have

1 f(y) f(x)= f(x + t(y x)), y x dt − h∇ − − i Z0 Therefore,

1 f(y) f(x)= f(x), y x + f(x + t(y x)) f(x), y x dt. − h∇ − i h∇ − − ∇ − i Z0 1 i,j d We denote by CL (D) for D ⊆ Ê the class of functions that are i-times continuously differentiable on D and its jth derivative is Lipschitz continuous on D.

4 Thus,

1 f(y) f(x) f(x), y x = f(x + t(y x)) f(x), y x dt | − − h∇ − i h∇ − − ∇ − i Z0 1

f(x + t(y x)) f(x), y x dt ≤ 0 | h∇ − − ∇ − i| Z 1 f(x + t(y x)) f(x) 2 y x 2dt ≤ 0 ||∇ − − ∇ || · || − || Z 1 tL y x 2dt ≤ || − ||2 Z0 L = y x 2, 2 || − ||2 where the second inequality uses the generalized Cauchy-Schwarz inequality and the third inequality uses the L-smoothness.

Note that the proof of this descent lemma actually shows both upper and lower bounds on the function:

L L f(x)+ f(x), y x x y 2 f(y) f(x)+ f(x), y x + x y 2. (3.5) h∇ − i− 2 || − ||2 ≤ ≤ h∇ − i 2 || − ||2 When f is convex, we have several equivalent characterizations of the L-smoothness property of f, including Lemma 3.2.

Lemma 3.3 Let f is convex and continuously differentiable, and let L> 0. Then, the following claims are equivalent. – f is L-smooth. L 2 d – f(y) f(x)+ f(x), y x + 2 x y 2 for all x, y Ê . ≤ h∇ − i 1|| − || 2 ∈ d – f(y) f(x)+ f(x), y x + 2L f(x) f(y) 2 for all x, y Ê . ≥ h∇ − 1 i ||∇ − ∇2 || d ∈ – f(x) f(y), x y L f(x) f(y) 2 for all x, y Ê . h∇ − ∇ − i≥ ||∇ − ∇ L || ∈2 d – f(λx + (1 λ)y) λf(x)+(1 λ)f(y) λ(1 λ) x y for any x, y Ê and λ [0 1]. − ≥ − − 2 − || − ||2 ∈ ∈

3.2 Convergence Rate Analysis of the Gradient Method

3.2.1 For Smooth (Non-Convex) Problems

The following lemma shows that at each iteration the decrease in the function value is at least a constant times the squared norm of the gradient.

1,1 d d Ê Lemma 3.4 (Sufficient decrease lemma) Suppose that f (Ê ). Then for any x and s> 0 ∈CL ∈ Ls f(x) f(x s f(x)) s 1 f(x) 2 (3.6) − − ∇ ≥ − 2 ||∇ ||2   Proof By the descent lemma we have

Ls2 Ls f(x s f(x)) f(x) s f(x) 2 + f(x) 2 = f(x) s 1 f(x) 2. − ∇ ≤ − ||∇ ||2 2 ||∇ ||2 − − 2 ||∇ ||2   1,1 d Lemma 3.5 (Sufficient decrease of the gradient method) Suppose that f L (Ê ). Let xk k 0 be the sequence by the gradient method with one of the following step size strategies:∈C { } ≥ – exact line search,

– backtracking line search with initial s Ê++, α (0, 1), and β (0, 1), – constant step size s 0, 2 . ∈ ∈ ∈ ∈ L  5 Then

f(x ) f(x ) M f(x ) 2, (3.7) k − k+1 ≥ ||∇ k ||2 where

1 2L , if exact line search is used, 2(1 α)β M = α min s, −L , if backtracking line search is used, (3.8)  s 1 nsL o if a constant step s is used. − 2 1,1  Theorem 3.1 Let f L, and let xk k 0 be the sequence generated by the gradient method with one ∈C { } ≥ d of the three step size strategies in Lemma 3.5. Assume that f is bounded below over Ê .

– The sequence f(xk) k 0 is nonincreasing. In addition, for any k 0, f(xk+1) < f(xk) unless { } ≥ ≥ f(xk)= 0. – ∇f(x ) 0 as k ∇ k → → ∞ f(x0) f(x∗) – min f(x ) − with M in (3.8). k=0,...,N ||∇ k ||2 ≤ M(N+1) q 3.2.2 For Smooth Convex Problems

For a L-smooth and convex functions, a gradient method has a sublinear O(1/k) rate of convergence of the generated sequence of function values to the optimal value. We denote the class of convex functions 1,1 d 1,1 with Lipschitz gradient with constant L by (Ê ) or . FL FL 1,1 Theorem 3.2 Let f L , and let xk k 0 be the sequence generated by the gradient method with a constant step size s =∈1 F. Then, for k { 0} ≥ L ≥ 2 L x0 x 2 f(xk) f(x ) || − ∗|| . (3.9) − ∗ ≤ 2k Remark 3.1 Theorem 3.2 can be extended to the gradient method with another backtracking scheme (that is slightly different from what we learned in the class). For details, please see [2, Chapter 10, page 283]. You will learn this backtracking version in the homework 3.

1 Remark 3.2 The exact worst-case convergence rate for a gradient method with constant step size L with N total number of iterations is [7]:

2 L x0 x 2 f(xN ) f(x ) || − ∗|| . (3.10) − ∗ ≤ 4N + 2 The equality holds for the Huber function:

LR LR2 R x 2 , x φ (x)= 2N+1 || ||2 − 2(2N+1) || ||2 ≥ 2N+1 , (3.11) N L 2 R ( x 2, x 2 < . 2 || || || || 2N+1

where R = x0 x 2. || − ∗|| 3.2.3 For Smooth and Strongly Convex Problems

Definition 3.2 A function f is called µ-strongly convex for a given µ > 0 if and only if the function f( ) µ 2 is convex. · − 2 ||·||2 Lemma 3.6 If f is twice continuously differentiable, f is µ-strongly convex if and only if

2f(x) µI. (3.12) ∇ 

d d Ê Example 3.2 Strong convexity of the quadratic function. Let Q Ë and p . Then, the function 1 ∈ ∈ 1 f(x)= 2 x⊤Qx + p⊤x is µ-strongly convex with µ> 0 if and only if the function 2 x⊤(Q µI)x + p⊤x is convex, which is equivalent to Q µI 0 and λ (Q) µ. − −  min ≥

6 Lemma 3.7 Let f is continuously differentiable, and let µ> 0. Then, the following claims are equivalent. – f is µ-strongly convex. µ 2 d – f(y) f(x)+ f(x), y x + 2 x y 2 for all x, y Ê . ≥ h∇ − i 1|| − || 2 ∈ d – f(y) f(x)+ f(x), y x + 2µ f(x) f(y) 2 for all x, y Ê . ≤ h∇ − 1 i ||∇ − ∇2 || d ∈ – f(x) f(y), x y µ f(x) f(y) 2 for all x, y Ê . h∇ − ∇ − i≤ ||∇ − ∇ µ || ∈2 d – f(λx + (1 λ)y) λf(x)+(1 λ)f(y) λ(1 λ) x y for any x, y Ê and λ [0 1]. − ≤ − − 2 − || − ||2 ∈ ∈ 1,1 d 1,1 We denote the class of functions that is L-smooth and µ-strongly convex by µ,L(Ê ) or µ,L. For such functions (for some µ > 0), a gradient method has a linear rate of convergence,S i.e., a rateS of the form O(qk) for some q (0, 1). ∈ 1,1 Theorem 3.3 Let f (µ> 0), and let xk k 0 be the sequence generated by the gradient method ∈Sµ,L { } ≥ with a constant step size s = 1 . Then, for k 0 L ≥ 2 µ 2 xk+1 x 2 1 xk x 2, (3.13) || − ∗|| ≤ − L || − ∗||  k 2 µ 2 xk x 2 1 x0 x 2, || − ∗|| ≤ − L || − ∗||   k L µ 2 f(xk) f(x ) 1 x0 x 2 − ∗ ≤ 2 − L || − ∗||   Remark 3.3 Theorem 3.3 can be extended to the gradient method with another backtracking scheme (that is a slight modification of what we learned in this class) [2, Chapter 10, page 283].

7 4 Lower Complexity Bounds of Gradient Methods for Smooth Convex Functions (See [14, Chapter 2])

This sections studies the lower complexity bounds of gradient methods (or first-order methods) in the general form:

xk x0 + span f(x0),..., f(xk 1) , k = 1, 2,..., (4.1) ∈ {∇ ∇ − } which includes gradient methods with three different step sizes learned in the class and a conjugate gradient method. 1,1 1,1 In [14], the lower complexity bounds of algorithms in the form (4.1) for each L and µ,L are found 1,1 d F S by specifying the “worst functions in the world” (that means, in L (Ê )) that are difficult for all iterative algorithms in the form (4.1), under the large-dimensional conditionF “k <

4.1 For Smooth Convex Problem

1,1 d Nesterov’s worst function in the world (i.e., in (Ê )) is FL L 1 f (x)= x⊤Q x e⊤x , (4.2) k 4 2 k − 1   where 2 1 0 − 1 2 1  − −  0 1 2  −   ..   . 0d k,k   −  Q =   , e = (1, 0,..., 0)⊤. (4.3) k  2 1 0  1  −   1 2 1   − −   0 1 2     −       0d k,k 0d k,d k   − − −    Since 0 Q 4I, the function f satisfies  k  k L 0 2f (x)= Q LI, (4.4)  ∇ k 4 k  1,1 d and we have fk(x) L (Ê ) for 1 k d. Using the optimality∈F condition ≤f (x≤)= L (Q x e )= 0, we have the following unique solution: ∇ k 4 k − 1 1 k ⊤ xˆk = argmin fk(x)= 1 ,..., 1 , 0,..., 0 (4.5) d x Ê − k + 1 − k + 1 ∈   with the optimal value: L 1 fˆk = min fk(x)= 1+ . (4.6) d x Ê 8 − k + 1 ∈   3 Using k i2 = k(k+1)(2k+1) (k+1) , we have the following bound on the solution xˆ : i=1 6 ≤ 3 k P k i 2 k + 1 xˆ 2 = 1 (4.7) || k||2 − k + 1 ≤ 3 i=1 X  

8

k,d d d

Ê Ê Denote Ê = x : xi = 0, k + 1 i d , in which only the first k components of the { ∈ ≤ ≤ }⊆ k,d point can differ from zero. We can easily see that for all x Ê we have ∈ fp(x)= fk(x), p = k,...,d. (4.8)

We next show that first-order algorithms (4.1) starting from x0 = 0 satisfies span f(x0) = span e implying that x span e , and thus x span e ,..., e . {∇ } { 1} 1 ∈ { 1} k ∈ { 1 k} p Lemma 4.1 Fix p such that 1 p d, and let x0 = 0. Then for any sequence xk k=0 satisfying the condition ≤ ≤ { }

xk k = span fp(x0),..., fp(xk 1) , k = 1,...,p, (4.9) ∈L {∇ ∇ − } k,d we have Ê for k = 1,...,p. Lk ⊆

L 1,d 1,d k,d

Ê Ê Proof Since x0 = 0, we have fp(x0)= 4 e1 Ê . Thus 1 . Let k for some k

∇ −k,n ∈ L ≡k+1,d L ⊆ k+1,d

Ê Ê Since Qp is three-diagonal, for any x Ê we have fp(x) . Therefore, k+1 , and the proof can be completed by induction.∈ ∇ ∈ L ⊆

p Corollary 4.1 Fix p such that 1 p d. For any sequence xk k=0 such that x0 = 0 and xk k we have ≤ ≤ { } ∈L f (x ) fˆ , k = 1,...,p. (4.10) p k ≥ k k,n ˆ Proof Indeed, x Ê and therefore f (x )= f (x ) f , using (4.8). k ∈Lk ⊆ p k k k ≥ k Theorem 4.1 For any k such that 1 k 1 (d 1), and any x there exists a (quadratic) function f ≤ ≤ 2 − 0 in 1,1 that satisfies FL 2 3L x0 x 2 || − ∗|| f(xk) f(x ) (4.11) 32(k + 1)2 ≤ − ∗ for any first-order method in a form (4.1). Proof The methods of this type are invariant to a simultaneous shift of all objects in the space of variables. Therefore, we can assume that x0 = 0. Fix k and consider f(x)= f2k+1(x), where x = xˆ2k+1 and f(x )= fˆ2k+1. Using Corollary 4.1, we have ∗ ∗ f(x ) f (x )= f (x ) fˆ . k ≡ 2k+1 k k k ≥ k Hence, since x0 = 0, using (4.6) and (4.7), we have

L 1+ 1 + 1 1 f(xk) f(x ) fˆk fˆ2k+1 8 k+1 2k+2 3L ∗ − − − 2 − 2 1 = 2 . x0 x 2 ≥ xˆ2k+1 2 ≥  (2k + 2)  32(k + 1) || − ∗|| || || 3

4.2 For Smooth and Strongly Convex Problem

1,1

Nesterov’s worst function in the world (i.e., in (Ê∞)) is Sµ,L

L µ 1 µ 2 f(x)= − x⊤Qx e⊤x + x , (4.12) 4 2 − 1 2 || ||2   where 2 1 0 0 − 1 2 1 0  − −  Q = .. , e1 = (1, 0,..., 0)⊤. (4.13)  0 1 2 .   −   . .   0 0 .. ..     

9 This satisfies L µ µI 2f(x)= − Q + µI LI (4.14)  ∇ 4  1,1 since 0 Q 4I, so f(x) (Ê∞).   ∈Sµ,L

Theorem 4.2 For any x0 Ê∞ and any µ and L such that 0 <µ

10 5 Accelerated Gradient Methods (See [14, Chapter 2])

5.1 For Convex Quadratic Problem

1,1 For any quadratic function f in L , a conjugate gradient method (and also Chebyshev’s method) satisfies [11,12]: F

2 L x0 x 2 f(xk) f(x ) || − ∗|| . (5.1) − ∗ ≤ 2(2k + 1)2 Note that this O(1/k2) rate (5.1) does not apply to smooth convex (non-quadratic) functions in general, asking for new methods with rate O(1/k2) for smooth convex functions; we will learn these below. Remark 5.1 For any k such that 1 k 1 (d 3), and any x there exists a quadratic function f in ≤ ≤ 2 − 0 1,1 that satisfies [12]: FL 2 L x0 x 2 || − ∗|| f(xk) f(x ) (5.2) 2(2k + 1)2 ≤ − ∗ for any first-order method in a form (4.1), which improves upon Theorem 4.1. This exactly matches to the upper bound (5.1), implying that a conjugate gradient method has the optimal worst-case convergence rate for decreasing the large-dimensional convex quadratic functions.

Remark 5.2 A gradient method with a rate O(1/k) for decreasing f(xk) f(x ) requires O(1/ǫ) number − ∗ of iterations to reach a tolerance f(xk) f(x ) ǫ, while accelerated gradient methods with a rate 2 − ∗ ≤ 4 O(1/k ) requires only O(1/√ǫ) number of iterations. For example for ǫ = 10− , a gradient method requires about 104 number of iterations, while accelerated gradient methods require only about 102 number of iterations.

5.2 For Strongly Convex Quadratic problem

For any quadratic function f in 1,1 , a conjugate gradient method satisfies [11] Sµ,L 1 √q 2k f(xk) f(x ) 4 − (f(x0) f(x )), (5.3) − ∗ ≤ 1+ √q − ∗   µ where q = L , which is similar to the rate in Theorem 4.2, compared to the bound of gradient method in Theorem 3.3. Another well-known accelerated method for minimizing strongly convex quadratic function is a heavy- ball method [16] (or equivalently Chebyshev’s method).

Algorithm 3 Heavy-ball method [16] 1,1 d 1: Input: f ∈Sµ,L is quadratic, µ> 0, x0 = x−1 ∈ Ê . 2: for k ≥ 0 do 3: xk+1 = xk − αk∇f(xk)+ βk(xk − xk−1)

An optimal choice of parameters for minimizing a strongly convex quadratic function is 4 1 √q αk = , βk = − , (5.4) (√L + √µ)2 1+ √q and the resulting heavy-ball method satisfies

2 2 2 xk+1 x 1 √q xk x − ∗ − − ∗ (5.5) xk x ! ≤ 1+ √q xk 1 x ! − ∗ 2   − − ∗ 2

for any strongly convex quadratic function.

11 Remark 5.3 A heavy-ball method is not guaranteed to converge to the optimal point for strongly convex but non-quadratic problems [10].

Remark 5.4 A conjugate gradient method has its nonlinear versions for non-quadratic problems, but they are not as effective as the algorithms introduced below for the non-quadratic problems.

5.3 For Smooth Convex Problem

The most widely used accelerated gradient method for smooth convex problem is the following Nesterov’s fast gradient method (also known as Nesterov’s accelerated gradient method).

Algorithm 4 Nesterov’s Fast Gradient Method [13,14] 1,1 d 1: Input: f ∈FL , x0 = y0 ∈ Ê . 2: for k ≥ 0 do 1 3: xk+1 = yk − L ∇f(yk) 4: yk+1 = xk+1 + βk(xk+1 − xk)

Two widely used choices of βk are as follows. 1, k = 0, tk 1 – Choice 1: βk = − with tk = 2 tk+1 1+√1+4tk−1 ( 2 , k = 1, 2,... tk 1 k k+2 – Choice 2: β = − = with t = . k tk+1 k+3 k 2

1,1 Theorem 5.1 Let f L , and let xk k 0 be the sequence generated by Nesterov’s fast gradient method (using either choice∈ F of the parameters{ } ≥ above). Then, for k 0 ≥ 2 2 L x0 x 2 2L x0 x 2 x x ∗ ∗ f( k) f( ) || 2− || || − 2 || . (5.6) − ∗ ≤ 2tk 1 ≤ (k + 1) − Remark 5.5 There exists a backtracking version of Nesterov’s fast gradient method, which you will be asked to implement in homework 3.

Algorithm 5 Another Equivalent Formulation of Nesterov’s Fast Gradient Method [15] 1,1 d 1: Input: f ∈FL , y0 ∈ Ê . 2: for k ≥ 0 do 3: x = argmin f(y )+ h∇f(y ), x − y i + L ||x − y ||2 = y − 1 ∇f(y ) k+1 x n k k k 2 k 2o k L k k L 2 1 k 4: zk+1 = argminz =0 ti[f(yi)+ h∇f(yi), z − yii]+ ||z − y0||2 = y0 − =0 ti∇f(yi) nPi 2 o L Pi 1 1 5: yk+1 = 1 − xk+1 + zk+1  tk+1  tk+1

Remark 5.6 The trajectory of the iterates of Nesterov’s fast gradient method can be modeled by a second-order ODE [17]: 3 x′′(t)+ x′(t)+ f(x(t)) = 0, (5.7) t ∇ whereas the trajectory of a gradient method can be modeled by a first-order ODE:

x′(t)+ f(x(t)) = 0, (5.8) ∇ and the trajectory of a heavy-ball method can be modeled by another second-order ODE:

x′′(t)+ ax′(t)+ b f(x(t)) = 0. (5.9) ∇

12 There is still a gap between the lower complexity bounds (4.11), (5.2) and the upper bound (5.6) of Nesterov’s fast gradient method. Even though we can not improve the worst-case rate O(1/k2), one might be able to develop an algorithm with a smaller worst-case bound.

Algorithm 6 Optimized Gradient Method [9] 1,1 d 1: Input: f ∈FL , x0 = y0 ∈ Ê , θ0 = 1. 2: for k = 0, 1,...,N − 1 do 1 3: xk+1 = yk − L ∇f(yk) 1+ 1+4 2 q θk  2 , k = 0,...,N − 2, 4: θk+1 =   1+ 1+8θ2 q k , k = N − 1,  2  θk−1 θk 5: yk+1 = xk+1 + (xk+1 − xk)+ (xk+1 − yk) θk+1 θk+1

Algorithm 7 Another Equivalent Formulation of Optimized Gradient Method [9] 1,1 d 1: Input: f ∈FL , y0 ∈ Ê , θ0 = 1. 2: for k = 0, 1,...,N − 1 do 1 3: xk+1 = yk − L ∇f(yk) 1 k 4: zk+1 = y0 − =0 2θi∇f(yk) L Pi 1+ 1+4 2 q θk  2 , k = 0,...,N − 2, 5: θk+1 =   1+ 1+8θ2 q k , k = N − 1,  2  1 1 6: yk+1 = 1 − xk+1 + zk+1  θk+1  θk+1

1,1 Theorem 5.2 Let f L , and let yN be the last (Nth) iterate generated by the optimized gradient method. Then, ∈ F

2 2 L x0 x 2 L x0 x 2 y x ∗ ∗ f( N ) f( ) || −2 || || − 2|| . (5.10) − ∗ ≤ 2θN ≤ (N + 1) Remark 5.7 For any k such that k

5.4 For Smooth and Strongly Convex Problem

For a smooth and strongly convex function f, Nesterov’s fast gradient method with the following pa- rameter: 1 √q βk = − (5.12) 1+ √q satisfies

k f(xk) f(x ) 2(1 √q) (f(x0) f(x )). (5.13) − ∗ ≤ − − ∗

Remark 5.8 Other choices of βk can be found in [14].

13 6 Subgradients (see [2, Chapter 3] [14, Chapter 3])

6.1 Definition of Subgradient and Subdifferential

Recall that a differentiable function f is convex if and only if dom f is convex and

f(y) f(x)+ f(x), y x (6.1) ≥ h∇ − i holds for all x, y dom f. ∈

d 2 d Ê Definition 6.1 Let f : Ê ( , ] be a proper function and let x dom f. A vector g is called a subgradient of f at x→if −∞ ∞ ∈ ∈

d f(y) f(x)+ g, y x for all y Ê . (6.2) ≥ h − i ∈ Definition 6.2 The set of all subgradients of f at x is called the subdifferential of f at x and is denoted by ∂f(x):

d d Ê ∂f(x)= g Ê : f(y) f(x)+ g, y x for all y . (6.3) { ∈ ≥ h − i ∈ } d Theorem 6.1 Let f : Ê ( , ] be a proper convex function, and let x int(dom f). If f is differentiable at x, then ∂f(x→) =−∞f∞(x) . Conversely, if f has a unique subgradient∈ at x, then it is differentiable at x and ∂f(x)= {∇f(x) . } {∇ } Example 6.1 Subdifferential of absolute value.

sign x , x = 0, – f(x)= x : ∂f(x)= { } 6 | | [ 1, 1], x = 0. ( −

Example 6.2 Subdifferential of ℓ2-norm.

– f(x)= x 2: this is differentiable away from 0, and at 0, we have x 2 0+ g, x 0 , which leads to || || || || ≥ h − i

x , x = 0, x 2 ∂f(x)= || || 6 ( g : g 2 1 , x = 0. { || || ≤ }

6.2 Properties of the Subdifferential Set

d Theorem 6.2 Let f : Ê ( , ] be a proper function. Then the set ∂f(x) is closed and convex d → −∞ ∞ for any x Ê . ∈ d Proof For any x Ê , the subdifferential set can be represented as ∈

x d ∂f( )= y Ê y ∩ ∈ H d where y = g Ê : f(y) f(x)+ g, y x . Since the sets y are half-spaces and, in particular, closedH and convex,{ ∈ it follows that≥ ∂f(x)h is closed− i} and convex. H

d Definition 6.3 A proper function f : Ê ( , ] is called subdifferentiable at x dom f if ∂f(x) = . → −∞ ∞ ∈ 6 ∅ d Lemma 6.1 Let f : Ê ( , ] and assume that dom f is convex. Suppose that for any x dom f, the set ∂f(x) is nonempty.→ Then−∞f∞is convex. ∈

2 d d Ê A function f : Ê → (−∞, ∞] is called proper if it does not attain the value −∞ and there exists at least one x ∈ such that f(x) < ∞, meaning that dom f is nonempty.

14 Proof Let x, y dom f and θ [0, 1]. Define z = (1 θ)x + θy. By the convexity of dom f, z dom f, and hence there∈ exists g ∂f(∈z), which implies the following− two inequalities: ∈ ∈ f(y) f(z)+ g, y z = f(z)+(1 θ) g, y x ≥ h − i − h − i f(x) f(z)+ g, x z = f(z) θ g, y x . ≥ h − i − h − i Multiplying the first inequality by θ, the second by 1 θ, and summing them yields − f((1 θ)x + θy)= f(z) (1 θ)f(x)+ θf(y). − ≤ − Since this holds for any x, y dom f with dom f being convex, it follow that the function f is convex. ∈ This means that if a function is subdifferentiable at any point in its convex domain, then it is convex. However, this does not mean that the reverse direction is correct (see [2, Example 3.12]).

d Theorem 6.3 Let f : Ê ( , ] be a proper convex function, and assume that x˜ int(dom f). The ∂f(x˜) is nonempty and→ bounded.−∞ ∞ ∈

d d

Ê Ê This implies that if a function f : Ê is convex, then f is subdifferentiable over . →

6.3 Computing Subgradients

– Multiplication by a positive scalar: Let α> 0. Then for any x dom f, ∂(αf)(x)= α∂f(x). – Summation: ∈ – For any x dom f dom f , ∂f (x)+ ∂f (x) ∂(f + f )(x). ∈ 1 ∩ 2 1 2 ⊆ 1 2 – For any x int(dom f1) int(dom f2), ∂(f1 + f2)(x)= ∂f1(x)+ ∂f2(x). – Affine transformation:∈ Let h∩(x)= f(Ax + b). – For any x dom h, A⊤(∂f(Ax + b) ∂h(x). ∈ ⊆ – If x int(dom h) and Ax + b int(dom f)), ∂h(x)= A⊤(∂f(Ax + b)).

∈ d ∈

Ê Ê Ê – Composition: Let f : Ê be a convex function and g : be a nondecreasing convex d → → function. Let x Ê and suppose that g is differentiable at the point f(x). Let h(x)= g(f(x)), then ∈ ∂h(x)= g(f(x))∂f(x). (6.4) ∇ d – Pointwise Maximum: Let f1,f2,...,fm : Ê ( , ] be proper convex functions, and define f(x) = max f (x),f (x),...,f (x) . Let x →m −∞int(dom∞ f ). Then { 1 2 m } ∈∩i=1 i

∂f(x) = conv i j : fj (x)=f(x) ∂fi(x) , (6.5) ∪ ∈{ } where conv( ) is a convex hull of a set .  C C

Example 6.3 Subdifferential of ℓ1-norm. – f(x)= x = d x : || ||1 i=1 | i| P d ∂f(x)= ∂f (x)= sign(x )e + [ e , e ] i i i − i i i=1 i j : xj =0 i j : xj =0 X ∈{ X 6 } ∈{ X } = g =(g ,...,g )⊤ : g sign x if x = 0, and [ 1, 1], otherwise. 1 d i ∈ { } i 6 − Example 6.4 Subdifferential of Ax + b . Let f(x)= g(Ax + b) and g(y)= y . Then, with I(x)= || ||1 || ||1 i : a⊤x + b = 0 , and I˜(x)= i : a⊤x + b = 0 , we have { i i } { i i 6 }

∂g(Ax + b)= sign(a⊤x + b )e + [ e , e ]. i i i − i i i I˜(x) i I(x) ∈X ∈X and

∂f(x)= A⊤∂g(Ax + b)= sign(a⊤x + b )a + [ a , a ]. i i i − i i i I˜(x) i I(x) ∈X ∈X

15 Example 6.5 Subdifferential of max function. Let f(x) = max x1,...,xd = max f1(x),...,fd(x) , where f (x)= x . Using ∂f (x)= e , we have { } { } i i i { i}

∂f(x) = conv i I(x) ei = θiei : θi = 1, θi 0, i I(x) , (6.6) ∪ ∈ { }  ≥ ∈  i I(x) i I(x)  ∈X ∈X   where I(x)= i : f(x)= x . { i} Example 6.6 Subdifferential of piecewise linear function. Let f(x) = maxi ai⊤x + bi . Using ∂fi(x) = a , we have { } { i}

∂f(x) = conv i I(x) ai = θiai : θi = 1, θi 0, i I(x) , (6.7) ∪ ∈ { }  ≥ ∈  i I(x) i I(x)  ∈X ∈X }   where I(x)= i : f(x)= a⊤x + b . { i i}

6.4 Optimality Condition

d Ê x d x Theorem 6.4 Let f : ( , ] be a proper convex function. Then arg minx Ê f( ) if and only if → −∞ ∞ ∗ ∈ ∈ 0 ∂f(x). (6.8) ∈ Proof A point x is optimal if and only if ∗ f(x) f(x )+ 0, x x for any x dom f, (6.9) ≥ ∗ h − ∗i ∈ and this is equivalent to 0 ∂f(x ). ∈ ∗ Example 6.7 Minimizing piecewise linear functions. Consider

min f(x) max (a⊤x + bi) , (6.10) d i x Ê ≡ i=1,2,...,m ∈   and the optimality condition is

0 ∂f(x)= θ a : θ = 1, θ 0, i I(x) ∈  i i i i ≥ ∈  i I(x) i I(x)  ∈X ∈X  where I(x)= i : f(x)= ai⊤x+bi. This means that x is optimal if and only if there exists (θ1,...,θm) such that { } ∗ m m 0, i I(x ), 0 = θiai, θi = 1, θi ≥ ∈ ∗ . i=1 i=1 (= 0, otherwise. X X Example 6.8 Soft-thresholding operator. Consider the problem (an instance of Lasso problem): 1 f(x)= y x 2 + λ x . (6.11) 2|| − ||2 || ||1 The subgradient of f is x y + λs ∂f(x) (6.12) − ∈ where si = sign(xi) if xi = 0 and si [ 1, 1] if xi = 0. A solution x = y λs with s ∂( x 1) can 6 ∈ − ∗ − ∈ || ∗|| be found by using the soft-thresholding operator Sλ as y λ, y > λ, i − i [x ]i =[Sλ(y)]i = 0, λ yi λ, (6.13) ∗  − ≤ ≤ yi + λ, yi < λ. −  16 7 Subgradient Methods for Nonsmooth Convex Optimization (see [2, Chapter 3,8] [14, Chapter 3])

7.1 Subgradient Methods

If f is not differentiable, a natural generalization of gradient descent method to the nonsmooth case is

x = x s g , g ∂f(x ) (7.1) k+1 k − k k k ∈ k 1 2 = argmin f(xk)+ gk, x xk + x xk 2 , x h − i 2s || − ||  k  where xk+1 is constructed by minimizing a linearization of the cost function with a quadratic term. Here, the direction of the negative subgradient is not necessarily a descent direction. This means that sk cannot be chosen to guarantee a descent update.

Example 7.1 Non-descent subgradient direction Consider the function f(x1,x2) = x1 + 2 x2 . Then (1, 2) ∂f(1, 0) = (1,g) : g 2 . However, the negative subgradient (1, 2)| is| not| a| descent direction.∈ { | | ≤ } −

One can choose a step size sk by one of the following approaches:

– Constant step size: sk = s – Constant step length: s = γ k gk 2 || || 2 – Diminishing step size: choose sk such that k∞=0 sk < and k∞=0 sk = , i.e., sk is square summable but not summable. ∞ ∞ P P

7.2 Convergence Analysis of the Subgradient Method

7.2.1 Lipschitz Continuity and Boundedness of Subgradients

We assume that there exists a constant M > 0 for which

g M (7.2) || ||2 ≤ for all g ∂f(x). This implies that f is Lipschitz continuous with constant M > 0: ∈ f(x) f(y) M x y for all x, y, (7.3) | − |≤ || − ||2 because for gx ∂f(x) and gy ∂f(y), ∈ ∈ f(x) f(y) gx, x y gx x y M x y , − ≤h − i≤|| ||2|| − ||2 ≤ || − ||2 f(y) f(x) gy, y x gy x y M x y . − ≤h − i≤|| ||2|| − ||2 ≤ || − ||2 7.2.2 Convergence Rate Analysis of the Subgradient Method

Theorem 7.1 Let f is proper, closed and convex, and assume that there exists a constant M > 0 for which g 2 M for all g ∂f(x). Let xk k 0 be the sequence generated by the subgradient method (7.1||).|| Then,≤ ∈ { } ≥

2 k 2 2 2 2 k 2 x0 x 2 + i=0 si gi 2 x0 x 2 + M i=0 si min f(xi) f(x ) || − ∗|| || || || − ∗|| . (7.4) 0 i k − ∗ ≤ k ≤ k ≤ ≤ 2 i=0Psi 2 i=0 si P Proof Using the definition of subgradient, weP have P

2 2 xi+1 x 2 = xi sigi x 2 || − ∗|| || − − ∗|| 2 2 2 = xi x 2 2si gi, xi x +si gi 2 || − ∗|| − h − ∗i || || 2 2 2 xi x 2 2si(f(xi) f(x )) + si gi 2 ≤ || − ∗|| − − ∗ || ||

17 for i = 0,...,k. A telescoping sum yields

k k 2 si min f(xi) f(x ) 2 si(f(xi) f(x )) 0 i k − ∗ ≤ − ∗ i=0 ≤ ≤ i=0 X   X k 2 2 2 2 x0 x 2 xk+1 x 2 + si gi 2 ≤ || − ∗|| − || − ∗|| || || i=0 X k 2 2 2 x0 x 2 + si gi 2. ≤ || − ∗|| || || i=0 X Specific choices of sk satisfy the following rate of convergence:

– Constant step sk = s, k = 0, 1,....

2 2 x0 x 2 M s min f(xi) f(x ) || − ∗|| + (7.5) 0 i k − ∗ ≤ 2(k + 1)s 2 ≤ ≤ – Constant step length: s = γ , k = 0, 1,.... k gk 2 || || 2 M x0 x 2 Mγ min f(xi) f(x ) || − ∗|| + (7.6) 0 i k − ∗ ≤ 2(k + 1)γ 2 ≤ ≤

x0 x∗ 2 – Dynamic step sk = || − || , k = 0,...,N. gk 2√N+1 || ||

M x0 x 2 min f(xi) f(x ) || − ∗|| (7.7) 0 i N − ∗ ≤ √ ≤ ≤ N + 1 r – Diminishing step sk = , k = 0, 1,... gk 2√k+1 || || 2 x0 x 2 + r log(k + 1) min f(xi) f(x ) M || − ∗|| (7.8) 0 i k − ∗ ≤ √ ≤ ≤ 2r k + 1

f(xk) f(x∗) – Polyak’s step sk = − 2 , k = 0, 1,... gk || ||2

M x0 x 2 min f(xi) f(x ) || − ∗|| (7.9) 0 i k − ∗ ≤ √ ≤ ≤ k + 1 Subgradient method has the worst-case convergence rate O(1/√k); this requires (1/ǫ2) iterations to achieve the tolerance f(xk) f(x ) ǫ. − ∗ ≤

Example 7.2 Piecewise linear minimization. Let f(x) = maxi(ai⊤x + bi). At each kth iteration, find index j(k) such that aj⊤xk + bj = maxi(ai⊤xk + bi) and choose the subgradient aj(k) ∂f(xk). The resulting subgradient step is x = x s a . ∈ k+1 k − k j(k)

7.3 Lower Complexity Bounds of Subgradient Methods for Nonsmooth Convex Functions

Theorem 7.2 For any k, 0 k d 1, there exists a convex and Lipscthiz continuous function f with a constant M that satisfies ≤ ≤ −

M x0 x 2 || − ∗|| f(xk) f(x ) (7.10) 2(1 + √k + 1) ≤ − ∗ for any subgradient method in a form:

xk x0 + span g0,..., gk 1 (7.11) ∈ { − } with g ∂f(x ). k ∈ k

18 Proof Nesterov specified a function

µ 2 fk(x)= γ max xi + x 1 i k 2 || || ≤ ≤ that is Lipschitz continuous on [0,R] with contant M = µR + γ. The proof in [14, Theorem 3.2.1] uses B f(x)= fk+1(x) with

M√k + 1 M γ = , µ = , 1+ √k + 1 (1 + √k + 1)R where x0 x 2 R. || − ∗|| ≤ The worst-case convergence rate O(1/√k) of subgradient methods cannot be improved. One way to improve the rate is to use a smoothing technique (not studied in this course) such as Problem 3 in Homework 3 that minimizes a smooth approximation of the nonsmooth ℓ1-norm problem, instead of minimizing the nonsmooth ℓ1-norm problem. (If you are interested, see e.g. [3,4] for the project).

Remark 7.1 For any N

M x0 x 2 || − ∗|| f(xN ) f(x ) (7.12) √N + 1 ≤ − ∗ for any subgradient method in a form (7.11). This exactly matches to the subgradient method with step x0 x∗ 2 size sk = || − || and Polyak’s step. gk 2√N+1 || ||

References

1. Beck, A.: Introduction to Nonlinear Optimization: Theory, Algorithms, and Applications with MATLAB. Soc. Indust. Appl. Math., Philadelphia (2014). URL http://epubs.siam.org/doi/abs/10.1137/1.9781611973655 2. Beck, A.: First-Order Methods in Optimization. Soc. Indust. Appl. Math., Philadelphia (2017). URL http://epubs.siam.org/doi/book/10.1137/1.9781611974997 3. Beck, A., Teboulle, M.: Smoothing and first order methods: A unified framework. SIAM J. Optim. 22(2), 557–80 (2012). DOI 10.1137/100818327 4. Becker, S.R., Cand`es,E.J., Grant, M.C.: Templates for convex cone problems with applications to sparse signal recovery. Math. Prog. Comp. 3(3) (2011). URL http://mpc.zib.de/index.php/MPC/article/view/58 5. Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge, UK (2004). URL http://www.stanford.edu/~boyd/cvxbook.html 6. Drori, Y.: The exact information-based complexity of smooth convex minimization. J. Complexity 39, 1–16 (2017). DOI 10.1016/j.jco.2016.11.001 7. Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: A novel approach. Math- ematical Programming 145(1-2), 451–82 (2014). DOI 10.1007/s10107-013-0653-0 8. Drori, Y., Teboulle, M.: An optimal variant of Kelley’s cutting-plane method. Mathematical Programming 160(1), 321–51 (2016). DOI 10.1007/s10107-016-0985-7 9. Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization. Mathematical Programming 159(1), 81–107 (2016). DOI 10.1007/s10107-015-0949-3 10. Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016). DOI 10.1137/15M1009597 11. Nemirovski, A.: Optimization II: Numerical methods for nonlinear continuous optimization (1999). URL http://www2.isye.gatech.edu/~nemirovs/Lect_OptII.pdf. Lecture notes 12. Nemirovsky, A.S.: Information-based complexity of linear operator equations. J. of Complexity 8(2), 153–75 (1992). DOI 10.1016/0885-064X(92)90013-2 13. Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence O(1/k2). Dokl. Akad. Nauk. USSR 269(3), 543–7 (1983) 14. Nesterov, Y.: Introductory lectures on convex optimization: A basic course. Kluwer (2004). DOI 10.1007/ 978-1-4419-8853-9 15. Nesterov, Y.: Smooth minimization of non-smooth functions. Mathematical Programming 103(1), 127–52 (2005). DOI 10.1007/s10107-004-0552-5 16. Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4(5), 1–17 (1964). DOI 10.1016/0041-5553(64)90137-5 17. Su, W., Boyd, S., Cand`es,E.J.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. J. Mach. Learning Res. 17(153), 1–43 (2016). URL http://jmlr.org/papers/v17/15-084.html

19