Part 2. Gradient and Subgradient Methods for Unconstrained Convex Optimization
Total Page:16
File Type:pdf, Size:1020Kb
Part 2. Gradient and Subgradient Methods for Unconstrained Convex Optimization Math 126 Winter 18 Date of current version: January 29, 2018 Abstract This note studies (sub)gradient methods for unconstrained convex optimization. Many parts of this note are based on the chapters [1, Chapter 4] [2, Chapter 3,5,8,10] [5, Chapter 9] [14, Chapter 2,3] and their corresponding lecture notes available online by the authors. Please email me if you find any typos or errors. 1 Unconstrained Convex Optimization (see [5, Chapter 9]) We consider the following unconstrained problem: min f(x) (1.1) d x Ê ∈ where we assume that d Ê – f : Ê is convex and continuously differentiable, – the problem→ is solvable, i.e., there exists an optimal point x . ∗ Note that it is equivalent to solve the optimality condition: d Find x Ê s.t. f(x)= 0. (1.2) ∈ ∇ Iterative algorithms generate a sequence of points x0, x1,... dom f with f(xk) p , where p is the ∈ → ∗ ∗ optimal value. The iterative algorithm is terminated, for example, when f(xk) ǫ where ǫ > 0 is some specified tolerance. ||∇ || ≤ 2 Descent Methods (see [1, Chapter 4] [5, Chapter 9]) 2.1 General Descent Methods d d d Ê Ê Ê Definition 2.1 Let f : Ê be a continuously differentiable function over . A vector 0 = d is called a descent direction of→f at x if it satisfies 6 ∈ f(x )⊤d < 0, (2.1) ∇ k k i.e., it takes an acute angle with the negative gradient f(x ). −∇ k d d Ê Lemma 2.1 Let f be a continuously differentiable function over Ê , and let x . Suppose that d is descent direction of f at x. Then there exists ǫ> 0 such that ∈ f(x + sd) <f(x) (2.2) for any s (0, ǫ]. ∈ Donghwan Kim Dartmouth College E-mail: [email protected] Proof Since f(x)⊤d < 0, it follows from the definition of the directional derivative that ∇ f(x + sd) f(x) lim − = f(x)⊤d < 0 s 0+ s ∇ → Therefore, there exists ǫ> 0 such that f(x + sd) f(x) − < 0 s for any s (0,ǫ]. ∈ The outline of general descent methods is as follows. Algorithm 1 General descent methods 1: Input: x0 ∈ dom f. 2: for k ≥ 0 do 3: Determine a descent direction dk. 4: Choose a step size sk > 0 satisfying f(xk + skdk) <f(xk). 5: xk+1 = xk + skdk. 6: If a stopping criteria is satisfied, then stop. One can choose a step size sk at each iteration by one of the following approaches: – Exact line search: sk = argmins 0 f(xk + sdk) – Backtracking line search: starting≥ from an initial s > 0, repeat s βs until the following sufficient decrease condition is satisfied: ← f(x + sd) <f(x)+ αs f(x)⊤d ∇ with parameters α (0, 1) and β (0, 1). ∈ ∈ – Constant step size: sk = s (to be further studied later in this note). 1 d Example 2.1 Exact line search for quadratic functions. Let f(x)= 2 x⊤Qx + p⊤x, where Q Ë++ and d d d ∈ Ê Ê p Ê . Let d be a descent direction of f at x . We will derive an explicit formula for the step∈ size generated∈ by exact line search: ∈ sˆ(x, d) = argmin f(x + sd), (2.3) s 0 ≥ where we have 1 1 2 1 f(x + sd)= (x + sd)⊤Q(x + sd)+ p⊤(x + sd)= (d⊤Qd)s +(d⊤Qx + d⊤p)s + x⊤Qx + p⊤x. 2 2 2 The optimality condition of (2.3) is d f(x + sd)=(d⊤Qd)s +(d⊤Qx + d⊤p) = 0 ds and using f(x)= Qx + p we have ∇ d⊤ f(x) sˆ(x, d)= ∇ . − d⊤Qd Note that the exact line search is not easy to compute in general. 2 2.2 Gradient Descent Methods Gradient descent methods take the negative gradient as a descent direction: d = f(x ), (2.4) k −∇ k 2 since f(xk)⊤dk = f(xk) < 0 as long as f(xk) = 0. The∇ outline of gradient−||∇ descent|| methods is as∇ follows.6 Algorithm 2 Gradient descent methods 1: Input: x0 ∈ dom f. 2: for k ≥ 0 do 3: dk = −∇f(xk). 4: Choose a step size sk > 0 satisfying f(xk + skdk) <f(xk). 5: xk+1 = xk + skdk. 6: If a stopping criteria is satisfied, then stop. 2.3 Steepest Descent Methods d Definition 2.2 Let be any norm on Ê . We define a normalized steepest descent direction (with respect to the norm || · ||) as ||·|| dnsd = argmin f(x)⊤v : v = 1 . (2.5) v ∇ || || This is a unit-norm step with most negative directional derivative f(x)⊤v. (Recall that v is a descent ∇ direction if f(x)⊤v < 0.) In other words, a normalized steepest descent direction is the direction in the unit ball∇ of that extends farthest in the direction f(x). ||·|| −∇ Definition 2.3 A (unnormalized) steepest descent step is defined as dsd = f(x) dnsd, (2.6) ||∇ ||∗ where z = max z, y : y 1 denotes the dual norm (e.g., 2 is a dual norm of 2, and || ||∗ {h i || || ≤ } ||·||2 ||·|| 1 is a dual norm of ). This satisfies f(x)⊤dsd = f(x) . ||·|| ||·||∞ ∇ −||∇ ||∗ Example 2.2 Examples of steepest descent methods. – Euclidean norm (ℓ2-norm): dsd = f(x). – The resulting algorithm is a gradient−∇ descent method. 1/2 d 1 – Quadratic P -norm ( z P = P z 2 where P Ë++): dsd = P − f(x). – The resulting algorithm|| || is|| a preconditioned|| gradient∈ descent method− ∇ with a preconditioner P (or 1 1/2 P − ). This is equivalent to a gradient descent method with the change of coordinates x¯ = P x. – A good choice of P (e.g., P 2f(x )) makes the condition number of the problem after the change of coordinates x¯ = P 1≈/2x ∇small,∗ which likely makes the problem easier to solve. – ℓ -norm: d = ∂f(x) e for the index i satisfying f(x) = ( f(x)) , where e is the ith 1 sd ∂xi i i i standard basis vector.− ||∇ ||∞ | ∇ | – The resulting algorithm is an instance of coordinate-descent methods that update only one com- ponent of x at each iteration. Remark 2.1 At the end of this term (in Part 6), we will study the Newton’s descent direction: 2 1 d = [ f(x)]− f(x), (2.7) nt − ∇ ∇ 2 d which is a steepest descent direction at x in local Hessian norm 2f(x) when f(x) Ë++. ||·||∇ ∇ ∈ 3 3 Convergence Analysis of the Gradient Method (See [1, Chapter 4] [2, Chapter 5] [14, Chapter 2]) 3.1 Lipschitz Continuity of the Gradient Definition 3.1 A function f is said to be L-smooth if it is continuously differentiable and its gradient d f is Lipschitz continuous over Ê , meaning that there exists L> 0 for which ∇ d f(x) f(y) L x y for any x, y Ê . (3.1) ||∇ − ∇ ||2 ≤ || − ||2 ∈ 1,1 d 1,1 1 We denote the class of functions with Lipschitz gradient with constant L by L (Ê ) or L . One can generalize the choice of the norm as C C f(x) f(y) L x y , (3.2) ||∇ − ∇ ||∗ ≤ || − || where is a dual norm of , but we will not consider this generalization in this class. ||·||∗ ||·|| Lemma 3.1 If f is twice continuously differentiable, f is L-smooth if and only if LI 2f(x) LI (or equivalently 2f(x) L) (3.3) − ∇ ||∇ ||2 ≤ Example 3.1 Smooth functions. d d 1 1,1 Ê – Quadratic functions: Let Q Ë and p . Then, the function f(x) = 2 x⊤Qx + p⊤x is a L function with L = Q , since∈ ∈ C || ||2 f(x) f(y) = Qx + p (Qy + p) = Q(x y) Q x y . ||∇ − ∇ ||2 || − ||2 || − ||2 ≤ || ||2 · || − ||2 – A convex function f(x)= 1+ x 2: We have f(x)= 1 x and 2 √ x 2+1 || || ∇ || ||2 p 2 1 1 1 f(x)= I xx⊤ I I. 2 2 3/2 2 ∇ x + 1 − ( x 2 + 1) x + 1 || ||2 || || || ||2 p p Therefore, f is a 1,1 function with L = 1. CL 1,1 An important result for L functions is that they can be bounded above by a certain quadratic function, which is fundamentalC in convergence proofs of gradient-based methods. 1,1 d Lemma 3.2 (Descent lemma) Let f L ( ) for some L > 0 and a given convex set Ê . Then for any x, y , ∈ C D D ⊆ ∈D L f(y) f(x)+ f(x), y x + x y 2. (3.4) ≤ h∇ − i 2 || − ||2 Proof By the fundamental theorem of calculus, we have 1 f(y) f(x)= f(x + t(y x)), y x dt − h∇ − − i Z0 Therefore, 1 f(y) f(x)= f(x), y x + f(x + t(y x)) f(x), y x dt. − h∇ − i h∇ − − ∇ − i Z0 1 i,j d We denote by CL (D) for D ⊆ Ê the class of functions that are i-times continuously differentiable on D and its jth derivative is Lipschitz continuous on D. 4 Thus, 1 f(y) f(x) f(x), y x = f(x + t(y x)) f(x), y x dt | − − h∇ − i h∇ − − ∇ − i Z0 1 f(x + t(y x)) f(x), y x dt ≤ 0 | h∇ − − ∇ − i| Z 1 f(x + t(y x)) f(x) 2 y x 2dt ≤ 0 ||∇ − − ∇ || · || − || Z 1 tL y x 2dt ≤ || − ||2 Z0 L = y x 2, 2 || − ||2 where the second inequality uses the generalized Cauchy-Schwarz inequality and the third inequality uses the L-smoothness.