Part 2. Gradient and Subgradient Methods for Unconstrained Convex Optimization

Part 2. Gradient and Subgradient Methods for Unconstrained Convex Optimization Math 126 Winter 18 Date of current version: January 29, 2018 Abstract This note studies (sub)gradient methods for unconstrained convex optimization. Many parts of this note are based on the chapters [1, Chapter 4] [2, Chapter 3,5,8,10] [5, Chapter 9] [14, Chapter 2,3] and their corresponding lecture notes available online by the authors. Please email me if you find any typos or errors. 1 Unconstrained Convex Optimization (see [5, Chapter 9]) We consider the following unconstrained problem: min f(x) (1.1) d x Ê ∈ where we assume that d Ê – f : Ê is convex and continuously differentiable, – the problem→ is solvable, i.e., there exists an optimal point x . ∗ Note that it is equivalent to solve the optimality condition: d Find x Ê s.t. f(x)= 0. (1.2) ∈ ∇ Iterative algorithms generate a sequence of points x0, x1,... dom f with f(xk) p , where p is the ∈ → ∗ ∗ optimal value. The iterative algorithm is terminated, for example, when f(xk) ǫ where ǫ > 0 is some specified tolerance. ||∇ || ≤ 2 Descent Methods (see [1, Chapter 4] [5, Chapter 9]) 2.1 General Descent Methods d d d Ê Ê Ê Definition 2.1 Let f : Ê be a continuously differentiable function over . A vector 0 = d is called a descent direction of→f at x if it satisfies 6 ∈ f(x )⊤d < 0, (2.1) ∇ k k i.e., it takes an acute angle with the negative gradient f(x ). −∇ k d d Ê Lemma 2.1 Let f be a continuously differentiable function over Ê , and let x . Suppose that d is descent direction of f at x. Then there exists ǫ> 0 such that ∈ f(x + sd) <f(x) (2.2) for any s (0, ǫ]. ∈ Donghwan Kim Dartmouth College E-mail: [email protected] Proof Since f(x)⊤d < 0, it follows from the definition of the directional derivative that ∇ f(x + sd) f(x) lim − = f(x)⊤d < 0 s 0+ s ∇ → Therefore, there exists ǫ> 0 such that f(x + sd) f(x) − < 0 s for any s (0,ǫ]. ∈ The outline of general descent methods is as follows. Algorithm 1 General descent methods 1: Input: x0 ∈ dom f. 2: for k ≥ 0 do 3: Determine a descent direction dk. 4: Choose a step size sk > 0 satisfying f(xk + skdk) <f(xk). 5: xk+1 = xk + skdk. 6: If a stopping criteria is satisfied, then stop. One can choose a step size sk at each iteration by one of the following approaches: – Exact line search: sk = argmins 0 f(xk + sdk) – Backtracking line search: starting≥ from an initial s > 0, repeat s βs until the following sufficient decrease condition is satisfied: ← f(x + sd) <f(x)+ αs f(x)⊤d ∇ with parameters α (0, 1) and β (0, 1). ∈ ∈ – Constant step size: sk = s (to be further studied later in this note). 1 d Example 2.1 Exact line search for quadratic functions. Let f(x)= 2 x⊤Qx + p⊤x, where Q Ë++ and d d d ∈ Ê Ê p Ê . Let d be a descent direction of f at x . We will derive an explicit formula for the step∈ size generated∈ by exact line search: ∈ sˆ(x, d) = argmin f(x + sd), (2.3) s 0 ≥ where we have 1 1 2 1 f(x + sd)= (x + sd)⊤Q(x + sd)+ p⊤(x + sd)= (d⊤Qd)s +(d⊤Qx + d⊤p)s + x⊤Qx + p⊤x. 2 2 2 The optimality condition of (2.3) is d f(x + sd)=(d⊤Qd)s +(d⊤Qx + d⊤p) = 0 ds and using f(x)= Qx + p we have ∇ d⊤ f(x) sˆ(x, d)= ∇ . − d⊤Qd Note that the exact line search is not easy to compute in general. 2 2.2 Gradient Descent Methods Gradient descent methods take the negative gradient as a descent direction: d = f(x ), (2.4) k −∇ k 2 since f(xk)⊤dk = f(xk) < 0 as long as f(xk) = 0. The∇ outline of gradient−||∇ descent|| methods is as∇ follows.6 Algorithm 2 Gradient descent methods 1: Input: x0 ∈ dom f. 2: for k ≥ 0 do 3: dk = −∇f(xk). 4: Choose a step size sk > 0 satisfying f(xk + skdk) <f(xk). 5: xk+1 = xk + skdk. 6: If a stopping criteria is satisfied, then stop. 2.3 Steepest Descent Methods d Definition 2.2 Let be any norm on Ê . We define a normalized steepest descent direction (with respect to the norm || · ||) as ||·|| dnsd = argmin f(x)⊤v : v = 1 . (2.5) v ∇ || || This is a unit-norm step with most negative directional derivative f(x)⊤v. (Recall that v is a descent ∇ direction if f(x)⊤v < 0.) In other words, a normalized steepest descent direction is the direction in the unit ball∇ of that extends farthest in the direction f(x). ||·|| −∇ Definition 2.3 A (unnormalized) steepest descent step is defined as dsd = f(x) dnsd, (2.6) ||∇ ||∗ where z = max z, y : y 1 denotes the dual norm (e.g., 2 is a dual norm of 2, and || ||∗ {h i || || ≤ } ||·||2 ||·|| 1 is a dual norm of ). This satisfies f(x)⊤dsd = f(x) . ||·|| ||·||∞ ∇ −||∇ ||∗ Example 2.2 Examples of steepest descent methods. – Euclidean norm (ℓ2-norm): dsd = f(x). – The resulting algorithm is a gradient−∇ descent method. 1/2 d 1 – Quadratic P -norm ( z P = P z 2 where P Ë++): dsd = P − f(x). – The resulting algorithm|| || is|| a preconditioned|| gradient∈ descent method− ∇ with a preconditioner P (or 1 1/2 P − ). This is equivalent to a gradient descent method with the change of coordinates x¯ = P x. – A good choice of P (e.g., P 2f(x )) makes the condition number of the problem after the change of coordinates x¯ = P 1≈/2x ∇small,∗ which likely makes the problem easier to solve. – ℓ -norm: d = ∂f(x) e for the index i satisfying f(x) = ( f(x)) , where e is the ith 1 sd ∂xi i i i standard basis vector.− ||∇ ||∞ | ∇ | – The resulting algorithm is an instance of coordinate-descent methods that update only one com- ponent of x at each iteration. Remark 2.1 At the end of this term (in Part 6), we will study the Newton’s descent direction: 2 1 d = [ f(x)]− f(x), (2.7) nt − ∇ ∇ 2 d which is a steepest descent direction at x in local Hessian norm 2f(x) when f(x) Ë++. ||·||∇ ∇ ∈ 3 3 Convergence Analysis of the Gradient Method (See [1, Chapter 4] [2, Chapter 5] [14, Chapter 2]) 3.1 Lipschitz Continuity of the Gradient Definition 3.1 A function f is said to be L-smooth if it is continuously differentiable and its gradient d f is Lipschitz continuous over Ê , meaning that there exists L> 0 for which ∇ d f(x) f(y) L x y for any x, y Ê . (3.1) ||∇ − ∇ ||2 ≤ || − ||2 ∈ 1,1 d 1,1 1 We denote the class of functions with Lipschitz gradient with constant L by L (Ê ) or L . One can generalize the choice of the norm as C C f(x) f(y) L x y , (3.2) ||∇ − ∇ ||∗ ≤ || − || where is a dual norm of , but we will not consider this generalization in this class. ||·||∗ ||·|| Lemma 3.1 If f is twice continuously differentiable, f is L-smooth if and only if LI 2f(x) LI (or equivalently 2f(x) L) (3.3) − ∇ ||∇ ||2 ≤ Example 3.1 Smooth functions. d d 1 1,1 Ê – Quadratic functions: Let Q Ë and p . Then, the function f(x) = 2 x⊤Qx + p⊤x is a L function with L = Q , since∈ ∈ C || ||2 f(x) f(y) = Qx + p (Qy + p) = Q(x y) Q x y . ||∇ − ∇ ||2 || − ||2 || − ||2 ≤ || ||2 · || − ||2 – A convex function f(x)= 1+ x 2: We have f(x)= 1 x and 2 √ x 2+1 || || ∇ || ||2 p 2 1 1 1 f(x)= I xx⊤ I I. 2 2 3/2 2 ∇ x + 1 − ( x 2 + 1) x + 1 || ||2 || || || ||2 p p Therefore, f is a 1,1 function with L = 1. CL 1,1 An important result for L functions is that they can be bounded above by a certain quadratic function, which is fundamentalC in convergence proofs of gradient-based methods. 1,1 d Lemma 3.2 (Descent lemma) Let f L ( ) for some L > 0 and a given convex set Ê . Then for any x, y , ∈ C D D ⊆ ∈D L f(y) f(x)+ f(x), y x + x y 2. (3.4) ≤ h∇ − i 2 || − ||2 Proof By the fundamental theorem of calculus, we have 1 f(y) f(x)= f(x + t(y x)), y x dt − h∇ − − i Z0 Therefore, 1 f(y) f(x)= f(x), y x + f(x + t(y x)) f(x), y x dt. − h∇ − i h∇ − − ∇ − i Z0 1 i,j d We denote by CL (D) for D ⊆ Ê the class of functions that are i-times continuously differentiable on D and its jth derivative is Lipschitz continuous on D. 4 Thus, 1 f(y) f(x) f(x), y x = f(x + t(y x)) f(x), y x dt | − − h∇ − i h∇ − − ∇ − i Z0 1 f(x + t(y x)) f(x), y x dt ≤ 0 | h∇ − − ∇ − i| Z 1 f(x + t(y x)) f(x) 2 y x 2dt ≤ 0 ||∇ − − ∇ || · || − || Z 1 tL y x 2dt ≤ || − ||2 Z0 L = y x 2, 2 || − ||2 where the second inequality uses the generalized Cauchy-Schwarz inequality and the third inequality uses the L-smoothness.

Part 2. Gradient and Subgradient Methods for Unconstrained Convex Optimization

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support