Line-Search Methods for Smooth Unconstrained Optimization

Notes Line-Search Methods for Smooth Unconstrained Optimization Daniel P. Robinson Department of Applied Mathematics and Statistics Johns Hopkins University September 17, 2020 1 / 106 Outline Notes 1 Generic Linesearch Framework 2 Computing a descent direction pk (search direction) Steepest descent direction Modified Newton direction Quasi-Newton directions for medium scale problems Limited-memory quasi-Newton directions for large-scale problems Linear CG method for large-scale problems 3 Choosing the step length αk (linesearch) Backtracking-Armijo linesearch Wolfe conditions Strong Wolfe conditions 4 Complete Algorithms Steepest descent backtracking Armijo linesearch method Modified Newton backtracking-Armijo linesearch method Modified Newton linesearch method based on the Wolfe conditions Quasi-Newton linesearch method based on the Wolfe conditions 2 / 106 Notes The problem minimize f(x) n x2R n where the objective function f : R ! R assume that f 2 C1 (sometimes C2) and is Lipschitz continuous in practice this assumption may be violated, but the algorithms we develop may still work in practice it is very rare to be able to provide an explicit minimizer we consider iterative methods: given starting guess x0, generate sequence fxkg for k = 1; 2;::: AIM: ensure that (a subsequence) has some favorable limiting properties I satisfies first-order necessary conditions I satisfies second-order necessary conditions Notation: fk = f(xk), gk = g(xk), Hk = H(xk) 4 / 106 The basic idea Notes consider descent methods, i.e., f(xk+1) < f(xk) calculate a search direction pk from xk ensure that this direction is a descent direction, i.e., T gk pk < 0 if gk 6= 0 so that, for small steps along pk, the objective function f will be reduced calculate a suitable steplength αk > 0 so that f(xk + αkpk) < fk computation of αk is the linesearch the computation of αk may itself require an iterative procedure generic update for linesearch methods is given by xk+1 = xk + αkpk 5 / 106 Issue 1: steps might be too long Notes f(x) 3 2.5 (x1,f(x1) 2 1.5 (x2,f(x2) (x3,f(x3) (x ,f(x ) (x5,f(x5) 4 4 1 0.5 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x 2 Figure: The objective function f(x) = x and the iterates xk+1 = xk + αkpk generated by the k+1 k+1 descent directions pk = (−1) and steps αk = 2 + 3=2 from x0 = 2. decrease in f is not proportional to the size of the directional derivative! 6 / 106 Issue 2: steps might be too short Notes f(x) 3 2.5 (x1,f(x1) 2 1.5 (x2,f(x2) (x3,f(x3) (x ,f(x ) (x4,f(x4) 1 5 5 0.5 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x 2 Figure: The objective function f(x) = x and the iterates xk+1 = xk + αkpk generated by the k+1 descent directions pk = −1 and steps αk = 1=2 from x0 = 2. decrease in f is not proportional to the size of the directional derivative! 7 / 106 What is a practical linesearch? Notes in the “early” days exact linesearchs were used, i.e., pick αk to minimize f(xk + αpk) I an exact linesearch requires univariate minimization I cheap if f is simple, e.g., a quadratic function I generally very expensive and not cost effective I exact linesearch may not be much better than an approximate linesearch modern methods use inexact linesearchs I ensure steps are neither too long nor too short I make sure that the decrease in f is proportional to the directional derivative I try to pick “appropriate” initial stepsize for fast convergence F related to how the search direction sk is computed the descent direction (search direction) is also important I “bad” directions may not converge at all I more typically, “bad” directions may converge very slowly 8 / 106 Notes Definition 2.1 (Steepest descent direction) For a differentiable function f, the search direction def pk = −∇f(xk) ≡ −gk is called the steepest-descent direction. pk is a descent direction provided gk 6= 0 T T 2 I gk pk = −gk gk = −kgkk2 < 0 pk solves the problem L minimize m (xk + p) subject to kpk2 = kgkk2 n k p2R L def T I mk (xk + p) = fk + gk p L I mk (xk + p) ≈ f(xk + p) Any method that uses the steepest-descent direction is a method of steepest descent. 11 / 106 Observation: the steepest descent direction is also the unique solution to Notes T 1 T minimize fk + g p + p p n k 2 p2R approximates second-derivative Hessian by the identify matrix I how often is this a good idea? is it a surprise that convergence is typically very slow? Idea: choose positive definite Bk and compute search direction as the unique minimizer Q def T 1 T pk = argmin mk (p) = fk + gk p + p Bkp n 2 p2R pk satisfies Bkpk = −gk why must Bk be positive definite? Q I Bk 0 =) Mk is strictly convex =) unique solution I if gk 6= 0, then pk 6= 0 and is a descent direction T T pk gk = −pk Bkpk < 0 =) pk is a descent direction pick “intelligent” Bk that has “useful” curvature information IF Hk 0 and we choose Bk = Hk, then sk is the Newton direction. Awesome! 13 / 106 Question: How do we choose the positive-definite matrix Bk? Notes Ideally, Bk is chosen such that kBk − Hkk is “small” Bk = Hk when Hk is “sufficiently” positive definite. Comments: for the remainder of this section, we omit the suffix k and write H = Hk, B = Bk, and g = gk. n×n for a symmetric matrix A 2 R , we use the matrix norm kAk2 = max jλjj 1≤j≤n with fλjg the eigenvalues of A. the spectral decomposition of H is given by H = VΛVT, where V = v1 v2 ··· vn and Λ = diag(λ1; λ2; : : : ; λn) with Hvj = λjvj and λ1 ≥ λ2 ≥ · · · ≥ λn. H is positive definite if and only if λj > 0 for all j. computing the spectral decomposition is, generally, very expensive! 14 / 106 Method 1: small scale problems (n < 1000) Notes Algorithm 1 Compute modified Newton matrix B from H 1: input H 2: Choose β > 1, the desired bound on the condition number of B. T 3: Compute the spectral decomposition H = VΛV . 4: if H = 0 then 5: Set " = 1 6: else 7: Set " = kHk2/β > 0 8: end if 9: Compute λ if λ ≥ " Λ¯ = diag(λ¯ ; λ¯ ;:::; λ¯ ) with λ¯ = j j 1 2 n j " otherwise T 10: return B = VΛ¯V 0, which satisfies cond(B) ≤ β replaces the eigenvalues of H that are “not positive enough” with " Λ¯ = Λ + D, where D = diag max(0;" − λ1); max(0;" − λ2);:::; max(0;" − λn) 0 B = H + E, where E = VDVT 0 15 / 106 Question: What are the properties of the resulting search direction? Notes n T −1 −1 T X vj g p = −B g = −VΛ¯ V g = − vj λ¯ j=1 j Taking norms and using the orthogonality of V, gives n T 2 − X (vj g) kpk2 = kB 1gk2 = 2 2 λ¯2 j=1 j T 2 T 2 X (vj g) X (vj g) = + 2 2 λj " λj≥" λj<" Thus, we conclude that 1 kpk = O provided vTg 6= 0 for at least one λ < " 2 " j j T the step p goes unbounded as " ! 0 provided vj g 6= 0 for at least one λj < " any indefinite matrix will generally satisfy this property the next method that we discuss is better! 16 / 106 Method 2: small scale problems (n < 1000) Notes Algorithm 2 Compute modified Newton matrix B from H 1: input H 2: Choose β > 1, the desired bound on the condition number of B. T 3: Compute the spectral decomposition H = VΛV . 4: if H = 0 then 5: Set " = 1 6: else 7: Set " = kHk2/β > 0 8: end if 9: Compute 8 < λj if λj ≥ " Λ¯ = diag(λ¯1; λ¯2;:::; λ¯n) with λ¯j = −λj if λj ≤ −" : " otherwise T 10: return B = VΛ¯V 0, which satisfies cond(B) ≤ β replace small eigenvalues λi of H with " replace “sufficiently negative” eigenvalues λi of H with −λi Λ¯ = Λ + D, where D = diag max(0; −2λ1;" − λ1);:::; max(0; −2λn;" − λn) 0 B = H + E, where E = VDVT 0 17 / 106 Suppose that B = H + E is computed from Algorithm 2 so that E = B − H = V(Λ¯ − Λ)VT Notes and since V is orthogonal that T kEk2 = kV(Λ¯ − Λ)V k2 = kΛ¯ − Λk2 = max jλ¯j − λjj j By definition 8 < 0 if λj ≥ " λ¯j − λj = " − λj if −" < λj < " : −2λj if λj ≤ −" which implies that kEk2 = max f 0;" − λj; −2λj g: 1≤j≤n However, since λ1 ≥ λ2 ≥ · · · ≥ λn, we know that kEk2 = max f 0;" − λn; −2λn g if λn ≥ ", i.e., H is sufficiently positive definite, then E = 0 and B = H if λn < ", then B 6= H and it can be shown that kEk2 ≤ 2 max("; jλnj) regardless, B is well-conditioned by construction 18 / 106 Example 2.2 Consider Notes 2 1 0 g = and H = 4 0 −2 The Newton direction is −2 pN = −H−1g = so that gTpN = 4 > 0 (ascent direction) 2 N T 1 T and p is a saddle point of the quadratic model g p + 2 p Hp since H is indefinite. Algorithm 1 (Method 1) generates 1 0 B = 0 " so that −1 −2 T 16 p = −B g = 4 and g p = −4 − < 0 (descent direction) − " " Algorithm 2 (Method 2) generates 1 0 B = 0 2 so that −2 p = −B−1g = and gTp = −12 < 0 (descent direction) −2 19 / 106 Notes pN pN Method 1 p pN p Method 2 Question: What is the geometric interpretation? Answer: Let the spectral decomposition of H be Notes H = VΛVT and assume that λj 6= 0 for all j.

Load more