Line-Search Methods for Smooth Unconstrained Optimization

Outline 1 Generic Linesearch Framework 2 Computing a descent direction pk (search direction) Line-Search Methods for Smooth Unconstrained Optimization Steepest descent direction Modified Newton direction Quasi-Newton directions for medium scale problems Limited-memory quasi-Newton directions for large-scale problems Daniel P. Robinson Linear CG method for large-scale problems Department of Applied Mathematics and Statistics Johns Hopkins University 3 Choosing the step length αk (linesearch) Backtracking-Armijo linesearch Wolfe conditions Strong Wolfe conditions September 17, 2020 4 Complete Algorithms Steepest descent backtracking Armijo linesearch method Modified Newton backtracking-Armijo linesearch method Modified Newton linesearch method based on the Wolfe conditions Quasi-Newton linesearch method based on the Wolfe conditions 1 / 106 2 / 106 The basic idea The problem minimize f(x) n consider descent methods, i.e., f(xk+1) < f(xk) x2R n calculate a search direction pk from xk where the objective function f : R ! R ensure that this direction is a descent direction, i.e., 1 2 assume that f 2 C (sometimes C ) and is Lipschitz continuous T gk pk < 0 if gk 6= 0 in practice this assumption may be violated, but the algorithms we develop may still work so that, for small steps along pk, the objective function f will be reduced in practice it is very rare to be able to provide an explicit minimizer calculate a suitable steplength αk > 0 so that we consider iterative methods: given starting guess x0, generate sequence f(xk + αkpk) < fk fxkg for k = 1; 2;::: computation of αk is the linesearch AIM: ensure that (a subsequence) has some favorable limiting properties the computation of αk may itself require an iterative procedure I satisfies first-order necessary conditions generic update for linesearch methods is given by I satisfies second-order necessary conditions xk+1 = xk + αkpk Notation: fk = f(xk), gk = g(xk), Hk = H(xk) 4 / 106 5 / 106 Issue 1: steps might be too long Issue 2: steps might be too short f(x) 3 f(x) 3 2.5 2.5 (x ,f(x ) 1 1 (x1,f(x1) 2 2 1.5 (x2,f(x2) 1.5 (x2,f(x2) (x3,f(x3) (x3,f(x3) (x ,f(x ) (x ,f(x ) (x5,f(x5) 4 4 (x ,f(x ) 4 4 1 1 5 5 0.5 0.5 0 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x 2 2 Figure: The objective function f(x) = x and the iterates xk+1 = xk + αkpk generated by the Figure: The objective function f(x) = x and the iterates xk+1 = xk + αkpk generated by the k+1 k+1 k+1 descent directions pk = (−1) and steps αk = 2 + 3=2 from x0 = 2. descent directions pk = −1 and steps αk = 1=2 from x0 = 2. decrease in f is not proportional to the size of the directional derivative! decrease in f is not proportional to the size of the directional derivative! 6 / 106 7 / 106 What is a practical linesearch? Definition 2.1 (Steepest descent direction) in the “early” days exact linesearchs were used, i.e., pick αk to minimize For a differentiable function f, the search direction f(xk + αpk) def pk = −∇f(xk) ≡ −gk I an exact linesearch requires univariate minimization is called the steepest-descent direction. I cheap if f is simple, e.g., a quadratic function I generally very expensive and not cost effective I exact linesearch may not be much better than an approximate linesearch pk is a descent direction provided gk 6= 0 T T 2 I gk pk = −gk gk = −kgkk2 < 0 p modern methods use inexact linesearchs k solves the problem I ensure steps are neither too long nor too short L minimize m (xk + p) subject to kpk2 = kgkk2 n k I make sure that the decrease in f is proportional to the directional derivative p2R I try to pick “appropriate” initial stepsize for fast convergence related to how the search direction s is computed F k L def T I mk (xk + p) = fk + gk p L I mk (xk + p) ≈ f(xk + p) the descent direction (search direction) is also important I “bad” directions may not converge at all Any method that uses the steepest-descent direction is a method of steepest descent. I more typically, “bad” directions may converge very slowly 8 / 106 11 / 106 Question: How do we choose the positive-definite matrix Bk? Observation: the steepest descent direction is also the unique solution to Ideally, Bk is chosen such that T 1 T minimize fk + g p + p p n k 2 p2R kBk − Hkk is “small” approximates second-derivative Hessian by the identify matrix I Bk = Hk when Hk is “sufficiently” positive definite. how often is this a good idea? Comments: is it a surprise that convergence is typically very slow? for the remainder of this section, we omit the suffix k and write H = Hk, B = Bk, Idea: choose positive definite Bk and compute search direction as the unique minimizer and g = gk. Q def T 1 T for a symmetric matrix A 2 n×n, we use the matrix norm pk = argmin mk (p) = fk + gk p + p Bkp R n 2 p2R kAk2 = max jλjj 1≤j≤n pk satisfies Bkpk = −gk why must Bk be positive definite? with fλjg the eigenvalues of A. Q I B 0 =) M is strictly convex =) unique solution k k the spectral decomposition of H is given by H = VΛVT, where I if gk 6= 0, then pk 6= 0 and is a descent direction T T pk gk = −pk Bkpk < 0 =) pk is a descent direction V = v1 v2 ··· vn and Λ = diag(λ1; λ2; : : : ; λn) pick “intelligent” Bk that has “useful” curvature information with Hvj = λjvj and λ1 ≥ λ2 ≥ · · · ≥ λn. IF Hk 0 and we choose Bk = Hk, then sk is the Newton direction. Awesome! H is positive definite if and only if λj > 0 for all j. computing the spectral decomposition is, generally, very expensive! 13 / 106 14 / 106 Method 1: small scale problems (n < 1000) Question: What are the properties of the resulting search direction? n T −1 −1 T X vj g Algorithm 1 Compute modified Newton matrix B from H p = −B g = −VΛ¯ V g = − vj λ¯ j=1 j 1: input H 2: Choose β > 1, the desired bound on the condition number of B. Taking norms and using the orthogonality of V, gives T 3: Compute the spectral decomposition H = VΛV . n T 2 4: if H = 0 then 2 −1 2 X (vj g) kpk2 = kB gk2 = 5: Set " = 1 λ¯2 j=1 j 6: else T 2 T 2 X (vj g) X (vj g) 7: Set " = kHk2/β > 0 = + 2 2 8: end if λj " λj≥" λj<" 9: Compute Thus, we conclude that λ if λ ≥ " Λ¯ = diag(λ¯ ; λ¯ ;:::; λ¯ ) with λ¯ = j j 1 2 n j " otherwise 1 kpk = O provided vTg 6= 0 for at least one λ < " 2 " j j T 10: return B = VΛ¯V 0, which satisfies cond(B) ≤ β T replaces the eigenvalues of H that are “not positive enough” with " the step p goes unbounded as " ! 0 provided vj g 6= 0 for at least one λj < " Λ¯ = Λ + D, where any indefinite matrix will generally satisfy this property D = diag max(0;" − λ1); max(0;" − λ2);:::; max(0;" − λn) 0 the next method that we discuss is better! B = H + E, where E = VDVT 0 15 / 106 16 / 106 Method 2: small scale problems (n < 1000) Suppose that B = H + E is computed from Algorithm 2 so that ¯ T Algorithm 2 Compute modified Newton matrix B from H E = B − H = V(Λ − Λ)V 1: input H and since V is orthogonal that 2: Choose β > 1, the desired bound on the condition number of B. ¯ T ¯ ¯ T kEk2 = kV(Λ − Λ)V k2 = kΛ − Λk2 = max jλj − λjj 3: Compute the spectral decomposition H = VΛV . j 4: if H = 0 then 5: Set " = 1 By definition 8 0 if λ ≥ " 6: else < j λ¯j − λj = " − λj if −" < λj < " 7: Set " = kHk2/β > 0 : −2λ if λ ≤ −" 8: end if j j 9: Compute 8 which implies that λj if λj ≥ " < kEk2 = max f 0;" − λj; −2λj g: Λ¯ = diag(λ¯1; λ¯2;:::; λ¯n) with λ¯j = −λj if λj ≤ −" 1≤j≤n : " otherwise However, since λ1 ≥ λ2 ≥ · · · ≥ λn, we know that T 10: ¯ return B = VΛV 0, which satisfies cond(B) ≤ β kEk2 = max f 0;" − λn; −2λn g replace small eigenvalues λi of H with " replace “sufficiently negative” eigenvalues λ of H with −λ i i if λ ≥ ", i.e., H is sufficiently positive definite, then E = 0 and B = H Λ¯ = Λ + D, where n if λn < ", then B 6= H and it can be shown that kEk2 ≤ 2 max("; jλnj) D = diag max(0; −2λ ;" − λ );:::; max(0; −2λ ;" − λ ) 0 1 1 n n regardless, B is well-conditioned by construction B = H + E, where E = VDVT 0 17 / 106 18 / 106 Example 2.2 Consider 2 1 0 g = and H = N N 4 0 −2 p p The Newton direction is −2 pN = −H−1g = so that gTpN = 4 > 0 (ascent direction) 2 Method 1 N T 1 T and p is a saddle point of the quadratic model g p + 2 p Hp since H is indefinite. Algorithm 1 (Method 1) generates 1 0 p B = 0 " so that −1 −2 T 16 p = −B g = 4 and g p = −4 − < 0 (descent direction) pN − " " Algorithm 2 (Method 2) generates 1 0 B = 0 2 so that p Method 2 −2 p = −B−1g = and gTp = −12 < 0 (descent direction) −2 19 / 106 Question: What is the geometric interpretation? Answer: Let the spectral decomposition of H be H = VΛVT and assume that λj 6= 0 for all j.

Line-Search Methods for Smooth Unconstrained Optimization

Hybrid Whale Optimization Algorithm with Modified Conjugate Gradient Method to Solve Global Optimization Problems

Lecture Notes on Numerical Optimization

Optimization and Approximation

Download Thesisadobe

Arxiv:1806.10197V1 [Math.NA] 26 Jun 2018 Tive

Unconstrained Optimization

Numerical Optimization

Nonsmooth Optimization Via Bfgs

Gauss–Newton Optimization for Phase Recovery from the Bispectrum James L

The Wolfe Epsilon Steepest Descent Algorithm

A NONLINEAR RESPONSE MODEL for SINGLE NUCLEOTIDE POLYMORPHISM DETECTION ASSAYS by Drew Philip Kouri Submitted in Partial Fulfill

A Survey of Optimization Methods from a Machine Learning Perspective