Line-Search Methods for Smooth Unconstrained Optimization
Total Page:16
File Type:pdf, Size:1020Kb
Outline 1 Generic Linesearch Framework 2 Computing a descent direction pk (search direction) Line-Search Methods for Smooth Unconstrained Optimization Steepest descent direction Modified Newton direction Quasi-Newton directions for medium scale problems Limited-memory quasi-Newton directions for large-scale problems Daniel P. Robinson Linear CG method for large-scale problems Department of Applied Mathematics and Statistics Johns Hopkins University 3 Choosing the step length αk (linesearch) Backtracking-Armijo linesearch Wolfe conditions Strong Wolfe conditions September 17, 2020 4 Complete Algorithms Steepest descent backtracking Armijo linesearch method Modified Newton backtracking-Armijo linesearch method Modified Newton linesearch method based on the Wolfe conditions Quasi-Newton linesearch method based on the Wolfe conditions 1 / 106 2 / 106 The basic idea The problem minimize f(x) n consider descent methods, i.e., f(xk+1) < f(xk) x2R n calculate a search direction pk from xk where the objective function f : R ! R ensure that this direction is a descent direction, i.e., 1 2 assume that f 2 C (sometimes C ) and is Lipschitz continuous T gk pk < 0 if gk 6= 0 in practice this assumption may be violated, but the algorithms we develop may still work so that, for small steps along pk, the objective function f will be reduced in practice it is very rare to be able to provide an explicit minimizer calculate a suitable steplength αk > 0 so that we consider iterative methods: given starting guess x0, generate sequence f(xk + αkpk) < fk fxkg for k = 1; 2;::: computation of αk is the linesearch AIM: ensure that (a subsequence) has some favorable limiting properties the computation of αk may itself require an iterative procedure I satisfies first-order necessary conditions generic update for linesearch methods is given by I satisfies second-order necessary conditions xk+1 = xk + αkpk Notation: fk = f(xk), gk = g(xk), Hk = H(xk) 4 / 106 5 / 106 Issue 1: steps might be too long Issue 2: steps might be too short f(x) 3 f(x) 3 2.5 2.5 (x ,f(x ) 1 1 (x1,f(x1) 2 2 1.5 (x2,f(x2) 1.5 (x2,f(x2) (x3,f(x3) (x3,f(x3) (x ,f(x ) (x ,f(x ) (x5,f(x5) 4 4 (x ,f(x ) 4 4 1 1 5 5 0.5 0.5 0 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x 2 2 Figure: The objective function f(x) = x and the iterates xk+1 = xk + αkpk generated by the Figure: The objective function f(x) = x and the iterates xk+1 = xk + αkpk generated by the k+1 k+1 k+1 descent directions pk = (−1) and steps αk = 2 + 3=2 from x0 = 2. descent directions pk = −1 and steps αk = 1=2 from x0 = 2. decrease in f is not proportional to the size of the directional derivative! decrease in f is not proportional to the size of the directional derivative! 6 / 106 7 / 106 What is a practical linesearch? Definition 2.1 (Steepest descent direction) in the “early” days exact linesearchs were used, i.e., pick αk to minimize For a differentiable function f, the search direction f(xk + αpk) def pk = −∇f(xk) ≡ −gk I an exact linesearch requires univariate minimization is called the steepest-descent direction. I cheap if f is simple, e.g., a quadratic function I generally very expensive and not cost effective I exact linesearch may not be much better than an approximate linesearch pk is a descent direction provided gk 6= 0 T T 2 I gk pk = −gk gk = −kgkk2 < 0 p modern methods use inexact linesearchs k solves the problem I ensure steps are neither too long nor too short L minimize m (xk + p) subject to kpk2 = kgkk2 n k I make sure that the decrease in f is proportional to the directional derivative p2R I try to pick “appropriate” initial stepsize for fast convergence related to how the search direction s is computed F k L def T I mk (xk + p) = fk + gk p L I mk (xk + p) ≈ f(xk + p) the descent direction (search direction) is also important I “bad” directions may not converge at all Any method that uses the steepest-descent direction is a method of steepest descent. I more typically, “bad” directions may converge very slowly 8 / 106 11 / 106 Question: How do we choose the positive-definite matrix Bk? Observation: the steepest descent direction is also the unique solution to Ideally, Bk is chosen such that T 1 T minimize fk + g p + p p n k 2 p2R kBk − Hkk is “small” approximates second-derivative Hessian by the identify matrix I Bk = Hk when Hk is “sufficiently” positive definite. how often is this a good idea? Comments: is it a surprise that convergence is typically very slow? for the remainder of this section, we omit the suffix k and write H = Hk, B = Bk, Idea: choose positive definite Bk and compute search direction as the unique minimizer and g = gk. Q def T 1 T for a symmetric matrix A 2 n×n, we use the matrix norm pk = argmin mk (p) = fk + gk p + p Bkp R n 2 p2R kAk2 = max jλjj 1≤j≤n pk satisfies Bkpk = −gk why must Bk be positive definite? with fλjg the eigenvalues of A. Q I B 0 =) M is strictly convex =) unique solution k k the spectral decomposition of H is given by H = VΛVT, where I if gk 6= 0, then pk 6= 0 and is a descent direction T T pk gk = −pk Bkpk < 0 =) pk is a descent direction V = v1 v2 ··· vn and Λ = diag(λ1; λ2; : : : ; λn) pick “intelligent” Bk that has “useful” curvature information with Hvj = λjvj and λ1 ≥ λ2 ≥ · · · ≥ λn. IF Hk 0 and we choose Bk = Hk, then sk is the Newton direction. Awesome! H is positive definite if and only if λj > 0 for all j. computing the spectral decomposition is, generally, very expensive! 13 / 106 14 / 106 Method 1: small scale problems (n < 1000) Question: What are the properties of the resulting search direction? n T −1 −1 T X vj g Algorithm 1 Compute modified Newton matrix B from H p = −B g = −VΛ¯ V g = − vj λ¯ j=1 j 1: input H 2: Choose β > 1, the desired bound on the condition number of B. Taking norms and using the orthogonality of V, gives T 3: Compute the spectral decomposition H = VΛV . n T 2 4: if H = 0 then 2 −1 2 X (vj g) kpk2 = kB gk2 = 5: Set " = 1 λ¯2 j=1 j 6: else T 2 T 2 X (vj g) X (vj g) 7: Set " = kHk2/β > 0 = + 2 2 8: end if λj " λj≥" λj<" 9: Compute Thus, we conclude that λ if λ ≥ " Λ¯ = diag(λ¯ ; λ¯ ;:::; λ¯ ) with λ¯ = j j 1 2 n j " otherwise 1 kpk = O provided vTg 6= 0 for at least one λ < " 2 " j j T 10: return B = VΛ¯V 0, which satisfies cond(B) ≤ β T replaces the eigenvalues of H that are “not positive enough” with " the step p goes unbounded as " ! 0 provided vj g 6= 0 for at least one λj < " Λ¯ = Λ + D, where any indefinite matrix will generally satisfy this property D = diag max(0;" − λ1); max(0;" − λ2);:::; max(0;" − λn) 0 the next method that we discuss is better! B = H + E, where E = VDVT 0 15 / 106 16 / 106 Method 2: small scale problems (n < 1000) Suppose that B = H + E is computed from Algorithm 2 so that ¯ T Algorithm 2 Compute modified Newton matrix B from H E = B − H = V(Λ − Λ)V 1: input H and since V is orthogonal that 2: Choose β > 1, the desired bound on the condition number of B. ¯ T ¯ ¯ T kEk2 = kV(Λ − Λ)V k2 = kΛ − Λk2 = max jλj − λjj 3: Compute the spectral decomposition H = VΛV . j 4: if H = 0 then 5: Set " = 1 By definition 8 0 if λ ≥ " 6: else < j λ¯j − λj = " − λj if −" < λj < " 7: Set " = kHk2/β > 0 : −2λ if λ ≤ −" 8: end if j j 9: Compute 8 which implies that λj if λj ≥ " < kEk2 = max f 0;" − λj; −2λj g: Λ¯ = diag(λ¯1; λ¯2;:::; λ¯n) with λ¯j = −λj if λj ≤ −" 1≤j≤n : " otherwise However, since λ1 ≥ λ2 ≥ · · · ≥ λn, we know that T 10: ¯ return B = VΛV 0, which satisfies cond(B) ≤ β kEk2 = max f 0;" − λn; −2λn g replace small eigenvalues λi of H with " replace “sufficiently negative” eigenvalues λ of H with −λ i i if λ ≥ ", i.e., H is sufficiently positive definite, then E = 0 and B = H Λ¯ = Λ + D, where n if λn < ", then B 6= H and it can be shown that kEk2 ≤ 2 max("; jλnj) D = diag max(0; −2λ ;" − λ );:::; max(0; −2λ ;" − λ ) 0 1 1 n n regardless, B is well-conditioned by construction B = H + E, where E = VDVT 0 17 / 106 18 / 106 Example 2.2 Consider 2 1 0 g = and H = N N 4 0 −2 p p The Newton direction is −2 pN = −H−1g = so that gTpN = 4 > 0 (ascent direction) 2 Method 1 N T 1 T and p is a saddle point of the quadratic model g p + 2 p Hp since H is indefinite. Algorithm 1 (Method 1) generates 1 0 p B = 0 " so that −1 −2 T 16 p = −B g = 4 and g p = −4 − < 0 (descent direction) pN − " " Algorithm 2 (Method 2) generates 1 0 B = 0 2 so that p Method 2 −2 p = −B−1g = and gTp = −12 < 0 (descent direction) −2 19 / 106 Question: What is the geometric interpretation? Answer: Let the spectral decomposition of H be H = VΛVT and assume that λj 6= 0 for all j.