Notes

Line-Search Methods for Smooth Unconstrained Optimization

Daniel P. Robinson Department of Applied Mathematics and Statistics Johns Hopkins University

September 17, 2020

1 / 106

Outline Notes

1 Generic Linesearch Framework

2 Computing a descent direction pk (search direction) Steepest descent direction Modified Newton direction Quasi-Newton directions for medium scale problems Limited-memory quasi-Newton directions for large-scale problems Linear CG method for large-scale problems

3 Choosing the step length αk (linesearch) -Armijo linesearch Strong Wolfe conditions

4 Complete Algorithms Steepest descent backtracking Armijo linesearch method Modified Newton backtracking-Armijo linesearch method Modified Newton linesearch method based on the Wolfe conditions Quasi-Newton linesearch method based on the Wolfe conditions

2 / 106 Notes The problem

minimize f(x) n x∈R n where the objective function f : R → R

assume that f ∈ C1 (sometimes C2) and is Lipschitz continuous in practice this assumption may be violated, but the algorithms we develop may still work in practice it is very rare to be able to provide an explicit minimizer

we consider iterative methods: given starting guess x0, generate sequence

{xk} for k = 1, 2,... AIM: ensure that (a subsequence) has some favorable limiting properties

I satisfies first-order necessary conditions I satisfies second-order necessary conditions

Notation: fk = f(xk), gk = g(xk), Hk = H(xk)

4 / 106

The basic idea Notes

consider descent methods, i.e., f(xk+1) < f(xk)

calculate a search direction pk from xk ensure that this direction is a descent direction, i.e.,

T gk pk < 0 if gk 6= 0

so that, for small steps along pk, the objective function f will be reduced

calculate a suitable steplength αk > 0 so that

f(xk + αkpk) < fk

computation of αk is the linesearch

the computation of αk may itself require an iterative procedure generic update for linesearch methods is given by

xk+1 = xk + αkpk

5 / 106 Issue 1: steps might be too long Notes

f(x) 3

2.5

(x1,f(x1)

2

1.5 (x2,f(x2)

(x3,f(x3) (x ,f(x ) (x5,f(x5) 4 4 1

0.5

0

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x

2 Figure: The objective function f(x) = x and the iterates xk+1 = xk + αkpk generated by the k+1 k+1 descent directions pk = (−1) and steps αk = 2 + 3/2 from x0 = 2.

decrease in f is not proportional to the size of the directional derivative! 6 / 106

Issue 2: steps might be too short Notes

f(x) 3

2.5

(x1,f(x1)

2

1.5 (x2,f(x2)

(x3,f(x3) (x ,f(x ) (x4,f(x4) 1 5 5

0.5

0

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x

2 Figure: The objective function f(x) = x and the iterates xk+1 = xk + αkpk generated by the k+1 descent directions pk = −1 and steps αk = 1/2 from x0 = 2.

decrease in f is not proportional to the size of the directional derivative! 7 / 106 What is a practical linesearch? Notes

in the “early” days exact linesearchs were used, i.e., pick αk to minimize

f(xk + αpk)

I an exact linesearch requires univariate minimization I cheap if f is simple, e.g., a quadratic function I generally very expensive and not cost effective I exact linesearch may not be much better than an approximate linesearch

modern methods use inexact linesearchs

I ensure steps are neither too long nor too short I make sure that the decrease in f is proportional to the directional derivative I try to pick “appropriate” initial stepsize for fast convergence

F related to how the search direction sk is computed

the descent direction (search direction) is also important

I “bad” directions may not converge at all I more typically, “bad” directions may converge very slowly

8 / 106

Notes Definition 2.1 (Steepest descent direction) For a differentiable function f, the search direction

def pk = −∇f(xk) ≡ −gk is called the steepest-descent direction.

pk is a descent direction provided gk 6= 0 T T 2 I gk pk = −gk gk = −kgkk2 < 0

pk solves the problem

L minimize m (xk + p) subject to kpk2 = kgkk2 n k p∈R

L def T I mk (xk + p) = fk + gk p L I mk (xk + p) ≈ f(xk + p)

Any method that uses the steepest-descent direction is a method of steepest descent.

11 / 106 Observation: the steepest descent direction is also the unique solution to Notes

T 1 T minimize fk + g p + p p n k 2 p∈R approximates second-derivative Hessian by the identify matrix I how often is this a good idea? is it a surprise that convergence is typically very slow?

Idea: choose positive definite Bk and compute search direction as the unique minimizer

Q def T 1 T pk = argmin mk (p) = fk + gk p + p Bkp n 2 p∈R

pk satisfies Bkpk = −gk why must Bk be positive definite? Q I Bk 0 =⇒ Mk is strictly convex =⇒ unique solution I if gk 6= 0, then pk 6= 0 and is a descent direction T T pk gk = −pk Bkpk < 0 =⇒ pk is a descent direction

pick “intelligent” Bk that has “useful” curvature information

IF Hk 0 and we choose Bk = Hk, then sk is the Newton direction. Awesome!

13 / 106

Question: How do we choose the positive-definite matrix Bk? Notes Ideally, Bk is chosen such that

kBk − Hkk is “small”

Bk = Hk when Hk is “sufficiently” positive definite.

Comments:

for the remainder of this section, we omit the suffix k and write H = Hk, B = Bk, and g = gk. n×n for a symmetric matrix A ∈ R , we use the matrix norm

kAk2 = max |λj| 1≤j≤n

with {λj} the eigenvalues of A. the spectral decomposition of H is given by H = VΛVT, where  V = v1 v2 ··· vn and Λ = diag(λ1, λ2, . . . , λn)

with Hvj = λjvj and λ1 ≥ λ2 ≥ · · · ≥ λn.

H is positive definite if and only if λj > 0 for all j. computing the spectral decomposition is, generally, very expensive!

14 / 106 Method 1: small scale problems (n < 1000) Notes Algorithm 1 Compute modified Newton matrix B from H

1: input H 2: Choose β > 1, the desired bound on the condition number of B. T 3: Compute the spectral decomposition H = VΛV . 4: if H = 0 then 5: Set ε = 1 6: else 7: Set ε = kHk2/β > 0 8: end if 9: Compute

 λ if λ ≥ ε Λ¯ = diag(λ¯ , λ¯ ,..., λ¯ ) with λ¯ = j j 1 2 n j ε otherwise

T 10: return B = VΛ¯V 0, which satisfies cond(B) ≤ β

replaces the eigenvalues of H that are “not positive enough” with ε Λ¯ = Λ + D, where   D = diag max(0, ε − λ1), max(0, ε − λ2),..., max(0, ε − λn)  0

B = H + E, where E = VDVT  0 15 / 106

Question: What are the properties of the resulting search direction? Notes n T −1 −1 T X vj g p = −B g = −VΛ¯ V g = − vj λ¯ j=1 j

Taking norms and using the orthogonality of V, gives

n T 2 − X (vj g) kpk2 = kB 1gk2 = 2 2 λ¯2 j=1 j T 2 T 2 X (vj g) X (vj g) = + 2 2 λj ε λj≥ε λj<ε

Thus, we conclude that  1  kpk = O provided vTg 6= 0 for at least one λ < ε 2 ε j j

T the step p goes unbounded as ε → 0 provided vj g 6= 0 for at least one λj < ε any indefinite matrix will generally satisfy this property the next method that we discuss is better!

16 / 106 Method 2: small scale problems (n < 1000) Notes Algorithm 2 Compute modified Newton matrix B from H

1: input H 2: Choose β > 1, the desired bound on the condition number of B. T 3: Compute the spectral decomposition H = VΛV . 4: if H = 0 then 5: Set ε = 1 6: else 7: Set ε = kHk2/β > 0 8: end if 9: Compute   λj if λj ≥ ε Λ¯ = diag(λ¯1, λ¯2,..., λ¯n) with λ¯j = −λj if λj ≤ −ε  ε otherwise

T 10: return B = VΛ¯V 0, which satisfies cond(B) ≤ β

replace small eigenvalues λi of H with ε replace “sufficiently negative” eigenvalues λi of H with −λi Λ¯ = Λ + D, where   D = diag max(0, −2λ1, ε − λ1),..., max(0, −2λn, ε − λn)  0

B = H + E, where E = VDVT  0 17 / 106

Suppose that B = H + E is computed from Algorithm 2 so that

E = B − H = V(Λ¯ − Λ)VT Notes

and since V is orthogonal that

T kEk2 = kV(Λ¯ − Λ)V k2 = kΛ¯ − Λk2 = max |λ¯j − λj| j

By definition   0 if λj ≥ ε λ¯j − λj = ε − λj if −ε < λj < ε  −2λj if λj ≤ −ε which implies that kEk2 = max { 0, ε − λj, −2λj }. 1≤j≤n

However, since λ1 ≥ λ2 ≥ · · · ≥ λn, we know that

kEk2 = max { 0, ε − λn, −2λn }

if λn ≥ ε, i.e., H is sufficiently positive definite, then E = 0 and B = H

if λn < ε, then B 6= H and it can be shown that kEk2 ≤ 2 max(ε, |λn|) regardless, B is well-conditioned by construction

18 / 106 Example 2.2 Consider Notes 2 1 0 g = and H = 4 0 −2 The Newton direction is −2 pN = −H−1g = so that gTpN = 4 > 0 (ascent direction) 2 N T 1 T and p is a saddle point of the quadratic model g p + 2 p Hp since H is indefinite. Algorithm 1 (Method 1) generates 1 0 B = 0 ε

so that   −1 −2 T 16 p = −B g = 4 and g p = −4 − < 0 (descent direction) − ε ε Algorithm 2 (Method 2) generates 1 0 B = 0 2

so that −2 p = −B−1g = and gTp = −12 < 0 (descent direction) −2 19 / 106

Notes

pN pN

Method 1

p

pN

p Method 2 Question: What is the geometric interpretation? Answer: Let the spectral decomposition of H be Notes H = VΛVT and assume that λj 6= 0 for all j. Partition the eigenvector and eigenvalue matrices as ! Λ+ } λj>0  Λ = and V = V+ V− Λ− } λj<0 |{z} |{z} λj>0 λj<0    T   Λ+ V+ So that H = V+ V− T and the Newton direction is Λ− V−  −1   T  N −1 T  Λ+ V+ p = −VΛ V g = − V+ V− −1 T g Λ− V−

 −1   T   Λ+ V+g = − V+ V− −1 T Λ− V−g

−1 T −1 T = −V+Λ+ V+g − V−Λ− V−g

−1 T −1 T def N N = − V+Λ+ V+g − V−Λ− V−g = p+ + p− | {z } | {z } pN pN + − where N −1 T N −1 T p+ = −V+Λ V+g and p− = −V−Λ V−g + − 21 / 106

Notes

Result Q T 1 T Given mk (p) = g p + 2 p Hp with H nonsingular, the Newton direction

N N N −1 T −1 T p = p+ + p− = −V+Λ+ V+g − V−Λ− V−g is such that N Q p+ minimizes mk (p) on the subspace spanned by V+.

N Q p− maximizes mk (p) on the subspace spanned by V−.

22 / 106 Proof: def Q Notes The function ψ(y) = mk (V+y) may be written as

T 1 T T ψ(y) = g V+y + 2 y V+HV+y and has a stationary point y∗ given by

∗ T −1 T y = −(V+HV+) V+g

Since 2 T T T ∇ ψ(y) = V+HV+ = V+VΛV V+ = Λ+ 0 we know that y∗ is, in fact, the unique minimizer of ψ(y) and

∗ −1 T N V+y = −V+Λ+ V+g ≡ p+ For the second part

∗ T −1 T −1 T y = −(V−HV−) V−g = −Λ− V−g so that ∗ −1 T N V−y = −V−Λ− V−g ≡ p−.

N Q Since Λ− ≺ 0, we know that p− maximizes mk (p) on the space spanned by the columns of V−.

23 / 106

Notes Some observations

N p+ is a descent direction for f:

T N T −1 T T −1 g p+ = −g V+Λ+ V+g = −y Λ+ y ≤ 0

def T where y = V+g. The inequality is strict if y 6= 0, i.e., g is not orthogonal to the

column space of V+

N p− is an ascent direction for f:

T N T −1 T T −1 g p− = −g V−Λ− V−g = −y Λ− y ≥ 0

def T where y = V−g. The inequality is strict if y 6= 0, i.e., g is not orthogonal to the

column space of V−

N N N p = p+ + p− may or may not be a descent direction! This is the geometry of the Newton direction. What about the modified Newton direction?

24 / 106 Question: what is the geometry of the modifed-Newton step? Answer: Partition the matrices in the spectral decomposition H = VΛVT such that   Notes Λ+ V = [V+ Vε V−] and Λ =  Λε  Λ− where

Λ+ = {λi : λi ≥ ε} Λε = {λi : |λi| < ε} Λ− = {λi : λi ≤ −ε}

Theorem 2.3 (Properties of the modifed-Newton direction)

Suppose that the positive-definite matrix Bk is computed from Algorithm 2 with input Hk and value ε > 0. If gk 6= 0 and pk is the unique solution to Bkp = −gk, then pk may be written as pk = p+ + pε + p−, where Q (i) p+ is a direction of positive curvature that minimizes mk (p) in the space spanned by the columns of V+; Q (ii) −(p−) is a direction of negative curvature that maximizes mk (p) in the space spanned by the columns of V−; and

(iii) pε is a multiple of the steepest-descent direction in the space spanned by the columns of Vε with kpεk2 = O(1/ε).

Comments: implies that singular Hessians Hk may easily cause numerical difficulties N the direction p− is the negative of p− associated with the Newton step 25 / 106

Notes

Main goals of quasi-Newton methods

Maintain positive-definite “approximation” Bk to Hk

Instead of computing Bk+1 from scratch, e.g., from a spectral decomposition of Hk+1, want to compute Bk+1 as an update to Bk using information at xk+1

inject “real” curvature information from H(xk+1) into Bk+1

Should be cheap to form and compute with Bk be applicable for medium-scale problems (n ≈ 1000 − 10, 000)

Hope that fast convergence (superlinear) of the iterates {xk} computed using Bk can be recovered

27 / 106 The optimization problem solved in the (k + 1)st iterate will be

Q def T 1 T Notes minimize m (p) = fk+1 + g p + p Bk+1p n k+1 k+1 2 p∈R

How do we choose Bk+1?

How about to satisfy the following:

the model should agree with f at xk+1

the of the model should agree with the gradient of f at xk+1

the gradient of the model should agree with the gradient of f at xk (previous point)

These are equivalent to √ 1 Q mk+1(0) = fk+1 already satisfied √ 2 Q ∇mk+1(0) = gk+1 already satisfied

3 Q ∇mk+1(−αkpk) = gk

Bk+1(−αkpk) + gk+1 = gk ⇐⇒ Bk+1 (xk+1 − xk) = gk+1 − gk | {z } | {z } sk yk

Thus, we aim to cheaply compute a symmetric positive-definite matrix Bk+1 that satisfies the so-called secant equation

def def Bk+1sk = yk where sk = xk+1 − xk and yk = gk+1 − gk

28 / 106

Notes Observation: If the matrix Bk+1 is positive definite, then the curvature condition

T sk yk > 0 (1) is satisfied.

using the fact that Bk+1sk = yk, we see that

T T sk yk = sk Bk+1sk

which must be positive if Bk+1 0 the previous relation explains why (1) is called a curvature condition

(1) will hold for any two points xk and xk+1 if f is strictly convex

(1) does not always hold for any two points xk and xk+1 if f is nonconvex

for nonconvex functions, we need to be careful how we choose αk that defines xk+1 = xk + αkpk (see Wolfe conditions for linesearch strategies)

Note: for the remainder, I will drop the subscript k from all terms.

29 / 106 The problem Given a symmetric positive-definite matrix B, and vectors s and y, find a symmetric Notes positive-definite matrix B¯ that satisfies the secant equation Bs¯ = y

Want a cheap update, so let us first consider a rank-one update of the form B¯ = B + uvT for some vectors u and v that we seek. Derivation: Case 1: Bs = y Choose u = v = 0 since Bs¯ = (B + uvT)s = Bs + uvTs = y Case 2: Bs 6= y If u = 0 or v = 0 then Bs¯ = (B + uvT)s = Bs + uvTs = Bs 6= y So search for u 6= 0 and v 6= 0. Symmetry of B and desired symmetry of B¯ imply that B¯ = B¯T ⇐⇒ B + uvT = BT + vuT ⇐⇒ uvT = vuT But uvT = vuT, u 6= 0, and v 6= 0 imply that v = αu for some α 6= 0 (exercise). Thus, we should search for an α 6= 0 and u 6= 0 such that the secant condition is satisfied by ¯ T B = B + αuu 30 / 106

From previous slide, we hope to find an α > 0 and u 6= 0 so that T B¯ = B + αuu Notes satisfies want y = Bs¯ = Bs + αuuTs = Bs + α(uTs)u which implies α(uTs)u = y − Bs 6= 0. This implies that u = β(y − Bs) for some β 6= 0. Plugging back in, we are searching for B¯ = B + αβ2(y − Bs)(y − Bs)T = B + γ(y − Bs)(y − Bs)T for some γ 6= 0 that satisfies the secant equation

want y = Bs¯ = Bs + γ(y − Bs)Ts(y − Bs) which implies y − Bs = γ(y − Bs)Ts(y − Bs) which means we can choose 1 γ = provided (y − Bs)Ts 6= 0. (y − Bs)Ts

31 / 106 Definition 2.4 (SR1 quasi-Newton formula) Notes If (y − Bs)Ts 6= 0 for some symmetric matrix B and vectors s and y, then the symmetric rank-1 (SR1) quasi-Newton formula for updating B is

(y − Bs)(y − Bs)T B¯ = B + (y − Bs)Ts and satisfies B¯ is symmetric Bs¯ = y (secant condition)

In practice, we choose some κ ∈ (0, 1) and then use

( T B + (y−Bs)(y−Bs) if |sT(y − Bs)| > κksk ky − Bsk B¯ = (y−Bs)Ts 2 2 B otherwise to avoid numerical error.

Question: Is B¯ positive definite if B is positive definite? Answer: Sometimes....but sometimes not!

32 / 106

Example 2.5 (SR1 update is not necessarily positive definite) Consider the SR1 update with Notes ! ! ! 2 1 −1 −3 B = s = and y = 1 1 −1 2

The matrix B is positive definite with eigenvalues ≈ {0.38, 2.6}. Moreover, ! −1 yTs = ( −3 2 ) = 1 > 0 −1 so that the necessary condition for B¯ to be positive definite holds. The SR1 update gives ! ! ! ! −3 2 1 −1 0 y − Bs = − = 2 1 1 −1 4 !   −1 (y − Bs)Ts = 0 4 = −4 −1 ! ! ! 2 1 1 0   2 1 B¯ = + 0 4 = 1 1 (−4) 4 1 −3 The matrix B¯ is indefinite with eigenvalues ≈ {2.19, −3.19}. 33 / 106 Notes

Theorem 2.6 (convergence of SR1 update) Suppose that 2 n f is C on R ∗ ∗ limk→∞ xk = x for some vector x ∇2f is bounded and Lipschitz continous in a neighborhood of x∗

the steps sk are uniformly linearly independent T |sk(yk − Bksk)| > κkskk2kyk − Bkskk2 for some κ ∈ (0, 1) Then, the matrices generated by the SR1 formula satisfy

2 ∗ lim Bk = ∇ f(x ) k→∞

Comment: the SR1 updates may converge to an indefinite matrix

34 / 106

Notes

Some comments if B is symmetric, then the SR1 update B¯ (when it exists) is symmetric if B is positive definite, then the SR1 update B¯ (when it exists) is not necessarily positive definite linesearch methods require positive-definite matrices SR1 updating is not appropriate (in general) for linesearch methods SR1 updating is an attractive option for optimization methods that effectively utilize indefinite matrices (e.g., trust-region methods) to derive an updating strategy that produces positive-definite matrices, we have to look beyond rank-one updates. Would a rank-two update work?

35 / 106 Notes For hereditary positive definiteness we must consider updates of at least rank two.

Result U is a symmetric matrix of rank two if and only if

U = βuuT + δwwT, for some nonzero β and δ, with u and w linearly independent.

If B¯ = B + U for some rank-two matrix U and Bs¯ = y for some vectors s and y, then

Bs¯ = (B + U)s = y, and it follows that y − Bs = Us = β(uTs)u + δ(wTs)w

36 / 106

Let us make the “obvious” choices of

u = Bs and w = y Notes which gives the rank-two update U = βBs(Bs)T + δyyT and we now search for β and δ so that Bs¯ = y is satisfied. Substituting, we have

want y = Bs¯ = (B + βBs(Bs)T + δyyT)s = Bs + βBssTBs + δyyTs = Bs + β(sTBs)Bs + δ(yTs)y Making the “simple” choice of 1 1 β = − and δ = sTBs yTs gives 1 1 Bs¯ = Bs − (sTBs)Bs + (yTs)y sTBs yTs = Bs − Bs + y = y as desired. This gives the Broyden, Fletcher, Goldfarb, Shanno (BFGS) quasi-Newton update formula. 37 / 106 Notes Broyden, Fletcher, Goldfarb, Shanno (BFGS) update Given a symmetric positive-definite matrix B and vectors s and y that satisfy yTs > 0, the quasi-Newton BFGS update formula is given by 1 1 B¯ = B − Bs(Bs)T + yyT (2) sTBs yTs

Result If B is positive definite and yTs > 0, then 1 1 B¯ = B − Bs(Bs)T + yyT sTBs yTs

is positive definite and satisfies Bs¯ = y.

Proof: Homework assignment. (hint: see the next slide)

38 / 106

A second derivation of the BFGS formula Notes Let M = B−1 for some given symmetric positive definite matrix B. Solve

T minimize kM¯ − MkW subject to M¯ = M¯ , My¯ = s ¯ n×n M∈R for a “closest” symmetric matrix M¯, where

n n 1/2 1/2 X X 2 kAkW = kW AW kF and kCkF = cij i=1 j=1 for any weighting matrix W that satisfies Wy = s. If we choose

Z 1 W = G−1 and G = [∇2f(y + τ s)]−1 dτ 0 then the solution is  syT   ysT  ssT M¯ = I − M I − + . yTs yTs yTs Computing the inverse B¯ = M¯−1 using the Sherman-Morrison-Woodbury formula gives 1 1 B¯ = B − Bs(Bs)T + yyT (BFGS formula again) sTBs yTs so that B¯ 0 if and only if M¯ 0. 39 / 106 Recall: given the positive-definite matrix B and vectors s and y satisfying sTy > 0, the BFGS formula (2) may be written as Notes B¯ = B − aaT + bbT where Bs y a = and b = (sTBs)1/2 (yTs)1/2

Limited-memory BFGS update

Suppose at iteration k we have m previous pairs (si, yi) for i = k − m,..., k − 1. Then the limited-memory BFGS (L-BFGS) update with memory m and positive-definite seed 0 matrix Bk may be written in the form

k−1 0 X h T Ti Bk = Bk + bibi − aiai i=k−m for some vectors ai and bi that may be recovered via an unrolling formula (next slide).

0 the positive-definite seed matrix Bk is often chosen to be diagonal 0 essentially perform m BFGS updates to the seed matrix Bk A different seed matrix may be used for each iteration k

I Added flexibility allows “better” seed matrices to be used as iterations proceed

41 / 106

Notes Algorithm 3 Unrolling the L-BFGS formula

0 1: input m ≥ 1, {(si, yi)} for i = k − m,..., k − 1, and Bk 0 2: for i = k − m,..., k − 1 do T 1/2 3: bi ← yi/(yi si) 0 Pi−1  T T  4: ai ← Bksi + j=k−m (bj si)bj − (aj si)aj T 1/2 5: ai ← ai/(si ai) 6: end for

typically 3 ≤ m ≤ 30 T bi and bj si should be saved and reused from previous iterations (exceptions are T bk−1 and bj sk−1 because they depend on the most recent data obtained). 0 with the previous savings in computation and assuming Bk = I, the total cost to 3 2 get the ai, bi vectors is approximately 2 m n. cost of a matrix-vector product with Bk requires 4mn multiplications. other so-called compact representations are slightly more efficient

42 / 106 The problem of interest Notes Given a symmetric positive-definite matrix A, solve the linear system

Ax = b which is equivalent to finding the unique minimizer of

def minimize q(x) = 1 xTAx − bTx (3) n 2 x∈R because ∇q(x) = Ax − b and ∇2q(x) = A 0 implies that

∇q(x) = Ax − b = 0 ⇐⇒ Ax = b

We seek a method that cheaply computes approximate solutions to the system Ax = b, or equivalent approximately minimizes (3). Do not want to compute a factorization of A, because it is assumed to be too expensive. We are interested in very large-scale problems We allow ourselves to compute matrix-vector products, i.e., Av for any vector v Why do we care about this? We will see soon! def def Notation: r(x) = Ax − b and rk = Axk − b.

44 / 106

q(x) = 1 xTAx − bTx 2 Notes

Algorithm 4 algorithm for minimizing q

n n×n 1: input initial guess x0 ∈ R and a symmetric positive-definite matrix A ∈ R 2: loop 3: for i = 0, 1,..., n − 1 do 4: Compute αi as the solution to

minimize q(xi + αei+1) (4) α∈R

5: xi+1 ← xi + αiei+1 6: end for 7: x0 ← xn 8: end loop

ej represents the jth coordinate vector, i.e.,

jth z}|{ T n ej = ( 0 0 ... 1 ... 0 0 ) ∈ R Is it easy to solve (4)? Yes! (exercise) Does this work? 45 / 106 Notes

Figure: Coordinate descent does not Figure: Coordinate descent converges in n converge in n steps. Looks like it will steps. Very good! Axes are aligned with the be very slow! Axes are not aligned with coordinate directions ei. the coordinate directions ei.

Fact: the axes are determined by the eigenvectors of A. Observation: coordinate descent converges in no more than n steps if the eigenvectors align with the coordinate directions, i.e., V = I where V is the eigenvector matrix of A. Implication: Using the spectral decomposition of A A = VΛVT = IΛIT = Λ (a diagonal matrix) Conclusion: Coordinate descent converges in ≤ n steps when A is diagonal. 46 / 106

First thought: This is only good for diagonal matrices. That sucks! Second thought: Can we use a transformation of variables so that the transformed Notes Hessian is diagonal? Let S be a square nonsingular matrix and define the transformed variables y = S−1x ⇐⇒ Sy = x.

Consider the transformed quadratic function qb defined as def T T T T T T q(y ) = q(Sy ) ≡ 1 y S AS y − ( S b ) y = 1 y Ayb − b y b 2 | {z } |{z} 2 Ab b ∗ ∗ ∗ ∗ so that a minimizer x of q satisfies x = Sy where y is a minimizer of qb. T If Ab = S AS is diagonal, then applying the coordinate descent Algorithm 4 to qb would converge to y∗ in ≤ n steps. Ab is diagonal if and only if T si Asj = 0 for all i 6= j,

where  S = s0 s1 ... sn−1 Line 5 of Algorithm 4 (in the transformed variables) is

yi+1 = yi + αiei+1. Multiplying by S and using Sy = x leads to

xi+1 = Syi+1 = S(yi + αiei+1) = Syi + αiSei+1 = xi + αisi

47 / 106 Notes

Definition 2.7 (A-conjugate directions)

A set of nonzero vectors {s0, s1,..., sl} is said to be conjugate with respect to the symmetric positive-definite matrix A (A-conjugate for short) if T si Asj = 0 for all i 6= j.

We want to find A-conjugate vectors. The eigenvectors of A are A-conjugate since the spectral decomposition gives

A = VΛVT ⇐⇒ VTAV = Λ (diagonal matrix of eigenvalues)

Computing the spectral decomposition is too expensive for large scale problems. We want to find A-conjugate vectors cheaply! Let us look at a general algorithm.

48 / 106

Algorithm 5 General A-conjugate direction algorithm Notes

n×n n 1: input symmetric positive-definite matrix A ∈ R , and vectors x0 and b ∈ R . 2: Set r0 ← Ax0 − b and k ← 0. 3: Choose direction s0. 4: while rk 6= 0 do

5: Compute αk ← argmin q(xk + αsk) α∈R 6: Set xk+1 ← xk + αksk. 7: Set rk+1 ← Axk+1 − b.

8: Pick new A-conjugate direction sk+1. 9: Set k ← k + 1. 10: end while

How do we compute αk? How do we compute the A-conjugate directions efficiently?

Theorem 2.8 (general A-conjugate algorithm converges) n For any x0 ∈ R , the general A-conjugate direction Algorithm 5 computes the solution x∗ of the system Ax = b in at most n steps.

Proof: Theorem 5.1 in Nocedal and Wright.

49 / 106 Exercise: show that the α that solves the minimization problem on line 5 of k Notes Algorithm 5 satisfies T sk rk αk = − T . sk Ask

Before discussing how we compute the A-conjugate directions {si}, let us look at some properties of them if we assume we already have them.

Theorem 2.9 (Expanding subspace minimization)

n Let x0 ∈ R be any starting point and suppose that {xk} is the sequence of iterates generated by the A-conjugate direction Algorithm 5 for a set of A-conjugate directions {sk}. Then, T rk si = 0 for all i = 0,... k − 1 1 T T and xk is the unique minimizer of q(x) = 2 x Ax − b x over the set

{x : x = x0 + span(s0, s1,..., sk−1)}

Proof: See Theorem 5.2 in Nocedal and Wright.

50 / 106

k−1 Question: If {si}i=0 are A-conjugate, how do we compute sk to be A-conjugate? Notes Answer: Theorem 2.9 shows that rk is orthogonal to the previously explored space. To keep it simple and hopefully cheap, let us try

sk = −rk + β sk−1 for some β. (5)

If this is going to work, then the A-conjugacy condition and (5) requires that

T 0 = sk−1Ask T = sk−1A(−rk + βsk−1) T = sk−1(−Ark + βAsk−1) T T = −sk−1Ark + βsk−1Ask−1 and solving for β gives T sk−1Ark β = T . sk−1Ask−1

With this choice of β, the vector sk is A-conjugate!.....but, how do we get started? Choosing s0 = −r0, i.e., a steepest descent direction, makes sense.

This gives us a complete algorithm; the linear CG algorithm!

51 / 106 Notes

Algorithm 6 Preliminary linear CG algorithm

n×n n 1: input symmetric positive-definite matrix A ∈ R , and vectors x0 and b ∈ R . 2: Set r0 ← Ax0 − b, s0 ← −r0, and k ← 0. 3: while rk 6= 0 do T T 4: Set αk ← −( sk rk)/( sk Ask). 5: Set xk+1 ← xk + αksk. 6: Set rk+1 ← Axk+1 − b. T T 7: Set βk+1 ← ( sk Ark+1)/( sk Ask). 8: Set sk+1 ← −rk+1 + βk+1sk. 9: Set k ← k + 1. 10: end while

This preliminary version requires two matrix-vector products per iteration. By using some more “tricks” and algebra, we can reduce this to a single matrix-vector multiplication per iteration.

52 / 106

Algorithm 7 Linear CG algorithm Notes n×n n 1: input symmetric positive-definite matrix A ∈ R , and vectors x0 and b ∈ R . 2: Set r0 ← Ax0 − b, s0 ← −r0, and k ← 0. 3: while rk 6= 0 do T T 4: Setαk ← ( rk rk)/( sk Ask). 5: Set xk+1 ← xk + αksk. 6: Set rk+1 ← rk + αkAsk. T T 7: Set βk+1 ← ( rk+1rk+1)/( rk rk). 8: Set sk+1 ← −rk+1 + βk+1sk. 9: Set k ← k + 1. 10: end while

Only requires the single matrix-vector multiplication Ask. More practical to use a stopping condition like −8  krkk ≤ 10 max 1, kr0k2 Required computation I 2 inner products I 3 vector sums I 1 matrix-vector multiplication Ideal for large (sparse) matrices A Converges quickly if cond(A) is small or the eigenvalues are “clustered” I convergence can be accelerated by using precondioning

53 / 106 We want to use CG to compute an approximate solution p to minimize f + gTp + 1 pTHp n 2 Notes p∈R H may be indefinite (even singular) and must ensure that p is a descent direction

Algorithm 8 Newton CG algorithm for computing descent directions

n×n 1: input symmetric matrix H ∈ R and vector g 2: Set p0 = 0, r0 ← g, s0 ← −g, and k ← 0. 3: while rk 6= 0 do T 4: if sk Hsk > 0 then T T 5: Set αk ← ( rk rk)/( sk Hsk). 6: else 7: if k = 0, then return pk = −g else return pk end if 8: end if 9: Set pk+1 ← pk + αksk.

10: Set rk+1 ← rk + αkHsk. T T 11: Set βk+1 ← ( rk+1rk+1)/( rk rk). 12: Set sk+1 ← −rk+1 + βk+1sk. 13: Set k ← k + 1. 14: end while 15: return pk

Made the replacements x ← p, A ← H, and b ← −g in Algorithm 7. 54 / 106

Theorem 2.10 (Newton CG algorithm computes descent directions.) Notes If g 6= 0 and Algorithm 8 terminates on iterate K, then every iterate pk of the algorithm T satisfies pk g < 0 for k ≤ K.

Proof: Case 1: Algorithm 8 terminates on iteration K = 0. T T 2 It is immediate that g p0 = −g g = −kgk2 < 0 since g 6= 0. Case 2: Algorithm 8 terminates on iteration K ≥ 1. First observe from p0 = 0, s0 = −g, r0 = g, g 6= 0, and lines 10 and 9 of Algorithm 8 that rTr kgk4 T = T( + α ) = α T = 0 0 T = − 2 < . g p1 g p0 0s0 0g s0 T g s0 T 0 (6) s0 Hs0 g Hg By unrolling the “p” update, using A-conjugacy of {si}, and definition of rk, we get k−1 T T T T X T sk rk = sk (Hpk + g) = sk g + sk H αjsj = sk g. j=0 T 2 The previous line, lines 9 and 5 of Algorithm 8, and the fact that sk rk = −krkk2 gives T 4 T T T T T T rk rk T T krkk2 T g pk+1 = g pk + αkg sk = g pk + αksk rk = g pk + T sk rk = g pk − T < g pk. sk Hsk sk Hsk Combining the previous inequality and (6) results in kgk4 T < T < ··· < T < − 2 < . g pk+1 g pk g p1 T 0 g Hg 55 / 106 Algorithm 9 Backtracking Notes

1: input xk, pk

2: Choose αinit > 0 and τ ∈ (0, 1) (0) 3: Set α = αinit and l = 0 (l) 4: until f(xk + α pk) “<” f(xk) (l+1) (l) 5: Set α = τ α 6: Set l ← l + 1 7: end until (l) 8: return αk = α

typical choices (not always)

I αinit = 1 I τ = 1/2 this prevents the step from getting too small does not prevent the step from being too long decrease in f may not be proportional to the directional derivative need to tighten the requirement that

(l)  f xk + α pk “<” f(xk)

58 / 106

The Armijo condition Notes Given a point xk and search direction pk, we say that αk satisfies the Armijo condition if

T f xk + αkpk) ≤ f(xk) + ηαk gk pk (7) for some η ∈ (0, 1), e.g., η = 0.1 or even η = 0.0001

Purpose: ensure the decrease in f is substantial relative to the directional derivative.

0.14

0.12

0.1 T f(xk)+αβgk pk

0.08

0.06

0.04

0.02

0 f(xk+αpk)

−0.02 T f(xk)+αgk pk −0.04

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 α

59 / 106 Notes

Algorithm 10 A backtracking-Armijo linesearch

1: input xk, pk

2: Choose αinit > 0, η ∈ (0, 1), and τ ∈ (0, 1) (0) 3: Set α = αinit and l = 0 (`) (`) T 4: until f xk + α pk) ≤ f(xk) + ηα gk pk (l+1) (l) 5: Set α = τ α 6: Set ` ← ` + 1 7: end until (`) 8: return αk = α

Does this always terminate? Does it give us what we want?

60 / 106

Theorem 3.1 (Satisfying the Armijo condition) Notes Suppose that f ∈ C1 and g(x) is Lipschitz continuous with Lipschitz constant γ(x) p is a descent direction at x Then for any given η ∈ (0, 1), the Armijo condition

f(x + αp) ≤ f(x) + ηα g(x)Tp is satisfied for all α ∈ [0, αmax(x)], where

2(η − 1)g(x)Tp αmax(x) := 2 > 0 γ(x)kpk2

Proof: A Taylor’s approximation and 2(η − 1)g(x)Tp α ≤ 2 γ(x)kpk2 imply that T 1 2 2 f(x + αp) ≤ f(x) + αg(x) p + 2 γ(x)α kpk2 ≤ f(x) + αg(x)Tp + α(η − 1)g(x)Tp = f(x) + αη g(x)Tp 61 / 106 Theorem 3.2 Notes [Bound on αk when using an Armijo backtracking linesearch] Suppose that 1 f ∈ C and g(x) is Lipschitz continuous with Lipschitz constant γk at xk

pk is a descent direction at xk

Then for any chosen η ∈ (0, 1) the stepsize αk generated by the backtracking-Armijo linesearch terminates with  T  2τ (η − 1)gk pk  αk ≥ min αinit, 2 ≡ min αinit, τ αmax(xk) γkkpkk2 where τ ∈ (0, 1) is the backtracking parameter.

Proof: (l) Theorem 3.1 implies that the backtracking will terminate as soon as α ≤ αmax(xk). Consider the following two cases:

αinit satisfies the Armijo condition: it is then clear that αk = αinit.

αinit does not satisfy the Armijo condition: thus, there must be a last linesearch iterate (the l-th) for which

(l) (l+1) (l) α > αmax(xk) =⇒ αk = α = τ α > τ αmax(xk)

Combining the two cases gives the required result.

62 / 106

Notes

Algorithm 11 Generic backtracking-Armijo linesearch method

1: input initial guess x0 2: Set k = 0 3: loop 4: Find a descent direction pk at xk. 5: Compute the stepsize αk using the backtracking-Armijo linesearch Algorithm 10. 6: Set xk+1 = xk + αkpk 7: Set k ← k + 1 8: end loop 9: return approximate solution xk

should include a sensible relative stopping tolerance such as

−8  k∇f(xk)k ≤ 10 max 1, k∇f(x0)k

should also terminate if a maximum number of allowed iterations is reached should also terminate if a maximum time limit is reached

63 / 106 Notes Theorem 3.3 (Global convergence of generic backtracking-Armijo linesearch)

Suppose that f ∈ C1 n g is Lipschitz continuous on R with Lipschitz constant γ Then, the iterates generated by the generic backtracking-Armijo linesearch method Algorithm 11 must satisfy one of the following conditions: 1 finite termination, i.e., there exists a positive integer k such that

gk = 0 for some k ≥ 0

2 objective is unbounded below, i.e.,

lim fk = −∞ k→∞

3 an angle condition between gk and pk exists, i.e.,

 T T 2 2 lim min |pk gk|, |pk gk| /kpkk2 = 0 k→∞

64 / 106

Proof: Notes Suppose that outcomes 1 and 2 do not happen, i.e., gk 6= 0 for all k ≥ 0 and

lim fk > −∞. (8) k→∞ The Armijo condition implies that

T fk+1 − fk ≤ ηαkpk gk for all k. Summing over the first j iterations yields

j X T fj+1 − f0 ≤ η αkpk gk (9) k=0 Taking limits of the previous inequality as j → ∞ and using (8) implies that the LHS is bounded below and, therefore, the RHS is also bounded below. Since the sum is composed of all negative terms, we deduce that the following sum

∞ X T αk|pk gk| (10) k=0 is bounded.

65 / 106 n T o Notes 2τ (η−1)gk pk From Theorem 3.2, αk ≥ min αinit, 2 . Thus, γkpkk2

n T o P∞ T P∞ 2τ (η−1)gk pk T k=0 αk|pk gk| ≥ k=0 min αinit, 2 |pk gk| γkpkk2

n T 2 o P∞ T 2τ (1−η)|pk gk| = k=0 min αinit|pk gk|, 2 γkpkk2

n T 2 o P∞ 2τ (1−η) T |pk gk| ≥ k=0 min{αinit, γ } min |pk gk|, 2 kpkk2

Using the fact that the series (10) converges, we obtain that

 T T 2 2 lim min |pk gk|, |pk gk| /kpkk2 = 0 k→∞

66 / 106

Some observations the Armijo condition prevents too “long” of steps Notes typically, the Armijo condition is satisfied for very “large” intervals the Armijo condition can be satisfied by steps that are not even close to a minimizer of φ(x) := f(xk + αpk)

φ(α) = f(xk + αpk)

T l(α) = fk + c1αgk pk

α acceptable acceptable

68 / 106 Wolfe conditions

Given the current iterate xk, search direction pk, and constants 0 < c1 < c2 < 1, we Notes say that the step length αk satisfies the Wolfe conditions if T f(xk + αkpk) ≤ f(xk) + c1αk∇f(xk) pk (11a) T T ∇f(xk + αkpk) pk ≥ c2∇f(xk) pk (11b)

condition (11a) is the Armijo condition (ensures “sufficient” decrease) condition (11b) is a directional derivative condition (prevents too short of steps)

φ(α) = f(xk + αpk)

T l(α) = fk + c1αgk pk

desired slope

α acceptable acceptable acceptable 69 / 106

Notes

Algorithm 12 Generic Wolfe linesearch method

1: input initial guess x0 2: Set k = 0 3: loop 4: Find a descent direction pk at xk. 5: Compute a stepsize αk that satisfies the Wolfe conditions (11). 6: Set xk+1 = xk + αkpk 7: Set k ← k + 1 8: end loop 9: return approximate solution xk

should include a sensible relative stopping tolerance such as

−8  k∇f(xk)k ≤ 10 max 1, k∇f(x0)k

should also terminate if a maximum number of allowed iterations is reached should also terminate if a maximum time limit is reached

70 / 106 Theorem 3.4 (Convergence of generic Wolfe linesearch (Zoutendijk)) Notes Assume that n 1 n f : R → R is bounded below on C on R n ∇f(x) is Lipschitz continous on R with constant L It follows that the iterates generated from the generic Wolfe linesearch method satisfy

X 2 2 cos (θk)kgkk2 < ∞ (12) k≥0 where T def −gk pk cos(θk) = (13) kgkk2kpkk2

Comments

definition (13) measures the angle between the gradient gk and search direction pk I cos(θ) ≈ 0 implies gk and pk are nearly orthogonal I cos(θ) ≈ 1 implies gk and pk are nearly parallel

condition (12) ensures that the gradient converges to zero provided gk and pk stay away from being arbitrarily close to orthogonal Proof: See Theorem 3.2 in Nocedal and Wright.

71 / 106

Notes

Some comments The Armijo linesearch

I only requires function evaluations I may easily accept steps that are far from a univariate minimizer of φ(α) = f(xk + αpk) The Wolfe linesearch

I requires function and gradient evaluations I typically produces steps that are closer to the univariate minimizer of φ I may still allow the acceptance of steps that are far from a univariate minimizer of φ I “ideal” when seach directions are computed from linear systems based on the BFGS quasi-Newton updating formula Note: we have not yet said how to compute steps that satisfy the Wolfe conditions. We have not even shown that such steps exist! (coming soon)

72 / 106 Strong Wolfe conditions

Given the current iterate xk, search direction pk, and constants 0 < c1 < c2 < 1, we Notes say that the step length αk satisfies the Strong Wolfe conditions if T f(xk + αkpk) ≤ f(xk) + c1αk∇f(xk) pk (14a) T T |∇f(xk + αkpk) pk| ≤ c2|∇f(xk) pk| (14b)

φ(α) = f(xk + αpk)

T l(α) = fk + c1αgk pk

desired slope

α acceptable acceptable acceptable

74 / 106

Theorem 3.5 n 1 Notes Suppose that f ∈ R → R is in C and pk is a descent direction at xk, and assume that f is bounded below along the ray {xk + αpk : α > 0}. It follows that if 0 < c1 < c2 < 1, then there exist nontrivial intervals of step lengths that satisfy the strong Wolfe conditions (14).

Proof: The function φ(α) := f(xk + αpk) is bounded below over α > 0 by assumption and T must, therefore, intersect the graph of l(α) := f(xk) + αc1∇f(xk) pk at least once since c1 ∈ (0, 1); let αmin > 0 be the smallest such intersecting value so that T f(xk + αminpk) = f(xk) + αminc1∇f(xk) pk. (15)

It is clear that (11a)/(14a) hold for all α ∈ (0, αmin]. The Mean Value Theorem now implies T f(xk + αmin pk) − f(xk) = αmin∇f(xk + αMpk) pk for some αM ∈ (0, αmin). T Combining the last equality with (15) and the facts that c1 < c2 and ∇f(xk) pk < 0 yield T T T ∇f(xk + αMpk) pk = c1∇f(xk) pk > c2∇f(xk) pk. (16)

Thus, αM satisfies (11a)/(14a) and (11b) with a strict inequality so we may use the smoothness assumption to deduce that there is an interval around αM for which (11) and (14a) hold. Finally, since the left-hand-side of (16) is negative, we also know that (14b) holds, which completes the proof. 75 / 106 Algorithm 13 A bisection type (weak) Wolfe linesearch Notes

1: input xk, pk, 0 < c1 < c2 < 1 2: Choose α = 0, t = 1, and β = ∞ T T T 3: while (f xk + tpk) > f(xk) + c1 · t · (gk pk) OR ∇f xk + tpk) pk < c2(gk pk)) do T 4: if f xk + tpk) > f(xk) + c1 · t · (gk pk) then 5: Set β := t α+β 6: Set t := 2 7: else 8: α := t 9: if β = ∞ then 10: t := 2α 11: else α+β 12: t := 2 13: end if 14: end if 15: end while 16: 17: return t

See Section 3.5 in Nocedal-Wright for more discussions; in particular, a procedure for the strong Wolfe conditions.

76 / 106

Some comments The Armijo linesearch Notes

I only requires function evaluations I computationally cheapest to implement I may easily accept steps that are far from a univariate minimizer of φ(α) = f(xk + αpk) The Wolfe linesearch

I requires function and gradient evaluations I typically produces steps that are closer to the univariate minimizer of φ I may still allow the acceptance of steps that are far from a univariate minimizer of φ I computationally more expensive to implement compared to Armijo The strong Wolfe linesearch

I requires function and gradient evaluations I typically produces steps that are closer to the univariate minimizer of φ I only approximate stationary points of the univariate function φ satisfy these conditions. I computationally most expensive to implement compared to Armijo

Why use (strong) Wolfe conditions? sometimes more accurate lead to fewer outer (major) iterations of the linesearch method. beneficial in the context of quasi-Newton methods for maintaining positive-definite matrix approximations.

77 / 106 Theorem 3.6 (Wolfe conditions guarantee the BFGS update is well-defined) Notes Suppose that Bk 0 and that pk is a descent direction for f at xk. If a (strong) Wolfe linesearch is used to compute αk such that xk+1 = xk + αkpk, then

T yk sk > 0 where sk = xk+1 − xk = αkpk and yk = gk+1 − gk.

Proof: It follows from (11b) that

T T ∇f(xk + αkpk) pk ≥ c2∇f(xk) pk.

Multplying both sides by αk, using xk+1 = xk + αkpk, and notation gk = ∇f(xk), yields

T T gk+1sk ≥ c2gk sk.

T Subtracting gk sk from both sides gives T T T T gk+1sk − gk sk ≥ c2gk sk − gk sk so that T T T yk sk ≥ (c2 − 1)gk sk = αk(c2 − 1)gk pk > 0 since c2 ∈ (0, 1) and pk is a descent direction for f at xk. 78 / 106

Notes

Algorithm 14 Steepest descent backtracking-Armijo linesearch method

1: Input: initial guess x0 2: Set k = 0 −8  3: until k∇f(xk)k ≤ 10 max 1, k∇f(x0)k 4: Set pk = −∇f(xk) as the search direction. 5: Compute the stepsize αk using the backtracking-Armijo linesearch Algorithm 10. 6: Set xk+1 = xk + αkpk 7: Set k ← k + 1 8: end until 9: return approximate solution xk

should also terminate if a maximum number of allowed iterations is reached should also terminate if a maximum time limit is reached

81 / 106 Theorem 4.1 (Global convergence for steepest descent) 1 n Notes If f ∈ C and g is Lipschitz continuous on R , then the iterates generated by the steepest-descent backtracking-Armijo linesearch Algorithm 14 satisfies one of the following conditions: 1 finite termination, i.e., there exists some finite integer k ≥ 0 such that

gk = 0

2 objective is unbounded below, i.e.,

lim fk = −∞ k→∞

3 convergence of , i.e., lim gk = 0 k→∞

Proof: Since pk = −gk, if outcome 1 above does not happen, then we may conclude that

kgkk = kpkk= 6 0 for all k ≥ 0 Additionally, if outcome 2 above does not occur, then Theorem 3.3 ensures that

n  T T 2 2o 0 = lim min |pk gk|, |pk gk| /kpkk2 = lim {kgkk2 min (kgkk2, 1)} k→∞ k→∞ and thus limk→∞ gk = 0. 82 / 106

Notes

Some general comments on the steepest descent backtracking-Armijo algorithm archetypal globally convergent method many other methods resort to steepest descent when things “go wrong” not scale invariant convergence is usually slow! (linear local convergence, possibly with constant 1 2 close to 1; for global convergence only sublinear rate O((  ) ) to get to a “stationary" point, i.e., kgkk ≤ ) in practice, often does not converge at all

slow convergence results because the computation of pk does not use any curvature information call any method that uses the steepest descent direction a “method of steepest descent” each iteration is relatively cheap, but you may need many of them!

83 / 106 Rate of convergence Notes Theorem 4.2 (Global convergence rate for steepest-descent)

1 n Suppose f ∈ C and ∇f is Lipschitz continuous on R . Further assume that f(x) is n bounded from below on R . Let {xk} be the iterates generated by the steepest-descent backtracking-Armijo linesearch Algorithm 14. Then there exists a constant M such that for all T ≥ 1 M min k∇f(xk)k ≤ √ . k=0,...,T T + 1

M 2 Consequently, for any  > 0, within d  e steps, we will see an iterate where the gradient has norm at most . In other words, we reach an “-stationary” point in 1 2 O((  ) ) steps

REMARK: Same thing holds for constant step size for an appropriately chosen constant. See HW. Proof: Recall (9) in the proof of Theorem 3.3:

T X f0 − fk+1 0 α |gTp | ≤ ≤ M since f is bounded from below. k k k η k=0

84 / 106

Rate of convergence Notes Thus, T 2 X 2τ (1 − η)|g pk| X 0 k + α | T | ≤ . 2 init gk pk M γkpkk k:τ αmax<αinit k:τ αmax≥αinit Thus, T 2 0 X |g pk| X M k + | T | ≤ , 2 gk pk (17) kpkk C k:τ αmax<αinit k:τ αmax≥αinit   2τ (1−η) where C = min γ , αinit .

Setting pk = −gk in the inequality above, we obtain that

0 2 X 2 M (T + 1) min kgkk ≤ kgkk ≤ . k=0,...,T C k=0,...,T

Therefore, M min kgkk ≤ √ , k=0,...,T T + 1 for some constant M.

85 / 106 Steepest descent example Notes

1.5

1

0.5

0

−0.5 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Figure: Contours for the objective function f(x, y) = 10(y − x2)2 + (x − 1)2, and the iterates generated by the steepest-descent backtracking-Armijo linesearch Algorithm 14.

86 / 106

Rate of convergence Notes

Theorem 4.3 (Local convergence rate for steepest-descent) Suppose the following all hold: 2 n f is twice continuously differentiable and ∇ f is Lipschitz continuous on R with Lipschitz constant M. x∗ is a local minimum of f with positive definite Hessian ∇2f(x∗) with eigenvalues lower bounded by ` and upper bounded by L. ∗ The initial starting iterate x0 is close enough to x ; more precisely, ∗ 2` r0 := kx0 − x k < ¯r := M . 2 The steepest-descent with constant step size α = `+L converges as follows:

 k ∗ ¯rr0 2` kxk − x k ≤ 1 − . ¯r − r0 L + 3`

1 ∗ In other words, in O(log(  )) iterations, we are within a distance of  from x .

Proof: See Theorem 1.2.4 in Nesterov’s book.

See also Theorem 3.4 in Nocedal-Wright for a similar result with exact line-search.

87 / 106 Notes

Algorithm 15 Modified-Newton backtracking-Armijo linesearch method

1: Input: initial guess x0 2: Set k = 0 −8  3: until k∇f(xk)k ≤ 10 max 1, k∇f(x0)k 4: Compute positive-definite matrix Bk from Hk using Algorithm 2. 5: Compute the search direction pk as the solution to Bkp = −gk. 6: Compute the stepsize αk using the backtracking-Armijo linesearch Algorithm 10. 7: Set xk+1 = xk + αkpk 8: Set k ← k + 1 9: end until 10: return approximate solution xk

should also terminate if a maximum number of allowed iterations is reached should also terminate if a maximum time limit is reached

89 / 106

Theorem 4.4 (Global convergence of modifed-Newton backtracking-Armijo) Notes Suppose that f ∈ C1 n g is Lipschitz continuous on R

the eigenvalues of Bk are uniformly bounded away from zero and infinity Then, the iterates generated by the modified-Newton backtracking-Armijo Algorithm 15 satisfy one of the following: 1 finite termination, i.e., there exists a finite k such that

gk = 0

2 objective is unbounded below, i.e.,

lim fk = −∞ k→∞

3 convergence of gradients, i.e., lim gk = 0 k→∞

Proof:

Let λmin(Bk) and λmax(Bk) be the smallest and largest eigenvalues of Bk. By assumption, there are bounds 0 < λmin ≤ λmax such that sTB s λ ≤ λ (B ) ≤ k ≤ λ (B ) ≤ λ for any nonzero s. min min k ksk2 max k max 90 / 106 From the previous slide, it follows that sTB−1s λ−1 ≤ λ−1 (B ) = λ (B−1) ≤ k ≤ λ (B−1) = λ−1 (B ) ≤ λ−1 Notes max max k min k ksk2 max k min k min (18) for any nonzero vector s. Using Bkpk = −gk and (18) we see that T T −1 −1 2 | pk gk| = |gk Bk gk| ≥ λmaxkgkk2 (19) In addition, we may use (18) to see that 2 T −2 −2 2 kpkk2 = gk Bk gk ≤ λminkgkk2, which implies that −1 kpkk2 ≤ λminkgkk2. It now follows from the previous inequality and (19) that T |pk gk| λmin ≥ kgkk2. (20) kpkk2 λmax Combining the previous inequality and (19) yields 2  2   T T 2 2 kgkk2 λmin min |pk gk|, |pk gk| /kpkk2 ≥ min 1, . (21) λmax λmax T T 2 2 From Theorem 3.3 we know that limk→∞ min |pk gk|, |pk gk| /kpkk2 = 0 and combined with (21), we conclude that

lim gk = 0. k→∞

91 / 106

Rate of convergence Notes

Theorem 4.5 (Global convergence rate for modified-Newton) 1 n Suppose f ∈ C and ∇f is Lipschitz continuous on R . Further assume that f(x) is n bounded from below on R . Let {xk} be the iterates generated by the modified-Newton backtracking-Armijo linesearch Algorithm 15 such that Bk has eigenvalues uniformly bounded away from 0 and infinity. Then there exists a constant M such that for all T ≥ 1 M min k∇f(xk)k ≤ √ . k=0,...,T T + 1

M 2 Consequently, for any  > 0, within d  e steps, we will see an iterate where the gradient has norm at most . In other words, we reach an “-stationary” point in 1 2 O((  ) ) steps

Proof: Almost identical to steepest descent proof (Theorem 4.2); use (19) and (20) in (17).

REMARK: Same thing holds for constant step size for an appropriately chosen constant.

92 / 106 Rate of convergence Notes Theorem 4.6 (Local quadratic convergence for modified-Newton) Suppose that f ∈ C2 n H is Lipschitz continuous on R with constant γ Suppose that iterates are generated from the modified-Newton backtracking-Armijo Algorithm 15 such that

αinit = 1 1 η ∈ (0, 2 ) Bk = Hk whenever Hk is positive definite ∗ ∗ If the sequence of iterates {xk} has a limit point x satisfying H(x ) is positive definite, then it follows that

(i) αk = 1 for all sufficiently large k, ∗ (ii) the entire sequence {xk} converges to x , and (iii) the rate of convergence is q-quadratic, i.e, there is a constant κ ≥ 0 and a natural number k0 such that for all k ≥ k0

∗ ∗ 2 kxk+1 − x k2 ≤ κkxk − x k2

93 / 106

Proof: Notes By assumption we know that there is some subsequence K such that

∗ ∗ lim xk = x and H(x ) 0 k∈K

Continuity implies that Hk is positive definite for all k ∈ K sufficiently large so that

T 1 ∗ 2 pk Hkpk ≥ 2 λmin(H(x ))kpkk2 for all k0 ≤ k ∈ K (22) ∗ ∗ for some k0, where λmin(H(x )) is the smallest eigenvalue of H(x ). This fact and since Hkpk = −gk for k0 ≤ k ∈ K implies that

T T T 1 ∗ 2 |pk gk| = −pk gk = pk Hkpk ≥ 2 λmin(H(x ))kpkk2 for all k0 ≤ k ∈ K. (23) This also implies T |pk gk| 1 ∗ ≥ 2 λmin(H(x ))kpkk2. (24) kpkk2 We now deduce from Theorem 3.3, (23), and (24) that

lim pk = 0. k∈K→∞

94 / 106 Mean Value Theorem implies that there exists zk between xk and xk + pk such that Notes T 1 T f(xk + pk) = fk + pk gk + 2 pk H(zk)pk.

Lipschitz continuity of H(x) and the fact that Hkpk = −gk implies

1 T 1 T T f(xk + pk) − fk − 2 pk gk = 2 (pk gk + pk H(zk)pk) 1 T = 2 pk (H(zk) − Hk)pk 1 2 1 3 ≤ 2 kH(zk) − H(xk)k2kpkk2 ≤ 2 γkpkk2. (25)

Since η ∈ (0, 1/2) and pk → 0, we may pick k sufficiently large so that

∗ 2γkpkk2 ≤ λmin(H(x ))(1 − 2η). (26)

Combining (25), (26), and (23) shows that

1 T 1 3 f(xk + pk) − fk ≤ 2 pk gk + 2 γkpkk2 1 T 1 ∗ 2 ≤ 2 pk gk + 4 λmin(H(x ))(1 − 2η)kpkk2 1 T 1 T ≤ 2 pk gk − 2 (1 − 2η)pk gk T = ηpk gk so that the unit step xk + pk satisfies the Armijo condition for all k ∈ K sufficiently large.

95 / 106

Note that (22) implies that

−1 ∗ kHk k2 ≤ 2/λmin(H(x )) for all k0 ≤ k ∈ K. (27) Notes The iteration gives

∗ ∗ −1 ∗ −1 ∗ xk+1 − x = xk − x − Hk gk = xk − x − Hk (gk − g(x )) −1 ∗ ∗ = Hk [ g(x ) − gk − Hk(x − xk)] . But a Taylor Approximation provides the estimate γ kg(x∗) − g − H (x∗ − x )k ≤ kx − x∗k2 k k k 2 2 k 2 which implies

∗ γ −1 ∗ 2 γ ∗ 2 kxk+1 − x k2 ≤ kHk k2kxk − x k2 ≤ ∗ kxk − x k2 (28) 2 λmin(H(x )) Thus, we know that for k ∈ K sufficiently large, the unit step will be accepted and

∗ 1 ∗ kxk+1 − x k2 ≤ 2 kxk − x k2 which proves parts (i) and (ii) since the entire argument may be repeated. Finally, part (iii) then follows from part (ii) and (28) with the choice

∗ κ = γ/λmin(H(x )) which completes the proof. 96 / 106 1.5 Notes

1

0.5

0

−0.5 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Figure: Contours for the objective function f(x, y) = 10(y − x2)2 + (x − 1)2, and the iterates generated by the modified-Newton backtracking-Armijo Algorithm 15. 97 / 106

Notes

Algorithm 16 Modified-Newton Wolfe linesearch method

1: Input: initial guess x0 2: Set k = 0 −8  3: until k∇f(xk)k ≤ 10 max 1, k∇f(x0)k 4: Compute positive-definite matrix Bk from Hk using Algorithm 2. 5: Compute a search direction pk as the solution to Bkp = −gk. 6: Compute a stepsize αk that satisfies the Wolfe conditions (11a) and (11b). 7: Set xk+1 ← xk + αkpk 8: Set k ← k + 1 9: end until 10: return approximate solution xk

should also terminate if a maximum number of allowed iterations is reached should also terminate if a maximum time limit is reached

99 / 106 Notes

Algorithm 17 Quasi-Newton Wolfe linesearch method

1: Input: initial guess x0 2: Set k = 0 −8  3: until k∇f(xk)k ≤ 10 max 1, k∇f(x0)k 4: Compute positive-definite matrix Bk from Bk−1 using the BFGS rule, or using Al- 0 gorithm 3 with a seed matrix Bk. 5: Compute a search direction pk as the solution to Bkp = −gk. 6: Compute a stepsize αk that satisfies the Wolfe conditions (11a) and (11b). 7: Set xk+1 ← xk + αkpk 8: Set k ← k + 1 9: end until 10: return approximate solution xk

should also terminate if a maximum number of allowed iterations is reached should also terminate if a maximum time limit is reached

101 / 106

Theorem 4.7 (Global convergence of modifed-Newton/quasi-Newton Wolfe) Notes

Suppose that f ∈ C1 n g is Lipschitz continuous on R

the condition number of Bk is uniformly bounded by β Then, the iterates generated by the modified/quasi-Newton Wolfe Algorithm 17 or 16 satisfy one of the following: 1 finite termination, i.e., there exists a finite k such that

gk = 0

2 objective is unbounded below, i.e.,

lim fk = −∞ k→∞

3 convergence of gradients, i.e., lim gk = 0 k→∞

Proof: Homework assignment.

102 / 106 Rate of convergence Notes

Theorem 4.8 (Global convergence rate for modified-Newton/quasi-Newton Wolfe)

1 n Suppose f ∈ C and ∇f is Lipschitz continuous on R . Further assume that f(x) is n bounded from below on R . Let {xk} be the iterates generated by the modified/quasi-Newton Wolfe lineasearch Algorithms 16 or 17, such that Bk has uniformly bounded condition number. Then there exists a constant M such that for all T ≥ 1 M min k∇f(xk)k ≤ √ . k=0,...,T T + 1

M 2 Consequently, for any  > 0, within d  e steps, we will see an iterate where the gradient has norm at most . In other words, we reach an “-stationary” point in 1 2 O((  ) ) steps

103 / 106

Rate of convergence Notes

Theorem 4.9 (Local superlinear convergence for quasi-Newton) Suppose that f ∈ C2 n H is Lipschitz continuous on R

Suppose that iterates are generated from the quasi-Newton Wolfe Algorithm 1 Algorithm 17 such that 0 < c1 < c2 ≤ 2 . ∗ ∗ If the sequence of iterates {xk} has a limit point x satisfying H(x ) is positive definite and g(x∗) = 0, then it follows that

(i) the step length αk = 1 is valid for all sufficiently large k, ∗ (ii) the entire sequence {xk} converges to x superlinearly.

See also Theorems 3.6 and 3.7 from Nocedal-Wright.

104 / 106 A Summary Notes The problem

minimize f(x) n x∈R n where the objective function f : R → R

Solve instead quadratic approximation at current iterate xk, after choosing positive definite Bk, to get a search direction that is a descent direction:

Q def T 1 T pk = argmin mk (p) = fk + gk p + p Bkp n 2 p∈R

Bk = identity matrix: Steepest Descent Direction.

Bk obtained from Hessian Hk by adjusting eigenvalues: Modified-Newton.

Bk obtained by updating at every iteration using current and previous gradient information: Quasi-Newton.

Decide on step size strategy: Armijo , Wolfe backtracking line   search, constant step size (HW assignment), O √1 in iteration k ... many others. k 105 / 106

A Summary Notes Steepest Descent Modified/Quasi Newton

Requires only function and gradient Could need Hessian evaluations as well evaluations (with spectral decompositions)

No overhead of solving linear system in Solves a linear system Bp = g in each each iteration iteration

Global convergence to -stationary Global convergence to -stationary 1 2 1 2 point in O((  ) ) steps point in O((  ) ) steps

Linear local convergence: converges Superlinear (quadratic) local conver- to  distance from local minimum in gence: converges to  distance 1 1 O(log(  )) steps from local minimum in o(log(  )) 1 (O(log log(  )) steps.

106 / 106