Notes
Line-Search Methods for Smooth Unconstrained Optimization
Daniel P. Robinson Department of Applied Mathematics and Statistics Johns Hopkins University
September 17, 2020
1 / 106
Outline Notes
1 Generic Linesearch Framework
2 Computing a descent direction pk (search direction) Steepest descent direction Modified Newton direction Quasi-Newton directions for medium scale problems Limited-memory quasi-Newton directions for large-scale problems Linear CG method for large-scale problems
3 Choosing the step length αk (linesearch) Backtracking-Armijo linesearch Wolfe conditions Strong Wolfe conditions
4 Complete Algorithms Steepest descent backtracking Armijo linesearch method Modified Newton backtracking-Armijo linesearch method Modified Newton linesearch method based on the Wolfe conditions Quasi-Newton linesearch method based on the Wolfe conditions
2 / 106 Notes The problem
minimize f(x) n x∈R n where the objective function f : R → R
assume that f ∈ C1 (sometimes C2) and is Lipschitz continuous in practice this assumption may be violated, but the algorithms we develop may still work in practice it is very rare to be able to provide an explicit minimizer
we consider iterative methods: given starting guess x0, generate sequence
{xk} for k = 1, 2,... AIM: ensure that (a subsequence) has some favorable limiting properties
I satisfies first-order necessary conditions I satisfies second-order necessary conditions
Notation: fk = f(xk), gk = g(xk), Hk = H(xk)
4 / 106
The basic idea Notes
consider descent methods, i.e., f(xk+1) < f(xk)
calculate a search direction pk from xk ensure that this direction is a descent direction, i.e.,
T gk pk < 0 if gk 6= 0
so that, for small steps along pk, the objective function f will be reduced
calculate a suitable steplength αk > 0 so that
f(xk + αkpk) < fk
computation of αk is the linesearch
the computation of αk may itself require an iterative procedure generic update for linesearch methods is given by
xk+1 = xk + αkpk
5 / 106 Issue 1: steps might be too long Notes
f(x) 3
2.5
(x1,f(x1)
2
1.5 (x2,f(x2)
(x3,f(x3) (x ,f(x ) (x5,f(x5) 4 4 1
0.5
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x
2 Figure: The objective function f(x) = x and the iterates xk+1 = xk + αkpk generated by the k+1 k+1 descent directions pk = (−1) and steps αk = 2 + 3/2 from x0 = 2.
decrease in f is not proportional to the size of the directional derivative! 6 / 106
Issue 2: steps might be too short Notes
f(x) 3
2.5
(x1,f(x1)
2
1.5 (x2,f(x2)
(x3,f(x3) (x ,f(x ) (x4,f(x4) 1 5 5
0.5
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x
2 Figure: The objective function f(x) = x and the iterates xk+1 = xk + αkpk generated by the k+1 descent directions pk = −1 and steps αk = 1/2 from x0 = 2.
decrease in f is not proportional to the size of the directional derivative! 7 / 106 What is a practical linesearch? Notes
in the “early” days exact linesearchs were used, i.e., pick αk to minimize
f(xk + αpk)
I an exact linesearch requires univariate minimization I cheap if f is simple, e.g., a quadratic function I generally very expensive and not cost effective I exact linesearch may not be much better than an approximate linesearch
modern methods use inexact linesearchs
I ensure steps are neither too long nor too short I make sure that the decrease in f is proportional to the directional derivative I try to pick “appropriate” initial stepsize for fast convergence
F related to how the search direction sk is computed
the descent direction (search direction) is also important
I “bad” directions may not converge at all I more typically, “bad” directions may converge very slowly
8 / 106
Notes Definition 2.1 (Steepest descent direction) For a differentiable function f, the search direction
def pk = −∇f(xk) ≡ −gk is called the steepest-descent direction.
pk is a descent direction provided gk 6= 0 T T 2 I gk pk = −gk gk = −kgkk2 < 0
pk solves the problem
L minimize m (xk + p) subject to kpk2 = kgkk2 n k p∈R
L def T I mk (xk + p) = fk + gk p L I mk (xk + p) ≈ f(xk + p)
Any method that uses the steepest-descent direction is a method of steepest descent.
11 / 106 Observation: the steepest descent direction is also the unique solution to Notes
T 1 T minimize fk + g p + p p n k 2 p∈R approximates second-derivative Hessian by the identify matrix I how often is this a good idea? is it a surprise that convergence is typically very slow?
Idea: choose positive definite Bk and compute search direction as the unique minimizer
Q def T 1 T pk = argmin mk (p) = fk + gk p + p Bkp n 2 p∈R
pk satisfies Bkpk = −gk why must Bk be positive definite? Q I Bk 0 =⇒ Mk is strictly convex =⇒ unique solution I if gk 6= 0, then pk 6= 0 and is a descent direction T T pk gk = −pk Bkpk < 0 =⇒ pk is a descent direction
pick “intelligent” Bk that has “useful” curvature information
IF Hk 0 and we choose Bk = Hk, then sk is the Newton direction. Awesome!
13 / 106
Question: How do we choose the positive-definite matrix Bk? Notes Ideally, Bk is chosen such that
kBk − Hkk is “small”
Bk = Hk when Hk is “sufficiently” positive definite.
Comments:
for the remainder of this section, we omit the suffix k and write H = Hk, B = Bk, and g = gk. n×n for a symmetric matrix A ∈ R , we use the matrix norm
kAk2 = max |λj| 1≤j≤n
with {λj} the eigenvalues of A. the spectral decomposition of H is given by H = VΛVT, where V = v1 v2 ··· vn and Λ = diag(λ1, λ2, . . . , λn)
with Hvj = λjvj and λ1 ≥ λ2 ≥ · · · ≥ λn.
H is positive definite if and only if λj > 0 for all j. computing the spectral decomposition is, generally, very expensive!
14 / 106 Method 1: small scale problems (n < 1000) Notes Algorithm 1 Compute modified Newton matrix B from H
1: input H 2: Choose β > 1, the desired bound on the condition number of B. T 3: Compute the spectral decomposition H = VΛV . 4: if H = 0 then 5: Set ε = 1 6: else 7: Set ε = kHk2/β > 0 8: end if 9: Compute
λ if λ ≥ ε Λ¯ = diag(λ¯ , λ¯ ,..., λ¯ ) with λ¯ = j j 1 2 n j ε otherwise
T 10: return B = VΛ¯V 0, which satisfies cond(B) ≤ β
replaces the eigenvalues of H that are “not positive enough” with ε Λ¯ = Λ + D, where D = diag max(0, ε − λ1), max(0, ε − λ2),..., max(0, ε − λn) 0
B = H + E, where E = VDVT 0 15 / 106
Question: What are the properties of the resulting search direction? Notes n T −1 −1 T X vj g p = −B g = −VΛ¯ V g = − vj λ¯ j=1 j
Taking norms and using the orthogonality of V, gives
n T 2 − X (vj g) kpk2 = kB 1gk2 = 2 2 λ¯2 j=1 j T 2 T 2 X (vj g) X (vj g) = + 2 2 λj ε λj≥ε λj<ε
Thus, we conclude that 1 kpk = O provided vTg 6= 0 for at least one λ < ε 2 ε j j
T the step p goes unbounded as ε → 0 provided vj g 6= 0 for at least one λj < ε any indefinite matrix will generally satisfy this property the next method that we discuss is better!
16 / 106 Method 2: small scale problems (n < 1000) Notes Algorithm 2 Compute modified Newton matrix B from H
1: input H 2: Choose β > 1, the desired bound on the condition number of B. T 3: Compute the spectral decomposition H = VΛV . 4: if H = 0 then 5: Set ε = 1 6: else 7: Set ε = kHk2/β > 0 8: end if 9: Compute λj if λj ≥ ε Λ¯ = diag(λ¯1, λ¯2,..., λ¯n) with λ¯j = −λj if λj ≤ −ε ε otherwise
T 10: return B = VΛ¯V 0, which satisfies cond(B) ≤ β
replace small eigenvalues λi of H with ε replace “sufficiently negative” eigenvalues λi of H with −λi Λ¯ = Λ + D, where D = diag max(0, −2λ1, ε − λ1),..., max(0, −2λn, ε − λn) 0
B = H + E, where E = VDVT 0 17 / 106
Suppose that B = H + E is computed from Algorithm 2 so that
E = B − H = V(Λ¯ − Λ)VT Notes
and since V is orthogonal that
T kEk2 = kV(Λ¯ − Λ)V k2 = kΛ¯ − Λk2 = max |λ¯j − λj| j
By definition 0 if λj ≥ ε λ¯j − λj = ε − λj if −ε < λj < ε −2λj if λj ≤ −ε which implies that kEk2 = max { 0, ε − λj, −2λj }. 1≤j≤n
However, since λ1 ≥ λ2 ≥ · · · ≥ λn, we know that
kEk2 = max { 0, ε − λn, −2λn }
if λn ≥ ε, i.e., H is sufficiently positive definite, then E = 0 and B = H
if λn < ε, then B 6= H and it can be shown that kEk2 ≤ 2 max(ε, |λn|) regardless, B is well-conditioned by construction
18 / 106 Example 2.2 Consider Notes 2 1 0 g = and H = 4 0 −2 The Newton direction is −2 pN = −H−1g = so that gTpN = 4 > 0 (ascent direction) 2 N T 1 T and p is a saddle point of the quadratic model g p + 2 p Hp since H is indefinite. Algorithm 1 (Method 1) generates 1 0 B = 0 ε
so that −1 −2 T 16 p = −B g = 4 and g p = −4 − < 0 (descent direction) − ε ε Algorithm 2 (Method 2) generates 1 0 B = 0 2
so that −2 p = −B−1g = and gTp = −12 < 0 (descent direction) −2 19 / 106
Notes
pN pN
Method 1
p
pN
p Method 2 Question: What is the geometric interpretation? Answer: Let the spectral decomposition of H be Notes H = VΛVT and assume that λj 6= 0 for all j. Partition the eigenvector and eigenvalue matrices as ! Λ+ } λj>0 Λ = and V = V+ V− Λ− } λj<0 |{z} |{z} λj>0 λj<0 T Λ+ V+ So that H = V+ V− T and the Newton direction is Λ− V− −1 T N −1 T Λ+ V+ p = −VΛ V g = − V+ V− −1 T g Λ− V−
−1 T Λ+ V+g = − V+ V− −1 T Λ− V−g
−1 T −1 T = −V+Λ+ V+g − V−Λ− V−g
−1 T −1 T def N N = − V+Λ+ V+g − V−Λ− V−g = p+ + p− | {z } | {z } pN pN + − where N −1 T N −1 T p+ = −V+Λ V+g and p− = −V−Λ V−g + − 21 / 106
Notes
Result Q T 1 T Given mk (p) = g p + 2 p Hp with H nonsingular, the Newton direction
N N N −1 T −1 T p = p+ + p− = −V+Λ+ V+g − V−Λ− V−g is such that N Q p+ minimizes mk (p) on the subspace spanned by V+.
N Q p− maximizes mk (p) on the subspace spanned by V−.
22 / 106 Proof: def Q Notes The function ψ(y) = mk (V+y) may be written as
T 1 T T ψ(y) = g V+y + 2 y V+HV+y and has a stationary point y∗ given by
∗ T −1 T y = −(V+HV+) V+g
Since 2 T T T ∇ ψ(y) = V+HV+ = V+VΛV V+ = Λ+ 0 we know that y∗ is, in fact, the unique minimizer of ψ(y) and
∗ −1 T N V+y = −V+Λ+ V+g ≡ p+ For the second part
∗ T −1 T −1 T y = −(V−HV−) V−g = −Λ− V−g so that ∗ −1 T N V−y = −V−Λ− V−g ≡ p−.
N Q Since Λ− ≺ 0, we know that p− maximizes mk (p) on the space spanned by the columns of V−.
23 / 106
Notes Some observations
N p+ is a descent direction for f:
T N T −1 T T −1 g p+ = −g V+Λ+ V+g = −y Λ+ y ≤ 0
def T where y = V+g. The inequality is strict if y 6= 0, i.e., g is not orthogonal to the
column space of V+
N p− is an ascent direction for f:
T N T −1 T T −1 g p− = −g V−Λ− V−g = −y Λ− y ≥ 0
def T where y = V−g. The inequality is strict if y 6= 0, i.e., g is not orthogonal to the
column space of V−
N N N p = p+ + p− may or may not be a descent direction! This is the geometry of the Newton direction. What about the modified Newton direction?
24 / 106 Question: what is the geometry of the modifed-Newton step? Answer: Partition the matrices in the spectral decomposition H = VΛVT such that Notes Λ+ V = [V+ Vε V−] and Λ = Λε Λ− where
Λ+ = {λi : λi ≥ ε} Λε = {λi : |λi| < ε} Λ− = {λi : λi ≤ −ε}
Theorem 2.3 (Properties of the modifed-Newton direction)
Suppose that the positive-definite matrix Bk is computed from Algorithm 2 with input Hk and value ε > 0. If gk 6= 0 and pk is the unique solution to Bkp = −gk, then pk may be written as pk = p+ + pε + p−, where Q (i) p+ is a direction of positive curvature that minimizes mk (p) in the space spanned by the columns of V+; Q (ii) −(p−) is a direction of negative curvature that maximizes mk (p) in the space spanned by the columns of V−; and
(iii) pε is a multiple of the steepest-descent direction in the space spanned by the columns of Vε with kpεk2 = O(1/ε).
Comments: implies that singular Hessians Hk may easily cause numerical difficulties N the direction p− is the negative of p− associated with the Newton step 25 / 106
Notes
Main goals of quasi-Newton methods
Maintain positive-definite “approximation” Bk to Hk
Instead of computing Bk+1 from scratch, e.g., from a spectral decomposition of Hk+1, want to compute Bk+1 as an update to Bk using information at xk+1
inject “real” curvature information from H(xk+1) into Bk+1
Should be cheap to form and compute with Bk be applicable for medium-scale problems (n ≈ 1000 − 10, 000)
Hope that fast convergence (superlinear) of the iterates {xk} computed using Bk can be recovered
27 / 106 The optimization problem solved in the (k + 1)st iterate will be
Q def T 1 T Notes minimize m (p) = fk+1 + g p + p Bk+1p n k+1 k+1 2 p∈R
How do we choose Bk+1?
How about to satisfy the following:
the model should agree with f at xk+1
the gradient of the model should agree with the gradient of f at xk+1
the gradient of the model should agree with the gradient of f at xk (previous point)
These are equivalent to √ 1 Q mk+1(0) = fk+1 already satisfied √ 2 Q ∇mk+1(0) = gk+1 already satisfied
3 Q ∇mk+1(−αkpk) = gk
Bk+1(−αkpk) + gk+1 = gk ⇐⇒ Bk+1 (xk+1 − xk) = gk+1 − gk | {z } | {z } sk yk
Thus, we aim to cheaply compute a symmetric positive-definite matrix Bk+1 that satisfies the so-called secant equation
def def Bk+1sk = yk where sk = xk+1 − xk and yk = gk+1 − gk
28 / 106
Notes Observation: If the matrix Bk+1 is positive definite, then the curvature condition
T sk yk > 0 (1) is satisfied.
using the fact that Bk+1sk = yk, we see that
T T sk yk = sk Bk+1sk
which must be positive if Bk+1 0 the previous relation explains why (1) is called a curvature condition
(1) will hold for any two points xk and xk+1 if f is strictly convex
(1) does not always hold for any two points xk and xk+1 if f is nonconvex
for nonconvex functions, we need to be careful how we choose αk that defines xk+1 = xk + αkpk (see Wolfe conditions for linesearch strategies)
Note: for the remainder, I will drop the subscript k from all terms.
29 / 106 The problem Given a symmetric positive-definite matrix B, and vectors s and y, find a symmetric Notes positive-definite matrix B¯ that satisfies the secant equation Bs¯ = y
Want a cheap update, so let us first consider a rank-one update of the form B¯ = B + uvT for some vectors u and v that we seek. Derivation: Case 1: Bs = y Choose u = v = 0 since Bs¯ = (B + uvT)s = Bs + uvTs = y Case 2: Bs 6= y If u = 0 or v = 0 then Bs¯ = (B + uvT)s = Bs + uvTs = Bs 6= y So search for u 6= 0 and v 6= 0. Symmetry of B and desired symmetry of B¯ imply that B¯ = B¯T ⇐⇒ B + uvT = BT + vuT ⇐⇒ uvT = vuT But uvT = vuT, u 6= 0, and v 6= 0 imply that v = αu for some α 6= 0 (exercise). Thus, we should search for an α 6= 0 and u 6= 0 such that the secant condition is satisfied by ¯ T B = B + αuu 30 / 106
From previous slide, we hope to find an α > 0 and u 6= 0 so that T B¯ = B + αuu Notes satisfies want y = Bs¯ = Bs + αuuTs = Bs + α(uTs)u which implies α(uTs)u = y − Bs 6= 0. This implies that u = β(y − Bs) for some β 6= 0. Plugging back in, we are searching for B¯ = B + αβ2(y − Bs)(y − Bs)T = B + γ(y − Bs)(y − Bs)T for some γ 6= 0 that satisfies the secant equation
want y = Bs¯ = Bs + γ(y − Bs)Ts(y − Bs) which implies y − Bs = γ(y − Bs)Ts(y − Bs) which means we can choose 1 γ = provided (y − Bs)Ts 6= 0. (y − Bs)Ts
31 / 106 Definition 2.4 (SR1 quasi-Newton formula) Notes If (y − Bs)Ts 6= 0 for some symmetric matrix B and vectors s and y, then the symmetric rank-1 (SR1) quasi-Newton formula for updating B is
(y − Bs)(y − Bs)T B¯ = B + (y − Bs)Ts and satisfies B¯ is symmetric Bs¯ = y (secant condition)
In practice, we choose some κ ∈ (0, 1) and then use
( T B + (y−Bs)(y−Bs) if |sT(y − Bs)| > κksk ky − Bsk B¯ = (y−Bs)Ts 2 2 B otherwise to avoid numerical error.
Question: Is B¯ positive definite if B is positive definite? Answer: Sometimes....but sometimes not!
32 / 106
Example 2.5 (SR1 update is not necessarily positive definite) Consider the SR1 update with Notes ! ! ! 2 1 −1 −3 B = s = and y = 1 1 −1 2
The matrix B is positive definite with eigenvalues ≈ {0.38, 2.6}. Moreover, ! −1 yTs = ( −3 2 ) = 1 > 0 −1 so that the necessary condition for B¯ to be positive definite holds. The SR1 update gives ! ! ! ! −3 2 1 −1 0 y − Bs = − = 2 1 1 −1 4 ! −1 (y − Bs)Ts = 0 4 = −4 −1 ! ! ! 2 1 1 0 2 1 B¯ = + 0 4 = 1 1 (−4) 4 1 −3 The matrix B¯ is indefinite with eigenvalues ≈ {2.19, −3.19}. 33 / 106 Notes
Theorem 2.6 (convergence of SR1 update) Suppose that 2 n f is C on R ∗ ∗ limk→∞ xk = x for some vector x ∇2f is bounded and Lipschitz continous in a neighborhood of x∗
the steps sk are uniformly linearly independent T |sk(yk − Bksk)| > κkskk2kyk − Bkskk2 for some κ ∈ (0, 1) Then, the matrices generated by the SR1 formula satisfy
2 ∗ lim Bk = ∇ f(x ) k→∞
Comment: the SR1 updates may converge to an indefinite matrix
34 / 106
Notes
Some comments if B is symmetric, then the SR1 update B¯ (when it exists) is symmetric if B is positive definite, then the SR1 update B¯ (when it exists) is not necessarily positive definite linesearch methods require positive-definite matrices SR1 updating is not appropriate (in general) for linesearch methods SR1 updating is an attractive option for optimization methods that effectively utilize indefinite matrices (e.g., trust-region methods) to derive an updating strategy that produces positive-definite matrices, we have to look beyond rank-one updates. Would a rank-two update work?
35 / 106 Notes For hereditary positive definiteness we must consider updates of at least rank two.
Result U is a symmetric matrix of rank two if and only if
U = βuuT + δwwT, for some nonzero β and δ, with u and w linearly independent.
If B¯ = B + U for some rank-two matrix U and Bs¯ = y for some vectors s and y, then
Bs¯ = (B + U)s = y, and it follows that y − Bs = Us = β(uTs)u + δ(wTs)w
36 / 106
Let us make the “obvious” choices of
u = Bs and w = y Notes which gives the rank-two update U = βBs(Bs)T + δyyT and we now search for β and δ so that Bs¯ = y is satisfied. Substituting, we have
want y = Bs¯ = (B + βBs(Bs)T + δyyT)s = Bs + βBssTBs + δyyTs = Bs + β(sTBs)Bs + δ(yTs)y Making the “simple” choice of 1 1 β = − and δ = sTBs yTs gives 1 1 Bs¯ = Bs − (sTBs)Bs + (yTs)y sTBs yTs = Bs − Bs + y = y as desired. This gives the Broyden, Fletcher, Goldfarb, Shanno (BFGS) quasi-Newton update formula. 37 / 106 Notes Broyden, Fletcher, Goldfarb, Shanno (BFGS) update Given a symmetric positive-definite matrix B and vectors s and y that satisfy yTs > 0, the quasi-Newton BFGS update formula is given by 1 1 B¯ = B − Bs(Bs)T + yyT (2) sTBs yTs
Result If B is positive definite and yTs > 0, then 1 1 B¯ = B − Bs(Bs)T + yyT sTBs yTs
is positive definite and satisfies Bs¯ = y.
Proof: Homework assignment. (hint: see the next slide)
38 / 106
A second derivation of the BFGS formula Notes Let M = B−1 for some given symmetric positive definite matrix B. Solve
T minimize kM¯ − MkW subject to M¯ = M¯ , My¯ = s ¯ n×n M∈R for a “closest” symmetric matrix M¯, where
n n 1/2 1/2 X X 2 kAkW = kW AW kF and kCkF = cij i=1 j=1 for any weighting matrix W that satisfies Wy = s. If we choose
Z 1 W = G−1 and G = [∇2f(y + τ s)]−1 dτ 0 then the solution is syT ysT ssT M¯ = I − M I − + . yTs yTs yTs Computing the inverse B¯ = M¯−1 using the Sherman-Morrison-Woodbury formula gives 1 1 B¯ = B − Bs(Bs)T + yyT (BFGS formula again) sTBs yTs so that B¯ 0 if and only if M¯ 0. 39 / 106 Recall: given the positive-definite matrix B and vectors s and y satisfying sTy > 0, the BFGS formula (2) may be written as Notes B¯ = B − aaT + bbT where Bs y a = and b = (sTBs)1/2 (yTs)1/2
Limited-memory BFGS update
Suppose at iteration k we have m previous pairs (si, yi) for i = k − m,..., k − 1. Then the limited-memory BFGS (L-BFGS) update with memory m and positive-definite seed 0 matrix Bk may be written in the form
k−1 0 X h T Ti Bk = Bk + bibi − aiai i=k−m for some vectors ai and bi that may be recovered via an unrolling formula (next slide).
0 the positive-definite seed matrix Bk is often chosen to be diagonal 0 essentially perform m BFGS updates to the seed matrix Bk A different seed matrix may be used for each iteration k
I Added flexibility allows “better” seed matrices to be used as iterations proceed
41 / 106
Notes Algorithm 3 Unrolling the L-BFGS formula
0 1: input m ≥ 1, {(si, yi)} for i = k − m,..., k − 1, and Bk 0 2: for i = k − m,..., k − 1 do T 1/2 3: bi ← yi/(yi si) 0 Pi−1 T T 4: ai ← Bksi + j=k−m (bj si)bj − (aj si)aj T 1/2 5: ai ← ai/(si ai) 6: end for
typically 3 ≤ m ≤ 30 T bi and bj si should be saved and reused from previous iterations (exceptions are T bk−1 and bj sk−1 because they depend on the most recent data obtained). 0 with the previous savings in computation and assuming Bk = I, the total cost to 3 2 get the ai, bi vectors is approximately 2 m n. cost of a matrix-vector product with Bk requires 4mn multiplications. other so-called compact representations are slightly more efficient
42 / 106 The problem of interest Notes Given a symmetric positive-definite matrix A, solve the linear system
Ax = b which is equivalent to finding the unique minimizer of
def minimize q(x) = 1 xTAx − bTx (3) n 2 x∈R because ∇q(x) = Ax − b and ∇2q(x) = A 0 implies that
∇q(x) = Ax − b = 0 ⇐⇒ Ax = b
We seek a method that cheaply computes approximate solutions to the system Ax = b, or equivalent approximately minimizes (3). Do not want to compute a factorization of A, because it is assumed to be too expensive. We are interested in very large-scale problems We allow ourselves to compute matrix-vector products, i.e., Av for any vector v Why do we care about this? We will see soon! def def Notation: r(x) = Ax − b and rk = Axk − b.
44 / 106
q(x) = 1 xTAx − bTx 2 Notes
Algorithm 4 Coordinate descent algorithm for minimizing q
n n×n 1: input initial guess x0 ∈ R and a symmetric positive-definite matrix A ∈ R 2: loop 3: for i = 0, 1,..., n − 1 do 4: Compute αi as the solution to
minimize q(xi + αei+1) (4) α∈R
5: xi+1 ← xi + αiei+1 6: end for 7: x0 ← xn 8: end loop
ej represents the jth coordinate vector, i.e.,
jth z}|{ T n ej = ( 0 0 ... 1 ... 0 0 ) ∈ R Is it easy to solve (4)? Yes! (exercise) Does this work? 45 / 106 Notes
Figure: Coordinate descent does not Figure: Coordinate descent converges in n converge in n steps. Looks like it will steps. Very good! Axes are aligned with the be very slow! Axes are not aligned with coordinate directions ei. the coordinate directions ei.
Fact: the axes are determined by the eigenvectors of A. Observation: coordinate descent converges in no more than n steps if the eigenvectors align with the coordinate directions, i.e., V = I where V is the eigenvector matrix of A. Implication: Using the spectral decomposition of A A = VΛVT = IΛIT = Λ (a diagonal matrix) Conclusion: Coordinate descent converges in ≤ n steps when A is diagonal. 46 / 106
First thought: This is only good for diagonal matrices. That sucks! Second thought: Can we use a transformation of variables so that the transformed Notes Hessian is diagonal? Let S be a square nonsingular matrix and define the transformed variables y = S−1x ⇐⇒ Sy = x.
Consider the transformed quadratic function qb defined as def T T T T T T q(y ) = q(Sy ) ≡ 1 y S AS y − ( S b ) y = 1 y Ayb − b y b 2 | {z } |{z} 2 Ab b ∗ ∗ ∗ ∗ so that a minimizer x of q satisfies x = Sy where y is a minimizer of qb. T If Ab = S AS is diagonal, then applying the coordinate descent Algorithm 4 to qb would converge to y∗ in ≤ n steps. Ab is diagonal if and only if T si Asj = 0 for all i 6= j,
where S = s0 s1 ... sn−1 Line 5 of Algorithm 4 (in the transformed variables) is
yi+1 = yi + αiei+1. Multiplying by S and using Sy = x leads to
xi+1 = Syi+1 = S(yi + αiei+1) = Syi + αiSei+1 = xi + αisi
47 / 106 Notes
Definition 2.7 (A-conjugate directions)
A set of nonzero vectors {s0, s1,..., sl} is said to be conjugate with respect to the symmetric positive-definite matrix A (A-conjugate for short) if T si Asj = 0 for all i 6= j.
We want to find A-conjugate vectors. The eigenvectors of A are A-conjugate since the spectral decomposition gives
A = VΛVT ⇐⇒ VTAV = Λ (diagonal matrix of eigenvalues)
Computing the spectral decomposition is too expensive for large scale problems. We want to find A-conjugate vectors cheaply! Let us look at a general algorithm.
48 / 106
Algorithm 5 General A-conjugate direction algorithm Notes
n×n n 1: input symmetric positive-definite matrix A ∈ R , and vectors x0 and b ∈ R . 2: Set r0 ← Ax0 − b and k ← 0. 3: Choose direction s0. 4: while rk 6= 0 do
5: Compute αk ← argmin q(xk + αsk) α∈R 6: Set xk+1 ← xk + αksk. 7: Set rk+1 ← Axk+1 − b.
8: Pick new A-conjugate direction sk+1. 9: Set k ← k + 1. 10: end while
How do we compute αk? How do we compute the A-conjugate directions efficiently?
Theorem 2.8 (general A-conjugate algorithm converges) n For any x0 ∈ R , the general A-conjugate direction Algorithm 5 computes the solution x∗ of the system Ax = b in at most n steps.
Proof: Theorem 5.1 in Nocedal and Wright.
49 / 106 Exercise: show that the α that solves the minimization problem on line 5 of k Notes Algorithm 5 satisfies T sk rk αk = − T . sk Ask
Before discussing how we compute the A-conjugate directions {si}, let us look at some properties of them if we assume we already have them.
Theorem 2.9 (Expanding subspace minimization)
n Let x0 ∈ R be any starting point and suppose that {xk} is the sequence of iterates generated by the A-conjugate direction Algorithm 5 for a set of A-conjugate directions {sk}. Then, T rk si = 0 for all i = 0,... k − 1 1 T T and xk is the unique minimizer of q(x) = 2 x Ax − b x over the set
{x : x = x0 + span(s0, s1,..., sk−1)}
Proof: See Theorem 5.2 in Nocedal and Wright.
50 / 106
k−1 Question: If {si}i=0 are A-conjugate, how do we compute sk to be A-conjugate? Notes Answer: Theorem 2.9 shows that rk is orthogonal to the previously explored space. To keep it simple and hopefully cheap, let us try
sk = −rk + β sk−1 for some β. (5)
If this is going to work, then the A-conjugacy condition and (5) requires that
T 0 = sk−1Ask T = sk−1A(−rk + βsk−1) T = sk−1(−Ark + βAsk−1) T T = −sk−1Ark + βsk−1Ask−1 and solving for β gives T sk−1Ark β = T . sk−1Ask−1
With this choice of β, the vector sk is A-conjugate!.....but, how do we get started? Choosing s0 = −r0, i.e., a steepest descent direction, makes sense.
This gives us a complete algorithm; the linear CG algorithm!
51 / 106 Notes
Algorithm 6 Preliminary linear CG algorithm
n×n n 1: input symmetric positive-definite matrix A ∈ R , and vectors x0 and b ∈ R . 2: Set r0 ← Ax0 − b, s0 ← −r0, and k ← 0. 3: while rk 6= 0 do T T 4: Set αk ← −( sk rk)/( sk Ask). 5: Set xk+1 ← xk + αksk. 6: Set rk+1 ← Axk+1 − b. T T 7: Set βk+1 ← ( sk Ark+1)/( sk Ask). 8: Set sk+1 ← −rk+1 + βk+1sk. 9: Set k ← k + 1. 10: end while
This preliminary version requires two matrix-vector products per iteration. By using some more “tricks” and algebra, we can reduce this to a single matrix-vector multiplication per iteration.
52 / 106
Algorithm 7 Linear CG algorithm Notes n×n n 1: input symmetric positive-definite matrix A ∈ R , and vectors x0 and b ∈ R . 2: Set r0 ← Ax0 − b, s0 ← −r0, and k ← 0. 3: while rk 6= 0 do T T 4: Setαk ← ( rk rk)/( sk Ask). 5: Set xk+1 ← xk + αksk. 6: Set rk+1 ← rk + αkAsk. T T 7: Set βk+1 ← ( rk+1rk+1)/( rk rk). 8: Set sk+1 ← −rk+1 + βk+1sk. 9: Set k ← k + 1. 10: end while
Only requires the single matrix-vector multiplication Ask. More practical to use a stopping condition like −8 krkk ≤ 10 max 1, kr0k2 Required computation I 2 inner products I 3 vector sums I 1 matrix-vector multiplication Ideal for large (sparse) matrices A Converges quickly if cond(A) is small or the eigenvalues are “clustered” I convergence can be accelerated by using precondioning
53 / 106 We want to use CG to compute an approximate solution p to minimize f + gTp + 1 pTHp n 2 Notes p∈R H may be indefinite (even singular) and must ensure that p is a descent direction
Algorithm 8 Newton CG algorithm for computing descent directions
n×n 1: input symmetric matrix H ∈ R and vector g 2: Set p0 = 0, r0 ← g, s0 ← −g, and k ← 0. 3: while rk 6= 0 do T 4: if sk Hsk > 0 then T T 5: Set αk ← ( rk rk)/( sk Hsk). 6: else 7: if k = 0, then return pk = −g else return pk end if 8: end if 9: Set pk+1 ← pk + αksk.
10: Set rk+1 ← rk + αkHsk. T T 11: Set βk+1 ← ( rk+1rk+1)/( rk rk). 12: Set sk+1 ← −rk+1 + βk+1sk. 13: Set k ← k + 1. 14: end while 15: return pk
Made the replacements x ← p, A ← H, and b ← −g in Algorithm 7. 54 / 106
Theorem 2.10 (Newton CG algorithm computes descent directions.) Notes If g 6= 0 and Algorithm 8 terminates on iterate K, then every iterate pk of the algorithm T satisfies pk g < 0 for k ≤ K.
Proof: Case 1: Algorithm 8 terminates on iteration K = 0. T T 2 It is immediate that g p0 = −g g = −kgk2 < 0 since g 6= 0. Case 2: Algorithm 8 terminates on iteration K ≥ 1. First observe from p0 = 0, s0 = −g, r0 = g, g 6= 0, and lines 10 and 9 of Algorithm 8 that rTr kgk4 T = T( + α ) = α T = 0 0 T = − 2 < . g p1 g p0 0s0 0g s0 T g s0 T 0 (6) s0 Hs0 g Hg By unrolling the “p” update, using A-conjugacy of {si}, and definition of rk, we get k−1 T T T T X T sk rk = sk (Hpk + g) = sk g + sk H αjsj = sk g. j=0 T 2 The previous line, lines 9 and 5 of Algorithm 8, and the fact that sk rk = −krkk2 gives T 4 T T T T T T rk rk T T krkk2 T g pk+1 = g pk + αkg sk = g pk + αksk rk = g pk + T sk rk = g pk − T < g pk. sk Hsk sk Hsk Combining the previous inequality and (6) results in kgk4 T < T < ··· < T < − 2 < . g pk+1 g pk g p1 T 0 g Hg 55 / 106 Algorithm 9 Backtracking Notes
1: input xk, pk
2: Choose αinit > 0 and τ ∈ (0, 1) (0) 3: Set α = αinit and l = 0 (l) 4: until f(xk + α pk) “<” f(xk) (l+1) (l) 5: Set α = τ α 6: Set l ← l + 1 7: end until (l) 8: return αk = α
typical choices (not always)
I αinit = 1 I τ = 1/2 this prevents the step from getting too small does not prevent the step from being too long decrease in f may not be proportional to the directional derivative need to tighten the requirement that