L. Vandenberghe ECE236C (Spring 2020) 15. Quasi-Newton methods

variable metric methods •

quasi-Newton methods •

BFGS update •

limited-memory quasi-Newton methods •

15.1 Newton method for unconstrained minimization

minimize f x ( ) f convex, twice continously differentiable

Newton method

2 1 x = x t f x − f x k+1 k − k∇ ( k) ∇ ( k)

advantages: fast convergence, robustness, affine invariance • disadvantages: requires second derivatives and solution of linear equation • can be too expensive for large scale applications

Quasi-Newton methods 15.2 Variable metric methods

1 x = x t H− f x k+1 k − k k ∇ ( k) the positive definite matrix Hk is an approximatian of the Hessian at xk, chosen to:

avoid calculation of second derivatives • simplify computation of search direction •

‘Variable metric’ interpretation (236B, lecture 10, page 11)

1 ∆x = H− f x − ∇ ( ) is the steepest descent direction at x for the quadratic norm

 1 2 z = zT Hz / k kH

Quasi-Newton methods 15.3 Quasi-Newton methods

given: starting point x dom f , H 0 0 ∈ 0 for k = 0,1,... 1. compute quasi-Newton direction ∆x = H 1 f x k − k− ∇ ( k) 2. determine step size tk (e.g., by line search)

3. compute xk+1 = xk + tk∆xk

4. compute Hk+1

different update rules exist for H in step 4 • k+1 can also propagate H 1 or a factorization of H to simplify calculation of ∆x • k− k k

Quasi-Newton methods 15.4 Broyden–Fletcher–Goldfarb–Shanno (BFGS) update

BFGS update yyT H ssT H H = H + k k k+1 k T T y s − s Hk s where s = x x , y = f x f x k+1 − k ∇ ( k+1) − ∇ ( k)

Inverse update  T   T  T 1 sy 1 ys ss H− = I H− I + k+1 − yT s k − yT s yT s

note that yT s > 0 for strictly convex f ; see page 1.8 • cost of update or inverse update is O n2 operations • ( )

Quasi-Newton methods 15.5 Positive definiteness

if yT s > 0, BFGS update preserves positive definitess of H • k this ensures that ∆x = H 1 f x is a descent direction • − k− ∇ ( k)

Proof: from inverse update formula,

 T T  T  T 2 T 1 s v 1 s v s v v H− v = v y H− v y + ( ) k+1 − sT y k − sT y yT s

if H 0, both terms are nonnegative for all v • k second term is zero only if sT v = 0; then first term is zero only if v = 0 •

Quasi-Newton methods 15.6 Secant condition the BFGS update satisfies the secant condition

Hk+1s = y where s = x x and y = f x f x k+1 − k ∇ ( k+1) − ∇ ( k)

Interpretation: we define a quadratic approximation of f around xk+1

T 1 T f˜ x = f x + + f x + x x + + x x + H + x x + ( ) ( k 1) ∇ ( k 1) ( − k 1) 2( − k 1) k 1( − k 1)

by construction f˜ x = f x • ∇ ( k+1) ∇ ( k+1) secant condition implies that also f˜ x = f x : • ∇ ( k) ∇ ( k) f˜ x = f x + H x x ∇ ( k) ∇ ( k+1) k+1( k − k+1) = f x ∇ ( k)

Quasi-Newton methods 15.7 Secant method for f : R R, BFGS with unit step size gives the secant method →

f 0 xk f 0 xk f 0 xk 1 xk+1 = xk ( ), Hk = ( ) − ( − ) − Hk xk xk 1 − −

xk 1 xk xk+1 −

f˜ x 0( )

f x 0( )

Quasi-Newton methods 15.8 Convergence

Global result if f is strongly convex, BFGS with (EE236B, lecture 10-6) converges from any x , H 0 0 0

Local convergence if f is strongly convex and 2 f x is Lipschitz continuous, local convergence is ∇ ( ) superlinear: for sufficiently large k,

x x? c x x? k k+1 − k2 ≤ k k k − k2 where c 0 k → (cf., quadratic local convergence of Newton method)

Quasi-Newton methods 15.9 Example

m T X T minimize c x log bi ai x − i=1 ( − ) n = 100, m = 500

Newton BFGS 103 103

100 100 ? ? f 3 f 3 10− 10− ) − ) − k k

x 6 x 6 ( 10− ( 10− f f

9 9 10− 10−

10 12 10 12 − 0 2 4 6 8 10 12 − 0 50 100 150 k k

cost per Newton iteration: O n3 plus computing 2 f x • ( ) ∇ ( ) cost per BFGS iteration: O n2 • ( )

Quasi-Newton methods 15.10 Square root BFGS update

T to improve numerical stability, propagate Hk in factored form Hk = Lk Lk

if H = L LT then H = L LT with • k k k k+1 k+1 k+1

 αy˜ s˜ s˜T  Lk+1 = Lk I + ( − ) , s˜T s˜

where  T 1 2 1 T s˜ s˜ / y˜ = L− y, s˜ = L s, α = k k yT s

if L is triangular, cost of reducing L to triangular form is O n2 • k k+1 ( )

Quasi-Newton methods 15.11 Optimality of BFGS update

X = Hk+1 solves the problem

minimize tr H 1X log det H 1X n ( k− ) − ( k− ) − subject to Xs = y

cost function is nonnegative, equal to zero only if X = H • k also known as relative entropy between densities N 0, X , N 0, H • ( ) ( k) BFGS update is a least-change secant update • optimality result follows from KKT conditions: X = Hk+1 satisfies

1 1 1 T T X− = H− sν + νs , Xs = y, X 0 k − 2( ) with yT 1y! ! 1 1 Hk− ν = 2H− y 1 + s sT y k − yT s

Quasi-Newton methods 15.12 Davidon–Fletcher–Powell (DFP) update

switch Hk and X in objective on previous page

minimize tr H X 1 log det H X 1 n ( k − ) − ( k − ) − subject to Xs = y

minimize relative entropy between N 0, H and N 0, X • ( k) ( ) problem is convex in X 1 (with constraint written as s = X 1y) • − − solution is ‘dual’ of BFGS formula •  ysT   syT  yyT Hk+1 = I Hk I + − sT y − sT y sT y

(known as DFP update)

predates BFGS update, but is less often used

Quasi-Newton methods 15.13 Limited memory quasi-Newton methods

1 main disadvantage of quasi-Newton method is need to store Hk, Hk− , or Lk

Limited-memory BFGS 1 (L-BFGS): do not store Hk− explicitly instead we store up to m (e.g., m = 30) values of • s = x x , y = f x f x j j+1 − j j ∇ ( j+1) − ∇ ( j)

we evaluate ∆x = H 1 f x recursively, using • k k− ∇ ( k)

T ! T ! T sj y yj s sj s 1 j 1 j j H− = I H− I + j+1 − T j − T T yj sj yj sj yj sj

for j = k 1,..., k m, assuming, for example, Hk m = I − − − an alternative is to restart after m iterations • cost per iteration is O nm , storage is O nm • ( ) ( )

Quasi-Newton methods 15.14 Interpretation of CG as restarted BFGS method

first two iterations of BFGS (page 15.5) if H0 = I:

1 x = x t f x , x = x t H− f x 1 0 − 0∇ ( 0) 2 1 − 1 1 ∇ ( 1) where H is computed from s = x x and y = f x f x via 1 1 − 0 ∇ ( 1) − ∇ ( 0) T T T T 1 y y ss ys + sy H− = I + 1 + 1 ( sT y )yT s − yT s

if t is determined by exact line search, then f x T s = 0 • 0 ∇ ( 1) quasi-Newton step in second iteration simplifies to • T 1 y f x1 H− f x1 = f x1 + ∇ ( )s − 1 ∇ ( ) −∇ ( ) yT s

this is the Hestenes–Stiefel conjugate gradient update nonlinear CG can be interpreted as L-BFGS with m = 1

Quasi-Newton methods 15.15 References

J. E. Dennis, Jr., and R. B. Schabel, Numerical Methods for Unconstrained Optimization and • Nonlinear Equations (1996), chapter 9. C. T. Kelley, Iterative Methods for Optimization (1999), chapter 4. • J. Nocedal and S. J. Wright, Numerical Optimization (2006), chapter 6 and section 7.2. •

Quasi-Newton methods 15.16