<<

Lecture 2

C7B Optimization Hilary 2011 A. Zisserman

• Optimization for general functions

• Downhill (amoeba)

• Newton’s method • • Quasi-Newton methods

• Least-Squares and Gauss-Newton methods

Review – unconstrained optimization

• Line minimization

• Minimizing Quadratic functions • line search • problem with steepest descent

Assumptions: • single minimum • positive definite Hessian Optimization for General Functions

f(x, y)=exp(x)(4x2 +2y2 +4xy +2x +1)

5

4

3

2

1

0

-1 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1

Apply methods developed using quadratic Taylor series expansion

Rosenbrock’s function

f(x, y)=100(y x2)2 +(1 x)2 − −

Rosenbrock function 3

2.5

2

1.5

1

0.5

0

-0.5

-1 -2 -1 0 1 2 Minimumisat[1, 1] Steepest descent

• The 1D line minimization must be performed using one of the earlier methods (usually cubic polynomial interpolation)

Steepest Descent Steepest Descent 3

2.5 0.85 2

1.5 0.8

1 0.75 0.5

0 0.7

-0.5 0.65 -1 -2 -1 0 1 2 -0.95 -0.9 -0.85 -0.8 -0.75

• The zig-zag behaviour is clear in the zoomed view (100 iterations) • The algorithm crawls down the valley

Conjugate

• Again, an explicit line minimization must be used at every step

Conjugate Descent 3

2.5

2

1.5

1

0.5

0

-0.5

-1 -2 -1 0 1 2 gradient < 1e-3 after 101 iterations

• The algorithm converges in 101 iterations • Far superior to steepest descent The downhill simplex (amoeba) algorithm

• Due to Nelder and Mead (1965) •A direct method: only uses function evaluations (no derivatives) • a simplex is the in N dimensions with N+1 vertices, e.g. • 2D: triangle • 3D: tetrahedron • basic idea: move by reflections, expansions or contractions

start reflect expand

contract One iteration of the simplex algorithm

Reorder the points so that f(x +1) >f(x2) >f(x1)(i.e.x +1 is the worst point). • n n

Generate a trial point xr by reflection • xr =¯x + α(¯x x +1) − n where ¯x = i xi /(N +1)isthecentroidand α > 0. Compute f(xr), and there are then 3 possibilities: ¡P ¢ 1. f(x1)

2. f(xr)

xe = xr + β(xr ¯x) − where β > 0. If f(xe)

3. f(xr) >f(xn) then assume the polytope is too large and generate a new point by contraction

xc =¯x + γ(x +1 ¯x) n − where γ (0 < γ < 1) is the contraction coefficient. If f(xc)

Standard values are α =1, β =1, γ =0.5.

Example Path of best vertex Downhill Simplex 3

2.5

2

1.5

1

0.5

0

-0.5

-1 -2 -1 0 1 2 2.7 2.6 2.5 2.4 2.3 2.2 2.1 2 1.9 1.8 Matlab fminsearch -2 -1.8 -1.6 -1.4 -1.2 -1 with 200 iterations detail Example 2: contraction about a minimum

YY YY YY YY reflections -10123 -10123 -10123 -10123

-10123 -10123 -1 0 1 2 3 -1 0 1 2 3 XX XX XX XX

YY YY YY YY contractions -10123 -10123 -10123 -10123

-10123 -10123 -1 0 1 2 3 -1 0 1 2 3 XX XX XX XX

Summary • no derivatives required • deals well with noise in the cost function • is able to crawl out of some local minima (though, of course, can still get stuck)

Newton’s method

Expand f(x) by its Taylor series about the point xn 1 f(xn + δx) f(xn)+g δx + δx H n δx ≈ n> 2 >

where the gradient is the vector

∂f ∂f > gn = f(xn)= ,..., ∇ "∂x1 ∂xN # and the Hessian is the symmetric matrix

2 2 ∂ f ... ∂ f ∂x2 ∂x1∂xN ⎡ . 1 . ⎤ H n = H (xn)= . .. ⎢ ∂2f ∂2f ⎥ ⎢ 2 ⎥ ⎢ ∂x1∂xN ∂x ⎥ ⎢ N ⎥ ⎣ ⎦

For a minimum we require that f(x)=0,andso ∇ f(x)=gn + H nδx = 0 ∇ 1 with solution δx = H gn. This gives the iterative update − −n 1 x = xn H gn n+1 − −n Assume that H is positive definite (all eigenvalues greater than zero)

1 x = xn H gn n+1 − −n

If f(x) is quadratic, then the solution is found in one step. • The method has quadratic convergence (as in the 1D case). • 1 The solution δx = H gn is guaranteed to be a downhill direction. • − −n Rather than jump straight to the minimum, it is better to perform • a line minimization which ensures global convergence

1 x = xn αnH gn n+1 − −n If H = I then this reduces to steepest descent. •

Newton’s method - example

Newton method with line search Newton method with line search 3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1 -2 -1 0 1 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 gradient < 1e-3 after 15 iterations gradient < 1e-3 after 15 iterations ellipses show successive quadratic approximations

• The algorithm converges in only 15 iterations compared to the 101 for conjugate gradients, and 200 for simplex • However, the method requires computing the at each iteration – this is not always feasible Quasi-Newton methods

If the problem size is large and the Hessian matrix is dense then it • may be infeasible/inconvenient to compute it directly.

Quasi-Newton methods avoid this problem by keeping a “rolling • estimate” of H (x), updated at each iteration using new gradient information.

Common schemes are due to Broyden, Fletcher, Goldfarb and Shanno • (BFGS), and also Davidson, Fletcher and Powell (DFP).

e.g. in 1D First derivatives h f1 f0 h f0 f 1 f 0(x0 + )= − and f 0(x0 )= − − f(x) 2 h − 2 h second derivative

f1 f0 f0 f 1 −h −h − f1 2f0 + f 1 f00(x0)= − = − − h h2 h h x x x x -1 0 +1 For H n+1 build an approximation from H n, gn, gn+1, xn, xn+1

Quasi-Newton: BFGS

Set H = I. • 0

Update according to •

qnqn> (H nsn)(H nsn)> H n+1 = H n + q>n sn − s>n H nsn where sn = x xn n+1 − qn = g gn n+1 −

The matrix itself is not stored, but rather represented compactly by • a few stored vectors.

The estimate H is used to form a local quadratic approximation • n+1 as before. Example

Rosenbrock function 3

2.5

2

1.5

1

0.5

0

-0.5

-1 -2 -1 0 1 2

• The method converges in 25 iterations, compared to 15 for the full-Newton method

Non-linear least squares

It is very common in applications for a cost function f(x)tobethe • sum of a large number of squared residuals

M 2 f(x)= ri iX=1

If each residual depends non-linearly on the parameters x then the • minimization of f(x) is a non-linear least squares problem.

Examples in computer vision are maximum likelihood estimators of • image relations (such as homographies and fundamental matrices) from point correspondences. Example: aligning a 3D model to an image

Cost function

Transformation parameters: f − 3D rotation matrix R • translation 3-vector T =(Tx,Ty,Tz) • >

Image generation: rotate and translate 3D model by R and T • Ty Tx project to generate image Iˆ (x, y) • R,T

Non-linear least squares

M 2 f(x)= ri iX=1 The M N Jacobian of the vector of residuals r is defined as × assume M > N ∂r1 ... ∂r1 ∂x1 ∂xN ⎛ . .. ⎞ J(x)= . . J ⎜ ∂rM ∂rM ⎟ ⎜ ∂x ∂x ⎟ ⎜ 1 N ⎟ ⎝ ⎠ Consider ∂ 2 ∂ri ri = 2ri ∂xk ∂xk Xi Xi

Hence

f(x)=2J>r = J> ∇

For the Hessian we require 2 ∂ 2 ∂ ∂ri ri =2 ri ∂xl∂xk ∂xl ∂xk Xi Xi 2 ∂ri ∂ri ∂ ri =2 +2 ri ∂xk ∂xl ∂xk∂xl Xi Xi Hence M H (x)= 2J>J +2riR i iX=1 J > J Ri Note that the second-order term in the Hessian H (x) is multiplied by • the residuals ri.

In most problems, the residuals will typically be small. •

Also, at the minimum, the residuals will typically be distibuted with • mean = 0.

For these reasons, the second-order term is often ignored, giving the • Gauss-Newton approximation to the Hessian :

H (x)=2J>J

Hence, explicit computation of the full Hessian can again be avoided. •

Example – Gauss-Newton

The minimization of the Rosenbrock function

f(x, y) = 100(y x2)2 +(1 x)2 − − can be written as a least-squares problem with residual vector

10(y x2) r = (1 −x) " − #

∂r ∂r 1 1 20x 10 J(x)= ∂x ∂y = − ⎛ ∂r2 ∂r2 ⎞ 10 ∂x ∂y à − ! ⎝ ⎠ The true Hessian is 2 2 ∂ f ∂ f 2 ∂x2 ∂x∂y 1200x 400y +2 400x H(x)=⎡ 2 2 ⎤ = − − ∂ f ∂ f " 400x 200 # ∂x∂y 2 − ⎢ ∂y ⎥ ⎣ ⎦ The Gauss-Newton approximation of the Hessian is

20x 1 20x 10 800x2 +2 400x 2J J =2 − − − = − > 10 0 10 400x 200 " #" − # " − #

1 x = xn αnH gn with H n (x)=2J Jn n+1 − −n n>

Gauss-Newton method with line search Gauss-Newton method with line search 3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1 -2 -1 0 1 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 gradient < 1e-3 after 14 iterations gradient < 1e-3 after 14 iterations

• minimization with the Gauss-Newton approximation with line search takes only 14 iterations Comparison

Newton Gauss-Newton Newton method with line search Gauss-Newton method with line search 3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1 -2 -1 0 1 2 -2 -1 0 1 2 gradient < 1e-3 after 15 iterations gradient < 1e-3 after 14 iterations

• requires computing Hessian • approximates Hessian by Jacobian product • exact solution if quadratic • requires only derivatives

Gradient descent Conjugate gradient

NB axes do not have equal scales Newton Gauss-Newton Conjugate gradient

Newton Gauss-Newton (no line search)