<<

C7B Optimization

8 Lectures Hilary Term 2011 2 Tutorial Sheets Prof. A. Zisserman

Overview

• Lectures 1 & 2 (AZ): Introduction – unconstrained univariate and multivariate optimization, simplex (amoeba) algorithm, Newton-like methods.

• Lectures 3-6 (BK): Classical optimization. Constrained optimization - equality constraints, Lagrange multipliers, inequality constraints. Nonlinear programming - search methods, approximation methods, axial iteration, pattern search, descent methods, quasi-Newton methods. Least squares and singular values.

• Lectures 7 & 8 (AZ): Levenberg-Marquardt algorithm, discrete optimization, dynamic programming, convexity, robust cost functions, graph-cuts.

Textbooks

• Practical Optimization Philip E. Gill, Walter Murray, and Margaret H. Wright, Academic Press, 1981

Covers unconstrained and constrained optimization. Very clear and comprehensive. Background reading and web resources

• Numerical Recipes in C (or C++) : The Art of Scientific Computing William H. Press, Brian P. Flannery, Saul A. Teukolsky, William T. Vetterling CUP 1992/2002 • Good chapter on optimization • Available on line at http://www.nrbook.com/a/bookcpdf.php

Background reading and web resources

• Convex Optimization • Stephen Boyd and Lieven Vandenberghe CUP 2004 • Available on line at http://www.stanford.edu/~boyd/cvxbook/

• Further reading, web resources, and the lecture notes are on http://www.robots.ox.ac.uk/~az/lectures/opt Introduction: Problem specification

Suppose we have a cost function (or objective function) f(x):IRn IR → Our aim is find the value of the parameters x that minimize this function x =argminf(x) ∗ x subject to the following constraints:

• equality ci(x)=0,i=1,...,me

• inequality c (x) 0,i= me +1,...,m i ≥ We will start by focussing on unconstrained problems

Recall: One dimensional functions

Adifferentiable function has a stationary point when the • is zero: df / d x =0.

The gives the type of stationary point • Unconstrained optimization function of one variable f(x) min f(x) x

local global x minimum minimum

• down-hill search ( descent) algorithms can find local minima • which of the minima is found depends on the starting point • such minima often occur in real applications

Example: template matching in 2D images Cost function

for each find closest model point data point

Dj Dj Model point: Mi =(xi,yi)>

Transformation parameters: rotation angle θ • Mi Mi translation t =(tx,ty) • > correct closest point correspondences correspondences

Performance Unconstrained univariate optimization

For the moment, assume we can start close to the global minimum. (We will return to this issue later when covering convexity) f(x)

min f(x) x

x We will look at three basic methods to determine the minimum: 1. gradient descent 2. polynomial interpolation 3. Newton’s method These introduce the ideas that will be applied in the multivariate case

A typical 1D function

As an example, consider the function

f(x)=0.1+0.1x + x2/(0.1+x2)

(assume we do not know the actual function expression from now on) 1. Gradient descent

df Given a starting location, x0,examinedx and move in the downhill direction to generate a new estimate, x1 = x0 + δx

df δx = α − dx

How to determine the step size δ x ?

Polynomial interpolation

Bracket the minimum. • Fit a quadratic or cubic polynomial which interpolates f(x)atsome • points in the interval.

Jump to the (easily obtained) minimum of the polynomial. • Throw away the worst point and repeat the process. • 1.4

1.2

1

0.8

0.6

0.4

0.2

0 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.4

1.2

1

0.8

0.6

0.4

0.2

0 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Quadratic interpolation using 3 points, 2 iterations Other methods to interpolate a quadratic ? • e.g. 2 points and one gradient

Newton’s method

Fit a quadratic approximation to f(x) using both gradient and information at x.

Expand f(x)locallyusingaTaylorseries. • δx2 f(x + δx)=f(x)+δxf (x)+ f (x)+h.o.t 0 2 00

Find the δx which minimizes this local quadratic approximation. • f (x) δx = 0 −f 00(x) Update x • f 0(x) xn+1 = xn − f 00(x) Newton iterations detail with quadratic approximations

1.4 0.45

0.4 1.2

0.35

1 f(x) 0.3

0.8 0.25

0.6 0.2

0.15 0.4

0.1

0.2 0.05

0 0 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 x

• avoids the need to bracket the root • quadratic convergence (decimal accuracy doubles at every iteration)

• global convergence of Newton’s method is poor • Often fails if the starting point is too far from the minimum

1.4

1.2 1 f(x) 1 0.5 0.8

0.6 0

0.4

-0.5

0.2

0 -1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 x

• in practice, must be used with a globalization strategy which reduces the step length until function decrease is assured Extension to N dimensions

How big can N be? • problem sizes can vary from a handful of parameters to many thousands • We will consider examples for N=2, so that cost function surfaces can be visualized

Start by looking at the types of stationary points for functions in nD

Stationary Points for Multidimensional functions

A function f :IRn IR , →

has a stationary point when the gradient

∂f ∂f ∂f > f = , ,..., = 0 ∇ Ã∂x1 ∂x2 ∂xn! Taylor expansion in 2D

A function may be approximated locally by its Taylor expansion about a point x0 ∂2f ∂2f ∂f ∂f x 1 ∂x2 ∂x∂y x f(x0 + x) f(x0)+ , + (x, y) ⎡ 2 2 ⎤ ≈ Ã∂x ∂y !à y ! 2 ∂ f ∂ f à y ! ∂x∂y 2 ⎢ ∂y ⎥ ⎣ ⎦ +h.o.t The expansion to second order is a quadratic function 1 f(x)=a + g x + x H x > 2 >

Taylor expansion in nD

A function may be approximated locally by its expansion about a point x0 1 f(x + x) f(x )+ f x + x H x +h.o.t 0 ≈ 0 ∇ > 2 > where the gradient f(x)off(x) is the vector ∇ ∂f ∂f > f(x)= ,..., ∇ "∂x1 ∂xN # and the Hessian H (x)off(x)isthesymmetricmatrix

∂2f ∂2f 2 ... ∂x ∂x ∂x1 1 N H (x)=⎡ . ... ⎤ ⎢ ∂2f ∂2f ⎥ ⎢ 2 ⎥ ⎢ ∂x1∂xN ∂x ⎥ ⎢ N ⎥ The expansion to second order⎣ is a quadratic function⎦ 1 f(x)=a + g x + x H x > 2 > Properties of Quadratic functions

Taylor expansion 1 f(x + x)=f(x )+g x + x H x 0 0 > 2 > Expand about a stationary point x0 = x∗ in direction p

1 2 f(x∗ + αp)=f(x∗)+g>αp + α p>H p 2 1 = f(x )+ α2p H p ∗ 2 > since at a stationary point g = f = 0 ∇ |x∗

At a stationary point the behaviour is determined by H

H is a symmetrix matrix, and so has orthogonal eigenvec- tors H u = λ u choose u =1 i i i || i|| 1 f(x + αu )=f(x )+ α2u H u ∗ i ∗ 2 i> i 1 = f(x )+ α2λ ∗ 2 i As α increases, f(x + αu ) increases, decreases or is | | ∗ i unchanging according to whether λi is positive, negative or zero. Suppose x is 2D and choose p =1,then || || p.u1 =cosθ and p =cosθu1 +sinθu2

1 f(x + αp)=f(x )+ α2 (cos θu +sinθu ) H (cos θu +sinθu ) ∗ ∗ 2 1 2 > 1 2 1 = f(x )+ α2 cos2 θλ +sin2 θλ ∗ 2 1 2 ³ ´ If λ1 and λ2 both positive (> 0) then x∗ is a minimum. Examples of Quadratic functions

Case 1: both eigenvalues positive 1 f(x)=a + g x + x H x > 2 > with

50 64 a =0 g = H = −50 46 " − # " #

15

10

5

0

-5 -5 0 5 10 15 minimum

Case 2: eigenvalues have different signs

1 f(x)=a + g x + x H x > 2 >

with

30 60 a =0 g = H = +20− 0 4 " # " − #

15

10

5

0

-5 -5 0 5 10 15

saddle : extremum but not a minimum Case 3: one eigenvalue zero

1 f(x)=a + g x + x H x > 2 >

with

0 60 a =0 g = H = " 0 # " 00#

15

10

5

0

-5 -5 0 5 10 15

parabolic cylinder Optimization for Quadratic functions

Assume that H is positive definite (all eigenvalues greater than zero)

1 f(x)=a + g x + x H x > 2 > f(x)=g + H x ∇

There is a unique minimum at

x = H 1g ∗ − − If N is large, it is not feasible to perform this inversion directly.

Optimization in N dimensions

Key idea • reduce optimization to a 1D line search

15

10

5

0

-5 -5 0 5 10 15 An Optimization Algorithm

Start at x0 then repeat

1. compute a search direction pk

2. compute a step length αk, such that f(xk + αkpk)

3. update x x + α p k+1 ← k k k

4. check for convergence (termination criteria) e.g. f = 0 ∇

Reduces optimization in N dimensions to a series of (1D) line minimizations

Steepest descent

Basic principle is to minimize the N-dimensional function by a series of 1D line-minimizations :

xn+1 = xn + αnpn

The steepest descent method chooses pn to be parallel to the negative gradient

pn = f(xn) −∇

Step-size αn is chosen to minimize f(xn + αnpn). For quadratic forms there is a closed form solution :

pn>pn αn = −p>n H pn [exercise] Example 50 64 a =0 g = H = −50 46 " − # " # 15

10 Steepest descent (x0 =[0, 14])

5

0

-5 -5 0 5 10 15 The gradient is everywhere perpendicular to the contour lines. • After each line minimization the new gradient is always orthogonal • to the previous step direction (true of any line minimization.)

Consequently, the iterates tend to zig-zag down the valley in a very • inefficient manner

Conjugate

The method of conjugate gradients chooses successive descent direc- tions pn such that it is guaranteed to reach the minimum in a finite number of steps.

Each pn is chosen to be conjugate to all previous search directions • with respect to the Hessian H :

pn>H pj =0, 0=

The resulting search directions are mutually linearly independent. •

Remarkably, pn can be chosen using only knowledge of pn 1, f(xn 1) • − ∇ − and f(xn) (see Numerical Recipes) ∇

fn> fn pn = fn + ∇ ∇ p ⎛ ⎞ n 1 ∇ fn> 1 fn 1 − ∇ ∇ − ⎝ − ⎠ An N-dimensional quadratic form can be minimized in at most N • conjugate descent steps.

3different starting points. •

Minimum is reached in exactly 2 steps. •