C7B Optimization

C7B Optimization 8 Lectures Hilary Term 2011 2 Tutorial Sheets Prof. A. Zisserman Overview • Lectures 1 & 2 (AZ): Introduction – unconstrained univariate and multivariate optimization, simplex (amoeba) algorithm, Newton-like methods. • Lectures 3-6 (BK): Classical optimization. Constrained optimization - equality constraints, Lagrange multipliers, inequality constraints. Nonlinear programming - search methods, approximation methods, axial iteration, pattern search, descent methods, quasi-Newton methods. Least squares and singular values. • Lectures 7 & 8 (AZ): Levenberg-Marquardt algorithm, discrete optimization, dynamic programming, convexity, robust cost functions, graph-cuts. Textbooks • Practical Optimization Philip E. Gill, Walter Murray, and Margaret H. Wright, Academic Press, 1981 Covers unconstrained and constrained optimization. Very clear and comprehensive. Background reading and web resources • Numerical Recipes in C (or C++) : The Art of Scientific Computing William H. Press, Brian P. Flannery, Saul A. Teukolsky, William T. Vetterling CUP 1992/2002 • Good chapter on optimization • Available on line at http://www.nrbook.com/a/bookcpdf.php Background reading and web resources • Convex Optimization • Stephen Boyd and Lieven Vandenberghe CUP 2004 • Available on line at http://www.stanford.edu/~boyd/cvxbook/ • Further reading, web resources, and the lecture notes are on http://www.robots.ox.ac.uk/~az/lectures/opt Introduction: Problem specification Suppose we have a cost function (or objective function) f(x):IRn IR → Our aim is find the value of the parameters x that minimize this function x =argminf(x) ∗ x subject to the following constraints: • equality ci(x)=0,i=1,...,me • inequality c (x) 0,i= me +1,...,m i ≥ We will start by focussing on unconstrained problems Recall: One dimensional functions Adiﬀerentiable function has a stationary point when the • derivative is zero: df / d x =0. The second derivative gives the type of stationary point • Unconstrained optimization function of one variable f(x) min f(x) x local global x minimum minimum • down-hill search (gradient descent) algorithms can find local minima • which of the minima is found depends on the starting point • such minima often occur in real applications Example: template matching in 2D images Cost function for each find closest model point data point Dj Dj Model point: Mi =(xi,yi)> Transformation parameters: rotation angle θ • Mi Mi translation t =(tx,ty) • > correct closest point correspondences correspondences Performance Unconstrained univariate optimization For the moment, assume we can start close to the global minimum. (We will return to this issue later when covering convexity) f(x) min f(x) x x We will look at three basic methods to determine the minimum: 1. gradient descent 2. polynomial interpolation 3. Newton’s method These introduce the ideas that will be applied in the multivariate case A typical 1D function As an example, consider the function f(x)=0.1+0.1x + x2/(0.1+x2) (assume we do not know the actual function expression from now on) 1. Gradient descent df Given a starting location, x0,examinedx and move in the downhill direction to generate a new estimate, x1 = x0 + δx df δx = α − dx How to determine the step size δ x ? Polynomial interpolation Bracket the minimum. • Fit a quadratic or cubic polynomial which interpolates f(x)atsome • points in the interval. Jump to the (easily obtained) minimum of the polynomial. • Throw away the worst point and repeat the process. • 1.4 1.2 1 0.8 0.6 0.4 0.2 0 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.4 1.2 1 0.8 0.6 0.4 0.2 0 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Quadratic interpolation using 3 points, 2 iterations Other methods to interpolate a quadratic ? • e.g. 2 points and one gradient Newton’s method Fit a quadratic approximation to f(x) using both gradient and curvature information at x. Expand f(x)locallyusingaTaylorseries. • δx2 f(x + δx)=f(x)+δxf (x)+ f (x)+h.o.t 0 2 00 Find the δx which minimizes this local quadratic approximation. • f (x) δx = 0 −f 00(x) Update x • f 0(x) xn+1 = xn − f 00(x) Newton iterations detail with quadratic approximations 1.4 0.45 0.4 1.2 0.35 1 f(x) 0.3 0.8 0.25 0.6 0.2 0.15 0.4 0.1 0.2 0.05 0 0 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 x • avoids the need to bracket the root • quadratic convergence (decimal accuracy doubles at every iteration) • global convergence of Newton’s method is poor • Often fails if the starting point is too far from the minimum 1.4 1.2 1 f(x) 1 0.5 0.8 0.6 0 0.4 -0.5 0.2 0 -1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 x • in practice, must be used with a globalization strategy which reduces the step length until function decrease is assured Extension to N dimensions How big can N be? • problem sizes can vary from a handful of parameters to many thousands • We will consider examples for N=2, so that cost function surfaces can be visualized Start by looking at the types of stationary points for functions in nD Stationary Points for Multidimensional functions A function f :IRn IR , → has a stationary point when the gradient ∂f ∂f ∂f > f = , ,..., = 0 ∇ Ã∂x1 ∂x2 ∂xn! Taylor expansion in 2D A function may be approximated locally by its Taylor series expansion about a point x0 ∂2f ∂2f ∂f ∂f x 1 ∂x2 ∂x∂y x f(x0 + x) f(x0)+ , + (x, y) ⎡ 2 2 ⎤ ≈ Ã∂x ∂y !Ã y ! 2 ∂ f ∂ f Ã y ! ∂x∂y 2 ⎢ ∂y ⎥ ⎣ ⎦ +h.o.t The expansion to second order is a quadratic function 1 f(x)=a + g x + x H x > 2 > Taylor expansion in nD A function may be approximated locally by its Taylor series expansion about a point x0 1 f(x + x) f(x )+ f x + x H x +h.o.t 0 ≈ 0 ∇ > 2 > where the gradient f(x)off(x) is the vector ∇ ∂f ∂f > f(x)= ,..., ∇ "∂x1 ∂xN # and the Hessian H (x)off(x)isthesymmetricmatrix ∂2f ∂2f 2 ... ∂x ∂x ∂x1 1 N H (x)=⎡ . ... ⎤ ⎢ ∂2f ∂2f ⎥ ⎢ 2 ⎥ ⎢ ∂x1∂xN ∂x ⎥ ⎢ N ⎥ The expansion to second order⎣ is a quadratic function⎦ 1 f(x)=a + g x + x H x > 2 > Properties of Quadratic functions Taylor expansion 1 f(x + x)=f(x )+g x + x H x 0 0 > 2 > Expand about a stationary point x0 = x∗ in direction p 1 2 f(x∗ + αp)=f(x∗)+g>αp + α p>H p 2 1 = f(x )+ α2p H p ∗ 2 > since at a stationary point g = f = 0 ∇ |x∗ At a stationary point the behaviour is determined by H H is a symmetrix matrix, and so has orthogonal eigenvec- tors H u = λ u choose u =1 i i i || i|| 1 f(x + αu )=f(x )+ α2u H u ∗ i ∗ 2 i> i 1 = f(x )+ α2λ ∗ 2 i As α increases, f(x + αu ) increases, decreases or is | | ∗ i unchanging according to whether λi is positive, negative or zero. Suppose x is 2D and choose p =1,then || || p.u1 =cosθ and p =cosθu1 +sinθu2 1 f(x + αp)=f(x )+ α2 (cos θu +sinθu ) H (cos θu +sinθu ) ∗ ∗ 2 1 2 > 1 2 1 = f(x )+ α2 cos2 θλ +sin2 θλ ∗ 2 1 2 ³ ´ If λ1 and λ2 both positive (> 0) then x∗ is a minimum. Examples of Quadratic functions Case 1: both eigenvalues positive 1 f(x)=a + g x + x H x > 2 > with 50 64 a =0 g = H = −50 46 " − # " # 15 10 5 0 -5 -5 0 5 10 15 minimum Case 2: eigenvalues have different signs 1 f(x)=a + g x + x H x > 2 > with 30 60 a =0 g = H = +20− 0 4 " # " − # 15 10 5 0 -5 -5 0 5 10 15 saddle surface: extremum but not a minimum Case 3: one eigenvalue zero 1 f(x)=a + g x + x H x > 2 > with 0 60 a =0 g = H = " 0 # " 00# 15 10 5 0 -5 -5 0 5 10 15 parabolic cylinder Optimization for Quadratic functions Assume that H is positive definite (all eigenvalues greater than zero) 1 f(x)=a + g x + x H x > 2 > f(x)=g + H x ∇ There is a unique minimum at x = H 1g ∗ − − If N is large, it is not feasible to perform this inversion directly. Optimization in N dimensions Key idea • reduce optimization to a 1D line search 15 10 5 0 -5 -5 0 5 10 15 An Optimization Algorithm Start at x0 then repeat 1. compute a search direction pk 2. compute a step length αk, such that f(xk + αkpk) <f(xk) 3. update x x + α p k+1 ← k k k 4. check for convergence (termination criteria) e.g. f = 0 ∇ Reduces optimization in N dimensions to a series of (1D) line minimizations Steepest descent Basic principle is to minimize the N-dimensional function by a series of 1D line-minimizations : xn+1 = xn + αnpn The steepest descent method chooses pn to be parallel to the negative gradient pn = f(xn) −∇ Step-size αn is chosen to minimize f(xn + αnpn).

Load more