Lecture 2 "Downhill Simplex and Newton Methods"

Lecture 2 C7B Optimization Hilary 2011 A. Zisserman • Optimization for general functions • Downhill simplex (amoeba) algorithm • Newton’s method • Line search • Quasi-Newton methods • Least-Squares and Gauss-Newton methods Review – unconstrained optimization • Line minimization • Minimizing Quadratic functions • line search • problem with steepest descent Assumptions: • single minimum • positive definite Hessian Optimization for General Functions f(x, y)=exp(x)(4x2 +2y2 +4xy +2x +1) 5 4 3 2 1 0 -1 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 Apply methods developed using quadratic Taylor series expansion Rosenbrock’s function f(x, y)=100(y x2)2 +(1 x)2 − − Rosenbrock function 3 2.5 2 1.5 1 0.5 0 -0.5 -1 -2 -1 0 1 2 Minimumisat[1, 1] Steepest descent • The 1D line minimization must be performed using one of the earlier methods (usually cubic polynomial interpolation) Steepest Descent Steepest Descent 3 2.5 0.85 2 1.5 0.8 1 0.75 0.5 0 0.7 -0.5 0.65 -1 -2 -1 0 1 2 -0.95 -0.9 -0.85 -0.8 -0.75 • The zig-zag behaviour is clear in the zoomed view (100 iterations) • The algorithm crawls down the valley Conjugate Gradients • Again, an explicit line minimization must be used at every step Conjugate Gradient Descent 3 2.5 2 1.5 1 0.5 0 -0.5 -1 -2 -1 0 1 2 gradient < 1e-3 after 101 iterations • The algorithm converges in 101 iterations • Far superior to steepest descent The downhill simplex (amoeba) algorithm • Due to Nelder and Mead (1965) •A direct method: only uses function evaluations (no derivatives) • a simplex is the polytope in N dimensions with N+1 vertices, e.g. • 2D: triangle • 3D: tetrahedron • basic idea: move by reflections, expansions or contractions start reflect expand contract One iteration of the simplex algorithm Reorder the points so that f(x +1) >f(x2) >f(x1)(i.e.x +1 is the worst point). • n n Generate a trial point xr by reflection • xr =¯x + α(¯x x +1) − n where ¯x = i xi /(N +1)isthecentroidand α > 0. Compute f(xr), and there are then 3 possibilities: ¡P ¢ 1. f(x1) <f(xr) <f(xn)(i.e.xr is neither the new best or worst point), replace xn+1 by xr. 2. f(xr) <f(x1)(i.e.xr is the new best point), then assume direction of reflection is good and generate a new point by expansion xe = xr + β(xr ¯x) − where β > 0. If f(xe) <f(xr)thenreplacexn+1 by xe,otherwise,theexpansion has failed, replace xn+1 by xr. 3. f(xr) >f(xn) then assume the polytope is too large and generate a new point by contraction xc =¯x + γ(x +1 ¯x) n − where γ (0 < γ < 1) is the contraction coefficient. If f(xc) <f(xn+1) then the contraction has succeeded and replace xn+1 by xc, otherwise contract again. Standard values are α =1, β =1, γ =0.5. Example Path of best vertex Downhill Simplex 3 2.5 2 1.5 1 0.5 0 -0.5 -1 -2 -1 0 1 2 2.7 2.6 2.5 2.4 2.3 2.2 2.1 2 1.9 1.8 Matlab fminsearch -2 -1.8 -1.6 -1.4 -1.2 -1 with 200 iterations detail Example 2: contraction about a minimum YY YY YY YY reflections -10123 -10123 -10123 -10123 -10123 -10123 -1 0 1 2 3 -1 0 1 2 3 XX XX XX XX YY YY YY YY contractions -10123 -10123 -10123 -10123 -10123 -10123 -1 0 1 2 3 -1 0 1 2 3 XX XX XX XX Summary • no derivatives required • deals well with noise in the cost function • is able to crawl out of some local minima (though, of course, can still get stuck) Newton’s method Expand f(x) by its Taylor series about the point xn 1 f(xn + δx) f(xn)+g δx + δx H n δx ≈ n> 2 > where the gradient is the vector ∂f ∂f > gn = f(xn)= ,..., ∇ "∂x1 ∂xN # and the Hessian is the symmetric matrix 2 2 ∂ f ... ∂ f ∂x2 ∂x1∂xN ⎡ . 1 . ⎤ H n = H (xn)= . .. ⎢ ∂2f ∂2f ⎥ ⎢ 2 ⎥ ⎢ ∂x1∂xN ∂x ⎥ ⎢ N ⎥ ⎣ ⎦ For a minimum we require that f(x)=0,andso ∇ f(x)=gn + H nδx = 0 ∇ 1 with solution δx = H gn. This gives the iterative update − −n 1 x = xn H gn n+1 − −n Assume that H is positive definite (all eigenvalues greater than zero) 1 x = xn H gn n+1 − −n If f(x) is quadratic, then the solution is found in one step. • The method has quadratic convergence (as in the 1D case). • 1 The solution δx = H gn is guaranteed to be a downhill direction. • − −n Rather than jump straight to the minimum, it is better to perform • a line minimization which ensures global convergence 1 x = xn αnH gn n+1 − −n If H = I then this reduces to steepest descent. • Newton’s method - example Newton method with line search Newton method with line search 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 -0.5 -0.5 -1 -1 -2 -1 0 1 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 gradient < 1e-3 after 15 iterations gradient < 1e-3 after 15 iterations ellipses show successive quadratic approximations • The algorithm converges in only 15 iterations compared to the 101 for conjugate gradients, and 200 for simplex • However, the method requires computing the Hessian matrix at each iteration – this is not always feasible Quasi-Newton methods If the problem size is large and the Hessian matrix is dense then it • may be infeasible/inconvenient to compute it directly. Quasi-Newton methods avoid this problem by keeping a “rolling • estimate” of H (x), updated at each iteration using new gradient information. Common schemes are due to Broyden, Fletcher, Goldfarb and Shanno • (BFGS), and also Davidson, Fletcher and Powell (DFP). e.g. in 1D First derivatives h f1 f0 h f0 f 1 f 0(x0 + )= − and f 0(x0 )= − − f(x) 2 h − 2 h second derivative f1 f0 f0 f 1 −h −h − f1 2f0 + f 1 f00(x0)= − = − − h h2 h h x x x x -1 0 +1 For H n+1 build an approximation from H n, gn, gn+1, xn, xn+1 Quasi-Newton: BFGS Set H = I. • 0 Update according to • qnqn> (H nsn)(H nsn)> H n+1 = H n + q>n sn − s>n H nsn where sn = x xn n+1 − qn = g gn n+1 − The matrix itself is not stored, but rather represented compactly by • a few stored vectors. The estimate H is used to form a local quadratic approximation • n+1 as before. Example Rosenbrock function 3 2.5 2 1.5 1 0.5 0 -0.5 -1 -2 -1 0 1 2 • The method converges in 25 iterations, compared to 15 for the full-Newton method Non-linear least squares It is very common in applications for a cost function f(x)tobethe • sum of a large number of squared residuals M 2 f(x)= ri iX=1 If each residual depends non-linearly on the parameters x then the • minimization of f(x) is a non-linear least squares problem. Examples in computer vision are maximum likelihood estimators of • image relations (such as homographies and fundamental matrices) from point correspondences. Example: aligning a 3D model to an image Cost function Transformation parameters: f − 3D rotation matrix R • translation 3-vector T =(Tx,Ty,Tz) • > Image generation: rotate and translate 3D model by R and T • Ty Tx project to generate image Iˆ (x, y) • R,T Non-linear least squares M 2 f(x)= ri iX=1 The M N Jacobian of the vector of residuals r is defined as × assume M > N ∂r1 ... ∂r1 ∂x1 ∂xN ⎛ . .. ⎞ J(x)= . J ⎜ ∂rM ∂rM ⎟ ⎜ ∂x ∂x ⎟ ⎜ 1 N ⎟ ⎝ ⎠ Consider ∂ 2 ∂ri ri = 2ri ∂xk ∂xk Xi Xi Hence f(x)=2J>r = J> ∇ For the Hessian we require 2 ∂ 2 ∂ ∂ri ri =2 ri ∂xl∂xk ∂xl ∂xk Xi Xi 2 ∂ri ∂ri ∂ ri =2 +2 ri ∂xk ∂xl ∂xk∂xl Xi Xi Hence M H (x)= 2J>J +2riR i iX=1 J > J Ri Note that the second-order term in the Hessian H (x) is multiplied by • the residuals ri. In most problems, the residuals will typically be small. • Also, at the minimum, the residuals will typically be distibuted with • mean = 0. For these reasons, the second-order term is often ignored, giving the • Gauss-Newton approximation to the Hessian : H (x)=2J>J Hence, explicit computation of the full Hessian can again be avoided. • Example – Gauss-Newton The minimization of the Rosenbrock function f(x, y) = 100(y x2)2 +(1 x)2 − − can be written as a least-squares problem with residual vector 10(y x2) r = (1 −x) " − # ∂r ∂r 1 1 20x 10 J(x)= ∂x ∂y = − ⎛ ∂r2 ∂r2 ⎞ 10 ∂x ∂y Ã − ! ⎝ ⎠ The true Hessian is 2 2 ∂ f ∂ f 2 ∂x2 ∂x∂y 1200x 400y +2 400x H(x)=⎡ 2 2 ⎤ = − − ∂ f ∂ f " 400x 200 # ∂x∂y 2 − ⎢ ∂y ⎥ ⎣ ⎦ The Gauss-Newton approximation of the Hessian is 20x 1 20x 10 800x2 +2 400x 2J J =2 − − − = − > 10 0 10 400x 200 " #" − # " − # 1 x = xn αnH gn with H n (x)=2J Jn n+1 − −n n> Gauss-Newton method with line search Gauss-Newton method with line search 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 -0.5 -0.5 -1 -1 -2 -1 0 1 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 gradient < 1e-3 after 14 iterations gradient < 1e-3 after 14 iterations • minimization with the Gauss-Newton approximation with line search takes only 14 iterations Comparison Newton Gauss-Newton Newton method with line search Gauss-Newton method with line search 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 -0.5 -0.5 -1 -1 -2 -1 0 1 2 -2 -1 0 1 2 gradient < 1e-3 after 15 iterations gradient < 1e-3 after 14 iterations • requires computing Hessian • approximates Hessian by Jacobian product • exact solution if quadratic • requires only derivatives Gradient descent Conjugate gradient NB axes do not have equal scales Newton Gauss-Newton Gradient descent Conjugate gradient Newton Gauss-Newton (no line search).

Lecture 2 "Downhill Simplex and Newton Methods"

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support