Lecture 2
C7B Optimization Hilary 2011 A. Zisserman
• Optimization for general functions
• Downhill simplex (amoeba) algorithm
• Newton’s method • Line search • Quasi-Newton methods
• Least-Squares and Gauss-Newton methods
Review – unconstrained optimization
• Line minimization
• Minimizing Quadratic functions • line search • problem with steepest descent
Assumptions: • single minimum • positive definite Hessian Optimization for General Functions
f(x, y)=exp(x)(4x2 +2y2 +4xy +2x +1)
5
4
3
2
1
0
-1 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1
Apply methods developed using quadratic Taylor series expansion
Rosenbrock’s function
f(x, y)=100(y x2)2 +(1 x)2 − −
Rosenbrock function 3
2.5
2
1.5
1
0.5
0
-0.5
-1 -2 -1 0 1 2 Minimumisat[1, 1] Steepest descent
• The 1D line minimization must be performed using one of the earlier methods (usually cubic polynomial interpolation)
Steepest Descent Steepest Descent 3
2.5 0.85 2
1.5 0.8
1 0.75 0.5
0 0.7
-0.5 0.65 -1 -2 -1 0 1 2 -0.95 -0.9 -0.85 -0.8 -0.75
• The zig-zag behaviour is clear in the zoomed view (100 iterations) • The algorithm crawls down the valley
Conjugate Gradients
• Again, an explicit line minimization must be used at every step
Conjugate Gradient Descent 3
2.5
2
1.5
1
0.5
0
-0.5
-1 -2 -1 0 1 2 gradient < 1e-3 after 101 iterations
• The algorithm converges in 101 iterations • Far superior to steepest descent The downhill simplex (amoeba) algorithm
• Due to Nelder and Mead (1965) •A direct method: only uses function evaluations (no derivatives) • a simplex is the polytope in N dimensions with N+1 vertices, e.g. • 2D: triangle • 3D: tetrahedron • basic idea: move by reflections, expansions or contractions
start reflect expand
contract One iteration of the simplex algorithm
Reorder the points so that f(x +1) >f(x2) >f(x1)(i.e.x +1 is the worst point). • n n
Generate a trial point xr by reflection • xr =¯x + α(¯x x +1) − n where ¯x = i xi /(N +1)isthecentroidand α > 0. Compute f(xr), and there are then 3 possibilities: ¡P ¢ 1. f(x1) 2. f(xr) xe = xr + β(xr ¯x) − where β > 0. If f(xe) 3. f(xr) >f(xn) then assume the polytope is too large and generate a new point by contraction xc =¯x + γ(x +1 ¯x) n − where γ (0 < γ < 1) is the contraction coefficient. If f(xc) Standard values are α =1, β =1, γ =0.5. Example Path of best vertex Downhill Simplex 3 2.5 2 1.5 1 0.5 0 -0.5 -1 -2 -1 0 1 2 2.7 2.6 2.5 2.4 2.3 2.2 2.1 2 1.9 1.8 Matlab fminsearch -2 -1.8 -1.6 -1.4 -1.2 -1 with 200 iterations detail Example 2: contraction about a minimum YY YY YY YY reflections -10123 -10123 -10123 -10123 -10123 -10123 -1 0 1 2 3 -1 0 1 2 3 XX XX XX XX YY YY YY YY contractions -10123 -10123 -10123 -10123 -10123 -10123 -1 0 1 2 3 -1 0 1 2 3 XX XX XX XX Summary • no derivatives required • deals well with noise in the cost function • is able to crawl out of some local minima (though, of course, can still get stuck) Newton’s method Expand f(x) by its Taylor series about the point xn 1 f(xn + δx) f(xn)+g δx + δx H n δx ≈ n> 2 > where the gradient is the vector ∂f ∂f > gn = f(xn)= ,..., ∇ "∂x1 ∂xN # and the Hessian is the symmetric matrix 2 2 ∂ f ... ∂ f ∂x2 ∂x1∂xN ⎡ . 1 . ⎤ H n = H (xn)= . .. ⎢ ∂2f ∂2f ⎥ ⎢ 2 ⎥ ⎢ ∂x1∂xN ∂x ⎥ ⎢ N ⎥ ⎣ ⎦ For a minimum we require that f(x)=0,andso ∇ f(x)=gn + H nδx = 0 ∇ 1 with solution δx = H gn. This gives the iterative update − −n 1 x = xn H gn n+1 − −n Assume that H is positive definite (all eigenvalues greater than zero) 1 x = xn H gn n+1 − −n If f(x) is quadratic, then the solution is found in one step. • The method has quadratic convergence (as in the 1D case). • 1 The solution δx = H gn is guaranteed to be a downhill direction. • − −n Rather than jump straight to the minimum, it is better to perform • a line minimization which ensures global convergence 1 x = xn αnH gn n+1 − −n If H = I then this reduces to steepest descent. • Newton’s method - example Newton method with line search Newton method with line search 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 -0.5 -0.5 -1 -1 -2 -1 0 1 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 gradient < 1e-3 after 15 iterations gradient < 1e-3 after 15 iterations ellipses show successive quadratic approximations • The algorithm converges in only 15 iterations compared to the 101 for conjugate gradients, and 200 for simplex • However, the method requires computing the Hessian matrix at each iteration – this is not always feasible Quasi-Newton methods If the problem size is large and the Hessian matrix is dense then it • may be infeasible/inconvenient to compute it directly. Quasi-Newton methods avoid this problem by keeping a “rolling • estimate” of H (x), updated at each iteration using new gradient information. Common schemes are due to Broyden, Fletcher, Goldfarb and Shanno • (BFGS), and also Davidson, Fletcher and Powell (DFP). e.g. in 1D First derivatives h f1 f0 h f0 f 1 f 0(x0 + )= − and f 0(x0 )= − − f(x) 2 h − 2 h second derivative f1 f0 f0 f 1 −h −h − f1 2f0 + f 1 f00(x0)= − = − − h h2 h h x x x x -1 0 +1 For H n+1 build an approximation from H n, gn, gn+1, xn, xn+1 Quasi-Newton: BFGS Set H = I. • 0 Update according to • qnqn> (H nsn)(H nsn)> H n+1 = H n + q>n sn − s>n H nsn where sn = x xn n+1 − qn = g gn n+1 − The matrix itself is not stored, but rather represented compactly by • a few stored vectors. The estimate H is used to form a local quadratic approximation • n+1 as before. Example Rosenbrock function 3 2.5 2 1.5 1 0.5 0 -0.5 -1 -2 -1 0 1 2 • The method converges in 25 iterations, compared to 15 for the full-Newton method Non-linear least squares It is very common in applications for a cost function f(x)tobethe • sum of a large number of squared residuals M 2 f(x)= ri iX=1 If each residual depends non-linearly on the parameters x then the • minimization of f(x) is a non-linear least squares problem. Examples in computer vision are maximum likelihood estimators of • image relations (such as homographies and fundamental matrices) from point correspondences. Example: aligning a 3D model to an image Cost function Transformation parameters: f − 3D rotation matrix R • translation 3-vector T =(Tx,Ty,Tz) • > Image generation: rotate and translate 3D model by R and T • Ty Tx project to generate image Iˆ (x, y) • R,T Non-linear least squares M 2 f(x)= ri iX=1 The M N Jacobian of the vector of residuals r is defined as × assume M > N ∂r1 ... ∂r1 ∂x1 ∂xN ⎛ . .. ⎞ J(x)= . . J ⎜ ∂rM ∂rM ⎟ ⎜ ∂x ∂x ⎟ ⎜ 1 N ⎟ ⎝ ⎠ Consider ∂ 2 ∂ri ri = 2ri ∂xk ∂xk Xi Xi Hence f(x)=2J>r = J> ∇ For the Hessian we require 2 ∂ 2 ∂ ∂ri ri =2 ri ∂xl∂xk ∂xl ∂xk Xi Xi 2 ∂ri ∂ri ∂ ri =2 +2 ri ∂xk ∂xl ∂xk∂xl Xi Xi Hence M H (x)= 2J>J +2riR i iX=1 J > J Ri Note that the second-order term in the Hessian H (x) is multiplied by • the residuals ri. In most problems, the residuals will typically be small. • Also, at the minimum, the residuals will typically be distibuted with • mean = 0. For these reasons, the second-order term is often ignored, giving the • Gauss-Newton approximation to the Hessian : H (x)=2J>J Hence, explicit computation of the full Hessian can again be avoided. • Example – Gauss-Newton The minimization of the Rosenbrock function f(x, y) = 100(y x2)2 +(1 x)2 − − can be written as a least-squares problem with residual vector 10(y x2) r = (1 −x) " − # ∂r ∂r 1 1 20x 10 J(x)= ∂x ∂y = − ⎛ ∂r2 ∂r2 ⎞ 10 ∂x ∂y à − ! ⎝ ⎠ The true Hessian is 2 2 ∂ f ∂ f 2 ∂x2 ∂x∂y 1200x 400y +2 400x H(x)=⎡ 2 2 ⎤ = − − ∂ f ∂ f " 400x 200 # ∂x∂y 2 − ⎢ ∂y ⎥ ⎣ ⎦ The Gauss-Newton approximation of the Hessian is 20x 1 20x 10 800x2 +2 400x 2J J =2 − − − = − > 10 0 10 400x 200 " #" − # " − # 1 x = xn αnH gn with H n (x)=2J Jn n+1 − −n n> Gauss-Newton method with line search Gauss-Newton method with line search 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 -0.5 -0.5 -1 -1 -2 -1 0 1 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 gradient < 1e-3 after 14 iterations gradient < 1e-3 after 14 iterations • minimization with the Gauss-Newton approximation with line search takes only 14 iterations Comparison Newton Gauss-Newton Newton method with line search Gauss-Newton method with line search 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 -0.5 -0.5 -1 -1 -2 -1 0 1 2 -2 -1 0 1 2 gradient < 1e-3 after 15 iterations gradient < 1e-3 after 14 iterations • requires computing Hessian • approximates Hessian by Jacobian product • exact solution if quadratic • requires only derivatives Gradient descent Conjugate gradient NB axes do not have equal scales Newton Gauss-Newton Gradient descent Conjugate gradient Newton Gauss-Newton (no line search)