1 Gradient-Based Optimization
Total Page:16
File Type:pdf, Size:1020Kb
1 Gradient-Based Optimization 1.1 General Algorithm for Smooth Functions All algorithms for unconstrained gradient-based optimization can be described as follows. We start with iteration number k = 0 and a starting point, xk. 1. Test for convergence. If the conditions for convergence are satisfied, then we can stop and xk is the solution. 2. Compute a search direction. Compute the vector pk that defines the direction in n-space along which we will search. 3. Compute the step length. Find a positive scalar, αk such that f(xk + αkpk) < f(xk). 4. Update the design variables. Set xk+1 = xk + αkpk, k = k + 1 and go back to 1. xk+1 = xk + αkpk (1) | {z } ∆xk There are two subproblems in this type of algorithm for each major iteration: computing the search direction pk and finding the step size (controlled by αk). The difference between the various types of gradient-based algorithms is the method that is used for computing the search direction. AA222: Introduction to MDO 1 1.2 Optimality Conditions T Consider a function f(x) where x is the n-vector x = [x1; x2; : : : ; xn] . The gradient vector of this function is given by the partial derivatives with respect to each of the independent variables, 2 @f 3 6@x 7 6 1 7 6 @f 7 6 7 rf(x) ≡ g(x) ≡ 6@x 7 (2) 6 . 2 7 6 . 7 6 7 4 @f 5 @xn In the multivariate case, the gradient vector is perpendicular to the the hyperplane tangent to the contour surfaces of constant f. AA222: Introduction to MDO 2 Higher derivatives of multi-variable functions are defined as in the single-variable case, but note that the number of gradient components increase by a factor of n for each differentiation. While the gradient of a function of n variables is an n-vector, the \second derivative" of an n-variable function is defined by n2 partial derivatives (the derivatives of the n first partial derivatives with respect to the n variables): @2f @2f ; i 6= j and ; i = j: (3) 2 @xi@xj @xi 2 If the partial derivatives @f=@xi, @f=@xj and @ f=@xi@xj are continuous and f is single 2 2 2 valued, then @ f=@xi@xj exists and @ f=@xi@xj = @ f=@xj@xi. Therefore the second- order partial derivatives can be represented by a square symmetric matrix called the Hessian matrix, 2 @2f @2f 3 ··· 2 6 @ x1 @x1@xn 7 2 6 . 7 r f(x) ≡ H(x) ≡ 6 . 7 (4) 6 7 4 @2f @2f 5 ··· ; 2 @xn@x1 @ xn which contains n(n + 1)=2 independent elements. If f is quadratic, the Hessian of f is constant, and the function can be expressed as 1 f(x) = xT Hx + gT x + α. (5) 2 AA222: Introduction to MDO 3 As in the single-variable case the optimality conditions can be derived from the Taylor-series expansion of f about x∗: 1 f(x∗ + "p) = f(x∗) + "pT g(x∗) + "2pT H(x∗ + εθp)p; (6) 2 where 0 ≤ θ ≤ 1, " is a scalar, and p is an n-vector. For x∗ to be a local minimum, then for any vector p there must be a finite " such that f(x∗ + "p) ≥ f(x∗), i.e. there is a neighborhood in which this condition holds. If this condition is satisfied, then f(x∗ + "p) − f(x∗) ≥ 0 and the first and second order terms in the Taylor-series expansion must be greater than or equal to zero. As in the single variable case, and for the same reason, we start by considering the first order terms. Since p is an arbitrary vector and " can be positive or negative, every component of the gradient vector g(x∗) must be zero. Now we have to consider the second order term, "2pT H(x∗ + εθp)p. For this term to be non-negative, H(x∗ + εθp) has to be positive semi-definite, and by continuity, the Hessian at the optimum, H(x∗) must also be positive semi-definite. AA222: Introduction to MDO 4 Necessary conditions (for a local minimum): kg(x∗)k = 0 and H(x∗) is positive semi-definite: (7) Sufficient conditions (for a strong local minimum): kg(x∗)k = 0 and H(x∗) is positive definite: (8) Some definitions from linear algebra that might be helpful: n×n T n • The matrix H 2 R is positive definite if p Hp > 0 for all nonzero vectors p 2 R (If H = HT then all the eigenvalues of H are strictly positive) n×n T n • The matrix H 2 R is positive semi-definite if p Hp ≥ 0 for all vectors p 2 R (If H = HT then the eigenvalues of H are positive or zero) n×n n T • The matrix H 2 R is indefinite if there exists p; q 2 R such that p Hp > 0 and qT Hq < 0. (If H = HT then H has eigenvalues of mixed sign.) n×n T n • The matrix H 2 R is negative definite if p Hp < 0 for all nonzero vectors p 2 R (If H = HT then all the eigenvalues of H are strictly negative) AA222: Introduction to MDO 5 2-D Critical Point Examples AA222: Introduction to MDO 6 Example 1.1: Critical Points of a Function Consider the function: 2 2 3 4 f(x) = 1:5x1 + x2 − 2x1x2 + 2x1 + 0:5x1 Find all stationary points of f and classify them. Solve rf(x) = 0, get three solutions: (0; 0) local minimum p p 1=2(−3 − 7; −3 − 7) global minimum p p 1=2(−3 + 7; −3 + 7) saddle point To establish the type of point, we have to determine if the Hessian is positive definite and compare the values of the function at the points. AA222: Introduction to MDO 7 2 2 3 4 Figure 1: Critical points of f(x) = 1:5x1 + x2 − 2x1x2 + 2x1 + 0:5x1 AA222: Introduction to MDO 8 1.3 Steepest Descent Method The steepest descent method uses the gradient vector at each point as the search direction for each iteration. As mentioned previously, the gradient vector is orthogonal to the plane tangent to the isosurfaces of the function. The gradient vector at a point, g(xk), is also the direction of maximum rate of change (maximum increase) of the function at that point. This rate of change is given by the norm, kg(xk)k. Steepest descent algorithm: 1. Select starting point x0, and convergence parameters "g;"a and "r. 2. Compute g(xk) ≡ rf(xk). If kg(xk)k ≤ "g then stop. Otherwise, compute the normalized search direction to pk = −g(xk)=kg(xk)k. 3. Perform line search to find step length αk in the direction of pk. 4. Update the current point, xk+1 = xk + αpk. 5. Evaluate f(xk+1). If the condition jf(xk+1) − f(xk)j ≤ "a + "rjf(xk)j is satisfied for two successive iterations then stop. Otherwise, set k = k + 1, xk+1 = xk + 1 and return to step 2. Here, jf(xk+1) − f(xk)j ≤ "a + "rjF (xk)j is a check for the successive reductions of f. "a −6 is the absolute tolerance on the change in function value (usually small ≈ 10 ) and "r is the relative tolerance (usually set to 0.01). AA222: Introduction to MDO 9 If we use an exact line search, the steepest descent direction at each iteration is orthogonal to the previous one, i.e., df(xk+1) @f(xk+1) @xk+1 T = 0 ) = 0 ) r f(xk+1)pk = 0 ) (9) dα @xk+1 @α T −g (xk+1)g(xk) = 0 (10) Therefore the method \zigzags" in the design space and is rather inefficient. Although a substantial decrease may be observed in the first few iterations, the method is usually very slow after that. In particular, while the algorithm is guaranteed to converge, it may take an infinite number of iterations. The rate of convergence is linear. AA222: Introduction to MDO 10 For steepest descent and other gradient methods that do not produce well-scaled search directions, we need to use other information to guess a step length. One strategy is to assume that the first-order change in xk will be the same as the one obtained T T in the previous step. i.e, that αg¯ k pk = αk−1gk−1pk−1 and therefore: T g pk−1 α¯ = α k−1 : (11) k−1 T gk pk AA222: Introduction to MDO 11 −(10x2+x2) Example 1.2: Steepest Descent Applied to f(x1; x2) = 1 − e 1 2 Figure 2: Solution path of the steepest descent method AA222: Introduction to MDO 12 1.4 Conjugate Gradient Method A small modification to the steepest descent method takes into account the history of the gradients to move more directly towards the optimum. Suppose we want to minimize a convex quadratic function 1 φ(x) = xT Ax − bT x (12) 2 where A is an n × n matrix that is symmetric and positive definite. Differentiating this with respect to x we obtain, rφ(x) = Ax − b ≡ r(x): (13) Minimizing the quadratic is thus equivalent to solving the linear system, Ax = b: (14) The conjugate gradient method is an iterative method for solving linear systems of equations such as this one. A set of nonzero vectors fp0; p1; : : : ; pn−1g is conjugate with respect to A if T pi Apj = 0; for all i 6= j: (15) AA222: Introduction to MDO 13 Suppose that we start from a point x0 and a set of directions fp0; p1; : : : ; pn−1g to generate a sequence fxkg where xk+1 = xk + αkpk (16) where αk is the minimizer of φ along xk + αpk, given by T r pk α = − k (17) k T pk Apk We will see that for any x0 the sequence fxkg generated by the conjugate direction algorithm converges to the solution of the linear system in at most n steps.