Chapter 6: Gradient-Free Optimization

AA222: MDO 133 Monday 30th April, 2012 at 16:25 Chapter 6 Gradient-Free Optimization 6.1 Introduction Using optimization in the solution of practical applications we often encounter one or more of the following challenges: • non-differentiable functions and/or constraints • disconnected and/or non-convex feasible space • discrete feasible space • mixed variables (discrete, continuous, permutation) • large dimensionality • multiple local minima (multi-modal) • multiple objectives Gradient-based optimizers are efficient at finding local minima for high-dimensional, nonlinearly- constrained, convex problems; however, most gradient-based optimizers have problems dealing with noisy and discontinuous functions, and they are not designed to handle multi-modal problems or discrete and mixed discrete-continuous design variables. Consider, for example, the Griewank function: n n x2 f(x) = P i − Q cos pxi + 1 4000 i i=1 i=1 (6.1) −600 ≤ xi ≤ 600 Mixed (Integer-Continuous) Figure 6.1: Graphs illustrating the various types of functions that are problematic for gradient- based optimization algorithms AA222: MDO 134 Monday 30th April, 2012 at 16:25 Figure 6.2: The Griewank function looks deceptively smooth when plotted in a large domain (left), but when you zoom in, you can see that the design space has multiple local minima (center) although the function is still smooth (right) How we could find the best solution for this example? • Multiple point restarts of gradient (local) based optimizer • Systematically search the design space • Use gradient-free optimizers Many gradient-free methods mimic mechanisms observed in nature or use heuristics. Unlike gradient-based methods in a convex search space, gradient-free methods are not necessarily guar- anteed to find the true global optimal solutions, but they are able to find many good solutions (the mathematician's answer vs. the engineer's answer). The key strength of gradient-free methods is their ability to solve problems that are difficult to solve using gradient-based methods. Furthermore, many of them are designed as global optimizers and thus are able to find multiple local optima while searching for the global optimum. Various gradient-free methods have been developed. We are going to look at some of the most commonly used algorithms: • Nelder{Mead Simplex (Nonlinear Simplex) • Simulated Annealing • Divided Rectangles Method • Genetic Algorithms • Particle Swarm Optimization 6.2 Nelder{Mead Simplex The simplex method of Nelder and Mead performs a search in n-dimensional space using heuristic ideas. It is also known as the nonlinear simplex and is not to be confused with the linear simplex, with which it has nothing in common. Its main strengths are that it requires no derivatives to be computed and that it does not require the objective function to be smooth. The weakness of this method is that it is not very efficient, particularly for problems with more than about 10 design variables; above this number of variables convergence becomes increasingly difficult. AA222: MDO 135 Monday 30th April, 2012 at 16:25 0.9 0.8 0.7 0.6 2 0.5 x 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 x 1 T Figure 6.3: Starting simplex for n = 2, x0 = [0; 0] and c = 1. The two other vertices are [a; b] and [b; a]. A simplex is a structure in n-dimensional space formed by n + 1 points that are not in the same plane. Hence a line segment is a 1-dimensional simplex, a triangle is a 2-dimensional simplex and a tetrahedron forms a simplex in 3-dimensional space. The simplex is also called a hypertetrahedron. The Nelder{Mead algorithm starts with a simplex (n + 1 sets of design variables x) and then modifies the simplex at each iteration using four simple operations. The sequence of operations to be performed is chosen based on the relative values of the objective function at each of the points. The first step of the simplex algorithm is to find the n + 1 points of the simplex given an initial guess x0. This can be easily done by simply adding a step to each component of x0 to generate n new points. However, generating a simplex with equal length edges is preferable. Suppose the length of all th sides is required to be c and that the initial guess, x0 is the (n + 1) point. The remaining points of the simplex, i = 1; : : : ; n can be computed by adding a vector to x0 whose components are all b except for the ith component which is set to a, where c p b = p n + 1 − 1 (6.2) n 2 c a = b + p : (6.3) 2 After generating the initial simplex, we have to evaluate the objective function at each of its vertices in order to identify three key points: the points associated with the highest (\worst") the second highest (\lousy") and the lowest (\best") values of the objective. The Nelder{Mead algorithm performs four main operations to the simplex: reflection, expansion, outside contraction, inside contraction and shrinking. Each of these operations generates a new point and the sequence of operations performed in one iteration depends on the value of the objective at the new point relative to the other key points. Let xw, xl and xb denote the worst, the second worst (or \lousy") and best points, respectively, among all n + 1 points of the simplex. We first AA222: MDO 136 Monday 30th April, 2012 at 16:25 0.9 0.8 0.7 0.6 0.5 3 x 0.4 0.3 0.2 0.1 0 0 0.8 0.2 0.4 0.6 0.6 0.4 0.8 0.2 1 0 x 2 x 1 T Figure 6.4: Starting simplex for n = 3, x0 = [0; 0; 0] and c = 1. The three other vertices are [a; b; b], [b; a; b] and [b; b; a]. calculate the average of the n points that exclude exclude the worst, n+1 1 X xa = xi: (6.4) n i=1;i6=w After computing xa, we know that the line from xw to xa is a descent direction. A new point is found on this line by reflection, given by xr = xa + α (xa − xw) (6.5) If the value of the function at this reflected point is better than the best point, then the reflection has been especially successful and we step further in the same direction by performing an expansion, xe = xr + γ (xr − xa) ; (6.6) where the expansion parameter γ is usually set to 1. If, however the reflected point is worse than the worst point, we assume that a better point exists between xw and xa and perform an inside contraction xc = xa − β (xa − xw) ; (6.7) where the contraction factor is usually set to β = 0:5. If the reflected point is not worse than the worst but is still worse than the lousy point, then an outside contraction is performed, that is, xo = xa + β (xa − xw) : (6.8) After the expansion, we accept the new point if the value of the objective function is better than the best point. Otherwise, we just accept the previous reflected point. AA222: MDO 137 Monday 30th April, 2012 at 16:25 Figure 6.5: Operations performed on the simplex in Nelder{Mead's algorithm for n = 2. If reflection, expansion, and contraction fail, we resort to a shrinking operation. This operation retains the best point and shrinks the simplex, that is, for all point of the simplex except the best one, we compute a new position xi = xb + ρ (xi − xb) ; (6.9) where the scaling parameter is usually set to ρ = 0:5. The simplex operations are shown in Figure 6.5 for the n = 2 case. A flow chart of the algorithm is shown in Figure 6.6 and the algorithm itself in Algorithm 12. AA222: MDO 138 Monday 30th April, 2012 at 16:25 Initialize n-simplex, evaluate n+1 points Rank vertices: best, lousy and worst Reflect Is reflected point better Yes Expand than best point? No Keep expanded point Yes Perform inside Yes Is reflected point worse Is expanded point better contraction than worst point? than best point? No No Keep reflected point Perform outside Is new point worse than Is reflected point worse Yes worst point? than lousy point? contraction No Yes No Shrink Yes Is new point worse than Keep new point Shrink Keep reflected point reflected point? No Keep new point Figure 6.6: Flow chart for the Nelder{Mead algorithm. The best point is that with the lowest value of the objective function, the worst point is that with the highest value and the lousy point has the second highest value. AA222: MDO 139 Monday 30th April, 2012 at 16:25 Algorithm 12 Nelder{Mead Algorithm Input: Initial guess, x0 Output: Optimum, x∗ k 0 Create a simplex with edge length c repeat Identify the highest (xw: worst), second highest (xl, lousy) and lowest (xb: best) value points with function values fw, fl, and fb, respectively Evaluate xa, the average of the point in simplex excluding xw Perform reflection to obtain xr, evaluate fr if fr < fb then Perform expansion to obtain xe, evaluate fe. if fe < fb then xw xe, fw fe (accept expansion) else xw xr, fw fr (accept reflection) end if else if fr ≤ fl then xw xr, fw fr (accept reflected point) else if fr > fw then Perform an inside contraction and evaluate fc if fc < fw then xw xc (accept contraction) else Shrink the simplex end if else Perform an outside contraction and evaluate fc if fc ≤ fr then xw xc (accept contraction) else Shrink the simplex end if end if end if k k + 1 until (fw − fb) < ("1 + "2jfbj) The convergence criterion can also be that the size of simplex, i.e., n X s = jxi − xn+1j (6.10) i=1 must be less than a certain tolerance.

Chapter 6: Gradient-Free Optimization

Revised Primal Simplex Method Katta G

The Simplex Algorithm in Dimension Three1

11.1 Algorithms for Linear Programming

Chapter 1 Introduction

A Branch-And-Bound Algorithm for Zero-One Mixed Integer

Arxiv:1706.05795V4 [Math.OC] 4 Nov 2018

The Revised Simplex Algorithm Assume Again That We Are Given an LP in Canonical Form, Which We Pad with Slack Variables

Practical Guide to the Simplex Method of Linear Programming

Simplex Algorithm(S)

Lectures 3 (Part), 4 and 5

Advanced Operations Research Techniques IE316 Lecture 7

The Simplex Method