Sebastian U. Stich MADALGO & CTIC Summer School 2011 [email protected] Randomized Derivative-Free Optimization Institute of Theoretical Computer Science Swiss Institute of Bioinformatics

Standard Approaches Evolution Strategies Explained The Right Metric Other Methods

Pure Random Search [1a] • (1 + 1)-ES described already in 1968 [2a] Step-size adaptation of ES [2c] f H • sample points uniformly in space For quadratic: Hit and Run [7a] • basis of more complex methods like CMA-ES 1 • exponential running time y∗ ∗ ∗ ∗ • draw a random direction and on the segment • convergence of (1+1)-ES linear on quadratic functions [2b] f(x) = f(x ) + (x − x )H(x − x ) K • Pure adaptive random search (PAS) [1b]: 2 contained in the convex body uniformly the 2 ∗ next iterate sample sequence of points, all new points y • theoretical convergence results only known for very simple H = ∇ f(x ) z uniformly among those with better function functions • originally proposed as sampling method to y value → not possible to do ∗ generate uniform samples from convex x∗ Generic (µ + λ)- (ES) gradient direction −∇f(x) body [7b] P • PAS analogous to randomized version of method of centers −1 1: for k = 0 to N do Newton direction −H ∇f(x) • fast mixing times (in special settings, e.g. log-convave functions) proven by Lovász [7c] and Dyer [7d] 2: Xk ←(λ new points, based on Mk−1) DIRECT [1c] • introduced as optimization algorithm by Bertsimas [7a] 3: EvaluateFunction (Xk) • iteratively evaluate function at grid • if H ≈ In first order information is sufficient 4: Mk ← Select µ best points among Xk ∪ Mk−1 1/2 success probability → • in each iteration locally refine grid cells if small success probability → H−1 5: end for increase σ • otherwise estimation of is necessary Random Cutting Plane [7e] potential to improve function value   c return SN M decrease σ • invariant under quadratic transformations c • based on cutting plane method [7f] • exponential running time 6: BestPoint k=0 k Xk x(4) • RCP algorithm described first by Levin [7g] • estimating H only from zeroth-order information is difficult (5) x3 [2a] M. Schumer and K. Steiglitz. Adaptive step size random search. (2) x x xG • natural extension of 1D bisection method Nelder-Mead [1d] For (1+1)-ES: (with step-size σk) Automatic Control, IEEE Transactions on, 13(3):270 – 276, 1968. • many update schemes are reasonable and used (1) x (3) Not practical usable as for a generic convex x = zk • • evaluate function at vertices of a simplex xk+1 ∼ N (mk, σk · In) mk = Best(xk, mk−1) [2b] Jens Jägersküpper. Rigorous runtime analysis of the (1+1) es: 1/5-rule and Xk+1 set, computing the center of gravity is more Rank 1 update. • propose new search point: reflection of worst ellipsoidal fitness landscapes. Estimate difficult than solving the original optimization point through centroid x¯ Foundations of Genetic Algorithms, 3469:356–361. 2005. −1 −1 T problem. Hk+1 ≈ (1 − λk)H + λkykyk • can converge to non-stationary points Most ES work by only considering rank information of the current [2c] A. Auger and N. Hansen. Tutorial: Evolution strategies and covariance matrix • Dabbene et al. [7e] approximate centroid by hit-and-run sampling • Adaptation: simplex derivatives [1e] iterates. (Invariant under monotone transformations). adaptation. available online. yk some vector (for example: yk ∝ mk+1 − mk), λk weight xr Random Conic Pursuit [7h] Trust Region [1f] • random direction (rank 1 matrix) is chosen at random and optimization over 2-dim subspace • approximate objective function with models yields next iterate (often quadratic) on regions Problem Statement Summary Open Problems and Goals • convergence on unconstrained SDP • refine/expand regions on which the approximation is bad/good • several adaptation heuristics proposed in the For smooth functions, standard methods (gradient, Theoretical results to support experimental evidence. paper (adaptation of sampling distribution, Picture [7j] Newton) will converge to local minimizer but are not • rigorous convergence results to compare with boundaries) but no theoretical results robust against noise. Evolution strategies have been classical methods Simulated Annealing. [7k, 7l] proven to effective in this setting. Highly developed [1a] Samuel H. Brooks. A discussion of random methods for seeking maxima. xf (x) • convergence/divergence results for noisy and • inspiration and name from annealing in Operations Research, 6(2):244–251, 1958. arg min f(x) methods (like CMA-ES) are at the moment (among) multi-modal functions metallurgy [1b] Zelda B. Zabinsky and Robert L. Smith. Pure adaptive search in global optimization. x∈Rn • new iterates proposed by sampling random Mathematical Programming, 53:323–338, 1992. the best performing algorithms. Especially interesting would be to analyze particu- points in a weighted neighborhood (acceptance [1c] Jones, Perttunen, and Stuckman. Lipschitzian optimization without the lipschitz constant. • access only zeroth-order information • convergence proof of random variants of standard lar heuristic used by many algorithms to self-estimate rate changes according to cooling schedule) Journal of Optimization Theory and Applications, 79:157–181, 1993. • convergence results, but only low rates [1d] J. A. Nelder and R. Mead. A simplex method for function minimization. • objective fuction given as black-box methods relatively easy for convex functions strategy parameters. • adaptation of Metropolis-Hastings algorithm The Computer Journal, 7(4):308–313, 1965. • multi-modal functions (global minimizer) • convergence proofs for general ES still missing • step-size adaptation heuristics [1e] Andrew R. Conn, Katya Scheinberg, and Luis N. Vicente. Introduction to derivative-free optimization. 2009 • robust against (random) noise • main issues: self-adaptation of strategy parameters • Hessian estimation (eg. by rank-µ updates) [1f] Andrew R. Conn, Nicholas I. M. Gould, and Philippe L. Toint. Trust-region method. 2000 • provable (fast) convergence? (stepsize, Hessian estimation, lots of heuristics. . . ) • limited memory Hessian estimation possible? [7a] D. Bertsimas and S. Vempala. Solving convex programs by random walks. J. ACM, 51:540–556, July 2004. [7b] R. L. Smith. Efficient Monte-Carlo procedures for generating points uniformly Random Gradient [3a] Random Pursuit [4a] CMA-ES [5a] Gaussian Adaptation [6a, 6b] distributed over bounded regions. Operation Research, 32:1296-1308, 1984. [7c] László Lovász. Hit-and-run mixes fast. Mathematical Programming, 86:443–461, 1999. [7d] M. Dyer, A. Frieze, and R. Kannan. A random poly.-time alg. for approx. the iteratively update u ∼ N (0,In) iteratively update u ∼ N (0,In) (µ, λ)-Evolution Strategy, iteratively (1 + 1)-Evolution Strategy iteratively volume of conv. bodies. J. ACM, 38:1–17, January 1991. update update x = x − h g (x ) h = h (L ) x = x − h g (x ) hk = arg min f(xk+h·u) k+1 k k µ k k k 1 stepsizes k+1 k k µ k h [7e] F. Dabbene, P. S. Shcherbakov, and B. T. Polyak. A randandomized cutting mk (mean) σk(stepsize) Ck (covariance) mk (mean) σk(stepsize) Ck (covariance) Siam J. Optim. g directional derivative/finite difference µ accuracy line search in random direction 6 plane method with probabilistic geometric convergence. , k

) 20(6):3185–3207, 2010.

x 4 in random direction: • Described by [4b], first analysis by ( propose new candidates propose new candidates x2 f [7f] Jr. Kelley, J. E. The cutting-plane method for solving convex programs. 0 2 g0(x) = f (x, u) · u Karmanov [4c], for approximative line − (·) (·) J. of the Soc. for Ind. and App. Mathematics, 8(4):pp. 703–712, 1960. ∗ 0 y ∼ N (mk, σk · Ck) y ∼ N (mk, σk · Ck)

f k k

search independently by [4a] k f(x + µu) − f(x) -2 [7g] A. Levin. On an algorithm for the minimization of convex functions. g (x) = · u ln µ Soviet Math. Dokl. µ • convergence for smooth convex -4 • step-size adapt. by path lenght control • covariance adaptation by rank 1 x1 , 6:286–290, 1965. functions with bounded level sets and -6 • covariance adapt. by rank µ updates updates [7h] A. Kleiner, A. Rahimi, and M. Jordan. Random conic pursuit for SDP. • First results by Polyak [3b], NIPS 23:1135–1143. 2010. Lipschitz continuous gradients -8 • variable metric (invariant under completely analyzed by Nesterov [3a] • reaches complexity bound of 0 1000 3000 5000 7000 9000 [7j] E. P. Xing et al. Distance metric learning with application to clustering. quadratic transformations) NIPS’02 using Gaussian smoothing [3c] gradient method by factor O(n) In :505–512, 2002. • described first by Kjellström, turned • random oracle for derivative Random Gradient [7k] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Opt. by simulated annealing. • first approaches Gradient Method into effective optimizer by Müller [6c] Science, 220(4598):671–680, 1983. • convergence for convex functions Random Pursuit to variable metric σ σ • very similar to (1 + 1)-CMA-ES [7l] V. Černý, Thermodynamical approach to the traveling salesman problem: An with L1-Lipschitz continuous decrease increase by Leventhal [4d] Accelerated RP without evolution path efficient simulation algorithm. gradients • variable metric (invariant under • approximation of level sets by J. of Opt. Th. and App., 45:41–51, 1985. • accelerated methods reach (provable) existing lower quadratic transformations) ellipsoids complexity bounds [3d] by factor O(n) • Natural-gradient descent in (entropy maximization of search parameter space [5b] distribution with the constraint [3a] Yu. Nesterov. Random Gradient-Free Minimization of Convex Functions. of fixed hitting probability) Acknowledgments Technical report, ECORE, 2011. [4a] B. Gärtner, Ch. L. Müller, and S. U. Stich. Optimization of convex functions with random pursuit. in preparation. [6a] G. Kjellström. Network Optimization by Random Variation of Component [3b] B. Polyak. Introduction to Optimization. Optimization Software - Inc. My research is supported by the CG Learning project, which itself is Publications Division, New York, 1987. [4b] F. J. Solis and R. J-B. Wets. Minimization by random search techniques. Values. Mathematical Operations Research, 6:19–30, 1981. Ericsson Technics, 25(3):133–151, 1969. supported by the FET (Future and Emerging Technologies) unit of the [3c] A.Nemirovsky and D.Yudin. Problem complexity and method efficiency in [5a] N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in European Commission (EC) within the 7th Framework Programme of optimization. [4c] V. G. Karmanov. On convergence of a random search method (in Russian). evolution strategies. [6b] G. Kjellströom and L. Taxen. Stochastic Optimization in System Design. John Wiley and Sons, New York, 1983. Theory of Probability and its Applications, 19(4):788–794, 1974. , 9(2):159–195, 2001. IEEE Trans. Circ. and Syst., 28(7), 1981. the EC under contract No. 255827. [3d] Yu. Nesterov. Introductory Lectures on Convex Optimization. [4d] D. Leventhal and A.S. Lewis. Randomized hessian estimation and directional [5b] L. Arnold, A. Auger, N. Hansen, and Y. Ollivier. Information-Geometric [6c] Ch. L. Müller and I. F. Sbalzarini. Gaussian adaptation revisited - an entropic Kluwer, Boston, 2004. search. Optimization, 60(3):329–345, 2011. Optimization Algorithms: A Unifying Picture via Invariance Principles. view on covariance matrix adaptation. In EvoApplications (1):432–441, 2010. Pictures from respectively cited papers unless indicated otherwise.