Randomized Derivative-Free Optimization Institute of Theoretical Computer Science Swiss Institute of Bioinformatics

Sebastian U. Stich MADALGO & CTIC Summer School 2011 [email protected] Randomized Derivative-Free Optimization Institute of Theoretical Computer Science Swiss Institute of Bioinformatics Standard Approaches Evolution Strategies Explained The Right Metric Other Methods Pure Random Search [1a] • (1 + 1)-ES described already in 1968 [2a] Step-size adaptation of ES [2c] f H • sample points uniformly in space For quadratic: Hit and Run [7a] • basis of more complex methods like CMA-ES 1 • exponential running time y∗ ∗ ∗ ∗ • draw a random direction and on the segment • convergence of (1+1)-ES linear on quadratic functions [2b] f(x) = f(x ) + (x − x )H(x − x ) K • Pure adaptive random search (PAS) [1b]: 2 contained in the convex body uniformly the 2 ∗ next iterate sample sequence of points, all new points y • theoretical convergence results only known for very simple H = ∇ f(x ) z uniformly among those with better function functions • originally proposed as sampling method to y value → not possible to do ∗ generate uniform samples from convex x∗ Generic (µ + λ)-Evolution Strategy (ES) gradient direction −∇f(x) body [7b] P • PAS analogous to randomized version of method of centers −1 1: for k = 0 to N do Newton direction −H ∇f(x) • fast mixing times (in special settings, e.g. log-convave functions) proven by Lovász [7c] and Dyer [7d] 2: Xk ←(λ new points, based on Mk−1) DIRECT [1c] • introduced as optimization algorithm by Bertsimas [7a] 3: EvaluateFunction (Xk) • iteratively evaluate function at grid • if H ≈ In first order information is sufficient 4: Mk ← Select µ best points among Xk ∪ Mk−1 1/2 success probability → • in each iteration locally refine grid cells if small success probability → H−1 5: end for increase σ • otherwise estimation of is necessary Random Cutting Plane [7e] potential to improve function value c return SN M decrease σ • invariant under quadratic transformations c • based on cutting plane method [7f] • exponential running time 6: BestPoint k=0 k Xk x(4) • RCP algorithm described first by Levin [7g] • estimating H only from zeroth-order information is difficult (5) x3 [2a] M. Schumer and K. Steiglitz. Adaptive step size random search. (2) x x xG • natural extension of 1D bisection method Nelder-Mead [1d] For (1+1)-ES: (with step-size σk) Automatic Control, IEEE Transactions on, 13(3):270 – 276, 1968. • many update schemes are reasonable and used (1) x (3) Not practical usable as for a generic convex x = zk • • evaluate function at vertices of a simplex xk+1 ∼ N (mk, σk · In) mk = Best(xk, mk−1) [2b] Jens Jägersküpper. Rigorous runtime analysis of the (1+1) es: 1/5-rule and Xk+1 set, computing the center of gravity is more Rank 1 update. • propose new search point: reflection of worst ellipsoidal fitness landscapes. Estimate difficult than solving the original optimization point through centroid x¯ Foundations of Genetic Algorithms, 3469:356–361. 2005. −1 −1 T problem. Hk+1 ≈ (1 − λk)H + λkykyk • can converge to non-stationary points Most ES work by only considering rank information of the current [2c] A. Auger and N. Hansen. Tutorial: Evolution strategies and covariance matrix • Dabbene et al. [7e] approximate centroid by hit-and-run sampling • Adaptation: simplex derivatives [1e] iterates. (Invariant under monotone transformations). adaptation. available online. yk some vector (for example: yk ∝ mk+1 − mk), λk weight xr Random Conic Pursuit [7h] Trust Region [1f] • random direction (rank 1 matrix) is chosen at random and optimization over 2-dim subspace • approximate objective function with models yields next iterate (often quadratic) on regions Problem Statement Summary Open Problems and Goals • convergence on unconstrained SDP • refine/expand regions on which the approximation is bad/good • several adaptation heuristics proposed in the For smooth functions, standard methods (gradient, Theoretical results to support experimental evidence. paper (adaptation of sampling distribution, Picture [7j] Newton) will converge to local minimizer but are not • rigorous convergence results to compare with boundaries) but no theoretical results robust against noise. Evolution strategies have been classical methods Simulated Annealing. [7k, 7l] proven to effective in this setting. Highly developed [1a] Samuel H. Brooks. A discussion of random methods for seeking maxima. xf (x) • convergence/divergence results for noisy and • inspiration and name from annealing in Operations Research, 6(2):244–251, 1958. arg min f(x) methods (like CMA-ES) are at the moment (among) multi-modal functions metallurgy [1b] Zelda B. Zabinsky and Robert L. Smith. Pure adaptive search in global optimization. x∈Rn • new iterates proposed by sampling random Mathematical Programming, 53:323–338, 1992. the best performing algorithms. Especially interesting would be to analyze particu- points in a weighted neighborhood (acceptance [1c] Jones, Perttunen, and Stuckman. Lipschitzian optimization without the lipschitz constant. • access only zeroth-order information • convergence proof of random variants of standard lar heuristic used by many algorithms to self-estimate rate changes according to cooling schedule) Journal of Optimization Theory and Applications, 79:157–181, 1993. • convergence results, but only low rates [1d] J. A. Nelder and R. Mead. A simplex method for function minimization. • objective fuction given as black-box methods relatively easy for convex functions strategy parameters. • adaptation of Metropolis-Hastings algorithm The Computer Journal, 7(4):308–313, 1965. • multi-modal functions (global minimizer) • convergence proofs for general ES still missing • step-size adaptation heuristics [1e] Andrew R. Conn, Katya Scheinberg, and Luis N. Vicente. Introduction to derivative-free optimization. 2009 • robust against (random) noise • main issues: self-adaptation of strategy parameters • Hessian estimation (eg. by rank-µ updates) [1f] Andrew R. Conn, Nicholas I. M. Gould, and Philippe L. Toint. Trust-region method. 2000 • provable (fast) convergence? (stepsize, Hessian estimation, lots of heuristics. ) • limited memory Hessian estimation possible? [7a] D. Bertsimas and S. Vempala. Solving convex programs by random walks. J. ACM, 51:540–556, July 2004. [7b] R. L. Smith. Efficient Monte-Carlo procedures for generating points uniformly Random Gradient [3a] Random Pursuit [4a] CMA-ES [5a] Gaussian Adaptation [6a, 6b] distributed over bounded regions. Operation Research, 32:1296ï£¡-1308, 1984. [7c] László Lovász. Hit-and-run mixes fast. Mathematical Programming, 86:443–461, 1999. [7d] M. Dyer, A. Frieze, and R. Kannan. A random poly.-time alg. for approx. the iteratively update u ∼ N (0,In) iteratively update u ∼ N (0,In) (µ, λ)-Evolution Strategy, iteratively (1 + 1)-Evolution Strategy iteratively volume of conv. bodies. J. ACM, 38:1–17, January 1991. update update x = x − h g (x ) h = h (L ) x = x − h g (x ) hk = arg min f(xk+h·u) k+1 k k µ k k k 1 stepsizes k+1 k k µ k h [7e] F. Dabbene, P. S. Shcherbakov, and B. T. Polyak. A randandomized cutting mk (mean) σk(stepsize) Ck (covariance) mk (mean) σk(stepsize) Ck (covariance) Siam J. Optim. g directional derivative/finite difference µ accuracy line search in random direction 6 plane method with probabilistic geometric convergence. , k ) 20(6):3185–3207, 2010. x 4 in random direction: • Described by [4b], first analysis by ( propose new candidates propose new candidates x2 f [7f] Jr. Kelley, J. E. The cutting-plane method for solving convex programs. 0 2 g0(x) = f (x, u) · u Karmanov [4c], for approximative line − (·) (·) J. of the Soc. for Ind. and App. Mathematics, 8(4):pp. 703–712, 1960. ∗ 0 y ∼ N (mk, σk · Ck) y ∼ N (mk, σk · Ck) f k k search independently by [4a] k f(x + µu) − f(x) -2 [7g] A. Levin. On an algorithm for the minimization of convex functions. g (x) = · u ln µ Soviet Math. Dokl. µ • convergence for smooth convex -4 • step-size adapt. by path lenght control • covariance adaptation by rank 1 x1 , 6:286–290, 1965. functions with bounded level sets and -6 • covariance adapt. by rank µ updates updates [7h] A. Kleiner, A. Rahimi, and M. Jordan. Random conic pursuit for SDP. • First results by Polyak [3b], NIPS 23:1135–1143. 2010. Lipschitz continuous gradients -8 • variable metric (invariant under completely analyzed by Nesterov [3a] • reaches complexity bound of 0 1000 3000 5000 7000 9000 [7j] E. P. Xing et al. Distance metric learning with application to clustering. quadratic transformations) NIPS’02 using Gaussian smoothing [3c] gradient method by factor O(n) In :505–512, 2002. • described first by Kjellström, turned • random oracle for derivative Random Gradient [7k] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Opt. by simulated annealing. • first approaches Gradient Method into effective optimizer by Müller [6c] Science, 220(4598):671–680, 1983. • convergence for convex functions Random Pursuit to variable metric σ σ • very similar to (1 + 1)-CMA-ES [7l] V. Černý, Thermodynamical approach to the traveling salesman problem: An with L1-Lipschitz continuous decrease increase by Leventhal [4d] Accelerated RP without evolution path efficient simulation algorithm. gradients • variable metric (invariant under • approximation of level sets by J. of Opt. Th. and App., 45:41–51, 1985. • accelerated methods reach (provable) existing lower quadratic transformations) ellipsoids complexity bounds [3d] by factor O(n) • Natural-gradient descent in (entropy maximization of search parameter space [5b] distribution with the constraint [3a] Yu. Nesterov. Random Gradient-Free Minimization of Convex Functions. of fixed hitting probability) Acknowledgments Technical report, ECORE, 2011. [4a] B. Gärtner, Ch. L. Müller, and S. U. Stich. Optimization of convex functions with random pursuit. in preparation. [6a] G. Kjellström. Network Optimization by Random Variation of Component [3b] B. Polyak. Introduction to Optimization. Optimization Software - Inc. My research is supported by the CG Learning project, which itself is Publications Division, New York, 1987. [4b] F. J. Solis and R. J-B. Wets. Minimization by random search techniques.

Randomized Derivative-Free Optimization Institute of Theoretical Computer Science Swiss Institute of Bioinformatics

Genetic Algorithm

Gaussian Adaptation Revisited – an Entropic View on Covariance Matrix Adaptation

Recent Advances in Differential Evolution – an Updated Survey

Outline of Machine Learning

Parallel Inverse Modeling and Uncertainty Quantification for Computationally Demanding Groundwater-Flow Models Using Covariance Matrix Adaptation

Data Reduction of Digital Twin Simulation Experiments Using Different Optimisation Methods

On Spectral Invariance of Randomized Hessian and Covariance Matrix

Evolutionary Search Algorithm Based on Hypercube Optimization for High-Dimensional Functions

Implementation Aspects Regarding Closed-Loop Control Systems Using Evolutionary Algorithms

On Proving Linear Convergence of Comparison-Based Step-Size Adaptive Randomized Search on Scaling-Invariant Functions Via Stabil

Using Mutation Analysis to Evolve Subdomains for Random Testing

Characterization and Uncertainty Analysis of Siliciclastic Aquifer-Fault