Tutorial CMA-ES — Evolution Strategies and Covariance Matrix Adaptation
Anne Auger & Nikolaus Hansen
INRIA Research Centre Saclay – ˆIle-de-France Project team TAO University Paris-Sud, LRI (UMR 8623), Bat. 660 91405 ORSAY Cedex, France
Copyright is held by the author/owner(s). GECCO’13 , July 6, 2013, Amsterdam, Netherlands. get the slides: google ”Nikolaus Hansen”. . . under Publications click last bullet: Talks, seminars, tutorials. . . don’t hesitate to ask questions...
Anne Auger & Nikolaus Hansen CMA-ES July, 2013 1 / 83 Content
1 Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems 2 Evolution Strategies (ES) A Search Template The Normal Distribution Invariance 3 Step-Size Control Why Step-Size Control One-Fifth Success Rule Path Length Control (CSA) 4 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Cumulation—the Evolution Path Covariance Matrix Rank-µ Update 5 CMA-ES Summary 6 Theoretical Foundations 7 Comparing Experiments 8 Summary and Final Remarks
Anne Auger & Nikolaus Hansen CMA-ES July, 2013 2 / 83 Problem Statement Black Box Optimization and Its Difficulties Problem Statement Continuous Domain Search/Optimization
Task: minimize an objective function (fitness function, loss function) in continuous domain n f : R R, x f (x) X ⊆ → 7→ Black Box scenario (direct search scenario) x f(x)
gradients are not available or not useful problem domain specific knowledge is used only within the black box, e.g. within an appropriate encoding Search costs: number of function evaluations
Anne Auger & Nikolaus Hansen CMA-ES July, 2013 3 / 83 Problem Statement Black Box Optimization and Its Difficulties Problem Statement Continuous Domain Search/Optimization
Goal fast convergence to the global optimum . . . or to a robust solution x solution x with small function value f (x) with least search cost there are two conflicting objectives
Typical Examples shape optimization (e.g. using CFD) curve fitting, airfoils model calibration biological, physical parameter calibration controller, plants, images
Problems exhaustive search is infeasible naive random search takes too long deterministic search is not successful / takes too long Approach: stochastic search, Evolutionary Algorithms
Anne Auger & Nikolaus Hansen CMA-ES July, 2013 4 / 83 Problem Statement Black Box Optimization and Its Difficulties Objective Function Properties
n We assume f : R R to be non-linear, non-separable and to have at least moderateX ⊂ → dimensionality, say n 10. Additionally, f can be 6 non-convex multimodal there are possibly many local optima non-smooth derivatives do not exist discontinuous, plateaus ill-conditioned noisy ... Goal : cope with any of these function properties they are related to real-world problems
Anne Auger & Nikolaus Hansen CMA-ES July, 2013 5 / 83 Problem Statement Black Box Optimization and Its Difficulties What Makes a Function Difficult to Solve? Why stochastic search?
1.0
non-linear, non-quadratic, non-convex 0.8 on linear and quadratic functions much better 0.6 0.4
search policies are available 0.2
0.0 1.0 0.5 0.0 0.5 1.0
100 ruggedness 90 80
70
non-smooth, discontinuous, multimodal, and/or 60
50 noisy function 40 30
20
10
0 dimensionality (size of search space) −4 −3 −2 −1 0 1 2 3 4 3 (considerably) larger than three 2 1
non-separability 0 dependencies between the objective variables −1 −2
−3 ill-conditioning −3 −2 −1 0 1 2 3
gradient direction Newton direction Anne Auger & Nikolaus Hansen CMA-ES July, 2013 6 / 83 Problem Statement Black Box Optimization and Its Difficulties Ruggedness non-smooth, discontinuous, multimodal, and/or noisy
100
90
80
70
60
50
Fitness 40
30
20
10
0 −4 −3 −2 −1 0 1 2 3 4 cut from a 5-D example, (easily) solvable with evolution strategies
Anne Auger & Nikolaus Hansen CMA-ES July, 2013 7 / 83 Problem Statement Black Box Optimization and Its Difficulties Curse of Dimensionality
The term Curse of dimensionality (Richard Bellman) refers to problems caused by the rapid increase in volume associated with adding extra dimensions to a (mathematical) space. Example: Consider placing 100 points onto a real interval, say [0, 1]. Now consider the 10-dimensional space [0, 1]10. To get similar coverage in terms of distance between adjacent points would require 10010 = 1020 points. A 100 points appear now as isolated points in a vast empty space. Remark: distance measures break down in higher dimensionalities (the central limit theorem kicks in) Consequence: a search policy that is valuable in small dimensions might be useless in moderate or large dimensional search spaces. Example: exhaustive search.
Anne Auger & Nikolaus Hansen CMA-ES July, 2013 8 / 83 Problem Statement Non-Separable Problems Separable Problems Definition (Separable Problem) A function f is separable if arg min f (x1,..., xn) = arg min f (x1,...),..., arg min f (..., xn) (x1,...,xn) x1 xn ⇒ it follows that f can be optimized in a sequence of n independent 1-D optimization processes
Example: Additively 3 decomposable functions 2 1 n
X 0 f (x1,..., xn) = fi(xi) i=1 −1
Rastrigin function −2
−3 −3 −2 −1 0 1 2 3
Anne Auger & Nikolaus Hansen CMA-ES July, 2013 9 / 83 Problem Statement Non-Separable Problems Non-Separable Problems Building a non-separable problem from a separable one (1,2)
Rotating the coordinate system f : x f (x) separable 7→ f : x f (Rx) non-separable 7→ R rotation matrix
3 3
2 2
1 R 1
0 0 −→ −1 −1
−2 −2
−3 −3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 1 Hansen, Ostermeier, Gawelczyk (1995). On the adaptation of arbitrary normal mutation distributions in evolution strategies: The generating set adaptation. Sixth ICGA, pp. 57-64, Morgan Kaufmann 2 Salomon (1996). ”Reevaluating Genetic Algorithm Performance under Coordinate Rotation of Benchmark Functions; A survey of some theoretical and practical aspects of genetic algorithms.” BioSystems, 39(3):263-278 Anne Auger & Nikolaus Hansen CMA-ES July, 2013 10 / 83 Problem Statement Ill-Conditioned Problems Ill-Conditioned Problems Curvature of level sets
Consider the convex-quadratic function 1 ∗ T ∗ 1 P ∗ 2 1 P ∗ ∗ f (x) = (x x ) H(x x ) = hi,i (xi x ) + hi,j (xi x )(xj x ) 2 − − 2 i − i 2 i6=j − i − j H is Hessian matrix of f and symmetric positive definite
gradient direction f 0(x)T − Newton direction H−1f 0(x)T −
Ill-conditioning means squeezed level sets (high curvature). Condition number equals nine here. Condition numbers up to 1010 are not unusual in real world problems.
If H I (small condition number of H) first order information (e.g. the gradient)≈ is sufficient. Otherwise second order information (estimation of H−1) is necessary.
Anne Auger & Nikolaus Hansen CMA-ES July, 2013 11 / 83 Problem Statement Ill-Conditioned Problems What Makes a Function Difficult to Solve? . . . and what can be done
The Problem Possible Approaches
Dimensionality exploiting the problem structure separability, locality/neighborhood, encoding
Ill-conditioning second order approach changes the neighborhood metric
Ruggedness non-local policy, large sampling width (step-size) as large as possible while preserving a reasonable convergence speed
population-based method, stochastic, non-elitistic recombination operator serves as repair mechanism restarts ... metaphors
Anne Auger & Nikolaus Hansen CMA-ES July, 2013 12 / 83 Problem Statement Ill-Conditioned Problems Metaphors
Evolutionary Computation Optimization/Nonlinear Programming
individual, offspring, parent candidate solution ←→ decision variables design variables object variables population set of candidate solutions fitness function ←→ objective function ←→ loss function cost function error function generation iteration ←→
... methods: ESs
Anne Auger & Nikolaus Hansen CMA-ES July, 2013 13 / 83 Evolution Strategies (ES)
1 Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems 2 Evolution Strategies (ES) A Search Template The Normal Distribution Invariance 3 Step-Size Control Why Step-Size Control One-Fifth Success Rule Path Length Control (CSA)
4 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Cumulation—the Evolution Path Covariance Matrix Rank-µ Update
5 CMA-ES Summary
6 Theoretical Foundations 7 Comparing Experiments
8 Summary and Final Remarks
Anne Auger & Nikolaus Hansen CMA-ES July, 2013 14 / 83 Evolution Strategies (ES) A Search Template Stochastic Search
A black box search template to minimize f : Rn R → Initialize distribution parameters θ, set population size λ N While not terminate ∈ n 1 Sample distribution P (x θ) x1,..., xλ R | → ∈ 2 Evaluate x1,..., xλ on f
3 Update parameters θ Fθ(θ, x1,..., xλ, f (x1),..., f (xλ)) ←
Everything depends on the definition of P and Fθ deterministic algorithms are covered as well In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for (incremental) Estimation of Distribution Algorithms Anne Auger & Nikolaus Hansen CMA-ES July, 2013 15 / 83 Evolution Strategies (ES) A Search Template The CMA-ES
n Input: m R , σ R+, λ ∈ ∈ Initialize: C = I, and pc = 0, pσ = 0, 2 2 p µw Set: cc 4/n, cσ 4/n, c1 2/n , cµ µw/n , c1 + cµ 1, dσ 1 + n , ≈ ≈ ≈ 1 ≈ ≤ ≈ and wi=1...λ such that µw = Pµ w 2 0.3 λ i=1 i ≈ While not terminate
xi = m + σ yi, yi i(0, C) , for i = 1,...,λ sampling ∼ N Pµ Pµ m wi xi λ = m + σyw where yw = wi yi λ update mean ← i=1 : i=1 : p 2 pc (1 cc) pc + 1I pσ <1.5√n 1 (1 cc) √µw yw cumulation for C ← − {k k } − − p 2 1 pσ (1 cσ) pσ + 1 (1 cσ) √µw C− 2 yw cumulation for σ ← − − − T Pµ T C (1 c1 cµ) C + c1 pc pc + cµ wi yi λy update C ← − − i=1 : i:λ c pσ σ σ exp σ k k 1 update of σ dσ E (0,I) ← × kN k − Not covered on this slide: termination, restarts, useful output, boundaries and encoding
Anne Auger & Nikolaus Hansen CMA-ES July, 2013 16 / 83 Evolution Strategies (ES) A Search Template Evolution Strategies
New search points are sampled normally distributed
xi m + σ i(0, C) for i = 1,...,λ ∼ N n n n as perturbations of m, where xi, m R , σ R+, C R × ∈ ∈ ∈ where n the mean vector m R represents the favorite solution ∈ the so-called step-size σ R+ controls the step length ∈ n×n the covariance matrix C R determines the shape of the distribution ellipsoid ∈ here, all new points are sampled with the same parameters
The question remains how to update m, C, and σ.
Anne Auger & Nikolaus Hansen CMA-ES July, 2013 17 / 83 Evolution Strategies (ES) The Normal Distribution Why Normal Distributions?
1 widely observed in nature, for example as phenotypic traits 2 only stable distribution with finite variance stable means that the sum of normal variates is again normal:
N (x, A) + N (y, B) ∼ N (x + y, A + B)
helpful in design and analysis of algorithms related to the central limit theorem 3 most convenient way to generate isotropic search points the isotropic distribution does not favor any direction, rotational invariant 4 maximum entropy distribution with finite variance the least possible assumptions on f in the distribution shape
Anne Auger & Nikolaus Hansen CMA-ES July, 2013 18 / 83 Evolution Strategies (ES) The Normal Distribution Normal Distribution
Standard Normal Distribution 0.4
0.3 probability density of the 1-D standard 0.2 normal distribution
probability density 0.1
0 −4 −2 0 2 4
2−D Normal Distribution
1
0.4 0.8
0.6 0.3 probability density of 0.4 0.2 0.2 a 2-D normal 0 −0.2 0.1 distribution −0.4 −0.6
0 −0.8 5 −1 5 −1 −0.5 0 0.5 1 0 0 −5 −5
Anne Auger & Nikolaus Hansen CMA-ES July, 2013 19 / 83 Evolution Strategies (ES) The Normal Distribution The Multi-Variate (n-Dimensional) Normal Distribution
Any multi-variate normal distribution N (m, C) is uniquely determined by its mean n value m ∈ R and its symmetric positive definite n × n covariance matrix C.
The mean value m
2−D Normal Distribution determines the displacement (translation) 0.4 value with the largest density (modal value) 0.3 0.2 the distribution is symmetric about the distribution 0.1 0 mean 5 5 0 0 −5 −5
The covariance matrix C determines the shape geometrical interpretation: any covariance matrix can be uniquely identified n T −1 with the iso-density ellipsoid {x ∈ R | (x − m) C (x − m) = 1}
Anne Auger & Nikolaus Hansen CMA-ES July, 2013 20 / 83 Evolution Strategies (ES) The Normal Distribution
. . . any covariance matrix can be uniquely identified with the iso-density ellipsoid n T −1 {x ∈ R | (x − m) C (x − m) = 1} Lines of Equal Density