Tutorial CMA-ES — Evolution Strategies and Covariance Matrix Adaptation

Anne Auger & Nikolaus Hansen

INRIA Research Centre Saclay – ˆIle-de-France Project team TAO University Paris-Sud, LRI (UMR 8623), Bat. 660 91405 ORSAY Cedex, France

Copyright is held by the author/owner(s). GECCO’13 , July 6, 2013, Amsterdam, Netherlands. get the slides: google ”Nikolaus Hansen”. . . under Publications click last bullet: Talks, seminars, tutorials. . . don’t hesitate to ask questions...

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 1 / 83 Content

1 Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems 2 Evolution Strategies (ES) A Search Template The Normal Distribution Invariance 3 Step-Size Control Why Step-Size Control One-Fifth Success Rule Path Length Control (CSA) 4 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Cumulation—the Evolution Path Covariance Matrix Rank-µ Update 5 CMA-ES Summary 6 Theoretical Foundations 7 Comparing Experiments 8 Summary and Final Remarks

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 2 / 83 Problem Statement Black Box Optimization and Its Difficulties Problem Statement Continuous Domain Search/Optimization

Task: minimize an objective (fitness function, ) in continuous domain n f : R R, x f (x) X ⊆ → 7→ Black Box scenario (direct search scenario) x f(x)

are not available or not useful problem domain specific knowledge is used only within the black box, e.g. within an appropriate encoding Search costs: number of function evaluations

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 3 / 83 Problem Statement Black Box Optimization and Its Difficulties Problem Statement Continuous Domain Search/Optimization

Goal fast convergence to the global optimum . . . or to a robust solution x solution x with small function value f (x) with least search cost there are two conflicting objectives

Typical Examples shape optimization (e.g. using CFD) curve fitting, airfoils model calibration biological, physical parameter calibration controller, plants, images

Problems exhaustive search is infeasible naive takes too long deterministic search is not successful / takes too long Approach: stochastic search, Evolutionary Algorithms

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 4 / 83 Problem Statement Black Box Optimization and Its Difficulties Objective Function Properties

n We assume f : R R to be non-linear, non-separable and to have at least moderateX ⊂ → dimensionality, say n 10. Additionally, f can be 6 non-convex multimodal there are possibly many local optima non-smooth derivatives do not exist discontinuous, plateaus ill-conditioned noisy ... Goal : cope with any of these function properties they are related to real-world problems

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 5 / 83 Problem Statement Black Box Optimization and Its Difficulties What Makes a Function Difficult to Solve? Why stochastic search?

1.0

non-linear, non-quadratic, non-convex 0.8 on linear and quadratic functions much better 0.6 0.4

search policies are available 0.2

0.0 1.0 0.5 0.0 0.5 1.0

100 ruggedness 90 80

70

non-smooth, discontinuous, multimodal, and/or 60

50 noisy function 40 30

20

10

0 dimensionality (size of search space) −4 −3 −2 −1 0 1 2 3 4 3 (considerably) larger than three 2 1

non-separability 0 dependencies between the objective variables −1 −2

−3 ill-conditioning −3 −2 −1 0 1 2 3

direction Newton direction Anne Auger & Nikolaus Hansen CMA-ES July, 2013 6 / 83 Problem Statement Black Box Optimization and Its Difficulties Ruggedness non-smooth, discontinuous, multimodal, and/or noisy

100

90

80

70

60

50

Fitness 40

30

20

10

0 −4 −3 −2 −1 0 1 2 3 4 cut from a 5-D example, (easily) solvable with evolution strategies

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 7 / 83 Problem Statement Black Box Optimization and Its Difficulties Curse of Dimensionality

The term Curse of dimensionality (Richard Bellman) refers to problems caused by the rapid increase in volume associated with adding extra dimensions to a (mathematical) space. Example: Consider placing 100 points onto a real interval, say [0, 1]. Now consider the 10-dimensional space [0, 1]10. To get similar coverage in terms of distance between adjacent points would require 10010 = 1020 points. A 100 points appear now as isolated points in a vast empty space. Remark: distance measures break down in higher dimensionalities (the central limit theorem kicks in) Consequence: a search policy that is valuable in small dimensions might be useless in moderate or large dimensional search spaces. Example: exhaustive search.

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 8 / 83 Problem Statement Non-Separable Problems Separable Problems Definition (Separable Problem) A function f is separable if   arg min f (x1,..., xn) = arg min f (x1,...),..., arg min f (..., xn) (x1,...,xn) x1 xn ⇒ it follows that f can be optimized in a sequence of n independent 1-D optimization processes

Example: Additively 3 decomposable functions 2 1 n

X 0 f (x1,..., xn) = fi(xi) i=1 −1

Rastrigin function −2

−3 −3 −2 −1 0 1 2 3

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 9 / 83 Problem Statement Non-Separable Problems Non-Separable Problems Building a non-separable problem from a separable one (1,2)

Rotating the coordinate system f : x f (x) separable 7→ f : x f (Rx) non-separable 7→ R rotation matrix

3 3

2 2

1 R 1

0 0 −→ −1 −1

−2 −2

−3 −3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 1 Hansen, Ostermeier, Gawelczyk (1995). On the adaptation of arbitrary normal mutation distributions in evolution strategies: The generating set adaptation. Sixth ICGA, pp. 57-64, Morgan Kaufmann 2 Salomon (1996). ”Reevaluating Genetic Algorithm Performance under Coordinate Rotation of Benchmark Functions; A survey of some theoretical and practical aspects of genetic algorithms.” BioSystems, 39(3):263-278 Anne Auger & Nikolaus Hansen CMA-ES July, 2013 10 / 83 Problem Statement Ill-Conditioned Problems Ill-Conditioned Problems Curvature of level sets

Consider the convex-quadratic function 1 ∗ T ∗ 1 P ∗ 2 1 P ∗ ∗ f (x) = (x x ) H(x x ) = hi,i (xi x ) + hi,j (xi x )(xj x ) 2 − − 2 i − i 2 i6=j − i − j H is of f and symmetric positive definite

gradient direction f 0(x)T − Newton direction H−1f 0(x)T −

Ill-conditioning means squeezed level sets (high curvature). Condition number equals nine here. Condition numbers up to 1010 are not unusual in real world problems.

If H I (small condition number of H) first order information (e.g. the gradient)≈ is sufficient. Otherwise second order information (estimation of H−1) is necessary.

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 11 / 83 Problem Statement Ill-Conditioned Problems What Makes a Function Difficult to Solve? . . . and what can be done

The Problem Possible Approaches

Dimensionality exploiting the problem structure separability, locality/neighborhood, encoding

Ill-conditioning second order approach changes the neighborhood metric

Ruggedness non-local policy, large sampling width (step-size) as large as possible while preserving a reasonable convergence speed

population-based method, stochastic, non-elitistic recombination operator serves as repair mechanism restarts ... metaphors

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 12 / 83 Problem Statement Ill-Conditioned Problems Metaphors

Evolutionary Computation Optimization/Nonlinear Programming

individual, offspring, parent candidate solution ←→ decision variables design variables object variables population set of candidate solutions fitness function ←→ objective function ←→ loss function cost function error function generation iteration ←→

... methods: ESs

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 13 / 83 Evolution Strategies (ES)

1 Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems 2 Evolution Strategies (ES) A Search Template The Normal Distribution Invariance 3 Step-Size Control Why Step-Size Control One-Fifth Success Rule Path Length Control (CSA)

4 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Cumulation—the Evolution Path Covariance Matrix Rank-µ Update

5 CMA-ES Summary

6 Theoretical Foundations 7 Comparing Experiments

8 Summary and Final Remarks

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 14 / 83 Evolution Strategies (ES) A Search Template Stochastic Search

A black box search template to minimize f : Rn R → Initialize distribution parameters θ, set population size λ N While not terminate ∈ n 1 Sample distribution P (x θ) x1,..., xλ R | → ∈ 2 Evaluate x1,..., xλ on f

3 Update parameters θ Fθ(θ, x1,..., xλ, f (x1),..., f (xλ)) ←

Everything depends on the definition of P and Fθ deterministic algorithms are covered as well In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for (incremental) Estimation of Distribution Algorithms Anne Auger & Nikolaus Hansen CMA-ES July, 2013 15 / 83 Evolution Strategies (ES) A Search Template The CMA-ES

n Input: m R , σ R+, λ ∈ ∈ Initialize: C = I, and pc = 0, pσ = 0, 2 2 p µw Set: cc 4/n, cσ 4/n, c1 2/n , cµ µw/n , c1 + cµ 1, dσ 1 + n , ≈ ≈ ≈ 1 ≈ ≤ ≈ and wi=1...λ such that µw = Pµ w 2 0.3 λ i=1 i ≈ While not terminate

xi = m + σ yi, yi i(0, C) , for i = 1,...,λ sampling ∼ N Pµ Pµ m wi xi λ = m + σyw where yw = wi yi λ update mean ← i=1 : i=1 : p 2 pc (1 cc) pc + 1I pσ <1.5√n 1 (1 cc) √µw yw cumulation for C ← − {k k } − − p 2 1 pσ (1 cσ) pσ + 1 (1 cσ) √µw C− 2 yw cumulation for σ ← − − − T Pµ T C (1 c1 cµ) C + c1 pc pc + cµ wi yi λy update C ← − − i=1 : i:λ  c  pσ  σ σ exp σ k k 1 update of σ dσ E (0,I) ← × kN k − Not covered on this slide: termination, restarts, useful output, boundaries and encoding

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 16 / 83 Evolution Strategies (ES) A Search Template Evolution Strategies

New search points are sampled normally distributed

xi m + σ i(0, C) for i = 1,...,λ ∼ N n n n as perturbations of m, where xi, m R , σ R+, C R × ∈ ∈ ∈ where n the mean vector m R represents the favorite solution ∈ the so-called step-size σ R+ controls the step length ∈ n×n the covariance matrix C R determines the shape of the distribution ellipsoid ∈ here, all new points are sampled with the same parameters

The question remains how to update m, C, and σ.

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 17 / 83 Evolution Strategies (ES) The Normal Distribution Why Normal Distributions?

1 widely observed in nature, for example as phenotypic traits 2 only stable distribution with finite variance stable means that the sum of normal variates is again normal:

N (x, A) + N (y, B) ∼ N (x + y, A + B)

helpful in design and analysis of algorithms related to the central limit theorem 3 most convenient way to generate isotropic search points the isotropic distribution does not favor any direction, rotational invariant 4 maximum entropy distribution with finite variance the least possible assumptions on f in the distribution shape

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 18 / 83 Evolution Strategies (ES) The Normal Distribution Normal Distribution

Standard Normal Distribution 0.4

0.3 probability density of the 1-D standard 0.2 normal distribution

probability density 0.1

0 −4 −2 0 2 4

2−D Normal Distribution

1

0.4 0.8

0.6 0.3 probability density of 0.4 0.2 0.2 a 2-D normal 0 −0.2 0.1 distribution −0.4 −0.6

0 −0.8 5 −1 5 −1 −0.5 0 0.5 1 0 0 −5 −5

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 19 / 83 Evolution Strategies (ES) The Normal Distribution The Multi-Variate (n-Dimensional) Normal Distribution

Any multi-variate normal distribution N (m, C) is uniquely determined by its mean n value m ∈ R and its symmetric positive definite n × n covariance matrix C.

The mean value m

2−D Normal Distribution determines the displacement (translation) 0.4 value with the largest density (modal value) 0.3 0.2 the distribution is symmetric about the distribution 0.1 0 mean 5 5 0 0 −5 −5

The covariance matrix C determines the shape geometrical interpretation: any covariance matrix can be uniquely identified n T −1 with the iso-density ellipsoid {x ∈ R | (x − m) C (x − m) = 1}

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 20 / 83 Evolution Strategies (ES) The Normal Distribution

. . . any covariance matrix can be uniquely identified with the iso-density ellipsoid n T −1 {x ∈ R | (x − m) C (x − m) = 1} Lines of Equal Density

2  2 1 N m, σ I ∼ m + σN (0, I) N m, D ∼ m + D N (0, I) N (m, C) ∼ m + C 2 N (0, I) one degree of freedom σ n degrees of freedom (n2 + n)/2 degrees of freedom components are components are components are independent standard independent, scaled correlated normally distributed where I is the identity matrix (isotropic case) and D is a diagonal matrix (reasonable for separable problems) and A × N (0, I) ∼ N 0, AAT holds for all A.

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 21 / 83 Evolution Strategies (ES) The Normal Distribution Effect of Dimensionality

2−D Normal Distribution

0.4

0.3

0.2

0.1

0 5 5 0 0 −5 −5

p  (0, I) (0, I) /√2 (0, I) n 1/2, 1/2 , kN − N k ∼ kN k −→ N − with modal value: √n 1 − yet: maximum entropy distribution ...... Anne Auger & Nikolaus Hansen CMA-ES July, 2013 22 / 83 Evolution Strategies (ES) The Normal Distribution Effect of Dimensionality

2−D Normal Distribution

0.4

0.3

0.2

0.1

0 5 5 0 0 −5 −5

p  (0, I) (0, I) /√2 (0, I) n 1/2, 1/2 , kN − N k ∼ kN k −→ N − with modal value: √n 1 yet: maximum entropy− distribution Anne Auger & Nikolaus Hansen CMA-ES July, 2013... 23 ESs / 83 Evolution Strategies (ES) The Normal Distribution Evolution Strategies Terminology Let µ: # of parents, λ: # of offspring Plus (elitist) and comma (non-elitist) selection (µ + λ)-ES: selection in parents offspring { } ∪ { } (µ, λ)-ES: selection in offspring { } (1 + 1)-ES Sample one offspring from parent m

x = m + σ (0, C) N If x better than m select

m x ← ... why?

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 24 / 83 Evolution Strategies (ES) The Normal Distribution The (µ/µ, λ)-ES Non-elitist selection and intermediate (weighted) recombination

Given the i-th solution point xi = m + σ i(0, C) = m + σ yi N| {z } =: yi Let xi:λ the i-th ranked solution point, such that f (x1:λ) f (xλ:λ). The new mean reads ≤ · · · ≤ µ µ X X m wi xi:λ = m + σ wi yi:λ ← i=1 i=1 | {z } =: yw where Pµ 1 λ w1 wµ > 0, i=1 wi = 1, Pµ w 2 =: µw 4 ≥ · · · ≥ i=1 i ≈

The best µ points are selected from the new solutions (non-elitistic) and weighted intermediate recombination is applied.

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 25 / 83 Evolution Strategies (ES) Invariance Invariance Under Monotonically Increasing Functions

Rank-based algorithms Update of all parameters uses only the ranks

f (x1:λ) ≤ f (x2:λ) ≤ ... ≤ f (xλ:λ)

g(f (x1:λ)) ≤ g(f (x2:λ)) ≤ ... ≤ g(f (xλ:λ)) ∀g g is strictly monotonically increasing

3 g preserves ranks 3 Whitley 1989. The GENITOR algorithm and selection pressure: Why rank-based allocation of reproductive trials is best, ICGA Anne Auger & Nikolaus Hansen CMA-ES July, 2013 26 / 83 Evolution Strategies (ES) Invariance Basic Invariance in Search Space

translation invariance is true for most optimization algorithms

f (x) f (x a) ↔ −

Identical behavior on f and fa (t=0) f : x f (x), x = x0 7→ (t=0) fa : x f (x a), x = x0 + a 7→ − No difference can be observed w.r.t. the argument of f

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 27 / 83 Evolution Strategies (ES) Invariance Rotational Invariance in Search Space

invariance to orthogonal (rigid) transformations R, where RRT = I e.g. true for simple evolution strategies recombination operators might jeopardize rotational invariance

f (x) f (Rx) ↔

Identical behavior on f and fR (t=0) f : x f (x), x = x0 7→ (t=0) −1 fR : x f (Rx), x = R (x0) 7→ 45 No difference can be observed w.r.t. the argument of f 4 Salomon 1996. ”Reevaluating Genetic Algorithm Performance under Coordinate Rotation of Benchmark Functions; A survey of some theoretical and practical aspects of genetic algorithms.” BioSystems, 39(3):263-278 5 Hansen 2000. Invariance, Self-Adaptation and Correlated Mutations in Evolution Strategies. Parallel Problem Solving from Nature PPSN VI Anne Auger & Nikolaus Hansen CMA-ES July, 2013 28 / 83 Evolution Strategies (ES) Invariance Invariance

The grand aim of all science is to cover the greatest number of empirical facts by logical deduction from the smallest number of hypotheses or axioms. — Albert Einstein Empirical performance results from benchmark functions from solved real world problems are only useful if they do generalize to other problems

Invariance is a strong non-empirical statement about generalization generalizing (identical) performance from a single function to a whole class of functions consequently, invariance is important for the evaluation of search algorithms

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 29 / 83 Step-Size Control

1 Problem Statement Black Box Optimization and Its Difficulties Non-Separable Problems Ill-Conditioned Problems 2 Evolution Strategies (ES) A Search Template The Normal Distribution Invariance 3 Step-Size Control Why Step-Size Control One-Fifth Success Rule Path Length Control (CSA)

4 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Cumulation—the Evolution Path Covariance Matrix Rank-µ Update

5 CMA-ES Summary

6 Theoretical Foundations 7 Comparing Experiments

8 Summary and Final Remarks

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 30 / 83 Step-Size Control Evolution Strategies Recalling

New search points are sampled normally distributed

xi m + σ i(0, C) for i = 1,...,λ ∼ N n n n as perturbations of m, where xi, m R , σ R+, C R × ∈ ∈ ∈ where n the mean vector m R represents the favorite solution Pµ ∈ and m wi xi:λ ← i=1 the so-called step-size σ R+ controls the step length ∈ n×n the covariance matrix C R determines the shape of the distribution ellipsoid ∈ The remaining question is how to update σ and C.

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 31 / 83 Step-Size Control Why Step-Size Control Why Step-Size Control?

0 10 random search step−size too small |

constant step−size (1+1)-ES −3 10 (red& green) n | step−size too large X 2 f (x) = xi −6 i=1

function value 10 in [ 2.2, 0.8]n − optimal step−size for n = 10 (scale invariant) −9 10 0 0.5 1 1.5 2 4 function evaluations x 10

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 32 / 83 Step-Size Control Why Step-Size Control Why Step-Size Control?

(5/5w, 10)-ES, 11 runs

100 with optimal step-size with step-size control

-1 ) 10 x ( f

p n 10-2

= X 2 f (x) = xi k ∗

x i=1 -3

− 10

m for n = 10 and k x0 [ 0.2, 0.8]n 10-4 ∈ −

10-5 0 200 400 600 800 1000 1200 function evaluations with optimal step-size σ

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 33 / 83 Step-Size Control Why Step-Size Control Why Step-Size Control?

(5/5w, 10)-ES, 2×11 runs

100 with optimal step-size with step-size control

-1 ) 10 x ( f

p n 10-2

= X 2 f (x) = xi k ∗

x i=1 -3

− 10

m for n = 10 and k x0 [ 0.2, 0.8]n 10-4 ∈ −

10-5 0 200 400 600 800 1000 1200 function evaluations with optimal versus adaptive step-size σ with too small initial σ

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 34 / 83 Step-Size Control Why Step-Size Control Why Step-Size Control?

(5/5w, 10)-ES

100 with optimal step-size with step-size control respective step-size -1 ) 10 x ( f

p n 10-2

= X 2 f (x) = xi k ∗

x i=1 -3

− 10

m for n = 10 and k x0 [ 0.2, 0.8]n 10-4 ∈ −

10-5 0 200 400 600 800 1000 1200 function evaluations comparing number of f -evals to reach m = 10−5: 1100−100 1.5 k k 650 ≈ Anne Auger & Nikolaus Hansen CMA-ES July, 2013 35 / 83 Step-Size Control Why Step-Size Control Why Step-Size Control?

(5/5w, 10)-ES

100 with optimal step-size with step-size control respective step-size -1 ) 10 x ( f p n 10-2 = X f (x) = x2 k i ∗

x i=1 -3

− 10 n

m in [ 0.2, 0.8] k − for n = 10 10-4

10-5 0 200 400 600 800 1000 1200 1400 1600 function evaluations 1700 comparing optimal versus default damping parameter dσ: 1.5 1100 ≈ Anne Auger & Nikolaus Hansen CMA-ES July, 2013 36 / 83 Step-Size Control Why Step-Size Control Why Step-Size Control?

constant σ 0.2 0 10 random search

0.15

−3 10 0.1

−6 0.05 function value 10 normalized progress

optimal step−size adaptive 0 (scale invariant) step−size σ −9 10 −3 −2 −1 0 0 500 1000 1500 10 10 10 10 function evaluations normalized step size

σ σopt∗ parent ϕ ← " " ∗ σopt∗ ϕ∗ − n evolution window refers to the step-size interval ( ) where reasonable performance is observed

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 37 / 83 Step-Size Control Why Step-Size Control Methods for Step-Size Control

1/5-th success ruleab , often applied with “+”-selection

increase step-size if more than 20% of the new solutions are successful, decrease otherwise

σ-self-adaptationc, applied with “,”-selection mutation is applied to the step-size and the better, according to the objective function value, is selected

simplified “global” self-adaptation

path length controld (Cumulative Step-size Adaptation, CSA)e self-adaptation derandomized and non-localized

a Rechenberg 1973, Evolutionsstrategie, Optimierung technischer Systeme nach Prinzipien der biologischen Evolution, Frommann-Holzboog b Schumer and Steiglitz 1968. Adaptive step size random search. IEEE TAC c Schwefel 1981, Numerical Optimization of Computer Models, Wiley d Hansen & Ostermeier 2001, Completely Derandomized Self-Adaptation in Evolution Strategies, Evol. Comput. 9(2) Annee Auger & Nikolaus Hansen CMA-ES July, 2013 38 / 83 Ostermeier et al 1994, Step-size adaptation based on non-local use of selection information, PPSN IV Step-Size Control One-Fifth Success Rule One-fifth success rule

                                                                                           ↓ ↓ increase σ decrease σ

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 39 / 83 Step-Size Control One-Fifth Success Rule One-fifth success rule

                                                                                          

Probability of success (ps) Probability of success (ps)

1/2 1/5 “too small”

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 40 / 83 Step-Size Control One-Fifth Success Rule One-fifth success rule

ps: # of successful offspring / # offspring (per generation)   1 ps ptarget Increase σ if p > p σ σ exp − s target ← × 3 × 1 ptarget Decrease σ if ps < p − target

(1 + 1)-ES

ptarget = 1/5 IF offspring better parent ps = 1, σ σ exp(1/3) ELSE ← × 1/4 ps = 0, σ σ/ exp(1/3) ←

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 41 / 83 Step-Size Control Path Length Control (CSA) Path Length Control (CSA) The Concept of Cumulative Step-Size Adaptation

xi = m + σ yi m ← m + σyw Measure the length of the evolution path the pathway of the mean vector m in the generation sequence

⇓ ⇓ decrease σ increase σ loosely speaking steps are perpendicular under random selection (in expectation) perpendicular in the desired situation (to be most efficient)

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 42 / 83 Step-Size Control Path Length Control (CSA) Path Length Control (CSA) The

n Initialize m R , σ R+, evolution path pσ = 0, ∈ ∈ set cσ 4/n, dσ 1. ≈ ≈ Pµ m m + σyw where yw = i=1 wi yi:λ update mean ← q 2 pσ (1 cσ) pσ + 1 (1 cσ) √µw yw ← − − − | {z } |{z} accounts for 1−c accounts for wi   σ  cσ pσ σ σ exp k k 1 update step-size ← × dσ E (0, I) − | kN{z k } >1 ⇐⇒ kpσk is greater than its expectation

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 43 / 83 Step-Size Control Path Length Control (CSA) (5/5, 10)-CSA-ES, default parameters

with optimal step-size 100 with step-size control respective step-size

10-1 k

∗ n x -2 X 2 10 f (x) = x − i

m i=1 k -3 10 in [ 0.2, 0.8]n − for n = 30 10-4

10-5 0 500 1000 1500 2000 2500 3000 3500 4000 function evaluations

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 44 / 83 Covariance Matrix Adaptation (CMA)

1 Problem Statement

2 Evolution Strategies (ES)

3 Step-Size Control

4 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Cumulation—the Evolution Path Covariance Matrix Rank-µ Update

5 CMA-ES Summary

6 Theoretical Foundations

7 Comparing Experiments

8 Summary and Final Remarks

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 45 / 83 Covariance Matrix Adaptation (CMA) Evolution Strategies Recalling

New search points are sampled normally distributed

xi m + σ i(0, C) for i = 1,...,λ ∼ N n n n as perturbations of m, where xi, m R , σ R+, C R × ∈ ∈ ∈ where n the mean vector m R represents the favorite solution ∈ the so-called step-size σ R+ controls the step length ∈ n×n the covariance matrix C R determines the shape of the distribution ellipsoid ∈ The remaining question is how to update C.

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 46 / 83 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update Pµ m m + σyw, yw = wi yi:λ, yi i(0, C) ← i=1 ∼ N

new distribution, T C 0.8 C + 0.2 ywyw the← ruling× principle:× the adaptation increases the likelihood of successful steps, yw, to appear again another viewpoint: the adaptation follows a natural gradient approximation of the expected fitness ... equations Anne Auger & Nikolaus Hansen CMA-ES July, 2013 47 / 83 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update

n 2 Initialize m R , and C = I, set σ = 1, learning rate ccov 2/n While not terminate∈ ≈

xi = m + σ yi, yi i(0, C) , ∼ N µ X m m + σyw where yw = wi yi:λ ← i=1 1 c c T C (1 cov)C + covµw ywyw where µw = Pµ 2 1 ← − |{z} i=1 wi ≥ rank-one

The rank-one update has been found independently in several domains6789

6 Kjellstrom&Tax¨ en´ 1981. in System Design, IEEE TCS 7 Hansen&Ostermeier 1996. Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation, ICEC 8 Ljung 1999. System Identification: Theory for the User 9 Haario et al 2001. An adaptive Metropolis algorithm, JSTOR Anne Auger & Nikolaus Hansen CMA-ES July, 2013 48 / 83 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update

T C ← (1 − ccov)C + ccovµwywyw covariance matrix adaptation learns all pairwise dependencies between variables off-diagonal entries in the covariance matrix reflect the dependencies

conducts a principle component analysis (PCA) of steps yw, sequentially in time and space eigenvectors of the covariance matrix C are the principle components / the principle axes of the mutation ellipsoid

learns a new rotated problem representation components are independent (only) in the new representation learns a new (Mahalanobis) metric variable metric method approximates the inverse Hessian on quadratic functions transformation into the sphere function for µ = 1: conducts a natural gradient ascent on the distribution entirely independent of the given coordinate system N ... cumulation, rank-µ

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 49 / 83 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update

1 Problem Statement

2 Evolution Strategies (ES)

3 Step-Size Control

4 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Cumulation—the Evolution Path Covariance Matrix Rank-µ Update

5 CMA-ES Summary

6 Theoretical Foundations

7 Comparing Experiments

8 Summary and Final Remarks

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 50 / 83 Covariance Matrix Adaptation (CMA) Cumulation—the Evolution Path Cumulation The Evolution Path

Evolution Path Conceptually, the evolution path is the search path the strategy takes over a number of generation steps. It can be expressed as a sum of consecutive steps of the mean m. An exponentially weighted sum of steps yw is used

g X g−i (i) pc ∝ (1 − cc) yw i=0 | {z } exponentially fading weights The recursive construction of the evolution path (cumulation):

√ p 2 pc ← (1 − cc) pc + 1 − (1 − cc) µw yw | {z } | {z } |{z} decay factor normalization factor m−mold input = σ

1 where µw = P 2 , cc  1. History information is accumulated in the evolution path. wi Anne Auger & Nikolaus Hansen CMA-ES July, 2013 51 / 83 Covariance Matrix Adaptation (CMA) Cumulation—the Evolution Path

“Cumulation” is a widely used technique and also know as exponential smoothing in time series, forecasting exponentially weighted mooving average iterate averaging in stochastic approximation momentum in the back-propagation algorithm for ANNs ... “Cumulation” conducts a low-pass filtering, but there is more to it. . .

... why?

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 52 / 83 Covariance Matrix Adaptation (CMA) Cumulation—the Evolution Path Cumulation Utilizing the Evolution Path T T T We used ywyw for updating C. Because ywyw = −yw(−yw) the sign of yw is lost.

The sign information (signifying correlation between steps) is (re-)introduced by using the evolution path. √ p 2 pc ← (1 − cc) pc + 1 − (1 − cc) µw yw | {z } | {z } decay factor normalization factor T C ← (1 − ccov)C + ccov pc pc | {z } rank-one

1 where µw = P 2 , ccov  cc  1 such that 1/cc is the “backward time horizon”. wi

Anne Auger & Nikolaus Hansen CMA-ES July, 2013... resulting 53 in / 83 Covariance Matrix Adaptation (CMA) Cumulation—the Evolution Path

Using an evolution path for the rank-one update of the covariance matrix reduces the number of function evaluations to adapt to a straight ridge from about (n2) to (n).(a) O O a Hansen & Auger 2013. Principled design of continuous stochastic search: From theory to practice.

2 6 Pn 2 Number of f -evaluations divided by dimension on the cigar function f (x) = x1 + 10 i=2 xi 4 10

cc = 1 (no cumulation)

3 10 cc = 1/√n cc = 1/n

2 10 1 2 10 10 dimension

The overall model complexity is n2 but important parts of the model can be learned in time of order n

... rank µ update Anne Auger & Nikolaus Hansen CMA-ES July, 2013 54 / 83 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-µ Update Rank-µ Update

xi = m + σ yi, yi ∼ Ni (0, C) , Pµ m ← m + σyw yw = i=1 wi yi:λ

The rank-µ update extends the update rule for large population sizes λ using µ > 1 vectors to update C at each generation step. The weighted empirical covariance matrix

µ X T Cµ = wi yi:λyi:λ i=1 computes a weighted mean of the outer products of the best µ steps and has rank min(µ, n) with probability one. with µ = λ weights can be negative 10 The rank-µ update then reads

C (1 ccov) C + ccov Cµ ← − 2 where ccov µw/n and ccov 1. 10 Jastrebski and≈ Arnold (2006). Improving≤ evolution strategies through active covariance matrix adaptation. CEC. Anne Auger & Nikolaus Hansen CMA-ES July, 2013 55 / 83 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-µ Update

1 P T 1 P xi = m + σ yi, yi ∼ N (0, C) Cµ = µ yi:λyi:λ mnew ← m + µ yi:λ C ← (1 − 1) × C + 1 × Cµ

new distribution sampling of λ = 150 calculating C where solutions where µ = 50, 1 C = I and σ = 1 w1 = = wµ = , ··· µ and ccov = 1

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 56 / 83 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-µ Update

11 Rank-µ CMA versus Estimation of Multivariate Normal Algorithm EMNAglobal

rank-µ CMA conducts a PCA of steps xi = mold + yi, yi ∼ N (0, C) 1 P T 1 P C ← µ (xi:λ−mold)(xi:λ−mold) mnew = mold + µ yi:λ

EMNAglobal conducts a PCA of x = m + y , y ∼ N (0, C) points i old i i 1 P T 1 P C ← µ (xi:λ−mnew)(xi:λ−mnew) mnew = mold + µ yi:λ sampling of λ = 150 calculating C from µ = 50 new distribution solutions (dots) solutions

mnew is the minimizer for the variances when calculating C

11 Hansen, N. (2006). The CMA Evolution Strategy: A Comparing Review. In J.A. Lozano, P. Larranga, I. Inza and E. Bengoetxea (Eds.). Towards a new evolutionary computation. Advances in estimation of distribution algorithms. pp. 75-102 Anne Auger & Nikolaus Hansen CMA-ES July, 2013 57 / 83 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-µ Update The rank-µ update increases the possible learning rate in large populations 2 2 roughly from 2/n to µw/n can reduce the number of necessary generations roughly from (n2) to (n) (12) O O given µw ∝ λ ∝ n Therefore the rank-µ update is the primary mechanism whenever a large population size is used say λ ≥ 3 n + 10

The rank-one update uses the evolution path and reduces the number of necessary function evaluations to learn straight ridges from (n2) to (n) . O O Rank-one update and rank-µ update can be combined

... all equations 12 Hansen, Muller,¨ and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1), pp. 1-18 Anne Auger & Nikolaus Hansen CMA-ES July, 2013 58 / 83 CMA-ES Summary Summary of Equations The Covariance Matrix Adaptation Evolution Strategy

n Input: m R , σ R+, λ ∈ ∈ Initialize: C = I, and pc = 0, pσ = 0, 2 2 p µw Set: cc 4/n, cσ 4/n, c1 2/n , cµ µw/n , c1 + cµ 1, dσ 1 + n , ≈ ≈ ≈ 1 ≈ ≤ ≈ and wi=1...λ such that µw = Pµ w 2 0.3 λ i=1 i ≈ While not terminate

xi = m + σ yi, yi i(0, C) , for i = 1,...,λ sampling ∼ N Pµ Pµ m wi xi λ = m + σyw where yw = wi yi λ update mean ← i=1 : i=1 : p 2 pc (1 cc) pc + 1I pσ <1.5√n 1 (1 cc) √µw yw cumulation for C ← − {k k } − − p 2 1 pσ (1 cσ) pσ + 1 (1 cσ) √µw C− 2 yw cumulation for σ ← − − − T Pµ T C (1 c1 cµ) C + c1 pc pc + cµ wi yi λy update C ← − − i=1 : i:λ  c  pσ  σ σ exp σ k k 1 update of σ dσ E (0,I) ← × kN k − Not covered on this slide: termination, restarts, useful output, boundaries and encoding

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 59 / 83 CMA-ES Summary Source Code Snippet

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 60 / 83

... internal parameters CMA-ES Summary Strategy Internal Parameters Strategy Internal Parameters

related to selection and recombination λ, offspring number, new solutions sampled, population size µ, parent number, solutions involved in updates of m, C, and σ wi=1,...,µ, recombination weights µ and wi should be chosen such that the variance effective selection λ Pµ 2 mass µw ≈ 4 , where µw := 1/ i=1 wi . related to C-update

cc, decay rate for the evolution path c1, learning rate for rank-one update of C cµ, learning rate for rank-µ update of C related to σ-update

cσ, decay rate of the evolution path dσ, damping for σ-change

Parameters were identified in carefully chosen experimental set ups. Parameters do not in the first place depend on the objective function and are not meant to be in the users choice. Only(?) the population size λ (and the initial σ) might be reasonably varied in a wide range, depending on the objective function Useful: restarts with increasing population size (IPOP)

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 61 / 83

... population size CMA-ES Summary The Experimentum Crucis Experimentum Crucis (0) What did we want to achieve?

reduce any convex-quadratic function

f (x) = xTHx

i−1 Pn 6 n−1 2 e.g. f (x) = i=1 10 xi to the sphere model f (x) = xTx

without use of derivatives lines of equal density align with lines of equal fitness

C H−1 ∝ in a stochastic sense

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 62 / 83 CMA-ES Summary The Experimentum Crucis Experimentum Crucis (1) f convex quadratic, separable

blue:abs(f), cyan:f−min(f), green:sigma, red:axis ratio Object Variables (9−D) 10 10 15 x(1)=3.0931e−06 x(2)=2.2083e−06 5 10 10 x(6)=5.6127e−08 x(7)=2.7147e−08 0 10 5 x(8)=4.5138e−09 x(9)=2.741e−09 −5 10 1e−05 0 x(5)=−1.0864e−07

1e−08 x(4)=−3.8371e−07 −10 f=2.66178883753772e−10 10 −5 x(3)=−6.9109e−07 0 2000 4000 6000 0 2000 4000 6000

Principle Axes Lengths Standard Deviations in Coordinates divided by sigma 2 2 10 10 1 2 3 0 0 10 10 4 5

−2 −2 6 10 10 7 8 −4 −4 10 10 9 0 2000 4000 6000 0 2000 4000 6000 function evaluations function evaluations i−1 Pn α n−1 2 f (x) = i=1 10 xi , α = 6

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 63 / 83 CMA-ES Summary The Experimentum Crucis Experimentum Crucis (2) f convex quadratic, as before but non-separable (rotated)

blue:abs(f), cyan:f−min(f), green:sigma, red:axis ratio Object Variables (9−D) 10 10 4 x(1)=2.0052e−06 x(5)=1.2552e−06 5 10 2 x(6)=1.2468e−06 x(9)=−7.3812e−08 0 10 0 x(4)=−2.9981e−07 x(7)=−8.3583e−07 −5 10 −2 x(3)=−2.0364e−06 2e−068e−06 x(2)=−2.1131e−06 −10 f=7.91055728188042e−10 10 −4 x(8)=−2.6301e−06 0 2000 4000 6000 0 2000 4000 6000 C H−1 for all g H Principle Axes Lengths Standard Deviations in Coordinates divided by sigma , 2 10 3 ∝ 1

0 8 0 10 10 2 7

−2 5 10 6 9 −4 10 4 0 2000 4000 6000 0 2000 4000 6000 function evaluations function evaluations  f (x) = g xTHx , g : R R stricly increasing →

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 64 / 83 Theoretical Foundations

1 Problem Statement

2 Evolution Strategies (ES)

3 Step-Size Control

4 Covariance Matrix Adaptation (CMA)

5 CMA-ES Summary

6 Theoretical Foundations

7 Comparing Experiments

8 Summary and Final Remarks

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 65 / 83 Theoretical Foundations Natural Gradient Descend

Consider arg min E(f (x) θ) under the sampling distribution x p(. θ) θ | ∼ | we could improve E(f (x) θ) by following the gradient θE(f (x) θ): | ∇ | θ θ η θE(f (x) θ), η > 0 ← − ∇ | θ depends on the parameterization of the distribution, therefore ∇ Consider the natural gradient of the expected transformed fitness 1 e θ E(w Pf (f (x)) θ) = F− θE(w Pf (f (x)) θ) ∇ ◦ | θ ∇ ◦ | 1 = E(w Pf (f (x))F− θ ln p(x θ)) ◦ θ ∇ |   ∂2log p(x|θ) using the Fisher information matrix Fθ = E ∂θ ∂θ of the density p. i j ij The natural gradient is invariant under re-parameterization of the distribution. A Monte-Carlo approximation reads λ X 1 ˜ θ Eb(w(f (x)) θ) = wi F− θ ln p(xi:λ θ), wi = w(f (xi:λ) θ) ∇ b | θ ∇ | b | i=1

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 66 / 83 Theoretical Foundations CMA-ES = Natural Evolution Strategy + Cumulation Natural gradient descend using the MC approximation and the normal distribution Rewriting the update of the distribution mean µ µ X X mnew wi xi:λ = m + wi(xi:λ m) ← − i=1 i=1 | {z } ˜ natural gradient for mean ∂ E(w ◦ P (f (x))|m, C) ∂˜m b f Rewriting the update of the covariance matrix13 rank one z}|T{ Cnew C + c1(pc pc C) ← − rank-µ µ z }| { cµ X  T 2  + wi (xi:λ m)(xi:λ m) σ C σ2 − − − i=1 | {z } ˜ natural gradient for covariance matrix ∂ E(w ◦ P (f (x))|m, C) ∂˜C b f 13 Akimoto et.al. (2010): Bidirectional Relation between CMA Evolution Strategies and Natural Evolution Strategies, PPSN XI Anne Auger & Nikolaus Hansen CMA-ES July, 2013 67 / 83 Theoretical Foundations Maximum Likelihood Update

The new distribution mean m maximizes the log-likelihood µ X mnew = arg max wi log pN (xi:λ m) m | i=1 independently of the given covariance matrix

The rank-µ update matrix Cµ maximizes the log-likelihood µ   X xi:λ mold Cµ = arg max wi log pN − mold, C C σ i=1

1 1 T −1 log pN (x m, C) = log det(2πC) (x m) C (x m) | − 2 − 2 − − pN is the density of the multi-variate normal distribution

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 68 / 83 Theoretical Foundations Variable Metric

On the function class 1  f (x) = g (x x∗)H(x x∗)T 2 − −

the covariance matrix approximates the inverse Hessian up to a constant factor, that is:

C H−1 (approximately) ∝

In effect, ellipsoidal level-sets are transformed into spherical level-sets.

g : R R is strictly increasing →

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 69 / 83 Theoretical Foundations On Convergence

Evolution Strategies converge with probability one on, 1 T  e.g., g 2 x Hx like

100 with optimal step-size with step-size control respective step-size 10-1

∗ ck 0.25 10-2 mk x e , c k − k ∝ − ≤ n 10-3

10-4

10-5 0 200 400 600 800 1000 1200 1400 1600 function evaluations

Monte Carlo pure random search converges like

∗ c c log k 1 mk x k = e , c = k − k ∝ − − n

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 70 / 83 Comparing Experiments

1 Problem Statement

2 Evolution Strategies (ES)

3 Step-Size Control

4 Covariance Matrix Adaptation (CMA)

5 CMA-ES Summary

6 Theoretical Foundations

7 Comparing Experiments

8 Summary and Final Remarks

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 71 / 83 Comparing Experiments Comparison to BFGS, NEWUOA, PSO and DE f convex quadratic, separable with varying condition number α

Ellipsoid dimension 20, 21 trials, tolerance 1e−09, eval max 1e+07

7 10 BFGS (Broyden et al 1970) NEWUAO (Powell 2004) 6 10 DE (Storn & Price 1996) PSO (Kennedy & Eberhart 1995) 5 10 CMA-ES (Hansen & Ostermeier 2001) 4 SP1 10 f (x) = g(xTHx) with

3 10 H diagonal NEWUOA g identity (for BFGS and BFGS 2 10 DE2 NEWUOA) PSO CMAES g any order-preserving = strictly 1 10 0 2 4 6 8 10 increasing function (for all other) 10 10 10 10 10 10 Condition number SP1 = average number of objective function evaluations14 to reach the target function value of g−1(10−9)

14 Auger et.al. (2009): Experimental comparisons of derivative free optimization algorithms, SEA Anne Auger & Nikolaus Hansen CMA-ES July, 2013 72 / 83 Comparing Experiments Comparison to BFGS, NEWUOA, PSO and DE f convex quadratic, non-separable (rotated) with varying condition number α

Rotated Ellipsoid dimension 20, 21 trials, tolerance 1e−09, eval max 1e+07

7 10 BFGS (Broyden et al 1970) NEWUAO (Powell 2004) 6 10 DE (Storn & Price 1996) PSO (Kennedy & Eberhart 1995) 5 10 CMA-ES (Hansen & Ostermeier 2001) 4 SP1 10 f (x) = g(xTHx) with

3 10 H full NEWUOA g identity (for BFGS and BFGS 2 10 DE2 NEWUOA) PSO CMAES g any order-preserving = strictly 1 10 0 2 4 6 8 10 increasing function (for all other) 10 10 10 10 10 10 Condition number SP1 = average number of objective function evaluations15 to reach the target function value of g−1(10−9)

15 Auger et.al. (2009): Experimental comparisons of derivative free optimization algorithms, SEA Anne Auger & Nikolaus Hansen CMA-ES July, 2013 73 / 83 Comparing Experiments Comparison to BFGS, NEWUOA, PSO and DE f non-convex, non-separable (rotated) with varying condition number α

Sqrt of sqrt of rotated ellipsoid dimension 20, 21 trials, tolerance 1e−09, eval max 1e+07

7 10 BFGS (Broyden et al 1970) NEWUAO (Powell 2004) 6 10 DE (Storn & Price 1996) PSO (Kennedy & Eberhart 1995) 5 10 CMA-ES (Hansen & Ostermeier 2001) 4 SP1 10 f (x) = g(xTHx) with

3 10 H full 1/4 NEWUOA g : x 7→ x (for BFGS and BFGS 2 10 DE2 NEWUOA) PSO CMAES g any order-preserving = strictly 1 10 0 2 4 6 8 10 increasing function (for all other) 10 10 10 10 10 10 Condition number SP1 = average number of objective function evaluations16 to reach the target function value of g−1(10−9)

16 Auger et.al. (2009): Experimental comparisons of derivative free optimization algorithms, SEA Anne Auger & Nikolaus Hansen CMA-ES July, 2013 74 / 83 Comparing Experiments Comparison during BBOB at GECCO 2009 24 functions and 31 algorithms in 20-D (24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24) = 1200 funcs 1.0 best 2009 BIPOP-CMA-ES AMaLGaM IDEA iAMaLGaM IDEA VNS (Garcia) MA-LS-Chain 0.8 DASA G3-PCX NEWUOA (1+1)-CMA-ES Cauchy EDA (1+1)-ES 0.6 BFGS PSO_Bounds GLOBAL ALPS-GA full NEWUOA NELDER (Han) 0.4 NELDER (Doe) EDA-PSO POEMS

Proportion of functions PSO MCS Rosenbrock 0.2 LSstep LSfminbnd simple GA DEPSO DIRECT BayEDAcG Monte Carlo 0.0 0 1 2 3 4 5 6 7 8 Running length / dimension Anne Auger & Nikolaus Hansen CMA-ES July, 2013 75 / 83 ... 2010 Comparing Experiments Comparison during BBOB at GECCO 2010 24 functions and 20+ algorithms in 20-D (24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24) = 1200 funcs 1.0 best 2009 BIPOP-CMA-ES CMA+DE-MOS IPOP-aCMA-ES IPOP-CMA-ES Adap DE (F-AUC) 0.8 DE (Uniform) PM-AdapSS-DE nPOEMS CMA-EGS (IPOP,r1) (1+2ms)-CMA-ES (1,2m)-CMA-ES 0.6 (1+1)-CMA-ES (1,2ms)-CMA-ES (1,4ms)-CMA-ES (1,4m)-CMA-ES avg NEWUOA (1,4)-CMA-ES 0.4 (1,4s)-CMA-ES NEWUOA

Proportion of functions NBC-CMA Cauchy EDA (1,2)-CMA-ES (1,2s)-CMA-ES 0.2 GLOBAL oPOEMS Artif Bee Colony Basic RCGA SPSA Monte Carlo 0.0 0 1 2 3 4 5 6 7 8 Running length / dimension Anne Auger & Nikolaus Hansen CMA-ES July, 2013 76 / 83 ... noisy Comparing Experiments Comparison during BBOB at GECCO 2009 30 noisy functions and 20 algorithms in 20-D (30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30) = 1500 funcs 1.0 best 2009 BIPOP-CMA-ES AMaLGaM IDEA iAMaLGaM IDEA 0.8 VNS (Garcia) MA-LS-Chain ALPS-GA BayEDAcG 0.6 full NEWUOA (1+1)-ES DASA GLOBAL 0.4 (1+1)-CMA-ES EDA-PSO

Proportion of functions PSO PSO_Bounds 0.2 DEPSO MCS SNOBFIT BFGS Monte Carlo 0.0 0 1 2 3 4 5 6 7 8 Running length / dimension Anne Auger & Nikolaus Hansen CMA-ES July, 2013 77 / 83 ... 2010 Comparing Experiments Comparison during BBOB at GECCO 2010 30 noisy functions and 10+ algorithms in 20-D (30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30) = 1500 funcs 1.0 best 2009 IPOP-aCMA-ES BIPOP-CMA-ES IPOP-CMA-ES 0.8 CMA+DE-MOS CMA-EGS (IPOP,r1) Basic RCGA (1,4ms)-CMA-ES 0.6 (1,2ms)-CMA-ES (1,4m)-CMA-ES (1,4)-CMA-ES (1,2m)-CMA-ES 0.4 (1,4s)-CMA-ES (1,2)-CMA-ES Proportion of functions (1,2s)-CMA-ES GLOBAL 0.2 avg NEWUOA NEWUOA SPSA Monte Carlo 0.0 0 1 2 3 4 5 6 7 8 Running length / dimension Anne Auger & Nikolaus Hansen CMA-ES July, 2013 78 / 83 ... summary Summary and Final Remarks

1 Problem Statement

2 Evolution Strategies (ES)

3 Step-Size Control

4 Covariance Matrix Adaptation (CMA)

5 CMA-ES Summary

6 Theoretical Foundations

7 Comparing Experiments

8 Summary and Final Remarks

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 79 / 83 Summary and Final Remarks The Continuous Search Problem

Difficulties of a non-linear are dimensionality and non-separabitity demands to exploit problem structure, e.g. neighborhood cave: design of benchmark functions

ill-conditioning demands to acquire a second order model

ruggedness demands a non-local (stochastic? population based?) approach

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 80 / 83 Summary and Final Remarks Main Characteristics of (CMA) Evolution Strategies

1 Multivariate normal distribution to generate new search points follows the maximum entropy principle

2 Rank-based selection implies invariance, same performance on g(f (x)) for any increasing g more invariance properties are featured

3 Step-size control facilitates fast (log-linear) convergence and possibly linear scaling with the dimension in CMA-ES based on an evolution path (a non-local trajectory)

4 Covariance matrix adaptation (CMA) increases the likelihood of previously successful steps and can improve performance by orders of magnitude the update follows the natural gradient C ∝ H−1 ⇐⇒ adapts a variable metric ⇐⇒ new (rotated) problem representation =⇒ f : x 7→ g(xTHx) reduces to x 7→ xTx

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 81 / 83 Summary and Final Remarks Limitations of CMA Evolution Strategies

8 2 internal CPU-time: 10− n seconds per function evaluation on a 2GHz PC, tweaks are available 1 000 000 f -evaluations in 100-D take 100 seconds internal CPU-time

better methods are presumably available in case of

partly separable problems specific problems, for example with cheap gradients specific methods small dimension (n 10)  for example Nelder-Mead small running times (number of f -evaluations < 100n) model-based methods

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 82 / 83 Summary and Final Remarks

Thank You

Source code for CMA-ES in C, Java, Matlab, Octave, Python, Scilab is available at http://www.lri.fr/˜hansen/cmaes_inmatlab.html

Anne Auger & Nikolaus Hansen CMA-ES July, 2013 83 / 83