CMA-ES and Advanced Mechanisms Youhei Akimoto, Nikolaus Hansen

To cite this version:

Youhei Akimoto, Nikolaus Hansen. CMA-ES and Advanced Adaptation Mechanisms. GECCO ’18 Companion: Proceedings of the Genetic and Conference Companion, Jul 2018, Kyoto, Japan. ￿hal-01959479￿

HAL Id: hal-01959479 https://hal.inria.fr/hal-01959479 Submitted on 18 Dec 2018

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. CMA-ES and Advanced Adaptation Mechanisms

Youhei Akimoto1 & Nikolaus Hansen2 1. University of Tsukuba, Japan 2. Inria, Research Centre Saclay, France

[email protected] [email protected]

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

GECCO '18 Companion, July 15–19, 2018, Kyoto, Japan © 2018 Copyright is held by the owner/author(s). ACM ISBN 978-1-4503-5764-7/18/07. https://doi.org/10.1145/3205651.3207854

1 Tutorial: Strategies and CMA-ES (Covariance Matrix Adaptation)

Anne Auger & Nikolaus Hansen

Inria We are happy to answerProject team questions TAO at any time. Research Centre Saclay – ˆIle-de-France University Paris-Sud, LRI (UMR 8623), Bat. 660 91405 ORSAY Cedex, France https://www.lri.fr/~hansen/gecco2014-CMA-ES-tutorial.pdf

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. Copyright is held by the owner/author(s). GECCO ’14, Jul 12-16 2014, Vancouver, BC, Canada ACM 978-1-4503-2881-4/14/07. http://dx.doi.org/10.1145/2598394.2605347 2 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 1 / 81 Topics 1. What makes the problem difficult to solve?

2. How does the CMA-ES work?

• Normal Distribution, Rank-Based Recombination • Step-Size Adaptation • Covariance Matrix Adaptation

3. What can/should the users do for the CMA-ES to work effectively on their problem?

• Choice of problem formulation and encoding (not covered) • Choice of initial solution and initial step-size • Restarts, Increasing Population Size • Restricted Covariance Matrix 3 Topics 1. What makes the problem difficult to solve?

2. How does the CMA-ES work?

• Normal Distribution, Rank-Based Recombination • Step-Size Adaptation • Covariance Matrix Adaptation

3. What can/should the users do for the CMA-ES to work effectively on their problem?

• Choice of problem formulation and encoding (not covered) • Choice of initial solution and initial step-size • Restarts, Increasing Population Size • Restricted Covariance Matrix 4 Problem Statement Black Box Optimization and Its Difficulties Problem Statement Continuous Domain Search/Optimization Task: minimize an objective function (fitness function, ) in continuous domain n f : R R, x f (x) X ✓ ! 7! Black Box scenario (direct search scenario) x f(x)

I gradients are not available or not useful I problem domain specific knowledge is used only within the black box, e.g. within an appropriate encoding Search costs: number of function evaluations

5 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 3 / 81 Problem Statement Black Box Optimization and Its Difficulties Problem Statement Continuous Domain Search/Optimization Goal I fast convergence to the global optimum . . . or to a robust solution x I solution x with small function value f (x) with least search cost there are two conflicting objectives

Typical Examples

I shape optimization (e.g. using CFD) curve fitting, airfoils I model calibration biological, physical I parameter calibration controller, plants, images

Problems I exhaustive search is infeasible I naive random search takes too long I deterministic search is not successful / takes too long Approach: search, Evolutionary 6 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 4 / 81 Problem Statement Black Box Optimization and Its Difficulties What Makes a Function Difficult to Solve? Why stochastic search?

non-linear, non-quadratic, non-convex on linear and quadratic functions much better search policies are available

100 ruggedness 90 80

70

non-smooth, discontinuous, multimodal, and/or 60

50 noisy function 40 30

20

10

0 −4 −3 −2 −1 0 1 2 3 4 dimensionality (size of search space) 3 (considerably) larger than three 2 1

non-separability 0

−1 dependencies between the objective variables −2

−3 ill-conditioning −3 −2 −1 0 1 2 3

non-smooth level sets

gradient direction Newton direction 7 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 6 / 81 Problem Statement Black Box Optimization and Its Difficulties Ruggedness non-smooth, discontinuous, multimodal, and/or noisy

100

90

80

70

60

50

Fitness 40

30

20

10

0 −4 −3 −2 −1 0 1 2 3 4 cut from a 5-D example, (easily) solvable with evolution strategies

8 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 7 / 81 Problem Statement Non-Separable Problems Separable Problems Definition (Separable Problem) A function f is separable if

arg min f (x1,...,xn)= arg min f (x1,...),...,arg min f (...,xn) (x ,...,x ) x1 xn 1 n ✓ ◆ it follows that f can be optimized in a sequence of n independent ) 1-D optimization processes

Example: Additively 3 decomposable functions 2 1 n 0 f (x1,...,xn)= fi(xi) −1 Xi=1 Rastrigin function −2

−3 −3 −2 −1 0 1 2 3

9 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 9 / 81 Problem Statement Non-Separable Problems Non-Separable Problems Building a non-separable problem from a separable one (1,2) Rotating the coordinate system f : x f (x) separable 7! f : x f (Rx) non-separable 7! R rotation matrix

3 3

2 2

1 R 1 0 0 ! −1 −1

−2 −2

−3 −3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 1 Hansen, Ostermeier, Gawelczyk (1995). On the adaptation of arbitrary normal mutation distributions in evolution strategies: The generating set adaptation. Sixth ICGA, pp. 57-64, Morgan Kaufmann 2 Salomon (1996). ”Reevaluating Genetic Performance under Coordinate Rotation of Benchmark Functions; A survey of some theoretical and practical aspects of genetic algorithms.” BioSystems, 39(3):263-278 10 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 10 / 81 Problem Statement Ill-Conditioned Problems Ill-Conditioned Problems Curvature of level sets Consider the convex-quadratic function 1 T 1 2 1 f (x)= 2 (x x⇤) H(x x⇤)= 2 i hi,i (xi xi⇤) + 2 i=j hi,j (xi xi⇤)(xj xj⇤) 6 H is Hessian matrix of f and symmetric positive definite P P

gradient direction f (x)T 0 Newton direction H 1f (x)T 0

Ill-conditioning means squeezed level sets (high curvature). Condition number equals nine here. Condition numbers up to 1010 are not unusual in real world problems.

If H I (small condition number of H) first order information (e.g. the ⇡ gradient) is sufficient. Otherwise second order information (estimation 1 of H ) is necessary. 11 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 11 / 81 Section Subsection Non-smooth level sets (sharp ridges) Similar difficulty but worse than ill-conditioning

1-norm scaled 1-norm 1/2-norm

12 Problem Statement Ill-Conditioned Problems What Makes a Function Difficult to Solve? . . . and what can be done The Problem Possible Approaches

Dimensionality exploiting the problem structure separability, locality/neighborhood, encoding

Ill-conditioning second order approach changes the neighborhood metric

Ruggedness non-local policy, large sampling width (step-size) as large as possible while preserving a reasonable convergence speed

population-based method, stochastic, non-elitistic recombination operator serves as repair mechanism restarts ...metaphors

13 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 12 / 81 Topics 1. What makes the problem difficult to solve?

2. How does the CMA-ES work?

• Normal Distribution, Rank-Based Recombination • Step-Size Adaptation • Covariance Matrix Adaptation

3. What can/should the users do for the CMA-ES to work effectively on their problem?

• Choice of problem formulation and encoding (not covered) • Choice of initial solution and initial step-size • Restarts, Increasing Population Size • Restricted Covariance Matrix 14 Evolution Strategies (ES) A Search Template Stochastic Search

A black box search template to minimize f : Rn R ! Initialize distribution parameters ✓, set population size N 2 While not terminate n 1 Sample distribution P (x ✓) x1,...,x R | ! 2 2 Evaluate x1,...,x on f 3 Update parameters ✓ F (✓, x ,...,x , f (x ),...,f (x )) ✓ 1 1

Everything depends on the definition of P and F✓ deterministic algorithms are covered as well

In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for (incremental) Estimation of Distribution Algorithms 15 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 15 / 81 Evolution Strategies (ES) A Search Template Stochastic Search

A black box search template to minimize f : Rn R ! Initialize distribution parameters ✓, set population size N 2 While not terminate n 1 Sample distribution P (x ✓) x1,...,x R | ! 2 2 Evaluate x1,...,x on f 3 Update parameters ✓ F (✓, x ,...,x , f (x ),...,f (x )) ✓ 1 1

Everything depends on the definition of P and F✓ deterministic algorithms are covered as well

In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for (incremental) Estimation of Distribution Algorithms 16 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 15 / 81 Evolution Strategies (ES) A Search Template Stochastic Search

A black box search template to minimize f : Rn R ! Initialize distribution parameters ✓, set population size N 2 While not terminate n 1 Sample distribution P (x ✓) x1,...,x R | ! 2 2 Evaluate x1,...,x on f 3 Update parameters ✓ F (✓, x ,...,x , f (x ),...,f (x )) ✓ 1 1

Everything depends on the definition of P and F✓ deterministic algorithms are covered as well

In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for (incremental) Estimation of Distribution Algorithms 17 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 15 / 81 Evolution Strategies (ES) A Search Template Stochastic Search

A black box search template to minimize f : Rn R ! Initialize distribution parameters ✓, set population size N 2 While not terminate n 1 Sample distribution P (x ✓) x1,...,x R | ! 2 2 Evaluate x1,...,x on f 3 Update parameters ✓ F (✓, x ,...,x , f (x ),...,f (x )) ✓ 1 1

Everything depends on the definition of P and F✓ deterministic algorithms are covered as well

In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for (incremental) Estimation of Distribution Algorithms 18 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 15 / 81 Evolution Strategies (ES) A Search Template Stochastic Search

A black box search template to minimize f : Rn R ! Initialize distribution parameters ✓, set population size N 2 While not terminate n 1 Sample distribution P (x ✓) x1,...,x R | ! 2 2 Evaluate x1,...,x on f 3 Update parameters ✓ F (✓, x ,...,x , f (x ),...,f (x )) ✓ 1 1

Everything depends on the definition of P and F✓ deterministic algorithms are covered as well

In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for (incremental) Estimation of Distribution Algorithms 19 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 15 / 81 Evolution Strategies (ES) A Search Template Stochastic Search

A black box search template to minimize f : Rn R ! Initialize distribution parameters ✓, set population size N 2 While not terminate n 1 Sample distribution P (x ✓) x1,...,x R | ! 2 2 Evaluate x1,...,x on f 3 Update parameters ✓ F (✓, x ,...,x , f (x ),...,f (x )) ✓ 1 1

Everything depends on the definition of P and F✓ deterministic algorithms are covered as well

In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for (incremental) Estimation of Distribution Algorithms 20 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 15 / 81 Evolution Strategies (ES) A Search Template Stochastic Search

A black box search template to minimize f : Rn R ! Initialize distribution parameters ✓, set population size N 2 While not terminate n 1 Sample distribution P (x ✓) x1,...,x R | ! 2 2 Evaluate x1,...,x on f 3 Update parameters ✓ F (✓, x ,...,x , f (x ),...,f (x )) ✓ 1 1

Everything depends on the definition of P and F✓ deterministic algorithms are covered as well

In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for (incremental) Estimation of Distribution Algorithms 21 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 15 / 81 Evolution Strategies (ES) A Search Template Stochastic Search

A black box search template to minimize f : Rn R ! Initialize distribution parameters ✓, set population size N 2 While not terminate n 1 Sample distribution P (x ✓) x1,...,x R | ! 2 2 Evaluate x1,...,x on f 3 Update parameters ✓ F (✓, x ,...,x , f (x ),...,f (x )) ✓ 1 1

Everything depends on the definition of P and F✓ deterministic algorithms are covered as well

In many Evolutionary Algorithms the distribution P is implicitly defined via operators on a population, in particular, selection, recombination and mutation Natural template for (incremental) Estimation of Distribution Algorithms 22 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 15 / 81 Evolution Strategies (ES) A Search Template The CMA-ES n Input: m n ; ‡ +; ⁄ , usually ⁄ 5, default 4+ 3logn Input: m RR, RR+, N 2 2œ 2œ œ Ø Ø Â Ê Initialize: = , and p2 = , p = ,2 Set cm =1C; c1I 2/n c; cµ0 µw/n0 ; cc 4/n; c‡ 1/Ôn; d‡ 1; wi=1...⁄ 2 2 µw Set: c 4/n, c¥ 4/nµ, c1 ¥ 2/n , cµ ¥µw/n , c1 +¥1cµ 1,µd ¥ 21 + , decreasingc in i and wi =1, wµ > 0 wµ+1, µw≠ := w 3/⁄n ⇡ ⇡ i ⇡ 1 ⇡Ø  i=1⇡ i ¥ and wi=1... such that µw = µ w 2 0.3 i=1 i ⇡ p Initialize C = I, and pc = 0, p‡ = 0 While not terminateq P q While not terminate xi = m + yi, yi i(0, C) , for i = 1,..., sampling x = m + ‡y , ⇠whereN y (0, C) for i =1,...,⁄ sampling i µ i i ≥ Ni µ m i=1 wi xi: = m + yw where ywµ= i=1 wi yi: update mean m m + cm‡yw, where yw = wrk(i) yi update mean Ω i=1 2 pc P(1 cc) pc + 1I p <1.5pn 1 (1 cPc) pµw yw cumulation for C {k k } 1 2 2 p‡ (1 c‡) p‡ + 1 (1 2 c‡q) Ôµw1 C≠ yw path for ‡ p (Ω1 c≠) p + 1 (1≠ cp≠) pµw C 2 yw cumulation for 2 2 pc (1 cc) pc + 1I [0,2n] Tp‡ µ1 (1 cTc) Ôµw yw path for C C (Ω1 c≠1 cµ) Cp+c1 pc pcÎ +Î cµ =≠wi y≠i:yi: update C c p i 1 ‡ Î ‡ Î ‡ ‡ expc p 1 update of ‡ Ω exp◊ d‡ k Ek (0),I1) ≠*  update of d E (0,ÎI)N Î P ⇥ ⁄kN k T T C C + cµ 1 1 w (y y 22C)+c (p p C) update C Ω ⇣ ⇣ i=1 rk(i) ⌘⌘i i ≠ 1 c c ≠ Not covered on this slide: termination, restarts, useful output, boundaries and Not covered: termination, restarts, useful output, search boundaries and encoding, encoding q corrections for: positive definiteness guaranty, pc variance loss, c‡ and d‡ for large ⁄ 23 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 16 / 81 Evolution Strategies (ES) A Search Template Evolution Strategies

New search points are sampled normally distributed

x m + (0, C) for i = 1,..., i ⇠ Ni n n n as perturbations of m, where xi, m R , R+, C R ⇥ 2 2 2 where the mean vector m Rn represents the favorite solution 2 the so-called step-size R+ controls the step length 2 n n the covariance matrix C R ⇥ determines the shape of 2 the distribution ellipsoid here, all new points are sampled with the same parameters

The question remains how to update m, C, and .

24 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 17 / 81 Evolution Strategies (ES) The Normal Distribution Why Normal Distributions?

1 widely observed in nature, for example as phenotypic traits 2 only stable distribution with finite variance stable means that the sum of normal variates is again normal:

(x, A)+ (y, B) (x + y, A + B) N N ⇠ N helpful in design and analysis of algorithms related to the central limit theorem 3 most convenient way to generate isotropic search points the isotropic distribution does not favor any direction, rotational invariant 4 maximum entropy distribution with finite variance the least possible assumptions on f in the distribution shape

25 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 18 / 81 Evolution Strategies (ES) The Normal Distribution Normal Distribution

Standard Normal Distribution 0.4

0.3 probability density of the 1-D standard 0.2 normal distribution

probability density 0.1

0 −4 −2 0 2 4

2−D Normal Distribution

1

0.4 0.8

0.6

0.3 0.4

probability density of 0.2

0.2 0

a 2-D normal −0.2

0.1 −0.4

distribution −0.6

0 −0.8 5 −1 5 −1 −0.5 0 0.5 1 0 0 −5 −5

26 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 19 / 81 Evolution Strategies (ES) The Normal Distribution The Multi-Variate (n-Dimensional) Normal Distribution

Any multi-variate normal distribution (m, C) is uniquely determined by its mean N value m Rn and its symmetric positive definite n n covariance matrix C. 2 ⇥ The mean value m

2−D Normal Distribution determines the displacement (translation) 0.4 value with the largest density (modal value) 0.3 0.2 the distribution is symmetric about the distribution 0.1 0 5 mean 5 0 0 −5 −5

The covariance matrix C determines the shape geometrical interpretation: any covariance matrix can be uniquely identified with n T 1 the iso-density ellipsoid x R (x m) C (x m)=n1 { 2 | }

27 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 20 / 81 Evolution Strategies (ES) The Normal Distribution

. . . any covariance matrix can be uniquely identified with the iso-density ellipsoid n T 1 x R (x m) C (x m)=n1 { 2 | } Lines of Equal Density

m, 2I m + (0, I) m, D2 m + D (0, I) 1 N ⇠ N N ⇠ N (m, C) m + C 2 (0, I) 2 N ⇠ N one degree of freedom ndegrees of freedom (n + n)/2 degrees of freedom components are components are components are independent standard independent, scaled correlated normally distributed where I is the identity matrix (isotropic case) and D is a diagonal matrix (reasonable for separable problems) and A (0, I) 0, AAT holds for all A. ⇥ N ⇠ N

28 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 21 / 81 Evolution Strategies (ES) The Normal Distribution Multivariate Normal Distribution and Eigenvalues

For any positive definite symmetric C,

C = d2b bT + + d2 b bT 1 1 1 ··· N N N

di: square root of the eigenvalue of C

bi: eigenvector of C, corresponding to di

The multivariate normal distribution (m, C) N (m, C) m + (0,d2)b + + (0,d2 )b N ⇠ N 1 1 ··· N N N

d b 2 · 2

29 d b 1 · 1 Evolution Strategies (ES) The Normal Distribution The (µ/µ, )-ES Non-elitist selection and intermediate (weighted) recombination Given the i-th solution point x = m + (0, C) = m + y i Ni i =: yi Let xi: the i-th ranked solution point, such that f (x :) f (x:). | {z } 1  ··· The new mean reads µ µ m w x = m + w y i i: i i: Xi=1 Xi=1 =: yw where | {z }

µ 1 w1 wµ > 0, i=1 wi = 1, µ w 2 =: µw 4 ··· i=1 i ⇡ P P The best µ points are selected from the new solutions (non-elitistic) and weighted intermediate recombination is applied. 30 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 26 / 81 Evolution Strategies (ES) Invariance Invariance Under Monotonically Increasing Functions Rank-based algorithms Update of all parameters uses only the ranks

f (x :) f (x :) ... f (x:) 1  2  

g(f (x :)) g(f (x :)) ... g(f (x:)) g 1  2   8 g is strictly monotonically increasing

3 g preserves ranks 3 Whitley 1989. The GENITOR algorithm and selection pressure: Why rank-based allocation of reproductive trials is best, ICGA 31 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 27 / 81 Evolution Strategies (ES) Invariance Basic Invariance in Search Space

translation invariance is true for most optimization algorithms

f (x) f (x a) $

Identical behavior on f and fa f : x f (x), x(t=0) = x 7! 0 f : x f (x a), x(t=0) = x + a a 7! 0 No difference can be observed w.r.t. the argument of f

32 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 28 / 81 Evolution Strategies (ES) Summary Summary

N 2 On 20D Sphere Function: f (x)= i=1[x]i ES without adaptation can’t approach the optimum adaptation required P ) 33 Step-Size Control Evolution Strategies Recalling New search points are sampled normally distributed

x m + (0, C) for i = 1,..., i ⇠ Ni n n n as perturbations of m, where xi, m R , R+, C R ⇥ 2 2 2 where the mean vector m Rn represents the favorite solution µ 2 and m w x i=1 i i: the so-called step-size R+ controls the step length P 2 n n the covariance matrix C R ⇥ determines the shape of 2 the distribution ellipsoid

The remaining question is how to update and C.

34 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 32 / 81 Step-Size Control Why Step-Size Control Methods for Step-Size Control 1/5-th success ruleab, often applied with “+”-selection increase step-size if more than 20% of the new solutions are successful, decrease otherwise -self-adaptationc, applied with “,”-selection mutation is applied to the step-size and the better, according to the objective function value, is selected

simplified “global” self-adaptation

path length controld (Cumulative Step-size Adaptation, CSA)e self-adaptation derandomized and non-localized

a Rechenberg 1973, Evolutionsstrategie, Optimierung technischer Systeme nach Prinzipien der biologischen Evolution, Frommann-Holzboog b Schumer and Steiglitz 1968. Adaptive step size random search. IEEE TAC c Schwefel 1981, Numerical Optimization of Computer Models, Wiley d Hansen & Ostermeier 2001, Completely Derandomized Self-Adaptation in Evolution Strategies, Evol. Comput. 9(2) e Ostermeier et al 1994, Step-size adaptation based on non-local use of selection information, PPSN IV 35 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 39 / 81 Step-Size Control Path Length Control (CSA) Path Length Control (CSA) The Concept of Cumulative Step-Size Adaptation xi = m + yi m m + yw

Measure the length of the evolution path the pathway of the mean vector m in the generation sequence

+ + decrease increase

loosely speaking steps are perpendicular under random selection (in expectation) perpendicular in the desired situation (to be most efficient)

36 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 40 / 81 Step-Size Control Path Length Control (CSA) Path Length Control (CSA) The Equations

n Initialize m R , R+, evolution path p = 0, 2 2 set c 4/n, d 1. ⇡ ⇡

m m + y where y = µ w y update mean w w i=1 i i: 2 p (1 c) p + 1 (1 Pc) pµ y w w q w accounts for 1 c accounts for i c p exp | k{z k } 1 |{z} update step-size ⇥ d E (0, I) ✓ ✓ kN k ◆◆ >1 p is greater than its expectation () k k | {z }

37 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 41 / 81 Step-Size Control Path Length Control (CSA) (5/5, 10)-CSA-ES, default parameters

k n ⇤

x 2 f (x)= xi i=1

m X k in [ 0.2, 0.8]n for n = 30

38 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 42 / 81 Step-Size Control Alternatives to CSA Two-Point Step-Size Adaptation (TPA)

Sample a pair of symmetric points along the previous mean shift

(0, I) = (g) (g) ( (g) (g 1)) T 1 x1/2 m (g)kN (g 1k) m m x C := x C x ± m m (g) k k k kC

Compare the ranking of x1 and x2 among current populations

rank(x ) rank(x ) s(g+1) =(1 c )s(g) + c 2 1 s s 1 >0 if the previous step still produces a promising solution

Update the step-size | {z }

s(g+1) (g+1) = (g) exp d !

z

[Hansen, 2008] Hansen, N. (2008). CMA-ES with two-point step-size adaptation. [research report] rr-6527, 2008. Inria-00276854v5. [Hansen et al., 2014] Hansen, N., Atamna, A., and Auger, A. (2014). How to assess step-size adaptation mechanisms in randomised search. In Parallel Problem Solving from Nature–PPSN XIII, pages 60–69. Springer. 39 Step-Size Control Alternatives to CSA On Sphere with Low Effective Dimension

On a function with low effective dimension M f (x)= [x]2, x RN, M N. i=1 i 2  N M variables do not affect the function value P CSA TPA 104 104 103 N=10,M=10 103 N=10,M=10 102 N=100,M=100 102 N=100,M=100 N=100,M=10 N=100,M=10 101 101 100 100 1 1 10− 10 2 2 10− 10 3 3 10− 10 f(x) f(x) 4 4 10− 10 5 5 10− 10 6 6 10− 10 7 7 10− 10 8 8 10− 10 9 9 10− 10 10 10 10 10 − 102 103 104 105 102 103 104 105 function evaluations function evaluations

40 Step-Size Control Alternatives to CSA Alternatives: Success-Based Step-Size Control comparing the fitness distributions of current and previous iterations

Generalizations of 1/5th-success-rule for non-elitist and multi-recombinant ES Median Success Rule [Ait Elhara et al., 2013] Population Success Rule [Loshchilov, 2014] controls a success probability An advantage over CSA and TPA: Cheap Computation It depends only on . 1/2 1 cf. CSA and TPA require a computation of C x and C x, respectively.

[Ait Elhara et al., 2013] Ait Elhara, O., Auger, A., and Hansen, N. (2013). A median success rule for non- elitist evolution strategies: Study of feasibility. In Proc. of the GECCO, pages 415–422. [Loshchilov, 2014] Loshchilov, I. (2014). A computationally efficient limited memory cma-es for large scale optimization. In Proc. of the GECCO, pages 397–404. 41 Step-Size Control Summary Step-Size Control: Summary Why Step-Size Control? to achieve linear convergence at near-optimal rate

Cumulative Step-Size Adaptation efficient and robust for N  inefficient on functions with (many) ineffective axes

Alternative Step-Size Adaptation Mechanisms Two-Point Step-Size Adaptation Median Success Rule, Population Success Rule

the effective adaptation of the overall population diversity seems yet to pose open questions, in particular with recombination or without entire control over the realized distribution.a

aHansen et al. How to Assess Step-Size Adaptation Mechanisms in Randomised Search. PPSN 2014

42 Step-Size Control Summary Step-Size Control: Summary

N/2 2 2 N 2 On 20D TwoAxes Function: f (x)= i=1 [Rx]i + a i=N/2+1[Rx]i , R: orthogonal convergence speed of CSA-ESP becomes lowerP as the function becomes ill conditioned (a2 becomes greater) covariance matrix adaptation required ) 43 Covariance Matrix Adaptation (CMA) Evolution Strategies Recalling

New search points are sampled normally distributed

x m + (0, C) for i = 1,..., i ⇠ Ni n n n as perturbations of m, where xi, m R , R+, C R ⇥ 2 2 2 where the mean vector m Rn represents the favorite solution 2 the so-called step-size R+ controls the step length 2 n n the covariance matrix C R ⇥ determines the shape of 2 the distribution ellipsoid

The remaining question is how to update C.

44 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 44 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update µ m m + y , y = w y , y (0, C) w w i=1 i i: i ⇠ Ni P

initial distribution, C = I

...equations

45 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 45 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update µ m m + y , y = w y , y (0, C) w w i=1 i i: i ⇠ Ni P

initial distribution, C = I

...equations

46 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 45 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update µ m m + y , y = w y , y (0, C) w w i=1 i i: i ⇠ Ni P

yw, movement of the population mean m (disregarding )

...equations

47 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 45 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update µ m m + y , y = w y , y (0, C) w w i=1 i i: i ⇠ Ni P

mixture of distribution C and step yw, C 0.8 C + 0.2 y yT ⇥ ⇥ w w

...equations

48 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 45 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update µ m m + y , y = w y , y (0, C) w w i=1 i i: i ⇠ Ni P

new distribution (disregarding )

...equations

49 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 45 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update µ m m + y , y = w y , y (0, C) w w i=1 i i: i ⇠ Ni P

new distribution (disregarding )

...equations

50 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 45 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update µ m m + y , y = w y , y (0, C) w w i=1 i i: i ⇠ Ni P

movement of the population mean m

...equations

51 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 45 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update µ m m + y , y = w y , y (0, C) w w i=1 i i: i ⇠ Ni P

mixture of distribution C and step yw, C 0.8 C + 0.2 y yT ⇥ ⇥ w w

...equations

52 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 45 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update µ m m + y , y = w y , y (0, C) w w i=1 i i: i ⇠ Ni P

new distribution, C 0.8 C + 0.2 y yT ⇥ ⇥ w w the ruling principle: the adaptation increases the likelihood of successful steps, yw, to appear again another viewpoint: the adaptation follows a natural gradient ...equations approximation of the expected fitness 53 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 45 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Covariance Matrix Adaptation Rank-One Update n 2 Initialize m R , and C = I, set = 1, learning rate ccov 2/n 2 ⇡ While not terminate

xi = m + yi, yi i(0, C) , ⇠ N µ m m + y where y = w y w w i i: Xi=1 T 1 C (1 ccov)C + ccovµw ywyw where µw = µ 2 1 i=1 wi rank-one P The rank-one update has been found|{z independently} in several domains6 7 8 9

6 Kjellstrom&Tax¨ en´ 1981. in System Design, IEEE TCS 7 Hansen&Ostermeier 1996. Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation, ICEC 8 Ljung 1999. System Identification: Theory for the User 9 Haario et al 2001. An adaptive Metropolis algorithm, JSTOR 54 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 46 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update

T C (1 ccov)C + ccovµwywy w covariance matrix adaptation learns all pairwise dependencies between variables off-diagonal entries in the covariance matrix reflect the dependencies

conducts a principle component analysis (PCA) of steps yw, sequentially in time and space eigenvectors of the covariance matrix C are the principle components / the principle axes of the mutation ellipsoid

learns a new rotated problem representation components are independent (only) in the new representation learns a new (Mahalanobis) metric variable metric method approximates the inverse Hessian on quadratic functions transformation into the sphere function for µ = 1: conducts a natural gradient ascent on the distribution N entirely independent of the given coordinate system

...cumulation, rank-µ 55 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 47 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Invariance Under Rigid Search Space Transformation

f = hRast f-level sets in dimension 2 f = h 3

2

1

0

−1

−2

−3 −3 −2 −1 0 1 2 3

for example, invariance under search space rotation (separable non-separable) …

56 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-One Update Invariance Under Rigid Search Space Transformation

f = hRast Rff-level sets in dimension 2 = h R 3 ¶ ¶

2

1

0

−1

−2

−3 −3 −2 −1 0 1 2 3

for example, invariance under search space rotation (separable non-separable) …

57 Covariance Matrix Adaptation (CMA) Cumulation—the Evolution Path Cumulation The Evolution Path Evolution Path Conceptually, the evolution path is the search path the strategy takes over a number of generation steps. It can be expressed as a sum of consecutive steps of the mean m. An exponentially weighted sum of steps yw is used

g g i (i) p (1 c ) y c / c w i=0 X exponentially |fading{z weights} The recursive construction of the evolution path (cumulation):

2 p (1 c ) p + 1 (1 c ) pµw yw c c c c m m decay factor p normalization factor old input = | {z } |{z} 1 | {z } where µw = 2 , cc 1. History information is accumulated in the evolution path. wi ⌧ 58 Anne Auger & NikolausP Hansen CMA-ES July, 2014 49 / 81 Covariance Matrix Adaptation (CMA) Cumulation—the Evolution Path Cumulation The Evolution Path Evolution Path Conceptually, the evolution path is the search path the strategy takes over a number of generation steps. It can be expressed as a sum of consecutive steps of the mean m. An exponentially weighted sum of steps yw is used

g g i (i) p (1 c ) y c / c w i=0 X exponentially |fading{z weights} The recursive construction of the evolution path (cumulation):

2 p (1 c ) p + 1 (1 c ) pµw yw c c c c m m decay factor p normalization factor old input = | {z } |{z} 1 | {z } where µw = 2 , cc 1. History information is accumulated in the evolution path. wi ⌧ 59 Anne Auger & NikolausP Hansen CMA-ES July, 2014 49 / 81 Covariance Matrix Adaptation (CMA) Cumulation—the Evolution Path

“Cumulation” is a widely used technique and also know as exponential smoothing in time series, forecasting exponentially weighted mooving average iterate averaging in momentum in the back-propagation algorithm for ANNs ... “Cumulation” conducts a low-pass filtering, but there is more to it. . .

...why?

60 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 50 / 81 Covariance Matrix Adaptation (CMA) Cumulation—the Evolution Path Cumulation

Utilizing the EvolutionT Path T T We used ywy for updating C. Because ywy = yw( yw) the sign of yw is lost. w w

The sign information (signifying correlation between steps) is (re-)introduced by using the evolution path.

2 p (1 c ) p + 1 (1 c ) pµw yw c c c c decay factor p normalization factor T C (|1 {zccov} )C +|ccov pc pc {z } rank-one

1 | {z } where µw = 2 , ccov cc 1 such that 1/cc is the “backward time horizon”. wi ⌧ ⌧ P 61 Anne Auger & Nikolaus Hansen CMA-ES July, 2014...resulting 51 in / 81 Covariance Matrix Adaptation (CMA) Cumulation—the Evolution Path Cumulation

Utilizing the EvolutionT Path T T We used ywy for updating C. Because ywy = yw( yw) the sign of yw is lost. w w

The sign information (signifying correlation between steps) is (re-)introduced by using the evolution path.

2 p (1 c ) p + 1 (1 c ) pµw yw c c c c decay factor p normalization factor T C (|1 {zccov} )C +|ccov pc pc {z } rank-one

1 | {z } where µw = 2 , ccov cc 1 such that 1/cc is the “backward time horizon”. wi ⌧ ⌧ P 62 Anne Auger & Nikolaus Hansen CMA-ES July, 2014...resulting 51 in / 81 Covariance Matrix Adaptation (CMA) Cumulation—the Evolution Path Cumulation

Utilizing the EvolutionT Path T T We used ywy for updating C. Because ywy = yw( yw) the sign of yw is lost. w w

The sign information (signifying correlation between steps) is (re-)introduced by using the evolution path.

2 p (1 c ) p + 1 (1 c ) pµw yw c c c c decay factor p normalization factor T C (|1 {zccov} )C +|ccov pc pc {z } rank-one

1 | {z } where µw = 2 , ccov cc 1 such that 1/cc is the “backward time horizon”. wi ⌧ ⌧ P 63 Anne Auger & Nikolaus Hansen CMA-ES July, 2014...resulting 51 in / 81 Covariance Matrix Adaptation (CMA) Cumulation—the Evolution Path

Using an evolution path for the rank-one update of the covariance matrix reduces the number of function evaluations to adapt to a straight ridge from about (n2) to (n).(a) O O a Hansen & Auger 2013. Principled design of continuous stochastic search: From theory to practice.

2 6 n 2 Number of f -evaluations divided by dimension on the cigar function f (x)=x1 + 10 i=2 xi 4 10 P

cc = 1 (no cumulation)

3 10 cc = 1/pn cc = 1/n

2 10 1 2 10 10 dimension

The overall model is n2 but important parts of the model can be learned in time of order n

...rank µ update 64 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 52 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-µ Update Rank-µ Update xi = m + yi, yi i (0, C) , ⇠ N µ m m + yw yw = w y i=1 i i: P The rank-µ update extends the update rule for large population sizes using µ>1 vectors to update C at each generation step. The weighted empirical covariance matrix

µ T Cµ = wi yi:yi: Xi=1 computes a weighted mean of the outer products of the best µ steps and has rank min(µ, n) with probability one. with µ = weights can be negative 10 The rank-µ update then reads

C (1 c ) C + c Cµ cov cov 2 where ccov µw/n and ccov 1. 10 ⇡  Jastrebski and Arnold (2006). Improving evolution strategies through active covariance matrix adaptation. CEC. 65 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 53 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-µ Update

1 T 1 x = m + y , y (0, C) Cµ = y y mnew m + y i i i ⇠ N µ i: i: µ i: C (1 1) C + 1 C P ⇥ ⇥ µ P new distribution sampling of = 150 calculating C where solutions where µ = 50, 1 C = I and = 1 w = = wµ = , 1 ··· µ and ccov = 1

66 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 54 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-µ Update

11 Rank-µ CMA versus Estimation of Multivariate Normal Algorithm EMNAglobal

rank-µ CMA conducts a PCA of steps xi = mold + yi, yi (0, C) C 1 (x m )(x m )T m = m + 1 y ⇠ N µ i: old i: old new old µ i: P P

EMNAglobal conducts a PCA of xi = mold + yi, yi (0, C) points ⇠ N C 1 (x m )(x m )T m = m + 1 y µ i: new i: new new old µ i: P P sampling of = 150 calculating C from µ = 50 new distribution solutions (dots) solutions

mnew is the minimizer for the variances when calculating C

11 Hansen, N. (2006). The CMA : A Comparing Review. In J.A. Lozano, P. Larranga, I. Inza and E. Bengoetxea (Eds.). Towards a new evolutionary computation. Advances in estimation of distribution algorithms. pp. 75-102 67 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 55 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-µ Update

The rank-µ update increases the possible learning rate in large populations 2 2 roughly from 2/n to µw/n can reduce the number of necessary generations roughly from (n2) to (n) (12) O O given µw n / / Therefore the rank-µ update is the primary mechanism whenever a large population size is used say 3 n + 10 The rank-one update uses the evolution path and reduces the number of necessary function evaluations to learn straight ridges from (n2) to (n) . O O Rank-one update and rank-µ update can be combined

...all equations 12 Hansen, Muller,¨ and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1), pp. 1-18 68 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 56 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-µ Update

The rank-µ update increases the possible learning rate in large populations 2 2 roughly from 2/n to µw/n can reduce the number of necessary generations roughly from (n2) to (n) (12) O O given µw n / / Therefore the rank-µ update is the primary mechanism whenever a large population size is used say 3 n + 10 The rank-one update uses the evolution path and reduces the number of necessary function evaluations to learn straight ridges from (n2) to (n) . O O Rank-one update and rank-µ update can be combined

...all equations 12 Hansen, Muller,¨ and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1), pp. 1-18 69 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 56 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-µ Update

The rank-µ update increases the possible learning rate in large populations 2 2 roughly from 2/n to µw/n can reduce the number of necessary generations roughly from (n2) to (n) (12) O O given µw n / / Therefore the rank-µ update is the primary mechanism whenever a large population size is used say 3 n + 10 The rank-one update uses the evolution path and reduces the number of necessary function evaluations to learn straight ridges from (n2) to (n) . O O Rank-one update and rank-µ update can be combined

...all equations 12 Hansen, Muller,¨ and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1), pp. 1-18 70 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 56 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-µ Update Rank-one update Hybrid (combined) update best f-value mean coordinates best f-value mean coordinates The rank-µ update3.0 2.0 105 105 2.5 103 103 1.5 1 2.0 1 10 10 1.0 1 increases the1.5 possible learning rate1 in large populations 10− 10− 3 1.0 3 0.5 2 2 10− 10− roughly from 2/n to µw/n 5 0.5 5 10− 10− 0.0 7 7 10− 0.0 10− 9 9 0.5 10− 0.5 10− − − 11 can reduce the number of necessary11 generations roughly from 10− 1.0 10− 1.0 0250050007500100001250015000 − 0250050007500100001250015000 0250050007500100001250015000 − 0250050007500100001250015000 step-size2 2 (12principle) axis length step-size 2 principle axis length 100 (n ) to (n10) 100 10 1 1 10− 1 1 10 O O 10 10− given µw n 2 0 10− 10 0 2 10 10− / / 3 1 10− 10− 1 3 10− 10− 4 2 Therefore10− the rank-µ update is the primary mechanism10− whenever a 2 4 5 10− 10− 3 10− 10− large10 6 population10 size3 is used 10 5 10 4 − − − − 0250050007500100001250015000 0250050007500100001250015000 0250050007500100001250015000 0250050007500100001250015000say 3 n + 10 Rank-µ update best f-value mean coordinates 2.5 105 5 10 The103 rank-one update2.0 2 6 2 1 f (x) = x + 10 x 10 1.5 TwoAxes i i 1 10− i i 3 1.0 =1 =6 10− uses the evolution path and reduces the number of necessary 5 10− 0.5 7 10− = 10 (default for N = 102) 9 function evaluations0.0 to learn straight ridges from (n ) to (n) . 10− 11 10− 0.5 O O 0250050007500100001250015000 − 0250050007500100001250015000 step-size principle axis length Rank-one100 update101 and rank-µ update can be combined 1 10− 100 2 10− 1 10− ...all equations 3 10− 2 10− 124 10− Hansen, Muller,¨ and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with 3 5 10− Covariance10− Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1), pp. 1-18

6 4 10− 10− 71 Anne0250050007500100001250015000 Auger & Nikolaus Hansen0250050007500100001250015000 CMA-ES July, 2014 56 / 81 Covariance Matrix Adaptation (CMA) Covariance Matrix Rank-µ Update Rank-one update Hybrid (combined) update best f-value mean coordinates best f-value mean coordinates The rank-µ update2.0 2.5 105 105 103 1.5 103 2.0 1 1 10 1.0 10 1.5 1 increases the possible learning rate1 in large populations 10− 10− 3 0.5 3 1.0 2 2 10− 10− roughly from 2/n to µ /n 5 5 w 10− 0.0 10− 0.5 7 7 10− 10− 9 0.5 9 0.0 10− − 10− 11 can reduce the number of necessary11 generations roughly from 10− 1.0 10− 0.5 010000200003000040000− 010000200003000040000 010000200003000040000− 010000200003000040000 0 step-size2 3 (12principle) axis length step-size 1 principle axis length 10 (n ) to (n10) 100 10 1 0 10− 2 10 O O 10 1 2 10− given µw n 10− 1 10− 1 3 10 / / 10− 2 2 10− 10− 4 10− 100 3 5 µ 10− Therefore10− the rank- update is the primary3 mechanism whenever a 1 10− 6 10− 4 10− 10− large10 7 population10 size2 is used 10 4 10 5 − − − − 010000200003000040000 010000200003000040000 010000200003000040000 010000200003000040000say 3 n + 10 Rank-µ update best f-value mean coordinates 105 1.5 5 10 103 The1 rank-one update 2 6 2 10 1.0 fTwoAxes(x) = xi + 10 xi 1 10− 3 i=1 i=6 10− uses the evolution0.5 path and reduces the number of necessary 5 10− 7 10− 0.0 = 50 2 9 function evaluations to learn straight ridges from (n ) to (n) . 10− 11 10− 0.5 O O 010000200003000040000− 010000200003000040000 step-size principle axis length Rank-one100 update101 and rank-µ update can be combined 100 1 10− 10 1 − ...all equations 2 2 10− 10−

12 3 10− 3 Hansen, Muller,¨ and Koumoutsakos 2003. Reducing the Time Complexity of the Derandomized Evolution Strategy with 10− 4 Covariance Matrix Adaptation (CMA-ES).10− Evolutionary Computation, 11(1), pp. 1-18

4 5 10− 10− 72 Anne010000200003000040000 Auger & Nikolaus Hansen010000200003000040000CMA-ES July, 2014 56 / 81 Covariance Matrix Adaptation (CMA) Active CMA

Different Types of Ill-Conditioning (α: Axes Ratio = 10)

Cigar Type: Discus Type: 1 long axis = n-1 short axes 1 short axis = n-1 long axes

f (x)=x2 + ↵ n x2 f (x)=↵ x2 + n x2 1 i=1 i · 1 i=1 i P P

73 Covariance Matrix Adaptation (CMA) Active CMA Active Update utilize negative weights [Jastrebski and Arnold, 2006]

Active Update (rewriting)

decreasing the variances in unpromising directions

/2 b c C C +c p p T + c w y yT c w y yT 1 c c µ i i: i: z µ }| | i| i: i:{ i=1 i= /2 +1 X bX c increasing the variances in promising directions | {z } increases the variance in the directions of pc and promising steps y (i /2 ) i: b c decrease the variance in the directions of unpromising steps yi: (i /2 + 1) b c

AAAi5Xic1Vpbbxy3FVbiXtLtLWkfCxS0ZQF2Km20ip2khp3GVZxYaKOqtpwE0MgCZ4a7y4gznJAcrTfjyT8o+lIU7VP/T/9B/02/Q87eZneVOMlDO4CwM+Q5h+d+DknFhZLW7e7+56WXr3zv+z/44Ss/6vz4Jz/92c9ffe0XH1tdmkQ8SbTS5tOYW6FkLp446ZT4tDCCZ7ESn8Tn+zT/yYUwVur82I0LcZrxQS77MuEOQ2evXXk5isVA5lXf8EzUnfuJkxeCPSlS7gS7YcTISCfzwc1OAyc+Lz3q63Un2meREn3HjdEjhq9OVOapMLHhiah+w6JE54JFRUJ/T48ZjWQli2yZnVXyXq9+WkWqr7Q2keLZG3uR8R81i0aSReOzKpKqnvw+jY5rvGXcDRPFiypy4pmrZJ5AWgsGmRsKdsGN5HkiLJM5K4zOpJ9KpREJMW3rGmzriwmPO4scERsMYyuYAvOBX6g25TV7Dh6fr2Ty6RKTqVjPZJmvZVPk6YK2g/6lE5n8ApaK6I01CgCxedJEmb5nFJnus+sww3XG83RONdaJwmKmEeA6u3FdklU/X6mF6zebZRuRxNdYdV7C1asN/GqXqn668LkQxcpFbRnbAjZl2rihHuicK+a0n+IxLM6CQmfqo6/g9Gevbu52d/3Dll96zcvmRvMcnb32639HqU7KTOQOdra2ihFywtSdLbY4cTKEunXpTqcQna2otAKcnvOBqIpBv8CvrZl/tliqYcpcOzbS5hywVriAqQvSaGWHCDXMA6VDq8H+qkyFztXYC2OrpDQGy4OXLXaieCzUvWaoQxjHQ2lZXyoozjLOrFYlEWbQS6Eo5vva3CHAHfahvCCjcea4OmeAsTrziv4M5u16kGOo18+CWCzcSIic9W5nsAm52S3/pnQ+CNCP3Tisq02Opbodz9G+LsZGDoaO7e3u3mLxmB1Lpdgxzx0v2V3nf9+D0ozthrwHFgeimwv3LuiCwgEFOzQhCyW2YfKJhAnPwRUzIkUiNTIunUiJsTfgVJlOkQbx7VMWiJCnOGEy77T08eHhE3ZUxkom7I8yEbkF7SaVsr2w8EM9EhiaXxO/tiwKbUEa/ofl+Uy3YWCyNAhAEjaGTEyPcpYLkdou+wBjnh5FmM63mewTDCt9uE3W4bAecT0hTfomv7GFSHx+V2rMZnIz6Qil4MaRfBzwGkKS4I0vvoEwHcCFttkBw2/uvBKQvwxnBfQirZccIvSNECCewG60LEh4icaL3BHLVsCfaGUwB0V57FQo4RpJkonpQQP8QMtd1ulEICfuooxZOK1Pf+92qg6igyIHTGWi+oQby0c1DbLZaCr6vFSOhqeDJ1YkQ8Fh49Pq95qncC3ewpsH+YinRqY1pqE5OK4wzBdZ61FgmW63S+RDUE60n/MLOfCc2nEWa2Ux7sYEN43fhKqOSCuoM0eiCiHa0BwNQQPT7Ab8xsoYyv2stG6iKuludupGK00+CQohCTx3QXwr+KBUakH8tk7mwKczyCRzOelE5AM0HkOkLEoelNPmWFwEhewy752iDBfwsTy5DLhyMqOsFRQ/oXAM7L7OJ8h/miF32aH2fsIbT8wTeBnlIzgTDRAaQz4sVcpQcZNhlx302XGvM5dFUT/OWQ6/QoiacVDopA5Ty8RG0g2n1LBEd5G/ikp54bJnlG2//fPd0GCHD/9nmKEq432K2YS8yob6NudmsRDoSimiYlUKSi2NdyMXkIkGwlFqC9mSIJYoJAgZT2AslEL5cxpZhJtzT2sJeuLWgP+KZdqYiPgycaURXwO3hcpVzJEvrCVUOO/Ywy2B9dV4CvANFDEJcEbZsslLDOGBXL8szUgrrCPzbyANBAHWYvQaTak4H9QLowEpVmGUYihkVOpHKHROwbOR/bVtD4bxrjiKhav2qMfo1RW/VXBUnO1YG+RkhDikOL93O8t88sjn02A1n2DBCnJWcj4wGkWdqv8Ft3UVD+7FCsNXb9e+Z9r5tk8nysUo0VkGJqroXGf1CbJYvTjc59bho0A1DNOwQoZSeN8MfHsIMwydpdyE6reESY2yR9vseUPMz2fjBF2sn436WjvqBrHTkPkYwLRMdOE74dernZ54VvuFmnXYCzxb7MsoOtlBvTsNJHiZDL8RnQjlLpAY4GWFxNSU8zTVeX2yd3qy230HkgUZqp3d7lsQYiapRfcOmiOZInFTK1mAMzPoep6iYSN5hC7EF0LGqmbjhFZUktPVJ47U6rdmnkod6rnfBKuSvA0NSR0NiUTUeBXC5Oo7u1e9H9Vsc49Q/AZiDmHG8S1Su3/CNmO6NNXalp/YOeF7C3Lf8nLbxMjCBalthkhHWX0hUaHAbyVrb4WslwjX9tU+IhAxa8C6DRuv9jwBeMlac+JzYjtoxe+jq5UcktO3MHk5CPExwUDjdfW3hFAKD89aNig8WL16gTbLpej3AYoNAUoFJTkaOxtR6Pmdv8kCvzapMNjYCWboiWyJU+xz62qRVsPM5KShjTGShDABmkMcnckl4MQuMjpFZMkZPG+QLdNP16Kk61AK25YhqHGAfUJ+9e1G703SmOljdw8KiS5QDehgZ+LOO0FRYan50V2vviUJ0Vevl7GaGISgViBrVMf1yL0VGFl56WrlMkpWrmeRHOereEwuF69KViAVyYtYZJ0JPPHpKAZ7K03wBRgEiXF7fDwbp7I9j5EJnmN2e3XA+QM66s6J9lmIfYudyNLCDRnFsWMRc47luR9PBsh3btOI8XBTQf3ELUwgSldF7srlOPJvk5PAvf+6u9l7d8JMG0cqGELemQbz4uy+QVGT+QUZa/9ptRP1DU/gdtXe8uoetgG8FG6O3nJmzB7W3mkeLk94vDAXUFup/KJpOfzh5lBjT4X3WD9D0dEqpdHrUSot9sNjS+c8m73rNbnYMpRX8RRkBUBT8y6n0gC1QL2vYXO9hvFQiMJ73K9W1A7kt/8LMReYNudUSDzfdCB/gwSLfKm+2YI8rKu8NZQvD2WIikYF0NZH7eUOuWum47g6bM8+ms09Wo79OI5NvcUeCa7YYZnFAluYg0eAWvPAlU3GDuZi9hY1hI+WVqVM8OgpSdjm1ttc+t6SWoJGLHaInEd6mlHuvUlpwmtwc2+N/jyxQ+zCDuoFWtOeeDU9nwp36+3oIF1nGU+5aVtmdNew11tN5P0Qwu+3dfAs5OJnS95eV63acHuShAG+3K+gKq/HaLqDVj5aD7/QQfFqfzmZQVl+6qA9cVxPy+Zxey6P++IC3Sapkgy+1d4jhiTAfHdZYVd6b25guz+45/mstzoLbL81qU7wmSrabJao/YYLNO6w6bonoSyctprMx0cNQ4Tva1rkXJ9nUo0Z5tpSfPAg48/mMOaV1cL+4AFVbz9IOCvaAVtSG1GcTXTmv9tA3PDMXuoPETb3brn/02qxReltR6l2dntN1ZNteHmPSNRtdbmRvkxjVPtaJSoZysPGMWy/elBHz+fjNXrehjfaTffmaZllYzoSyfg5XRnM4Bi/4FLxWIlFbBf8ZH6fEUY2ffH8Lk7Xvgsa/o4l8lfDJ4+H2jh2RKcr7JhGTiH5jXC1xNW2v2Kgo5twLEoXN8wfxTCPbm92Ap0XRKoeS2q82GMlU0E2sGUc6FSPhHV0JfFHTdnuseAmGfpLltKyD5WOZ4Oe+MH0urdzpItS+SN39pj2lfM83fQi89JB3JP7/meb3Q8XHl8hsfPXPwGV5A1vL4RT3c9zwe6XdHcQkf8cynOtOCR6yHML/wBz1f3ulx4ikrl1aOUayO6XASYM7/nyucOeNBetYfR3NZt4pudChuuqZn025Bd0BdvvC3/nR4WV97G/l15ZXaifqEhXOnFykBvJWxqJS8cybR0o04WUSCERUZlwuoVK5/GiCG8wYLDPPhYzsDFPFB3KMi/QDG+vpoNr9r6gGyh//gWVHQ+FNsLRZRU7GkqlrS6GY9AF5JNc+ss2NybIB8qKEawnOis0skodYIXjz9JNAqjPacCGu8g/0EWydMzKzN8a5prRfytIupV3QLd0VyjzcDtnIZpwjKcpJizpkP494mT/wT7dWd4+XXQ+pAad8vE2m8wv5Smd0Umq79fAEIXF3d4Oe86aY807u9hQXPgzmXAyQ4d7PqdO90t0nvnm7tXRkI4CQ8JZPGIdcaXCMaqSyP/Z4rktpUinNSpY5+zVzV77rnv55eO9bg8dyJ/3Nt97p7kHf2XjVxvXNm5s9Dbe3nhv4+HG0caTjeSKvPLXK/+48s9rg2t/ufa3a38PoC+/1OD8cmPhufav/wKOdkcN keep the variance in the subspace orthogonal to the above

[Jastrebski and Arnold, 2006] Jastrebski, G. and Arnold, D. V. (2006). Improving Evolution Strategies through Active Covariance Matrix Adaptation. In 2006 IEEE Congress on Evolutionary Computation, pages 9719–9726. 74 Covariance Matrix Adaptation (CMA) Active CMA On 10D Discus Function

10D Discus Function (axis ratio: ↵ = 103)

n f (x)=↵2 x2 + x2 · 1 i Xi=1

108 Positive Update 108 Positive&Negative Update 106 106 f(x) 104 104 2 102 102 106 eig(C) 100 100 2 2 10 10 4 4 10 10 6 6 10 f(x) 2 106 eig(C) 10 10 8 10 8 0 1000 2000 3000 4000 5000 6000 7000 8000 0 1000 2000 3000 4000 5000 6000 7000 8000 function evaluations function evaluations

Positive: wait for the smallest eig(C) decreasing Active: decrease the smallest eig(C) actively

75 Covariance Matrix Adaptation (CMA) Active CMA Summary Active Covariance Matrix Adaptation + Cumulation

/2 b c T T T C (1 c c +c)C+c p p +c w y y c w y y 1 µ µ 1 c c µ i i: i: µ | i| i: i: i=1 i= /2 +1 X bX c

wi < 0 (for i /2 + 1): negative weight assigned to yi:, | | b c i= µ wi = 1. | | Pcµ > 0: learning rate for the active update

These components complement each other cumulation: excels to learn a long axis, but inefficient for a large rank-µ update: efficient for a large active update: effective to learn short axes

An important yet solvable issue of active update

The positive definiteness of C will be violated if cµ is not small enough

The positive definiteness can be guaranteed w.p.1 by controlling cµwi 76 CMA-ES Summary

n Input: m R ; ‡ R+; ⁄ N 2, usually ⁄ 5, default 4+ 3logn Summaryœ ofœ Equationsœ Ø Ø Â Ê 2 2 SetThec Covariancem =1; c1 Matrix2/n Adaptation; cµ µ Evolutionw/n ; cc Strategy4/n; c‡ 1/Ôn; d‡ 1; wi=1...⁄ n ¥ µ (problem¥ dependent)¥ ¥1 µ ¥ 2 decreasingInput: m inR i, and R+, w , w > w , µ≠ w /⁄ 2 2 i i =1 µ 0 µ+1 w := i=1 i 3 Initialize: C = I, and pc = 0, p = 0, Ø ¥ p p 2 2 µw InitializeSet: cc C4/=n,Ic, and4/nc, =c1 0, 2/‡n=, c0µ µw/n , c1 + cµ 1, d 1 + n , ⇡ ⇡q ⇡ 1 ⇡  q ⇡ and wi=1... such that µw = µ w 2 0.3 While not terminate i=1 i ⇡ p While not terminate x = m + ‡y , whereP y (0, C) for i =1,...,⁄ sampling i i i ≥ Ni xi = m + yi, yi i(0, C) , for i =µ1,..., sampling m m + cm‡y⇠, Nwhere y = w y update mean Ω µ w w i=1 µrk(i) i m i=1 wi xi: = m + yw where yw = i=1 wi yi: update mean 1 2 p‡ (1 c‡) p‡ + 1 (1 c‡) Ôµw2 C≠ 2 yw path for ‡ pc P(1 cc) pc + 1I p <1.5pn 1 (q1 cPc) pµw yw cumulation for C Ω ≠ {k k ≠ } ≠ 2 1 p c p p 2 c 2 µ y p c (1 (1c) pc)+c +11I [0(,12n] cp)‡pµw C12 yw(1 c) Ôcumulationw w path for for C Ω ≠  Î Î ≠ ≠ c p ‡ ‡ exp ‡ Î ‡TÎ 1T µ T T update of ‡ CC ((11 c1 ccµµ) Cpd‡i=+1 wcE1i)pCc(p+0c,Ic)+1pccpµc +i=c1µwi yii:=1yiw:iyi:yi: update C Ω ◊ ÎN ) Î ≠*  c ⁄ p exp k k 1 T T update of C C + cdµ1PE 1(0w,I)rk(i) (yiy P22C)+P c1(p p C) update C Ω⇥ ik=1N k i ≠ c c ≠ ⇣ ⇣ ⌘⌘ NotNot covered: coveredtermination,on thisq slide: restarts, termination, useful output, restarts, search useful boundaries output, and boundaries encoding, and p c d ⁄ correctionsencoding for: positive definiteness guaranty, c variance loss, ‡ and ‡ for large

Anne Auger & Nikolaus Hansen CMA-ES 77 July, 2014 57 / 81 Topics 1. What makes the problem difficult to solve?

2. How does the CMA-ES work?

• Normal Distribution, Rank-Based Recombination • Step-Size Adaptation • Covariance Matrix Adaptation

3. What can/should the users do for the CMA-ES to work effectively on their problem?

• Choice of problem formulation and encoding (not covered) • Choice of initial solution and initial step-size • Restarts, Increasing Population Size • Restricted Covariance Matrix 78 What can/should the users do? Strategy Parameters and Initialization Default Parameter Values CMA-ES + (B)IPOP Restart Strategy = Quasi-Parameter Free Optimizer

The following parameters were identified in carefully chosen experimental set ups.

related to selection and recombination : offspring number, new solutions sampled, population size µ: parent number, solutions involved in mean update wi: recombination weights related to C-update 1 c : decay rate for the evolution path, cumulation factor c c1: learning rate for rank-one update of C cµ: learning rate for rank-µ update of C related to -update 1 c : decay rate of the evolution path d : damping for -change

The default values depends only on the dimension. They do in the first place not depend on the objective function.

79 What can/should the users do? Strategy Parameters and Initialization Parameters to be set depending on the problem Initialization and termination conditions

The following should be set or implemented depending on the problem.

related to the initial search distribution m(0): initial mean vector (0) (0) (or Ci,i ): initial (coordinate-wise) standard deviation related to stoppingq conditions max. func. evals. max. iterations function value tolerance min. axis length stagnation Practical Hints: start with an initial guess m(0) with a relatively small step-size (0) to locally improve the current guess; then increase the step-size, e.g., by factor of 10, to globally search for a better

AAAU8Xic1Rjbcty2da3e0k0vdqo89QXqWjNOR9pZKk7iOHYmHccZ66EaV7XjTJfyDkiCXIwAggHAlRiY+Y++dfraL+oH9D96AHAvJHfVunkqX5Y893NwbtioYFTpyeSft/Z+9OOf/PRn7/x8+O4vfvmrX9++897XSpQyJi9jwYT8JsKKMJqTl5pqRr4pJME8YuRVdPnE4l8tiFRU5C90VZALjrOcpjTGGkCzO7e+DyMgJ1ITXjCsSY4XNHNIVfFIMAVwXQ2BKqO5SSXQ1lP4ySgjF8MQsHPzXOJYg0iGntFcq/rhkpyCVPodqcNcSI6ZgvdhaIFIaSw1uqJ6jnCOaE41BfasJEqhuyF/be5NPqjvNgRIEjCNLgirkAI5DNhJcWzFAbGi2ZJcC+QtYgKsYVWNKC+kWBCk5wTFpZQk117LZ40hgLD6Ywia8mQr2UeIjLPxEYoqlIKDQiKRomBytFaTMRF5PYpgGc9RCkQYRURrIpESrLSBHEOY8mQVDP/lIzmc3R5NxhP3oP5L0LyMBs3zfHbnvb+EiYhLDp7EDCs1fVDoC+MPEWRzkZBHkAIK8O4YPx+aIUJhqQg4x4l5haXCV7WFAcjzxRAiSRKjJc5VgW2U6uFS2BzniSj1Wo7LOi8M3M5KxuqWhoSkuGS67pOvMPVwaFEFji9xRqYkzyDX5+AGjghI28TZk8+DC0PzooT0jFtYoyknqh4eok2WF0CeitxTt1CGYz0vNL9uG2CksMHKs7Zwb3nUGGQlOjemImeVFXQB/kuagqxDdPxDn2GYk6tYcA7hNuGl4PUU3Kjb4BQrDR8FFJ9Ho0PEqUZ/kJnLCJSRuVYop/Fc9zltZju2UVCjNppXMeSnQ4apEDoXmpgQQlIBrdUSLiAxYvJ7cxyQ69rpadSgt3gO0fdhOD0ej8cXXgQuoWr+FzkhpKkXkcHLFodtueEkEXk9PbmYTsYPwDPvgzmejD8GJ9aeuk5yCN0mgXZDFRIFWCazsbMpnDeehyn1qY6QaRoch85VQKrUU22jGmpyrZ0Ul/xNz2SlmkuazXUdzq2I0OWViVhJDh5MDiIG6Vaj0Yllca1hg2Ft8X0bdvf4BrJSbaupkyZqw/mg5fd957eKJS2099o1VCLfzlUI4A/yNdji63/rHK9SBT6sR8oWvCVwjnVw5FtrtQ+KfTNmq4E25zucuMx8eSw5oF8efGoZSuLoOwWlCkdWb1fQNbkkaQqkJFelJLa1WNjsylZeaD8l9/aq2ACwOSY4hYDwnqUM89q0ZTXGWFSU4B7HFbUMS6INxqsZ7RHHqm3oihHFMzuLeV9+spMl2cVSqK4PPoyZJCQ/+KSJe9Mz1vGYnEBAwgUxRT0zq2w+9oHyqjahExe+nocwDnf7aJYHYqm2MIuc3MAcbOHg5Y3ayj4LL3ebaBPnP9kY3+yeibcwFfHbnMiuI3DCV1AABluP4DswEERUXXi1hsMIaHFwgnPAHm0vuNBuOHZJtrJnvvZha+4rbsQwDCsJ2UgsZ321BNjc+chCpKNbOeoQ9wEBVbqtcreqw9B+m54E1ruvR6Pg86UxXR7K4CDow1Uxt7FPJMw0mi/sYT15bY5DWDRjSDtz0tfuaBvCG+k25PU7I39Wu6R51kc4Po/zrMM2xaLZOGyg4rmgEMAQLh3XMHMESyz0bphQBReTSumKkVFwt7Yp1qdyIV6RbCFoRt7NUhqiDqnLNdiJdxjuB5F/j1KzZXZAf/u/cLNltLy0g8TZzUiq71nHQjepP+hQntUm74DyPohDVTQhgGj9savuDOsGHUXmrIs9X+PO+7UfRZGsD9E5gSvkWckjuO+i03Og2vFAKkuOTjdq9r7dB897Wm0nOH9tPexa686cutXSrgSNW+gMep6N01py8KFtEy6Co5Md8XPCzuAedVq3ZK1W4u3yXCuc1EfhabLrZJzkZm1Zy91hXrBdyJe+hL/sxuDa9+LrXrbXpjMbPlo2YSDv7yswlXdzNNtBpx/tpm9tUNg86TczCJZDnXYRL+rV2HzRxeVRShawbdpQ2gM/3Lw7Wzt8E0BuuzRwWXy8AThKs8fOzvpw2DL74+V0gpwx4ahRUbv7Fsh4iFZ6p34sXHSWzD8/bwyy/G6mhVqnmFNWIcB1vfjqKcfXGxybwepwf/XUTm8HtDxb1gFV2jWimC1j5r67RFhirm7MhxCu1Lq//wnWXlGCozARWh3tmHq0S08fWxF1N1z6StwUsZP+qjSnZ01eqNQ8rcM3m+Uavuk2TaFX9/Kk5Lyy/xVxfEnQ5vxAeIEpwxEjbW7ts2TzluEho/7ohBf7f4XrzyJXjCbkUXCM3qDm75qHE1ggFu4K5i9i9i7vYrjaj+AIDj6cHFzN7c3fqZjdHgXd/576L1+fjAPoHH86GX3xqPlf6p3Bbwe/G9wbBINPBl8Mng2eD14O4lv/2nt3b3/v/X21/9f9v+3/3ZPu3Wp4fjNoPfv/+Dev/LHB solution. 80 What can/should the users do? Strategy Parameters and Initialization Python CMA-ES Implementation https://github.com/CMA-ES/pycma

81 What can/should the users do? Strategy Parameters and Initialization Python CMA-ES Demo https://github.com/CMA-ES/pycma Optimizing 10D Rosenbrock Function

From a practical perspective: given an unknown optimisation problem, the first thing I tend to do is try to improve a given (initial) solution using a small initial sigma. Then I (can) increase sigma successively (by a factor of 10 or more, depending on what I have seen in the initial evolution of sigma previously) and see whether I find the same or better (or

82 What can/should the users do? Strategy Parameters and Initialization Python CMA-ES Demo https://github.com/CMA-ES/pycma Optimizing 10D Rosenbrock Function

83 What can/should the users do? Multimodality Multimodality Two approaches for multimodal functions: Try again with • a larger population size • a smaller initial step-size (and random initial mean vector)

84 What can/should the users do? Multimodality Multimodality Approaches for multimodal functions: Try again with • the final solution as initial solution (non-elitist) and small step-size • a larger population size • a different initial mean vector (and a smaller initial step-size) A restart with a large population size helps if the objective function has a well global structure • functions such as Schaffer, Rastrigin, BBOB function 15~19 • loosely, unimodal global structure + deterministic noise

85 What can/should the users do? Multimodality Multimodality Hansen and Kern. Evaluating the CMA Evolution Strategy on Multimodal Test Functions, PPSN 2004.

Rastrigin function Griewank function

86 What can/should the users do? Multimodality Multimodality Approaches for multimodal functions: Try again with • the final solution as initial solution (non-elitist) and small step-size • a larger population size • a different initial mean vector (and a smaller initial step-size) A restart with a small initial step-size helps if the objective function has a weak global structure • functions such as Schwefel, Bi-Sphere, BBOB function 20~24

a large population size has a negative effect 87 What can/should the users do? Restart Strategy Restart Strategy It makes the CMA-ES parameter free

IPOP: Restart with increasing the population size • start with the default population size • double the population size after each trial (parameter sweep) • may be considered as gold standard for automated restarts

BIPOP: IPOP regime + Local search regime • IPOP regime: restart with increasing population size • Local search regime: restart with a smaller step-size and a smaller population size than the IPOP regime

88 Topics 1. What makes the problem difficult to solve?

2. How does the CMA-ES work?

• Normal Distribution, Rank-Based Recombination • Step-Size Adaptation • Covariance Matrix Adaptation

3. What can/should the users do for the CMA-ES to work effectively on their problem?

• Choice of problem formulation and encoding (not covered) • Choice of initial solution and initial step-size • Restarts, Increasing Population Size • Restricted Covariance Matrix 89 What can/should the users do? Restricted Covariance Matrix Motivation of the Restricted Covariance Matrix

Bottlenecks of the CMA-ES on high dimensional problems 1 (N2) Time and Space O N N I to store and update C R ⇥ 2 I to compute the eigen decomposition of C 2 (1/N2) Learning Rates for C-Update O 2 I cµ µw/N ⇡ 2 I c 2/N 1 ⇡

Exploit prior knowledge on the problem structure such as separability

⇛ decrease the degrees of freedom of the covariance matrix for • less time and space complexities • a higher learning rates that potentially accelerate the adaptation

90 What can/should the users do? Restricted Covariance Matrix Variants with Restricted Covariance Matrix

CMA-ES Variants with Restricted Covariance Matrices Sep-CMA [Ros and Hansen, 2008] I C = D. D: Diagonal VD-CMA [Akimoto et al., 2014] T N I C = D(I + vv )D. D: Diagonal, v R . 2 LM-CMA [Loshchilov, 2014] k T N I C = I + viv . vi R . i=1 i 2 [Akimoto and Hansen, 2016] VkD-CMA P k T N I C = D(I + viv )D. vi R . i=1 i 2 P

[Ros and Hansen, 2008] Ros, R. and Hansen, N. (2008). A simple modification in CMA-ES achieving linear time and space complexity. In Parallel Problem Solving from Nature - PPSN X, pages 296–305. Springer. [Akimoto et al., 2014] Akimoto, Y., Auger, A., and Hansen, N. (2014). Comparison-based natural gradient optimization in high dimension. In Proceedings of Genetic and Evolutionary Computation Conference, pages 373–380, Vancouver, BC, Canada. [Loshchilov, 2014] Loshchilov, I. (2014). A computationally efficient limited memory cma-es for large scale optimization. In Proceedings of Genetic and Evolutionary Computation Conference, pages 397–404. [Akimoto and Hansen, 2016] Akimoto, Y. and Hansen, N. (2016). Projection-based restricted covariance matrix adaptation for high dimension. In Genetic and Evolutionary Computation Conference, GECCO 2016, Denver, Colorado, USA, July 20-24, 2016, page (accepted). ACM.

91 What can/should the users do? Restricted Covariance Matrix Separable CMA (Sep-CMA)

µ (t+1) (t) T (t) (t) (t) T (t) CMA C = C + c p p C + cµ w (x m )(x m ) C cma 1 c c i i i ⇣ ⌘ Xi=1 ⇣ ⌘ µ (t+1) (t) 2 (t) (t) 2 (t) SEP [C ] =[C ] + c [p ] [C ] + cµ w [x m ] [C ] sep k,k k,k 1 c k k,k i i k k,k ⇣ ⌘ Xi=1 ⇣ ⌘ (N + 2)/3 times greater than CMA

92 What can/should the users do? Restricted Covariance Matrix Demo: On 100D Separable Ellipsoid Function

Separable-CMA CMA

• CMA needed 10 times more FEs + more CPU time • However, Sep-CMA won't be able to solve rotated ellipsoid function as efficiently as it solves separable ellipsoid

93 Summary and Final Remarks

Summary and Final Remarks

94 Summary and Final Remarks Main Characteristics of (CMA) Evolution Strategies

1 Multivariate normal distribution to generate new search points follows the maximum entropy principle

2 Rank-based selection implies invariance, same performance on g(f (x)) for any increasing g more invariance properties are featured

3 Step-size control facilitates fast (log-linear) convergence and possibly linear scaling with the dimension in CMA-ES based on an evolution path (a non-local trajectory)

4 Covariance matrix adaptation (CMA) increases the likelihood of previously successful steps and can improve performance by orders of magnitude the update follows the natural gradient 1 C H adapts a variable metric / () new (rotated) problem representation () = f : x g(xTHx) reduces to x xTx ) 7! 7!

95 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 79 / 81 Summary and Final Remarks Limitations of CMA Evolution Strategies

8 2 internal CPU-time: 10 n seconds per function evaluation on a 2GHz PC, tweaks are available 1 000 000 f -evaluations in 100-D take 100 seconds internal CPU-time variants with restricted covariance matrix such as Sep-CMA better methods are presumably available in case of

I partly separable problems

I specific problems, for example with cheap gradients specific methods

I small dimension (n 10) ⌧ for example Nelder-Mead

I small running times (number of f -evaluations < 100n) model-based methods

96 Anne Auger & Nikolaus Hansen CMA-ES July, 2014 80 / 81 Thank you

Source code for CMA-ES in C, C++, Java, Matlab, Octave, Python, R, Scilab and Practical hints for problem formulation, variable encoding, parameter setting are available (or linked to) at http://cma.gforge.inria.fr/cmaes_sourcecode_page.html

97 Comparing Experiments Comparison during BBOB at GECCO 2010 24 functions and 20+ algorithms in 20-D

98 Anne Auger & Nikolaus Hansen CMA-ES July, 2014...noisy 74 / 81