A CMA-ES with Multiplicative Updates

Oswin Krause Tobias Glasmachers Department of Computer Science Institut für Neuroinformatik University of Copenhagen Ruhr-Universität Bochum Copenhagen,Denmark Bochum, Germany [email protected] [email protected]

ABSTRACT N (m, σ2C), namely the global step size σ and the covari- Covariance matrix adaptation (CMA) mechanisms are core ance matrix . Adaptation of the step size enables linear building blocks of modern evolution strategies. Despite shar- convergence on scale invariant functions [4], while covari- ing a common principle, the exact implementation of CMA ance matrix adaptation (CMA) [10] renders the asymptotic varies considerably between different algorithms. In this pa- convergence rate independent of the conditioning number of per, we investigate the benefits of an exponential parametriza- the Hessian in the optimum of a twice continuously differ- tion of the covariance matrix in the CMA-ES. This technique entiable function. was first proposed for the xNES algorithm. It results in a The most prominent algorithm implementing the above multiplicative update formula for the covariance matrix. We principles is CMA-ES [10, 8, 12]. Nowadays there exists a show that the exponential parameterization and the mul- plethora of variants and extensions of the basic algorithm. tiplicative update are compatible with all mechanisms of A generic principle for the online update of parameters CMA-ES. The resulting algorithm, xCMA-ES, performs at (including the search distribution) is to maximize the ex- least on par with plain CMA-ES. Its advantages show in pected progress. This goal can be approximated by adapting particular with updates that actively decrease the sampling the search distribution so that the probability of the pertur- variance in specific directions, i.e., for active constraint han- bations that generated successful offspring in the past are dling. increased. This is likely to foster the generation of better points in the near future.1 The application of the above principle to CMA means to Categories and Subject Descriptors change the covariance matrix towards the maximum likeli- [Continuous Optimization] hood estimator of successful steps. To this end let N (m, C) denote the search distribution, and let x1, . . . , xµ denote suc- General Terms cessful offspring. The maximum likelihood estimator of the covariance matrix generating step δi = xi − m is the rank- T Algorithms one matrix δiδi . All CMA updates of CMA-ES are of the generic form C ← (1 − c) · C + c · δδT , which is a realization Keywords of this technique keeping an exponentially fading record of previous successful steps δ. CMA-ES variants differ in which evolution strategies, covariance matrix adaptation, CMA- step vectors enter the covariance matrix update. Early vari- ES, multiplicative update, exponential coordinates ants were based on cumulation of directions in an evolution path vector pc and a single rank-one update of C per gener- 1. INTRODUCTION ation [10]. Later versions added a rank-µ update based on Evolution Strategies (ES) are randomized direct search immediate information from the current population [8]. algorithms suitable for solving black box problems in the A different perspective on CMA techniques is provided d within the framework of information geometric optimization continuous domain, i.e., minimization problems f : → R defined on a d-dimensional real vector space. Most of these (IGO) [14], in particular by the natural evolution strategy algorithms generate a number of normally distributed off- (NES) approach [17, 16]. It turns out that the rank-µ update spring in each generation. The efficiency of this scheme, equation can be derived from stochastic natural de- at least for unimodal problems, crucially depends on online scent of a stochastically relaxed problem on the statistical adaptation of parameters of the Gaussian search distribution manifold of search distributions. This more general perspec- tive opens up new possibilities for CMA mechanisms, e.g., reparameterizing the covariance matrix in exponential form as done in the xNES algorithm [7]. This results in an update Permission to make digital or hard copies of all or part of this work for equation with the following properties: a) the update is mul- personal or classroom use is granted without fee provided that copies are tiplicative, in contrast to the standard additive update, b) it not made or distributed for profit or commercial advantage and that copies is possible to leave the variance in directions orthogonal to bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific 1 permission and/or a fee. This statement holds only under (mild) assumptions on the GECCO’15, July 11-15, 2015, Madrid, Spain. regularity of the fitness landscape, which remain implicit, Copyright 2015 ACM TBA ...$15.00. but may be violated, e.g., in the presence of constraints. all observed steps unchanged, and c) even when performing Algorithm 1: CMA-ES “active” (negative) updates the covariance matrix is guaran- Input: m, σ teed to remain positive definite. C ← I In this paper we incorporate the exponential parameteri- while stopping condition not met do zation of the covariance matrix into CMA-ES. We derive all // sample and evaluate offspring mechanisms found in the standard CMA-ES algorithm in for i ∈ {1, . . . , λ} do this framework, demonstrating the compatibility of cumula- 2 xi ← N (m, σ C) tive step size adaptation and evolution paths (two features end missing in xNES) with exponential coordinates. The new sort {xi} with respect to f(xi) algorithm is called xCMA-ES. Its performance on standard // internal update (paths and parameters) (unimodal) benchmarks coincides with that of CMA-ES, 0 Pµ m ← wixi however, in addition it benefits from neat properties of the i=1 p ← (1 − c ) · p + pc (2 − c )µ · C−1/2(m0 − m) exponential parameters, which show up prominently when s s s s s eff p m0−m performing active CMA updates with negative weights, e.g., pc ← (1 − cc) · pc + cc(2 − cc)µeff · σ T for constraint handling. C ← (1 − c1 − cµ) · C + c1 · pcpc T In the next section we recap CMA-ES and xNES. Based Pµ xi−m  xi−m  +cµ · i=1 wi σ σ thereon we present our new xCMA-ES algorithm. In sec-    σ ← σ · exp cs · kpsk − 1 tion 3 its performance is evaluated empirically on standard Dσ χd 0 benchmarks. We demonstrate the superiority for special m ← m tasks involving active CMA updates. end

2. ALGORITHMS In this section we provide the necessary background for it is assumed that its computational cost is substantial, our new algorithm before introducing the xCMA-ES. We will usually exceeding the internal computational complexity cover the well-known CMA-ES algorithm as well as xNES, of the algorithm. both with a focus on the components required for our own 3. Sort the offspring by fitness. Post-condition: f(x ) ≤ contribution. In the following algorithms, a few implemen- 1 f(x ) ≤ · · · ≤ f(x ). tation details are not shown, e.g., the decomposition 2 λ 4. Perform environmental selection: keep {x , . . . , x }, dis- of C required for sampling as well as for the computation of 1 µ card {x , . . . , x }, usually the worse half. the inverse matrix C−1/2. µ+1 λ 5. Update the evolution path for cumulative step size adap- 2.1 CMA-ES tation: the path ps is an exponentially fading record of steps of the mean vector, back-transformed by multipli- The Covariance Matrix Adaptation Evolution Strategy −1/2 (CMA-ES) [10, 12] is the most prominent evolution strategy cation with the inverse matrix square root C into a in existence. It comes in many variants, e.g., with extensions coordinate system where the sampling distribution is a for handling of fitness noise [9] and multi-modality [5]. Here standard normal distribution. This path is supposed to we describe what can be considered a baseline version, fea- follow a standard normal distribution. turing non-elitist (µ, λ) selection, cumulative step size con- 6. Update the evolution path for covariance matrix adap- trol (CSA), and two different types of covariance matrix up- tation: the path pc is an exponentially fading record of dates, namely a rank-1 update based on an evolution path, steps of the mean vector divided (corrected) by the step and the so-called rank-µ update based on the survivors of size. This path models the movement direction of the environmental truncation selection. distribution center over time. d The state of CMA-ES is given by the parameters m ∈ R , 7. Update the covariance matrix: a new matrix is obtained d×d σ > 0, and C ∈ R of its multi-variate normal search dis- by additive blending of the old matrix with a rank-one tribution N (m, σ2C), as well as by the two evolution paths matrix formed by the path pc and a rank-µ matrix formed d by successful steps of the current population. ps, pc ∈ R . Pseudo-code of a basic CMA-ES is provided in algorithm 1. The algorithm has a number of tuning con- 8. Update the step size: the step size is changed if the norm stants, e.g., the sizes of parent and offspring population µ of ps indicates a systematic deviation from the standard and λ, the various leaning rates, and the rank-based weights normal distribution. Note (for later reference) that this w1, . . . , wµ. For robust default settings for the different pa- update has a multiplicative form, involving the exponen- rameters we refer to [12]. tial function. In each generation, CMA-ES executes the following steps: 9. Update the mean: discard the old mean and center the distribution on a weighted mean of the new parent pop- 2 1. Sample offspring x1, . . . , xλ ∼ N (m, σ C). This step is ulation. realized by sampling standard normally distributed vec- d tors z1, . . . , zλ ∈ R , which are then transformed via xi ← m + σAzi, where A is a factor of the covariance 2.2 xNES matrix fulfilling AAT = C. It can be computed via a The exponential natural evolution strategy (xNES) algo- Cholesky decomposition of C, however, the usual method rithm [7] is a prominent member of the family of natural evo- is eigen decomposition since this operation is needed any- lution strategies (NES) [17, 16]. While exhibiting all prop- way later on. erties of an evolution strategy, it is derived as a stochastic 2. Evaluate the offspring’s fitness values f(x1), . . . , f(xλ). natural gradient descent method on the statistical manifold This function call is often considered a black box, and of search distributions, an approach that is best understood Algorithm 2: xNES is known as stochastic natural gradient descent (SNGD). Input: m, σ Several different implementations of this general scheme B ← I have been developed with a focus on computational aspects while stopping condition not met do [15]. It is common to replace “raw” fitness values f(xi) with // sample and evaluate individuals rank-based utility values ui that turn out to correspond ex- for i ∈ {1, . . . , λ} do actly to the weights wi of CMA-ES. The xNES algorithm supersedes earlier developments with zi ← N (0, I) two novel approaches. The first of these is to perform the xi ← m + σ · Bzi end NES update in a local parameterisation θ 7→ Pθ for which sort {(zi, xi)} with respect to f(xi) the Fisher matrix is the , which saves its // internal update (SNGD step) computation or estimation and in particular its (numerical) Pλ inversion. The second technique is a parameterization of the Gδ ← i=1 ui · zi λ positive definite covariance matrix involving the matrix ex- G ← P u · (z zT − I) M i=1 i i i ponential, which allows for an unconstrained representation G ← tr(G )/d σ M of the covariance matrix. G ← G − G · I B M σ Covariance matrices are symmetric and positive definite m ← m + ηm · σB · Gδ d × d matrices. Symmetric matrices form the d(d+1) dimen- σ ← σ · exp(ησ/2 · Gσ) 2 sional vector space Sd. The requirement of positive defi- B ← B · exp(ηB /2 · GB ) end niteness adds a non-linear constraint. Let Pd denote the open sub-manifold of positive definite symmetric matrices. Then the parameter space takes the form (m, C) = θ ∈ Θ = d R × Pd. as an instance of information geometric optimization [14]. Note that this space is not closed under subtraction, and NES was found to have close relations to CMA-ES [1, 7]. also not under addition of terms from Sd (e.g., (natural) The NES family of algorithms is derived as follow. Let ). When performing an additive update of the co- {Pθ | θ ∈ Θ} denote a family of search distributions with variance matrix of the form C ← C − γ · ∆, e.g., an SGD parameters θ, where Θ is a differentiable manifold. The most step, then a large enough step γ · ∆ can result in a violation prominent example are multivariate Gaussian distributions of the positivity constraint.2 Pθ = N (m, C) with parameters θ = (m, C) and density Possible workarounds are line search for a feasible step pθ(x). The algorithm aims to solve the optimization problem length or more elaborate constraint handling techniques. A   conceptually easier and more elegant solution is to parame- min F (θ) = Ex∼Pθ f(x) θ terize the manifold Pd with a vector space and perform SGD on this new parameter space. This is exactly the role of the which is lifted from points x to distributions P . The gra- θ matrix exponential exp : S → P , which is a diffeomor- dient of ∇ F (θ) is R f(x)∇ log(p (x))dx (given that the in- d d θ θ phism (a bijective, smooth mapping with smooth inverse). tegral converges in an open set around θ). It is intractable The exponential map for matrices is in general defined in in the black box model, however, it can be approximated by terms of the power series expansion exp(M) = P∞ 1 M n, the Monte Carlo estimator n=0 n! mapping general d × d matrices to the general linear group λ 1 X of invertible d × d matrices. For symmetric matrices it can G(θ) = f(x )∇ log(p (x )) λ i θ i be understood in terms of a spectral transformation. Let i=1 M = UDU T with U orthogonal and D diagonal denote the with samples xi ∼ Pθ. These samples correspond to the eigen decomposition of M, then it is easy to see from the T offspring population of an ES. The update θ ← θ − γ · G(θ) power series formula that it holds exp(M) = U exp(D)U . (with learning rate γ > 0) amounts to minimization of F The exponential of a is simply the diagonal with stochastic gradient descent (SGD). matrix consisting of the scalar exponentials of the diagonal This update is unsatisfactory in the context of NES since entries. Hence the matrix exponential corresponds to expo- it depends on the chosen parameterization θ 7→ Pθ (see [14], nentiation of the eigen values, mapping general to positive and refer to [16, 6] for further shortcomings). In the case of eigen values and thus Sd to Pd. optimizing Pθ, the question which direction to follow has a The xNES algorithm uses a special coordinate system canonical answer. This is because {Pθ | θ ∈ Θ} is a statistical for θ, centered into the current search distribution. Let manifold (a manifold the points of which are distributions) (m, C) denote the parameters of the current search distri- with an intrinsic Riemannian information geometry induced bution, and let A denote a factor of the covariance matrix T by KL-divergence [17, 16, 14]. The gradient w.r.t. the in- C, fulfilling AA = C. We introduce local coordinates d trinsic geometry, pulled back to the parameter space Θ, is (δ, M) = θ ∈ Θ = R × Sd (hence now Θ forms a vector known as the natural gradient, usually denoted by ∇˜ θF (θ). space, which is in particular closed under addition) so that It is obtained from the plain gradient as (δ, M) = (0, 0) represents the current distribution, and new

−1 ∇˜ θF (θ) = I(θ) · ∇θF (θ) , since the Fisher information matrix I(θ) is the metric ten- sor describing the intrinsic geometry. It is estimated in a 2 −1 This problem does not appear with standard CMA-ES up- straightforward manner by I(θ) G(θ). The update dates where positive semi-definite matrices are added to a −1 positive definite matrix, the result of which is always positive θ → θ − γ ·I(θ) G(θ) definite. 0 0 Pλ distribution parameters (m ,A ) are represented as Proof. The proof has two parts. In the case that i=1 ui = 0, this reduces to G = Pλ u z zT . We first take a look at   1  i=1 i i i (δ, M) 7→ m0,A0 = m + Aδ, A exp M . the case thatu ¯ = Pλ u 6= 0. Then we have 2 i=1 i λ ! The coordinates are chosen so that the Fisher matrix is the X T exp(G) = exp −u¯I + uizizi identity: I(0, 0) = I. Hence the coordinates are orthonormal i=1 w.r.t. the intrinsic geometry. Plain and natural gradient λ ! coincide. X T = exp(−u¯I) exp uizizi Despite the seemingly complicated derivation of xNES in- i=1 volving stochastic natural gradient on a statistical manifold λ ! and a non-linear coordinate system based on the matrix ex- X T = exp(−u¯) exp uizizi ponential, its update equations are surprisingly simple. The i=1 complete pseudo-code is given in algorithm 2. Pλ T In this implementation the covariance matrix factor A is The second step holds since −u¯I commutes with i=1 uizizi . Pλ T represented as A = σB, where the transformation matrix B We can thus assume w.l.o.g. that G = i=1 uizizi , which fulfills det(B) = 1. In the chosen exponential coordinates includes the caseu ¯ = 0. Because of rank(G) ≤ λ we can find T d×λ the corresponds to the trace (see computation an eigen decomposition of G = QDQ with Q ∈ R and λ×λ 2 3 3 of Gσ and GB in the algorithm). The parameters m, σ, and D ∈ R in O(λ d + λ ) time . With this decomposition B can be updated with independent SNGD steps, poten- the matrix exponential can be rewritten as tially with different learning rates.  T  T The parameters of the xNES algorithm are the sample exp(G) = I − QQ + Q exp(D)Q (population) size λ, the learning rates η , η , and η , as m σ B = I + Q (exp(D) − I) QT . well as the rank-based weights u1, . . . , uλ. Population size and weights essentially follow the settings of CMA-ES, with The first equality holds because exp(G) maps all 0-eigenvalues the deviation to subtract the mean from the weights. This of G to 1, which leads to the first term. Insertion of the re- leads to the weights ui = wi − 1/λ, resulting in negative sult into the term of interest yields weights for individuals that simply don’t enter the CMA- T h T i T ES due to truncation selection. The mean learning rate has A exp(G)A = A I + Q (exp(D) − I) Q A the canonical value ηm = 1, which results in the exact same T T mean update as in CMA-ES [1, 7], while the other learning = AA + AQ (exp(D) − I) (AQ) rates were empirically tuned, see [7]. = C + AQ (exp(D) − I) (AQ)T Note that due to the use of the matrix exponential in d×λ 2 xNES the updates of σ and B have exactly the same form. We can compute K = AQ ∈ R in O(λd ) and C + T 2 In contrast to CMA-ES, the covariance matrix update of K (exp(D) − I) K in O(λd ) time. xNES is multiplicative in nature. We argue that conceptu- The lemma implies that if a decomposition C = AAT is ally this is a desirable property since σ (scale) and B (shape) available then the update C ← A exp(G)AT , or C ← C + both describe complementary properties of the covariance K (exp(D) − I) KT in the notation of the proof, can now be matrix C = σ2BBT , and they enter the sampling process in seen as a rank-λ update to C. It can be computed signifi- a similar way, namely by left-multiplication with the stan- cantly faster than a full eigen decomposition of C. Typically, dard normally distributed random vectors z . In fact, the i as λ = O(log(d)), the runtime costs are dominated by the exponential parameterization seems to be canonical since it O(λd2) operations of the matrix multiplications, which leads allows for a clear separation of the scale component σ and to the same runtime complexity as the additive matrix up- the shape component B of the search distribution as linear date. If A was a Cholesky-factor then it could be updated sub-spaces of Θ, see also [7]. efficiently in O(d2λ) operations as well, without requiring to store or compute C first [13]. If A has been computed 2.3 Efficient Multiplicative Update through an eigenvalue decomposition then there is currently While the multiplicative update rule of the xNES guar- no fast algorithm known to perform the update of the eigen- antees positive definiteness of the covariance matrix, the value decomposition, and recomputing it from C has time matrix exponential in itself is usually a computationally ex- complexity Θ(d3), dominating the overhead of the exponen- pensive operation. In the following we show that the update tial update. can be implemented efficiently with time complexity O(d2λ), which coincides with the complexity of the additive update 2.4 xCMA-ES of the CMA-ES. Thus the computational difference between In this section we show that exponential coordinates for the updates is merely a constant. the covariance matrix are compatible with all mechanisms of CMA-ES. Consequently, we incorporate this technique into Pλ T Lemma 1. Consider a matrix G = i=1 ui(zizi −I) built the CMA-ES algorithm, resulting in a new method called d from λ < d vectors zi ∈ R , weights ui ∈ R, and a positive exponential CMA-ES, or xCMA-ES for short. d×d definite C ∈ R for which a decomposi- 3 tion C = AAT is available. Then the term This can be achieved by first applying a QR-decomposition on the matrix Z = (z1, . . . , zλ) which yields a λ × d-matrix T T λ×λ 2 A exp(G)AT B with BB = I and BGB = K ∈ R in O(λd ). An eigenvalue decomposition of K = VDV T can then be can be computed with time complexity O(λd2 + λ2d + λ3). performed in O(λ3) and Q = BT V . The xCMA-ES algorithm features all techniques found in Algorithm 3: xCMA-ES CMA-ES, but in addition incorporates the multiplicative co- Input: m, σ variance matrix update of the xNES algorithm. This means C ← I that xCMA-ES is equipped with two evolution paths, one while stopping condition not met do for cumulative step size control, and one for the rank-one // sample and evaluate offspring covariance matrix update. Notably, xCMA-ES comes with for i ∈ {1, . . . , λ} do explicit step size control, while in xNES the step size is up- zi ← N (0, I) dated with the same mechanism as the shape component √ xi ← m + σ · Czi of the covariance matrix, with the only difference that the end learning rates can be decoupled. For xCMA-ES we follow sort {(zi, xi)} with respect to f(xi) the proceeding of CMA-ES and do not decouple these pa- // internal update (paths and parameters) rameters explicitly (i.e., the scale of the covariance matrix 0 Pλ 1 m ← i=1(ui + λ )xi is allowed to drift). p −1/2 0 The beauty of this construction is that all mechanisms of ps ← (1 − cs) · ps + cs(2 − cs)µeff · C (m − m) p m0−m standard CMA-ES are naturally compatible with the expo- pc ← (1 − cc) · pc + cc(2 − cc)µeff · h i σ nential parameterization of the covariance matrix. In par- −1/2 T −1/2 Pλ T Z ← c1 · C pcpc C − I + cµ · i=1 uizizi ticular, the updates of the evolution paths and the step size C ← C1/2 exp(Z)C1/2 do not require any changes.    The stochastic natural gradient component Pλ u ·(z zT − σ ← σ · exp cs · kpsk − 1 i=1 i i i Dσ χd I) of xNES deserves particular attention. This matrix is—up m ← m0 to scaling—the quantity entering the matrix exponential. It end consists of a weighted sum of outer products of steps in the “natural” coordinate system of the standard normally dis- tributed samples z minus their expected value, which (for i untouched. In our opinion the latter better reflects the fact standard normal samples) is the identity matrix. Note that that no information was observed about these directions. due to P u = 0 the weighted identity matrices cancel each i i Also, by means of subtraction of the identity (the expected other out. Also note that in first order Taylor approxima- value) from the positive semi-definite weighted sum of outer tion around the origin the matrix exponential reduces to products, it is natural for the multiplicative update to (mul- exp(M) ≈ I + M, resulting in the exact rank-µ update of tiplicatively) grow or shrink the variance in a specific direc- CMA-ES [1, 7, 14]. tion. In contrast, the additive update can only (additively) Hence it is natural to incorporate the rank-one update of grow the variance in a specific direction, while shrinking CMA-ES into the multiplicative update by adding the term T works only through global shrinking and subsequently grow- c1 · (pp − I) to the term entering the matrix exponential, −1/2 ing all other directions, which can be slow. where p = C pc is the evolution path transformed back This problem was mitigated to some extent with so-called to the coordinate system of standard normally distributed active covariance matrix updates [2, 11]. In an active up- samples. date step the algorithm subtracts an from the A disadvantage of the mean-free weights ui in the xNES covariance matrix, an operation that must carefully ensure algorithm is that the computed steps Gδ are only mean-free the positive definiteness of the result, e.g., by means of line with respect to the estimated meanz ˆ = 1 Pλ z as λ i=1 i search or finely tuned learning rates. In contrast, the multi- λ λ λ plicative update operates on an unconstrained problem and X X  1  X G = u · z = w − · z = w · z − zˆ , hence can never result in an invalid configuration. δ i i i λ i i i i=1 i=1 i=1 2.5 Constraint Handling thus adding additional noise to the update. By replacingz ˆ A powerful constraint handling mechanism for the (1, λ)- by the true (zero) mean, we achieve a better estimate—the CMA-ES was introduces in [3]. In principle, the same mech- 4 original CMA-ES update . anism can be applied to the non-elitist, population-based These changes applied to CMA-ES define the basic xCMA- standard formulation of CMA-ES. The simple yet highly ef- ES algorithm. Its pseudo-code is found in algorithm 3. ficient approach amounts to performing active CMA updates As noted above, when using identical weights and learning for steps resulting in infeasible offspring, effectively reducing rates, in first order approximation the multiplicative update the variance in the direction of the constraint normal, while in xCMA-ES coincides with the additive update in stan- suspending step size adaptation. 5 dard CMA-ES. Both updates rely on a rank-(λ + 1) matrix The corresponding mechanism for xCMA-ES is essentially formed by the outer products of the sampling steps and the the same, namely to perform a standard update with neg- evolution path pc. An interesting conceptual difference be- ative weight, but of course without a need for constrain- tween additive and multiplicative update is that the additive ing the step size. There is a subtle difference to the elitist update shrinks the variance in all directions orthogonal to setting considered in [3]: with non-elitist selection it is in the λ + 1 update vectors by a factor of 1 − c1 − cc, while principle possible for infeasible offspring to enter the new the multiplicative update leaves these variance components parent population, namely if more than λ − µ offspring hap- pen to be infeasible—this can indeed be observed in exper- 4with both update choices, µ is the same and computed eff iments. Then the algorithm can get caught in a random from the wi. 5In standard CMA-ES the rank is even reduced to µ + 1 walk through the infeasible region. To avoid this effect we due to truncation selection effectively setting the weights of propose to make the weights adaptive to the number of in- discarded offspring to zero. feasible offspring as follows. We first compute the standard weights w of CMA-ES. Infeasible offspring are treated as name f(x) fstop i q worst, generally obtaining low weights. For the active CMA Pd 2 SharpRidge −x1 + 100 i=2 xi -1000 update we subtract a constant from the weights of infeasi- ParabRidge −x + 100 Pd x2 -1000 ble offspring, which is set to 0.4 P w . Then we normalize 1 i=2 i λ i i Rosenbrock Pd−1 100(x − x2)2 + (1 − x )2 10−14 the absolute values of the weights to one (by dividing all i=1 i+1 i i Pd 2 −14 P Sphere i=1 xi 10 weights by i |wi|), compute µeff based on these weights, 1 P 2 Pd 2 −14 and then make them mean free by subtracting wi. Fi- Cigar αx1 + i=2 xi 10 λ i 2 Pd 2 −14 nally, the parameters cs, dσ, c1, cc, cµ are recomputed based Discus x1 + α i=2 xi 10 d i on the new weights. P d−1 2 −14 Ellipsoid i=1 α xi 10 2 Finally, we ensure that the mean m stays in the feasible Pd Pi  −14 Schwefel xi 10 region by finding the minimum k ∈ N0 such that i=1 j=1 d 2+10 i−1 DiffPowers P |x | d 10−14 m + γk(m0 − m) , (1) i=1 i where we set γ = 2/3. Table 2: The formulas for the benchmark functions. The default value α = 10−6 was used in all experiments. 3. EXPERIMENTS In this section we perform an empirical evaluation of xCMA- function for each dimensionality d ∈ {4, 8, 16, 32, 64} until a ES and compare it to CMA-ES. This comparison highlights target value fstop is reached. We chose as initial step size for the effect of the exponential covariance matrix parameters all functions σ = √1 . We report the median of the required d and the resulting multiplicative covariance matrix update. function evaluations over the successful trials for both al- In particular, we aim to answer the following questions: gorithms. A trial is considered successful if the algorithm 1. Does the exponential update impact black box perfor- converges to the right optimum(which is only an issue on mance? Rosenbrock). The function descriptions and the values of 2. How do active covariance updates perform in exponen- fstop are given in table 2. The results of the experiments are tial form? given in Figure 1. In order to answer the first question we perform experi- The results show that both algorithms perform equally ments on the same standard benchmark sets as used for the well on all functions, showing nearly no differences between xNES algorithm [7]. the algorithms, with a minimal (maybe insignificant and Due to lack of a fair baseline for comparison we cannot surely irrelevant) edge for xCMA-ES. provide a good answer for the second question. However, we demonstrate that active updates work qualitatively as de- 3.2 Constraint Handling sired in a constraint optimization setting. As the CMA-ES Due to a lack of a non-elitist CMA-ES variant with a constraint handling mechanism based on (active) covariance constant value matrix updates we can not provide a strong baseline algo- λ, µ 4 + b3 log(d)c, λ/2 rithm for comparison with xCMA-ES. To obtain a reason- µeff+2 cs able baseline we constrain the mean of CMA-ES to never d+µeff+5 4+µeff/d cc leave the feasible region with the same mechanism as pro- d+4+2µeff/d 2 posed for xCMA-ES (see equation (1)). The standard se- c1 (d+1.3)2+µ n eff o lection operator is already capable of handling constraints µeff−2+1/µeff cµ min 1 − c1, 2 (d+2)2+2µ /2 to some extent when treating infeasible offspring as worse q eff µeff−1 than feasible ones. The only differences between the base- Dσ 1 + cs + 2 max{0, − 1} d+1 line CMA-ES and xCMA-ES are the additive update vs. the max{0,log(λ/2−1)−log(i−1)} wi Pλ multiplicative update with active constraint handling. j=1 max{0,log(λ/2−1)−log(i−1)} As a benchmark problem we use the constrained sphere Table 1: Constants used for the CMA-ES and xCMA-ES function proposed in [3]: algorithms 2 min kxk − m s.t xi ≥ 1, i = 1, . . . , m x ∗ ∗ is tuned on a large number of benchmark functions, ranking The optimum x has components xi = 1 for i ≤ m and ∗ stability over speed, it is not hard to find parameters which xi = 0 for i > m. The difficulty of this problem is twofold: beat it on a smaller set of functions. Thus tuning the pa- In the vicinity of the optimum only a fraction of 2−m of rameters of the xCMA-ES on a small set of function would the space it feasible—hence the problem gets harder with lead to an unfair benchmark. To avoid this situation we growing number of constraints. At the same time, while ap- avoid tuning altogether and instead resort to the default pa- proaching the optimum the gradient of the objective func- rameters of CMA-ES which are given in table 1. This choice tion in the unconstrained components towards zero vanishes. is justified by the similarity of the update rules, especially To counteract this trend the algorithm must reduce the vari- in the settings of small dimensionality with the conservative ance of its sampling distribution in the subspace spanned by learning rates of CMA-ES. all constraint normals—a task for which active CMA up- dates are supposed to be helpful. We perform 100 runs 3.1 Black Box Performance of both algorithm in d = 16 and d = 32 dimensions with To assess the black box performance of xCMA-ES, we m ∈ {2, 4, . . . , d/2} constraints. compare it to CMA-ES on the set of benchmark functions The results are given in figure 2. A plot of an example used in [7]. We run both algorithms for 100 trials on each run with d = 32 and m = 4 is given in figure 3 showing the 10 6 10 6 10 6 CMA-ES CMA-ES CMA-ES xCMA-ES xCMA-ES xCMA-ES

10 4 10 4 10 4

10 2 10 2 10 2 4 8 16 32 64 4 8 16 32 64 4 8 16 32 64

(a) Sphere (b) Cigar (c) Discus 10 6 10 6 10 6 CMA-ES CMA-ES CMA-ES xCMA-ES xCMA-ES xCMA-ES

10 4 10 4 10 4

10 2 10 2 10 2 4 8 16 32 64 4 8 16 32 64 4 8 16 32 64

(d) Ellipsoid (e) Schwefel (f) Rosenbrock 10 6 10 6 10 6 CMA-ES CMA-ES CMA-ES xCMA-ES xCMA-ES xCMA-ES

10 4 10 4 10 4

10 2 10 2 10 2 4 8 16 32 64 4 8 16 32 64 4 8 16 32 64

(g) DiffPowers (h) SharpRidge (i) ParabRidge

Figure 1: Median number of black box function evaluations over problem dimension of CMA-ES and xCMA-ES on nine standard benchmark problems. Due to low spread, the 25% and 75% quantile indicators are nearly invisible. function values of the best point in the sampled population We showed that its performance on standard benchmarks as well as the current step-size. is nearly indistinguishable from that of the CMA-ES. We It turns out that xCMA-ES is always faster then the base- demonstrated that, despite the application of the matrix line CMA-ES. Moreover, CMA-ES became unstable with exponential, it is possible to implement the multiplicative increasing m and in the case d = 32 and m = 16 fails to update with a time complexity of only O(d2λ). converge in all 100 runs. The plot in figure 3 shows this We further investigated an extension of the algorithm for instability already for m = 4 constraints. constrained optimization problems and showed that by a simple use of negative weights xCMA-ES outperforms the CMA-ES. This demonstrates the benefits of the multiplica- 4. CONCLUSION tive update. We have incorporated a new type of covariance matrix update into the CMA-ES algorithm. This multiplicative up- date stems from the xNES algorithm. It guarantees positive Acknowledgements definiteness of the covariance matrix even with negative up- Oswin Krause acknowledges support from the Danish Na- date weights. The resulting algorithm features all mecha- tional Advanced Technology Foundation through the project nisms of CMA-ES and is hence called xCMA-ES. “Personalized breast cancer screening”. strategy with increasing population size. In B. McKay 6 6 10 10 et al., editors, The 2005 IEEE International Congress CMA-ES CMA-ES on Evolutionary Computation (CEC’05), volume 2,

xCMA-ES xCMA-ES pages 1769–1776, 2005. 5 5 10 10 [6] H.-G. Beyer. Convergence Analysis of Evolutionary Algorithms that are Based on the Paradigm of Information Geometry. Evolutionary Computation, 104 104 22(4):679–709, 2014. [7] T. Glasmachers, T. Schaul, Y. Sun, D. Wierstra, and 103 103 J. Schmidhuber. Exponential natural evolution 2 4 8 2 4 8 16 strategies. In Proceedings of the Genetic and (a) d = 16 (b) d = 32 Evolutionary Computation Conference (GECCO), 2010. Figure 2: Number of iterations (generations) necessary to [8] N. Hansen, S. D. Muller,¨ and P. Koumoutsakos. reach the target objective value of 10−12 over the number m Reducing the time complexity of the derandomized of constraints for the constrained sphere problems in d = 16 evolution strategy with covariance matrix adaptation and d = 32 dimensions. The value for CMA-ES at d = 32 (CMA-ES). Evolutionary Computation, 11(1):1–18, and m = 16 is missing because the algorithm failed in all 2003. 100 runs. [9] N. Hansen, S. P. N. Niederberger, L. Guzzella, and P. Koumoutsakos. A Method for Handling Uncertainty in Evolutionary Optimization with an Application to 102 f(x)-CMA-ES Feedback Control of Combustion. IEEE Transactions σ-CMA-ES on Evolutionary Computation, 13(1):180–197, 2009. f(x)-xCMA-ES σ-xCMA-ES [10] N. Hansen and A. Ostermeier. Completely 10−2 derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159–195, 2001. [11] G. A. Jastrebski and D. V. Arnold. Improving evolution strategies through active covariance matrix 10−6 adaptation. In Evolutionary Computation, 2006. CEC 2006. IEEE Congress on, pages 2814–2821. IEEE, 2006. −10 10 [12] S. Kern, S. D. Muller,¨ N. Hansen, D. Buche,¨ J. Ocenasek, and P. Koumoutsakos. Learning

probability distributions in continuous evolutionary 10−14 algorithms–a comparative review. Natural Computing, 0 50.000 100. 000 150. 000 200. 000 3(1):77–112, 2004. Iterations [13] O. Krause and C. Igel. A More Efficient Rank-one Covariance Matrix Update for Evolution Strategies. In Figure 3: Plot of a single run of the CMA-ES and xCMA-ES J. He, T. Jansen, G. Ochoa, and C. Zarges, editors, algorithms on the constrained Sphere function with d = 32 Foundations of Genetic Algorithms (FOGA). ACM and m = 4. Plotted are function value and step length over Press, 2015. accepted for publication. generations. [14] Y. Ollivier, L. Arnold, A. Auger, and N. Hansen. Information-geometric optimization algorithms: a unifying picture via invariance principles. arXiv 5. REFERENCES preprint arXiv:1106.3708v3, 2014. [1] Y. Akimoto, Y. Nagata, I. Ono, and S. Kobayashi. [15] Y. Sun, D. Wierstra, T. Schaul, and J. Schmidhuber. Bidirectional Relation between CMA Evolution Stochastic Search using the Natural Gradient. In Strategies and Natural Evolution Strategies. In International Conference on Machine Learning Parallel Problem Solving from Nature (PPSN), 2010. (ICML), 2009. [2] D. V. Arnold and N. Hansen. Active covariance [16] D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, matrix adaptation for the (1+1)-CMA-ES. In J. Peters, and J. Schmidhuber. Natural evolution Proceedings of the 12th annual conference on Genetic strategies. Journal of Machine Learning Research, and Evolutionary Computation (GECCO), pages 15:949–980, 2014. 385–392. ACM, 2010. [17] D. Wierstra, T. Schaul, J. Peters, and [3] D. V. Arnold and N. Hansen. A (1+1)-CMA-ES for J. Schmidhuber. Natural Evolution Strategies. In constrained optimisation. In Proceedings of the Proceedings of the Congress on Evolutionary Genetic and Evolutionary Computation Conference Computation (CEC). IEEE Press, 2008. (GECCO 2012), pages 297–304. ACM, 2012. [4] A. Auger. Convergence results for the (1, λ)-SA-ES using the theory of ϕ-irreducible Markov chains. Theoretical Computer Science, 334(1–3):35–69, 2005. [5] A. Auger and N. Hansen. A restart CMA evolution