ACCEPTED FOR: IEEE TRANSACTIONS ON 1

Simplify Your Adaptation Strategy Hans-Georg Beyer and Bernhard Sendhoff Senior Member, IEEE

Abstract—The standard Covariance Matrix Adaptation Evo- call the ’s peculiarities, such as the covariance matrix lution Strategy (CMA-ES) comprises two evolution paths, one adaptation, the evolution path and the cumulative step-size for the learning of the strength and one for the rank- adaptation (CSA) into question. Notably in [4] a first attempt 1 update of the covariance matrix. In this paper it is shown that one can approximately transform this algorithm in such has been made to replace the CSA mutation strength control a manner that one of the evolution paths and the covariance by the classical mutative self-adaptation. While getting rid of matrix itself disappear. That is, the covariance update and the any kind of evolution path statistics and thus yielding a much covariance matrix operations are no longer needed simpler strategy1, the resulting rank-µ update strategy, the so- in this novel so-called Matrix Adaptation (MA) ES. The MA- called Covariance Matrix Self-Adaptation CMSA (CMSA-ES) ES performs nearly as well as the original CMA-ES. This is shown by empirical investigations considering the evolution does not fully reach the performance of the original CMA- dynamics and the empirical expected runtime on a set of standard ES in the case of small population sizes. Furthermore, some test functions. Furthermore, it is shown that the MA-ES can of the reported performance advantages in [4] were due to a be used as search engine in a Bi-Population (BiPop) ES. The wrongly implemented stalling of the covariance matrix update resulting BiPop-MA-ES is benchmarked using the BBOB COCO in the CMA-ES implementation used. framework and compared with the performance of the CMA-ES- v3.61 production code. It is shown that this new BiPop-MA-ES Meanwhile, our theoretical understanding of the evolution – while algorithmically simpler – performs nearly equally well dynamics taking place in CMSA- and CMA-ES has advanced. as the CMA-ES-v3.61 code. Basically two reasons for the superior CMA-ES performance Index Terms—Matrix Adaptation Evolution Strategies, Black can be identified in the case of small population sizes: Box Optimization Benchmarking. Concentrate evolution along a predicted direction. • Using the evolution path information for the covariance matrix update (this rank-1 update was originally used I.INTRODUCTION in the first CMA-ES version [7] as the only update) HE Covariance Matrix Adaptation can be regarded as some kind of time series prediction T (CMA-ES) has received considerable attention as an of the evolution of the parent in the RN search space. algorithm for unconstrained real-parameter optimization, i.e. That is, the evolution path carries the information in solving the problem which direction the search steps were most successful. This directional information may be regarded as the yˆ = argopty RN f(y) , (1) ∈ most promising one and can be used in the CMA- since its publication in its fully developed  form including ES to shrink the evolution of the covariance matrix.2 rank-µ update in [1]. While there are proposals regarding As a result, optimization in landscape topologies with modifications of the original CMA-ES such as the (1 + 1)- a predominant search direction (such as the Cigar test CMA-ES [2] and its latest extension in [19], or the active function or the Rosenbrock function) can benefit from CMA-ES [3], or the CMSA-ES [4], the design presented in [1] the path cumulation information. Increased mutation strength. has only slightly changed during the years and state-of-the-art • presentations such as [5] still rely on that basic design. Taking As has been shown by analyzing the dynamic of the CSA the advent of the so-called Natural Evolution Strategies (NES) on the ellipsoid model in [10], the path-length based mu- into account, e.g. [6], one sees that even a seemingly principled tation strength control yields larger steady state mutation design paradigm such as the information gain constrained strengths up to a factor of µ (parental population size) evolution on statistical manifolds did not cause a substantial compared to ordinary self-adaptive mutation strength change in the CMA-ES design. Actually, in order to get control (in the case of non-noisy objective functions). NES competitive, those designers had to borrow from and Due to the larger mutation strengths realized, the CSA- rely on the ideas/principles empirically found by the CMA- based approach can better take advantage of the genetic ES designers. Yet, it is somehow remarkable why the basic 1By simplicity we mean a reduced complexity of code and especially a framework of CMA-ES has not undergone a deeper scrutiny to decreased number of strategy specific parameters to be fixed by the algorithm designer. Furthermore, the remaining strategy specific parameters have been H.-G. Beyer is with the Research Center Process and Product Engineering determined rather by first principles than empirical parameter tuning studies. at the Vorarlberg University of Applied Sciences, Dornbirn, Austria, Email: 2Shrinking is a well-known technique for covariance matrix estimation in [email protected] order to control the undesired growth of the covariance matrix [8]. Considering B. Sendhoff is with the Honda Research Institute Europe GmbH, Offen- the covariance matrix update in the CMA-ES from the perspective of shrinking bach/Main, Germany, Email: [email protected] has been done first by Meyer-Nieberg and Kropat [9]. 2 ACCEPTED FOR: IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION

repair effect provided by the intermediate (or weighted) (µ/µw, λ)-CMA-ES recombination. The performance advantage vanishes, however, when the pop- Initialize y(0), σ(0),g := 0, p(0) := 0, ulation size is chosen sufficiently large. However, considering (0) (0) the CMA-ES as a general purpose algorithm, the strategy s := 0, := I (C1) should cope with both small and large population sizes. Repeat  (C2) Therefore, it seems that both the rank-1 and the rank-µ update M(g) := √C(g) (C3) are needed. This still calls in question whether it is necessary For l := 1 To λ (C4) to have two evolution paths and an update of the covariance matrix at all. (g) N ˜zl := l(0, I) (C5) In this paper we will show that one can drop one of the d˜(g) := M(g)z˜(g) (C6) evolution paths and remove the covariance matrix totally, thus l l (g) (g) (g) ˜(g) removing a “p” and the “C” from the CMA-ES without sig- y˜l := y + σ dl (C7) nificantly worsening the performance of the resulting “Matrix f˜(g) := f y˜(g) (C8) Adaptation” (MA)-ES. As a byproduct of the analysis, which l l End (C9) is necessary to derive the MA-ES from the CMA-ES, one can  draw a clearer picture of the CMA-ES, which is often regarded SortOffspringPopulation (C10) as a rather complicated . y(g+1) := y(g) + σ(g) d˜(g) (C11) This work is organized as follows. First, the CMA-ES w D E algorithm is presented in a manner that allows for a simpler (g+1) (g) (g) s := (1 cs)s + µeff cs(2 cs) ˜z (C12) analysis of the pseudocode lines. As the next step, the p- − − w evolution path (realized by line C13, Fig. 1, see Section II q D E (g+1) (g) (g) p := (1 cp)p + µeff cp(2 cp) d˜ (C13) for details) will be shown to be similar to that of the s- − − w evolution path (realized by line C12, Fig. 1). Both paths q D ET C(g+1) := (1 c c )C(g) + c p(g+1) p(g+1) can be approximately transformed (asymptotically exact for − 1 − w 1 ˜(g) ˜(g) T search space dimensionality N ) into each other by a + cw d d  (C14) linear transformation where the→ transformation ∞ matrix is just w c s(g+1)D  E the matrix square root M of the covariance matrix C. After σ(g+1) := σ(g) exp s 1 (C15) d E[ N (0, I) ] − removing the p-path from the CMA-ES, the update of the " σ k k !# C matrix will be replaced by transforming the covariance g := g +1 (C16) learning to a direct learning of the M matrix. Thus, one Until(termination condition(s) fulfilled) (C17) needs no longer the covariance matrix and operations such as Cholesky decomposition or spectral decomposition to calculate Fig. 1. Pseudocode of the CMA-ES with rank > 1 weighted covariance the matrix square root. This simplifies the resulting ES consid- matrix update. erably both from viewpoint of the algorithmic “complexity” and the numerical algebra operations needed. Furthermore, the resulting M-update allows for a new interpretation of the performed in Fig. 1 and show that those lines are mathemati- CMA-ES working principles. However, since the derivations cally equivalent to the original CMA-ES [1]. In order to have rely on assumptions that are only asymptotically exact for a clear notion of generational changes, a generation counter search space dimensionality N , numerical experiments (g) is used even though it is not necessarily needed for the are provided to show that the novel→ ∞ MA-ES performs nearly functioning in real CMA-ES implementations. equally well as the original CMA-ES. Additionally, the CMA- A CMA-ES generation cycle (also referred to as a gen- ES and MA-ES will be used as “search engines” in the eration) is performed within the repeat-until-loop (C2–C17). BiPop-ES framework [11]. The resulting BiPop-MA-ES will A number of λ offspring (labeled by a tilde on top of the be compared with the original BiPop-CMA-ES and the CMA- symbols and indexed by the subscript l) is generated within ES v3.61 production code using the BBOB COCO test en- the for-loop (C4–C9) where the parental state y(g) RN is vironment [12]. It will be shown that the MA-ES performs mutated according to a normal distribution yielding offspr∈ ing well in the BiPop-ES framework. The paper concludes with a y˜(g+1) N y(g), (σ(g))2C(g) with mean y(g) and covari- ∼ summary section. ance (σ(g))2C(g). This is technically done by generating at  first an isotropically iid normally distributed N-dimensional (g) (g) II. THE STANDARD CMA-ES vector ˜zl in line (C5). This vector is transformed by M into a (direction) vector d˜(g) where the transformation matrix Although not immediately obvious, the -CMA- l (µ/µw, λ) M(g) must obey the condition ES displayed in Fig. 1 represents basically the one introduced T in [1] and discussed in its current form in [5]. While the M(g) M(g) = C(g). (2) exposition in [5] contains additional tweaks regarding strategy specific parameters, its essence has been condensed into the Therefore, M(g) may be regarded as a “square root” of the pseudocode of Fig. 1. We will discuss the basic operations matrix C(g), explaining the meaning of line (C3). Technically, BEYER AND SENDHOFF: SIMPLIFY YOUR CMA-ES 3

(g) (g) (g) M = √C is calculated in the CMA-ES using Cholesky However, resolving (7) for d˜ w, one immediately sees that or spectral value decomposition. After having generated the this is exactly the fraction termh ini (8). Now, consider the matrix (g) (g) 1/2 (g) search direction d˜ , it is scaled with the mutation strength vector product (C )− d˜ taking (4), (C6), (3), and l h iw σ(g)3, thus forming the mutation vector. This mutation is added (C3) into account (g) to the parental state y in line (C7) yielding the offspring 1 (g+1) (g) 1 (g) 2 y y 2 individual’s y˜ . The fitness f˜ of the offspring is finally C(g) − C(g) − d˜(g) l (g−) = calculated in line (C8). After having generated all λ offspring, σ w   µ   D1 E the population is sorted (ranked) according to the individual (g) − 2 (g) = wm C d˜ fitnesses in line (C10). m;λ m=1   The process of ranking is not explicitly displayed, however, Xµ 1 it is implicitly covered by the “ ” index notation used (g) − 2 (g) (g) m; λ = wm C M ˜zm;λ (see below) in the calculation of the angular bracket notations m=1 X   in lines (C11–C14). Here the m refers to the mth best µ 1 1 h· · · i (g) − 2 (g) 2 (g) individual (w..t. the objective function values f˜l, l =1,...,λ) = wm C C ˜zm;λ in the population of λ offspring. The angular brackets indicate m=1 Xµ     the process of (weighted) intermediate recombination. Given = w z˜(g) λ offspring objects x˜(g), the recombination is defined as m m;λ l m=1 µ X (g) (g) (g) = ˜z (9) x˜ := wmx˜m;λ, (3) w w m=1 D E D E X one sees that (8) of the original CMA-ES is equivalent to line ˜ ˜ ˜ T where x˜m;λ refers to an object (i.e., d, ˜z, and dd , respec- (C12). The advantage of line (C12) is, however, that there is tively) belonging to the mth best individual (w.r.t. fitness) absolutely no need to perform a back transformation using the of the current offspring population comprising λ individuals. inverse of M(g). Originally, the choice of the weights wm used in (3) was Lines (C12) and (C13) also contain strategy specific con- 1 , for 1 m µ, stants to be discussed below. Here, we proceed with the second w := µ ≤ ≤ (4) m 0, otherwise, path cumulation used to provide the rank-1 update for the  also known as intermediate multi-recombination [13]. How- C-matrix in line (C14). Again, line (C13) deviates from the ever, later on and also provided in [5], the weight scheme original CMA-ES in [5] where one finds after adopting the symbols used5 ln( λ+1 ) ln m 2 − , for 1 m µ, Pµ ln λ+1 ln k wm := k=1( ( 2 ) ) ≤ ≤ (5) y(g+1) y(g) − p(g+1) p(g) ( 0, otherwise = (1 cp) + hσ µeff cp(2 cp) −(g) . − − cmσ has been proposed. This proposal is based on the heuristic q (10) argument that the best individuals, i.e., those with a small m As has been shown already above, the fraction term in (10) is should have a stronger influence on the weighted sum (3). exactly d˜(g) . Thus, we have recovered line (C13). h iw Note, both weight schemes obey the condition The C-matrix update takes place in line (C14). The update µ equation deviates from those given in [5]. The latter reads after 6 wm =1. (6) adopting the symbols m=1 X C(g+1) = 1 c + (1 h2 )c c (2 c ) C(g) Calculation of the parent of the new generation (g + 1) is − 1 − σ 1 p − p (g+1) (g+1) T done by weighted recombination of the y-vectors of the best +c1p p  µ offspring individuals according to y(g+1) := y˜(g) . Using T w µ y˜(g) y(g) y˜(g) y(g) (C6), (C7), and (3), this can alternatively be expressedh i as m;λ m;λ +c w − − C(g) . w m  σ(g)   σ(g)  −  y(g+1) = y(g) + σ(g)d˜(g) = y(g) + σ(g) d˜(g) (7) m=1 w w X   and is used inD line (C11). E D E  (11) The classical CSA path cumulation is done in line (C12). It Also this equation is equivalent to the corresponding line acts on the isotropically generated z vectors. This line seems (C14). First note that due to (6) the weights sum of C(g) to deviate from the original one provided in [1] and [5]. In the in the third line of (11) yields C(g) and the corresponding latter publication one finds after adopting the symbols used4 cw can be put into the first parentheses of the rhs of the first s(g+1) = (1 c )s(g) + ˜(g) s line of (11). Now, consider line (C7) and resolve for dl , one − 1 (g+1) (g) (g) − 2 y y obtains + µ c (2 c ) C − . (8) (g) eff s − s c σ(g) y˜ y(g) m d˜(g) = m;λ − . (12) p   m;λ (g) 3Note, σ is often referred to as “step-size” of the mutation, however, this σ is not really true and actually misleading, it is just a scaling factor that allows 5 for a faster adaptation of the length of the mutation. We assume cm = 1, recommended as standard choice in [5] and hσ = 1. 4 6 Note, we also assumed cm = 1, recommended as standard choice in [5]. Note, we also assumed hσ = 1. 4 ACCEPTED FOR: IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION

Thus, one gets for (11) III. REMOVINGTHE p ANDTHE C FROMTHE CMA-ES The CMA-ES pseudocode in Fig. 1 will be further trans- (g+1) (g) (g+1) (g+1) T C = (1 c1 cw)C + c1p p formed by first multiplying (C12) with M(g) and taking (C6) − − µ into account (g) (g) T  +cw wmd˜ d˜ (13) m;λ m;λ (g) (g+1) (g) (g) m=1 M s = (1 cs)M s X  − (g) (g) + µeff cs(2 cs) M ˜z and taking (3) into account one obtains − w q (g) (g) D E ˜(g) (g+1) (g) (g+1) (g+1) T = (1 cs)M s + µeff cs(2 cs) d . (19) C = (1 c1 cw)C + c1p p − − w − − q D E (g) (g) T Now, compare with (C13). Provided that cp = cs, one +cw d˜ d˜ .  (14) w immediately sees that D E  (g) (g) (g) (g) (g+1) (g+1) This agrees with line (C14). The remaining line (C15) of the cp = cs M s = p M s = p . ⇔ ⇒ CMA-ES pseudocode in Fig. 1 concerns the mutation strength (20) update. It has been directly adopted from [5]. In that line, Provided that M(g+1) M(g) asymptotically holds for ≃ the actual length of the updated s evolution path vector is N , (C13) can be dropped. Under that assumption → ∞ compared with the expected length of a random vector with (C13) is asymptotically a linear transformation of (C12). This standard normally distributed components. The latter is the is ensured by (24) as long as cs 0 for N . Yet, → → ∞ expected value E[χ] of the χ-distribution with N degrees of the question remains whether the assumptions cp = cs and freedom. If the actual length is larger than E[χ] the whole M(g+1) M(g) are admissible in practice (N < ). In ≃ ∞ argument in the e-function is positive and the mutation strength order to check how strongly cp deviates from cs the quotient is increased. Conversely, it is decreased. The control rule (C15) cp/cs versus the search space dimensionality N is displayed is meanwhile well understood [14]. A slightly modified rule in Fig. 2. The calculation is based on the standard parameter

/ 2 cp cs 1 s(g+1) σ(g+1) = σ(g) exp 1 , (15) 1.5 " 2D N − !#

that works without the E[χ] value, proposed by Arnold [14], 1.0 has been analyzed in [10]. Comparing the pseudocode in Fig. 1 with those presented in [1] or [5], we can conclude that the formulation of the CMA- 0.5 ES can be significantly simplified: One only has to calculate the weighted recombinations of d˜, ˜z, and d˜d˜T according to N 4 5 6 µ 10 100 1000 10 10 10 (g) (g) z˜ := wm˜zm;λ, (16) w m=1 Fig. 2. The cp/cs ratio depending on the search space dimension N assuming D E X the strategy parameter choice given by Eqs. (22–24), using the standard population sizing rule (21). µ ˜(g) ˜(g) settings provided in [5]7 d := wmdm;λ, (17) w m=1 D E X λ λ =4+ 3 ln N , µ = , (21) ⌊ ⌋ 2 µ   ˜(g) ˜(g) T ˜(g) ˜(g) T 1 d d := wmdm;λ dm;λ . (18) (22) w µeff = µ 2 , m=1 m=1 wm D  E X  µPeff /N +4 Furthermore, the calculation of the inverse to the square root cp = , (23) of C as needed in the original CMA-ES, Eq. (8), is no longer 2µeff /N + N +4 needed. While the pseudocode in Fig. 1 is equivalent to the µeff +2 cs = . (24) standard CMA-ES, it avoids unnecessarily complex derivations µeff + N +5 resulting in a clearer exposition of the basic ingredients that As one can see, the c /c ratio is only a slightly decreasing seem to be necessary to ensure the functioning of the CMA- p s function of N that does not deviate too much from 1. There- ES. However, as will be shown in the next section, this fore, one would not expect a much pronounced influence on pseudocode can even be transformed into another one using the performance of the CMA-ES. Similarly, a replacement of some simple and mild assumptions that allow for removing parts of the evolution path cumulation and the calculation of 7 In [5], cp has been labeled as cc and cs as cσ. We use the indices p and the C-matrix at all. s to refer to the corresponding path vectors in (C12) and (C13). BEYER AND SENDHOFF: SIMPLIFY YOUR CMA-ES 5

M(g) by M(g+1) on the rhs of (20) should only result in small does satisfy (27) up to the power of two in B. Provided that deviations. This will be confirmed in Section IV. the B is sufficiently small, one even can break As a next step, line (C14) will be investigated. Applying off after the lineark Bk-term yielding Eq. (2) for g and g +1, respectively, and using (13) and (C6) c T one gets M(g+1) = M(g) I + 1 s(g+1) s(g+1) I " 2 − T   M(g+1) M(g+1)  cw (g) (g) T (g) (g) T + ˜z ˜z I + ... .(30) = (1 c1 cw)M M 2 w − −  − # M(g)s(g+1) M(g)s(g+1) T D  E  +c1  (g+1) µ Using Eq. (30) as an approximation for M works espe- T M(g)˜z(g) M(g)z˜(g) cially well if c1 and cw are sufficiently small. Actually, using +cw wm m;λ m;λ 8 m=1 the suggested formulae in [5] X (g) (g) T  α = (1 c1 cw)M M cov (31) − − c1 = 2 (g) (g+1) (g+1) T (g) T (N +1.3) + µeff +c1M s s  M µ and (g) (g)  (g) T  (g) T +cw wmM ˜zm;λ z˜m;λ M µ +1/µ 2 eff eff (32) m=1 cw = min 1 c1, αcov 2 − , X T   − (N + 2) + αcovµeff /2 = (1 c c )M(g) M(g)   − 1 − w it becomes clear that, given the population sizing (21), it holds (g) (g+1) (g+1) T (g) T +c1M s s  M 1 ln N (g) (g) (g) T (g) T c = and c = . (33) +cwM ˜z ˜z  M  1 2 w 2 w O N O N T     = M(g) M(g)D  E  That is, at least for sufficiently large search space dimensional- (g) (g+1) (g+1) T (g) T (g) (g) T ities N, Eq. (30) should yield a similar behavior as the original +c1 M s s M M M − C-matrix update (C14).  (g) (g) (g) T (g) T (g) (g) T  +cw M z˜ z˜  M  M M  . Given the new update (30) we can outline the new simplified w − CMA-ES without covariance matrix and p-path. The new ES  D E (25)    may be called Matrix Adaptation ES (MA-ES) and is depicted (g) (g) T (g) in Fig. 3. The calculation of ˜z z˜ w in (M11) is done Pulling out the M on the rhs of (25) one finally obtains analogously to (3) h i  (g+1) (g+1) T µ M M (g) (g) T (g) (g) T ˜z ˜z := wmz˜m;λ z˜m;λ . (34) w (g)  (g+1) (g+1) T m=1 = M I + c1 s s I D  E X  " − For the damping constant dσ in (M12) and (C15), the recom-    mended expression in [5] (g) (g) T (g) T +cw z˜ z˜ I M .(26) w − # µeff 1 D  E   dσ =1+ cs +2max 0, − 1 (35) r N +1 − ! This is a very remarkable result that paves the way for getting rid of the covariance matrix update. Instead of considering is used in the empirical investigations of this paper. C one can directly evolve the M matrix if one finds an Comparing the MA-ES algorithm in Fig. 3 with that of the approach that connects M(g+1) with M(g). Technically, one original CMA-ES, Fig. 1, one sees that this pseudocode is has to perform some kind of “square root” operation on (26). not only shorter, but more importantly, it is less numerically Noting that Eq. (26) is of the form demanding. Since there is no explicit covariance matrix up- date/evolution, there is also no matrix square root calculation. AAT = M [I + B] MT, (27) The numerical operations to be performed are only matrix- matrix and matrix-vector multiplications. It is to be mentioned where B is a symmetrical matrix, this suggests to expand A here that there exists related work aiming at circumventing into a matrix power series the covariance matrix operations. In [2] a (1 + 1)-CMA-ES was proposed that evolves the Cholesky-matrix. This approach has been extended to a multi-parent ES in [19]. Alternatively, ∞ k A = M γkB . (28) considering the field of Natural Evolution Strategies (NES), k=0 the xNES [6] evolves a normalized transformation matrix. X While the latter approach shares some similarities with the As can be easily shown by direct calculation, MA-ES update in line (M11), most notably the multiplicative

1 1 8According to [5], 0 < α ≤ 2. Throughout this paper it is assumed A = M I + B B2 + ... (29) cov 2 − 8 that αcov = 2.   6 ACCEPTED FOR: IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION

TABLE I (µ/µw, λ)-MA-ES TESTFUNCTIONSANDSTOPPINGCRITERIONFORTHEEMPIRICAL PERFORMANCE EVALUATION.THEINITIALCONDITIONSARE (0) T (0) Initialize y(0), σ(0),g := 0, s(0) := 0, M(0) := I (M1) y = (1,..., 1) AND σ = 1. Repeat  (M2) Name Function ftarget N For l := 1 To λ (M3) 2 2 −10 Sphere fSp(y) := y = kyk 10 X i  (g) N i=1 z˜l := l(0, I) (M4) N Cigar f (y) := y2 + 106 y2 10−10 ˜(g) (g) (g) Cig 1 X i dl := M ˜zl (M5) i=2 N (g) (g) (g) (g) − f˜ := f y + σ d˜ (M6) Tablet f (y) := 106y2 + y2 10 10 l l Tab 1 X i End (M7) i=2   N − 6 i 1 − Ellipsoid f (y) := 10 N−1 y2 10 10 SortOffspringPopulation (M8) Ell X i i=1 N (g+1) (g) (g) ˜(g) 2 10 y := y + σ d (M9) Parabolic Ridge fpR(y) := −y1 + 100 y −10 w X i D E i=2 (g+1) (g) (g) N s := (1 cs)s + µeff cs(2 cs) z˜ (M10) v w Sharp Ridge f (y) := −y + 100u y2 −1010 − − sR 1 uX i (g+1) (g) c1 p (g+1) (g+1) TD E ti=2 M := M I + s s I N − 1+5 i 1 − 2 − 2 N−1 10 Different-Powers fdP(y) := (yi ) 10 cw T X h  (g) (g)   i=1 + ˜z ˜z I (M11) − 2 w − N 1 Rosenbrock f (y) := 100(y2 − y )2 D (g+1)  E  i Ros X h i i+1 (g+1) (g) cs s i=1 σ := σ exp 1 (M12) 2 −10 d E[ N (0, I) ] − +(yi − 1) 10 " σ k k !# i g := g +1 (M13) Until(termination condition(s) fulfilled) (M14) intended to show that the MA-ES exhibits similar performance Fig. 3. Pseudocode of the Matrix Adaptation ES (MA-ES). behaviors as the CMA-ES. To this end, not only standard population sizings (λ

T T According to the derivation presented, the MA-ES should E s(g+1) s(g+1) = I and E z˜(g) z˜(g) = I w perform nearly equally well as the CMA-ES. We will in- (36) h  i hD  E i vestigate the performance on the set of test functions given does hold. That is, provided that the M matrix has evolved in in Tab. I. All runs have been initialized with a mutation such a manner that the components of the selected z˜ vectors T strength σ(0) = 1 and y(0) = (1,..., 1) . For the first tests, are (nearly) standard normally distributed, the evolution of M the standard strategy parameter choice given by Eqs. (21-24) has reached a steady state. Of course, such a stationary behav- together with the recombination weights (5) have been used. ior can only exist for static quadratic fitness landscapes (i.e., - f In addition to the test functions in Tab. I, the 24 test func- functions with a constant Hessian) and more general functions tions from the BBOB test environment [12] have been used obtained from those quadratic functions by strictly increasing to evaluated the MA-ES in the BiPop-setting in Section V-B. -transformations (conserving ellipsoidal level sets). f 1) Mean value dynamics: In Fig. 4 the dynamics of the CMA-ES and the MA-ES are compared on the test functions IV. PERFORMANCE COMPARISON MA- VS. CMA-ES Sphere, Cigar, Tablet, Ellipsoid, ParabolicRidge, SharpRidge, The performance of the MA-ES has been extensively tested Rosenbrock, and DifferentPowers for search space dimension- and compared with the CMA-ES. These investigations are alities N =3 and N = 30. The curves displayed are the result BEYER AND SENDHOFF: SIMPLIFY YOUR CMA-ES 7

Sphere N =3 Sphere N = 30 Cigar N =3 Cigar N = 30 2 10 102 0 0

) 5 ) ) 10 ) 10 10 5 ) ) ) ) 10 g g g -2 g -2 ( ( ( 10 ( 10 y y y -4 y -4 0 ( ( ( 10 ( 10 10 100 f f f f 10-6 10-6 -5 10-8 10-8 10 10-5 and and and and 10-10 10-10 ) ) ) ) -10 -10 -12 10 10 g -12 g g 10 g 10 ( ( ( ( σ σ σ 10-14 σ 10-14 10-15 10-15 0 20 40 60 80 100 120 140 0 100 200 300 400 500 600 0 50 100 150 200 250 0 200 400 600 800 1000 1200 1400 g g g g Tablet N =3 Tablet N = 30 Ellipsoid N =3 Ellipsoid N = 30 105 105 105 105 ) ) ) ) ) ) ) ) g g g g ( ( ( 100 100 ( 100 100 y y y y ( ( ( ( f f f f

10-5 10-5 10-5 10-5 and and and and

-10 -10 -10 -10 ) ) ) 10 10 ) 10 10 g g g g ( ( ( ( σ σ σ σ 10-15 10-15 10-15 10-15 0 50 100 150 200 250 0 500 1000 1500 2000 2500 0 50 100 150 200 250 0 500 1000 1500 2000 2500 3000 g g g g ParabolicRidge N =3 ParabolicRidge N = 30 SharpRidge N =3 SharpRidge N = 30 10 108 10 1010 2 | | | 7 8 | 10 ) ) ) 10 8 10 ) ) ) 10 ) ) g g 6 6 10 g 10 g 0 ( (

( ( 10 6 y y 10 y y 105 104 ( ( ( ( 10-2 f f 4 f 2 f |

| 4 10 10 | 10 | -4 103 100 10 102 2 -2 -6

10 and 10 and and and 10 1 100 -4

10 ) 10 ) ) ) -8 g g g g 10

0 ( -6 ( 10 ( 10 ( 10-2 σ σ σ σ 10-1 10-8 10-10 0 20 40 60 80 100 0 100 200 300 400 500 600 0 100 200 300 400 500 600 0 100 200 300 400 500 600 700 800 g g g g Rosenbrock N =3 Rosenbrock N = 30 DifferentPowers N =3 DifferentPowers N = 30 102 104 102 102 2 0 ) ) ) 0 10 10 ) 0 ) ) ) 10 ) 10 g g g 0 -2 g ( ( ( 10 10 ( 10-2 y y y 10-2 -4 y ( ( ( -2 10 ( 10 10-4 f f f -6 f -4 -4 10 10 10 10-6 -8 -6 10 10-8 -6 10 and and and 10 and -10 -10 -8 10 10 ) )

10 ) -12 ) g g g -8 10 g -12

( 10 (

( 10 10-10 ( σ σ σ 10-14 σ 10-14 10-10 10-12 0 50 100 150 200 0 500 1000 1500 2000 2500 3000 3500 0 20 40 60 80 100 120 140 160 180 0 500 1000 1500 2000 2500 3000 g g g g

Fig. 4. The σ- and f- or |f|-dynamics of CMA-ES and MA-ES on the test functions of Tab. I for N = 3 and N = 30. The f-dynamics of the CMA-ES are in black and those of the MA-ES are in red. The σ-dynamics of the CMA-ES are in cyan and those of the MA-ES are in magenta. The curves are the averages of 20 independent ES runs (except the SharpRidge N = 3 case where 200 runs have been averaged and the Rosenbrock N = 3 case where 100 runs have been averaged). Note that the MA-ES exhibits premature convergence on the SharpRidge while the CMA-ES shows this behavior for N = 30. of an averaging over a number of independent single runs. averaging that converged to the global minimum (there is a Thus, they are an estimate of the mean value dynamics. second, but local minimum that sometimes attracts the ES). As one can see in Fig. 5, using the standard strategy 2) Expected runtime: In order to compare the N scaling parameter setting both CMA-ES and MA-ES perform very behavior, expected runtime (ERT) experiments have been similarly on the N = 30 test instances. Even in the case N =3 performed. Since a converging ES does not always reach a this behavior is observed except the SharpRidge where the predefined objective function target value ftarget, but may end MA-ES exhibits premature convergence while the CMA-ES is up in a local optimizer, a reasonable ERT definition must take able to follow the ridge exponentially fast. However, increasing into account unsuccessful runs. Therefore, the formula slightly, also the CMA-ES will fail on the SharpRidge. If N 1 ERT = 1 E[ru] + E [rs] (37) one wants to improve the behavior on the SharpRidge (while Ps − keeping the basic CMA-ES algorithm), the population sizing derived in [15] has been used for the calculation of the rule (21) must be changed and much larger λ values must be ERT. For E[ru] the empirical estimate of the runtime for used. As a general observation it is to be noted that the N =3 unsuccessful runs has been determined. Likewise, the mean case produces rather rugged single run dynamics. Regarding value of the runtime of the successful runs serves as an the Rosenbrock function, only those runs have been used for empirical estimate for E[rs] and Ps is to be estimated as 8 ACCEPTED FOR: IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION

Cigar Tablet of occasional convergence of the ES to the local optimizer. This is also the reason for the comparatively large standard

105 deviations Var[r] observed.

104 p

104 B. Considering λ =4N Population Sizing It is well known that the standard population sizing ac- # function evaluations # function evaluations cording to (21) is not well suited for multimodal and noisy 103 101 102 101 102 N N optimization. First, the case λ = 4N, µ = λ/2 will be 2 Ellipsoid ParabolicRidge considered. In the next section, the λ = 4N case will be investigated. Since relevant differences in the dynamics (compared to the dynamics of the standard population sizing 105 rule (21) in Fig. 4) can only be observed for the Ridge (and 104 to a certain extend for the Cigar), the graphs are presented in 104 the supplementary material. Regarding the ParabolicRidge and the Cigar one observes a certain performance loss for the MA-

# function10 evaluations 3 # function10 evaluations 3 ES. Having a closer look at the ERT graphs in Fig. 6 one sees 101 102 101 102 N N that these differences appear more pronounced at larger search Rosenbrock DifferentPowers space dimensionalities on peculiar ellipsoid models like the 106 Cigar and the “degenerated Cigar” – the Parabolic Ridge. In those test cases the CMA-ES seems to benefit from the second 105 evolution path: There is a dominating major principal axis 105 in these test functions (being constant in its direction in the 4 10 search space) in which the should be predominantly 104 directed. It seems that such a direction can be somewhat faster 103 # function evaluations # function evaluations identified using a second evolution path. 101 102 101 102 N N 2 Fig. 5. Expected runtime (in terms of # of function evaluations) and its C. Considering λ =4N Population Sizing standard deviation of the CMA-ES, displayed by black data points (and error The mean value dynamics of the CMA-ES and the MA- bars) and of the MA-ES, displayed by red data points (and error bars). The population size λ is given by (21). Data points have been obtained for N = ES with population sizes proportional to the square of the 4, 8, 16, 32, 64, and 128. As for the initial conditions and ftarget, see Tab. I. search space dimensionality N are mainly presented in the For comparison purposes, dashed blue lines are displayed in order to represent supplementary material. The f-dynamics of CMA-ES and linear (smaller slope) and quadratic (larger slope) runtime growth behavior. Note, regarding Rosenbrock especially for small N, the standard deviation of MA-ES do not deviate markedly and are similar to those the runtime is rather large and only displayed as a horizontal bar above the of the other population sizings. Some of the experimental respective data point. results regarding Sphere, Ellipsoid, and Rosenbrock are also displayed in Fig. 7. The curves are the result of an averaging over a number of independent single runs. As one can see, empirical success probability. Similarly, one can determine the there is a certain performance loss for the MA-ES except for variance of the run length, the formula reads9 Rosenbrock at N = 30 where MA excels.

1 1 Ps 2 The mutation strength dynamics of the two ES Var[r]= 1 Var[ru] + Var[rs]+ −2 E[ru] . (38) Ps − Ps on Rosenbrock are interesting on its own. In the case of     The ERT experiments performed regard search space di- N = 30, σ increases even exponentially fast. Since both the mensionalities N = 4, 8, 16, 32, 64, and 128. These inves- CMA-ES and the MA-ES converge to the optimizer, shrinking tigations exclude the SharpRidge since both CMA-ES and the mutation step-size (note, this is not σ as often wrongly MA-ES exhibit premature convergence for moderately large stated) is predominantly done by the covariance matrix or the N (for CMA-ES: N > 16 experiments exhibit premature M matrix (in the case of the MA-ES). This can be easily convergence). The results of these experiments are displayed confirmed by considering the dynamics of the maximal and in Fig. 5. Instead of the expected number of generations, the minimal eigenvalues of C (in the case of the CMA-ES) and T number of function evaluations are displayed versus the search MM (in the case of the MA-ES), respectively. Therefore, it 2 space dimensionality N. The number of function evaluations is a general observation for the large population size λ =4N is simply the product of the expected runtime measured in case that adaptation of the step-size is mainly done by the generations needed (to reach a predefined f target) times the update of M and C, respectively, and not by σ. This even offspring population size λ. holds for the Sphere model as shown in Figure 8. As one can see, both strategies exhibit nearly the same ex- Next, let us consider the expected runtime (ERT) curves 10 pected runtime performance. Only for the Rosenbrock function in the interval N = 4 to N = 128 in Figs. 9. Having a there are certain differences. These are mainly due to the effect 10Unlike the λ scaling law (21), choosing λ = 4N 2 prevents the CMA-ES and the MA-ES from premature convergence on the Sharp Ridge. That is why, 9Note, in [15] a wrong formula has been presented. This is the correct one. the ERT curves are displayed for this test function as well. BEYER AND SENDHOFF: SIMPLIFY YOUR CMA-ES 9

Cigar Tablet Sphere N =3 Sphere N = 30 2 6 10 10 100 ) ) 100 ) -2 )

g 10 g -2

( ( 10 -4 105 y 10 y -4 ( ( 10 105 f 10-6 f 10-6 -8 10 10-8

and -10 and 4 10 10-10 104 10 ) 10-12 ) -12 g g 10 ( ( -14 -14 σ σ # function evaluations # function evaluations 10 10 1 2 1 2 10 10 10 10 0 10 20 30 40 50 60 0 20 40 60 80 100 N N g g Ellipsoid ParabolicRidge Ellipsoid N =3 Ellipsoid N = 30 6 6 105 10 10 105 ) ) ) ) g g

0 ( ( 10 100 y 105 105 y ( ( f f -5 10 10-5 104 104 and and 10-10 -10

) 10 ) g g ( (

# function evaluations 3 # function evaluations 3 σ 10 10 σ 101 102 101 102 10-15 10-15 N N 0 10 20 30 40 50 60 70 80 90 0 20 40 60 80 100 120 g g Rosenbrock DifferentPowers 106 Rosenbrock N =3 Rosenbrock N = 30 102 1010 ) ) 0 6 10 ) 10 ) 5 g 5 g -2 10 ( 10 ( 10 y y -4 ( ( 10 5 0 f 10 f 10 10-6 104 -8 10 10-5 4 and 10 and 10-10 ) ) -10 -12 10 g g 10 ( 3 ( # function evaluations # function evaluations 10 3 -14 σ 10 σ 1 2 1 2 10 10 10 10 10 10-15 N N 0 20 40 60 80 100 0 50 100 150 200 250 300 350 g g Fig. 6. Expected runtime (in terms of # of function evaluations) and its standard deviation of the CMA-ES, displayed by black data points (and error Fig. 7. σ- and f-dynamics of CMA-ES and MA-ES with λ = 4N 2 on bars) and of the MA-ES, displayed by red data points (and error bars). The Sphere, Ellipsoid, and Rosenbrock for N = 3 (left) and N = 30 (right). The λ = 4N, µ = 2N case has been investigated. Data points have been obtained f-dynamics of the CMA-ES are in black and those of the MA-ES are in red. for N = 4, 8, 16, 32, 64, and 128. As for the initial conditions and ftarget, The σ-dynamics of the CMA-ES are in cyan and those of the MA-ES are in see Tab. I. For comparison purposes, dashed blue lines are displayed in order magenta. The curves are the averages of 20 independent ES runs (except the to represent linear (smaller slope) and quadratic (larger slope) runtime growth Rosenbrock N=3 case where 100 runs have been averaged). behavior. Note, regarding Rosenbrock for N = 4 the standard deviation of the runtime is rather large and only displayed as a horizontal bar above the 100 respective data point. 10-2 101 10-4

10-6 closer look at the ERT graphs, one sees that - apart from the 10-8 Parabolic and the Sharp Ridge - both the CMA-ES and the 10-10 10-12

MA-ES exhibit a larger slope than the dotted straight lines 10-14 2 100 N . That is, ERT as a function of N exhibits a super- min and max eigenvalues 0 20 40 60 80 100 0 20 40 60 80 100 ∝quadratic runtime. As for Rosenbrock (see Fig. 9) this might g g be partially attributed to the change of the local topology Fig. 8. Left figure: On the evolution of the minimal and the maximal eigenval- T that requires a permanent change of the covariance matrix ues of C (black curves) and MM (red curves) for a (1800/1800I , 3600)- during the evolution. Regarding the ellipsoid-like models such CMA-ES and (1800/1800I , 3600)-MA-ES, respectively, on the N = 30- as Sphere, Cigar, Tablet, and Ellipsoid this might indicate dimensional Sphere model. Right figure: Corresponding condition number dynamics. that CMA- and MA-ES do not work optimally with large population sizes. This suspicion is also formed by the non- decreasing behavior of the mutation strength σ in Fig. 7 for population sizes and/or different initial mutation strengths N = 30 indicating a possible failure of the σ-control rule. This [16]. An implementation of such a multi-start approach was raises the research question as to the reasons for this observed proposed in terms of the BiPop-CMA-ES [11]. Its pseudocode behavior and whether one can improve the scaling behavior. is presented in Fig. 10.11 It is the aim of this section to show that replacing the basic version of the CMA-ES given in V. PERFORMANCEOFTHE MA-ES AS BUILDING BLOCK Fig. 1 by the MA-ES, Fig. 3, does not significantly deteriorate IN A BIPOP-ES the performance of the BiPop-CMA-ES. The BiPop-CMA- A. The BiPop-Algorithm 11Since a pseudocode of the BiPop-CMA-ES [11] has not been published, In order to solve multi-modal optimization problems, the it has been derived here from the verbal description given in [11], [17] and CMA-ES should be used in a restart setting with increasing private communications with Ilya Loshchilov. 10 ACCEPTED FOR: IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION

Sphere SharpRidge BiPop-(C)MA-ES for minimization 107

106 106 Initialize ylower, yupper, σinit, λinit,

5 105 10 f ,b ,g := (B1) stop stop stop ∞ y u y y (B2) 4 init := [ lower, upper] 10 104  (ymin,fmin,b1) := # function10 evaluations 3 # function evaluations 101 102 101 102 (C)MA-ES(yinit, σinit, λinit,gstop, T C) (B3) N N If f f Then Return(y ,f ) (B4) Cigar Tablet min ≤ stop min min 107 107 n := 0; nsmall := 0; (B5)

106 blarge := 0; bsmall := 0; (B6) 106 Repeat (B7) 105 105 n := n +1 (B8)

104 n nsmall 104 λ := 2 − λinit (B9) # function evaluations # function evaluations 101 102 101 102 yinit := u[ylower, yupper] (B10) N N If n> 2 AND b

Rosenbrock DifferentPowers nsmall := nsmall +1 (B17) 107 108 Else (B18)

6 107 10 (yˇ, f,bˇ L) := (C)MA-ES(yinit, σinit,λ,gstop, T C) (B19)

106 105 blarge := blarge + bL (B20) 105 EndIf (B21) 104 104 If fˇ fmin Then (B22) # function evaluations # function10 evaluations 3 ≤ 101 102 101 102 ˇ N N ymin := yˇ; fmin := f (B23) EndIf Fig. 9. Expected runtime and its standard deviation of the CMA-ES, displayed (B24) by black data points (and error bars) and of the MA-ES, displayed by red Until(bsmall + blarge + b1 bstop OR fmin fstop) (B25) data points (and error bars). The λ = 4N 2, µ = 2N 2 case has been ≥ ≤ investigated. Data points have been obtained for N = 4, 8, 16, 32, 64, and Return(ymin,fmin) (B26) 128. As for the initial conditions and ftarget, see Tab. I. For comparison purposes, dashed blue lines are displayed in order to represent linear (smaller Fig. 10. Pseudocode of the BiPop-CMA-ES and the BiPop-MA-ES. slope) and quadratic (larger slope) runtime growth behavior.

ES uses the (C)MA-ESs12 as “search engine” in lines B3, 2) parental f-value smaller than predefined fstop, i.e., (g) B15, and B19, respectively. In order to utilize the (C)MA- f G : f(y ) f(y − ) ≈ benchmark results. Since we want to keep the (C)MA-ESs The BiPop-ES has two termination conditions in (B25). One conceptually generic, we use only four termination conditions regards the best f value fmin, i.e., if it drops below fstop in the empirical evaluations: the strategy stops. The other criterion terminates the BiPop- 1) maximum number of generations g g ≥ stop 13In the benchmark experiments, g > 10 and G = 10 were used. The 12(C)MA-ES is used as abbreviation to refer to both the classical CMA-ES implementation of “≈” regards real numbers as approximately equal if these and the MA-ES. numbers differ only in the two least significant digits. BEYER AND SENDHOFF: SIMPLIFY YOUR CMA-ES 11

ES if the total number of function evaluations is greater or TABLE II equal to the predefined stop budget b . The total function TEST FUNCTIONS OF THE BBOB TEST BED, DESCRIPTION TAKEN FROM stop [18] evaluation budget bstop is the sum of function evaluations consumed by the first (C)MA-ES run in (B3), being b1, and Properties # Name, Description the cumulated function evaluations in (B15), being bsmall, Separable f1 Sphere f Ellipsoid with monotone y-transformation, and in (B19), being blarge. Depending on the number of 2 condition 6 function evaluations already consumed by (B15) and (B19), 10 f3 Rastrigin with both y-transformation, the Then-branch (B12–B17) or the Else-branch (B19, B20) is condition 10 performed. Within the Then-branch (B12)–(B17) the BiPop- f4 Skew Rastrigin-Bueche (condition 10), 2 ES runs an ES with a randomly derived small population skew-condition 10 f5 Linear slope, neutral extension outside size λsmall (compared to λ) and also a randomly decreased the domain (not flat) initial σ . Here the probabilistic choices are introduced by Low or f6 Attractive sector function sInit 2 random numbers u being uniformly distributed in the interval moderate f7 Step-ellipsoid, condition 10 condition f8 Rosenbrock, original [0, 1]. As a result of the transformation in (B12) the probability f9 Rosenbrock, rotated 1 High condition f Ellipsoid with monotone y-transformation, density of σsInit is proportional to σ . Similarly, one can 10 sInit condition 106 show that λsmall obeys a probability density proportional to 1 f11 Discus with monotone y-transformation, . That is, the random choices of σsInit and λsmall 6 λsmall√ln λsmall condition 10 are drawn with a strong tendency towards smaller values. That f12 Bent cigar with asymmetric y-transformation, is, together with (B10) (see below), this initiates basically condition 106 f Sharp ridge, slope 1:100, condition 10 local searches. The Else-branch (B19, B20) is performed with 13 f14 Sum of different powers (exponentially) increasing population sizes λ the size of which Multi-modal f15 Rastrigin with both y-transformations, is determined in (B9), thus performing a rather global search.14 condition 10 f16 Weierstrass with monotone y-transformation, The criterion bsmall < blarge in (B11) ensures that the total condition 102 function evaluation budget is approximately evenly devoted to f17 Schaffer F7 with asymmetric y-transformation, both ESs with small population sizes and the ES with large condition 10 f18 Schaffer F7 with asymmetric y-transformation, population size. As a result, the Then-branch (small population condition 103 sizes) is more often performed than the Else-branch (with f19 f8f2 composition of 2-D Griewank-Rosenbrock large population size). Each (C)MA-ES run is started with an Multi-modal f20 Schwefel x ∗ sin(x) with tridiagonal uniformly random initialized y in the box interval predefined with weak transformation, condition 10 global structure f21 Gallagher 101 Gaussian peaks, condition by the vectors ylower and yupper. up to 103 f22 Gallagher 21 Gaussian peaks, condition up to 103 f23 Katsuuras repetitive rugged function B. Performance Evaluation Using the COCO-Framework 2 f24 Lunacek bi-Rastrigin, condition 10 The performance evaluation is realized by the COmpar- ing Continuous Optimizers (COCO) Software Version v15.03 15 [12] using the test functions given in Tab. II. We espe- The benchmark experiments were performed with a stop cially present the empirical cumulative performance graphs 2 budget of bstop = 1000N function evaluations. The actual that compare the overall performance of the CMA-ES and number of function evaluations until leaving the Repeat- the MA-ES as search engine in the BiPop-ES framework. Until-loop in Fig. 10 will exceed this value by a certain Additionally, we compare with Hansen’s CMA-ES production extent (because of simplicity reasons, no additional budget- code version 3.61.beta or 3.62.beta, respectively (last change: driven stopping criterion has been incorporated into the set 16 April, 2012). This production code contains ad hoc solu- of termination conditions T C within the (C)MA-ESs). The tions to improve the performance of the generic CMA-ES. initial offspring population size is determined by (21). The It is expected that this should be visible by better empirical initial mutation strength has been chosen as σinit = 10/√N, benchmark results compared to the CMA and MA versions. 7 the minimum parental step length was ∆stop = 10− . The However, extensive tests have shown that the performance initial parental y has been chosen uniformly at random in a advantages in terms of better expected runtimes – while being N box interval yinit [ 5, 5] . The lag for stagnation detection statistically significant on some test functions – are rather in f-space was G∈=− 10. small (see supplementary material). The benchmark experiments use the standard settings and testbed described in [12]. The goals of these experiments are 14A deeper, scientifically satisfactory explanation of the details in (B9), (B12)–(B14) can not be given here. Even the original publication [11] does a) to show that exchanging the classical CMA-ES search not provide deeper insights and it seems that this is rather an ad hoc choice engine, Fig. 1, by the MA-ES, Fig. 3, does not severely that meets the needs without a derivation from first principles. affect the overall performance of the BiPop-ES, 15Software and documentation have been obtained from: http://coco.gforge.inria.fr. b) to compare the performance with the BiPop-CMA-ES 16Source code downloaded from: production code version 3.61 to get a feeling how much ∼ https://www.lri.fr/ hansen/count-cmaes-m.php?Down=cmaes.m. There is an can be gained by a sophisticatedly tuned CMA-ES. ambiguity w.r.t. the version number in the code where one can find both version numbers 3.61.beta and 3.62.beta in the same code. In the standard BBOB setting, the performance of each al- 12 ACCEPTED FOR: IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION gorithm is evaluated on 15 independent (randomly transformed not consistently perform better than the much simpler generic and perturbed) instantiations of each test function. Based (C)MA-ES implementations. This basically transfers also to on the observed run lengths, ECDF (empirical cumulative the BBOB COCO ERT investigations (detailed results are to distribution function) graphs are generated. These graphs show be found in the supplementary material, for a short discussion the percentages of f target values ftarget reached for a given see below). In the following, we will summarize the main function value budget b per search space dimensionality N (in observations regarding ECDF results on the different function the plots, D is used to indicate the dimensionality). The stan- classes without presenting the graphs (that can be found in the k dard BBOB f-target values used are ftarget = min[f]+10 , supplementary material). k 8,..., 2 . Regarding the ECDF performance of the problem class f ∈ {− } 1 Aggregated performance profiles over the whole test func- – f5 one finds performance differences between BiPop-CMA tion set f1 – f24 (Tab. II) are presented in Fig. 11 and in and BiPop-MA mainly for dimensionality N = 3 and less detailed manner in the supplementary material. The ECDF pronounced for N = 5. A detailed view (not presented here) on the single function ECDFs reveals that in the N =3 case 1.0 1.0 bbob - f1-f24, 2-D best 2009 bbob - f1-f24, 3-D best 2009 15, 15, 15 instances 15, 15, 15 instances the f4 function (Rastrigin-Bueche) is responsible for these 0.8 0.8 differences. This function seems hard for all CMA-versions

BiPop CMA CMAv3.61 in the BBOB test if the dimension gets greater than N =3. 0.6 0.6 As for the low or moderate condition test functions f6 0.4 0.4 CMAv3.61 BiPop CMA – f9 virtually no remarkable differences between the per- 0.2 0.2 formance of the BiPop-CMA-ES, the BiPop-MA-ES, and Proportion of function+target pairs Proportion of function+target pairs

0.0 1.0.4 BiPop MA 0.0 1.0.4 BiPop MA the CMAv3.61-ES production code have been found. This 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 log10 of (# f-evals / dimension) log10 of (# f-evals / dimension) statement holds also well for the subset of test functions with 1.0 1.0 bbob - f1-f24, 5-D best 2009 bbob - f1-f24, 10-D best 2009 15, 15, 15 instances 15, 15, 15 instances high condition f10 – f14. 0.8 0.8 On the subset of multi-modal functions f15 – f19, BiPop-

BiPop MA BiPop CMA 0.6 0.6 CMA-ES and BiPop-MA-ES perform nearly equally well. The CMAv3.61-ES performs a bit better than BiPop-CMA-ES and 0.4 0.4 BiPop CMA BiPop MA BiPop-MA-ES. This is due to a certain performance advantage 0.2 0.2 of the CMAv3.61-ES on Schaffer’s F7 function f17 and f18, Proportion of function+target pairs Proportion of function+target pairs

0.0 1.0.4 CMAv3.61 0.0 1.0.4 CMAv3.61 which permanently reaches higher ECDF values (about 10% 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 log10 of (# f-evals / dimension) log10 of (# f-evals / dimension) in all dimensions). The reasons for this performance advantage 1.0 1.0 bbob - f1-f24, 20-D best 2009 bbob - f1-f24, 40-D best 2009 15, 15, 15 instances 15, 15, 15 instances remain unclear. 0.8 0.8 For the multi-modal test functions with weak global struc- BiPop MA BiPop CMA 0.6 0.6 ture, f20 – f24, one can observe performance differences be- tween BiPop-CMA-ES and BiPop-MA-ES with a performance 0.4 0.4 BiPop CMA CMAv3.61 advantage of the former. Having a look into the single function 0.2 0.2 ECDF graphs (not shown here), however, does not reveal a Proportion of function+target pairs Proportion of function+target pairs

0.0 1.0.4 CMAv3.61 0.0 1.0.4 BiPop MA 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 clear picture. There are also cases where BiPop-CMA-ES and log10 of (# f-evals / dimension) log10 of (# f-evals / dimension) BiPop-MA-ES perform equally well. Even more astonishing, Fig. 11. Accumulated performance profiles of the BiPop-CMA(-ES), the regarding ECDF the production code CMAv3.61 performs in BiPop-MA(-ES), and the production code CMAv3.61 on the set comprising most of the cases inferior to the BiPop-MA-ES and the BiPop- all 24 test functions of Tab. II. CMA-ES. graphs are displayed for search space dimensionalities N = Summarizing the findings of these BBOB ECDF evalua- 2, 3, 5, 10, 20, and 40 (in the graphics labeled as -D). tions, w.r.t. the ECDF performance profiles, the newly derived The horizontal axes display the logarithm of b/N (where b is MA-ES does not exhibit a severe performance degradation labeled as “# f-evals” and N is the “dimension”). Simulation compared to the CMA-ES. Sometimes, it even performs a bit runs are performed up to a function budget b marked by better than CMA-ES. the colored -crosses (values exceeding this b were obtained Alternatively, one can evaluate and compare the perfor- by a so-called× bootstrap technique, see [12]). Additionally, mance of the algorithms using the expected runtime (ERT) the “utopia” ECDF profile of the best BBOB 2009 results17 graphs (as for the definition of ERT, cp. Eq. (37)). BBOB is displayed. As one can see, up to the respective function COCO performance analysis have been done for the precision 8 2 target f = min[f]+10− . One finds for BiPop-CMA- budget of about bstop = 1000N all strategies perform very target similarly.18 With respect to the ECDF plots there is not a ES and BiPop-MA-ES similar performance behaviors. There clear “winner” and the CMAv3.61-ES production code does is a certain performance advantage of the CMAv3.61-ES production code being statistically significant on some of the 17This “utopia” profile represents the best performance results taken from 24 test functions and search space dimensionalities (especially all algorithms benchmarked in the 2009 GECCO competition. for the Sphere, Discus, and Attractive Sector function). It 18Since these graphs are aggregated performance plots, this does not ex- seems that CMAv3.61-ES has been “tuned” for the Sphere clude that the strategies perform differently on individual instances. However, a close scrutiny of the single performance plots did not reveal markable and Sharp ride. Interestingly, BiPop-CMA-ES and BiPop- differences. MA-ES seems to excel on Gallagher for higher dimensions, BEYER AND SENDHOFF: SIMPLIFY YOUR CMA-ES 13 however, the Wilcoxon rank sum test implemented in BBOB operation. This might pave the way for a theoretical analysis did not report statistical significance. Note, there are some test of the dynamics of the ES system, yet a challenge for future functions where no ES version was able to get sufficiently research. close to the global optimizer: (Skew) Rastrigin, Schaffer F7, Griewank-Rosenbrock, Schwefel, Katsuuras and Lunacek bi- REFERENCES Rastrigin. However, except of Bent cigar and Sharp ridge [1] N. Hansen, S. M¨uller, and P. Koumoutsakos, “Reducing the Time all three BiPop-ES versions performed rather similarly. Thus, Complexity of the Derandomized Evolution Strategy with Covariance justifying the statement that one can switch to the MA-ES also Matrix Adaptation (CMA-ES),” Evolutionary Computation, vol. 11, no. 1, pp. 1–18, 2003. from viewpoint of ERT. [2] C. Igel, T. Suttorp, and N. Hansen, “A Computational Efficient Covari- ance Matrix Update and a (1 + 1)-CMA for Evolution Strategies,” in VI. SUMMARY AND CONCLUSIONS GECCO’06: Proceedings of the Genetic and Evolutionary Computation Conference. New York, NY: ACM Press, 2006, pp. 453–460. Based on a theoretical analysis of the CMA-ES, Fig. 1, it has [3] G. Jastrebski and D. Arnold, “Improving Evolution Strategies through been shown under the assumption of equal cumulation time Active Covariance Matrix Adaptation,” in Proceedings of the CEC’06 Conference. Piscataway, NJ: IEEE, 2006, pp. 2814–2821. constants 1/cp and 1/cs that one can remove the p-path (line [4] H.-G. Beyer and B. Sendhoff, “Covariance Matrix Adaptation Revisited C13). Additionally it has been shown that one can also bypass – the CMSA Evolution Strategy,” in Parallel Problem Solving from the evolution of the covariance matrix, thus removing the “C” Nature 10. Berlin: Springer, 2008, pp. 123–132. [5] N. Hansen and A. Auger, “Principled Design of Continuous Stochastic from the CMA-ES yielding the MA-ES, Fig. 3. The latter Search: From Theory to Practice,” in Theory and Principled Methods step is accompanied by an approximation assumption, which for the Design of , Y. Borenstein and A. Moraglio, Eds. is increasingly violated for increasing population sizes since Berlin: Springer, 2014, pp. 145–180. c 1 in (32) for N = const. < . In spite of that, this [6] T. Glasmachers, T. Schaul, Y. Sun, D. Wierstra, and J. Schmidhuber, w → ∞ “Exponential Natural Evolution Strategies,” in GECCO’10: Proceedings algorithmic change (line M11, Fig. 3) does not severely affect of the Genetic and Evolutionary Computation Conference. ACM, 2010, the performance on population sizes as large as λ = 4N 2 as pp. 393–400. [7] N. Hansen and A. Ostermeier, “Completely Derandomized Self- has been shown in Section IV-C. Moreover, using the MA- Adaptation in Evolution Strategies,” Evolutionary Computation, vol. 9, ES in the BiPop-ES as search engine does not significantly no. 2, pp. 159–195, 2001. deteriorate the BiPop-CMA-ES performance on the standard [8] O. Ledoit and M. Wolf, “Honey, I Shrunk the Sample Covariance Matrix,” The Journal of Portfolio Management, vol. 30, no. 4, pp. 110– BBOB COCO test bed. 119, 2004. The novel MA-ES does not evolve a covariance matrix. [9] S. Meyer-Nieberg and E. Kropat, “A New Look at the Covariance Thus, a matrix square root operation in terms of Cholesky Matrix Estimation in Evolution Strategies,” in Operations Research and Enterprise Systems. Springer, Berlin, 2014, pp. 157–172. or spectral value decomposition is not needed. Only matrix- [10] H.-G. Beyer and M. Hellwig, “The Dynamics of Cumulative Step-Size vector and matrix-matrix operations are to be performed nu- Adaptation on the Ellipsoid Model,” Evolutionary Computation, vol. 24, merically (for example, complex eigenvalues due to numerical no. 1, pp. 25–57, 2016. [11] N. Hansen, “Benchmarking a BI-population CMA-ES on the BBOB- imprecision can not appear). This allows potentially for a more 2009 function testbed,” in Workshop Proceedings of the GECCO Genetic stable working ES. Furthermore, taking advantage of GPUs to and Evolutionary Computation Conference. ACM, 2009, pp. 2389–2395. speed-up calculations should be easy to accomplish. [12] S. Finck, N. Hansen, R. Ros, and A. Auger, “COCO Documentation, Release 15.03,” LRI, Orsay, Tech. Rep., 2015. [Online]. Available: http:// Besides the increased algorithmic simplicity, the new matrix coco.gforge.inria.fr update rule, line M11 in Fig. 3 and Eq. (30), allows also [13] H.-G. Beyer, The Theory of Evolution Strategies, for a novel interpretation of the C-matrix adaptation via the Series. Heidelberg: Springer, 2001. T [14] D. Arnold and H.-G. Beyer, “Performance Analysis of Evolutionary M-matrix adaptation, because C = MM : The change and Optimization With Cumulative Step Length Adaptation,” IEEE Trans- adaptation of the M-matrix is driven by contributions of actions on Automatic Control, vol. 49, no. 4, pp. 617–622, 2004. a) the deviation of the selected isotropically generated zzT [15] A. Auger and N. Hansen, “Performance Evaluation of an Advanced Local Search Evolutionary Algorithm,” in Congress on Evolutionary matrix from the isotropy matrix (being the identity ma- Computation, CEC’05. IEEE, 2005, pp. 1777–1784. trix) and [16] I. Loshchilov, M. Schoenauer, and M. Sebag, “Alternative Restart b) the deviation of the evolution path matrix ssT from Strategies for CMA-ES,” in Parallel Problem Solving from Nature 12, 19 Part I. Berlin: Springer, 2012, pp. 296–305. isotropy (i.e. spherical symmetry) [17] I. Loshchilov, “CMA-ES with Restarts for Solving CEC 2013 Bench- Thus, both contributions measure the departure of the ES- mark Problem,” in Evolutionary Computation (CEC), 2013 IEEE Congress on. IEEE, 2013, pp. 369–376. system (comprising the fitness function and the transformation [18] N. Hansen, S. Finck, R. Ros, and A. Auger, “Real-Parameter Black- in M5) from the Sphere model as seen from the mutations Box Optimization Benchmarking 2010: Noiseless Functions Definitions generated in line M4. In other words, from viewpoint of the z- (compiled Nov. 17, 2015),” INRIA, Tech. Rep. RR-6829, 2015. [Online]. Available: http://hal.inria.fr/inria-00362633/en/ mutations in M4, the (C)MA-ES seeks to transform a function [19] O. Krause, D.R. Arbones, and C. Igel, “CMA-ES with optimal covari- f with general ellipsoidal f-level sets into a function with ance Update and Storage Complexity,” in Proc. Adv. Neural Inf. Process. spherical level sets. Syst. Barcelona, Spain, 2016, pp. 370–378. [20] O. Krause and T. Glasmachers, “CMA-ES with Multiplicative Covari- Finally, the novel MA-ES might also be more attractive for ance Matrix Updates,” in Proc. Genet. Evol. Comput. Conf. (GECCO). a theoretical analysis due to a simpler pseudocode compared Madrid, Spain, 2015, pp. 281–288. to the CMA-ES. There is only one evolution path and the M- matrix update bypasses the C evolution and its square root

19These deviations must be seen in the sense of an average deviation over time. 14 ACCEPTED FOR: IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION

VII. SUPPLEMENTARY MATERIAL: SIMULATIONS This appendix presents a collection of additional simu- lations. The underlying experiments are explained in the respective sections of the main text.

A. Evolution dynamics for the λ =4N and λ =4N 2 cases BEYER AND SENDHOFF: SIMPLIFY YOUR CMA-ES 15

Sphere N =3 Sphere N = 30 Cigar N =3 Cigar N = 30 102 102 ) ) ) 100 ) 100 5 ) ) ) ) 10 105 g g g -2 g -2 ( ( ( 10 ( 10 y y y -4 y -4 0 ( ( ( 10 ( 10 10 100 f f f f 10-6 10-6 -5 10-8 10-8 10 10-5 and and and 10-10 and 10-10 -10 -10 ) ) ) ) 10 10 10-12 10-12 g g g g ( ( ( -14 ( -14 σ σ σ 10 σ 10 10-15 10-15 0 20 40 60 80 100 0 50 100 150 200 250 300 350 400 450 0 20 40 60 80 100 120 140 160 180 0 100 200 300 400 500 600 700 800 g g g g Tablet N =3 Tablet N = 30 Ellipsoid N =3 Ellipsoid N = 30

105 105 105 105 ) ) ) ) ) ) ) ) g g g g ( ( ( ( 100 100 100 100 y y y y ( ( ( ( f f f f

10-5 10-5 10-5 10-5 and and and and 10-10 10-10 10-10 10-10 ) ) ) ) g g g g ( ( ( ( σ σ σ σ 10-15 10-15 10-15 10-15 0 50 100 150 200 0 100 200 300 400 500 600 0 20 40 60 80 100 120 140 160 0 100 200 300 400 500 600 700 800 g g g g ParabolicRidge N =3 ParabolicRidge N = 30 SharpRidge N =3 SharpRidge N = 30 108 1010 108 1010 | | | 7 | ) ) ) 10 ) 8 )

) 8 ) ) 10 6 10 g 106 g g 10 g ( ( ( ( 106 y 105 y 106 y y ( ( ( 4 ( 10 4 f 4 f f f 10 | 10 | | | 104 3 2 10 102 10 2 2 10 10 0

and and and and 10 101 100 0 ) ) 10 ) ) -2 0 10 g 10 g g g ( ( ( -2 ( -1 -2 10 -4 σ 10 σ 10 σ σ 10 0 10 20 30 40 50 60 70 80 0 50 100 150 200 250 300 350 0 10 20 30 40 50 60 70 80 0 50 100 150 200 250 300 350 g g g g Rosenbrock N =3 Rosenbrock N = 30 DifferentPowers N =3 DifferentPowers N = 30 102 2 2 10 10 100 ) ) ) ) 100 ) ) ) 100 ) 100 -2 g g g g 10 -2 ( ( ( -2 ( -2 10 10 10 -4 y y y y 10 -4 ( ( ( ( 10 10-4 10-4 f f f f 10-6 10-6 10-6 10-6 -8 10 -8 10-8 10-8 10 and and and and 10-10 -10 10-10 10-10 10 ) ) ) ) -12 -12 -12 10 10-12 g g g 10 g 10 ( ( ( ( -14 -14 -14 -14 σ σ σ 10 σ 10 10 10 0 20 40 60 80 100 120 140 160 180 0 200 400 600 800 1000 1200 1400 0 20 40 60 80 100 120 0 100 200 300 400 500 g g g g

Fig. 12. The σ- and f- or |f|-dynamics of CMA-ES and MA-ES on the test functions of Tab. I for N = 3 and N = 30. The population sizes λ = 12, µ = 6 for N = 3 and λ = 120, µ = 60 for N = 30 have been used. The f-dynamics of the CMA-ES are in black and those of the MA-ES are in red. The σ-dynamics of the CMA-ES are in cyan and those of the MA-ES are in magenta. The curves are the averages of 20 independent ES runs (except the SharpRidge N = 3 case where 200 runs have been averaged and the Rosenbrock N = 3 case where 100 runs have been averaged). Note that the MA-ES exhibits premature convergence on the SharpRidge while the CMA-ES shows this behavior for the larger N = 30. 16 ACCEPTED FOR: IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION

Sphere N =3 Sphere N = 30 Cigar N =3 Cigar N = 30 102 0 10 5 ) ) ) ) 100 10 ) ) ) -2 ) 105 g g g 10 g -2 ( ( ( ( 10 -4 y y y y 0 10 -4 10 ( ( ( ( 10 100 f f f 10-6 f 10-6 -8 10-5 10 10-8 10-5 and and and -10 and 10 10-10 -10 -10

) ) 10 ) -12 ) 10 10 10-12 g g g g ( ( ( -14 ( -14 σ σ σ 10 σ 10 10-15 10-15 0 10 20 30 40 50 60 0 20 40 60 80 100 0 10 20 30 40 50 60 70 80 90 0 20 40 60 80 100 120 140 g g g g Tablet N =3 Tablet N = 30 Ellipsoid N =3 Ellipsoid N = 30 105 105 105 105 ) ) ) ) ) ) ) ) g g g g ( ( ( ( 100 100 100 100 y y y y ( ( ( ( f f f f -5 10-5 10-5 10 10-5 and and and and -10 10-10 10-10 10 10-10 ) ) ) ) g g g g ( ( ( ( σ σ σ σ 10-15 10-15 10-15 10-15 0 10 20 30 40 50 60 70 80 90 0 20 40 60 80 100 0 10 20 30 40 50 60 70 80 90 0 20 40 60 80 100 120 g g g g ParabolicRidge N =3 ParabolicRidge N = 30 SharpRidge N =3 SharpRidge N = 30 1010 108 1010 | | | 8 | ) ) ) 10 ) ) ) 108 ) 6 ) 108 g g g 10 g ( ( ( ( 106 y y 6 y y 6 ( ( ( ( 10 10 104 f f f f | 4 | | | 10 4 104 10 102 2 10 2 2 and and 10 and and 10 100

) 0 ) ) ) 0 g 10 g 0 g g 10

( ( 10 ( ( 10-2 σ σ σ σ 0 10 20 30 40 50 0 10 20 30 40 50 60 0 10 20 30 40 50 0 5 10 15 20 25 30 35 g g g g Rosenbrock N =3 Rosenbrock N = 30 DifferentPowers N =3 DifferentPowers N = 30 102 1010 102 100 ) ) ) 100 ) 100 ) ) ) ) 5 -2 g g g -2 g 10 10 -2 ( ( ( 10 ( 10 -4 y y y -4 y 10 -4 ( ( ( 10 ( 0 10 f f f f 10 10-6 10-6 10-6 -8 -8 10 -8 10 10-5 10 and and and and -10 10-10 10 10-10 -10 ) ) ) ) -12 10-12 10 10 10-12 g g g g ( ( ( -14 ( -14 -14 σ σ σ 10 σ 10 10 10-15 0 20 40 60 80 100 0 50 100 150 200 250 300 350 0 10 20 30 40 50 0 10 20 30 40 50 g g g g

Fig. 13. The σ- and f- or |f|-dynamics of CMA-ES and MA-ES on the test functions of Tab. I for N = 3 and N = 30. The population sizes λ = 36, µ = 18 for N = 3 and λ = 3600, µ = 1800 for N = 30 have been used. The f-dynamics of the CMA-ES are in black and those of the MA-ES are in red. The σ-dynamics of the CMA-ES are in cyan and those of the MA-ES are in magenta. The curves are the averages of 20 independent ES runs (except the SharpRidge N = 3 case where 200 runs have been averaged and the Rosenbrock N = 3 case where 100 runs have been averaged). Note that the MA-ES exhibits premature convergence on the SharpRidge while the CMA-ES shows this behavior for the larger N = 30. BEYER AND SENDHOFF: SIMPLIFY YOUR CMA-ES 17

B. BBOB COCO Performance Evaluation: ECDF-Graphs 1.0 1.0 bbob - f6-f9, 2-D best 2009 bbob - f6-f9, 3-D best 2009 15, 15, 15 instances 15, 15, 15 instances

In Fig. 14, the aggregated performance profiles of the func- 0.8 0.8 tions f1 – f5 are displayed for search space dimensionalities BiPop CMA BiPop CMA 0.6 0.6 N = 2, 3, 5, 10, 20, and 40 (in the graphics labeled as - 0.4 0.4 D). The horizontal axes display the logarithm of b/N (where b BiPop MA BiPop MA is labeled as “# f-evals” and N is the “dimension”). Simulation 0.2 0.2 Proportion of function+target pairs Proportion of function+target pairs runs are performed up to a function budget b marked by 0.0 1.0.4 CMAv3.61 0.0 1.0.4 CMAv3.61 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 the colored -crosses (values exceeding this b were obtained log10 of (# f-evals / dimension) log10 of (# f-evals / dimension) × 1.0 1.0 by a so-called bootstrap technique, see [12]). Additionally, bbob - f6-f9, 5-D best 2009 bbob - f6-f9, 10-D best 2009 20 15, 15, 15 instances 15, 15, 15 instances the “utopia” ECDF profile of the best BBOB 2009 results 0.8 0.8 is displayed. As one can see, up to the respective function CMAv3.61 CMAv3.61 2 0.6 0.6 budget of about bstop = 1000N all strategies perform very 21 0.4 0.4 similarly. Performance deviations in Fig 14 are mainly BiPop CMA BiPop CMA observed for dimensionality N = 3 making a difference 0.2 0.2 Proportion of function+target pairs Proportion of function+target pairs between BiPop-CMA and BiPop-MA. There is also a less 0.0 1.0.4 BiPop MA 0.0 1.0.4 BiPop MA 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 pronounced difference for N = 5. A detailed view on the log10 of (# f-evals / dimension) log10 of (# f-evals / dimension)

1.0 1.0 single function ECDFs reveals that in the N =3 case the f4 bbob - f6-f9, 20-D best 2009 bbob - f6-f9, 40-D best 2009 15, 15, 15 instances 15, 15, 15 instances function (Rastrigin-Bueche) is responsible for the performance 0.8 0.8 differences. This function seems hard for all CMA-versions in BiPop CMA CMAv3.61 the BBOB test if the dimension gets greater than N =3. 0.6 0.6 0.4 0.4 CMAv3.61 BiPop MA

1.0 1.0 0.2 0.2 bbob - f1-f5, 2-D best 2009 bbob - f1-f5, 3-D best 2009 15, 15, 15 instances 15, 15, 15 instances Proportion of function+target pairs Proportion of function+target pairs 0.0 1.0.4 BiPop MA 0.0 1.0.4 BiPop CMA 0.8 0.8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 log10 of (# f-evals / dimension) log10 of (# f-evals / dimension) BiPop MA CMAv3.61 0.6 0.6 Fig. 15. Accumulated performance profiles of the BiPop-CMA(-ES), the 0.4 0.4 BiPop-MA(-ES), and the production code CMAv3.61 on the test set low and CMAv3.61 BiPop CMA moderate condition test functions. 0.2 0.2 Proportion of function+target pairs Proportion of function+target pairs

0.0 1.0.4 BiPop CMA 0.0 1.0.4 BiPop MA 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 log10 of (# f-evals / dimension) log10 of (# f-evals / dimension) 1.0 1.0 bbob - f10-f14, 2-D best 2009 bbob - f10-f14, 3-D best 2009 15, 15, 15 instances 15, 15, 15 instances 1.0 1.0 bbob - f1-f5, 5-D best 2009 bbob - f1-f5, 10-D best 2009 0.8 0.8 15, 15, 15 instances 15, 15, 15 instances

0.8 0.8 BiPop CMA BiPop MA 0.6 0.6

BiPop CMA CMAv3.61 0.6 0.6 0.4 0.4 CMAv3.61 BiPop CMA 0.4 0.4 CMAv3.61 BiPop MA 0.2 0.2 Proportion of function+target pairs Proportion of function+target pairs

0.2 0.2 1.0.4 1.0.4 0.0 BiPop MA 0.0 CMAv3.61

Proportion of function+target pairs Proportion of function+target pairs 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 log10 of (# f-evals / dimension) log10 of (# f-evals / dimension) 0.0 1.0.4 BiPop MA 0.0 1.0.4 BiPop CMA 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 log10 of (# f-evals / dimension) log10 of (# f-evals / dimension) 1.0 1.0 bbob - f10-f14, 5-D best 2009 bbob - f10-f14, 10-D best 2009 15, 15, 15 instances 15, 15, 15 instances 1.0 1.0 bbob - f1-f5, 20-D best 2009 bbob - f1-f5, 40-D best 2009 0.8 0.8 15, 15, 15 instances 15, 15, 15 instances

0.8 0.8 CMAv3.61 CMAv3.61 0.6 0.6

CMAv3.61 BiPop MA 0.6 0.6 0.4 0.4 BiPop MA BiPop CMA 0.4 0.4 BiPop MA CMAv3.61 0.2 0.2 Proportion of function+target pairs Proportion of function+target pairs

0.2 0.2 1.0.4 1.0.4 0.0 BiPop CMA 0.0 BiPop MA

Proportion of function+target pairs Proportion of function+target pairs 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 log10 of (# f-evals / dimension) log10 of (# f-evals / dimension) 0.0 1.0.4 BiPop CMA 0.0 1.0.4 BiPop CMA 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 log10 of (# f-evals / dimension) log10 of (# f-evals / dimension) 1.0 1.0 bbob - f10-f14, 20-D best 2009 bbob - f10-f14, 40-D best 2009 15, 15, 15 instances 15, 15, 15 instances

Fig. 14. Accumulated performance profiles of the BiPop-CMA(-ES), the 0.8 0.8 BiPop-MA(-ES), and the production code CMAv3.61 on the test set of CMAv3.61 CMAv3.61 separable test functions. 0.6 0.6

0.4 0.4 Figure 15 provides the results for low or moderate condition BiPop CMA BiPop MA 0.2 0.2

test functions. There are virtually no differences between the Proportion of function+target pairs Proportion of function+target pairs

0.0 1.0.4 BiPop MA 0.0 1.0.4 BiPop CMA performance of the BiPop-CMA-ES, the BiPop-MA-ES, and 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 the CMAv3.61-ES production code. This statement holds log10 of (# f-evals / dimension) log10 of (# f-evals / dimension) Fig. 16. Accumulated performance profiles of the BiPop-CMA(-ES), the 20This “utopia” profile represents the best performance results taken from BiPop-MA(-ES), and the production code CMAv3.61 on the test set of high all algorithms benchmarked in the 2009 GECCO competition. condition test functions. 21Since these graphs are aggregated performance plots, this does not ex- clude that the strategies perform differently on individual instances. However, a close scrutiny of the single performance plots did not reveal markable differences. also well for the subset of test functions with high condition, 18 ACCEPTED FOR: IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION

Fig. 16. 1.0 1.0 bbob - f20-f24, 2-D best 2009 bbob - f20-f24, 3-D best 2009 On the subset of multi-modal functions, Fig. 17, BiPop- 15, 15, 15 instances 15, 15, 15 instances CMA-ES and BiPop-MA-ES perform nearly equally well. The 0.8 0.8 BiPop CMA CMAv3.61 CMAv3.61-ES performs a bit better than BiPop-CMA-ES and 0.6 0.6 BiPop-MA-ES. This is due to a certain performance advantage 0.4 0.4 CMAv3.61 BiPop CMA of the CMAv3.61-ES on Schaffer’s F7 function f17 and f18, 0.2 0.2 which permanently reaches higher ECDF values (about 10% Proportion of function+target pairs Proportion of function+target pairs 0.0 1.0.4 BiPop MA 0.0 1.0.4 BiPop MA 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 in all dimensions). The reasons for this performance advantage log10 of (# f-evals / dimension) log10 of (# f-evals / dimension) remains unclear. 1.0 1.0 bbob - f20-f24, 5-D best 2009 bbob - f20-f24, 10-D best 2009 15, 15, 15 instances 15, 15, 15 instances

0.8 0.8

1.0 best 2009 1.0 best 2009 bbob - f15-f19, 2-D bbob - f15-f19, 3-D BiPop CMA BiPop CMA 15, 15, 15 instances 15, 15, 15 instances 0.6 0.6 0.8 0.8 0.4 0.4 CMAv3.61 CMAv3.61 BiPop MA BiPop MA 0.6 0.6 0.2 0.2

0.4 0.4 Proportion of function+target pairs Proportion of function+target pairs BiPop CMA BiPop MA 0.0 1.0.4 CMAv3.61 0.0 1.0.4 CMAv3.61 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0.2 0.2 log10 of (# f-evals / dimension) log10 of (# f-evals / dimension) Proportion of function+target pairs Proportion of function+target pairs

1.0.4 BiPop MA 1.0.4 BiPop CMA 0.0 0.0 1.0 1.0 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 bbob - f20-f24, 20-D best 2009 bbob - f20-f24, 40-D best 2009 log10 of (# f-evals / dimension) log10 of (# f-evals / dimension) 15, 15, 15 instances 15, 15, 15 instances 0.8 0.8

1.0 best 2009 1.0 best 2009 bbob - f15-f19, 5-D bbob - f15-f19, 10-D BiPop CMA BiPop CMA 15, 15, 15 instances 15, 15, 15 instances 0.6 0.6 0.8 0.8 0.4 0.4 CMAv3.61 CMAv3.61 BiPop MA BiPop MA 0.6 0.6 0.2 0.2

0.4 0.4 Proportion of function+target pairs Proportion of function+target pairs BiPop MA BiPop CMA 0.0 1.0.4 CMAv3.61 0.0 1.0.4 CMAv3.61 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0.2 0.2 log10 of (# f-evals / dimension) log10 of (# f-evals / dimension) Proportion of function+target pairs Proportion of function+target pairs

0.0 1.0.4 BiPop CMA 0.0 1.0.4 BiPop MA 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Fig. 18. Accumulated performance profiles of the BiPop-CMA(-ES), the log10 of (# f-evals / dimension) log10 of (# f-evals / dimension) BiPop-MA(-ES), and the production code CMAv3.61 on the test set of multi- modal test functions with weak global structure. 1.0 1.0 bbob - f15-f19, 20-D best 2009 bbob - f15-f19, 40-D best 2009 15, 15, 15 instances 15, 15, 15 instances

0.8 0.8

CMAv3.61 CMAv3.61 0.6 0.6 C. BBOB COCO ERT-Graphs

0.4 0.4 Alternatively, one can evaluate and compare the perfor- BiPop MA BiPop MA

0.2 0.2 mance of the algorithms using the expected runtime (ERT) Proportion of function+target pairs Proportion of function+target pairs graphs (as for the definition of ERT, cp. Eq. (37)). Fig- 0.0 1.0.4 BiPop CMA 0.0 1.0.4 BiPop CMA 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 log10 of (# f-evals / dimension) log10 of (# f-evals / dimension) ure 19 shows the ERT(N) functions for the precision target f = min[f] + 10 8. As one can see on most of the Fig. 17. Accumulated performance profiles of the BiPop-CMA(-ES), the target − BiPop-MA(-ES), and the production code CMAv3.61 on the test set of multi- test functions, BiPop-CMA-ES and BiPop-MA-ES perform modal test functions. very similarly. As already observed in the ECDF graphs, the CMAv3.61-ES production code performs better on some For the multi-modal test functions with weak global struc- of the functions, especially on the Sphere, Sharp ride, Bent ture, Fig. 18, one can observe performance differences be- cigar and to a certain extend on Rastrigin. It seems that tween BiPop-CMA-ES and BiPop-MA-ES with a performance CMAv3.61-ES has been “optimized” for the Sphere and Sharp advantage of the former. Having a look into the single function ride. Interestingly, BiPop-CMA-ES and BiPop-MA-ES excel ECDF graphs (not shown here), however, does not reveal a on Gallagher for higher dimensions. NB, there are some test clear picture. There are also cases where BiPop-CMA-ES and functions where no ES version was able to get sufficiently BiPop-MA-ES perform equally well. Even more astonishing, close to the global optimizer: (Skew) Rastrigin, Schaffer F7, the production code CMAv3.61 performs in most of the cases Griewank-Rosenbrock, Schwefel, Katsuuras and Lunacek bi- significantly inferior to the BiPop-MA-ES and the BiPop- Rastrigin. However, except of Bent cigar and Sharp ridge CMA-ES. all three BiPop-ES versions performed rather similarly. Thus, justifying the statement that one can switch to the MA-ES also from viewpoint of ERT. BEYER AND SENDHOFF: SIMPLIFY YOUR CMA-ES 19

4 2 Ellipsoid separable 6 3 Rastrigin separable 5 4 Skew Rastrigin-Bueche separ

5 4 3 4 3 2 3 2 2 1 1 1

0 15, 15, 15 instances 15, 15, 15 instances 15, 15, 15 instances target Df: 1e-8 1.0.4 0 target Df: 1e-8 1.0.4 0 target Df: 1e-8 1.0.4 2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40

2 5 Linear slope 4 6 Attractive sector 4 7 Step-ellipsoid 4 8 Rosenbrock original

3 3 3

1 2 2 2

1 1 1

15, 15, 15 instances 0 0 15, 15, 15 instances 0 15, 15, 15 instances 0 15, 15, 15 instances target Df: 1e-8 1.0.4 target Df: 1e-8 1.0.4 target Df: 1e-8 1.0.4 target Df: 1e-8 1.0.4 2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40

4 9 Rosenbrock rotated 4 10 Ellipsoid 4 11 Discus 5 12 Bent cigar

4 3 3 3

3 2 2 2 2

1 1 1 1

0 15, 15, 15 instances 0 15, 15, 15 instances 0 15, 15, 15 instances 15, 15, 15 instances target Df: 1e-8 1.0.4 target Df: 1e-8 1.0.4 target Df: 1e-8 1.0.4 0 target Df: 1e-8 1.0.4 2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40

6 13 Sharp ridge 4 14 Sum of different powers 5 15 Rastrigin 6 16 Weierstrass

5 5 4 3 4 4 3 3 2 3 2 2 2 1 1 1 1

15, 15, 15 instances 0 15, 15, 15 instances 15, 15, 15 instances 15, 15, 15 instances 0 target Df: 1e-8 1.0.4 target Df: 1e-8 1.0.4 0 target Df: 1e-8 1.0.4 0 target Df: 1e-8 1.0.4 2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40

5 17 Schaffer F7, condition 10 6 18 Schaffer F7, condition 1000 7 19 Griewank-Rosenbrock F8F2 6 20 Schwefel x*sin(x)

5 6 5 4 5 4 4 3 4 3 3 2 3 2 2 2 1 1 1 1

15, 15, 15 instances 15, 15, 15 instances 15, 15, 15 instances 15, 15, 15 instances 0 target Df: 1e-8 1.0.4 0 target Df: 1e-8 1.0.4 0 target Df: 1e-8 1.0.4 0 target Df: 1e-8 1.0.4 2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40

6 21 Gallagher 101 peaks 6 22 Gallagher 21 peaks 5 23 Katsuuras

5 5 4

4 4 3 3 3 2 2 2

1 1 1

15, 15, 15 instances 15, 15, 15 instances 15, 15, 15 instances 0 target Df: 1e-8 1.0.4 0 target Df: 1e-8 1.0.4 0 target Df: 1e-8 1.0.4 2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40

−8 Fig. 19. Vertical axes: log10 of the expected number of function evaluations (ERT) divided by dimension N needed to get closer as 10 to the optimum; horizontal axes: search space dimension N. Slanted grid lines indicate quadratic scaling with N. Symbols used: CMAv3.61-ES: ◦, BiPop-CMA-ES: ♦, BiPop-MA-ES: ⋆. Light symbols give the maximum number of function evaluations from the longest trial divided by N. Black stars indicate a statistically better result compared to all other algorithms with p < 0.01 and Bonferroni correction number of dimensions. [This caption has been adapted from the GECCO BBOB competition LATEX template.]