ACCEPTED FOR: IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 1
Simplify Your Covariance Matrix Adaptation Evolution Strategy Hans-Georg Beyer and Bernhard Sendhoff Senior Member, IEEE
Abstract—The standard Covariance Matrix Adaptation Evo- call the algorithm’s peculiarities, such as the covariance matrix lution Strategy (CMA-ES) comprises two evolution paths, one adaptation, the evolution path and the cumulative step-size for the learning of the mutation strength and one for the rank- adaptation (CSA) into question. Notably in [4] a first attempt 1 update of the covariance matrix. In this paper it is shown that one can approximately transform this algorithm in such has been made to replace the CSA mutation strength control a manner that one of the evolution paths and the covariance by the classical mutative self-adaptation. While getting rid of matrix itself disappear. That is, the covariance update and the any kind of evolution path statistics and thus yielding a much covariance matrix square root operations are no longer needed simpler strategy1, the resulting rank-µ update strategy, the so- in this novel so-called Matrix Adaptation (MA) ES. The MA- called Covariance Matrix Self-Adaptation CMSA (CMSA-ES) ES performs nearly as well as the original CMA-ES. This is shown by empirical investigations considering the evolution does not fully reach the performance of the original CMA- dynamics and the empirical expected runtime on a set of standard ES in the case of small population sizes. Furthermore, some test functions. Furthermore, it is shown that the MA-ES can of the reported performance advantages in [4] were due to a be used as search engine in a Bi-Population (BiPop) ES. The wrongly implemented stalling of the covariance matrix update resulting BiPop-MA-ES is benchmarked using the BBOB COCO in the CMA-ES implementation used. framework and compared with the performance of the CMA-ES- v3.61 production code. It is shown that this new BiPop-MA-ES Meanwhile, our theoretical understanding of the evolution – while algorithmically simpler – performs nearly equally well dynamics taking place in CMSA- and CMA-ES has advanced. as the CMA-ES-v3.61 code. Basically two reasons for the superior CMA-ES performance Index Terms—Matrix Adaptation Evolution Strategies, Black can be identified in the case of small population sizes: Box Optimization Benchmarking. Concentrate evolution along a predicted direction. • Using the evolution path information for the covariance matrix update (this rank-1 update was originally used I.INTRODUCTION in the first CMA-ES version [7] as the only update) HE Covariance Matrix Adaptation Evolution Strategy can be regarded as some kind of time series prediction T (CMA-ES) has received considerable attention as an of the evolution of the parent in the RN search space. algorithm for unconstrained real-parameter optimization, i.e. That is, the evolution path carries the information in solving the problem which direction the search steps were most successful. This directional information may be regarded as the yˆ = argopty RN f(y) , (1) ∈ most promising one and can be used in the CMA- since its publication in its fully developed form including ES to shrink the evolution of the covariance matrix.2 rank-µ update in [1]. While there are proposals regarding As a result, optimization in landscape topologies with modifications of the original CMA-ES such as the (1 + 1)- a predominant search direction (such as the Cigar test CMA-ES [2] and its latest extension in [19], or the active function or the Rosenbrock function) can benefit from CMA-ES [3], or the CMSA-ES [4], the design presented in [1] the path cumulation information. Increased mutation strength. has only slightly changed during the years and state-of-the-art • presentations such as [5] still rely on that basic design. Taking As has been shown by analyzing the dynamic of the CSA the advent of the so-called Natural Evolution Strategies (NES) on the ellipsoid model in [10], the path-length based mu- into account, e.g. [6], one sees that even a seemingly principled tation strength control yields larger steady state mutation design paradigm such as the information gain constrained strengths up to a factor of µ (parental population size) evolution on statistical manifolds did not cause a substantial compared to ordinary self-adaptive mutation strength change in the CMA-ES design. Actually, in order to get control (in the case of non-noisy objective functions). NES competitive, those designers had to borrow from and Due to the larger mutation strengths realized, the CSA- rely on the ideas/principles empirically found by the CMA- based approach can better take advantage of the genetic ES designers. Yet, it is somehow remarkable why the basic 1By simplicity we mean a reduced complexity of code and especially a framework of CMA-ES has not undergone a deeper scrutiny to decreased number of strategy specific parameters to be fixed by the algorithm designer. Furthermore, the remaining strategy specific parameters have been H.-G. Beyer is with the Research Center Process and Product Engineering determined rather by first principles than empirical parameter tuning studies. at the Vorarlberg University of Applied Sciences, Dornbirn, Austria, Email: 2Shrinking is a well-known technique for covariance matrix estimation in [email protected] order to control the undesired growth of the covariance matrix [8]. Considering B. Sendhoff is with the Honda Research Institute Europe GmbH, Offen- the covariance matrix update in the CMA-ES from the perspective of shrinking bach/Main, Germany, Email: [email protected] has been done first by Meyer-Nieberg and Kropat [9]. 2 ACCEPTED FOR: IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION
repair effect provided by the intermediate (or weighted) (µ/µw, λ)-CMA-ES recombination. The performance advantage vanishes, however, when the pop- Initialize y(0), σ(0),g := 0, p(0) := 0, ulation size is chosen sufficiently large. However, considering (0) (0) the CMA-ES as a general purpose algorithm, the strategy s := 0, C := I (C1) should cope with both small and large population sizes. Repeat (C2) Therefore, it seems that both the rank-1 and the rank-µ update M(g) := √C(g) (C3) are needed. This still calls in question whether it is necessary For l := 1 To λ (C4) to have two evolution paths and an update of the covariance matrix at all. (g) N ˜zl := l(0, I) (C5) In this paper we will show that one can drop one of the d˜(g) := M(g)z˜(g) (C6) evolution paths and remove the covariance matrix totally, thus l l (g) (g) (g) ˜(g) removing a “p” and the “C” from the CMA-ES without sig- y˜l := y + σ dl (C7) nificantly worsening the performance of the resulting “Matrix f˜(g) := f y˜(g) (C8) Adaptation” (MA)-ES. As a byproduct of the analysis, which l l End (C9) is necessary to derive the MA-ES from the CMA-ES, one can draw a clearer picture of the CMA-ES, which is often regarded SortOffspringPopulation (C10) as a rather complicated evolutionary algorithm. y(g+1) := y(g) + σ(g) d˜(g) (C11) This work is organized as follows. First, the CMA-ES w D E algorithm is presented in a manner that allows for a simpler (g+1) (g) (g) s := (1 cs)s + µeff cs(2 cs) ˜z (C12) analysis of the pseudocode lines. As the next step, the p- − − w evolution path (realized by line C13, Fig. 1, see Section II q D E (g+1) (g) (g) p := (1 cp)p + µeff cp(2 cp) d˜ (C13) for details) will be shown to be similar to that of the s- − − w evolution path (realized by line C12, Fig. 1). Both paths q D ET C(g+1) := (1 c c )C(g) + c p(g+1) p(g+1) can be approximately transformed (asymptotically exact for − 1 − w 1 ˜(g) ˜(g) T search space dimensionality N ) into each other by a + cw d d (C14) linear transformation where the→ transformation ∞ matrix is just w c s(g+1)D E the matrix square root M of the covariance matrix C. After σ(g+1) := σ(g) exp s 1 (C15) d E[ N (0, I) ] − removing the p-path from the CMA-ES, the update of the " σ k k !# C matrix will be replaced by transforming the covariance g := g +1 (C16) learning to a direct learning of the M matrix. Thus, one Until(termination condition(s) fulfilled) (C17) needs no longer the covariance matrix and operations such as Cholesky decomposition or spectral decomposition to calculate Fig. 1. Pseudocode of the CMA-ES with rank > 1 weighted covariance the matrix square root. This simplifies the resulting ES consid- matrix update. erably both from viewpoint of the algorithmic “complexity” and the numerical algebra operations needed. Furthermore, the resulting M-update allows for a new interpretation of the performed in Fig. 1 and show that those lines are mathemati- CMA-ES working principles. However, since the derivations cally equivalent to the original CMA-ES [1]. In order to have rely on assumptions that are only asymptotically exact for a clear notion of generational changes, a generation counter search space dimensionality N , numerical experiments (g) is used even though it is not necessarily needed for the are provided to show that the novel→ ∞ MA-ES performs nearly functioning in real CMA-ES implementations. equally well as the original CMA-ES. Additionally, the CMA- A CMA-ES generation cycle (also referred to as a gen- ES and MA-ES will be used as “search engines” in the eration) is performed within the repeat-until-loop (C2–C17). BiPop-ES framework [11]. The resulting BiPop-MA-ES will A number of λ offspring (labeled by a tilde on top of the be compared with the original BiPop-CMA-ES and the CMA- symbols and indexed by the subscript l) is generated within ES v3.61 production code using the BBOB COCO test en- the for-loop (C4–C9) where the parental state y(g) RN is vironment [12]. It will be shown that the MA-ES performs mutated according to a normal distribution yielding offspr∈ ing well in the BiPop-ES framework. The paper concludes with a y˜(g+1) N y(g), (σ(g))2C(g) with mean y(g) and covari- ∼ summary section. ance (σ(g))2C(g). This is technically done by generating at first an isotropically iid normally distributed N-dimensional (g) (g) II. THE STANDARD CMA-ES vector ˜zl in line (C5). This vector is transformed by M into a (direction) vector d˜(g) where the transformation matrix Although not immediately obvious, the -CMA- l (µ/µw, λ) M(g) must obey the condition ES displayed in Fig. 1 represents basically the one introduced T in [1] and discussed in its current form in [5]. While the M(g) M(g) = C(g). (2) exposition in [5] contains additional tweaks regarding strategy specific parameters, its essence has been condensed into the Therefore, M(g) may be regarded as a “square root” of the pseudocode of Fig. 1. We will discuss the basic operations matrix C(g), explaining the meaning of line (C3). Technically, BEYER AND SENDHOFF: SIMPLIFY YOUR CMA-ES 3
(g) (g) (g) M = √C is calculated in the CMA-ES using Cholesky However, resolving (7) for d˜ w, one immediately sees that or spectral value decomposition. After having generated the this is exactly the fraction termh ini (8). Now, consider the matrix (g) (g) 1/2 (g) search direction d˜ , it is scaled with the mutation strength vector product (C )− d˜ taking (4), (C6), (3), and l h iw σ(g)3, thus forming the mutation vector. This mutation is added (C3) into account (g) to the parental state y in line (C7) yielding the offspring 1 (g+1) (g) 1 (g) 2 y y 2 individual’s y˜ . The fitness f˜ of the offspring is finally C(g) − C(g) − d˜(g) l (g−) = calculated in line (C8). After having generated all λ offspring, σ w µ D1 E the population is sorted (ranked) according to the individual (g) − 2 (g) = wm C d˜ fitnesses in line (C10). m;λ m=1 The process of ranking is not explicitly displayed, however, Xµ 1 it is implicitly covered by the “ ” index notation used (g) − 2 (g) (g) m; λ = wm C M ˜zm;λ (see below) in the calculation of the angular bracket notations m=1 X in lines (C11–C14). Here the m refers to the mth best µ 1 1 h· · · i (g) − 2 (g) 2 (g) individual (w.r.t. the objective function values f˜l, l =1,...,λ) = wm C C ˜zm;λ in the population of λ offspring. The angular brackets indicate m=1 Xµ the process of (weighted) intermediate recombination. Given = w z˜(g) λ offspring objects x˜(g), the recombination is defined as m m;λ l m=1 µ X (g) (g) (g) = ˜z (9) x˜ := wmx˜m;λ, (3) w w m=1 D E D E X one sees that (8) of the original CMA-ES is equivalent to line ˜ ˜ ˜ T where x˜m;λ refers to an object (i.e., d, ˜z, and dd , respec- (C12). The advantage of line (C12) is, however, that there is tively) belonging to the mth best individual (w.r.t. fitness) absolutely no need to perform a back transformation using the of the current offspring population comprising λ individuals. inverse of M(g). Originally, the choice of the weights wm used in (3) was Lines (C12) and (C13) also contain strategy specific con- 1 , for 1 m µ, stants to be discussed below. Here, we proceed with the second w := µ ≤ ≤ (4) m 0, otherwise, path cumulation used to provide the rank-1 update for the also known as intermediate multi-recombination [13]. How- C-matrix in line (C14). Again, line (C13) deviates from the ever, later on and also provided in [5], the weight scheme original CMA-ES in [5] where one finds after adopting the symbols used5 ln( λ+1 ) ln m 2 − , for 1 m µ, Pµ ln λ+1 ln k wm := k=1( ( 2 ) ) ≤ ≤ (5) y(g+1) y(g) − p(g+1) p(g) ( 0, otherwise = (1 cp) + hσ µeff cp(2 cp) −(g) . − − cmσ has been proposed. This proposal is based on the heuristic q (10) argument that the best individuals, i.e., those with a small m As has been shown already above, the fraction term in (10) is should have a stronger influence on the weighted sum (3). exactly d˜(g) . Thus, we have recovered line (C13). h iw Note, both weight schemes obey the condition The C-matrix update takes place in line (C14). The update µ equation deviates from those given in [5]. The latter reads after 6 wm =1. (6) adopting the symbols m=1 X C(g+1) = 1 c + (1 h2 )c c (2 c ) C(g) Calculation of the parent of the new generation (g + 1) is − 1 − σ 1 p − p (g+1) (g+1) T done by weighted recombination of the y-vectors of the best +c1p p µ offspring individuals according to y(g+1) := y˜(g) . Using T w µ y˜(g) y(g) y˜(g) y(g) (C6), (C7), and (3), this can alternatively be expressedh i as m;λ m;λ +c w − − C(g) . w m σ(g) σ(g) − y(g+1) = y(g) + σ(g)d˜(g) = y(g) + σ(g) d˜(g) (7) m=1 w w X and is used inD line (C11). E D E (11) The classical CSA path cumulation is done in line (C12). It Also this equation is equivalent to the corresponding line acts on the isotropically generated z vectors. This line seems (C14). First note that due to (6) the weights sum of C(g) to deviate from the original one provided in [1] and [5]. In the in the third line of (11) yields C(g) and the corresponding latter publication one finds after adopting the symbols used4 cw can be put into the first parentheses of the rhs of the first s(g+1) = (1 c )s(g) + ˜(g) s line of (11). Now, consider line (C7) and resolve for dl , one − 1 (g+1) (g) (g) − 2 y y obtains + µ c (2 c ) C − . (8) (g) eff s − s c σ(g) y˜ y(g) m d˜(g) = m;λ − . (12) p m;λ (g) 3Note, σ is often referred to as “step-size” of the mutation, however, this σ is not really true and actually misleading, it is just a scaling factor that allows 5 for a faster adaptation of the length of the mutation. We assume cm = 1, recommended as standard choice in [5] and hσ = 1. 4 6 Note, we also assumed cm = 1, recommended as standard choice in [5]. Note, we also assumed hσ = 1. 4 ACCEPTED FOR: IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION
Thus, one gets for (11) III. REMOVINGTHE p ANDTHE C FROMTHE CMA-ES The CMA-ES pseudocode in Fig. 1 will be further trans- (g+1) (g) (g+1) (g+1) T C = (1 c1 cw)C + c1p p formed by first multiplying (C12) with M(g) and taking (C6) − − µ into account (g) (g) T +cw wmd˜ d˜ (13) m;λ m;λ (g) (g+1) (g) (g) m=1 M s = (1 cs)M s X − (g) (g) + µeff cs(2 cs) M ˜z and taking (3) into account one obtains − w q (g) (g) D E ˜(g) (g+1) (g) (g+1) (g+1) T = (1 cs)M s + µeff cs(2 cs) d . (19) C = (1 c1 cw)C + c1p p − − w − − q D E (g) (g) T Now, compare with (C13). Provided that cp = cs, one +cw d˜ d˜ . (14) w immediately sees that D E (g) (g) (g) (g) (g+1) (g+1) This agrees with line (C14). The remaining line (C15) of the cp = cs M s = p M s = p . ⇔ ⇒ CMA-ES pseudocode in Fig. 1 concerns the mutation strength (20) update. It has been directly adopted from [5]. In that line, Provided that M(g+1) M(g) asymptotically holds for ≃ the actual length of the updated s evolution path vector is N , (C13) can be dropped. Under that assumption → ∞ compared with the expected length of a random vector with (C13) is asymptotically a linear transformation of (C12). This standard normally distributed components. The latter is the is ensured by (24) as long as cs 0 for N . Yet, → → ∞ expected value E[χ] of the χ-distribution with N degrees of the question remains whether the assumptions cp = cs and freedom. If the actual length is larger than E[χ] the whole M(g+1) M(g) are admissible in practice (N < ). In ≃ ∞ argument in the e-function is positive and the mutation strength order to check how strongly cp deviates from cs the quotient is increased. Conversely, it is decreased. The control rule (C15) cp/cs versus the search space dimensionality N is displayed is meanwhile well understood [14]. A slightly modified rule in Fig. 2. The calculation is based on the standard parameter
/ 2 cp cs 1 s(g+1) σ(g+1) = σ(g) exp 1 , (15) 1.5 " 2D N − !#
that works without the E[χ] value, proposed by Arnold [14], 1.0 has been analyzed in [10]. Comparing the pseudocode in Fig. 1 with those presented in [1] or [5], we can conclude that the formulation of the CMA- 0.5 ES can be significantly simplified: One only has to calculate the weighted recombinations of d˜, ˜z, and d˜d˜T according to N 4 5 6 µ 10 100 1000 10 10 10 (g) (g) z˜ := wm˜zm;λ, (16) w m=1 Fig. 2. The cp/cs ratio depending on the search space dimension N assuming D E X the strategy parameter choice given by Eqs. (22–24), using the standard population sizing rule (21). µ ˜(g) ˜(g) settings provided in [5]7 d := wmdm;λ, (17) w m=1 D E X λ λ =4+ 3 ln N , µ = , (21) ⌊ ⌋ 2 µ ˜(g) ˜(g) T ˜(g) ˜(g) T 1 d d := wmdm;λ dm;λ . (18) (22) w µeff = µ 2 , m=1 m=1 wm D E X µPeff /N +4 Furthermore, the calculation of the inverse to the square root cp = , (23) of C as needed in the original CMA-ES, Eq. (8), is no longer 2µeff /N + N +4 needed. While the pseudocode in Fig. 1 is equivalent to the µeff +2 cs = . (24) standard CMA-ES, it avoids unnecessarily complex derivations µeff + N +5 resulting in a clearer exposition of the basic ingredients that As one can see, the c /c ratio is only a slightly decreasing seem to be necessary to ensure the functioning of the CMA- p s function of N that does not deviate too much from 1. There- ES. However, as will be shown in the next section, this fore, one would not expect a much pronounced influence on pseudocode can even be transformed into another one using the performance of the CMA-ES. Similarly, a replacement of some simple and mild assumptions that allow for removing parts of the evolution path cumulation and the calculation of 7 In [5], cp has been labeled as cc and cs as cσ. We use the indices p and the C-matrix at all. s to refer to the corresponding path vectors in (C12) and (C13). BEYER AND SENDHOFF: SIMPLIFY YOUR CMA-ES 5
M(g) by M(g+1) on the rhs of (20) should only result in small does satisfy (27) up to the power of two in B. Provided that deviations. This will be confirmed in Section IV. the matrix norm B is sufficiently small, one even can break As a next step, line (C14) will be investigated. Applying off after the lineark Bk-term yielding Eq. (2) for g and g +1, respectively, and using (13) and (C6) c T one gets M(g+1) = M(g) I + 1 s(g+1) s(g+1) I " 2 − T M(g+1) M(g+1) cw (g) (g) T (g) (g) T + ˜z ˜z I + ... .(30) = (1 c1 cw)M M 2 w − − − # M(g)s(g+1) M(g)s(g+1) T D E +c1 (g+1) µ Using Eq. (30) as an approximation for M works espe- T M (g)˜z(g) M(g)z˜(g) cially well if c1 and cw are sufficiently small. Actually, using +cw wm m;λ m;λ 8 m=1 the suggested formulae in [5] X (g) (g ) T α = (1 c1 cw)M M cov (31) − − c1 = 2 (g) (g+1) (g+1) T (g) T (N +1.3) + µeff +c1M s s M µ and (g) (g) (g) T (g) T +cw wmM ˜zm;λ z˜m;λ M µ +1/µ 2 eff eff (32) m=1 cw = min 1 c1, αcov 2 − , X T − (N + 2) + αcovµeff /2 = (1 c c )M(g) M(g) − 1 − w it becomes clear that, given the population sizing (21), it holds (g) (g+1) (g+1) T (g) T +c1M s s M 1 ln N (g) (g) (g) T (g) T c = and c = . (33) +cwM ˜z ˜z M 1 2 w 2 w O N O N T = M(g) M(g)D E That is, at least for sufficiently large search space dimensional- (g) (g+1) (g+1) T (g) T (g) (g) T ities N, Eq. (30) should yield a similar behavior as the original +c1 M s s M M M − C-matrix update (C14). (g) (g) (g) T (g) T (g) (g) T +cw M z˜ z˜ M M M . Given the new update (30) we can outline the new simplified w − CMA-ES without covariance matrix and p-path. The new ES D E (25) may be called Matrix Adaptation ES (MA-ES) and is depicted (g) (g) T (g) in Fig. 3. The calculation of ˜z z˜ w in (M11) is done Pulling out the M on the rhs of (25) one finally obtains analogously to (3) h i (g+1) (g+1) T µ M M (g) (g) T (g) (g) T ˜z ˜z := wmz˜m;λ z˜m;λ . (34) w (g) (g+1) (g+1) T m=1 = M I + c1 s s I D E X " − For the damping constant dσ in (M12) and (C15), the recom- mended expression in [5] (g) (g) T (g) T +cw z˜ z˜ I M .(26) w − # µeff 1 D E dσ =1+ cs +2max 0, − 1 (35) r N +1 − ! This is a very remarkable result that paves the way for getting rid of the covariance matrix update. Instead of considering is used in the empirical investigations of this paper. C one can directly evolve the M matrix if one finds an Comparing the MA-ES algorithm in Fig. 3 with that of the approach that connects M(g+1) with M(g). Technically, one original CMA-ES, Fig. 1, one sees that this pseudocode is has to perform some kind of “square root” operation on (26). not only shorter, but more importantly, it is less numerically Noting that Eq. (26) is of the form demanding. Since there is no explicit covariance matrix up- date/evolution, there is also no matrix square root calculation. AAT = M [I + B] MT, (27) The numerical operations to be performed are only matrix- matrix and matrix-vector multiplications. It is to be mentioned where B is a symmetrical matrix, this suggests to expand A here that there exists related work aiming at circumventing into a matrix power series the covariance matrix operations. In [2] a (1 + 1)-CMA-ES was proposed that evolves the Cholesky-matrix. This approach has been extended to a multi-parent ES in [19]. Alternatively, ∞ k A = M γkB . (28) considering the field of Natural Evolution Strategies (NES), k=0 the xNES [6] evolves a normalized transformation matrix. X While the latter approach shares some similarities with the As can be easily shown by direct calculation, MA-ES update in line (M11), most notably the multiplicative
1 1 8According to [5], 0 < α ≤ 2. Throughout this paper it is assumed A = M I + B B2 + ... (29) cov 2 − 8 that αcov = 2. 6 ACCEPTED FOR: IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION
TABLE I (µ/µw, λ)-MA-ES TESTFUNCTIONSANDSTOPPINGCRITERIONFORTHEEMPIRICAL PERFORMANCE EVALUATION.THEINITIALCONDITIONSARE (0) T (0) Initialize y(0), σ(0),g := 0, s(0) := 0, M(0) := I (M1) y = (1,..., 1) AND σ = 1. Repeat (M2) Name Function ftarget N For l := 1 To λ (M3) 2 2 −10 Sphere fSp(y) := y = kyk 10 X i (g) N i=1 z˜l := l(0, I) (M4) N Cigar f (y) := y2 + 106 y2 10−10 ˜(g) (g) (g) Cig 1 X i dl := M ˜zl (M5) i=2 N (g) (g) (g) (g) − f˜ := f y + σ d˜ (M6) Tablet f (y) := 106y2 + y2 10 10 l l Tab 1 X i End (M7) i=2 N − 6 i 1 − Ellipsoid f (y) := 10 N−1 y2 10 10 SortOffspringPopulation (M8) Ell X i i=1 N (g+1) (g) (g) ˜(g) 2 10 y := y + σ d (M9) Parabolic Ridge fpR(y) := −y1 + 100 y −10 w X i D E i=2 (g+1) (g) (g) N s := (1 cs)s + µeff cs(2 cs) z˜ (M10) v w Sharp Ridge f (y) := −y + 100u y2 −1010 − − sR 1 uX i (g+1) (g) c1 p (g+1) (g+1) TD E ti=2 M := M I + s s I N − 1+5 i 1 − 2 − 2 N−1 10 Different-Powers fdP(y) := (yi ) 10 cw T X h (g) (g) i=1 + ˜z ˜z I (M11) − 2 w − N 1 Rosenbrock f (y) := 100(y2 − y )2 D (g +1) E i Ros X h i i+1 (g+1) (g) cs s i=1 σ := σ exp 1 (M12) 2 −10 d E[ N (0, I) ] − +(yi − 1) 10 " σ k k !# i g := g +1 (M13) Until(termination condition(s) fulfilled) (M14) intended to show that the MA-ES exhibits similar performance Fig. 3. Pseudocode of the Matrix Adaptation ES (MA-ES). behaviors as the CMA-ES. To this end, not only standard population sizings (λ
T T According to the derivation presented, the MA-ES should E s(g+1) s(g+1) = I and E z˜(g) z˜(g) = I w perform nearly equally well as the CMA-ES. We will in- (36) h i hD E i vestigate the performance on the set of test functions given does hold. That is, provided that the M matrix has evolved in in Tab. I. All runs have been initialized with a mutation such a manner that the components of the selected z˜ vectors T strength σ(0) = 1 and y(0) = (1,..., 1) . For the first tests, are (nearly) standard normally distributed, the evolution of M the standard strategy parameter choice given by Eqs. (21-24) has reached a steady state. Of course, such a stationary behav- together with the recombination weights (5) have been used. ior can only exist for static quadratic fitness landscapes (i.e., - f In addition to the test functions in Tab. I, the 24 test func- functions with a constant Hessian) and more general functions tions from the BBOB test environment [12] have been used obtained from those quadratic functions by strictly increasing to evaluated the MA-ES in the BiPop-setting in Section V-B. -transformations (conserving ellipsoidal level sets). f 1) Mean value dynamics: In Fig. 4 the dynamics of the CMA-ES and the MA-ES are compared on the test functions IV. PERFORMANCE COMPARISON MA- VS. CMA-ES Sphere, Cigar, Tablet, Ellipsoid, ParabolicRidge, SharpRidge, The performance of the MA-ES has been extensively tested Rosenbrock, and DifferentPowers for search space dimension- and compared with the CMA-ES. These investigations are alities N =3 and N = 30. The curves displayed are the result BEYER AND SENDHOFF: SIMPLIFY YOUR CMA-ES 7
Sphere N =3 Sphere N = 30 Cigar N =3 Cigar N = 30 2 10 102 0 0
) 5 ) ) 10 ) 10 10 5 ) ) ) ) 10 g g g -2 g -2 ( ( ( 10 ( 10 y y y -4 y -4 0 ( ( ( 10 ( 10 10 100 f f f f 10-6 10-6 -5 10-8 10-8 10 10-5 and and and and 10-10 10-10 ) ) ) ) -10 -10 -12 10 10 g -12 g g 10 g 10 ( ( ( ( σ σ σ 10-14 σ 10-14 10-15 10-15 0 20 40 60 80 100 120 140 0 100 200 300 400 500 600 0 50 100 150 200 250 0 200 400 600 800 1000 1200 1400 g g g g Tablet N =3 Tablet N = 30 Ellipsoid N =3 Ellipsoid N = 30 105 105 105 105 ) ) ) ) ) ) ) ) g g g g ( ( ( 100 100 ( 100 100 y y y y ( ( ( ( f f f f
10-5 10-5 10-5 10-5 and and and and
-10 -10 -10 -10 ) ) ) 10 10 ) 10 10 g g g g ( ( ( ( σ σ σ σ 10-15 10-15 10-15 10-15 0 50 100 150 200 250 0 500 1000 1500 2000 2500 0 50 100 150 200 250 0 500 1000 1500 2000 2500 3000 g g g g ParabolicRidge N =3 ParabolicRidge N = 30 SharpRidge N =3 SharpRidge N = 30 10 108 10 1010 2 | | | 7 8 | 10 ) ) ) 10 8 10 ) ) ) 10 ) ) g g 6 6 10 g 10 g 0 ( (
( ( 10 6 y y 10 y y 105 104 ( ( ( ( 10-2 f f 4 f 2 f |
| 4 10 10 | 10 | -4 103 100 10 102 2 -2 -6
10 and 10 and and and 10 1 100 -4
10 ) 10 ) ) ) -8 g g g g 10
0 ( -6 ( 10 ( 10 ( 10-2 σ σ σ σ 10-1 10-8 10-10 0 20 40 60 80 100 0 100 200 300 400 500 600 0 100 200 300 400 500 600 0 100 200 300 400 500 600 700 800 g g g g Rosenbrock N =3 Rosenbrock N = 30 DifferentPowers N =3 DifferentPowers N = 30 102 104 102 102 2 0 ) ) ) 0 10 10 ) 0 ) ) ) 10 ) 10 g g g 0 -2 g ( ( ( 10 10 ( 10-2 y y y 10-2 -4 y ( ( ( -2 10 ( 10 10-4 f f f -6 f -4 -4 10 10 10 10-6 -8 -6 10 10-8 -6 10 and and and 10 and -10 -10 -8 10 10 ) )
10 ) -12 ) g g g -8 10 g -12
( 10 (
( 10 10-10 ( σ σ σ 10-14 σ 10-14 10-10 10-12 0 50 100 150 200 0 500 1000 1500 2000 2500 3000 3500 0 20 40 60 80 100 120 140 160 180 0 500 1000 1500 2000 2500 3000 g g g g
Fig. 4. The σ- and f- or |f|-dynamics of CMA-ES and MA-ES on the test functions of Tab. I for N = 3 and N = 30. The f-dynamics of the CMA-ES are in black and those of the MA-ES are in red. The σ-dynamics of the CMA-ES are in cyan and those of the MA-ES are in magenta. The curves are the averages of 20 independent ES runs (except the SharpRidge N = 3 case where 200 runs have been averaged and the Rosenbrock N = 3 case where 100 runs have been averaged). Note that the MA-ES exhibits premature convergence on the SharpRidge while the CMA-ES shows this behavior for N = 30. of an averaging over a number of independent single runs. averaging that converged to the global minimum (there is a Thus, they are an estimate of the mean value dynamics. second, but local minimum that sometimes attracts the ES). As one can see in Fig. 5, using the standard strategy 2) Expected runtime: In order to compare the N scaling parameter setting both CMA-ES and MA-ES perform very behavior, expected runtime (ERT) experiments have been similarly on the N = 30 test instances. Even in the case N =3 performed. Since a converging ES does not always reach a this behavior is observed except the SharpRidge where the predefined objective function target value ftarget, but may end MA-ES exhibits premature convergence while the CMA-ES is up in a local optimizer, a reasonable ERT definition must take able to follow the ridge exponentially fast. However, increasing into account unsuccessful runs. Therefore, the formula slightly, also the CMA-ES will fail on the SharpRidge. If N 1 ERT = 1 E[ru] + E [rs] (37) one wants to improve the behavior on the SharpRidge (while Ps − keeping the basic CMA-ES algorithm), the population sizing derived in [15] has been used for the calculation of the rule (21) must be changed and much larger λ values must be ERT. For E[ru] the empirical estimate of the runtime for used. As a general observation it is to be noted that the N =3 unsuccessful runs has been determined. Likewise, the mean case produces rather rugged single run dynamics. Regarding value of the runtime of the successful runs serves as an the Rosenbrock function, only those runs have been used for empirical estimate for E[rs] and Ps is to be estimated as 8 ACCEPTED FOR: IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION
Cigar Tablet of occasional convergence of the ES to the local optimizer. This is also the reason for the comparatively large standard
105 deviations Var[r] observed.
104 p
104 B. Considering λ =4N Population Sizing It is well known that the standard population sizing ac- # function evaluations # function evaluations cording to (21) is not well suited for multimodal and noisy 103 101 102 101 102 N N optimization. First, the case λ = 4N, µ = λ/2 will be 2 Ellipsoid ParabolicRidge considered. In the next section, the λ = 4N case will be investigated. Since relevant differences in the dynamics (compared to the dynamics of the standard population sizing 105 rule (21) in Fig. 4) can only be observed for the Ridge (and 104 to a certain extend for the Cigar), the graphs are presented in 104 the supplementary material. Regarding the ParabolicRidge and the Cigar one observes a certain performance loss for the MA-
# function10 evaluations 3 # function10 evaluations 3 ES. Having a closer look at the ERT graphs in Fig. 6 one sees 101 102 101 102 N N that these differences appear more pronounced at larger search Rosenbrock DifferentPowers space dimensionalities on peculiar ellipsoid models like the 106 Cigar and the “degenerated Cigar” – the Parabolic Ridge. In those test cases the CMA-ES seems to benefit from the second 105 evolution path: There is a dominating major principal axis 105 in these test functions (being constant in its direction in the 4 10 search space) in which the mutations should be predominantly 104 directed. It seems that such a direction can be somewhat faster 103 # function evaluations # function evaluations identified using a second evolution path. 101 102 101 102 N N 2 Fig. 5. Expected runtime (in terms of # of function evaluations) and its C. Considering λ =4N Population Sizing standard deviation of the CMA-ES, displayed by black data points (and error The mean value dynamics of the CMA-ES and the MA- bars) and of the MA-ES, displayed by red data points (and error bars). The population size λ is given by (21). Data points have been obtained for N = ES with population sizes proportional to the square of the 4, 8, 16, 32, 64, and 128. As for the initial conditions and ftarget, see Tab. I. search space dimensionality N are mainly presented in the For comparison purposes, dashed blue lines are displayed in order to represent supplementary material. The f-dynamics of CMA-ES and linear (smaller slope) and quadratic (larger slope) runtime growth behavior. Note, regarding Rosenbrock especially for small N, the standard deviation of MA-ES do not deviate markedly and are similar to those the runtime is rather large and only displayed as a horizontal bar above the of the other population sizings. Some of the experimental respective data point. results regarding Sphere, Ellipsoid, and Rosenbrock are also displayed in Fig. 7. The curves are the result of an averaging over a number of independent single runs. As one can see, empirical success probability. Similarly, one can determine the there is a certain performance loss for the MA-ES except for variance of the run length, the formula reads9 Rosenbrock at N = 30 where MA excels.
1 1 Ps 2 The mutation strength dynamics of the two ES algorithms Var[r]= 1 Var[ru] + Var[rs]+ −2 E[ru] . (38) Ps − Ps on Rosenbrock are interesting on its own. In the case of The ERT experiments performed regard search space di- N = 30, σ increases even exponentially fast. Since both the mensionalities N = 4, 8, 16, 32, 64, and 128. These inves- CMA-ES and the MA-ES converge to the optimizer, shrinking tigations exclude the SharpRidge since both CMA-ES and the mutation step-size (note, this is not σ as often wrongly MA-ES exhibit premature convergence for moderately large stated) is predominantly done by the covariance matrix or the N (for CMA-ES: N > 16 experiments exhibit premature M matrix (in the case of the MA-ES). This can be easily convergence). The results of these experiments are displayed confirmed by considering the dynamics of the maximal and in Fig. 5. Instead of the expected number of generations, the minimal eigenvalues of C (in the case of the CMA-ES) and T number of function evaluations are displayed versus the search MM (in the case of the MA-ES), respectively. Therefore, it 2 space dimensionality N. The number of function evaluations is a general observation for the large population size λ =4N is simply the product of the expected runtime measured in case that adaptation of the step-size is mainly done by the generations needed (to reach a predefined f target) times the update of M and C, respectively, and not by σ. This even offspring population size λ. holds for the Sphere model as shown in Figure 8. As one can see, both strategies exhibit nearly the same ex- Next, let us consider the expected runtime (ERT) curves 10 pected runtime performance. Only for the Rosenbrock function in the interval N = 4 to N = 128 in Figs. 9. Having a there are certain differences. These are mainly due to the effect 10Unlike the λ scaling law (21), choosing λ = 4N 2 prevents the CMA-ES and the MA-ES from premature convergence on the Sharp Ridge. That is why, 9Note, in [15] a wrong formula has been presented. This is the correct one. the ERT curves are displayed for this test function as well. BEYER AND SENDHOFF: SIMPLIFY YOUR CMA-ES 9
Cigar Tablet Sphere N =3 Sphere N = 30 2 6 10 10 100 ) ) 100 ) -2 )
g 10 g -2
( ( 10 -4 105 y 10 y -4 ( ( 10 105 f 10-6 f 10-6 -8 10 10-8
and -10 and 4 10 10-10 104 10 ) 10-12 ) -12 g g 10 ( ( -14 -14 σ σ # function evaluations # function evaluations 10 10 1 2 1 2 10 10 10 10 0 10 20 30 40 50 60 0 20 40 60 80 100 N N g g Ellipsoid ParabolicRidge Ellipsoid N =3 Ellipsoid N = 30 6 6 105 10 10 105 ) ) ) ) g g
0 ( ( 10 100 y 105 105 y ( ( f f -5 10 10-5 104 104 and and 10-10 -10
) 10 ) g g ( (
# function evaluations 3 # function evaluations 3 σ 10 10 σ 101 102 101 102 10-15 10-15 N N 0 10 20 30 40 50 60 70 80 90 0 20 40 60 80 100 120 g g Rosenbrock DifferentPowers 106 Rosenbrock N =3 Rosenbrock N = 30 102 1010 ) ) 0 6 10 ) 10 ) 5 g 5 g -2 10 ( 10 ( 10 y y -4 ( ( 10 5 0 f 10 f 10 10-6 104 -8 10 10-5 4 and 10 and 10-10 ) ) -10 -12 10 g g 10 ( 3 ( # function evaluations # function evaluations 10 3 -14 σ 10 σ 1 2 1 2 10 10 10 10 10 10-15 N N 0 20 40 60 80 100 0 50 100 150 200 250 300 350 g g Fig. 6. Expected runtime (in terms of # of function evaluations) and its standard deviation of the CMA-ES, displayed by black data points (and error Fig. 7. σ- and f-dynamics of CMA-ES and MA-ES with λ = 4N 2 on bars) and of the MA-ES, displayed by red data points (and error bars). The Sphere, Ellipsoid, and Rosenbrock for N = 3 (left) and N = 30 (right). The λ = 4N, µ = 2N case has been investigated. Data points have been obtained f-dynamics of the CMA-ES are in black and those of the MA-ES are in red. for N = 4, 8, 16, 32, 64, and 128. As for the initial conditions and ftarget, The σ-dynamics of the CMA-ES are in cyan and those of the MA-ES are in see Tab. I. For comparison purposes, dashed blue lines are displayed in order magenta. The curves are the averages of 20 independent ES runs (except the to represent linear (smaller slope) and quadratic (larger slope) runtime growth Rosenbrock N=3 case where 100 runs have been averaged). behavior. Note, regarding Rosenbrock for N = 4 the standard deviation of the runtime is rather large and only displayed as a horizontal bar above the 100 respective data point. 10-2 101 10-4
10-6 closer look at the ERT graphs, one sees that - apart from the 10-8 Parabolic and the Sharp Ridge - both the CMA-ES and the 10-10 10-12
MA-ES exhibit a larger slope than the dotted straight lines condition number 10-14 2 100 N . That is, ERT as a function of N exhibits a super- min and max eigenvalues 0 20 40 60 80 100 0 20 40 60 80 100 ∝quadratic runtime. As for Rosenbrock (see Fig. 9) this might g g be partially attributed to the change of the local topology Fig. 8. Left figure: On the evolution of the minimal and the maximal eigenval- T that requires a permanent change of the covariance matrix ues of C (black curves) and MM (red curves) for a (1800/1800I , 3600)- during the evolution. Regarding the ellipsoid-like models such CMA-ES and (1800/1800I , 3600)-MA-ES, respectively, on the N = 30- as Sphere, Cigar, Tablet, and Ellipsoid this might indicate dimensional Sphere model. Right figure: Corresponding condition number dynamics. that CMA- and MA-ES do not work optimally with large population sizes. This suspicion is also formed by the non- decreasing behavior of the mutation strength σ in Fig. 7 for population sizes and/or different initial mutation strengths N = 30 indicating a possible failure of the σ-control rule. This [16]. An implementation of such a multi-start approach was raises the research question as to the reasons for this observed proposed in terms of the BiPop-CMA-ES [11]. Its pseudocode behavior and whether one can improve the scaling behavior. is presented in Fig. 10.11 It is the aim of this section to show that replacing the basic version of the CMA-ES given in V. PERFORMANCEOFTHE MA-ES AS BUILDING BLOCK Fig. 1 by the MA-ES, Fig. 3, does not significantly deteriorate IN A BIPOP-ES the performance of the BiPop-CMA-ES. The BiPop-CMA- A. The BiPop-Algorithm 11Since a pseudocode of the BiPop-CMA-ES [11] has not been published, In order to solve multi-modal optimization problems, the it has been derived here from the verbal description given in [11], [17] and CMA-ES should be used in a restart setting with increasing private communications with Ilya Loshchilov. 10 ACCEPTED FOR: IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION
Sphere SharpRidge BiPop-(C)MA-ES for minimization 107
106 106 Initialize ylower, yupper, σinit, λinit,
5 105 10 f ,b ,g := (B1) stop stop stop ∞ y u y y (B2) 4 init := [ lower, upper] 10 104 (ymin,fmin,b1) := # function10 evaluations 3 # function evaluations 101 102 101 102 (C)MA-ES(yinit, σinit, λinit,gstop, T C) (B3) N N If f f Then Return(y ,f ) (B4) Cigar Tablet min ≤ stop min min 107 107 n := 0; nsmall := 0; (B5)
106 blarge := 0; bsmall := 0; (B6) 106 Repeat (B7) 105 105 n := n +1 (B8)
104 n nsmall 104 λ := 2 − λinit (B9) # function evaluations # function evaluations 101 102 101 102 yinit := u[ylower, yupper] (B10) N N If n> 2 AND b