<<

BCMA-ES: A Bayesian approach to CMA-ES Eric Benhamou David Saltiel LAMSADE and A.I Square Connect LISIC and A.I Square Connect Paris Dauphine, France ULCO-LISIC, France [email protected] [email protected] Sebastien Verel Fabien Teytaud ULCO-LISIC ULCO-LISIC France France [email protected] [email protected]

ABSTRACT In a nutshell, the (µ / λ) CMA-ES is an iterative black box opti- This paper introduces a novel theoretically sound approach for mization algorithm, that, in each of its iterations, samples λ candi- the celebrated CMA-ES algorithm. Assuming the of date solutions from a multivariate , evaluates the multi variate normal distribution for the minimum follow a these solutions (sequentially or in parallel) retains µ candidates conjugate prior distribution, we derive their optimal update at and adjusts the sampling distribution used for the next iteration each iteration step. Not only provides this Bayesian framework a to give higher probability to good samples. Each iteration can be justification for the update of the CMA-ES algorithm but it alsogives individually seen as taking an initial guess or prior for the multi two new versions of CMA-ES either assuming normal-Wishart or variate parameters, namely the mean and the covariance, and after normal-Inverse Wishart priors, depending whether we parametrize making an experiment by evaluating these sample points with the the likelihood by its covariance or . We support fit function updating the initial parameters accordingly. our theoretical findings by numerical experiments that show fast Historically, the CMA-ES has been developed heuristically, mainly convergence of these modified versions of CMA-ES. by conducting experimental research and validating intuitions em- pirically. Research was done without much focus on theoretical CCS CONCEPTS foundations because of the apparent complexity of this algorithm. It was only recently that [3, 8] and [21] made a breakthrough and • Mathematics of computing → Probability and ; provided a theoretical justification of CMA-ES updates thanks to geometry. They proved that CMA-ES was performing KEYWORDS a natural gradient descent in the metric. These CMA-ES, Bayesian, conjugate prior, normal-inverse-Wishart works provided nice explanation for the reasons of the performance of the CMA-ES because of strong invariance properties, good search ACM Reference Format: directions, etc Eric Benhamou, David Saltiel, Sebastien Verel, and Fabien Teytaud. 2019. BCMA-ES: A Bayesian approach to CMA-ES. In Proceedings of A.I Square There is however another way of explanation that has been Working Paper. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/ so far ignored and could also bring nice insights about CMA-ES. nnnnnnn.nnnnnnn It is theory. At the light of Bayesian statistics, CMA-ES can be seen as an iterative prior posterior update. But there is some real complexity due to tricky updates that may ex- 1 INTRODUCTION plain why this has always been ignored. First of all, in a regular The adaptation evolution strategy (CMA-ES) [12] Bayesian approach, all sample points are taken. This is not the case is arguably one of the most powerful real-valued derivative-free in the (µ/λ) CMA-ES as out of the λ generated paths, only the µ optimization algorithms, finding many applications in machine best are selected. The updating weights are also constant which is learning. It is a state-of-the-art optimizer for continuous black-box not consistent with Bayesian updates. But more importantly, the functions as shown by the various benchmarks of the COmparing covariance matrix update is the core of the problem. It appeals Continuous Optimisers INRIA platform for ill-posed functions. It important remarks. The update is done according to a weighted has led to a large number of papers and articles and we refer the p pT arXiv:1904.01401v1 [cs.LG] 2 Apr 2019 combination of a rank one matrix referred to C with interested reader to [1, 2, 4–6, 10–12, 15, 21] and [25] to cite a few. C c1 and a rank min(µ,n) matrix with parameter cµ , whose details It has has been successfully applied in many unbiased perfor- are given for instance in [9]. The two updates for the covariance mance comparisons and numerous real-world applications. In par- matrix makes the Bayesian update interpretation challenging as ticular, in , it has been used for direct policy search these updates are done according to two paths: the isotropic and in reinforcement learning and hyper-parameter tuning in super- anisotropic evolution path. All this may explain why a Bayesian vised learning ([13, 14, 16]), and references therein, as well as hy- approach for interpreting and revisiting the CMA-ES algorithm perparameter optimization of deep neural networks [18] have seemed a daunting task and not tackled before. This is precisely the objective of this paper. Section 2 recalls Permission to make digital or hard copies of part or all of this work for personal or various Bayesian concepts of updates for prior and posterior to classroom use is granted without fee provided that copies are not made or distributed highlight the analogy of an iterative Bayesian update. Section 3 for profit or commercial advantage and that copies bear this notice and the full citation presents in greater details the Bayesian approach of CMA-ES, with on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). the corresponding family of derived algorithms, emphasizing the A.I Square Working Paper, March 2019, France various design choices that can conduct to multiple algorithms. © 2019 Copyright held by the owner/author(s). Section 4 provides numerical experiments and shows that Bayesian ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn A.I Square Working Paper, March 2019, France Benhamou, Saltiel, Verel, Teytaud adapted CMA-ES algorithms perform well on convex and non con- is the product of some marginal distribution for a given parameter θ. vex functions. We finally conclude about some possible extensions The sequence may not be independent neither identically distributed, and further experiments. which is a much stronger result! However, the analogy with a successive Bayesian prior posterior update has been so far missing in the landscape of CMA-ES for 2.2 Conjugate priors multiple reasons. First of all, from a cultural point of view, the evolu- In Bayesian statistical inference, the that tionary and Bayesian community have always been quite different expresses one’s (subjective) beliefs about the distribution param- and not overlapping. Secondly, the CMA-ES was never formulated eters before any evidence is taken into account is called the prior in terms of a prior and posterior update making its connection with probability distribution, often simply called the prior. In CMA-ES, it Bayesian world non obvious. Thirdly, when looking in details at the is the distribution of the mean and covariance. We can then update parameters updates, the weighted combination between the global our prior distribution with the data using Bayes’ theorem to obtain and local search makes the interpretation of a Bayesian posterior a posterior distribution. The posterior distribution is a probability update non trivial. We will explain in this paper that the global distribution that represents your updated beliefs about the param- search needs to be addressed with a special dilatation techniques eters after having seen the data. The Bayes’ theorem tells us the that is not common in Bayesian wold. fundamental rule of Bayesian statistics, that is 2 FRAMEWORK Posterior ∝ Prior × Likelihood CMA-ES computes at each step an update of the mean and co- The proportional sign indicates that one should compute the dis- of the distribution of the minimum. From a very general tribution up to a renormalization constant that enforces the dis- point of view this can be interpreted as a prior posterior update in tribution sums to one. This rule is simply a direct consequence Bayesian statistics. of Baye’s theorem. Mathematically, let us say that for a X, its distribution p depends on a parameter θ that can be 2.1 Bayesian vs Frequentist probability theory multi-dimensional. To emphasize the dependency of the distribu- ( | ) The justification of the Bayesian approach is discussed23 in[ ]. In tion on the parameters, let us write this distribution as p x θ and ( ) Bayesian probability theory, we assume a distribution on unknown let us assume we have access to a prior distribution π θ . Then the ( ) parameters of a that can be characterized as a joint distribution of θ, x writes simply as probabilization of uncertainty. This procedure leads to an axiomatic ϕ(θ, x) = p(x|θ)π(θ) reduction from the notion of unknown to the notion of randomness The marginal distribution of x is trivially given by marginalizing but with probability. We do not know the value of the parameters the joint distribution by θ as follows: for sure but we know specific values that these parameters can take ∫ ∫ with higher probabilities. This creates a prior distribution that is m(x) = ϕ(θ, x)dθ = p(x|θ)π(θ)dθ updated as we make some experiments as shown in [7, 19, 23]. In the Bayesian view, a probability is assigned to a hypothesis, whereas The posterior of θ is obtained by Bayes’s formula as under frequentist inference, a hypothesis is typically tested without being assigned a probability. There are even some nice theoretical p(x |θ)π(θ) π(θ |x) = ∫ ∝ p(x |θ)π(θ) justification for it as presented in [17]. p(x|θ)π(θ)dθ

Definition 2.1. (Infinite exchangeability). We say that (x1, x2,...) Computing a posterior is tricky and does not bring much value is an infinitely exchangeable sequence of random variables if, for any in general. A key concept in Bayesian statistics is conjugate priors n, the joint probability p(x1, x2,..., xn) is invariant to permutation of that makes the computation really easy and is described at length the indices. That is, for any permutation π, below. p(x1, x2,..., xn) = p(xπ 1, xπ 2,..., xπn) Definition 2.2. A prior distribution π(θ) is said to be a conjugate Equipped with this definition, the De Finetti’s theorem as pro- prior if the posterior distribution vided below states that exchangeable observations are conditionally π(θ |x) ∝ p(x |θ)π(θ) (1) independent relative to some latent variable. remains in the same distribution family as the prior. Theorem 2.1. (De Finetti, 1930s). A sequence of random variables At this stage, it is relevant to introduce dis- (x , x ,...) is infinitely exchangeable iff, for all n, 1 2 tributions as this higher level of abstraction that encompasses the n ∫ Ö multi variate normal trivially solves the issue of founding conjugate p(x1, x2,..., xn) = p(xi |θ)P(dθ), priors. This will be very helpful for inferring conjugate priors for i=1 the multi variate Gaussian used in CMA-ES. for some measure P on θ. Definition 2.3. A distribution is said to belong to the exponential This representation theorem 2.1 justifies the use of priors on family if it can be written (in its canonical form) as: parameters since for exchangeable data, there must exist a parame- p(x|η) = h(x) exp(η · T (x) − A(η)), (2) ter θ, a likelihood p(x|θ) and a distribution π on θ. A proof of De Finetti theorem is for instance given in [24] (section 1.5). where η is the natural parameter, T (x) is the sufficient , A(η) is log-partition function and h(x) is the base measure. η and T (x) may Remark 2.1. The De Finetti is trivially satisfied in case of i.i.d. be vector-valued. Here a · b denotes the inner product of a and b. sampling as the sequence is clearly exchangeable and that the joint The log-partition function is defined by the integral probability is clearly given by the product of all the marginal distri- ∫ butions. However, the De Finetti goes far beyond as it proves that the A(η) ≜ log h(x) exp(η · T (x)) dx. (3) infinite exchangeability is enough to prove that the joint distribution X BCMA-ES: A Bayesian approach to CMA-ES A.I Square Working Paper, March 2019, France

Also, η ∈ Ω = {η ∈ Rm |A(θ) < +∞} where Ω is the natural the covariance matrix). In this paper, we will stick to the normal- parameter space. Moreover, Ω is a convex set and A(·) is a convex inverse-Wishart to keep things simple. The Normal-inverse-Wishart function on Ω. distribution is parametrized by µ0, λ, Ψ,ν and its distribution is given by Remark 2.2. Not surprisingly, the normal distribution N(x; µ, Σ) d  1  with mean µ ∈ R and covariance matrix Σ belongs to the exponential f (µ, Σ|µ , λ, Ψ,ν) = N µ µ , Σ W−1(Σ|Ψ,ν) family but with a different parametrisation. Its exponential family 0 0 λ form is given by: where W−1 denotes the inverse Wishart distribution. The key  Σ−1µ   x  theoretical guarantee of the BCMA-ES is to update the mean and ( ) ( ) η µ, Σ = −1 , T x = 1 T , (4a) covariance of our CMA-ES optimally as follows. vec(Σ ) vec(− 2 xx ) − d 1 T −1 1 Proposition 2.3. If our sampling density follows a d dimensional h(x) = (2π) 2 , A(η(µ, Σ)) = µ Σ µ + log |Σ|. (4b) 2 2 multivariate normal distribution ∼ Nd (µ, Σ) with unknown mean where in equations (4a), the notation vec(·) means we have vectorized µ and covariance Σ and if its parameters are distributed according the matrix, stacking each column on top of each other and hence can to a Normal-Inverse-Wishart (µ, Σ) ∼ NIW(µ0,κ0,v0,ψ ) and if we equivalently write for a and b, two matrices, the result Tr(aTb) observe X = (x1,.., xn) samples, then the posterior is also a Normal- ( ⋆ ⋆ ⋆ ⋆) as the scalar product of their vectorization vec(a)· vec(b). We can Inverse-Wishart with different parameters NIW µ0 ,κ0 ,v0 ,ψ given remark the canonical parameters are very different from traditional by (also called ) parameters. We can notice that changing slightly ⋆ κ0µ0 + nx the sufficient statisticT (x) leads to change the corresponding canonical µ0 = , κ0 + n parameters η. ⋆ κ0 = κ0 + n, For an exponential family distribution, it is particularly easy to ⋆ (5) form conjugate prior. v0 = v0 + n n Õ κ n Proposition 2.2. If the observations have a density of the expo- ψ ⋆ = ψ + (x − x)(x − x)T + 0 (x − µ )(x − µ )T   i i κ + n 0 0 nential family formp(x|θ, λ) = h(x) exp η(θ, λ)T T (x)−nA(η(θ, λ)) , i=1 0 with λ a set of hyper-parameters, then the prior with likelihood de- with x the sample mean. ( ) ∝ ( · ( , ) − ( ( , ))) ( , ) fined by π θ exp µ1 η θ λ µ0A η θ λ with µ ≜ µ0 µ1 is Remark 2.3. This proposition is the cornerstone of the BCMA-ES. It a conjugate prior. provides the theoretical guarantee that the updates of the parameters The proof is given in appendix subsection 6.1. As we can vary in the algorithm are accurate and optimal under the assumption of the parameterisation of the likelihood, we can obtain multiple con- the prior. In particular, this implies that any other formula for the jugate priors. Because of the conjugacy, if the initial parameters of update of the mean and variance and in particular the ones used in the multi variate Gaussian follows the prior, the posterior is the true the mainstream CMA-ES assumes a different prior. distribution given the information X and stay in the same family Proof. A complete proof is given in the appendix section 6.2. making the update of the parameters really easy. Said differently, □ with conjugate prior, we make the optimal update. And it is enlight- ening to see that as we get some information about the likelihood, 3 BAYESIAN CMA-ES our posterior distribution becomes more peak as shown in figure1. 3.1 Main assumptions Our main assumptions are the followings : • the parameters of the multi-variate Gaussian follow a conju- gate prior distribution. • the minimum of our objective function f follows a multi- variate normal law. 3.2 Simulating the minimum One of the main challenge is to simulate the likelihood to infer the posterior. The key question is really to use the additional informa- tion of the function value f for candidate points. At step t in our algorithm, we suppose multi variate Gaussian parameters µ and Σ follow a normal inverse Wishart denoted by NIW (µt ,κt ,vt ,ψt ). Figure 1: As we get more and more information using the In full generality, we need to do a Monte Carlo of Monte Carlo likelihood, the posterior becomes more peak. as the parameters of our multi variate normal are themselves sto- chastic. However, we can simplify the problem and take their mean values. It is very effective in terms of computation and reduces 2.3 Optimal updates for NIW Monte Carlo noise. For the normal inverse Wishart distribution, The two natural conjugate priors for the Multi variate normal that there exist closed form for these mean values given by: updates both the mean and the covariance are the normal-inverse- Et [µ] = µt (6) Wishart if we want to update the mean and covariance of the Multi and variate normal or the normal-Wishart if we are interested in up- ψt Et [Σ] = (7) dating the mean and the precision matrix (which is the inverse of vt − n − 1 A.I Square Working Paper, March 2019, France Benhamou, Saltiel, Verel, Teytaud

 We simulate potential candidates X = {Xi } ∼ N Et [µ], Et [Σ] and f in increasing order, we will move our theoretical multi evaluate them f (Xi ). If the distribution of the minimum was accu- variate Gaussian little by little. A better solution is more to rate, the minimum would concentrate around Et [µ] and be spread brutally move the center of our multi variate Gaussian to with a variance of Et [Σ]. When evaluating potential candidates, as the best candidate seen so far, as follows: our guess is not right, we do not get values centered around Et [µ] µt = arg min f (X) (9) and spread with a variance of Et [Σ]. This comes from three things: X ∈X • Our assumed minimum is not right. We need to shift our We call this strategy two. Intuitively, strategy two should be normal to the right minimum! best when starting the algorithm while strategy one would • Our assumed variance is not right. We need to compute it be better once we are close to the solution. on real data taken into additional information given by f . To recover the true variance, we can adapt what we did in stratey • Last but not least, our Monte Carlo simulation adds some one as follows: random noise. • For the last issue, we can correct any of our by the k Monte Carlo bias. This can be done using standard control variate Õ    T Σ = w · X − X X − X as the simulated mean and variance are given: Et [µ] and Et [Σ] t (i),w ↓ (i),f ↑ (.),f ↑ (i),f ↑ (.),f ↑ respectively and we can compute for each of them the bias explicitly. i=1 The first two issues are more complex. Let us tackle each issue | {z } MC covariance for Xf ↑ one by one. k ! To recover the true minimum, we design two strategies. Õ    T • We design a strategy where we rebuild our normal distri- − wi · Xi − X Xi − X −Σt (10) bution but using sorted information of our X’s weighted by i=1 their normal density to ensure this is a true normal corrected | {z } MC covariance for simulated X from the Monte Carlo bias. We need to explicitly compute Ík Ík the weights. For each simulated point Xi , we compute it as- where X (.),f ↑ = i=1 w(i),w ↓X(i),f ↑ and X = i=1 wiXi are sumed density denoted by di = N(Et [µ], Et [Σ])(Xi ) where respectively the mean of the sorted and non sorted points. N(Et [µ], Et [Σ])(.) denotes the p.d.f. of the multi-variate • Again, we could design another strategy that takes part of Gaussian. the points but we leave this to further research. (w ) We divide these density by their sum to get weights i i=1..k Once we have the likelihood mean and variance using (9) and (10) Ík that are positive and sum to one as follows. wj = dj / i=1 di . or (8) and (10), we update the posterior law according to equation Hence for k simulated points, we get {Xi ,wi }i=1..k . We re- (5). This gives us the iterative conjugate prior parameters updates: order jointly the uplets (points and density) in terms of their κt µt + nµt weights in decreasing order. µt+1 = , To insist we take sorted value in decreasing order with re- κt + n spect to the weights (wi )i=1..k , we denote the order statistics κt+1 = κt + n, ( ) ↓ (11) i ,w . vt+1 = vt + n, This first sorting leads to k new uplets {X(i),w ↓,w(i),w ↓}i=1..k . κ n ψ = ψ +Σ + t µ − µ  µ − µ T Using a stable sort (that keeps the order of the density), we t+1 t t κ + n t t t t sort jointly the uplets (points and weights) according to their t objective function value (in increasing order this time) and The resulting algorithm is summarized in Algo 1. { } get a k new uplets X(i),f ↑,w(i),w ↓ i=1..k . We can now com- Proposition 3.1. Under the assumption of a NIW prior, the up- pute the empirical mean µt as follows: dates of the BCMA-ES parameters for the expected mean and variance k k ! write as a weighted combination of the prior expected mean and Õ Õ variance and the empirical mean and variance as follows µt = w(i),w ↓ · X(i),f ↑ − wiXi − µt (8) µ  i=1 i=1 Et+1[µ] = Et [µ] + wt µt − Et [µ] , | {z } | {z } [ ] Σ,1 [ ] Σ,2 − [ ] − [ ]T MC mean for Xf ↑ MC bias for X Et+1 Σ = wt Et Σ + wt µt Et µ µt Et µ |{z} | {z } The intuition of equation (8) is to compute in the left term discount factor rank one matrix the Monte Carlo mean using reordered points according to Σ,3 their objective value and correct our initial computation by + wt Σt the Monte Carlo bias computed as the right term, equal to |{z} the initial Monte Carlo mean minus the real mean. We call rank (n-1) matrix n this strategy one. µ where wt = , • If we think for a minute about the strategy one, we get κt + n the intuition that when starting the minimization, it may Σ,1 κt n wt = not be optimal. This is because weights are proportional to (κt + n)(vt − 1)  1 T −1 v − n − 1 exp 2 (X − Et [µ]) (Et [Σ]) (X − Et [µ]) . Σ,2 t wt = , When we start the algorithm, we use a large search space, vt − 1 hence a large covariance matrix Σt which leads to have Σ,3 1 wt = weights which are quite similar. Hence even if we sort can- vt − 1 didates by their fit, ranking them according to the value of (12) BCMA-ES: A Bayesian approach to CMA-ES A.I Square Working Paper, March 2019, France

Remark 3.1. The proposition above is quite fundamental. It justi- 3.4 Differences with standard CMA-ES fies that under the assumption of NIW prior, the update is a weighted Since we use a rigorous derivation of the posterior, we have the sum of previous expected mean and covariance. It is striking that it following features: provides very similar formulae to the standard CMA ES update. Recall • the update of the covariance takes all points. This is different that these updates given for the mean m and covariance C can be t t from λ/µ CMA-ES that uses only a subset of the point. written as follows: • by design, the update is optimal as we compute at each step µ the posterior. Õ • the contraction dilatation mechanism is an alternative to mt+1 = mt + wi (xi:λ − mt ) i=1 global local search path in standard CMA-ES. • weights varies across iterations which is also a major differ- ( − − ) T Ct+1 = 1 c1 cµ + cs Ct + c1 pcpc ence between main CMA ES and Bayesian CMA ES. Weights | {z } |{z} are proportional to exp( 1XT Σ−1X) sorted in decreasing or- discount factor rank one matrix (13) 2 der. Initially, when the variance is large, µ  T Õ xi:λ − mk xi:λ − mt + cµ wi 3.5 Full algorithm σk σt i=1 The complete Bayesian CMA ES algorithm is summarized in 2. It | {z } iterates until a stopping condition is met. We use multiple stop- rank min(µ,n−1) matrix ping conditions. We stop if we have not increase our best result where the notations mt ,wi , xi:λ,Ct ,c1,cµ ,cs , etc... are given for in- for a given number of iterations. We stop if we have reached the stance in [26]. maximum of our iterations. We stop if our variance norm is small. Additional stopping condition can be incorporated easily. Proof. See 6.3 in the appendix section. □ Algorithm 2 Bayesian update of CMA-ES parameters: 1: Initialization 2: Start with a prior distribution Π on µ and Σ Algorithm 1 Predict and Correct parameters at step t 3: Set retrial to 0 4: Set fmin to max float 1: Simulate candidate 5: while stop criteria not satisfied do 2: Use mean values Et [µ] = µt and Σt = E[Σ] = ψt /(vt − n − 1) 6: X ∼ N(µ, Σ) 3: Simulate k points X = {Xi } = 1..k ∼ N(Et [µ], Σt ) 7: update the parameters of the Gaussian thanks to the posterior ( | ) 4: Compute densities (di )i ..k = (N(Et [µ], Σt )(Xi ))i ..k = law Π µ, Σ X following details given in algorithm 1 5: Sort in decreasing order with respect to d to get 8: Handle dilatation contraction variance for local minima as {X(i),d ↓,d(i),d ↓}i=1..k explained in algorithm 3 6: Stable Sort in increasing order order with respect to f (Xi ) to 9: if DilateContractFunc(X, Σt , Xmin, fmin, Σt,min) == 1 get {X(i),f ↑,d(i),d ↓}i=1..k then 7: 10: return best solution 8: Correct Et [µ] and Σt 11: end if 12: end while 9: Either Update Et [µ] and Σt using (9) and (10) (strategy two) 13: return best solution 10: Or Update Et [µ] and Σt using (8) and (10) (strategy one) 11: Update µn+1,κn+1,vn+1,ψn+1 using (11) Last but not least, we have a dilatation contraction mechanism for the variance to handle local minima with multiple level of con- tractions and dilatation that is given in function 3. The overall idea is first to dilate variance if we do not make any progress to 3.3 Particularities of Bayesian CMA-ES increase the search space so that we are not trapped in a local There are some subtleties that need to be emphasized. minimum. Should this not succeed, it means that we are reaching something that looks like the global minimum and we progres- • Although we assume a prior, we do not need to simulate the sively contract the variance. In our implemented algorithm, we take prior but can at each step use the of the prior L = 5, L = 20, L = 30, L = 40, L = 50 and the dilatation, con- which means that we do not consume additional simulation 1 2 3 4 5 traction parameters given by k = 1.5,k = 0.9,k = 0.7,k = 0.5 compared to the standard CMA-ES. 1 2 3 5 We have also a restart at previous minimum level L = L . • We need to tackle local minimum (we will give example of ∗ 2 this in the numerical section) to avoid being trapped in a 4 NUMERICAL RESULTS bowl! If we are in a local minimum, we need to inflate the variance to increase our search space. We do this whenever 4.1 Functions examined our algorithm does not manage to decrease. However, if after We have examined four functions to stress test our algorithm. They a while we do not get better result, we assume that this is are listed in increasing order of complexity for our algorithm and indeed not a local minimum but rather a global minimum correspond to different type of functions. They are all generalized and start deflating the variance. This mechanism of inflation function that can defined for any n. For all, we present deflation ensures we can handle noisy functions like Rastri- the corresponding equation for a variable x = (x1, x2,.., xn) of n gin or Schwefel 1 or Schwefel 2 functions as defined in the dimension. Code is provided in supplementary materials. We have section 4. frozen seeds to have reproducible results. A.I Square Working Paper, March 2019, France Benhamou, Saltiel, Verel, Teytaud

Algorithm 3 Dilatation contraction variance for local minima: n n 1: Function X, Σ , X , f , Σ Õ Ö DilateContractFunc( t min min t,min) f (x) = | xi | + | xi | (15) ( ) ≤ 2: if f X fmin then i=1 i=1 3: Set fmin = f (X) 4: Memorize current point and its variance: 5: • Xmin = X 6: • Σt,min = Σt 7: Set retrial = 0 8: else 9: Set retrial += 1 10: if retrial == L∗ then 11: Restart at previous best solution: 12: • X = Xmin 13: • Σt = Σt,min 14: end if 15: if L2 > retrial and retrial > L1 then 16: Dilate variance by k1 17: else if L3 > retrial and retrial ≥ L2 then 18: Contract variance by k2 19: else if L4 > retrial and retrial ≥ L3 then Figure 3: Schwefel 2 function: a simple piecewise linear func- 20: Contract variance by k3 tion 21: else if L5 > retrial and retrial ≥ L4 then 22: Contract variance by k4 23: else 4.1.3 Rastrigin. The Rastrigin function, first proposed by22 [ ] 24: return 1 and generalized by [20], is more difficult compared to the Cone and 25: end if the Schwefel 2 function. Its equation is given by (16) and represented 26: return 0 in figure 4. It is a non-convex function often used as a performance 27: end if test problem for optimization algorithms. It is a typical example of 28: End Function non-linear multi modal function. Finding its minimum is considered a good stress test for an optimization algorithm, due to its large search space and its large number of local minima. 4.1.1 Cone. The most simple function to optimize is the qua- dratic cone whose equation is given by (14) and represented in figure 2. It is also the standard Euclidean norm. It is obviously convex and is a good test of the performance of an optimization method. n !1/2 Õ 2 f (x) = xi = ∥x ∥2 (14) i=1

Figure 4: Rastrigin function: a non convex function multi- modal and with a large number of local minima

n Õ  2  f (x) = 10 × n + xi − 10 cos(2πxi ) (16) i=1 Figure 2: A simple convex function: the quadratic norm. 4.1.4 Schwefel 1 function. The last function we tested is the Minimum in 0 Schwefel 1 function whose equation is given by (17) and repre- sented in figure 5. It is sometimes only defined on [−500, 500]n. The Schwefel 1 function shares similarities with the Rastrigin func- 4.1.2 Schwefel 2 function. A slightly more complicated function tion. It is continuous, not convex, multi-modal and with a large is the Schwefel 2 function whose equation is given by (15) and number of local minima. The extra difficulty compared to the Rast- represented in figure 3. It is a piecewise linear function and validates rigin function, the local minima are more pronounced local bowl the algorithm can cope with non convex function. making the optimization even harder. BCMA-ES: A Bayesian approach to CMA-ES A.I Square Working Paper, March 2019, France

f (x) = 418.9829 × n n Õ h √ i − (p| |)1 ( )1 xi sin xi |xi |<500 + 500 sin 500 |xi | ≥500 (17) i=1

Figure 6: Convergence for the Cone function

Figure 5: Schwefel 1 function: a non convex function multi- modal and with a large number of local pronounced bowls

4.2 Convergence For each of the functions, we compared our method using strategy one entitled B-CMA-ES S1: update µt and Σt using (8) and (10) plotted in orange, or strategy two B-CMA-ES S2: same update but using (9) and (10), plotted in blue and standard CMA-ES as provided by the opensource python package pycma plotted in green. We Figure 7: Convergence for the Schwefel 2 function clearly see that strategies one and two are quite similar to standard CMA-ES. The convergence graphics that show the error compared to the minimum are represented: • for the cone function by figure 6 (case of a convex function), with initial point (10, 10) • for the Schwefel 2 function in figure 7 (case of piecewise linear function), with initial point (10, 10) • for the Rastrigin function in figure 8 (case of a non con- vex function with multiple local minima), with initial point (10, 10) • and for the Schwefel 1 function in figure 9 (case of a non convex function with multiple large bowl local minima), with initial point (400, 400) The results are for one test run. In a forthcoming paper, we will benchmark them with more runs to validate the interest of this new method. Figure 8: Convergence for the Rastrigin function For the four functions, BCMAES achieves convergence similar to standard CMA-ES. The intuition of this good convergence is that the prior (thanks to Bayesian update) and the use of the best candi- shifting the multi variate mean by the best candidate seen so far date seen at each simulation to shift the mean of the multi-variate is a good guess to update it at the next run (standard CMA-ES or Gaussian likelihood. We envisage further works to benchmark our B-CMA-ES S1). algorithm to other standard evolutionary algorithms, in particular 5 CONCLUSION to use the COCO platform to provide more meaningful tests and confirm the theoretical intuition of good performance of this new In this paper, we have revisited the CMA-ES algorithm and provided version of CMA-ES, and to test the importance of the prior choice. a Bayesian version of it. Taking conjugate priors, we can find opti- mal update for the mean and covariance of the multi variate Normal. 6 APPENDIX We have provided the corresponding algorithm that is a new version of CMA-ES. First numerical experiments show this new version is 6.1 Conjugate priors competitive to standard CMA-ES on traditional functions such as Proof. Consider n independent and identically distributed (IID) cone, Schwefel 1, Rastrigin and Schwefel 2. This faster convergence measurements X ≜ {xj ∈ Rd |1 ≤ j ≤ n} and assume that can be explained on a theoretical side from an optimal update of these variables have an exponential family density. The likelihood A.I Square Working Paper, March 2019, France Benhamou, Saltiel, Verel, Teytaud

Applying 6.1, we have:

 2   2   2  p µ, σ |µ0,v, α, β = p µ|σ , µ0,v, α, β p σ |µ0,v, α, β

 2   2  = p µ|σ , µ0,v p σ |α, β . (23)

Using the definition of the Normal inverse gamma law, we end the proof. □

Remark 6.1. If x, σ 2 ∼ NIG(µ, λ, α, β), the probability density function is the following: √ λ βα  1 α+1 f (x, σ 2|µ, λ, α, β) = √ σ 2π Γ(α) σ 2 Figure 9: Convergence for the Schwefel 1 function  2β + λ(x − µ)2  exp − . (24) 2σ 2 p(X|θ, λ) writes simply as the product of each individual likelihood: Proposition 6.2. The Normal Inverse Gamma NIG (µ0,v, α, β) n n  Ö j   T Õ j  distribution is a conjugate prior of a normal distribution with unknown p(X|θ, λ)= h(x ) exp η(θ, λ) T (x ) − nA(η(θ, λ)) . (18) mean and variance. j=1 j=1 If we start with a prior π(θ) of the form π(θ) ∝ exp(F(θ)) for some Proof. the posterior is proportional to the product of the prior function F(·), its posterior writes: and likelihood, then: √ / π(θ |X) ∝ p(X|θ) exp(F(θ))   v  1 1 2  −v(µ − µ )2  p µ, σ 2|X ∝ √ exp 0 n 2π σ 2 2σ 2 Õ ∝ © ( )· ( j ) − ( ( )) F( )ª α α+1 exp ­η θ, λ T x nA η θ, λ + θ ® . (19) β  1   −β  j=1 × exp « ¬ Γ(α) σ 2 σ 2 It is easy to check that the posterior (19) is in the same exponential n/2 ( Ín 2 )  1  (xi − µ) family as the prior iff F(·) is in the form: × exp − i=1 . (25) 2πσ 2 2σ 2 F(θ) = µ1 · η(θ, λ) − µ0A(η(θ, λ)) (20) Defining the empirical mean and variance as x = 1 Ín x and for some µ ≜ (µ0, µ1), such that: n i=1 i 1 Ín 2 Ín 2 2 s = (xi −x) , we obtain that (xi − µ) = n(s +(x − µ) ). n T n i=1 i=1  Õ j   p(X|θ, λ)∝exp µ1 + T (x ) η(θ, λ) − (n + µ0)A(η(θ, λ)) . (21) j=1 So, the conditional density writes: Hence, the conjugate prior for the likelihood (18) is parametrized / /   √  1 α+n 2+3 2 by µ and given by: p µ, σ 2|X ∝ v 2 1 σ p(X|θ, λ) = exp (µ · η(θ, λ) − µ A(η(θ, λ))) , (22) ( ) Z 1 0 1 h 1  i × − β + v(µ − µ )2 + ns + (x − µ)2 . ∫ exp 2 0 (26) where Z = exp (µ1 · η(θ, λ) − µ0A(η(θ, λ))) dx. □ σ 2 6.2 Exact computation of the posterior update Besides, 2 2 for the Normal inverse Wishart v(µ − µ0) + n s + (x − µ) To make our proof simple, we first start by the one dimensional 2 2  2 2 = v µ − 2µµ0 + µ + ns + n x − 2xµ + µ case and show that in one dimension it is a normal inverse gamma. o 2 2 2 We then generalize to the multi dimensional case. = µ (v + n) − 2µ (vµ0 + nx) + vµo + ns + nx . (27)

Lemma 6.1. The probability density function of a Normal inverse Denoting a = v + n and b = vµ0 + nx, we have : gamma (denoted by NIG) random variable can be expressed as the 1  2 2 product of a Normal and an Inverse gamma probability density func- β + v(µ − µ0) + n s + (x − µ) tions. 2 1   2 − 2 2 2 2 = β + aµ 2bµ + vµo + ns + nx Proof. we suppose that x |µ, σ ∼ N(µ0, σ /v). We recall the 2     following definition of conditional probability: 1 2 2b 2 2 = β + a µ − µ + vµo + ns + nx Definition 6.1. Suppose that events A,B and C are defined on the 2 a same probability space, and the event B is such that P(B) > 0. We  2 2 ! 1 b b 2 2 have the following expression: = β + a µ − − + vµ + ns + nx . (28) 2 a a o P(A ∩ B|C) = P(A|B,C)P(B|C). BCMA-ES: A Bayesian approach to CMA-ES A.I Square Working Paper, March 2019, France

( v +p+ +n So we can express the proportional expression of the posterior : − 0 2 κ0 T −1 ∝ |Σ| 2 exp − µ − µ  Σ µ − µ  2 0 0 ⋆ / ( 2 )    1 α +3 2 2β⋆ + λ⋆ µ − µ⋆ p µ, σ 2|X ∝ × − , n T −1 2 exp 2 − (x − µ) Σ (x − µ) σ 2σ 2 n !) 1 −1  Õ T  with − tr Σ ψ + (xi − x)(xi − x) . (32) 2 ⋆ n i=1 • α = α + 2  (x−µ )2  We organize the terms and find the parameters of our Normal • ⋆ = + 1 Ín ( − )2 + nv 0 β β 2 i=1 xi x n+v 2 ( ⋆ ⋆ ⋆ ⋆) Inverse Wishart random variable NIW µ0 ,κ0 ,v0 ,ψ . • µ⋆ = v µ0+nx v+n κ0µ0 + nx ⋆ ⋆ ⋆ ⋆ • λ = v + n µ0 = , κ0 = κ0 + n, v0 = v0 + n κ0 + n We can identify the terms with the expression of the probability n (33) ⋆ Õ T κ0n T density function given in 6.1 to conclude that the posterior follows ψ = ψ + (xi − x)(xi − x) + (x − µ0)(x − µ0) κ + n a NIG(µ⋆, λ⋆, α⋆, β⋆). □ i=1 0 which are exactly the equations provided in (5). □ We are now ready to prove the following proposition: 6.3 Weighted combination for the BCMA ES Proposition 6.3. The Normal Inverse Wishart (denoted by NIW) update (µ0,κ0,v0,ψ ) distribution is a conjugate prior of a multivariate nor- Proof. mal distribution with unknown mean and covariance. Et+1[µ] = µt+1 κt µt + nµt Proof. we use the fact that the probability density function = (34) κt + n of a Normal inverse Wishart random variable can be expressed µ as the product of a Normal and an Inverse Wishart probability = Et [µ] + wt (µˆ − Et [µ]) density functions (we use the same reasoning that in 6.1). Besides, the posterior is proportional to the product of the prior and the ψt+1 Et+1[Σ] = likelihood. vt+1 − n − 1 We first express the probability density function of the multivariate 1 1 = ψt + Σt Gaussian random variable in a proper way in order to use it when vt − 1 vt − 1 we write the posterior density function. κt n  T + µ − µt µ − µt (κ + n)(v − 1) t t Ín (x − µ)T Σ−1 (x − µ) t t i=1 i i Σ,1 Σ,2 T = w Et [Σ] + w (µˆ − Et [µ])(µˆ − Et [µ]) = n (x − µ)T Σ−1 (x − µ) + Ín (x − x)T Σ−1 (x − x) . t t i=1 i i (29) |{z} | {z } discount factor rank one matrix We can inject the previous result and use the properties of the trace Σ,3 function to express the following probability density function of + wt Σt the multivariate Gaussian random variable of parameters µ and Σ. |{z} The density writes as: rank (n-1) matrix µ n where w = , t κ + n ( t −n/2 v − n − 1 √|Σ| n T −1 Σ,1 t exp − 2 (x − µ) Σ (x − µ) wt = , (2π )pn vt − 1 κ n ) wΣ,2 = t   t ( )( − ) − 1tr Σ−1 Ín (x − x)(x − x)T . (30) κt + n vt 1 2 i=1 i i (35) Σ is a covariance matrix of rank n − 1 as we subtract the em- Hence, we can compute explicitly the posterior as follows: t pirical mean (which removes one degree of freedom). The matrix √ T   κ n κ o (µˆ − Et [µ])(µˆ − Et [µ]) is of rank 1 as it is parametrized by the 2| ∝ 0 − 0 − T −1 −  p µ, σ X p exp µ µ0 Σ µ µ0 vector µˆ. □ (2π)p |Σ| 2

v/2 v +p+   |ψ | − 0 1 1  −1 × |Σ| 2 exp − tr ψ Σ vp/2 2 Γp (v0/2) 2 ) n × |Σ|−n/2 exp − (x − µ)T Σ−1 (x − µ) 2 n !) 1 −1 Õ T − tr Σ (xi − x)(xi − x) (31) 2 i=1 A.I Square Working Paper, March 2019, France Benhamou, Saltiel, Verel, Teytaud

REFERENCES [14] Christian Igel. 2010. Evolutionary Kernel Learning. In Encyclopedia of Machine [1] Youhei Akimoto, Anne Auger, and Nikolaus Hansen. 2015. Continuous Optimiza- Learning and Data Mining. Springer, New-York, 465–469. tion and CMA-ES. GECCO 2015, Madrid, Spain 1 (2015), 313–344. [15] Christian Igel, Nikolaus Hansen, and Stefan Roth. 2007. Covariance Matrix [2] Youhei Akimoto, Anne Auger, and Nikolaus Hansen. 2016. CMA-ES and Advanced Adaptation for Multi-objective Optimization. Evol. Comput. 15, 1 (March 2007), Adaptation Mechanisms. GECCO, Denver 2016 (2016), 533–562. 1–28. [3] Youhei Akimoto, Yuichi Nagata, Isao Ono, and Shigenobu Kobayashi. 2010. Bidi- [16] Christian Igel, Verena Heidrich-Meisner, and Tobias Glasmachers. 2009. Shark. rectional Relation between CMA Evolution Strategies and Natural Evolution Journal of Machine Learning Research 9 (2009), 993–996. Strategies. PPSN XI, 1 (2010), 154–163. [17] M. I. Jordan. 2010. Lecture notes: Justification for Bayes. https://people.eecs. [4] Anne Auger and Nikolaus Hansen. 2009. Benchmarking the (1+1)-CMA-ES on the berkeley.edu/~jordan/courses/260-spring10/lectures/lecture2.pdf BBOB-2009 noisy testbed. Companion Material GECCO 2009 (2009), 2467–2472. [18] Ilya Loshchilov and Frank Hutter. 2016. CMA-ES for Hyperparameter Opti- [5] Anne Auger and Nikolaus Hansen. 2012. Tutorial CMA-ES: evolution strategies mization of Deep Neural Networks. arXiv e-prints 1604, April (April 2016), and covariance matrix adaptation. Companion Material Proceedings 2012, 12 arXiv:1604.07269. arXiv:cs.NE/1604.07269 [19] Jean-Michel Marin and Christian P. Robert. 2007. Bayesian Core: A Practical Ap- (2012), 827–848. proach to Computational Bayesian Statistics (Springer Texts in Statistics). Springer- [6] Anne Auger, Marc Schoenauer, and Nicolas Vanhaecke. 2004. LS-CMA-ES: A Verlag, Berlin, Heidelberg. Second-Order Algorithm for Covariance Matrix Adaptation. PPSN VIII, 8th International Conference, Birmingham, UK, September 18-22, 2004, Proceedings [20] H. MÃijhlenbein, M. Schomisch, and J. Born. 1991. The parallel genetic algorithm 2004, 2004 (2004), 182–191. as function optimizer. Parallel Comput. 17, 6 (1991), 619 – 632. [7] Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. 2004. Bayesian [21] Yann Ollivier, Ludovic Arnold, Anne Auger, and Nikolaus Hansen. 2017. Data Analysis (2nd ed. ed.). Chapman and Hall/CRC, New York. Information-geometric Optimization Algorithms: A Unifying Picture via Invari- [8] T. Glasmachers, T. Schaul, S. Yi, D. Wierstra, and J.: Schmidhuber. 2010. Expo- ance Principles. J. Mach. Learn. Res. 18, 1 (Jan. 2017), 564–628. nential natural evolution strategies. In: Proceedings of Genetic and Evolutionary [22] L. A. Rastrigin. 1974. Systems of extremal control. Mir, Moscow. Computation Conference, pp 2010, 2010 (2010), 393–400. [23] C. P. Robert. 2007. The Bayesian Choice: From Decision-Theoretic Foundations to [9] Nikolaus Hansen. 2016. The CMA Evolution Strategy: A Tutorial, Preprint. Computational Implementation. Springer, New York. arXiv:1604.00772 [24] M.J. Schervish. 1996. Theory of Statistics. Springer, New York. https://books. [10] Nikolaus Hansen and Anne Auger. 2011. CMA-ES: evolution strategies and google.fr/books?id=F9A9af4It10C covariance matrix adaptation. GECCO 2011 2011, 1 (2011), 991–1010. [25] Konstantinos Varelas, Anne Auger, Dimo Brockhoff, Nikolaus Hansen, Ouas- [11] Nikolaus Hansen and Anne Auger. 2014. Evolution strategies and CMA-ES sim Ait ElHara, Yann Semet, Rami Kassab, and Frédéric Barbaresco. 2018. A (covariance matrix adaptation). GECCO Vancouver 2014, 14 (2014), 513–534. Comparative Study of Large-Scale Variants of CMA-ES. PPSN XV - 15th Interna- [12] Nikolaus Hansen and Andreas Ostermeier. 2001. Completely Derandomized tional Conference, Coimbra, Portugal 15, 2018 (2018), 3–15. Self-Adaptation in Evolution Strategies. Evolutionary Computation 9, 2 (2001), [26] Wikipedia. 2018. CMA-ES. https://en.wikipedia.org/wiki/CMA-ES 159–195. [13] Verena Heidrich-Meisner and Christian Igel. 2009. Neuroevolution strategies for episodic reinforcement learning. J. Algorithms 64, 4 (2009), 152–168.