<<

Deriving and Improving CMA-ES with Geometric Trust Regions

Abbas Abdolmaleki Bob Price Nuno Lau PARC, Palo Alto Research Center PARC, Palo Alto Research Center DETI, IEETA, University of Aveiro IEETA, University of Aveiro [email protected] [email protected] LIACC, University of Porto [email protected] Luis Paulo Reis Gerhard Neumann DSI, University of Minho CLAS, TU Darmstadt, Darmstadt LIACC, University of Porto University of Lincoln [email protected] [email protected]

ABSTRACT don’t require gradients or higher derivatives of the fitness function. CMA-ES is one of the most popular stochastic search algorithms. Stochastic search algorithms typically maintain a search distribution It performs favourably in many tasks without the need of extensive over individuals or candidate solutions, which is typically a Gaussian tuning. The algorithm has many beneficial properties, distribution. This search distribution is used to generate a population including automatic step-size adaptation, efficient covariance up- of individuals which are evaluated by their corresponding fitness val- dates that incorporates the current samples as well as the evolution ues. Subsequently, a new search distribution is computed by either path and its invariance properties. Its update rules are composed computing gradient based updates [19], expectation-maximisation- of well established heuristics where the theoretical foundations of based updates [7, 12], evolutionary strategies [10], the cross-entropy some of these rules are also well understood. In this paper we method [14] or information-theoretic policy updates [1], such that will fully derive all CMA-ES update rules within the framework of the individuals with higher fitness will have better selection proba- expectation-maximisation-based stochastic search algorithms using bility. The Covariance Adaptation - Evolutionary Strategy information-geometric trust regions. We show that the use of the trust (CMA-ES) is one of the most popular stochastic search algorithms region results in similar updates to CMA-ES for the mean and the [10]. It performs well in many tasks and does not require extensive while it allows for the derivation of an improved parameter tuning. There are three ingredients for the success of update rule for the step-size. Our new algorithm, Trust-Region Co- CMA-ES. Firstly, the covariance matrix update efficiently combines Matrix Adaptation Evolution Strategy (TR-CMA-ES) is the old covariance matrix with the sample covariance. Secondly, the fully derived from first order optimization principles and performs evolution path, which stores the average update direction is also in- favourably in compare to standard CMA-ES algorithm. corporated in the covariance matrix to shape future update directions. Lastly, the step-size adaptation of CMA-ES scales the distribution KEYWORDS efficiently, increasing it when subsequent updates are correlated while decreasing it otherwise. All these update rules are well es- Stochastic Search, Expectation Maximisation, Trust Regions tablished heuristics. The foundations of some of these heuristics ACM Reference format: are also already theoretically well understood. However, a unique Abbas Abdolmaleki, Bob Price, Nuno Lau, Luis Paulo Reis, and Gerhard mathematical framework that explains all these update rules is so Neumann. 2017. Deriving and Improving CMA-ES far still missing. In contrast, expectation maximisation-based algo- with Information Geometric Trust Regions . In Proceedings of the Genetic rithms [7, 12, 14] (Section 2.2) optimize a clearly defined objective, and Evolutionary Computation Conference 2017, Berlin, Germany, July i.e., the maximization of a lower-bound. The maximisation of lower 15–19, 2017 (GECCO ’17), 8 pages. DOI: 10.475/123 4 bound in each iteration is equivalent to weighted maximum likeli- hood estimation (MLE) of the distribution. It is well-known that ML estimation of the covariance matrix yields a degenerate, over-fitted 1 INTRODUCTION distribution [2] which typically results in premature convergence Stochastic search algorithms [10, 19] are black box optimizers of a of the algorithm. We extend the EM-based framework by incor- fitness function that is either unknown or too complex to be mod- porating information geometric trust regions. Using trust regions elled explicitly. These algorithms only use the fitness values and will restrict the change of the search distribution in each update. We implement such trust regions by bounding the Kullback-Leibler Permission to make digital or hard copies of part or all of this work for personal or (KL) divergence1 of the search distribution (Section 3). Using these classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation trust regions, we can recover exactly the same form of covariance on the first page. Copyrights for third-party components of this work must be honored. update as in CMA-ES. And the mean update of CMA-ES is a de- For all other uses, contact the owner/author(s). generate case of the trust region update. Furthermore, we can derive GECCO ’17, Berlin, Germany © 2017 Copyright held by the owner/author(s). 123-4567-24-567/08/06. . . $15.00 DOI: 10.475/123 4 1For simplicity we use ’KL’ and ’KL ’ interchangeably GECCO ’17, July 15–19, 2017, Berlin, GermanyAbbas Abdolmaleki, Bob Price, Nuno Lau, Luis Paulo Reis, and Gerhard Neumann an improved step-size update rule that is based on the same theoret- seek to find a distribution over individuals x, denoted π (x;θ ), that ical foundation as the remaining update rules. Our new algorithm, minimises the expected fitness Trust-Region Covariance Matrix Adaptation Evolution Strategy (TR- Z CMA-ES) works favourably in compare to CMA-ES in all tested E[f (x)] = π (x;θ )f (x)dx. (1) scenarios, which includes typical standard benchmark functions and x complex policy optimization tasks from robotics. At each new iteration t + 1, N individuals x are drawn from the current Gaussian distribution π (x,θt ), with θt = {µt ,σt ,Σt }. These 1.1 Related Work individuals are evaluated and, with their fitness values, construct a dataset {x , f (x )} that is subsequently used to compute a We will review CMA-ES and Expectation-Maximisation-based algo- i i i=1...N new Gaussian search distribution π , with θ = {µ ,σ ,Σ }, rithms in the preliminaries section. In this section, we will briefly θ t +1 t+1 t+1 t+1 such that better individuals will have higher selection probability. review other existing stochastic search algorithms. The Natural Evolution Strategy (NES) uses the natural gradient 2.1 Covariance Matrix Adaptation - Evolutionary to optimize the expected fitness value under the search distribution [19]. The natural gradient has been shown to outperform the stan- Strategy dard gradient in many applications in [6]. The As CMA-ES will be our main baseline for comparisons, in this intuition of the natural gradient is that we want to obtain an update section we briefly explain the update rules of the CMA-ES algorithm vector of the of the search distribution that optimises the to obtain a new search distribution. value of expected fitness while the KL-divergence between new and Fitness Transformation. CMA-ES uses a rank preserving trans- current search distributions is bounded. To obtain this update vector, formation of fitness values which makes it invariant under monotone a second order approximation of the KL, which is equivalent to the transformations of the fitness function [9]. In particular, CMA-ES matrix, is used. The resulting natural gradient is gives zero weights to the worse half of the population and the weight obtained by multiplying the standard gradient with the inverse of the of the jth best individuals is proportional to Fisher matrix. Yet, in contrast to our algorithm, NES family algo- wj ∝ ln(N /2 + 0.5) − ln(j). (2) rithms do not exactly enforce a desired bound of the KL-divergence but use a constant learning rate [19]. We will use the same weighting method in our algorithm. The use of trust regions is a common approach in policy search Mean Update Rule. Using the weights, the next distribution to obtain a stable policy update and to avoid premature convergence mean µt+1 is set to the weighted sum of the individuals, i.e, [1, 17, 18]. The model-based relative entropy stochastic search al- PN /2 N /2 wix X gorithm (MORE)[1] uses a local quadratic surrogate to update it’s µ = i=1 , Z = w . t+1 Z i Gaussian search distribution. As these surrogates can be inaccurate, i=1 MORE doesn’t exploit the surrogate greedily, but uses a trust region Evolution Path. The evolution path records the sum of consec- in form of a KL-bound. This trust region defines how much the utive update steps, i.e, µt+1 − µt . If consecutive update steps are algorithm can exploit the quadratic model. Moreover, MORE explic- towards the same direction, i.e., they are correlated, their updates will itly bounds the entropy loss to avoid premature convergence. This sum up. In contrast, if they are decorrelated, the update directions method has been shown very competitive when the hyper parameters cancel each other out. Using the information from the evolution path are set correctly. Similar trust regions have been used in other policy leads to significant improvements in terms of convergence speed, as search methods such as Relative Entropy Policy Search [17]. it enables the algorithm to exploit correlations between consecutive Similar to these methods, our framework also uses a KL bound to steps. define a trust region. However, in difference to all these policy search Covariance Matrix Update Rule. The covariance matrix update methods, firstly we use this bound in an EM-based framework and is computed using the local information about the fitness function secondly we use the reverse KL bound. As the KL is not symmetric, from the individuals, called the rank-µ update, and the information this is a very important difference. For a more detailed discussion about the evolution of the distribution contained in the evolution path please see Section 3. pc t+1, called the rank-1 update. The covariance update is defined as N T 2 PRELIMINARIES X wi (xi − µt )(xi − µt ) Σt+1 =(1 − cµ − c1)Σt + cµ In stochastic search, we want to maximize a fitness function f (x) : σ i=1 t Rn → R. The goal is to find one or more individuals x ∈ Rn which | {z } have the highest possible fitness value. The only accessible infor- Rank-µ update T mation is the fitness values of the individuals. Typically stochastic + c1pc t+1pc t+1 , | {z } search algorithms, maintain a Gaussian distribution on individuals, Rank-one update

x ∼ N (x|µ,σ Σ) , where c1 ≥ 0 and cµ ≥ 0 are learning rates which are set by well where µ ∈ Rn, Σ ∈ Rn×n and σ ∈ R+ respectively are the mean(our established heuristics. current guess of optimum), the covariance matrix and step size Step Size Update Rule. Subsequently, the evolution path is also which together define the exploration of the search distribution.2 We used to find a new step size σt+1 such that the difference between the distributions of the actual evolution path and an evolution path 2In this paper σ is used to denote the variance which we refer to it as step size. under random selection is minimised, see [9] for more details. Deriving and Improving CMA-ES with Information Geometric Trust Regions GECCO ’17, July 15–19, 2017, Berlin, Germany

Theoretical Foundation. Recently there are theoretical studies is the KL-divergence between the variational distribution q(x) and to justify the update rules of CMA-ES. For example, In [16] and [4] the posterior it has been shown that the CMA-ES mean and rank-µ update rules p(R|x)π (x;θ ) can be derived by computing the approximate natural gradient of the p(x|R;θ ) = R , (8) p(R|x)π (x;θ )dx expected fitness under the search distribution. Moreover, Akimoto et al. [4] nicely sketches the relation between CMA-ES and the EM- which is proportional to the search distribution π (x;θ ) weighted based stochastic search framework conceptually. However, it does by fitness event probabilities p(R|x). Equation 5 holds for any not define a concrete mathematical framework for driving the search variational distribution q(x) and parameters θ, and hence, distribution update rules. While these theoretical studies provide logp(R;θ ) ≥ L(q(x),θ ) (9) a great understanding of CMA-ES, to the best of our knowledge, they do not lead to an improvement over the original CMA-ES. One as the KL term is always positive. Note that the lower bound will reason is that, these frameworks [16], [4] do not derive the rank-1 of be tight at our current distribution parameters θt , i.e., logp(R;θt ) = CMA-ES and a step-size update rule, which are crucial features of L(q(x),θt ), if we choose a variational distribution q(x) such that CMA-ES. Our new algorithm will have all the features of CMA-ES KL(q(x)||p(x|R;θt )) = 0, which is equivalent to choosing while it will be fully derived from a single principle. p(R|x)π (x;θt ) q(x) = p(x |R;θt ) = R . p(R|x)π (x;θt )dx 2.2 Stochastic Search by x This choice of the q(x) gives us a lower-bound L(p(x |R;θ ),θ ) on Expectation-Maximisation t the logp(R;θ ) such that p(R;θt ) = L(p(x |R;θt ),θt ). Constructing An alternative view to the formulation in Equation 1 is to convert this lower-bound corresponds to the E-step. the problem into an inference problem [15]. In order to use infer- Optimising the lower bound (M-Step). In the M-step, we then ence, the common method is to define a binary fitness event R as maximise L(p(x |R;θt ),θ ) with respect to the parameters θ to obtain observed variable. To simplify notation we will always write R when a new parameters θt+1. Typically, p(R|x) is unknown and we only we mean R = 1. The probability of this fitness event is given by have access to sample evaluations for the individual x generated p(R|x) ∝ C(f (x)), where C is a monotonically increasing (i.e., rank from the search distribution π (x;θt ). In this case, the lower-bound preserving) transformation of the fitness values. Intuitively p(R|x) can be approximated by defines the probability that individual x is the optimum solution. Hence now we optimise the objective X q(x ) p(R|x )π (x ;θ ) L(q(x),θ ) ≈ 1/N i log i i dx Z π (x ;θ ) q(x ) i i t i p(R;θ ) = p(R|x)π (x;θ )dx, (3) X x ≈ 1/N wi log π (xi ;θ ) + const., (10) where p(R;θ ) is the marginal distribution of the event R. We would i now like to find the distribution parameters θ, that maximises p(R;θ ). where wi ∝ p(R|xi ). This corresponds to a weighted maximum We can also maximize any strictly increasing function of p(R;θ ). In likelihood estimate. Please note that in Equation 10, we focus on particular, the derivations will be simpler if instead we maximise the terms depending on θ and the others are treated as constant. log probability of the event R, i.e, Monotonic improvement guarantee. Expectation-Maximization Z stochastic search methods inherit a monotonic improvement guaran- logp(R;θ ) = log p(R|x)π (x;θ )dx (4) tee from the EM algorithm, i.e., logp(R;θt ) is always increased (or x stays the same) at each iteration. This is easy to see by noting that the To optimise logp(R;θ ), we resort to an iterative expectation max- lower-bound L is tight after the E-step, i.e., logp(R;θt ) = L(q,θt ). imization (EM) algorithm [7, 15]. In each iteration, the EM algo- Therefore, optimising the lower-bound at the M-step, can only im- rithm iteratively constructs a lower bound on logp(R;θ ) (E-step) and prove logp(R;θt+1) over logp(R;θt ). then optimizes that lower bound to obtain a new search distribution Disadvantage of EM-based algorithms. While this algorithm πt+1 = π (x;θt+1) (M-step). can improve the search distribution, it can quickly result in a de- Constructing the lower bound (E-Step). Similar to prior work generated distribution which stops exploration and would lead to [15], we can now introduce a variational distribution q(x) which is premature convergence, which is a problem in many of these meth- used to decompose the logp(R;θ ), i.e, ods [12, 14]. The cause of this limitation is mainly the maximum likelihood estimate (Equation 10) which overfits the current individ- logp(R;θ ) = L(q,θ ) + KL(q(x)||p(x|R;θ )), (5) uals and change the current search distribution drastically[2]. where the first term 3 TRUST REGIONS FOR COVARIANCE Z p(R|x)π (x;θ ) L(q,θ ) = q(x) log dx (6) MATRIX ADAPTATION x q(x) In order to remedy the problem of premature convergence in EM- is the lower bound of logp(R;θ ) and second term based algorithms, we will add information-geometric trust region to the optimization objective of the lower bound. The trust region Z p(x |R;θ ) KL(q(x)||p(x|R;θ )) = − q(x) log dx (7) regularizes the maximization step and bound the entropy loss of the x q(x) new distribution with respect to the current distribution. As we do not GECCO ’17, July 15–19, 2017, Berlin, GermanyAbbas Abdolmaleki, Bob Price, Nuno Lau, Luis Paulo Reis, and Gerhard Neumann

Iteration 1 Iteration 12 Iteration 24 Iteration 31 Iteration 39 Current Distribution New Distribution Individuals 2 X

X1 X1 X1 X1 X1

Figure 1: In this illustration, we use a slightly modified version of the McCormick function with two . The blue cross shows the optimum and blue contours have higher value than green contours. In iteration one, the algorithm starts with a small distribution. In each iteration we sample five individuals. The figures show that our step size strategy effectively controls growing and shrinking of the distribution. At the beginning, it naturally grows and when it has found the optimum, it naturally shrinks (iteration 24-39). Please note that the step size strategy is derived from a well defined consistent principle. fully maximize the lower-bound any more, our algorithm belongs Where |.| and tr(.) are and operators respec- to the class of generalized Expectation-Maximisation algorithms tively. In general, to solve this constraint optimisation problem, we [8] which also holds the expectation improvement guarantee of first construct the Lagrangian equation, i.e, conventional EM algorithms. X   Our constraint optimization problem for optimizing the lower L(η,θ ) = wi log π (xi ;θ ) + η(ϵ − KL πt ||π ), bound now is given by i X θt+1 = argmax wi log π (xi ;θ ) where η is a Lagrangian multiplier. Subsequently, we differentiate ∗ θ i L(η,θ ) we respect to θ and set it to zero to obtain θ which will   s.t. KL π (x;θt )||π (x;θ ) ≤ ϵ, (11) depend on the Lagrangian multiplier η. To compute the optimal value for the Lagrangian multiplier η∗, we can optimize the dual where ϵ defines the bound on the KL divergence. With this bound function L(η,θ ∗) which is obtained by setting the optimal θ ∗ back we can define a trust region for our updates to avoid a degenerated into the Lagrangian. In order to perform an efficient optimization distribution. In next section, we will show how to solve this con- and to obtain the CMA-ES updates, we employ a coordinate descent straint optimisation problem efficiently in closed form for a Gaussian strategy where we optimize for each parameter, i.e, the new mean search distribution. The resulting update strategies match the update µt+1, the covariance matrix Σt+1 and the step size σt+1, indepen- strategies of CMA-ES except for small, but important differences dently. I.e., when we optimize, for example, for the covariance as we will discuss later in this section. While bounding the KL- matrix Σ, we use the current mean µt and step size σt for the re- divergence is a well established method [1, 13, 17, 18], it has so far maining parameters. This decoupling have three advantages. Firstly, not been applied to expectation-maximisation stochastic search algo- it renders the problem to a concave optimisation problem for each rithms. Another interesting observation is that, in difference to many search distribution component. This is easy to verify by observing existing policy search methods that use a trust region [1, 17], we that the second derivative of the Lagrangian L(η,θ ) with respect employ the reverse KL divergence measure, i.e., instead of bounding  to each parameter is negative definite. Moreover, this decoupling KL(π ||πt ) we bound KL(πt ||π . Hence, our derivation reveals an allows us to set different ϵ values for each component, i.e., ϵµ ,ϵΣ,ϵσ interesting connection of CMA-ES to many trust region optimization for the mean, the covariance matrix and the step size respectively. algorithms that use the forward KL, KL(π ||πt ) which was so far Different ϵ lead to different learning rates. We can set a higher unknown. Note that in the case of a Gaussian search distribution, we learning rate for the mean and step size than for the covariance as the can solve the optimisation program in equation 11 in closed form covariance matrix has many more parameters which makes it more only if we use the reverse KL. frail to overfit. Natural evolutionary strategy algorithms also use dif- ferent learning rates for mean, step size, and covariance matrix [19]. 3.1 Update Rules for Finally, by setting the mean to the current mean when optimising for Multivariate Normal Distributions covariance matrix and step size, we find a covariance matrix Σt+1 In each iteration, we solve the optimization program given in Equa- and step size σt+1 that increases the likelihood of successful steps tion 11 with a Gaussian distribution instead of likelihood of successful individuals. CMA-ES also uses the current mean when obtaining the new covariance matrix. This = −n/2| |−0.5 − . − T −1 −1 − π (x;θ ) (2π ) σ Σ exp( 0 5(x µ) σ Σ (x µ)) approach has been shown to be less prone to premature convergence and the KL divergence between two Gaussian distributions [9].

  −1 −1 KL πt ||π = 0.5 tr(σ Σ σt Σt ) (12) 3.1.1 Mean Update Rule. Now, we construct the Lagrangian for obtaining the new mean, i.e, ! T −1 −1 |σ Σ| + (µ − µt ) σ Σ (µ − µt ) − n + ln . −1 −1 |σt Σt | L(µ,η) = 2ϵµη − tr(σt Σt F ), (13) Deriving and Improving CMA-ES with Information Geometric Trust Regions GECCO ’17, July 15–19, 2017, Berlin, Germany where obtain X T T η∗ Σ + Σ /σ + λ p pT /σ F = wi (xi − µ)(xi − µ) + η(µ − µt )(µ − µt ) . Σ t s t Σ c c t Σt+1 = (18) i P w + λ + η∗ i i Σ Σ To obtain the new mean µt+1 = argmaxµ L(µ,η), we differentiate in closed form. The coefficient ∗ is again the optimum Lagrangian ηΣ the Lagrangian with respect to µ and set it to zero to obtain multiplier. It can be obtained by minimising the dual function ∗ P ∗ = argmin . Where = , . Now, by changing ηµ µt + wixi ηΣ η gΣ (η) gΣ (η) L(η Σt+1) µt+1 = P ∗ , (14) variables, i.e., wi + ηµ η λ where η∗ is the optimum Lagrangian multiplier which can be ob- 1 − α − γ = , γ = Σ (19) µ P w + λ + η P w + λ + η tained by minimising the convex dual function i i Σ i i Σ 1 α = P , gµ (η) = suppµL(η,µ) = L(η,µt+1), i wi + λΣ + η ∗ i.e., ηµ = argminη gµ (η). As indicated in the equation, the dual we can rewrite the equation for Σ as function is obtained by setting the optimal solution for the mean T Σs pcpc back into the Lagrangian. The dual function is convex and can be Σt+1 = (1 − α − γ )Σt + α + γ . σ σ efficiently optimised by any arbitrary non-linear optimiser. We use t t fmincon tool in matlab (please see the matlab code for implemen- Note that as η > 0 and λΣ > 0 we can infer that 0 ≤ α + γ ≤ 1. Hence, we obtained the exact form of covariance matrix update of tation). If we set a large KL bound for the mean, i.e., ϵµ → ∞, η will go to 0 and we obtain the original CMA-ES mean update CMA-ES where the second and third terms are called rank-µ and P P rank-1 updates respectively. The only difference is that α and γ µt+1 = wixi / wi , which is equivalent to the solution of un- regularized weighted maximum likelihood estimate. In this paper are constant coefficients in CMA-ES. Please note that CMA-ES we always set a large learning rate for mean which results in sim- uses constant coefficients to combine the old covariance matrix with ilar mean update as CMA-ES. Investigating the effect of different rank-1 and rank-µ while TR-CMA-ES compute these coefficients learning rates for mean is part of our future work. based on satisfying an exact KL bound. We observed that this exact KL-bound results in different coefficients in each iteration. 3.1.2 Incorporating the Evolution Path. The evolution path is a crucial ingredient of the performance of CMA-ES [9]. It sum- 3.1.4 Step Size Update Rule. We also construct the step size marizes the path taken by the recent distribution updates and can be Lagrangian for the optimisation program in Equation 15, i.e., X n n interpreted as sort of momentum term. In order to exploit the evo- L(η,σ ) = − ( wi + λσ + η) ln σ + η(2ϵσ + n + ln(σt )) lution path in our formulation, similar to [3] we treat the evolution i path p as an individual sample, i.e., our optimisation problem is −1 −1 T c − σ tr(Σt (λσ p p + Σs ) + ησt I ). (20) now given by c c X To obtain the new step size σt+1 = argmaxσ L(σ,η) we differentiate −1 max wi log π (xi ;θ ) + λ log π (µt + pc ;θ ) (15) the Lagrangian with respect to σ and set it to zero to obtain θ i   s.t. KL π (x;θ )||π (x;θ ) ≤ ϵ, ∗ −1 T t nησ σt + tr(Σt (Σs + λσ pcpc )) σt+1 = P ∗ . (21) where λ > 0 is a user-defined weight that specifies the importance of n( wi + λσ + ησ ) the evolution path with respect to other individuals. Empirically, ∗ pc Where Σs is given in equation 17 and the coefficient ησ is the we found it is better to use different λ coefficients for the covariance optimal Lagrangian multiplier. It is again obtained by minimising matrix and the step size, i.e, λ , λ . Please note that we also tried ∗ Σ σ the dual function ησ = argminη gσ (η). Where gσ (η) = L(η,σt+1). to use the evolution path to adapt the mean as well. However, we In contrast to the current step size control approaches, our algorithm observed that the learning process gets considerably unstable due to naturally increases or decreases the step size σ based on the the overshooting the optimum. current individuals and the evolution path without explicitly defining 3.1.3 Covariance Matrix Update Rule. We construct the co- any heuristic. More formally, the step size is adapted to increase the variance matrix Lagrangian for the optimisation program in Equation likelihood of successful steps and the evolution path using the same 15, i.e., mathematical foundations as the updates for the mean and covariance X matrix. CMA-ES only uses the evolution path without using the L(η,Σ) = −( wi + λΣ + η) ln |Σ| + η(2ϵΣ + n + ln |Σt |) current individuals [5]. Figure 1 illustrates the effectiveness of our i search distribution step size update rule. Now, all the updates follow −1 T − tr(Σ (Σs /σt + ηΣt + λΣpcpc /σt )), (16) the same principle. where 3.2 Algorithm X T Σs = wi (xi − µt )(xi − µt ) . (17) Algorithm 1 shows a compact representation of the new stochastic i search method. In each iteration, we generate N individuals x ∈ Rn To obtain the new covariance matrix Σt+1 = argmaxΣ L(Σ,η), we and evaluate their fitness values (Lines 1-7). Subsequently, we differentiate the Lagrangian with respect to Σ−1 and set it to zero to compute a weight for each individual based on the fitness value (Line GECCO ’17, July 15–19, 2017, Berlin, GermanyAbbas Abdolmaleki, Bob Price, Nuno Lau, Luis Paulo Reis, and Gerhard Neumann

4 3 3 Sphere x 103 Diffpow × 10 Elli x 10 ParabR 5 x 10 12 14 14

4.5 TR-CMA-ES TR-CMA-ES 12 TR-CMA-ES TR-CMA-ES 10 12 4 CMA-ES CMA-ES CMA-ES CMA-ES 10 10 3.5 8 3 8 8 2.5 6 6 6 2 4 1.5 4 4 1 2 2 2 0.5 0 0 0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60

4 Rosenbrock 4 Schwefel Cigar × 10 × 10 × 104 × 104 Tablet 18 3.5 2.5 8 16 TR-CMA-ES 3 TR-CMA-ES TR-CMA-ES 7 TR-CMA-ES 14 2 CMA-ES CMA-ES CMA-ES 6 CMA-ES 2.5 12 1.5 5 10 2 4 8 1.5 1 3 6 1 2 4 0.5

Average NumberAverage of Evaluations 0.5 2 1

0 0 0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Figure 2: Plot of the average number of required fitness evaluations over 20 trials to reach the target fitness value of −10−5 (103 for the unbounded function ParabR) for 8 different benchmark functions with 5 to 60 dimensions. TR-CMA-ES outperformed CMA-ES in most benchmark functions. In fact only in some dimensions of ParabR and Cigar functions CMA-ES performs better than TR-CMA-ES.

8). For ease of comparison we use the same rank-based weighting as 4.1 Standard Functions CMA-ES. Please see preliminary section for the weighting method. We empirically evaluate TR-CMA-ES on 8 benchmark functions We compute the number of effective samples (Line 9). Similar from [10]. As TR-CMA-ES is a maximiser we multiply all the to CMA-ES, we empirically obtained a default setting for hyper functions values by −1 to turn their minimum to a maximum. Now, parameters of the algorithm which scales with the dimension of the except for PrabR function that is unbounded, all other functions have problem and size of the population (Line 10). Subsequently we a global maximum of zero. For both TR-CMA-ES and CMA-ES, compute the new mean and new evolution path (Line 11-15). Finally, we use the formula given in algorithm 1 to define the population we obtain the new covariance matrix and new step size (Line 16-21). size. For each trial, we randomly generate the mean of the initial ∗ Note that the η are obtained by optimising the dual functions we distribution from a with zero mean. We set the explained in the previous section. initial covariance matrix as a with initial σ = 1. We ran TR-CMA-ES and CMA-ES on these functions with dimensions 4 EXPERIMENTS from 5 to 60 and a target fitness of −10−5 (103 for ParabR). We In this section, we present our experimental evaluations of TR-CMA- perform 20 trials for each dimension and report the average number ES in comparison to CMA-ES. We use two sets of benchmarks, the of fitness evaluations to reach the target fitness. standard functions and three simulated robotics tasks where we Results. Figure 2 provides the full results on the set of eight compare against CMA-ES algorithm. For all experiments we use benchmark functions. The results show that TR-CMA-ES in most of the same hyper parameter setting as given in Algorithm 1. For the cases, outperforms CMA-ES in terms of number of evaluations comparison, we use the CMA-ES Matlab source code released in to reach the target fitness. In fact only in some dimensions of ParabR CMA-ES official website 3. We run both algorithms for several trials and Cigar functions CMA-ES performs better than TR-CMA-ES. for each single experiment. For both algorithms, we used the same population size and both algorithms will start with a same initial 4.2 Simulated Robotics Tasks distribution in each trial. However, initial distribution for each trial is varied slightly. Also, both algorithms use fixed hyper parameter We use a 5-link planar robot that has to reach a given point in task settings, i.e, throughout the experiments for both algorithms, we space as a toy task for the comparisons, see Figure 3(a). We subse- don’t set any hyper parameters by hand4. quently made the task more difficult by introducing hard obstacles, which results in a discontinuous fitness function. We denote this task as hole-reaching task, see Figure 3(b). Finally, we evaluate our algo- rithm on a physical simulation of a robot playing table tennis (Figure 3https://www.lri.fr/∼hansen/cmaes inmatlab.html 4Matlab source code of TR-CMA-ES (with examples) as well as videos regarding 3(c)). For all tasks, we use dynamic motor primitives (DMPs) [11] as the robotics experiments are available at https://goo.gl/9u15Wf trajectory generator and we optimise the parameters of the DMPs to Deriving and Improving CMA-ES with Information Geometric Trust Regions GECCO ’17, July 15–19, 2017, Berlin, Germany

5 5

4 4

3 3

2 2 y [m] y 1 1 y-axis [m]

0 0

-1 Target Position -1 -1 0 1 2 3 4 5 -0.5 0 0.5 1 1.5 2 2.5 x [m] x-axis [m] (a) Reaching Task (b) Hole Reaching Task (c) Table Tennis

Figure 3: (a) Planar reaching task: a 5-link planar robot has to reach a via-point v 50 = [1, 1] in task space. The via-point is indicated by the red cross. The postures of the resulting motion are shown as overlay, where darker postures indicate a posture which is close in time to the via-point. (b) Planar hole reaching task: A 5-link planar robot has to reach the bottom of a hole centring at point [2 0] in task space while avoiding any collision with the walls. The hole is indicated by the red lines. The postures of the resulting motion are shown as overlay, where darker postures indicate a posture which is close in time to reach the bottom of the hole. (c) The table tennis task: The incoming ball has a fixed initial velocity. The goal of the robot is to learn forehand strokes to return the ball to a fixed target position.

-10-y -y 20 -3 -3.5 -10

-4 -3.5 15 -4.5

-4 -5 10

-4.5 -5.5

TR-CMA-ES Fitness Average 5

Average\Max Fitness Average\Max -6

Average Fitness Average -5 CMA-ES TR-CMA-ES(Average Fitness) -6.5 CMA-ES(Average Fitness) TR-CMA-ES(Max Fitness) 0 -5.5 -7 CMA-ES(Max Fitness) TR-CMA-ES

-7.5 CMA-ES -6 x 50 -5 x50 0 50 100 150 200 250 300 350 400 450 0 50 100 150 200 250 300 350 400 450 Number of Evaluations 0 500 1000 1500 2000 2500 3000 3500 4000 Number of Evaluations Number of Evaluations (a) Reaching Task (b) Hole Reaching Task (c) Table Tennis

Figure 4: Comparisons for the three simulated robotics tasks which were used for comparison. (a) Reaching task (b) Hole reaching task and (c) Table Tennis task. Results show that TR-CMA-ES outperforms CMA-ES in terms of number of needed evaluations on all task. In the hole reaching task as well as in the table tennis task, TR-CMA-ES has a better average final performance than CMA-ES. Please note that y in −10−y is value on y axis. achieve optimum movements. For all tasks, we generate 50 samples accelerations and quadratic costs for collisions with the environment. in each iterations. We ran 10 trials and we report the average fitness Note that this objective function is highly discontinuous due to in each iteration and the standard deviations over all trials. All other the quadratic costs for the collisions. The DMP goal attractor for settings are set as before if not stated otherwise. reaching the final state in this task is unknown and also need to be Planar Reaching Task. For completing the reaching task, the learned. Hence, our lower level policy for a 5-link robot with 5 basis robot has to reach a via-point v50 = [1,1] at time step 50 with its functions for each degree of freedom had 30 dimensions. Figure 4(b) end-effector and at the final time step T = 100 the point v100 = [5,0]. shows the results. For this task we report both the maximum fitness The reward was given by a quadratic cost term for the from and average fitness in each iteration. In both cases TR-CMA-ES two via-points as well as quadratic costs for high accelerations. The illustrates a better performance. DMPs goal attractor for reaching the final state was assumed to be Table Tennis Task. In this task, we use a simulated robot arm known. Hence, the parameter vector x for a 5-link robot with 5 (see Figure 3(c)) to learn a forehand hitting stroke in a table tennis basis function for each degree of freedom had 25 dimensions. Figure game. The robot is mounted on a floating base and has 8 actuated 4(c) shows the learning progress. TR-CMA-ES again outperform joints including the floating base. The goal of the robot is to return CMA-ES, while both algorithms could find the optimal solution. the incoming ball at a target position on the opponent’s side of the Planar Hole Reaching Task. For completing the hole reaching table. The ball is always served with same initial velocities in the task, the robot’s end effector has to reach the bottom of a hole (35cm x,y and z directions. To learn the task, we initialize the mean of the wide and 1m deep) centered point [2,0] without any collision with distribution with a initial DMP trajectory obtained by kinaesthetic the ground or the hole wall. The reward was given by a quadratic cost teaching, such that the movement generates a single forehand stroke. term for the distance to bottom of the hole, quadratic costs for high We only learn the final positions and final velocities of the DMP GECCO ’17, July 15–19, 2017, Berlin, GermanyAbbas Abdolmaleki, Bob Price, Nuno Lau, Luis Paulo Reis, and Gerhard Neumann

Algorithm 1 TR-CMA-ES principle as used for the mean and the covariance matrix. Given   1: given n(Dimension), N = 4 + 3 log(n) (Number of Individuals) the similarities in terms of the mean and covariance matrix update 2: initialize µt=0, σt =0 > 0, pc,t =0 = 0, Σt=0 = I , t = 0 between CMA-ES and our algorithm, our step size control update 3: repeat rule is more consistent and a more principled than the update rule is 4: for i = 1,...,N do used in standard CMA-ES. The new step-size update also performs 5: x i = µt + σt × N (0, Σt ) favourably in our experiments and should also be preferred for CMA- 6: fi = f (x i ) ES due to the consistent derivation. Our algorithm also enjoys all 7: end for the invariance properties of the CMA-ES. For future work, we will 8: w = Compute Weights({x i, fi }i=1...N ) (See preliminaries section) PN investigate the effect of regularized mean update over unregulated i=1 wi 9: Compute variance effective selection mass: µw = PN 2 mean update. Moreover we will investigate other theoretical aspects i=1 wi 10: Set Hyper Parameters of our framework such as the theoretical difference between the µ + 1 4n w µw forward KL and inverse KL trust region. We will also extend our λσ = 1, λΣ = , ϵΣ = min(0.2, 1.5 ) (n+1.3)2+µw (n+2)2+µw 2 framework for solving contextual stochastic search problems and ϵ = µw , ϵ = 1000, c = µw +2 σ 2n µ c n+µw +5 full reinforcement learning problems. 11: Update Mean: ∗ 12: ηµ = argminη gµ (η) 6 ACKNOWLEDGEMENTS ∗ P ηµ µt + wi x i 13: µ + = P ∗ We thank the anonymous reviewers for their careful reading of our t 1 wi +ηµ 14: Compute Evolution Path: manuscript and their many insightful comments and suggestions. p 15: pc t+1 = (1 − cc )pc t + cc (2 − cc )µw (µt+1 − µt ) This research was funded by Palo Alto Research Centre (PARC) 16: Update Covariance: and also was funded by European Unions FP7 under EuRoC grant 17: ∗ ηΣ = argminη gΣ (η) agreement CP-IP 608849 and by LIACC (UID/CEC/00027/2015) T T ∗ P (x i −µt )(x i −µt ) pc pc η Σt + i wi +λΣ and IEETA (UID/CEC/00127/2015). 18: Σ σt σt Σt +1 = P w +λ +η∗ i Σ Σ 19: Update Step-Size: REFERENCES ∗ 20: ησ = argminη gσ (η) [1] A. Abdolmaleki, R. Lioutikov, J. Peters, N. Lua, L.P. Reis, and G. Neumann. ∗ −1 P T T nησ σt +tr(Σt ( i wi (x i −µt )(x i −µt ) +λσ pc pc )) 2015. Model Based Relative Entropy Stochastic Search. In NIPS. 21: σt +1 = P ∗ n( wi +λσ +ησ ) [2] A. Abdolmaleki, N. Lua, L.P. Reis, and G. Neumann. 2015. Regularized co- 22: t = t + 1 variance estimation for weighted maximum likelihood policy search methods. In 23: until stopping criterion is met Humanoids Conference. [3] Y. Akimoto, A. Auger, and N. Hansen. 2014. Comparison-based natural gradient optimization in high dimension. GECCO (2014). [4] Y. Akimoto, Y. Nagata, I. Ono, and S. Kobayashi. 2012. Theoretical foundation trajectories as well as the τ time-scaling parameter and the starting for CMA-ES from perspective. Algorithmica (2012). time point of the DMP which results in 18 parameters. The reward [5] Y. Akimoto, J. Sakuma, I. Ono, and S. Kobayashi. 2008. Functionally specialized CMA-ES: a modification of CMA-ES based on the specialization of the functions function is defined by the sum of quadratic penalties for missing of covariance matrix adaptation and step size adaptation. GECCO (2008). the ball (minimum distance between ball and racket trajectory) and [6] S. Amari. 1998. Natural Gradient Works Efficiently in Learning. Neural Compu- missing the target return position. Figure 4(c) shows the result. TR- tation (1998). Issue 2. DOI:http://dx.doi.org/10.1162/089976698300017746 [7] P. Dayan and G.E. Hinton. 1997. Using expectation-maximization for re- CMA-ES robustly finds the solution in all trials while CMA-ES inforcement learning. Neural Comput. 9, 2 (1997), 271–278. DOI:http: could not find good solution in 3 trials out of 10. And TR-CMA-ES //dx.doi.org/10.1162/neco.1997.9.2.271 [8] A. Dempster, N. Laird, and D. Rubin. 1977. Maximum likelihood from incomplete was faster. data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39, 1 (1977), 1–38. 5 CONCLUSION [9] N. Hansen. 2016. The CMA Evolution Strategy: A Tutorial. Tutorial. (2016). [10] N. Hansen, S.D. Mller, and P. Koumoutsakos. 2003. Reducing the Time Complex- In this paper, we derived the full CMA-ES update equations for mean ity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation and covariance with an expectation-maximization based framework (CMA-ES). Evolutionary Computation (2003). [11] A. Ijspeert and S. Schaal. 2003. Learning Attractor Landscapes for Learning Motor using information-geometric trust regions. The presented update Primitives. In Advances in Neural Information Processing Systems 15(NIPS). for the covariance matrix share the same structure as the CMA-ES [12] J. Kober, B.J. Mohler, and J. Peters. 2008. Learning Perceptual Coupling for algorithm. However, CMA-ES is not using trust region (i.e., use Motor Primitives. In Intelligent Robots and Systems (IROS). 834–839. [13] I. Loshchilov, M. Schoenauer, and M. Sebag. 2013. KL-based Control of the constraint on KL), but a penalty on KL is used in the objective Learning Schedule for Surrogate Black-Box Optimization. CoRR (2013). as regularizer. As a consequence, the optimum ’Lagrangian’ mul- [14] S. Mannor, R. Rubinstein, and Y. Gat. 2003. The Cross Entropy method for Fast Policy Search. In Proceedings of the 20th International Conference on Machine tipliers in CMA-ES are set by hand and remain fixed during the Learning(ICML). learning. However they are well established and can be left constant [15] G. Neumann. 2011. Variational Inference for Policy Search in Changing Situations. throughout many applications. In the trust region formulation, the In ICML. [16] Y. Ollivier, L. Arnold, A. Auger, and N. Hansen. 2011. Information-geometric Lagrangian multipliers are optimized for the given bound ϵ. As we optimization algorithms: A unifying picture via invariance principles.. In arXiv show, in addition to the theoretical foundation, this algorithm gives preprint arXiv:1106.3708. us an improved performance over the original algorithm. In contrast [17] J. Peters, K. Mulling,¨ and Y. Altun. 2010. Relative Entropy Policy Search. In AAAI. AAAI Press. to the mean and the covariance update, our step-size update does not [18] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. 2015. Trust Region match the step-size update from CMA-ES. Both, the TR-CMA-ES Policy Optimization. In ICML. [19] D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, and J. Schmidhuber. and CMA-ES take advantage of evolution path to set step size σ. 2014. Natural Evolution Strategies. Journal of Machine Learning Research However our update rule for the step size is obtained from the same (2014).