Natural Evolution Strategies
Total Page:16
File Type:pdf, Size:1020Kb
Natural Evolution Strategies Daan Wierstra, Tom Schaul, Jan Peters and Juergen Schmidhuber Abstract— This paper presents Natural Evolution Strategies be crucial to find the right domain-specific trade-off on issues (NES), a novel algorithm for performing real-valued ‘black such as convergence speed, expected quality of the solutions box’ function optimization: optimizing an unknown objective found and the algorithm’s sensitivity to local suboptima on function where algorithm-selected function measurements con- stitute the only information accessible to the method. Natu- the fitness landscape. ral Evolution Strategies search the fitness landscape using a A variety of algorithms has been developed within this multivariate normal distribution with a self-adapting mutation framework, including methods such as Simulated Anneal- matrix to generate correlated mutations in promising regions. ing [5], Simultaneous Perturbation Stochastic Optimiza- NES shares this property with Covariance Matrix Adaption tion [6], simple Hill Climbing, Particle Swarm Optimiza- (CMA), an Evolution Strategy (ES) which has been shown to perform well on a variety of high-precision optimization tion [7] and the class of Evolutionary Algorithms, of which tasks. The Natural Evolution Strategies algorithm, however, is Evolution Strategies (ES) [8], [9], [10] and in particular its simpler, less ad-hoc and more principled. Self-adaptation of the Covariance Matrix Adaption (CMA) instantiation [11] are of mutation matrix is derived using a Monte Carlo estimate of the great interest to us. natural gradient towards better expected fitness. By following Evolution Strategies, so named because of their inspira- the natural gradient instead of the ‘vanilla’ gradient, we can ensure efficient update steps while preventing early convergence tion from natural Darwinian evolution, generally produce due to overly greedy updates, resulting in reduced sensitivity consecutive generations of samples. During each generation, to local suboptima. We show NES has competitive performance a batch of samples is generated by perturbing the parents’ with CMA on unimodal tasks, while outperforming it on several parameters – mutating their genes, if you will. (Note that ES multimodal tasks that are rich in deceptive local optima. generally uses asexual reproduction: new individuals typi- cally are produced without crossover or similar techniques I. INTRODUCTION more prevalent in the field of genetic algorithms). A number Real-valued ‘black box’ function optimization is one of of samples is selected, based on their fitness values, while the the major branches of modern applied machine learning less fit individuals are discarded. The winners are then used research [1]. It concerns itself with optimizing the continuous as parents for the next generation, and so on, a process which parameters of some unknown objective function, also called typically leads to increasing fitness over the generations. This a fitness function. The exact structure of the objective func- basic ES framework, though simple and heuristic in nature, tion is assumed to be unknown or unspecified, but specific has proven to be very powerful and robust, spawning a wide function measurements, freely chosen by the algorithm, are variety of algorithms. available. This is a recurring problem setup in real-world In Evolution Strategies, mutations are generally drawn domains, since often the precise structure of a problem is from a normal distribution with specific mutation sizes either not available to the engineer, or too expensive to model associated with every problem parameter. One of the major or simulate. Numerous real-world problems can be treated as research topics in Evolution Strategies concerns the auto- real-valued black box function optimization problems. In or- mated adjustment of the mutation sizes for the production der to illustrate the importance and prevalence of this general of new samples, a procedure generally referred to as self- problem setup, one could point to a diverse set of tasks such adaptation of mutation. Obviously, choosing mutation sizes as the classic nozzle shape design problem [2], developing too high will produce debilitating mutations and ensure that an Aibo robot gait [3] or non-Markovian neurocontrol [4]. convergence to a sufficiently fit region of parameter space Now, since exhaustively searching the entire space of will be prevented. Choosing them too low leads to extremely solution parameters is considered infeasible, and since we do small convergence rates and causes the algorithm to get not assume a precise model of our fitness function, we are trapped in bad local suboptima. Generally the mutation size forced to settle for trying to find a reasonably fit solution that must be chosen from a small range of values – known as satisfies certain pre-specified constraints. This, inevitably, the evolution window – specific to both the problem domain involves using a sufficiently intelligent heuristic approach. and to the distribution of current individuals on the fitness Though this may sound crude, in practice it has proven to landscape. ES algorithms must therefore adjust mutation during evolution, based on the progress made on the recent Daan Wierstra, Tom Schaul and Juergen Schmidhuber are with IDSIA, evolution path. Often this is done by simultaneously evolving Manno-Lugano, Switzerland (email: [daan, tom, juergen]@idsia.ch). Juergen Schmidhuber is also with the Technical University Munich, Garching, both problem parameters and the corresponding mutation Germany. sizes. This has been shown to produce excellent results in Jan Peters is with the Max Planck Institute for Biological Cybernetics, a number of cases [10]. Tuebingen, Germany (email: [email protected]). This research was funded by SNF grants 200021-111968/1 and 200021- The Covariance Matrix Adaptation algorithm constitutes 113364/1. a more sophisticated approach. CMA adapts a variance- covariance mutation matrix Σ used by a multivariate normal accessible to the algorithm consists of function measurements distributions from which mutations are drawn. This en- selected by the algorithm. The goal is to optimize f(x), ables the algorithm to generate correlated mutations, speed- while keeping the number of function evaluations – which ing up evolution significantly for many real-world fitness are considered costly – as low as possible. This is done by landscapes. Self-adaptation of this mutation matrix is then evaluating a number 1 . λ of separate individuals z1 ... zλ achieved by integrating information on successful mutations each successive generation g, using the information from on its recent evolution path, by making similar mutations fitness evaluations f(z1) . f(zλ) to adjust both the current more likely. CMA performs excellently on a number of candidate objective parameters x and the mutation sizes. benchmark tasks. One of the problems with the CMA algo- In conventional Evolution Strategies, optimization is rithm, however, is its ad-hoc nature and relatively complex or achieved by mimicking natural evolution: at every gener- even contrived set of mathematical justifications and ‘rules ation, parent solution x produces offspring z1 ... zλ by of thumb’. Another problem pertains to its sensitivity to local mutating string x using a multivariate normal distribution optima. with zero mean and some variance σ. After evaluating all The goal of this paper is to advance the state of the individuals, the best µ individuals are kept (selected), stored art of real-valued ‘black box’ function optimization while as candidate solutions and subsequently used as ‘parents’ for providing a firmer mathematical grounding of its mechanics, the next generation. This simple process is known to produce derived from first principles. Simultaneously, we want to de- excellent results for a number of challenging problems. velop a method that can achieve optimization up to arbitrarily high precision, and reduce sensitivity to local suboptima Algorithm 1 Natural Evolution Strategies as compared to certain earlier methods such as CMA. We g ← 1 (g) T present an algorithm, Natural Evolution Strategies, that is initialize population parameters θ = hx, Σ = A Ai both elegant and competitive, inheriting CMA’s strength of repeat for k = 1 . λ do correlated mutations, while enhancing the ability to prevent draw zk ∼ π(x, Σ) early convergence to local optima. evaluate cost of f(zk) −1 Our method, Natural Evolution Strategies (which can ac- ∇x log π (zk) = Σ (zk − x) tually be seen as a (1, λ)-Evolution Strategy with 1 candidate ∇Σ log π (zk) = 1 −1 T −1 1 −1 solution per generation and λ samples or ‘children’), adapts Σ (zk − x)(zk − x) Σ − Σ 2 2 both a mutation matrix and the parent individual using a h Ti ∇A log π (zk) = A ∇Σ log π (zk) + ∇Σ log π (zk) natural gradient based update step [12]. Every generation, end for2 3 a gradient towards better expected fitness is estimated using ∇x log π(z1) ∇A log π(z1) 1 a Monte Carlo approximation. This gradient is then used 6 . 7 Φ = 4 . 5 to update both the parent individual’s parameters and the ∇x log π(zλ) ∇A log π(zλ) 1 T mutation matrix. By using a natural gradient instead of a R = [f(z1), . , f(zλ)] ‘vanilla’ gradient, we can prevent early convergence to local δθ = (ΦTΦ)−1ΦTR optima, while ensuring large update steps. We show that θ(g+1) ← θ(g) − β · δθ the algorithm has competitive performance with CMA on g ← g + 1 a number of benchmarks, while it outperforms CMA on until stopping criterion is met the Rastrigin function benchmark, a task