Population-Based Optimizes a Meta-Learning Objective

Kevin Frans Olaf Witkowski Cross Labs, Cross Compass Ltd. Cross Labs, Cross Compass Ltd. Japan Japan Massachusetts Institute of Technology Earth-Life Science Institute, Tokyo Institute of Technology Cambridge, MA, USA Japan College of Arts and Sciences, University of Tokyo Japan ABSTRACT Meta-learning models, or models that learn to learn, have been a long-desired target for their ability to quickly solve new tasks. Traditional meta-learning methods can require expensive inner and outer loops, thus there is demand for algorithms that discover strong learners without explicitly searching for them. We draw parallels to the study of evolvable genomes in evolutionary sys- tems – genomes with a strong capacity to adapt – and propose that meta-learning and adaptive optimize for the same ob- jective: high performance after a set of learning iterations. We argue that population-based evolutionary systems with non-static fitness landscapes naturally bias towards high-evolvability genomes, and Figure 1: Meta-learning and adaptive evolvability both mea- therefore optimize for populations with strong learning ability. sure parallel objectives: performance after a set of learn- We demonstrate this claim with a simple , ing iterations. An evolutionary system can be considered as Population-Based Meta Learning (PBML), that consistently discov- "learning to learn" if its population of genomes consistently ers genomes which display higher rates of improvement over gen- improve their evolvability, thus allowing the population to erations, and can rapidly adapt to solve sparse fitness and robotic more efficiently adapt to changes in the environment. control tasks.

1 INTRODUCTION genome’s offspring [20, 31, 37], while others capture the ability for In recent years, there has been a growing interest in the commu- offspring to adapt to new challenges [12, 22,28]. nity about meta-learning, or learning to learn. In a meta-learning From a meta-learning perspective, the ability for offspring to viewpoint, models are judged not just on their strength, but on adapt to new challenges – which we refer to as adaptive evolvability their capacity to increase this strength through future learning. – is of particular interest. Consider a shifting environment in which Strong meta-learning models are therefore desired for many practi- genomes must repeatedly adapt to maintain high fitness levels. A cal applications, such as sim-to-real robotics [4, 27, 43] or low-data genome with high adaptive evolvability would more easily produce learning [15, 33, 42, 47], for their ability to quickly generalize to new offspring that explore towards high-fitness areas. Thus, adaptive tasks. The inherent challenge in traditional meta-learning lies in evolvability can be measured as the expected fitness of a genome’s its expensive inner loop – if learning is expensive, then learning to descendants after some number of generations.

arXiv:2103.06435v1 [cs.NE] 11 Mar 2021 learn is orders of magnitude harder. Following this problem, there Here we consider the evolutionary system from an optimiza- is a desire for alternate meta-learning methods that can discover tion perspective. Evolutionary systems are learning models which fast learners without explicitly searching for them. iteratively improve through mutation and selection. Thus, the learn- In the adjacent field of , a similar inter- ing ability of a system directly depends on its genomes’ ability to est has risen for genomes with high evolvability. Genomes represent efficiently explore via mutation. Take two populations and allow individual units in a world, such as the structure or behavior or a them to adapt for a number of generations; the population with creature, which together form a global population. An evolutionary higher learning ability will produce descendants with higher aver- system improves by mutating this population of genomes to form age fitness. The meta-learning objective of an evolutionary system offspring, which are then selected based on their fitness scores. is therefore the same as maximizing the adaptive evolvability of its A genome’s evolvability is its ability to efficiently carry out this population – genomes should learn to produce high-fitness descen- mutation-selection process [29, 41]. Precise definitions of evolv- dants through mutation (Figure 1). ability are debated: some measurements capture the diversity of a The key insight in connecting these ideas is that evolvability is known to improve naturally in certain evolutionary systems. In contrast to meta-learning, where meta-parameters are traditionally Population-Based Evolution Optimizes a Meta-Learning Objective Kevin Frans and Olaf Witkowski optimized for in a separate loop [24], evolvability is often a property A common bottleneck in meta-learning is in the computational that is thought to arise indirectly as an evolutionary process runs needs of a two-loop process. If the inner loop requires many iter- [1, 12, 28, 32], a key example being the adaptability of life on Earth ations of training, then the outer loop can become prohibitively [37]. A large body of work has examined what properties lead expensive. Thus, many meta-learning algorithms focus specifically to increasing evolvability in evolutionary systems [1, 10, 12, 28, on the one-shot or few-shot domain to ensure a small inner loop 32], and if a population consistently increases in evolvability, it is [33, 42, 47]. On many domains, however, it is unrealistic to suc- equivalent to learning to learn. cessfully solve a new task in only a few updates. In addition, meta- In this work, we claim that in evolutionary algorithms with a parameters that are strong in the short term can degrade in the large population and a non-static fitness landscape, high-evolvability long term [51]. In this case, other methods must be used to meta- genomes will naturally be selected for, creating a viable meta- learn over the long term, such as implicit differentiation [39]. The learning method that does not rely on explicit search trees. The approach we adopt is to combine the inner and outer loops, and key idea is that survivable genes in population-based evolution- optimize parameters and meta-parameters simultaneously, which ary systems must grant fitness not only in the present, but alsoto can significantly reduce computation cost5 [ , 14, 16, 19]. The key inheritors of the gene. In environments with non-static fitness land- issue here is that models may overfit on their current tasks instead scapes, this manifests in the form of learning an efficient mutation of developing long-term learning abilities. We claim that evolving function, often with inductive biases that balance exploration and large populations of genomes reduces this problem, as shown in exploitation along various dimensions. experiments below. We solidify this claim by considering Population-Based Meta Learning (PBML), an evolutionary algorithm that successfully dis- 2.2 Evolvability covers genomes with a strong capacity to improve themselves over The definition of evolvability has often shifted around, and while generations. We show that learning rates consistently increase even there is no strict consensus, measurements of evolvability gener- when a genome’s learning behavior is decoupled from its fitness, ally fall into two camps: diversity and adaptation. Diversity-based and genomes can discover non-trivial mutation behavior to effi- evolvability focuses on the ability for a genome to produce phe- ciently explore in sparse environments. We show that genomes notypically diverse offspring [20, 29, 31, 37]. This measurement, discovered through population-based evolution consistently out- however, often depends on a hand-specified diversity function. In perform competitors on both numeric and robotic control tasks. the context of learning to learn, diverse offspring are not always op- timal, rather diversity is only a factor in an exploration-exploitation tradeoff to produce high-fitness offspring9 [ ]. Instead, we consider 2 BACKGROUND adaptation-based evolvability [12, 28, 41], which measures the per- 2.1 Meta Learning formance of offspring some generations after a change in the fitness landscape. The field of meta-learning focuses on discovering models that dis- As evolvability is a highly desired property in evolutionary algo- play a strong capacity for additional learning, hence the description rithms, many works aim to explain how evolvability emerges in evo- learning to learn [49]. Traditionally, meta-learning consists of a lutionary systems. Some works attribute evolvability due the rules distribution of tasks, along with an inner and outer loop. In the of the environment. Periodically varying goals have been shown to inner loop, a new task is selected, and a model is trained to solve induce evolvability in a population of genomes, as genomes with the task. In the outer loop, meta-parameters are trained to increase more evolvable representations can adapt quicker to survive selec- expected performance in the inner loop. A strong meta-learning tion [22, 28]. In addition, environments where novelty is rewarded model will encode shared information in these meta-parameters, or where competition is spread around niches will naturally favor so as to quickly learn to solve a new task. genomes that produce diverse offspring [32]. Others take a different Meta-parameters can take many forms [24], as long as they affect viewpoint, and instead consider the genetic algorithm and represen- inner loop training in some way. Hyperparameter search [6, 18, 35] tations in use as a major cause of evolvability. Specifically, research is a simple form of meta-learning, in which parameters such as into evolvable representations has shown that there are benefits learning rate are adjusted to increase training efficiency. Methods to modularity and redundancy in genomes [41]. Exploring neutral such as MAML [3, 15, 17] and Reptile [36] define meta-parameters landscapes, where multiple genotypes map to the same phenotype, as the initial parameters of an inner training loop, with the objec- has also been shown to aid in mutation diversity [12]. A few works tive to discover robust representations that can be easily adapted even take a step further and claim that given certain conditions, the to new problems. Others view meta-parameters as specifications appearance of evolvability is inevitable [10, 32], a view we further for a neural architecture, a subset known as Neural Architecture examine in this work. Search [13, 34, 40]. Still other meta-learning algorithms optimize the inner learning function itself, such as defining the learning 2.3 Evolutionary Meta Learning function as a neural network [2, 8, 11], or adjusting an auxiliary re- ward function in reinforcement learning [23, 25, 52]. In our method, Many methods take inspiration from both evolutionary systems meta-parameters are included in the genome, which both defines a and meta-learning, in most cases using evolution as a gradient- starting point for mutation and can also parametrize the mutation less optimizer for either the inner or outer meta-learning loop. function itself. Evolution in the outer loop has been shown to successfully learn policy gradient algorithms [26], loss functions [21], neural network Population-Based Evolution Optimizes a Meta-Learning Objective Kevin Frans and Olaf Witkowski structures [46], and initial training parameters [45]. There has been 3.1 Reasoning little focus, however, on viewing evolutionary systems as both We first define evolutionary systems as methods that iteratively the inner and outer loop in a meta-learning formulation, where a select for genomes with high fitness. In a standard evolutionary genome evolves while improve its own learning ability. system, we consider a population of genomes along with a fitness In addition, a recent trend is to use the method known as Evo- function. Every generation, genomes that display high fitness in- lution Strategies [44], which involves evolving a global genome crease in population size, while genomes with lower fitness die off. by mutating a population of offspring, then updating the genome Exploration is introduced in the form of descendants – each genome based on a weighted average of the offspring’s parameters. This occasionally mutates, producing offspring with slight variations. method has proven the scalability of evolutionary methods, how- Over generations, the population will skew towards genomes that ever, it remains a greedy algorithm, which as shown in experiments display high fitness. below fail to select for long-term benefits. We now address the meta-learning objective of such a system. The traditional meta-learning objective is to optimize for perfor- 3 APPROACH mance after a series of learning iterations. In evolutionary systems, Our motivation is to merge the benefits of evolutionary algorithms this learning iteration is a combination of mutation and selection. and meta-learning into a single method. Genome that display strong A simple visualization of this is a single genome which mutates to adaptive abilities – genomes that have learned to learn – are heav- produce 100 offspring, where the child with the highest fitness sur- ily desired in areas such as open-endedness [30, 48] and quality vives. This can be seen as a one-step lookahead search towards the diversity [38], as they can efficiently explore new niches and ideas. peak of a fitness landscape. As such, the meta-learning performance Traditional meta-learning methods, however, are often bottlenecked of a genome can be defined as the fitness of its offspring after anum- by expensive inner and outer loops, and optimization can be limited ber of generations. This is the same definition as adaptation-based by gradient degradation. Rather, we would prefer to continuously evolvability, a parallel which guides us moving forwards. train a model on new tasks, and have the model naturally improve A genome that has “learned to learn” must in fact “learn to mu- its learning ability over time. tate”. The simplest form of learning to mutate involves developing a We present Population-Based Meta Learning (PBML), an easily robust genetic representation. Consider an environment where the specified evolutionary algorithm that favors genomes that quickly mutation function of a genome consists of noise added to its param- adapt to changes in the fitness landscape. eters. In this case, a genome where slight changes in parameters result in large changes in the displayed phenotype would be able to explore a larger space than its counterparts. The mutation function may also be directly parametrized in the genome, such as a parame- Algorithm 1: Population-Based Meta Learning ter defining mutation radius. Crucially, producing diverse offspring Define an initial set of genomes G. Each genome contains through mutation is not always optimal; there is an exploration- parameters, as well as a population count. Define a fitness exploitation tradeoff. However, some dimensions may consistently function F(), which evaluates the fitness of a genome.; be worth exploring – e.g. variation in dental structure helps adapt Define a stochastic mutation function M(), which creates to new food types, but variation in blood proteins is often lethal. A offspring by mutating the parameters of a genome.; strong meta-learned genome would encode these inductive biases Define decay ratio D; the ratio of population that dies every into their mutation function, and explore in dimensions that are generation.; useful for improving fitness. Define offspring count C, the total number of offspring The key intuition behind PBML is that evolutionary systems produced every generation.; with large populations naturally optimize for long term fitness. As Define ranking function Rank(), which outputs a genome’s a simple example, consider two genomes, Red and Blue. Red has floating-point rank between 0 and 1 based on fitness.; higher fitness than Blue, but it also has a gene that causes 50% for timestep=0...N do of its offspring to mutate lethally. Over time, descendants ofRed for g in G do will die at higher rates, so the descendants of Blue will become g.fitness = F(g.parameters); more present in the population. In other words, a gene with high g.rank = Rank(g.fitness); survivability not only grants high fitness to its genome, but also g.population *= (1-D); must maintain this fitness advantage in the offspring that inherit g.population *= g.rank; it. Thus, over generations, the genes that survive will be genes g.population /= total_population; that allow their inheritors to achieve high fitness – precisely the for child in int(g.population * C) do definition of a gene with strong meta-learning capabilities. G.append(M(g)); An important requirement for meta-learning to be selected for is that a variety of genomes must be allowed to survive each genera- end tion. In evolutionary systems, genomes compete with each other to end survive by increasing their fitness over generations. It is important end that genomes with lower fitness are not immediately removed, so that competition for long-term fitness can emerge. Imagine a greedy Population-Based Evolution Optimizes a Meta-Learning Objective Kevin Frans and Olaf Witkowski evolutionary system where only a single high-fitness genome sur- vives through each generation. Even if that genome’s mutation function had a high lethality rate, it would still remain. In com- parison, in an evolutionary system where multiple lineages can survive, a genome with lower fitness but stronger learning ability can survive long enough for benefits to show. Large population numbers are helpful for two reasons: to coun- teract noise, and to capture selection effects. A genome with a strong mutation function will produce fitter offspring on average, but these effects are confounded with noise inherent to mutation. The larger the amount of offspring, the more accurately an evolu- tionary system will skew towards strong mutation functions. In addition, large population counts allow for finer-grained effects to be represented. For example, imagine the population of genome A is to fall by 7%, and genome B to fall by 10%. If only 10 copies of each genome are present, both genomes will fall to 9 population, and the difference will not be seen. In contrast, if 10,000 copiesare present, then the population counts can be more smoothly adjusted. In PBML, we approximate this behavior by defining each genome’s population as a ratio. This provides a continuous way to measure selection pressure due to fitness. A genome with lower fitness will decay its population ratio faster than a higher fitness genome. How- ever, they will remain in the simulation until they drop below a certain cutoff percentage. As genomes with higher populations will produce more offspring, the overall population will gradually skew towards high-fitness genomes, while allowing many lineages of genomes to be compared across generations.

4 EXPERIMENTS In a series of experiments, we examine how population-based evo- lution can discover genomes with strong learning ability. In the Figure 2: Numeric Fitness world, where current fitness (X) is Numeric Fitness world, we show that population-based methods decoupled from long-term learning ability (R). Population- optimize for long-term improvement, even if current fitness is de- based evolution results in genomes where learning ability coupled from improvement ability. In the Square Fitness world, steadily increases, as shown in a quadratic learning curve we show that genomes can successfully learn mutation functions (top) and increasing mutation radius (bottom). In contrast, which efficiently explore areas with high potential fitness, while greedy single-genome evolution and random drift show no avoiding zero-fitness areas. Finally, we display that population- clear improvement in learning ability. based evolution discovers robotic control policies that are easily adaptable to new goals. In this Numeric Fitness experiment, our goal is to examine 4.1 Do population-based evolutionary systems whether long-term learning ability appears when only current fit- select for genomes with long-term fitness ness is selected for. Specifically, long-term learning is represented by the R parameter of the genome, which specifies the mutation improvement? radius of X. Genomes with a larger R value will create a larger de- We first present an environment where a genome’s long-term learn- viation in their offspring’s X values, and are therefore more likely ing ability is decoupled from its current fitness, which we refer to as to produce offspring with higher fitness than their peers. It isim- the Numeric Fitness world. We define a genome as a set of two num- portant that while R has an affect on the offspring produced bya bers, X and R. The fitness of a genome is determined entirely byits genome, it has no affect on fitness in the present. Thus, if R steadily X value, with higher X values resulting in higher fitness. Learning increases through the generations, we can claim that there is con- ability, on the other hand, is encoded by R, which parametrizes the sistent pressure to improve mutation ability even with no direct amount of noise inserted in the mutation function. fitness gain. In figure 2 (top), we show that population-based evolutionary 퐺푒푛표푚푒 = {푋, 푅} (1) systems steadily increase in their fitness improvement rate across 퐹푖푡푛푒푠푠(푋, 푅) = 푋 (2) generations. In the population-based system, 1000 offspring are pro- ( duced each generation, with high-population genomes producing 푋 = 푋 + 푁표푟푚푎푙 () ∗ 1.05푅 푀푢푡푎푡푒(푋, 푅) = (3) more offspring. Every generation, the population ratios of existing 푅 = 푅 + 푁표푟푚푎푙 () genomes drop from 25% to 100% depending on their fitness rank, Population-Based Evolution Optimizes a Meta-Learning Objective Kevin Frans and Olaf Witkowski

Figure 3: Left: Square Fitness world, where fitness- containing squares are uniformly spread throughout a gridworld. Every ten generations, the fitness value of each square is shuffled. Non-square areas have zero fitness. Right: Hard Square world, where squares only have a 10% chance to be high-fitness. and genomes with a ratio of less than 1/1000 are removed. In the greedy single-genome system, only the highest fitness genome sur- vives through each generation, and all 1000 offspring are mutations of this single genome. The random drift system is the same as the population-based system, except all genomes are ranked as equal fitness. Notably, while in the beginning the single-genome system achieves higher average fitness, over time the genomes created through population-based evolution outperform the greedy variety. Figure 2 (bottom) demonstrates the reasoning behind the learn- ing ability gap. In a population-based evolutionary system, the survivability of a genome not only depends on current fitness but also on the fitness of its descendants. This results in a consistent pressure to improve learning ability through generations, which manifests in the form of a consistently increasing R parameter. In contrast, both greedy single-genome evolution and random drift display no significant trend in the R parameter, as a greedy system optimizes only for fitness within the current generation, and ina random drifting system there is no selection pressure at all. Figure 4: Genomes evolved in a Square Fitness world (top). After training, genomes are transferred to a Hard 4.2 Can selection pressure for long-term fitness Square world (bottom). Population-based evolution results in genomes which develop efficient mutation functions, al- result in encoding useful inductive biases lowing them to consistently locate the high-fitness squares. within a genome’s mutation function? Genomes evolved in a static environment instead learn to In a second experiment, the Square Fitness world, we consider a minimize mutation, and thus remain stuck in low-fitness ar- setup where a genome parametrizes its own mutation function in a eas when transferred. non-trivial manner. A genome is defined by an X and Y coordinate along with a 16-length vector R that specifies its mutation behavior. Specifically, a genome mutates by creating a child a certain distance The fitness landscape in this world is defined as a 256x256 co- away from its XY coordinate, with the direction chosen randomly. ordinate grid, with a fitness value for every coordinate. Fitness is When a genome mutates, each parameter in its R vector determines distributed as a series of spaced-out squares. Within each square, the probability that a child will be created a certain distance away. fitness can range between 0.3 to 1, whereas fitness remains at0in the areas in between. Importantly, a given square always has six 퐺푒푛표푚푒 = {푋, 푌, 푅1, 푅2, 푅3...푅16} (4) neighbors within a fixed distance of itself. Every few generations, = ( ) the fitness values of each square shuffle, while positions remainthe 푅푎푑푖푢푠 푆표푓 푡푚푎푥 푅1, 푅2, 푅3...푅16  same. 퐷푖푟푒푐푡푖표푛 = 푅푎푛푑표푚(360) 푀푢푡푎푡푒(푋, 푌, 푅) = (5) In this experiment, a successfully meta-learned genome would 푋+ = 퐶표푠(푅푎푑푖푢푠) ∗ 퐷푖푟푒푐푡푖표푛 smartly define its mutation function to explore efficiently. Itisim-  푌+ = 푆푖푛(푅푎푑푖푢푠) ∗ 퐷푖푟푒푐푡푖표푛 portant to be able to explore different squares to find the areas with  Population-Based Evolution Optimizes a Meta-Learning Objective Kevin Frans and Olaf Witkowski

Figure 5: Offspring produced by various genomes in Square Fitness world, according to their learned mutation behavior. Population-based evolution genomes produce offspring in a ring corresponding to the nearest set of neighboring squares, allowing for efficient exploration. Genomes in the static-fitness environment learn to reduce the effect of mutation, whereas mutation functions created from single-genome evolution and random drift show no clear behavior. the highest fitness. It is inefficient, however, to simply randomly 4.3 Do strong mutation functions enable faster explore around a coordinate, as most of the fitness landscape con- learning on new, unseen tasks? sists of zero-fitness area. Rather, a strong mutation function will A common goal in meta-learning is to create models that can quickly only create offspring that land on a variety of other squares. solve new tasks. In general, models are first trained on a distribution Figure 5 shows the mutation behaviors learned from various of related tasks so that general meta-knowledge can be extracted. evolutionary methods. In the start, genomes have a uniform mu- This meta-knowledge is then used to efficiently learn the specifics tation function at every radius. After running population-based of a new task. evolution, the genomes instead develop a prominent bias in their We utilize this framework to test the learning capabilities of mutation function. Specifically, they place a large chunk of proba- genomes evolved in the Square Fitness environment. We define bility in maintaining their current coordinate, which can be seen a new environment, referred to as the Hard Square environment, as exploitation. They additionally, however, place some probability in which each square has only a 10% chance to contain high (1.0) in producing offspring with a radius of 9-10 away from themselves fitness. All other squares instead contain low (0.3) fitness, andall – which precisely matches the distance to a neighboring square. non-square areas contain zero fitness. Additionally, high-fitness Notably, this effect does not appear when dealing with certain squares are limited to the outer areas; thus squares near the center variations. First, in a greedy single-genome evolutionary system, are always low fitness. This task poses a challenging exploration there is no pressure on adjusting the mutation function. As such, problem, as high-fitness areas are sparsely located, and there isno the general mutation behavior of the population drifts randomly. gradient to inform a direction of improvement. While we see rings, these rings are an artifact of the homogeneous Figure 4 shows that population-based evolution genomes sig- population, and they do not correspond to any meaningful distance. nificantly outperform others when transferred. This is likely due The second ablation compares to a population-based evolution- to the biases encoded in their mutation function – by exploring ary system in which the fitness landscape is held static. In this case, only in the spaces where squares are present, they are much more we do in fact see pressure to improve long-term fitness, however, likely to discover high-fitness areas. In contrast, genomes evolved this pressure does not manifest in a mutation function that can be in a static-fitness environment demonstrate their weakness –they considered a strong learner. Instead, the population has learned to have learned to reduce exploration, and thus fail to escape the local favor genomes which display as little diversity as possible. Intu- minimum of the central low-fitness squares. Genomes evolved in itively, the genomes in this population have already achieved the single-genome evolution and random drift fail to achieve even this optimal phenotype, and learn to continuously exploit this rather baseline, as their inefficient mutation functions result in alarge than exploring further. percentage of offspring being born into zero-fitness areas. Population-Based Evolution Optimizes a Meta-Learning Objective Kevin Frans and Olaf Witkowski

Figure 6: Robotic control task, in which a two-joint arm must be navigated to a goal point. In all methods except the static-environment system, the goal location is randomized every 50 generations. Genomes define a neural network pol- icy which is given observations about the arm’s position and velocity, and controls the force applied to each joint.

4.4 Can population-based evolution help to Figure 7: Training on the robotic control task. Population- efficiently adapt robotic control policies? based evolution manages to discover genomes which can Finally, we wish to examine if population-based evolution can meta- adapt to new goal points; whereas single-genome evolution learn in a domain with more complex dynamics. We consider the and random drift genomes develop unstable mutation func- Reacher-v2 environment, as defined in OpenAI gym7 [ ], where a tions that hinder learning. Mujoco-simulated [50] arm must maneuver its tip to a point on a grid. The arm is controlled by adjusting the torque of two joints, Table 1: Fitness after 50 generations on a new goal and information about the current position and angles of the arm is given as an observation every frame. Fitness is calculated as the Method Top Fitness Average Fitness distance between the arm and the goal point over 50 frames. Every fifty generations, a new goal point is sampled, forcing genomes to Population-Based Evolution -2.962 -2.966 quickly adapt their policies to the new goal. Single-Genome Evolution -10.835 -34.591 To allow for genomes to develop meta-learning capabilities, each Static-Environment Evolution -5.197 -5.199 genome is defined as a four-layer neural network. The second Random Drift -3.585 -8.829 and third layers are comprised of three independent feedforward From Scratch -2.2415 -9.982 modules, which are summed according to a parametrized softmax function to form the full output of the layer. Every module has an independent mutation radius, encoded as a 12-parameter vector generations. In contrast, offspring from genomes in the other meth- stored in the genome. ods have a high chance to mutate badly, lowering their average fit- We define the genome this way to enable two avenues for improv- ness. This constraint, however, comes at a slight cost. The population- ing learning: robust representations and smart mutation functions. based genome comes close to the top-performing policy but falls Neural networks map a set of parameters to a policy. There are short, as it likely has stopped mutation in a module that is slightly multiple networks that can map to the same policy, however some sub-optimal. This can be seen as a exploration-exploitation tradeoff, networks may be better at adapting than others. To increase learn- in which the genome gains a higher average fitness by constraining ing ability, a genome may learn a robust representation that can its search space of offspring, but can fail when the optimal solution easily mutate to produce diversity. In addition, a genome can adjust is outside the space it considers. the mutation radiuses of its modules to develop a fine-tuned explo- ration behavior, such as lowering mutation in a module that parses 5 DISCUSSION observations, but increasing mutation in a module that calculates In this work, we show that population-based evolutionary systems where to position the arm. naturally optimize for genomes that display long-term learning Figure 7 shows that population-based evolution generally main- ability. From a meta-learning perspective, this work sheds light on tains a steady adaptation rate during training, while single-genome a new perspective for developing meta-learning algorithms that evolution results in more unstable behavior. To further measure learn through constant peer competition rather than explicit inner the learning ability of the various genomes, we then transfer each and outer loops. This paper also contributes to the study of emer- genome to an unseen test goal. Table 1 showcases the advantage and gent evolvability. Works from the artificial life community have drawbacks of the population-based systems. Notably, the population- discussed how evolvability emerges through simulated systems based and static-environment genomes learn to constrain their mu- [32], while works in biological systems have questioned why the tation functions, and can maintain strong policies through many DNA-based genomes of life on Earth display such high adaptive Population-Based Evolution Optimizes a Meta-Learning Objective Kevin Frans and Olaf Witkowski capacity [37]. The ideas we present create a solid perspective on arXiv preprint arXiv:1611.02779 (2016). how evolvability can emerge naturally through an evolutionary [12] Marc Ebner, Mark Shackleton, and Rob Shipman. 2001. How neutral networks influence evolvability. Complexity 7, 2 (2001), 19–33. process over large populations of creatures. [13] Thomas Elsken, Jan Hendrik Metzen, Frank Hutter, et al. 2019. Neural architecture We believe this paper provides a starting point to many fu- search: A survey. J. Mach. Learn. Res. 20, 55 (2019), 1–21. ture pathways examining evolutionary systems as a form of meta- [14] Chrisantha Fernando, Jakub Sygnowski, Simon Osindero, Jane Wang, Tom Schaul, Denis Teplyashin, Pablo Sprechmann, Alexander Pritzel, and Andrei Rusu. 2018. learning. A promising direction is to examine which populations Meta-learning by the baldwin effect. In Proceedings of the Genetic and Evolutionary retain their evolvability as evolution continues. As seen in experi- Computation Conference Companion. 1313–1320. [15] Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta- ments where the fitness landscape remains static, in certain envi- learning for fast adaptation of deep networks. In International Conference on ronments a genome will lose its learning ability in return for a more Machine Learning. PMLR, 1126–1135. consistent set of offspring10 [ ]. However, there may be situations [16] Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. 2019. Online meta-learning. In International Conference on Machine Learning. PMLR, 1920– where this does not occur – for example, if a genome derives its 1930. evolvability by developing a robust representation, there would be [17] Chelsea Finn, Kelvin Xu, and Sergey Levine. 2018. Probabilistic model-agnostic no pressure to lose such a representation. We believe the question meta-learning. arXiv preprint arXiv:1806.02817 (2018). [18] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. 2017. of “what genomes retain evolvability” is a promising future path. Forward and reverse gradient-based hyperparameter optimization. In Interna- Another direction of research lies in asking “what aspects of a tional Conference on Machine Learning. PMLR, 1165–1173. [19] Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. 2017. population define its learning ability”. In this paper, we focused on Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767 (2017). genomes which learn strong mutation functions, so their offspring [20] Alexander Gajewski, Jeff Clune, Kenneth O Stanley, and Joel Lehman. 2019. will have a higher expected fitness. Another avenue could be to Evolvability ES: scalable and direct optimization of evolvability. In Proceedings of the Genetic and Evolutionary Computation Conference. 107–115. instead learn a strong selection function. For example, rather than [21] Santiago Gonzalez and Risto Miikkulainen. 2020. Improved training speed, allowing only its high-fitness offspring to survive, a genome could accuracy, and data utilization through loss function optimization. In 2020 IEEE encode an artificial selection function that rewards offspring with Congress on Evolutionary Computation (CEC). IEEE, 1–8. [22] John J Grefenstette. 1999. Evolvability in dynamic fitness landscapes: A ge- high diversity. In the long run, such a selection function could still netic algorithm approach. In Proceedings of the 1999 Congress on Evolutionary result in higher fitness due to a higher capacity for exploration. Computation-CEC99 (Cat. No. 99TH8406), Vol. 3. IEEE, 2031–2038. [23] Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel, and Sergey We hope this paper serves as a conduit between the study of Levine. 2018. Meta-reinforcement learning of structured exploration strategies. meta-learning and evolvability in evolutionary systems, which we arXiv preprint arXiv:1802.07245 (2018). believe to be heavily related. There is room for many future works [24] Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. 2020. Meta-learning in neural networks: A survey. arXiv preprint arXiv:2004.05439 in areas connecting these fields, and we believe that sharing ideas (2020). will result in further insight in both directions. [25] Rein Houthooft, Richard Y Chen, Phillip Isola, Bradly C Stadie, Filip Wolski, Jonathan Ho, and Pieter Abbeel. 2018. Evolved policy gradients. arXiv preprint arXiv:1802.04821 (2018). ACKNOWLEDGMENTS [26] Rein Houthooft, Richard Y Chen, Phillip Isola, Bradly C Stadie, Filip Wolski, Jonathan Ho, and Pieter Abbeel. 2018. Evolved policy gradients. arXiv preprint Thanks to Lisa Soros for early feedback. arXiv:1802.04821 (2018). [27] Amlan Kar, Aayush Prakash, Ming-Yu Liu, Eric Cameracci, Justin Yuan, Matt Rusiniak, David Acuna, Antonio Torralba, and Sanja Fidler. 2019. Meta-sim: Learn- REFERENCES ing to generate synthetic datasets. In Proceedings of the IEEE/CVF International [1] Lee Altenberg et al. 1994. The evolution of evolvability in genetic programming. Conference on Computer Vision. 4551–4560. Advances in genetic programming 3 (1994), 47–74. [28] Nadav Kashtan, Elad Noor, and Uri Alon. 2007. Varying environments can speed [2] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David up evolution. Proceedings of the National Academy of Sciences 104, 34 (2007), Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. 2016. Learning 13711–13716. to learn by gradient descent by gradient descent. arXiv preprint arXiv:1606.04474 [29] Marc Kirschner and John Gerhart. 1998. Evolvability. Proceedings of the National (2016). Academy of Sciences 95, 15 (1998), 8420–8427. [3] Antreas Antoniou, Harrison Edwards, and Amos Storkey. 2018. How to train [30] Joel Lehman and Kenneth O Stanley. 2008. Exploiting open-endedness to solve your maml. arXiv preprint arXiv:1810.09502 (2018). problems through the search for novelty.. In ALIFE. Citeseer, 329–336. [4] K. Arndt, M. Hazara, A. Ghadirzadeh, and V. Kyrki. 2020. Meta Reinforcement [31] Joel Lehman and Kenneth O Stanley. 2011. Improving evolvability through nov- Learning for Sim-to-real Domain Adaptation. In 2020 IEEE International Con- elty search and self-adaptation. In 2011 IEEE congress of evolutionary computation ference on Robotics and Automation (ICRA). 2725–2731. https://doi.org/10.1109/ (CEC). IEEE, 2693–2700. ICRA40945.2020.9196540 [32] Joel Lehman and Kenneth O Stanley. 2013. Evolvability is inevitable: Increasing [5] Atilim Gunes Baydin, Robert Cornish, David Martinez Rubio, Mark Schmidt, and evolvability without the pressure to adapt. PloS one 8, 4 (2013), e62186. Frank Wood. 2017. Online learning rate adaptation with hypergradient descent. [33] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. 2017. Meta-sgd: Learning to arXiv preprint arXiv:1703.04782 (2017). learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835 (2017). [6] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter [34] Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. Darts: Differentiable optimization. Journal of machine learning research 13, 2 (2012). architecture search. arXiv preprint arXiv:1806.09055 (2018). [7] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schul- [35] Paul Micaelli and Amos Storkey. 2020. Non-greedy gradient-based hyperparame- man, Jie Tang, and Wojciech Zaremba. 2016. Openai gym. arXiv preprint ter optimization over long horizons. arXiv preprint arXiv:2007.07869 (2020). arXiv:1606.01540 (2016). [36] Alex Nichol, Joshua Achiam, and John Schulman. 2018. On first-order meta- [8] Yutian Chen, Matthew W Hoffman, Sergio Gómez Colmenarejo, Misha Denil, learning algorithms. arXiv preprint arXiv:1803.02999 (2018). Timothy P Lillicrap, Matt Botvinick, and Nando Freitas. 2017. Learning to learn [37] Massimo Pigliucci. 2008. Is evolvability evolvable? Nature Reviews Genetics 9, 1 without gradient descent by gradient descent. In International Conference on (2008), 75–82. Machine Learning. PMLR, 748–756. [38] Justin K Pugh, Lisa B Soros, and Kenneth O Stanley. 2016. Quality diversity: A [9] Giuseppe Cuccu and Faustino Gomez. 2011. When novelty is not enough. In new frontier for evolutionary computation. Frontiers in Robotics and AI 3 (2016), European Conference on the Applications of Evolutionary Computation. Springer, 40. 234–243. [39] Aravind Rajeswaran, Chelsea Finn, Sham Kakade, and Sergey Levine. 2019. Meta- [10] Stephane Doncieux, Giuseppe Paolo, Alban Laflaquière, and Alexandre Coninx. learning with implicit gradients. arXiv preprint arXiv:1909.04630 (2019). 2020. Novelty search makes evolvability inevitable. In Proceedings of the 2020 [40] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2019. Regularized Genetic and Evolutionary Computation Conference. 85–93. evolution for image classifier architecture search. In Proceedings of the aaai [11] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter conference on artificial intelligence, Vol. 33. 4780–4789. Abbeel. 2016. Rl2: Fast reinforcement learning via slow reinforcement learning. Population-Based Evolution Optimizes a Meta-Learning Objective Kevin Frans and Olaf Witkowski

[41] Joseph Reisinger, Kenneth O Stanley, and Risto Miikkulainen. 2005. Towards an [47] Qianru Sun, Yaoyao Liu, Tat-Seng Chua, and Bernt Schiele. 2019. Meta-transfer empirical measure of evolvability. In Proceedings of the 7th annual workshop on learning for few-shot learning. In Proceedings of the IEEE/CVF Conference on Genetic and evolutionary computation. 257–264. Computer Vision and Pattern Recognition. 403–412. [42] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B [48] Tim Taylor. 2019. Evolutionary innovations and where to find them: Routes to Tenenbaum, Hugo Larochelle, and Richard S Zemel. 2018. Meta-learning for open-ended evolution in natural and artificial systems. Artificial life 25, 2 (2019), semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676 (2018). 207–224. [43] Nataniel Ruiz, Samuel Schulter, and Manmohan Chandraker. 2018. Learning to [49] Sebastian Thrun and Lorien Pratt. 2012. Learning to learn. Springer Science & simulate. arXiv preprint arXiv:1810.02513 (2018). Business Media. [44] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. 2017. [50] Emanuel Todorov, Tom Erez, and Yuval Tassa. 2012. Mujoco: A physics engine Evolution strategies as a scalable alternative to reinforcement learning. arXiv for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent preprint arXiv:1703.03864 (2017). Robots and Systems. IEEE, 5026–5033. [45] Xingyou Song, Wenbo Gao, Yuxiang Yang, Krzysztof Choromanski, Aldo Pacchi- [51] Yuhuai Wu, Mengye Ren, Renjie Liao, and Roger Grosse. 2018. Under- ano, and Yunhao Tang. 2019. Es-maml: Simple hessian-free meta learning. arXiv standing short-horizon bias in stochastic meta-optimization. arXiv preprint preprint arXiv:1910.01215 (2019). arXiv:1803.02021 (2018). [46] Kenneth O Stanley, Jeff Clune, Joel Lehman, and Risto Miikkulainen. 2019. De- [52] Tianbing Xu, Qiang Liu, Liang Zhao, and Jian Peng. 2018. Learning to explore signing neural networks through neuroevolution. Nature Machine Intelligence 1, via meta-policy gradient. In International Conference on Machine Learning. PMLR, 1 (2019), 24–35. 5463–5472.