arXiv:1902.01724v3 [cs.NE] 14 Jul 2019 09Cprgthl yteonrato() ulcto r Publication owner/author(s). the by held Copyright 2019 © I n rubyoeo h e oprbetssrmiigis remaining tests comparable few the of one arguably and AI, [16 AlphaGo by defeated was grandmaster Go a when 2016, in was ie rrpbih ops nsreso ordsrbt ol to redistribute to or servers on post to republish, or wise, h edo rica nelgne(I a ogbe involv been long has (AI) intelligence artificial of field The Format: Reference ACM C SN978-1-4503-6748-6/19/07...$15.00 ISBN ACM EC 1 opno,Jl 31,21,Pau,CehRepubli Czech Prague, 2019, 13–17, July permissi Companion, from ’19 permissions GECCO Request fee. a pe and/or is permission credit with Abstracting honored. work this be of must components n author(s) for this Copyrights the bear page. copies first that the and on thi tion advantage of commercial ar part copies or that or profit provided all for fee of without granted copies is hard use or classroom digital make to Permission • CONCEPTS CCS times. of recent one in and developed systems community AI EC significant wider most the the between so, bridge doing a provide In to diversity. quality evolution, and Lamarckian of co-evolution, some of petitive highlight use We aspects—the field. the interesting in most system concepts the many at to look new it a relating presenting pri EC, of AlphaStar lens analyze the we through paper ily this includin In research, (EC). AI computation ary ev of and theory, areas game learning, many reinforcement learning, on deep draws AlphaStar p AI. the in of milestone a II—representing StarCraft professio of a game the beat at to world—t system the (AI) to intelligence artificial AlphaStar first revealed DeepMind 2019, January In ABSTRACT learning ohcesadG eese ssm ftebgetcalne f challenges biggest the of some as seen were Go and game chess human Both versus AI in milestone major next the 1997, in Blue bench humans, AI as chal for use of their humans, [22]. to way by validity a external created have as therefore are and games Games t to systems. in looked AI humans has lenging rival such, can as that and systems intelligence, artificial create to trying BACKGROUND 1 https://doi.org/10.1145/3319619.3321894 Republic. July Czech Companion), Prague, ’19 (GECCO Companion Conference putation In Perspective. AlphaSt Computation 2019. Togelius. Evolutionary Julian and Cully, Antoine Arulkumaran, Kai diversity quality co-evolution, evolution, Lamarckian KEYWORDS https://doi.org/10.1145/3319619.3321894 Machinery. Computing for sociation optn methodologies Computing fe h eeto h egigceswrdcapo yDeep by champion world chess reigning the of defeat the After ; odn ntdKingdom United London, meilCleeLondon College Imperial erlnetworks Neural lhSa:A vltoayCmuainPerspective Computation Evolutionary An AlphaStar: a Arulkumaran Kai [email protected] C,NwYr,N,UA pages. 3 USA, NY, York, New ACM, ; i-nprdapproaches Bio-inspired → ut-gn reinforcement Multi-agent eei n vltoayCom- Evolutionary and Genetic ss eurspirspecific prior requires ists, o aeo distributed or made not e [email protected]. gt iesdt h As- the to licensed ights mte.T oyother- copy To rmitted. okfrproa or personal for work s c cita- full the and otice we yohr than others by owned ; odn ntdKingdom United London, meilCleeLondon College Imperial 31,2019, 13–17, a player nal [email protected] ehope we olution- rogress marks non Cully Antoine r An ar: din ed com- mar- heir and its he or to g ]. s - hssse,lk t rdcso lhG,wsiiilytr [ initially 2018 was AlphaGo, December predecessor in its player like II system, SC This professional a beat to able was ihhge efrac hnhdpeiul enachieved. been previously had than performance higher with demonstr o copy originally mutated was a PBT receives [6]. also hyperparameters loser winner’s the loser’s; the (suc writing methods selection several of one using picked are works 1 h ayde ik otecosvr ewe Cadgm theo game and EC between crossovers the to 2 links deep many the atclrylreadvre cinspace. model, action forward varied fast and a large build particularly to a hard it make dominant that pro single even rules several complex no than observability, have challenging partial II, more play, SC considerably real-time it sequel g make its strategy that and real-time erties game, a original (SC), the StarCraft Both at grandmaster a beat hSa hog hsln swell A as examine lens can this we c through is hence phaStar population and a (EC), of computation notion evolutionary of very to the part but This 10], [8]. [2, game-the perspectives other and RL each multi-agent upon po against built was a process train training that keep uti agents explicitly AlphaStar of to as tion [9] diverge, (PBT) se algorithms training and the population-based (RL) point learning this improve reinforcement At then play. of and combination play, human a mimic through to learning imitation ing syste AI neural-network-based a AlphaStar, with challenge r otnosytanduigB,wiei h ue op n loop, outer the in while BP, using trained continuously are eune B sbt snhoosaddsrbtd[3.Ra [13]. distributed and A asynchronous PBT. both for is key is PBT scalability sequence, hours, several for train to need hyperparamet change can it as fly. RL, the non-station on deep highly as such with surfaces, problems loss in beneficial networ most tuning perhaps tasks, RL and learning supervised of range a on parameter winner’s the with selection), tournament binary aacineouin(LE) evolution BP. Lamarckian of properties search pro local search efficient speci the global this with and In EAs exploration [12]. of the loop erties combine inner can an MA in an BP, case, as such oute means, op an other be as by can run solutions is individual evolution and algorithm, which optimisation in (MA), algorithm memetic approach a synergistic particularly A (EAs). algorithms evol ary there including hyperparameters, However, their tune (BP). to backpropagation methods many parame is the networks training neural to of approach popular most the Currently, evolution Lamarckian 2.1 COMPONENTS 2 oeta epeetahg-ee vriwo eea intere general of overview high-level a present we that Note oeetnieltrtr eiwo Ecnb on nteor the in found be can LE on review literature extensive more A epidrcnl okacnieal tptwrsti gra this towards step considerable a took recently DeepMind sasnl ewr a aesvrlggbtso eoy or memory, of gigabytes several take may network single a As B 9,ue nApatrt ri gns sa Ata uses that MA an is agents, train to AlphaStar in used [9], PBT e okCt,N,Uie States United NY, City, York New 2 nteinrlo,nua networks neural loop, inner the in : e okUniversity York New [email protected] uinTogelius Julian 1 . t n aelf aside left have and st, gnlpaper. iginal y[17]. ry ie us- ained strategy, st use to is timised con- a s that m over- s entral oretic ution- the f pula- lises ame. ated as h ther tis It ters and 20]. ary Go: the are ers nd fic et- ks lf- p- p- d l- r GECCO ’19 Companion, July 13–17, 2019, Prague, Czech Republic K. Arulkumaran et al. than running many experiments with static hyperparameters, the beat a certain other agent4, criteria to beat a set of other agents, same amount of hardware can utilise PBT with little overhead—the or even a mix of these. Furthermore, these specific criteria are also outer loop reuses solution evaluation from the inner loop, and re- adapted online, which is relatively novel among QD algorithms quires relatively little communication. When considering the effect [21]. There is more that could be done here though: it may be pos- of non-stationary hyperparameters and pre-emption on weaker so- sible to extract BDs from human data [22], or even learn them in an lutions, the savings are even greater. unsupervised manner [3]. And, given a set of diverse strategies, a Another consequence of these requirements is that PBT is steady natural next step is to infer which might work best against a given state [19], as opposed to generational EAs such as classic genetic opponent, enabling online adaptation. algorithms. A natural fit for asynchronous EAs and LE, steady state EAs can allow the optimisation and evaluation of individual solu- 3 DISCUSSION tions to proceed uninterrupted and hence maximise resource effi- While AlphaStar is a complex system that draws upon many areas ciency. The fittest solutions survive longer, naturally providing a of AI research, we believe a hitherto undersold perspective is that form of elitism/hall of fame, but even ancestors that aren’t elites of it as an EA. In particular, it combines LE, CCEAs, and QD to may be preserved, maintaining diversity3. spectacular effect. We hope that this perspective will give both the EC and deep RL communities the ability to better appreciate and 2.2 Co-evolution build upon this significant AI system. When optimising an agent to play a game, like in AlphaStar, it is possible to use self-play for the agent to improve itself. Competi- REFERENCES tive co-evolutionary algorithms (CCEAs) can be seen as a superset [1] David Ackley and Michael Littman. 1991. Interactions between learning and of self-play, as rather than keeping only a solution and its prede- evolution. Artificial life II 10 (1991), 487–509. [2] David Balduzzi, Karl Tuyls, Julien Perolat, and Thore Graepel. 2018. Re- cessors, it is instead possible to keep and evaluate against an entire evaluating evaluation. In NeurIPS. population of solutions. Like self-play, CEAs form a natural cur- [3] Antoine Cully and Yiannis Demiris. 2018. Hierarchical Behavioral Repertoires with Unsupervised Descriptors. In GECCO. riculum [7], but also confer an additional robustness as solutions [4] Antoine Cully and Yiannis Demiris. 2018. Quality and diversity optimization: A are evaluated against a varied set of other solutions [15, 18]. unifying modular framework. TEVC 22, 2 (2018), 245–259. Through the use of PBT in a CCEA setting, Jaderberg et al. [8] [5] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. 2019. Go-Explore: a New Approach for Hard-Exploration Problems. arXiv were able to train agents to play a first-person game from pixels, preprint arXiv:1901.10995 (2019). utilising BP-based deep RL in combination with evolved reward [6] David E Goldberg and Kalyanmoy Deb. 1991. A comparative analysis of selection functions [1]. The design of CEAs have many aspects [14], and schemes used in genetic algorithms. In Foundations of Genetic Algorithms. Vol. 1. 69–93. characterising this approach could lead to many potential variants. [7] W Daniel Hillis. 1990. Co-evolving parasites improve simulated evolution as an Here, for example, the interaction method was atypically based on optimization procedure. Physica D: Nonlinear Phenomena 42, 1-3 (1990), 228– 234. sampling agents with similar fitness evaluations (Elo ratings), but [8] Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, many other heuristics exist. et al. 2018. Human-level performance in first-person multiplayer games with population-based deep reinforcement learning. arXiv preprint arXiv:1807.01281 2.3 Quality diversity (2018). [9] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff A major advantage of keeping a population of solutions—as op- Donahue, et al. 2017. Population based training of neural networks. arXiv posed to a single one—is that the population can represent a di- preprint arXiv:1711.09846 (2017). [10] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl verse set of solutions. This is not restricted strictly to multi-objective Tuyls, et al. 2017. A unified game-theoretic approach to multiagent reinforce- optimisation problems, but can also be applied to single objectives, ment learning. In NeurIPS. 4190–4203. [11] Brad L Miller and David E Goldberg. 1995. Genetic algorithms, tournament where behaviour descriptors (BDs; i.e., solution phenotypes) can selection, and the effects of noise. Complex Systems 9, 3 (1995), 193–212. be used to pick solutions in the end. Quality diversity (QD) algo- [12] P Moscato. 1989. On Evolution, Search, Optimization, Genetic Algorithms and rithms explicitly optimise for a single objective (quality), but also Martial Arts: Towards Memetic Algorithms. Technical Report. California Institute of Technology. search for a large variety of solution types, via BDs, to encourage [13] Mariusz Nowostawski and Riccardo Poli. 1999. Parallel genetic algorithm taxon- greater diversity in the population [4]. Recently, Ecoffet et al. [5] omy. In KES. 88–92. used a QD algorithm to reach another milestone in playing games [14] Elena Popovici, Anthony Bucci, R Paul Wiegand, and Edwin D De Jong. 2012. Coevolutionary principles. In Handbook of Natural Computing. 987–1033. with AI—their system was the first to solve Montezuma’s Revenge, [15] Christopher D Rosin and Richard K Belew. 1997. New methods for competitive a platform game notorious for its difficulty in exploring the envi- coevolution. Evolutionary Computation 5, 1 (1997), 1–29. [16] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, et al. ronment. 2016. Mastering the game of Go with deep neural networks and tree search. In SC, there is no best strategy. Hence, the final AlphaStar agent Nature 529, 7587 (2016), 484. consists of the set of solutions from the Nash distribution of the [17] John Maynard Smith. 1982. Evolution and the Theory of Games. Cambridge University Press. population—the set of complementary, least exploitable strategies [18] Kenneth O Stanley and Risto Miikkulainen. 2004. Competitive coevolution [2]. In order to improve training, as well as increase the variety in through evolutionary complexification. JAIR 21 (2004), 63–100. [19] Gilbert Syswerda. 1991. A study of reproduction in generational and steady-state the final set of solutions, it therefore makes sense to explicitly en- genetic algorithms. In Foundations of Genetic Algorithms. Vol. 1. 94–101. courage diversity. As it does so, AlphaStar can also be classified as [20] Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jader- a QD algorithm. In particular, agents may have game-specific BDs, berg, et al. 2019. AlphaStar: Mastering the Real-Time Strategy Game StarCraftII. https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/. such as building extra units of a certain type, but also criteria to (2019).

3When given an appropriate selection pressure [11]. 4A concept highly related to competitive fitness sharing in CCEAs [15]. AlphaStar: An Evolutionary Computation Perspective GECCO ’19 Companion, July 13–17, 2019, Prague, Czech Republic

[21] Rui Wang, Joel Lehman, Jeff Clune, and Kenneth O Stanley. 2019. Paired arXiv:1901.01753 (2019). Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex [22] Georgios N Yannakakis and Julian Togelius. 2018. Artificial Intelligence and and Diverse Learning Environments and Their Solutions. arXiv preprint Games. Springer.