<<

November 2020

DeepMind Lab2D

Charles Beattie1, Thomas Köppe1, Edgar A. Duéñez-Guzmán1 and Joel Z. Leibo1 1DeepMind

We present DeepMind Lab2D, a scalable environment simulator for artificial intelligence research that facilitates researcher-led experimentation with environment design. DeepMind Lab2D was built with the specific needs of multi-agent deep researchers in mind, but it may also be useful beyond that particular subfield.

1. Introduction state space size. It is a real scientific question that Are you a product of your genes, brain, or environ- merits serious study. We have elsewhere referred ment? What will an artificial intelligence (AI) be? to this as the problem problem (Leibo et al., 2019). The development of AI systems is inextricably in- One reason it is currently difficult to undertake tertwined with questions about the fundamental work aimed at the problem problem is that it is causal factors shaping any intelligence, including rare for any individual person to possess expertise natural intelligence. Even in a completely abstract in all the relevant areas. For instance, scenario, prior knowledge is needed for learning researchers know how to design well- effective learning, but prior learning is the only controlled experiments but struggle with the nec- realistic way to generate such knowledge. essary skills that are more akin to computer game The centrality of this dynamic is illustrated by design and engineering. Another reason why it is experiments in biology. In laboratory animals, difficult to pursue this hypothesis is the prevailing manipulations of the rearing environment pro- culture in machine learning that views any tin- duce profound effects on both brain structure kering with the environment as “hand-crafting”, and behavior. For instance, laboratory rodent en- “special-casing”, or just “cheating”. These atti- vironments may be enriched by using larger cages tudes are misguided. In our quest for generality, which contain larger groups of other individuals— we must not forget the great diversity and partic- creating more opportunities for social interaction, ularity of the problem space. variable toys and feeding locations, and a wheel A diverse set of customizable simulation envi- to allow for the possibility of voluntary exercise. ronments for large-scale 3D environments with Rearing animals in such enriched environments varying degrees of physical realism exist (Beattie improves their learning and memory, increases et al., 2016; Juliani et al., 2018; Kempka et al., synaptic arborization, and increases total brain 2016; Leibo et al., 2018; Todorov et al., 2012). weight (Van Praag et al., 2000). It is the inter- For 2D, excellent simulation environments also action of environmental factors that is thought exist (Chevalier-Boisvert et al., 2018; Jiang, 2019; to produce the enrichment effects, not any single Lanctot et al., 2019; Platanios et al., 2020; Schaul, factor in isolation. 2013; Stepleton, 2017; Suarez et al., 2019; Zheng arXiv:2011.07027v2 [cs.AI] 12 Dec 2020 While it is conceivable that AI research could et al., 2017); however, they fell short, at the time stumble upon a perfectly general learning algo- this project began, in at least one of our require- rithm without needing to consider its environ- ments of composability, flexibility, multi-agent ment, it is overwhelmingly clear that environ- capabilities, or performance. ments affect learning in ways that are not just arbitrary quirks, but real phenomena which we 1.1. DeepMind Lab2D 1 must strive to understand theoretically. There DeepMind Lab2D (or “DMLab2D” for short) is is structure there. For instance, the question of a platform for the creation two-dimensional, why some environments only generate trivial com- layered, discrete “grid-world” environments, in plexity, like tic-tac-toe, while others generate rich which pieces (akin to pieces on a chess complexity, like Go, is not just a matter of the 1https://github.com/deepmind/lab2d

Corresponding author(s): Charles Beattie hcbeattie@.comi DeepMind, 5 New Street Square, EC4A 3TW DeepMind Lab2D board) move around. This system is particularly • contact (string): A tag name for a contact tailored for multi-agent reinforcement learning. event. Whenever the piece enters (or leaves) The computationally intensive engine is written in the same (푥, 푦)-coordinate as another piece C++ for efficiency, while most of the level-specific (which is necessarily on a different ), all logic is scripted in Lua. involved pieces experience a contact event. The event is tagged with the value of this The grid. The environments of DMLab2D consist attribute. of one or more layers of two-dimensional grids. A position in the environment is uniquely identified An attempt to change a piece’s state fails if the by a coordinate tuple (푥, 푦, layer). Layers are la- piece’s resulting (푥, 푦, layer)-position is already beled by strings, and the 푥- and 푦-coordinates are occupied. non-negative integers. An environment can have an arbitrary number of layers, and their rendering Callbacks. Most of the logic in an environment order is controlled by the user. is implemented via callbacks for specific types (states) of pieces. Callbacks are functions which Pieces. The environments of DMLab2D are pop- the engine calls when the appropriate event or ulated with pieces. Each piece occupies a posi- interaction occurs. tion (푥, 푦, layer), and each position is occupied by at most one piece. Pieces also have an ori- Raycasts and queries. The engine provides two entation, which is one of the traditional cardi- ways to enumerate the pieces in particular posi- nal directions (north, east, south, west). Pieces tions (and layers) on the grid: raycasts and queries. can move around the (푥, 푦)-space and reorient A raycast, as the name implies, finds the first themselves as part of the evolution of the envi- piece, if any, in a ray from a given position. A ronment, both relatively to their current posi- query finds all pieces within a particular area tion/orientation and absolutely. It is also possible in the grid, shaped like a disc, a diamond, or a for a piece to have no position, in which case rectangle. it is “off the board”. Pieces cannot freely move among layers; instead, a piece’s layer is controlled 1.2. Why 2D, not 3D? through its state (described next). Two-dimensional environments are inherently easier to understand than three-dimensional ones, States. Each piece has an associated state. The at very little, if any, loss of expressiveness. Even a state consists of a number of key-value attributes. game as simple as Pong, which essentially consists Values are strings or lists of strings. The possible of three moving rectangles on a black background, values are fixed by the designer as part of the can capture something fundamental about the environment. The state of each piece can change real game of table tennis. This abstraction makes as part of the evolution of the environment, but it easier to capture the essence of the problems the state change can only select from among the and concepts that we aim to solve. 2D games fixed available values. have a long history of making challenging and The state of a piece controls the piece’s ap- interpretable benchmarks for artificially intelli- pearance, layer, group membership, and behav- gent agents (Mnih et al., 2015; Samuel, 1959; ior. Concretely, the state of a piece comprises the Shannon, 1950). following attributes: Rich complexity along numerous dimensions can be studied in 2D just as readily as in 3D, • layer (string): The label of the layer which if not more so. When studying a particular re- the piece occupies. search question, it is not clear a priori whether • sprite (string): The name of the sprite used specific aspects of 3D environments are crucial to draw this piece. for obtaining the desired behavior in the train- • groups (list of strings): The groups of which ing agents. Even when explicitly studying phe- this piece is a member. Groups are mostly nomena like navigation and exploration, where used for managing updater functions. organisms depend on complex visual processing

2 DeepMind Lab2D and continuous-time physical environments, re- provide robust support for multi-agent systems. searchers in reinforcement learning often need Most existing environments, however, only pro- to discretize the interactions and observations so vide poor support for multiple players. that they become tractable. Moreover, 2D worlds DeepMind Lab2D supports multiple simultane- can often capture the relevant complexity of the ous players interacting in the same environment. problem at hand without the need for continuous- These players may be either human or computer- time physical environments. This pattern where controlled, and it is possible to mix human and studying phenomena on 2D worlds has been a computer-controlled players in the same game. critical first step towards further advances in more Each player can have a custom view of the complex and realistic environments is ubiquitous world that reveals or obscures particular infor- in the field of artificial intelligence. 2D worlds mation, controlled by the designer. A global view, have been successfully used to study problems as potentially hidden from the players, can be set diverse as social complexity, navigation, imper- up and can include privileged information. This fect information, abstract reasoning, exploration, can be used for imperfect information games, as and many more (Leibo et al., 2017; Lerer and well as for human behavioral experiments where Peysakhovich, 2017; Rafols et al., 2005; Ullman the experimenter can see the global state of the et al., 2009; Zheng et al., 2020). environment as the episode is progressing. Another advantage of 2D worlds is that they are easier to design and program than their 3D 1.4. Exposing metrics, supporting analysis counterparts. This is particularly true when the DeepMind Lab2D provides several flexible mecha- 3D world actually exploits the space or physical nisms for exposing internal environment informa- dynamics beyond the capabilities of 2D ones. 2D tion: The simplest form is through observations, worlds do not require complex 3D assets to be which allow the researcher to add specific infor- evocative, nor do they require reasoning about mation from the environment to the observations shaders, lighting, and projections. In most 2D that are produced at each time step. The second worlds, the agent’s egocentric view of the world way is through events, which, similar to observa- is inherently compatible with the allocentric view tions, can be raised from within the Lua script. (i.e. the third-person or world view). That is, Unlike observations, events are not tied to time typically the agent’s view is simply a movable steps but instead are triggered on specific condi- window on the whole environment’s view. tions. Finally, the properties API provides a way In addition, 2D worlds are significantly less to read and write parameters of the environment, resource-intensive to run, and typically do not typically parameters that change rarely. require any specialized hardware (like GPUs) to attain reasonable performance. This keeps spe- 2. Example results in cialized hardware, if any, available exclusively for deep reinforcement learning the intensive work of training the agents. Using 2D environments also enables better scalability For an example, we consider a game called “Run- to a larger number of agents interacting with the ning With Scissors” (Fig. 1). A variant of this same environment, as it costs only very little to game with simpler graphics was first described render another agent’s view. Running 2D simu- in Vezhnevets et al.(2020). It can be seen as a lations is within the capabilities of smaller labs, spatially and temporally embedded extension to whereas most 3D physics-based reinforcement Rock-Paper-Scissors. As such, it inherits the rich learning is still prohibitively expensive in many game-theoretic structure of its parent (described settings. in e.g. Weibull(1997)). The main difference is that, unlike in the matrix game, agents in Run- ning With Scissors do not select their strategies 1.3. Multi-player support and benchmarking as atomic decisions. Instead, they must learn poli- with human players cies to implement their strategic choices. They A large fraction of human skills are social skills. must decide how to “play rock” (or paper, or scis- To probe these, simulation environments must sors), in addition to deciding that they should

3 DeepMind Lab2D

is the standard 2-simplex. The initial value of 1 1 1 푇 the vector is the centroid 3 , 3 , 3 . The more resources of a given type an agent picks up, the more committed the agent becomes to the pure strategy corresponding to that resource. The re- wards 푟row and 푟col for the (zapping) row and the (zapped) column player, respectively, are assigned via 푇 푟row = 푣 퐴 푣 = −푟col , where  0 −1 +1   퐴 = +1 0 −1 .   −1 +1 0 Figure 1 | “Running With Scissors” screenshot.   To obtain high rewards, an agent should cor- rectly identify what resource its opponent is col- do so. Furthermore, it is possible—though not lecting (e.g. rock) and collect the resource corre- trivial—to observe the policy one’s partner is start- sponding to its counter-strategy (e.g. paper). In ing to implement and take countermeasures. This addition, the rules of the game, i.e. the dynamics induces a wealth of possible feinting strategies, of the environment are not assumed known to the none of which could easily be captured in the agents. They must explore to discover them. Thus classical matrix game formulation. Running With Scissors is simultaneously a game Agents can move around the map and collect of imperfect information—each player possesses resources: rock, paper, and scissors. The envi- some private information not known to their ad- ronment is 16 × 24 units in size but agents view versary (as in e.g. poker (Sandholm, 2015))—and it through a 5 × 5 window.2 The episode ends incomplete information, lacking common knowl- either when a timer runs out or when there is an edge of the rules (Harsanyi, 1967). interaction event, triggered by one agent zapping In Rock-Paper-Scissors, when faced with an op- the other with a beam. The resolution of the in- ponent playing rock, a best responding agent will teractions is driven by a traditional matrix game, come to play paper. When faced with an oppo- where there is a payoff matrix describing the re- nent playing paper, a best responding agent will ward produced by the pure strategies available learn to play scissors. And, when one’s opponent to the two players. In Running With Scissors, the plays scissors, one will learn to counter with rock. zapping agent becomes the row player and the Fig. 2 shows that the same incentives prevail in zapped agent becomes the column player. The Running With Scissors. However, in this case, the actual strategy of the player depends on the re- policies to implement strategic decisions are more sources it has picked up before the interaction. complex since agents must learn to run around These resources are represented by the resource the map and collect the resources. Much more vector complex policies involving scouting and feinting

2 can also be learned in this environment. See Vezh- 푣 ∈ Δ ⊂ ℝrock ⊕ ℝpaper ⊕ ℝscissors, nevets et al.(2020) for details. To give an indication of the simulation perfor- where mance, two random agents playing against one 2  Í another, receiving full RGB observations of size Δ ≔ (푥1, 푥2, 푥3) : 0 ≤ 푥1, 푥2, 푥3 , 푖 푥푖 = 1 80 × 80px (16 × 16px per tile, with 5 × 5 tiles), 2This partial viewing window is a square. The agent run with an average of 250,000 frames per sec- sees 3 rows in front of itself, 1 row behind, and 2 columns ond (measured over 1000 episodes of 1000 steps to either side. Elementary actions are to move forward, each), on a single core of an Intel Xeon W-2135 backward, strafe left, strafe right, turn left or turn right, and fire the interaction beam. A gameplay video may be viewed (“Skylake”) CPU at 3.70GHz. The training exam- at https://youtu.be/IukN22qusl8. ple shown in Fig. 2 took several days to complete,

4 DeepMind Lab2D

Figure 2 | A new agent implementing the advantage actor-critic algorithm can be trained to best respond to frozen agents implementing “semi-pure” strategies (Mnih et al., 2016). and the cost of running the simulation is thus significant environment-side iteration. In our own entirely negligible. experience, we have found that DeepMind Lab2D facilitates researcher creativity in the design of learning environments and intelligence tests. We 3. Discussion are excited to see what the research community Artificial intelligence research based on reinforce- uses it to build in the future. ment learning is beginning to mature as a field. The need for rigorous standards by which the 4. Acknowledgements correctness, scale, reproducibility, ethicality, and impact of a contribution may be assessed is now We thank the following people for their contribu- accepted (Henderson et al., 2020; Khetarpal et al., tions to this project: Antonio García Castañeda, 2018; Mitchell et al., 2019; Osband et al., 2019). Edward Hughes, Ramana Kumar, Jay Lemmon, But in all these well-received calls for rigor in Kevin McKee, Haroon Qureshi, Denis Teplyashin, AI, the humble simulation environment gets short- Víctor Valdés, and Tom Ward. changed. It would appear that many researchers consider the environment to be none of their con- References cern. A more holistic (and realistic!) view of their Charles Beattie, Joel Z. Leibo, Denis Teplyashin, work suggests otherwise. Research workflows Tom Ward, Marcus Wainwright, Heinrich Kütt- involve significant time spent authoring game en- ler, Andrew Lefrancq, Simon Green, Víctor vironments and intelligence tests, adding analytic Valdés, Amir Sadik, Julian Schrittwieser, Keith methods, and so forth. But these activities are Anderson, Sarah York, Max Cant, Adam Cain, usually not as simple and easy to extend as they Adrian Bolton, Stephen Gaffney, Helen King, ought to be, though they are clearly critical to the , , and Stig Petersen. success of the enterprise. DeepMind Lab, 2016. We think that progress toward artificial gen- eral intelligence requires robust simulation plat- Maxime Chevalier-Boisvert, Lucas Willems, and forms to enable in silico exploration of agent Suman Pal. Minimalistic gridworld environ- learning, skill acquisition, and careful measure- ment for OpenAI Gym. https://github.com/ ment. We hope that the system we introduce maximecb/gym-minigrid, 2018. here, DeepMind Lab2D, can fill this role. It gen- eralizes and extends a popular internal system John C Harsanyi. Games with incomplete in- at DeepMind which supported a large range of formation played by “Bayesian” players, I–III research projects. It was especially popular for Part I. The basic model. Management science, multi-agent research involving workflows with 14(3):159–182, 1967.

5 DeepMind Lab2D

Peter Henderson, Jieru Hu, Joshua Romoff, Emma Sanchez, Simon Green, Audrunas Gruslys, et al. Brunskill, Dan Jurafsky, and Joelle Pineau. To- Psychlab: A psychology laboratory for deep wards the systematic reporting of the energy reinforcement learning agents. arXiv preprint and carbon footprints of machine learning. arXiv:1801.08116, 2018. arXiv preprint arXiv:2002.05651, 2020. Joel Z Leibo, Edward Hughes, Marc Lanctot, and Shuo Jiang. Multi agent reinforcement learning Thore Graepel. Autocurricula and the emer- environments compilation, 2019. gence of innovation from social interaction: A manifesto for multi-agent intelligence research. Arthur Juliani, Vincent-Pierre Berges, Esh Vckay, arXiv preprint arXiv:1903.00742, 2019. Yuan Gao, Hunter Henry, Marwan Mattar, and Danny Lange. Unity: A general plat- Adam Lerer and Alexander Peysakhovich. Main- form for intelligent agents. arXiv preprint taining cooperation in complex social dilem- arXiv:1809.02627, 2018. mas using deep reinforcement learning. arXiv preprint arXiv:1707.01068, 2017. Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaśkowski. Margaret Mitchell, Simone Wu, Andrew Zaldivar, Vizdoom: A Doom-based AI research platform Parker Barnes, Lucy Vasserman, Ben Hutchin- for visual reinforcement learning. In 2016 IEEE son, Elena Spitzer, Inioluwa Deborah Raji, and Conference on Computational Intelligence and Timnit Gebru. Model cards for model reporting. Games (CIG), pages 1–8. IEEE, 2016. In Proceedings of the conference on fairness, ac- countability, and transparency, pages 220–229, Khimya Khetarpal, Zafarali Ahmed, Andre Cian- 2019. flone, Riashat Islam, and Joelle Pineau. Re- evaluate: Reproducibility in evaluating rein- Volodymyr Mnih, Koray Kavukcuoglu, David forcement learning algorithms. OpenReview Silver, Andrei A Rusu, Joel Veness, Marc G preprint ID:HJgAmITcgm, 2018. Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Marc Lanctot, Edward Lockhart, Jean-Baptiste Human-level control through deep reinforce- Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, ment learning. , 518(7540):529–533, Julien Pérolat, Sriram Srinivasan, Finbarr Tim- 2015. bers, Karl Tuyls, Shayegan Omidshafiei, Daniel Hennes, Dustin Morrill, Paul Muller, Timo Volodymyr Mnih, Adria Puigdomenech Badia, Ewalds, Ryan Faulkner, János Kramár, Bart De Mehdi Mirza, Alex Graves, Timothy Lillicrap, Vylder, Brennan Saeta, James Bradbury, David Tim Harley, David Silver, and Koray Kavukcu- Ding, Sebastian Borgeaud, Matthew Lai, Ju- oglu. Asynchronous methods for deep rein- lian Schrittwieser, Thomas Anthony, Edward forcement learning. In International conference Hughes, Ivo Danihelka, and Jonah Ryan-Davis. on machine learning, pages 1928–1937, 2016. OpenSpiel: A framework for reinforcement learning in games. CoRR, abs/1908.09453, Ian Osband, Yotam Doron, Matteo Hessel, John 2019. Aslanides, Eren Sezener, Andre Saraiva, Kat- rina McKinney, Tor Lattimore, Csaba Szepez- Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, vari, Satinder Singh, Benjamin Van Roy, Janusz Marecki, and Thore Graepel. Multi- Richard Sutton, David Silver, and Hado agent reinforcement learning in sequential so- Van Hasselt. Behaviour suite for reinforcement cial dilemmas. In Proceedings of the 16th Con- learning. arXiv preprint arXiv:1908.03568, ference on Autonomous Agents and MultiAgent 2019. Systems, pages 464–473, 2017. Emmanouil Antonios Platanios, Abulhair Saparov, Joel Z Leibo, Cyprien de Masson d’Autume, Daniel and Tom Mitchell. Jelly Bean World: A test- Zoran, David Amos, Charles Beattie, Keith An- bed for never-ending learning. arXiv preprint derson, Antonio García Castañeda, Manuel arXiv:2002.06306, 2020.

6 DeepMind Lab2D

Eddie J Rafols, Mark B Ring, Richard S Sutton, Tomer Ullman, Chris Baker, Owen Macindoe, and Brian Tanner. Using predictive represen- Owain Evans, Noah Goodman, and Joshua B tations to improve generalization in reinforce- Tenenbaum. Help or hinder: Bayesian models ment learning. In IJCAI, pages 835–840, 2005. of social goal inference. In Advances in neural in- formation processing systems, pages 1874–1882, Arthur L Samuel. Some studies in machine learn- 2009. ing using the game of checkers. IBM Journal of research and development, 3(3):210–229, 1959. Henriette Van Praag, Gerd Kempermann, and Tuomas Sandholm. Solving imperfect-informa- Fred H Gage. Neural consequences of envi- tion games. Science, 347(6218):122–123, romental enrichment. Nature Reviews Neuro- 2015. science, 1(3):191–198, 2000. Tom Schaul. A description lan- Alexander Sasha Vezhnevets, Yuhuai Tony Wu, guage for model-based or interactive learn- Maria Eckstein, Rémi Leblond, and Joel Z Leibo. ing. In 2013 IEEE Conference on Computational Options as responses: Grounding behavioural hierarchies in multi-agent reinforcement learn- Inteligence in Games (CIG), pages 1–8. IEEE, ing. In Proceedings of the 37th International 2013. Conference on Machine Learning, 2020. Claude E Shannon. A chess-playing machine. Sci- entific American, 182(2):48–51, 1950. Jörgen W Weibull. Evolutionary Game Theory. MIT press, 1997. Tom Stepleton. The pycolab game engine. https://github.com/deepmind/pycolab, Lianmin Zheng, Jiacheng Yang, Han Cai, Weinan 2017. Zhang, Jun Wang, and Yong Yu. MAgent: A many-agent reinforcement learning platform Joseph Suarez, Yilun Du, Phillip Isola, and Igor for artificial collective intelligence. CoRR, Mordatch. Neural MMO: A massively mul- abs/1712.00600, 2017. tiagent game environment for training and evaluating intelligent agents. arXiv preprint Stephan Zheng, Alexander Trott, Sunil Srinivasa, arXiv:1903.00784, 2019. Nikhil Naik, Melvin Gruesbeck, David C Parkes, and Richard Socher. The AI economist: Improv- Emanuel Todorov, Tom Erez, and Yuval Tassa. ing equality and productivity with AI-driven Mujoco: A physics engine for model-based con- tax policies. arXiv preprint arXiv:2004.13332, trol. In 2012 IEEE/RSJ International Conference 2020. on Intelligent Robots and Systems, pages 5026– 5033. IEEE, 2012.

7