Accepted to the ICLR 2020 workshop: Beyond tabula rasa in RL (BeTR-RL)

THE NETHACK LEARNING ENVIRONMENT

Heinrich Küttler Nantas Nardelli Roberta Raileanu AI Research University of Oxford New York University [email protected] [email protected] [email protected]

Marco Selvatici Edward Grefenstette Tim Rocktäschel Imperial College London Facebook AI Research Facebook AI Research [email protected] University College London University College London [email protected] [email protected]

ABSTRACT

Hard and diverse tasks in complex environments drive progress in Reinforcement Learning (RL) research. To this end, we present the NetHack Learning Environment, a scalable, procedurally generated, rich and challenging en- vironment for RL research based on the popular single-player terminal-based NetHack game, along with a suite of initial tasks. This environment is sufficiently complex to drive long-term research on exploration, planning, skill acquisition and complex policy learning, while dramatically reducing the computa- tional resources required to gather a large amount of experience. We compare this environment and task suite to existing alternatives, and discuss why it is an ideal medium for testing the robustness and systematic generalization of RL agents. We demonstrate empirical success for early stages of the game using a distributed deep RL baseline and present a comprehensive qualitative analysis of agents trained in the environment.

1 INTRODUCTION

Recent advances in (Deep) Reinforcement Learning (RL) have been driven by the development of novel simulation environments, such as the Arcade Learning Environment (Bellemare et al., 2013), StarCraft II (Vinyals et al., 2017), BabyAI (Chevalier-Boisvert et al., 2018), and Obstacle Tower (Juliani et al., 2019). These environments introduced new challenges for existing state-of-the- art methods and demonstrate their failure modes. For example, the failure mode on Montezuma’s Revenge in the Arcade Learning Environment (ALE) for agents that trained well on other games in the collection highlighted the inability of those agents to successfully learn from a sparse-reward environment. This sparked a long line of research on novel methods for exploration (e.g. Bellemare et al., 2016; Tang et al., 2017; Ostrovski et al., 2017) and learning from demonstrations (e.g. Hester et al., 2017; Salimans & Chen, 2018; Aytar et al., 2018) in Deep RL. This progress has limits: for example, the best approach on this environment, Go-Explore (Ecoffet et al., 2019), overfits to specific properties of ALE, and Montezuma’s Revenge in particular. It exploits the deterministic nature of environment transitions, allowing it to memorize sequences of actions leading to previously visited states, from which an agent can continue to explore. Amongst attempts to surpass the limits of deterministic or repetitive settings, procedurally gener- ated environments are a promising path towards systematically testing generalization of RL meth- ods (see e.g. Justesen et al., 2018; Juliani et al., 2019; Risi & Togelius, 2019; Cobbe et al., 2019a). Here, the environment observation is generated programmatically in every episode, making it ex- tremely unlikely for an agent to visit the exact state more than once. Existing procedurally generated RL environments are either costly to run (e.g. Vinyals et al., 2017; Johnson et al., 2016; Juliani et al., 2019) or are, as we argue, not rich enough to serve as a long-term benchmark for advancing RL research (e.g. Chevalier-Boisvert et al., 2018; Cobbe et al., 2019b; Beattie et al., 2016).

1 Accepted to the ICLR 2020 workshop: Beyond tabula rasa in RL (BeTR-RL)

To overcome these roadblocks, we present the NetHack Learning Environment (NLE)1, an extremely rich, procedurally generated environment that strikes a balance between complexity and speed. It is a fully-featured Gym environment around the popular open-source terminal-based single-player turn-based “dungeon-crawler” NetHack2 game. Aside from procedurally generated content, NetHack is an attractive research platform as it contains hundreds of enemy and object types, it has complex and stochastic environment dynamics, and there is a clearly defined goal (descend the dungeon, retrieve an amulet, and ascend). Furthermore, NetHack is difficult to master for human players, who often rely on external knowledge to learn about initial game strategies and the environment’s complex dynamics.3 Concretely, this paper makes the following core contributions: (i) we present the NetHack Learning Environment, a complex but fast fully-featured Gym environment for RL research built around the popular terminal-based NetHack rougelike game, (ii) we release an initial suite of default tasks in the environment and demonstrate that novel tasks can easily be added, (iii) we introduce a baseline model trained using IMPALA that leads to interesting non-trivial policies, and (iv) we demonstrate the benefit of NetHack’s symbolic observation space by presenting tools for in-depth qualitative analyses of trained agents.

2 NETHACK: A FRONTIERFOR RLRESEARCH

In NetHack, the player acts turn-by-turn in a procedurally generated grid-world environment, with game dynamics strongly focused on exploration, resource management, and continuous discovery of entities and game mechanics (IRDC, 2008). At the beginning of a NetHack game, the player assumes the role of a hero placed into a dungeon and who is tasked with finding the Amulet of Yendor and offering it to an in-game deity. To do so, the player has to first descend to the bottom of over 50 procedurally generated levels to retrieve the amulet and then subsequently escape the dungeon. The procedurally generated content of each level makes it highly unlikely that a player will ever see the same level or experience the same combination of items, monsters, and level dynamics more than once. This provides a fundamental challenge to learning systems and a degree of complexity that enables us to more effectively evaluate an agents’ ability to generalize, compared to existing testbeds. See AppendixA for details about the game of NetHack as well as the specific challenges it poses to current RL methods. A full list of symbol and monster types, as well as a partial list of environment dynamics involving these entities can be found in the NetHack manual (Raymond et al., 2020). The NetHack Learning Environment (NLE) is built on NetHack 3.6.6, the latest available version of the game. It is designed to provide a standard reinforcement learning interface around the visual interface of NetHack. NLE aims to make it easy for researchers to probe the behavior of their agents by defining new tasks with only a few lines of code, enabled by NetHack’s symbolic observation space as well as its rich entities and environment dynamics. To demonstrate that NetHack is a suitable testbed for advancing RL, we release a set of default tasks for tractable subgoals in the game. For all tasks described below, we add a penalty of 0.001 to the reward function if the agent’s action did not advance the in-game timer. − Staircase: The agent has to find the staircase down > to the second dungeon level. This task is already challenging, as there is often no direct path to the staircase. Instead, the agent has to learn to reliably open doors +, kick-in locked doors, search for hidden doors and passages #, avoid traps ^, or move boulders O that obstruct a passage. The agent receives a reward of 100 once it reaches the staircase down and the the episode terminates after 1000 agent steps. Pet: Many successful strategies for NetHack rely on taking good care of the hero’s pet, for example, the little dog d or kitten f that the hero starts with. In this task, the agent only receives a positive reward when it reaches the staircase while the pet is next to the agent. Eat: To survive in NetHack, players have to make sure their character does not starve to death. There are many edible objects in the game, for example food rations %, tins, and some monster corpses. In

1https://github.com/facebookresearch/NLE 2nethack.org 3“NetHack is largely based on discovering secrets and tricks during gameplay. It can take years for one to become well-versed in them, and even experienced players routinely discover new ones.” (Fernandez, 2009)

2 Accepted to the ICLR 2020 workshop: Beyond tabula rasa in RL (BeTR-RL)

this task, the agent receives the increase of nutrition as determined by the in-game “Hunger” status as reward (see NetHack Wiki, 2020, “Nutrition” entry for details). Gold: Throughout the game, the player can collect gold $ to trade for useful items with shopkeepers. The agent receives the amount of gold it collects as reward. This incentivizes the agent to explore dungeon maps fully and to descend dungeon levels to discover new sources of gold. There are many advanced strategies for obtaining large amounts of gold such as finding and selling gems; stealing from shopkeepers; and many more. Scout: An important part of the game is exploring dungeon levels. Here, we reward the agent for uncovering previously unknown tiles in the dungeon, for example by entering a new room or following a newly discovered passage. Like the previous task, this incentivizes the agent to explore dungeon levels and to descend deeper down. Score: In this task the agent receives as reward the increase of the in-game score. The in-game score is governed by a complex calculation, but in early stages of the game it is dominated by killing and reaching deeper dungeon levels (see NetHack Wiki, 2020, “Score” entry for details). Oracle: While levels are procedurally generated, there are a number of landmarks that appear in every game of NetHack. One such landmark is the Oracle, which is randomly placed between levels five and nine of the dungeon. Reliably finding the Oracle is difficult, as it requires the agent to go down multiple staircases and often to exhaustively explore each level. In this task, the agent receives a reward of 1000 if it reaches the Oracle @. Since there is only an infinitesimal chance a random agent would find the Oracle, we provide a shaped reward function using a hundredth of the Scout reward.

3 EXPERIMENTSAND RESULTS

In addition to this suite of default tasks in the NetHack Learning Environment, we present results of a random policy and a standard distributed deep RL baseline. For our baseline model, we learn three dense representations based on the observation: one st 1 That door is closed. CNN − ------for the entire map, one for an egocentric crop |...... |....| |...=....| ------|...... | #################*####...... | |<...| |...... | |..!.?...| |...=-###%#+. |.%...... | ------.-- |....| # --+------.---- ############ # around the agent, and another one for the hero’s |@...| ### # # ## #----- # # -+---- # ##################### # ###....| 0 # # # # # # ### #|...| ### ## ######## # |...| ### ## --|--- ....-### -.----.-- CNN MLP LSTM MLP π stats (see Figure1). These vectors are concate- # |.(..| |...| # |(...... | # |.....#0##----- ## |...... | # |....| ########### ### |...... | # |.....### #### |...... | nated and processed using another MLP to arrive # |....| ###...... | # ------|.....).| +# ------ot ht at a low-dimensional latent representation ot of Agent6850 the Candidate St:18/03 Dx:11 Co:12 In:8 Wi:13 Ch:11 Neutral S: Dlvl:1 $:7 HP:14(14) Pw:5(5) AC:4 Xp:1/17 T:835 Hungry MLP the observation. Consequently, we employ a st 1 recurrent policy parameterized by an LSTM to calculate the action distribution. We train the Figure 1: The baseline model released with NLE. agent’s policy for 109 steps in the environment using IMPALA (Espeholt et al., 2018) as implemented in TorchBeast (Küttler et al., 2019). In our experiments, we train two different agent characters, a neutral human monk (mon-hum-neu-mal) and a lawful valkyrie (val-dwa-law-fem). See AppendixF for further details. For each model and character combination, we present results of the mean return averaged over the last 100 episodes and for five repeated experiments in Figure2. We discuss results for individual tasks below (see Table5 and Table6 in AppendixG for full details). Staircase: Our agents achieve a success rate of 85% and 84% for the monk and valkyrie class respectively (see Table5 AppendixG). To solve this task, the agent learns to kick in locked doors and reliably search for hidden passages and secret doors. Pet: The performance on the Pet task is comparable to Staircase with a win rate of 67% and 72% for monk and valkyrie respectively. Unsurprisingly, this task is slightly harder and requires more samples to be learnt. For example, the pet might get killed or fall into a trap door, making it impossible for the agent to successfully complete the episode. Eat: The Eat task demonstrate the importance of testing with different character classes in NetHack. The monk starts with a number of food rations %, apples % and oranges %. A sub-optimal strategy in this task is to consume all of these comestibles right at the start of the episode. In contrast, the valkyrie has to hunt for food, which our agent learns to do slowly over time. By having more pressure

3 Accepted to the ICLR 2020 workshop: Beyond tabula rasa in RL (BeTR-RL)

staircase pet eat gold 20 1000 75 75 10 750 50 50 500 0 25 25 250 10 0 0 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 1e9 1e9 1e9 1e9 scout score oracle 6 cnn_val

mean episode return cnn_mon 1000 400 4 random_mon random_val 500 200 2

0 0 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 1e9 1e9 1e9 environment (= agent) steps

Figure 2: Mean episode return of the last 100 episodes averaged over ten runs. to quickly learn a sustainable food strategy, the valkyrie learns to outlast the monk and survives the longest in game across all experiments that we ran (on average 2000 time steps). Gold: Somewhat surprisingly, our baseline agents fail at learning to search for gold in NetHack. We hypothesize that it could be difficult for the baseline agent to outperform a random policy, as the agent has to learn to descend the dungeon to discover new sources of gold. Scout & Score: In both of these tasks our agents learn non-trivial policies taking them down a number of levels in early parts of the dungeon. Oracle: Finding the Oracle is a difficult task, even if the agent learns to make its way down the dungeon levels through a shaped reward function. Thus, we believe this task serves as a good benchmark for advancing exploration methods in procedurally generated environments in the short term. Long term, many more tasks harder than this (e.g. reaching Mines’ End, Vlad’s Tower, Ascension, etc.) can be easily defined in NLE with few lines of code.

4 CONCLUSIONAND FUTURE WORK

The NetHack Learning Environment (NLE) is a fast and complex procedurally generated environment for reinforcement learning and decision making. We demonstrate that current state-of- the-art model-free RL serves as a sensible baseline, and provide an in-depth analysis of agents. NetHack provides interesting challenges for exploration methods given the extremely large number of possible states and wide variety of environment dynamics to discover. Previously proposed formulations of intrinsic motivation based on seeking novelty (Bellemare et al., 2016; Ostrovski et al., 2017; Burda et al., 2019b) or maximizing surprise (Pathak et al., 2017; Burda et al., 2019a; Raileanu & Rocktäschel, 2020) are likely insufficient to make progress on NetHack given that an agent will constantly find itself in novel states or observe unexpected environment dynamics. NetHack poses further challenges since in order to win an agent needs to acquire a wide range of skills such as collecting resources, fighting monsters, eating, manipulating objects, casting spells, or taking care of its pet, to name just a few. The multilevel dependencies present in NetHack can inspire progress in hierarchical RL and long-term planning (Dayan & Hinton, 1992; Kaelbling et al., 1996; Parr & Russell, 1997; Vezhnevets et al., 2017). Transfer to unseen game characters, environment dynamics, or level layouts can be evaluated (Taylor & Stone, 2009). Furthermore, its richness and constant challenge make NetHack an interesting benchmark for lifelong learning (Lopez-Paz & Ranzato, 2017; Parisi et al., 2018; Rolnick et al., 2018; Mankowitz et al., 2018). In addition, the extensive documentation about NetHack can enable research on using prior structured and natural language knowledge for learning, which could lead to improvements in generalization and sample efficiency (Branavan et al., 2011; Luketina et al., 2019; Zhong et al., 2019; Jiang et al., 2019). Lastly, NetHack can also drive research on learning from demonstrations (Abbeel & Ng, 2004; Argall et al., 2009) since a large collection of replay data is available. In summary, we believe NLE strikes an excellent balance between complexity and speed while encompassing a variety of challenges for the research community.

4 Accepted to the ICLR 2020 workshop: Beyond tabula rasa in RL (BeTR-RL)

REFERENCES Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML ’04, 2004. Brenna Argall, Sonia Chernova, Manuela M. Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57:469–483, 2009. Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6):26–38, 2017. Andrea Asperti, Daniele Cortesi, Carlo De Pieri, Gianmaria Pedrini, and Francesco Sovrano. Crawling in rogue’s dungeons with deep reinforcement techniques. IEEE Transactions on Games, 2019. Yusuf Aytar, Tobias Pfaff, David Budden, Tom Le Paine, Ziyu Wang, and Nando de Freitas. Playing hard exploration games by watching youtube. In Advances in Neural Information Processing Sys- tems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pp. 2935–2945, 2018. URL http://papers.nips.cc/ paper/7557-playing-hard-exploration-games-by-watching-youtube. Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, Julian Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton, Stephen Gaffney, Helen King, Demis Hassabis, Shane Legg, and Stig Petersen. Deepmind lab. CoRR, abs/1612.03801, 2016. URL http: //arxiv.org/abs/1612.03801. Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479, 2016. Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning envi- ronment: An evaluation platform for general agents. J. Artif. Intell. Res., 47:253–279, 2013. doi: 10.1613/jair.3912. URL https://doi.org/10.1613/jair.3912. Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemyslaw Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique Pondé de Oliveira Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, and Susan Zhang. Dota 2 with large scale deep reinforcement learning. CoRR, abs/1912.06680, 2019. URL http://arxiv.org/abs/1912.06680. S. R. K. Branavan, David Silver, and Regina Barzilay. Learning to win by reading manuals in a monte-carlo framework. In ACL, 2011. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. CoRR, abs/1606.01540, 2016. URL http://arxiv.org/ abs/1606.01540. Yuri Burda, Harrison Edwards, Deepak Pathak, Amos J. Storkey, Trevor Darrell, and Alexei A. Efros. Large-scale study of curiosity-driven learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019a. URL https: //openreview.net/forum?id=rJNwDjAqYX. Yuri Burda, Harrison Edwards, Amos J. Storkey, and Oleg Klimov. Exploration by random network distillation. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019b. URL https://openreview.net/forum?id= H1lJJnR5Ym. Jonathan Campbell and Clark Verbrugge. Learning combat in NetHack. In Thirteenth Annual AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE 2017), pp. 16–22, October 2017. Jonathan Campbell and Clark Verbrugge. Exploration in NetHack with secret discovery. IEEE Transactions on Games, 2018. ISSN 2475-1502. doi: 10.1109/TG.2018.2861759.

5 Accepted to the ICLR 2020 workshop: Beyond tabula rasa in RL (BeTR-RL)

Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. BabyAI: First Steps Towards Grounded Language Learning With a Human In the Loop. CoRR, abs/1810.08272, 2018. URL http://arxiv. org/abs/1810.08272.

Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic Gridworld Environment for OpenAI Gym. https://github.com/maximecb/gym-minigrid, 2018.

Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging to benchmark reinforcement learning. arXiv preprint arXiv:1912.01588, 2019a.

Karl Cobbe, Oleg Klimov, Christopher Hesse, Taehoon Kim, and John Schulman. Quantifying generalization in reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 1282–1289, 2019b. URL http://proceedings.mlr.press/v97/cobbe19a.html.

Dustin Dannenhauer, Michael W Floyd, Jonathan Decker, and David W Aha. as an evaluation domain for artificial intelligence. arXiv preprint arXiv:1902.01769, 2019.

Peter Dayan and Geoffrey E. Hinton. Feudal reinforcement learning. In NIPS, 1992.

Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-Explore: A New Approach for Hard-exploration Problems. arXiv preprint arXiv:1901.10995, 2019.

Lasse Espeholt, Hubert Soyer, Rémi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 1406–1415, 2018. URL http://proceedings. mlr.press/v80/espeholt18a.html.

Juan Manuel Sanchez Fernandez. Reinforcement Learning for roguelike type games (eliteMod v0.9). https://kcir.pwr.edu.pl/~witold/aiarr/2009_projekty/elitmod/, 2009. Accessed: 2020-01-19.

Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip HS Torr, Pushmeet Kohli, and Shimon Whiteson. Stabilising experience replay for deep multi-agent reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1146–1155. JMLR. org, 2017.

Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.

Gary Gygax and Dave Arneson. Dungeons and dragons, volume 19. Tactical Studies Rules Lake Geneva, WI, 1974.

Luke Harries, Sebastian Lee, Jaroslaw Rzepecki, Katja Hofmann, and Sam Devlin. Mazeexplorer: A customisable 3d benchmark for assessing generalisation in reinforcement learning. In Proc. IEEE Conference on Games, 2019.

Todd Hester, Matej Vecerík, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, Gabriel Dulac-Arnold, John Agapiou, Joel Z. Leibo, and Audrunas Gruslys. Deep q-learning from demonstrations. In AAAI, 2017.

Felix Hill, Andrew K. Lampinen, Rosalia Schneider, Stephen Clark, Matthew Botvinick, James L. McClelland, and Adam Santoro. Emergent systematic generalization in a situated agent. CoRR, abs/1910.00571, 2019. URL http://arxiv.org/abs/1910.00571.

International Roguelike Development Conference. Berlin Interpretation. http://www. roguebasin.com/index.php?title=Berlin_Interpretation, 2008. Accessed: 2020-01-08.

6 Accepted to the ICLR 2020 workshop: Beyond tabula rasa in RL (BeTR-RL)

Yacine Jernite, Kavya Srinet, Jonathan Gray, and Arthur Szlam. Craftassist instruction parsing: Semantic parsing for a minecraft assistant. CoRR, abs/1905.01978, 2019. URL http://arxiv. org/abs/1905.01978. Yiding Jiang, Shixiang Gu, Kevin Murphy, and Chelsea Finn. Language as an abstraction for hierarchical deep reinforcement learning. In NeurIPS, 2019. Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for artificial intelligence experimentation. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pp. 4246–4247, 2016. URL http://www.ijcai.org/Abstract/16/643. Arthur Juliani, Ahmed Khalifa, Vincent-Pierre Berges, Jonathan Harper, Ervin Teng, Hunter Henry, Adam Crespi, Julian Togelius, and Danny Lange. Obstacle Tower: A Generalization Challenge in Vision, Control, and Planning. In IJCAI, 2019. Niels Justesen, Ruben Rodriguez Torrado, Philip Bontrager, Ahmed Khalifa, Julian Togelius, and Sebastian Risi. Illuminating generalization in deep reinforcement learning through procedural level generation, 2018. Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning: A survey. ArXiv, cs.AI/9605103, 1996. Yuji Kanagawa and Tomoyuki Kaneko. Rogue-gym: A new challenge for generalization in reinforce- ment learning. arXiv preprint arXiv:1904.08129, 2019. Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaskowski.´ Vizdoom: A doom-based ai research platform for visual reinforcement learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8. IEEE, 2016. Heinrich Küttler, Nantas Nardelli, Thibaut Lavril, Marco Selvatici, Viswanath Sivakumar, Tim Rocktäschel, and Edward Grefenstette. TorchBeast: A PyTorch Platform for Distributed RL. ArXiv, abs/1910.03552, 2019. David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. In NIPS, 2017. Jelena Luketina, Nantas Nardelli, Gregory Farquhar, Jakob Foerster, Jacob Andreas, Edward Grefen- stette, Shimon Whiteson, and Tim Rocktäschel. A survey of reinforcement learning informed by natural language. arXiv preprint arXiv:1906.03926, 2019. Daniel J. Mankowitz, Augustin Zídek, André Barreto, Dan Horgan, Matteo Hessel, John Quan, Junhyuk Oh, Hado van Hasselt, David Silver, and Tom Schaul. Unicorn: Continual learning with a universal, off-policy agent. ArXiv, abs/1802.08294, 2018. Nantas Nardelli, Gabriel Synnaeve, Zeming Lin, Pushmeet Kohli, Philip HS Torr, and Nicolas Usunier. Value propagation networks. arXiv preprint arXiv:1805.11199, 2018. NetHack Wiki. NetHackWiki. https://nethackwiki.com/, 2020. Accessed: 2020-02-01. Alex Nichol, Vicki Pfau, Christopher Hesse, Oleg Klimov, and John Schulman. Gotta learn fast: A new benchmark for generalization in rl. arXiv preprint arXiv:1804.03720, 2018. Santiago Ontanón, Gabriel Synnaeve, Alberto Uriarte, Florian Richoux, David Churchill, and Mike Preuss. A survey of real-time strategy game ai research and competition in starcraft. IEEE Transactions on Computational Intelligence and AI in games, 5(4):293–311, 2013. Georg Ostrovski, Marc G Bellemare, Aäron van den Oord, and Rémi Munos. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2721–2730. JMLR. org, 2017. German Ignacio Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. 2018.

7 Accepted to the ICLR 2020 workshop: Beyond tabula rasa in RL (BeTR-RL)

Ronald Parr and Stuart J. Russell. Reinforcement learning with hierarchies of machines. In NIPS, 1997. Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 16–17, 2017. Roberta Raileanu and Tim Rocktäschel. {RIDE}: Rewarding impact-driven exploration for procedurally-generated environments. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkg-TJBFPB. Eric S Raymond, Mike Stephenson, et al. A Guide to the Mazes of Menace: Guidebook for NetHack. NetHack.org, 2020. URL http://www.nethack.org/download/3.6.5/nethack- 365-Guidebook.pdf. Sebastian Risi and Julian Togelius. Procedural content generation: From automatically generating game levels to increasing generality in machine learning. CoRR, abs/1911.13071, 2019. URL http://arxiv.org/abs/1911.13071. David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy P. Lillicrap, and Greg Wayne. Experience replay for continual learning. ArXiv, abs/1811.11682, 2018. Tim Salimans and Richard Chen. Learning montezuma’s revenge from a single demonstration. CoRR, abs/1812.03381, 2018. URL http://arxiv.org/abs/1812.03381. Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon Whiteson. The starcraft multi-agent challenge. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2186–2188. International Foundation for Autonomous Agents and Multiagent Systems, 2019. Gabriel Synnaeve, Nantas Nardelli, Alex Auvolat, Soumith Chintala, Timothée Lacroix, Zeming Lin, Florian Richoux, and Nicolas Usunier. Torchcraft: a library for machine learning research on real-time strategy games. arXiv preprint arXiv:1611.00625, 2016. Gabriel Synnaeve, Jonas Gehring, Zeming Lin, Daniel Haziza, Nicolas Usunier, Danielle Rothermel, Vegard Mella, Da Ju, Nicolas Carion, Laura Gustafson, et al. Growing up together: Structured exploration for large action spaces. 2019. TAEB. TAEB Documentation: Other Bots. https://taeb.github.io/bots.html, 2015. Accessed: 2020-01-19. Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems, pp. 2753–2762, 2017. Matthew E. Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. J. Mach. Learn. Res., 10:1633–1685, 2009. Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Manfred Otto Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In ICML, 2017. Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, et al. StarCraft II: A New Challenge for Reinforcement Learning. arXiv preprint arXiv:1708.04782, 2017. Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019. Chang Ye, Ahmed Khalifa, Philip Bontrager, and Julian Togelius. Rotation, translation, and cropping for zero-shot generalization, 2020.

8 Accepted to the ICLR 2020 workshop: Beyond tabula rasa in RL (BeTR-RL)

Victor Zhong, Tim Rocktäschel, and Edward Grefenstette. RTFM: Generalising to Novel Environment Dynamics via Reading. ArXiv, abs/1910.08210, 2019.

9 Accepted to the ICLR 2020 workshop: Beyond tabula rasa in RL (BeTR-RL)

ANETHACK

That door is closed. in-game messages You kill the dwarf! Welcome to experience level 5.--More------.-- agent |...... |....| |...=....| ring ------.... ------|...... | #################*####...... | |.------...... | room |<...| |...... |potion |..!.?...| scroll Legend weapon |...---...... - |...=-###%#+. |.%...... | ------.-- " -- Amulet |...... fog of war -| corpse |....| # --+------.---- ############ # ) -- Weapon unexplored territory --+---+--...... | [ -- Armor |@...| ### # # ## #----- # # -...... |...... -- -- ! -- Potion -+---- # ##################### # ###....| 0boulder # ? -- Scroll -----..|.--).)...-- -.-- # # # # # ### #|...| ### / -- --+- |..------.....---...-- ## ######## # |...| ### = -- Ring armor |..agent |...-- --...... %...| closed door ## --|--- ....-### -.----.-- + -- Spellbook |. |...>- |...%...... -- #tool |.(..| |...| # |(...... | * -- Gem enemies --- --.------.--..| -...... - # |.....#0##----- ## |...... | ( -- Tool |..G.[@...... |--- |...... | #hidden passage |....| ########### ### |...... | O -- Boulder ---.G....------.| --...----...-- # |.....### #### |...... | $ -- Gold |G%.-....|.| --..| -..| # |....| ###...... | % -- Comestible |%!.|....+.| ---- -.| # ------|.....).| -----....| |food -- +# ------agent stats .....| ------

Agent6850 the Candidate St:18/03 Dx:11 Co:12 In:8 Wi:13 Ch:11 Neutral S: Agent61322 the Novice St:18/02 Dx:12 Co:12 In:11 Wi:13 Ch:8 Neutral S: Dlvl:1 $:7 HP:14(14) Pw:5(5) AC:4 Xp:1/17 T:835 Hungry Dlvl:5 $:0 HP:37(39) Pw:25(25) AC:5 Xp:5/168 T:768 Hungry

1 1 Figure 3: Annotated example of an agent at two different stages in NetHack (Left: a procedurally generated first level of the Dungeons of Doom, right: Gnomish Mines).

NetHack is one of the oldest and most popular , originally released in 1987 as a successor to Hack, an open-source implementation of the original Rogue game. The world of NetHack is loosely based on classical role-playing lore such as Dungeons & Dragons (Gygax & Arneson, 1974). At the beginning of the game, the player assumes the role of a hero placed into a dungeon and who is tasked with finding the Amulet of Yendor and offering it to an in-game deity. To do so, the player has to first descend to the bottom of over 50 procedurally generated levels to retrieve the amulet and then subsequently escape the dungeon. The game provides further variation through five different races (human, elf, dwarf, , orc), thirteen roles (e.g. monk, valkyrie, wizard), three moral alignments (neutral, lawful, chaotic), and gender (male/female) that the player can chose from. Each choice determines some of the character’s features, as well as how the character interacts with other entities (e.g. some species of monsters may not be hostile depending on the character race, or priests of a particular religion may only help theologically aligned characters). This selection process heavily conditions the set of viable strategies that the agent can employ, and provides a strong challenge for multi-task and hierarchical reinforcement learning, similarly to how the presence of races in StarCraft and characters in Dota 2 affected the respective efforts to solve these testbeds (Vinyals et al., 2017; Berner et al., 2019). Furthermore, the hero’s interaction with several game entities involves pre-defined stochastic dynamics (usually defined by virtual dice tosses), and the game is designed to heavily punish careless exploratory policies.4 This makes NetHack a very interesting environment for evaluating exploration methods such as curiosity-driven learning (Pathak et al., 2017; Burda et al., 2019a), as well as safe reinforcement learning (Garcıa & Fernández, 2015). NetHack provides interesting modelling challenges. The large majority of entities appear rarely across multiple runs, meaning that any model that conditions on them needs to handle a distribution with an extremely long tail of events (closer to real-world tasks). Entities are hierarchically structured (e.g. a monster might be a wolf d, which shares some characteristics with other in-game canines such as coyotes d or hell hounds d) and compositional (e.g. an ant that the hero encounters might turn out to be a soldier ant a or a fire ant a). Furthermore, the in-game messages for describing many of the interactions between the hero and other entities provide an interesting setting for multi-modal and language-conditioned reinforcement learning (see the top of Figure3). NetHack is an extremely long game. Successful expert runs usually last tens of thousands of turns, while average successful runs can easily last hundreds of thousands of turns, spawning multiple days of play-time. Compared to testbeds with long episode horizons such as StarCraft and Dota 2, NetHack’s “episodes” are one or two orders of magnitude longer, and they wildly vary depending on policy quality. Similar to the Atari games found in the Arcade Learning Environment (Bellemare et al., 2016), the game provides a score that is, in principle, indicative of the hero’s progress. However

4Occasionally dying because of simple, avoidable mistakes is so common in the game that the online community has defined an acronym for it: Yet Another Stupid Death (YASD).

10 Accepted to the ICLR 2020 workshop: Beyond tabula rasa in RL (BeTR-RL)

since (a) it can be artificially increased by sub-optimal policies, and (b) changes in the score still remain fairly sparse across environment steps, it does not provide a good reward signal for RL agents towards reaching the end of the game successfully. We preliminarily show this in Section3 and evaluate other metrics such as the agent’s dungeon level and experience level reached.

BTHE NETHACK LEARNING ENVIRONMENT

The NetHack Learning Environment (NLE) is built on NetHack 3.6.2, the latest available version of the game. It is designed to provide a standard, turn-based (i.e., synchronous) reinforcement learning interface around the standard visual interface of NetHack. We use the game as-is as the backend for our NLE environment, leaving the game dynamics largely unchanged. We added to the more control over the random number generator for seeding the environment, as well as various modifications to expose the game’s internal state to our Python frontend, which in turn implements the featurization. By default, the observation space is a tuple of four elements (glyphs, stats, message, inventory), where glyphs is a tensor representing the 2D symbolic observation of the level, stats is a vector of agent coordinates and other character attributes (e.g., health points, strength, dexterity, hunger level; normally displayed in the bottom area of the GUI), message is a tensor representing the current message shown to the player (normally displayed in the top area of the GUI), and inventory is a list of padded tensors representing inventory items (normally displayed when requested by the player). Additional observations such as ASCII symbols and their colors or non-permanent game windows can be enabled with slight code modifications. More details about the default observation space and possible extensions can be found in AppendixC. The environment has 93 available actions, corresponding to all the actions a human player can take in NetHack. More precisely, the action space is composed of 77 command actions and 16 movement actions. The movement actions are split into 8 “1-step” compass directions (i.e., the agent moves a single step in a given direction) and 8 “move far” compass directions (i.e., the agent moves in the specified direction until it runs into some entity). The 77 command actions include eating, opening, kicking, reading, praying as well as many others. We refer the reader to AppendixD as well as to Raymond et al.(2020) for the full table of actions and NetHack commands. The environment comes with a Gym interface (Brockman et al., 2016), and includes multiple pre- defined tasks with different reward functions and action spaces (see next section for details). We designed the interface to be lightweight, achieving competitive results with Gym-based ALE (see AppendixE for a rough comparision). Finally, NLE also includes a dashboard to analyze NetHack runs recorded as terminal tty recordings. This allows NLE users to watch replays of the agent’s behavior at an arbitrary speed and provides an interface to visualize action distributions and game events. See AppendixH for details. NLE is available at [redacted] under an license.

COBSERVATION SPACE

The the NetHack Learning Environment Gym environment is implemented by wrapping a more low-level NetHack Python object into a Python class responsible for the featurization, reward schedule and end-of-episode dynamics. While the low-level NetHack object exposes a large number of NetHack game internals, the Gym wrapper by default exposes only a part of these data as numerical observation arrays, namely the default observation space is a tuple of four elements (glyphs, stats, message, inventory). glyphs NetHack supports non-ASCII graphical user interfaces, dubbed window-ports. To support displaying different monsters, objects and floor types in the NetHack dungeon map as different tiles, NetHack internally defines glyphs as ids in the range 0,..., MAX_GLYPH, where MAX_GLYPH = 5976 in our build. The glyph observation is an integer array of shape (21, 79) of these game glyph ids.5

5NetHack’s set of glyph ids is not necessarily well suited for machine learning. E.g., more than half of all glyph ids are of type “swallow”, most of which are guaranteed not to show up in any actual game of NetHack. We provide additional tooling to determine the type of a given glyph id to process this observation further.

11 Accepted to the ICLR 2020 workshop: Beyond tabula rasa in RL (BeTR-RL)

stats A integer vector of length 22, containing the (x, y) coordinate of the hero and the following 20 character stats: StrengthPercentage, Strength, Dexterity, Constitution, Intelligence, Wisdom, Charisma, Score, Hitpoints, MaxHitpoints, Gold, Energy, MaxEnergy, ArmorClass, MonsterLevel, ExperienceLevel, ExperiencePoints, Time, HungerLevel, CarryingCapacity. message A padded byte vector of length 256 representing the current message shown to the player, normally displayed in the top area of the GUI. We support different padding strategies and alphabet sizes, but by default we choose an alphabet size of 96, the last character of which being used for padding. inventory In NetHack’s default ASCII user interface, the hero’s inventory is displayed as type of the number of windows that can open and close during the game. Other user interfaces display a permanent inventory at all times. The NetHack Learning Environment follows that strategy. The inventory observation is a tuple of the following four arrays: glyphs: an integer vector of length 55 of glyph ids, padded with MAX_GLYPH, which is 5976 in our build; strs: A padded byte array of shape (55, 80) describing the inventory items; letters: A padded byte vector of length 55 with the corresponding ASCII character symbol; oclass: An integer vector of shape 55 with ids describing the type of inventory objects, padded with MAXOCLASSES = 18. The low-level NetHack Python object has additional entries for the dungeon map’s character symbols and colors as well as various parts of the program state of the underlying NetHack process, including the contents of all in-game windows (e.g., menus and other pop-ups). We refer to the source code to describe these.

DACTION SPACE

The game of NetHack uses ASCII inputs, i.e., individual keyboard presses including modifiers like Ctrl and Meta. The the NetHack Learning Environment pre-defines 93 actions, 16 of which are compass directions and 77 of which are command actions. Table2 gives a list of command actions, including their ASCII value and the corresponding key binding in NetHack while Table3 lists the 16 compass directions. For a detailed description of these actions, as well as other NetHack commands, we refer to the NetHack guide book (Raymond et al., 2020).

By default, NLE will auto-apply the MORE action in situations where the game waits for input to display more messages. NLE also auto-answers certain in-game questions, e.g., the “Do you want your possessions identified?” question at the end of the game in order to contain that data in the tty recording.

Name Value Key Description EXTCMD 35 # perform an extended command EXTLIST 191 M-? list all extended commands ADJUST 225 M-a adjust inventory letters ANNOTATE 193 M-A name current level APPLY 97 a apply (use) a tool (pick-axe, key, lamp...) ATTRIBUTES 24 C-x show your attributes AUTOPICKUP 64 @ toggle the pickup option on/off CALL 67 C call (name) something CAST 90 Z zap (cast) a spell CHAT 227 M-c talk to someone CLOSE 99 c close a door CONDUCT 195 M-C list voluntary challenges you have maintained DIP 228 M-d dip an object into something DOWN 62 > go down (e.g., a staircase) DROP 100 d drop an item DROPTYPE 68 D drop specific item types EAT 101 e eat something ENGRAVE 69 E engrave writing on the floor ENHANCE 229 M-e advance or check weapon and spell skills

12 Accepted to the ICLR 2020 workshop: Beyond tabula rasa in RL (BeTR-RL)

FIRE 102 f fire ammunition from quiver FORCE 230 M-f force a lock GLANCE 59 ; show what type of thing a map symbol corresponds to HELP 63 ? give a help message HISTORY 86 V show long version and game history INVENTORY 105 i show your inventory INVENTTYPE 73 I inventory specific item types INVOKE 233 M-i invoke an object’s special powers JUMP 234 M-j jump to another location KICK 4 C-d kick something KNOWN 92 \ show what object types have been discovered KNOWNCLASS 96 ‘ show discovered types for one class of objects LOOK 58 : look at what is here LOOT 236 M-l loot a box on the floor MONSTER 237 M-m use monster’s special ability NAME 78 N name a monster or an object OFFER 239 M-o offer a sacrifice to the gods OPEN 111 o open a door OPTIONS 79 O show option settings, possibly change them OVERVIEW 15 C-o show a summary of the explored dungeon PAY 112 p pay your shopping bill PICKUP 44 , pick up things at the current location PRAY 240 M-p pray to the gods for help PREVMSG 16 C-p view recent game messages PUTON 80 P put on an accessory (ring, amulet, etc) QUAFF 113 q quaff (drink) something QUIT 241 M-q exit without saving current game QUIVER 81 Q select ammunition for quiver READ 114 r read a scroll or spellbook REDRAW 18 C-r redraw screen REMOVE 82 R remove an accessory (ring, amulet, etc) RIDE 210 M-R mount or dismount a saddled steed RUB 242 M-r rub a lamp or a stone SAVE 83 S save the game and exit SEARCH 115 s search for traps and secret doors SEEALL 42 * show all equipment in use SEETRAP 94 ^ show the type of adjacent trap SIT 243 M-s sit down SWAP 120 x swap wielded and secondary weapons TAKEOFF 84 T take off one piece of armor TAKEOFFALL 65 A remove all armor TELEPORT 20 C-t teleport around the level THROW 116 t throw something TIP 212 M-T empty a container TRAVEL 95 _ travel to a specific location on the map TURN 244 M-t turn undead away TWOWEAPON 88 X toggle two-weapon combat UNTRAP 245 M-u untrap something UP 60 < go up (e.g., a staircase) VERSION 246 M-v list compile time options VERSIONSHORT 118 v show version WAIT / SELF 46 . rest one move while doing nothing / apply to self WEAR 87 W wear a piece of armor WHATDOES 38 & tell what a command does WHATIS 47 / show what type of thing a symbol corresponds to WIELD 119 w wield (put in use) a weapon WIPE 247 M-w wipe off your face ZAP 112 z zap a wand

Table 2: Command actions in the NetHack Learning Environment.6

13 Accepted to the ICLR 2020 workshop: Beyond tabula rasa in RL (BeTR-RL)

1-step move far Direction Value Key Value Key North 107 k 75 K East 108 l 76 L South 106 j 74 J West 104 h 72 H North east 117 u 85 U South east 110 n 78 N South west 98 b 66 B North west 121 y 89 Y

Table 3: Compass direction actions.

EENVIRONMENT SPEED COMPARISON

Table4 shows a comparison between popular Gym environments and the NetHack Learning Environ- ment. Atari environments were passed through a common featurization pipeline taken from OpenAI.7 ObstacleTowerEnv was instantiated with (retro=False, real_time=False). All environ- ments were controlled with a uniformly random policy with reset on terminal states, using a MacBook Pro equipped with an Intel Core i7 2.9 GHz, 16GB of RAM, MacOS Mojave, Python 3.7, Conda 4.7.12, and latest available packages as of January 2020.

Gym environment SPS episodes NLE (staircase) 1.1K 88 Atari (PongNoFrameskip-v4) 1.0K 65 Atari (MontezumaRevengeNoFrameskip-v4) 0.9K 589 CartPole-v1 123.7K 332977 ObstacleTowerEnv 0.06K 5

Table 4: Comparison between NLE and popular environments when running using the Python Gym interface. SPS stands for “steps per second”. Both NLE and Atari uses a Python featurization pipeline implemented in non-optimized Python. Each environment was run for 60 continuous seconds.

FBASELINE DETAILS

For our baseline model, we encode the multi-modal observation ot as follows. Let the observation ot at time step t be a tuple (gt, zt) consisting of the 21 79 matrix of glyph identifiers, a 21-dimensional vector containing agent stats such as its (x, y)-coordinate,× health points, experience level, and so on. We produce three dense representations based on the observation (see Figure1). For every of the 5991 possible glyphs in NetHack (monsters, items, dungeon features etc.), we learn a k-dimensional vector embedding using a lookup table. We apply a ConvNet to the entire map of glyph embeddings (red). In addition, via another CNN (blue) we learn a dedicated egocentric representation of the 9 9 crop of glyph embeddings around the agent to improve its ability to generalize (Hill et al., 2019;× Ye et al., 2020). Furthermore, we use an MLP to encode the hero’s stats (green). Throughout training, we change NetHack’s seed for procedurally generating the environment after every episode.

As embedding dimension of the glyphs, as well as hidden dimension for the observation ot and the output of LSTM ht, we use 64. For encoding the full map of glyphs as well as the 9 9 crop, we use a 5-layer ConvNet architecture with filter size 3 3, padding 1 and stride 1. The input× channel of the first layer of the ConvNet is the embedding size× of the glyphs (64). Subsequent layers have an input and output channel dimension of 16. We employ a gradient norm clipping of 40 and clip rewards using rc = tanh r/100. We use RMSProp with a learning rate of 0.0001 without momentum and with εRMSProp = 0.01. Our entropy cost is set to 0.0001. 7See github.com/openai/baselines/blob/master/baselines/common/atari_ wrappers.py

14 Accepted to the ICLR 2020 workshop: Beyond tabula rasa in RL (BeTR-RL)

eat scout 0.20 pet score staircase 0.15

0.10

0.05

0.00

UP EAT WAIT KICK E (far)W (far)S (far)N (far) MOREDOWN SEARCH NW (far) NE (far) SW (far)SE (far) W (1-step) E (1-step) N (1-step) S (1-step) NW (1-step) SE (1-step)NE (1-step)SW (1-step)

Figure 4: Action distribution per task for 1000 episodes.

GADDITIONAL RESULTS

The different tasks show patterns in their distributions of actions chosen by the trained policy, as exemplified by Figure4. Naturally, the EAT action is chosen primarily by the agent trained on the Eat task, while SEARCH, which allows the hero to discover hidden doors and passages, occurs much more frequently in the Explore task. The 21 79 shape of the dungeon map makes moving east-west more often useful than north-south for all tasks.× We also note that while the Eat task tends to favor the “move far” actions, Explore puts a higher weight on 1-step actions. We hypothesize that this relates to the per-step reward clipping used in the IMPALA algorithm, which incentivizes the agent not to explore too large a fraction of the dungeon in one single action.

HDASHBOARD

We built a web dashboard built with NodeJS (Figure5) to facilitate the visualization and management of NetHack runs, as well as providing a starting codebase for enabling the parsing of tty recordings for more advanced user analysis. Currently it enables to parse and aggregate multiple NetHack runs, as well as giving an interface to manually step through recorded frames.

IQUALITATIVE ANALYSIS

We analyse the cause for death of our agents during the course of training and present results in Figure6. The first thing to notice is that starvation becomes a less prominent cause of death over time, most likely because our agents, when starting to learn to descend the dungeon and fighting monsters, are more likely to die in combat before they starve. In the Score and Scout tasks, our agents quickly learn to avoid eating rotten corpses. However, for the Eat task the agent’s policy converged to the sub-optimal strategy of obtaining the intermediate positive reward from the nutrition of consuming a rotted corpse before then dying and ending the episode. Another observation is that agents are much less frequently dying from zapped over time. This indicates that our agents learn to be friendly to shopkeepers who compared to low-level heroes are extremely strong. Lastly, we can see that gnome kings G become a more prominent cause of death over time for agents trained in the Score and Scout tasks, which correlates with our agents leveling up and descending deeper into the dungeon.

15 Accepted to the ICLR 2020 workshop: Beyond tabula rasa in RL (BeTR-RL)

Task Model Score Time Exp Dungeon Win staircase cnn_mon 22 318 1.05 1.02 0.85 cnn_val 22 368 1.05 1.03 0.84 random_mon 12 443 1.00 1.03 0.19 random_val 11 601 1.00 1.04 0.17 pet cnn_mon 35 524 1.13 1.05 0.67 cnn_val 25 485 1.06 1.03 0.72 random_mon 14 466 1.00 1.07 0.12 random_val 13 626 1.00 1.07 0.12 eat cnn_mon 29 1601 1.07 1.05 – cnn_val 152 2000 2.13 1.06 – random_mon 36 1409 1.04 1.20 – random_val 20 1337 1.00 1.13 – gold cnn_mon 36 1320 1.04 1.19 – cnn_val 24 1488 1.01 1.14 – random_mon 44 1380 1.05 1.22 – random_val 22 1409 1.00 1.10 – scout cnn_mon 227 795 1.54 3.87 – cnn_val 231 966 1.47 4.11 – random_mon 38 1434 1.05 1.20 – random_val 20 1436 1.00 1.12 – score cnn_mon 444 976 2.39 3.73 – cnn_val 407 1110 2.30 3.58 – random_mon 36 1418 1.04 1.21 – random_val 22 1450 1.01 1.15 – oracle cnn_mon 52 1182 1.09 1.35 0.00 cnn_val 91 1087 1.08 2.12 0.00 random_mon 37 1361 1.05 1.21 0.00 random_val 20 1477 1.00 1.12 0.00

Table 5: Metrics averaged over last 1000 episodes for each task.

JRELATED WORK

Progress in RL has historically been achieved both by algorithmic innovations as well as development of novel environments to train and evaluate agents. We review recent RL environments and delineate their strengths and weaknesses as testbeds for current methods and future research below.

J.1 RECENT GAME-BASED ENVIRONMENTS

Retro video games have been a major reason for the recent uptake in Deep Reinforcement Learning work. The Arcade Learning Environment (Bellemare et al., 2013) provides a unified interface to Atari 2600 games, allowing to test RL algorithms on high-dimensional, visual observations quickly and cheaply, resulting in an explosion of work (Arulkumaran et al., 2017). The Gym Retro environment (Nichol et al., 2018) expands the list of classic games that can be used as testbeds, but focuses on evaluating visual generalization and transfer learning on a single game, Sonic The Hedgehog. Both StarCraft: BroodWar and StarCraft II have been successfully employed as RL environments (Synnaeve et al., 2016; Vinyals et al., 2017), contributed to research on, for example, planning (On- tanón et al., 2013; Nardelli et al., 2018), multi-agent systems (Foerster et al., 2017; Samvelyan et al., 2019), imitation learning (Vinyals et al., 2019), and model-free reinforcement learning (Synnaeve et al., 2019; Vinyals et al., 2019). However, the complexity of these games creates a high entry barrier both in terms of computational resources required as well as complex baselines models that require a high degree of domain knowledge to be extended.

16 Accepted to the ICLR 2020 workshop: Beyond tabula rasa in RL (BeTR-RL)

Model Killer Name Score Exp Dungeon cnn_mon gnome lord 2088 6 7 ghoul 2054 6 10 gecko 2020 6 6 human zombie 1980 6 7 dwarf 1976 5 8 kitten 1884 5 8 dog 1872 6 6 falling rock 1868 5 8 an arrow 1832 5 9 werewolf 1828 5 7 cnn_val winter wolf cub 1884 5 9 snake 1868 6 7 human zombie 1588 5 8 giant spider 1572 5 11 fire ant 1556 5 7 giant beetle 1534 5 8 gnome king 1532 5 8 hill orc called Aiunu of Goragor 1480 5 6 wererat 1468 5 8 hill orc called Uuna of A-baneagris 1464 5 6

Table 6: Top ten of the last 1000 episodes in the Score task.

Figure 5: Screenshot of the web dashboard included in NLE.

3D games have been proven to be useful testbeds for tasks such as navigation and embodied reasoning. Vizdoom (Kempka et al., 2016) modifies the classic first-person shooter game Doom to construct an API for visual-control; DeepMind Lab (Beattie et al., 2016) presents a game engine based on Quake III Arena to allow for the creation of tasks based on the dynamics of the original game; Project Malmo (Johnson et al., 2016) and CraftAssist (Jernite et al., 2019) provide visual and symbolic interfaces to the popular Minecraft game, focusing on single-agent and multi-agent interactions respectively, which, however, are much more computationally demanding compared to NetHack.

17 Accepted to the ICLR 2020 workshop: Beyond tabula rasa in RL (BeTR-RL)

3000 1500

val_score 2000 1000 mon_score val_eat mon_eat 1000 500 val_scout death by starvation mon_scout death by rotted corpse 0 0

100 300 80

200 60

40 100

death by wand 20

death by gnome king 0 0 0 20 40 60 80 100 0 20 40 60 80 100

percentage of episodes completed during training

Figure 6: Counts for some causes of death across training.

More recent work has produced game-like environments with procedurally generated elements, such as the ProcGen benchmark Cobbe et al.(2019a), MazeExplorer (Harries et al., 2019), and the Obstacle Tower environment (Cobbe et al., 2019b). However, we argue that, compared to NetHack or Minecraft, these environments do not provide the depth likely necessary to serve as long-term RL testbeds. In conclusion, none of the current benchmarks combine a fast simulator with a procedurally generated environment, a hard exploration problem, wide variety of complex environment dynamics, and numerous types of static and interactive entities. The unique combination of challenges present in NetHack makes NLE well-suited for driving research towards more general and robust RL algorithms.

J.2 ROGUELIKESAS REINFORCEMENT LEARNING TESTBEDS

We are not the first to argue for roguelike games to be used as testbeds for RL. Asperti et al.(2019) present an interface to Rogue, the very first roguelike game and one of the simplest roguelikes in terms of game dynamics and difficulty, and show that policies trained with model-free RL algorithms can successfully learn rudimentary navigation. Similarly, Kanagawa & Kaneko(2019) present an environment inspired by Rogue that aims to provide a highly parametrizable generation of Rogue levels. Like us, Dannenhauer et al.(2019) argue that roguelike games could be a useful RL testbed. They discuss another roguelike game, Dungeon Crawl Stone Soup, but their position paper does not provide an environment nor any experiments to validate their claims. Some work has attempted to create interfaces handcrafted bots in NetHack, such as TAEB, BotHack, and Saiph (TAEB, 2015; NetHack Wiki, 2020). These largely rely on search heuristics and common planning methods, not involving any learning component. Nonetheless, BotHack is the only programmatic agent to have ever achieved ascension, albeit doing so through abuse of an exploit present in a previous version of NetHack (TAEB, 2015). Most similar to our work is gym_nethack8 (Campbell & Verbrugge, 2017; 2018), which offers a Gym environment based on a simplified version of NetHack 3.6.0. The authors extract the observation by directly parsing the characters of the textual human interface, omitting game information such as messages and disabling a large number of NetHack’s game mechanics and levels to make their experiments tractable. With NLE, we provide an environment with full access to NetHack to enable new research approaches (e.g., learning from demonstrations) and offer the community new challenges (e.g., beating the game in a realistic setting).

8github.com/campbelljc/gym_nethack

18