Quixote: A NetHack Reinforcement Learning Framework and Agent Chandler Watson1 1Department of Mathematics, Stanford University

CS 229, Spring 2019 Abstract Objective Model 1: Basic Q-learning Model 3: ε Scheduling

In the 1980s, Michael Toy and NetHack: a ” game Simple Q-learning model: Random works well—bootstrap off it: Future Directions Glenn Wichman released a new • , low win rate • Hard to capture informative enough state • Much exploration needed, then exploitation game that would change the • Sparse “rewards,” large action space Both for framework and for agent: ecosystem of Unix games • Massive number of game mechanics • Much, much more testing! forever: Rogue. Rogue was a • (https://wikimedia.org/api/rest_v1/media/math/render/svg/47fa1e5cf8cf75996a777c11c7b9445dc96d4637) Unpredictable NPCs • Implement deep RL architectures dungeon crawler—a game in • Explore more expressive state reps. which the objective is to • Simple epsilon-greedy strategy • Directions to landmarks explore a dungeon, typically in • Primary reward is delta score • ≈c • Finer auxiliary rewards search of an artifact—which • Retracing penalty, exploration bonus • Use “meta-actions”: go to door, fight enemy would result in many spinoffs, • State representation: including the open-source NetHack, enjoyed by many Figure 4: Stepwise epsilon scheduling. today. NetHack is a game both beloved and reviled for its Results incredible difficulty—finishing the game is seen as a great Rand. QL QL + ε AQL AQL + ε accomplishment. Figure 1: A screenshot of the end of a game of NetHack. (https://www.reddit.com/r/nethack/comments/3ybcce/yay_my_first_360_ascension/) 45.60 44.92 51.96 32.53* 49.52 In this project, we explore the creation of a new NetHack Difficult but interesting environment for RL Figure 6: Possible ”meta-actions” for Quixote. reinforcement learning (RL) • Focus on navigation (10 actions) • Randomness as action: framework, Quixote, and the • Maximize points at end of game: go deeper creation of a Q-learning agent from this framework. Figure 3: The state representation for Model 1. Additionally, we explore a Framework domain-specific state • Choose relatively high ε = 0.1 representation and Q-learning • Try both per-state/action learning rates tweaks, allowing progress (opted for constant learning rate later) Figure 5: Results of running Quixote on 50 episodes without involvement of deep (mean scores given). *Based on 30 episodes architectures.

Discussion Figure 7: The “random as action” model. Possible future directions include reorganization and • Validate over 20 episodes for discount factor While shallow QL appeared to perform at • Q-learning decreases randomness packaging of Quixote into a random performance, the modifications • Blend policy and random bona fide Python package for Model 2: Approximate Q-Learning made improved QL over baseline. public use, and the References incorporation of deep RL into Ensure that model is able to generalize • Hard to say how useful state rep. was Quixote for higher between states w/ LFA: • Hard to say if linear model underfit A. Fern, RL for Large State Spaces. Available: performance. https://oregonstate.instructure.com/files/67025084/download • Bias from restarting when RL stalled D. Takeshi, Going Deeper into Reinforcement Learning, 2019. Available: • Variance between runs too high to conclude https://danieltakeshi.github.io/2016/10/31/going-deeper-into- reinforcement-learning-understanding-q-learning-and-linear-function- Figure 2: The layout of the Quixote framework. approximation/ Q-learning, Wikipedia, 2019. Available: https://en.wikipedia.org/wiki/Q- In any case, the Quixote framework despite learning. Original framework for NetHack RL several early bugs was rugged and has great Y. Liao, K. Li, Z. Yang, CS229 Final Report: Reinforcement Learning to Play • Interactivity allows rapid prototyping, debugging • Same hyperparameters, rewards, etc. Mario, 2012. Available: potential as an RL environment. http://cs229.stanford.edu/proj2012/LiaoYiYang-RLtoPlayMario.pdf