Rogue-Gym: a New Challenge for Generalization in Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
Rogue-Gym: A New Challenge for Generalization in Reinforcement Learning Yuji Kanagawa Tomoyuki Kaneko Graduate School of Arts and Sciences Interfaculty Initiative in Information Studies The University of Tokyo The University of Tokyo Tokyo, Japan Tokyo, Japan [email protected] [email protected] Abstract—In this paper, we propose Rogue-Gym, a simple and or designated areas but need to perform safely in real world classic style roguelike game built for evaluating generalization in situations that are similar to but different from their training reinforcement learning (RL). Combined with the recent progress environments. For agents to act appropriately in unknown of deep neural networks, RL has successfully trained human- level agents without human knowledge in many games such as situations, they need to properly generalize their policies that those for Atari 2600. However, it has been pointed out that agents they learned from the training environment. Generalization is trained with RL methods often overfit the training environment, also important in transfer learning, where the goal is to transfer and they work poorly in slightly different environments. To inves- a policy learned in a training environment to another similar tigate this problem, some research environments with procedural environment, called the target environment. We can use this content generation have been proposed. Following these studies, we propose the use of roguelikes as a benchmark for evaluating method to reduce the training time in many applications of the generalization ability of RL agents. In our Rogue-Gym, RL. For example, we can imagine a situation in which we agents need to explore dungeons that are structured differently train an enemy in an action game through experience across each time they start a new game. Thanks to the very diverse fewer stages and then transfer the enemy to a higher number structures of the dungeons, we believe that the generalization of other scenes via generalization. benchmark of Rogue-Gym is sufficiently fair. In our experiments, we evaluate a standard reinforcement learning method, PPO, In this paper, we propose the Rogue-Gym environment, with and without enhancements for generalization. The results a simple roguelike game built to evaluate the generalization show that some enhancements believed to be effective fail to ability of RL agents. As in the original implementation of mitigate the overfitting in Rogue-Gym, although others slightly Rogue, it has very diverse and randomly generated dungeon improve the generalization ability. structures on each floor. Thus, there is no pattern of actions that Index Terms—roguelike games, reinforcement learning, gener- alization, domain adaptation, neural networks is always effective, which makes the generalization benchmark in Rogue-Gym sufficiently fair. Instead, in Rogue-Gym agents I. INTRODUCTION have to generalize abstract subgoals like getting coins or Reinforcement learning (RL) is a key method for training going downstairs through their action sequences that consist of AI agents without human knowledge. Recent advances in deep concrete actions (e.g., moving left). Rogue-Gym is designed reinforcement learning have created human-level agents in so that the environment an agent encounters, which includes many games, such as those for Atari 2600 [1] and the game dungeon maps, items, and enemies, is configurable through a DOOM [2], using only pixels as inputs. This method could be random seed. Thus, we can easily evaluate the generalization applied to many domains, from robotics to the game industry. score in Rogue-Gym by using random seeds different from those used in training. Since many other properties including arXiv:1904.08129v2 [cs.LG] 1 Jun 2019 However, it is still difficult to generalize learned policies between tasks even for current state of the art RL algorithms. the size of dungeons and the presence of enemies, are com- Recent studies (e.g., by Zhang et al. [3] and by Cobbe et pletely configurable, researchers can easily adjust the difficulty al. [4]) have shown that agents trained by RL methods often of learning so that the properties are complex enough and overfit the training environment and perform poorly in a test difficult for simple agents to solve but can still be addressed environment, when the test environment is not exactly the same by the state-of-the-art RL methods. as the training environment. This is an important problem In our experiments, we evaluate a popular DRL algorithm because test environments are often differ somewhat from with or without generalization methods in Rogue-Gym. We training environments in many applications of reinforcement show that some of the methods work poorly in generalization, learning. For example, in real world applications including although they successfully improve the training scores through self-driving cars [5], agents are often trained via simulators learning. In contrast, some of these methods, like L2 regular- ization, achieve better generalization scores than those of the A part of this work was supported by JSPS KAKENHI Grant Number baseline methods, but the results are not sufficiently effective. 18K19832 and by JST, PRESTO. Therefore, Rogue-Gym is a novel and challenging domain for further studies. 978-1-7281-1884-0/19/$31.00 ©2019 IEEE II. BACKGROUND which makes the task hard for RL algorithms. In experiments, We follow a standard notation and denote a Markov decision they showed that the state-of-the-art algorithms including PPO process M by M = (S; A; R; P), where S is the state space, struggle to generalize learned policies in Obstacle Tower. A is the action space, R is the immediate reward function Our work is most similar to Cobbe et al. [4] and Juliani that maps a state to a reward, and P is the state transition et al. [12] in proposing a new environment with procedural probability. We denote the policy of an agent by π(ajs), that content generation for evaluating generalization, though we is, the probability of taking an action a 2 A given a state place more importance on customizability and reasonable s 2 S. In an instance of MDP, the goal of reinforcement difficulty. learning [6](RL) is to get the optimal policy that maximizes the In addition to regularization, state representation learn- expected total reward by repeatedly taking an action, observing ing [13] is also a promising approach for generalization. The a state, and getting a reward. In this paper, we consider the key idea is the use of abstract state representations to bridge episodic setting, in which the total reward is defined as R = the gap between a training environment and a test environment. PT We can obtain such a representation via unsupervised learning t=0 R(st), where t denotes the time step, and the initial 0 methods such as variational autoencoders (VAE) [14]. state s0 is sampled from the initial state distribution P . One of the famous classes of RL algorithms is policy Higgins et al. [15] adopted this idea for RL and proposed gradients. Suppose that a policy π is parameterized by a DARLA, which learns disentangled state representations by using β-VAE [16] from randomly collected observations and parameter vector θ. Then, we can denote the policy by πθ and then learns a policy by using these representations. They the gradient of the expected sum of the reward by rθE[R]. manually set up training and test environments by changing The goal of policy gradient methods is to maximize E[R] by the colors and/or locations of objects in 3D navigation tasks iteratively updating θ on the basis of estimating of rθE[R] with agents’ experience. in DeepMind Lab [17]. They showed that DARLA improves Deep reinforcement learning refers to RL methods that use generalization scores in these tasks. We evaluated β-VAE in deep neural networks as function approximators. DRL enables our experiments. us to train RL agents given only screen pixels as states, through IV. ROGUE-GYM ENVIRONMENT the use of deep convolutional neural networks (CNNs) [1]. In this section, we introduce the Rogue-Gym environment, PPO [7] is one of the state-of-the-art deep policy gradient which is a simple roguelike game built for evaluating the methods, and we use it as a baseline method in this paper. generalization performance of RL agents. III. RELATED WORK To fairly evaluate the generalization ability of RL algo- rithms, we claim that structural diversity across training and Farebrother et al. [8] proposed the use of ALE [9], an test environments is important, in addition to a sufficient environment based on an Atari2600 emulator, to evaluate number of test environments. In the context of evaluating an generalization in RL. They conducted experiments by using RL agent in a single task, Machado et al. [10] claimed that the different game modes of Atari 2600 games introduced by stochasticity of an environment is important by showing that Machado et al. [10] and showed that regularization techniques a simple algorithm that memorizes only an effective sequence like L2 regularization mitigates the overfitting of DQN [1]. of actions performs well in a deterministic environment. This However, the number of environments is limited in ALE, kind of hack is also possible in a generalization setting if the which allows us to tune algorithms for specific environments. To increase the number of training/evaluation environments, training and test environments do not have diverse structures procedural content generation is considered to be a promising and share an undesirable structural similarity. For example, normal method. Zhang et al. [3] conducted experiments by using sim- in the task of CoinRun [4], stages share a common ple 2D gridworld mazes generated procedurally and showed structure in that the player is initially on the left side of that some of the typical methods used for mitigating overfitting the stage, and the coin of each stage is placed on the right in RL often fail.