Quantifying Generalization in Reinforcement Learning

Quantifying Generalization in Reinforcement Learning Karl Cobbe 1 Oleg Klimov 1 Chris Hesse 1 Taehoon Kim 1 John Schulman 1 Abstract (Nichol et al., 2018), we seek to better quantify an agent’s In this paper, we investigate the problem of over- ability to generalize. fitting in deep reinforcement learning. Among To begin, we train agents on CoinRun, a procedurally gen- the most common benchmarks in RL, it is cus- erated environment of our own design, and we report the tomary to use the same environments for both surprising extent to which overfitting occurs. Using this training and testing. This practice offers rela- environment, we investigate how several key algorithmic tively little insight into an agent’s ability to gen- and architectural decisions impact the generalization per- eralize. We address this issue by using proce- formance of trained agents. durally generated environments to construct distinct training and test sets. Most notably, we in- The main contributions of this work are as follows: troduce a new environment called CoinRun, designed as a benchmark for generalization in RL. 1. We show that the number of training environments re- Using CoinRun, we find that agents overfit to sur- quired for good generalization is much larger than the prisingly large training sets. We then show that number used by prior work on transfer in RL. deeper convolutional architectures improve gen- 2. We propose a generalization metric using the CoinRun eralization, as do methods traditionally found in environment, and we show how this metric provides a supervised learning, including L2 regularization, useful signal upon which to iterate. dropout, data augmentation and batch normaliza- tion. 3. We evaluate the impact of different convolutional architectures and forms of regularization, finding that these choices can significantly improve generalization 1. Introduction performance. Generalizing between tasks remains difficult for state of the art deep reinforcement learning (RL) algorithms. Although 2. Related Work trained agents can solve complex tasks, they struggle to Our work is most directly inspired by the Sonic Benchmark transfer their experience to new environments. Agents that (Nichol et al., 2018), which proposes to measure general- have mastered ten levels in a video game often fail catas- ization performance by training and testing RL agents on trophically when first encountering the eleventh. Humans distinct sets of levels in the Sonic the HedgehogTM video can seamlessly generalize across such similar tasks, but this game franchise. Agents may train arbitrarily long on the ability is largely absent in RL agents. In short, agents be- training set, but are permitted only 1 million timesteps at come overly specialized to the environments encountered arXiv:1812.02341v3 [cs.LG] 14 Jul 2019 test time to perform fine-tuning. This benchmark was de- during training. signed to address the problems inherent to “training on the That RL agents are prone to overfitting is widely appreci- test set.” ated, yet the most common RL benchmarks still encourage (Farebrother et al., 2018) also address this problem, ac- training and evaluating on the same set of environments. curately recognizing that conflating train and test environ- We believe there is a need for more metrics that evalu- ments has contributed to the lack of regularization in deep ate generalization by explicitly separating training and test RL. They propose using different game modes of Atari environments. In the same spirit as the Sonic Benchmark 2600 games to measure generalization. They turn to su- 1OpenAI, San Francisco, CA, USA. Correspondence to: Karl pervised learning for inspiration, finding that both L2 reg- Cobbe <[email protected]>. ularization and dropout can help agents learn more gener- alizable features. Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright (Packer et al., 2018) propose a different benchmark to mea- 2019 by the author(s). sure generalization using six classic environments, each of Quantifying Generalization in Reinforcement Learning Figure 1. Two levels in CoinRun. The level on the left is much easier than the level on the right. which has been modified to expose several internal param- 3. Quantifying Generalization eters. By training and testing on environments with different parameter ranges, their benchmark quantifies agents’ 3.1. The CoinRun Environment ability to interpolate and extrapolate. (Zhang et al., 2018a) We propose the CoinRun environment to evaluate the gen- measure overfitting in continuous domains, finding that eralization performance of trained agents. The goal of each generalization improves as the number of training seeds in- CoinRun level is simple: collect the single coin that lies creases. They also use randomized rewards to determine at the end of the level. The agent controls a character that the extent of undesirable memorization. spawns on the far left, and the coin spawns on the far right. Other works create distinct train and test environments us- Several obstacles, both stationary and non-stationary, lie ing procedural generation. (Justesen et al., 2018) use the between the agent and the coin. A collision with an ob- General Video Game AI (GVG-AI) framework to gener- stacle results in the agent’s immediate death. The only re- ate levels from several unique games. By varying diffi- ward in the environment is obtained by collecting the coin, culty settings between train and test levels, they find that and this reward is a fixed positive constant. The level ter- RL agents regularly overfit to a particular training distri- minates when the agent dies, the coin is collected, or after bution. They further show that the ability to generalize to 1000 time steps. human-designed levels strongly depends on the level gen- We designed the game CoinRun to be tractable for existing erators used during training. algorithms. That is, given a sufficient number of training (Zhang et al., 2018b) conduct experiments on procedurally levels and sufficient training time, our algorithms learn a generated gridworld mazes, reporting many insightful con- near optimal policy for all CoinRun levels. Each level is clusions on the nature of overfitting in RL agents. They find generated deterministically from a given seed, providing that agents have a high capacity to memorize specific levels agents access to an arbitrarily large and easily quantifiable in a given training set, and that techniques intended to miti- supply of training data. CoinRun mimics the style of plat- gate overfitting in RL, including sticky actions (Machado former games like Sonic, but it is much simpler. For the et al., 2018) and random starts (Hausknecht and Stone, purpose of evaluating generalization, this simplicity can be 2015), often fail to do so. highly advantageous. In Section 5.4, we similarly investigate how injecting Levels vary widely in difficulty, so the distribution of levels stochasticity impacts generalization. Our work mirrors naturally forms a curriculum for the agent. Two different (Zhang et al., 2018b) in quantifying the relationship be- levels are shown in Figure1. See AppendixA for more de- tween overfitting and the number of training environments, tails about the environment and AppendixB for additional though we additionally show how several methods, includ- screenshots. Videos of a trained agent playing can be found ing some more prevalent in supervised learning, can reduce here, and environment code can be found here. overfitting in our benchmark. 3.2. CoinRun Generalization Curves These works, as well as our own, highlight the growing need for experimental protocols that directly address gen- Using the CoinRun environment, we can measure how suc- eralization in RL. cessfully agents generalize from a given set of training lev- Quantifying Generalization in Reinforcement Learning Nature-CNN IMPALA-CNN 100 100 95 95 90 90 85 85 80 80 75 75 % Levels Solved 70 % Levels Solved 70 65 Train 65 Train Test Test 60 60 102 103 104 102 103 104 # Training Levels # Training Levels (a) Final train and test performance of Nature-CNN agents after (b) Final train and test performance of IMPALA-CNN agents af- 256M timesteps, as a function of the number of training levels. ter 256M timesteps, as a function of number of training levels. Figure 2. Dotted lines denote final mean test performance of the agents trained with an unbounded set of levels. The solid line and shaded regions represent the mean and standard deviation respectively across 5 seeds. Training sets are generated separately for each seed. els to an unseen set of test levels. Train and test levels are independent of the number of levels in the training set. We drawn from the same distribution, so the gap between train average gradients across all 8 workers on each mini-batch. and test performance determines the extent of overfitting. We use γ = :999, as an optimal agent takes between 50 and As the number of available training levels grows, we ex- 500 timesteps to solve a level, depending on level difficulty. pect the performance on the test set to improve, even when See AppendixD for a full list of hyperparameters. agents are trained for a fixed number of timesteps. At test Results are shown in Figure 2a. We collect each data point time, we measure the zero-shot performance of each agent by averaging the final agent’s performance across 10,000 on the test set, applying no fine-tuning to the agent’s pa- episodes, where each episode samples a level from the ap- rameters. propriate set. We can see that substantial overfitting occurs We train 9 agents to play CoinRun, each on a training set when there are less than 4,000 training levels. Even with with a different number of levels. During training, each 16,000 training levels, overfitting is still noticeable. Agents new episode uniformly samples a level from the appropri- perform best when trained on an unbounded set of levels, ate set.

Load more