Automatic Data Augmentation for Generalization in Reinforcement Learning

Automatic Data Augmentation for Generalization in Reinforcement Learning Roberta Raileanu 1 Max Goldstein 1 Denis Yarats 1 2 Ilya Kostrikov 1 Rob Fergus 1 Abstract sentation learning (Zhang et al., 2020a;c). In this work, we focus on the use of data augmentation in RL. We identify Deep reinforcement learning (RL) agents often key differences between supervised learning and reinforce- fail to generalize beyond their training environment learning which need to be taken into account when ments. To alleviate this problem, recent work using data augmentation in RL. has proposed the use of data augmentation. How- ever, different tasks tend to benefit from different More specifically, we show that a naive application of data types of augmentations and selecting the right augmentation can lead to both theoretical and practical prob- one typically requires expert knowledge. In this lems with standard RL algorithms, such as unprincipled ob- paper, we introduce three approaches for auto- jective estimates and poor performance. As a solution, we matically finding an effective augmentation for propose Data-regularized Actor-Critic or DrAC, a new any RL task. These are combined with two novel algorithm that enables the use of data augmentation with regularization terms for the policy and value func- actor-critic algorithms in a theoretically sound way. Specif- tion, required to make the use of data augmenta- ically, we introduce two regularization terms which con- tion theoretically sound for actor-critic algorithms. strain the agent’s policy and value function to be invariant Our method achieves a new state-of-the-art on the to various state transformations. Empirically, this approach Procgen benchmark and outperforms popular RL allows the agent to learn useful behaviors (outperforming algorithms on DeepMind Control tasks with dis- strong RL baselines) in settings in which a naive use of tractors. In addition, our agent learns policies and data augmentation completely fails or converges to a sub- representations which are more robust to changes optimal policy. While we use Proximal Policy Optimization in the environment that are irrelevant for solving (PPO, Schulman et al.(2017)) to describe and validate the task, such as the background. Our implemen- our approach, the method can be easily integrated with any tation is available at https://github.com/ actor-critic algorithm with a discrete stochastic policy such rraileanu/auto-drac. as A3C (Mnih et al., 2013), SAC (Haarnoja et al., 2018), or IMPALA (Espeholt et al., 2018). The current use of data augmentation in RL either relies 1. Introduction on expert knowledge to pick an appropriate augmenta- Generalization to new environments remains a major chal- tion (Cobbe et al., 2018; Lee et al., 2020; Kostrikov et al., lenge in deep reinforcement learning (RL). Current methods 2020) or separately evaluates a large number of transforma- fail to generalize to unseen environments even when trained tions to find the best one (Ye et al., 2020; Laskin et al., 2020). on similar settings (Farebrother et al., 2018; Packer et al., In this paper, we propose three methods for automatically finding a useful augmentation for a given RL task. The first arXiv:2006.12862v2 [cs.LG] 20 Feb 2021 2018; Zhang et al., 2018a; Cobbe et al., 2018; Gamrian & Goldberg, 2019; Cobbe et al., 2019; Song et al., 2020). two learn to select the best augmentation from a fixed set, using either a variant of the upper confidence bound algo- This indicates that standard RL agents memorize specific 2 trajectories rather than learning transferable skills. Several rithm (UCB, Auer(2002)) or meta-learning ( RL , Wang strategies have been proposed to alleviate this problem, such et al.(2016)). We refer to these methods as UCB-DrAC and as the use of regularization (Farebrother et al., 2018; Zhang RL2-DrAC, respectively. The third method, Meta-DrAC, et al., 2018a; Cobbe et al., 2018; Igl et al., 2019), data aug- directly meta-learns the weights of a convolutional network, mentation (Cobbe et al., 2018; Lee et al., 2020; Ye et al., without access to predefined transformations (MAML, Finn 2020; Kostrikov et al., 2020; Laskin et al., 2020), or repre- et al.(2017)). Figure1 gives an overview of UCB-DrAC. We evaluate these approaches on the Procgen generaliza- 1New York University, New York City, NY, USA 2Facebook AI Research, New York City, NY, USA. Correspondence to: Roberta tion benchmark (Cobbe et al., 2019) which consists of 16 Raileanu <[email protected]/edu>. procedurally generated environments with visual observations. Our results show that UCB-DrAC is the most effective Automatic Data Augmentation for Generalization in Reinforcement Learning Figure 1. Overview of UCB-DrAC. A UCB bandit selects an image transformation (e.g. random-conv) and applies it to the observations. The augmented and original observations are passed to a regularized actor-critic agent (i.e. DrAC) which uses them to learn a policy and value function which are invariant to this transformation. among these at finding a good augmentation, and is compa- trained on a fixed set of n = 200 levels (generated using rable or better than using DrAC with the best augmentation seeds from 1 to 200) and tested on the full distribution of from a given set. UCB-DrAC also outperforms baselines levels (generated by sampling seeds uniformly at random specifically designed to improve generalization in RL (Igl from all computer integers). et al., 2019; Lee et al., 2020; Laskin et al., 2020) on both Proximal Policy Optimization (PPO, Schulman et al. train and test. In addition, we show that our agent learns poli- (2017)) is an actor-critic algorithm that learns a policy π cies and representations that are more invariant to changes θ and a value function V with the goal of finding an optimal in the environment which do not alter the reward or transi- θ policy for a given MDP. PPO alternates between sampling tion function (i.e. they are inconsequential for control), such data through interaction with the environment and maxi- as the background theme. mizing a clipped surrogate objective function JPPO using To summarize, our work makes the following contributions: stochastic gradient ascent. See AppendixA for a full de- (i) we introduce a principled way of using data augmentation scription of PPO. One component of the PPO objective with actor-critic algorithms, (ii) we propose a practical ap- is the policy gradient term JPG, which is estimated using proach for automatically selecting an effective augmentation importance sampling: in RL settings, (iii) we show that the use of data augmenta- X ^ tion leads to policies and representations that better capture JPG(θ) = πθ(ajs)Aθold (s; a) task invariances, and (iv) we demonstrate state-of-the-art a2A (1) results on the Procgen benchmark and outperform popular πθ(ajs) = a∼π A^θ (s; a) ; RL methods on four DeepMind Control tasks with natural E θold old πθold (ajs) and synthetic distractors. ^ where A(·) is an estimate of the advantage function, πθold 2. Background is the behavior policy used to collect trajectories (i.e. that generates the training distribution of states and actions), and We consider a distribution q(m) of Partially Observ- π is the policy we want to optimize (i.e. that generates the able Markov Decision Processes (POMDPs, (Bell- θ true distribution of states and actions). man, 1957)) m 2 M, with m defined by the tuple (Sm; Om; A;Tm;Rm; γ), where Sm is the state space, Om 0 is the observation space, A is the action space, Tm(s js; a) 3. Automatic Data Augmentation for RL is the transition function, R (s; a) is the reward function, m 3.1. Data Augmentation in RL and γ is the discount factor. During training, we restrict access to a fixed set of POMDPs, Mtrain = fm1; : : : ; mng, Image augmentation has been successfully applied in com- where mi ∼ q, 8 i = 1; n. The goal is to find a puter vision for improving generalization on object clas- policy πθ which maximizes the expected discounted re- sification tasks (Simard et al., 2003; Cire¸sanet al., 2011; ward over the entire distribution of POMDPs, J(πθ) = Ciregan et al., 2012; Krizhevsky et al., 2012). As noted PT t by Kostrikov et al.(2020), those tasks are invariant to cer- Eq,π;Tm;pm t=0 γ Rm(st; at) . tain image transformations such as rotations or flips, which In practice, we use the Procgen benchmark which contains is not always the case in RL. For example, if your observa- 16 procedurally generated games. Each game corresponds tion is flipped, the corresponding reward will be reversed to a distribution of POMDPs q(m), and each level of a for the left and right actions and will not provide an ac- game corresponds to a POMDP sampled from that game’s curate signal to the agent. While data augmentation has distribution m ∼ q. The POMDP m is determined by the been previously used in RL settings without other algorith- i.e. seed ( integer) used to generate the corresponding level. mic changes (Cobbe et al., 2018; Ye et al., 2020; Laskin Following the setup from Cobbe et al.(2019), agents are et al., 2020), we argue that this approach is not theoretically Automatic Data Augmentation for Generalization in Reinforcement Learning sound. Algorithm 1 DrAC: Data-regularized Actor-Critic Black: unmodified actor-critic algorithm. If transformations are naively applied to observations in Cyan: image transformation. PPO’s buffer, as done in Laskin et al.(2020), the PPO Red: policy regularization. objective changes and equation (1) is replaced by Blue: value function regularization. πθ(ajf(s)) 1: Hyperparameters: image transformation f, regular- JPG(θ) = a∼π A^θ (s; a) ; (2) E θold old ization loss coefficient αr, minibatch size M, replay πθold (ajs) buffer size T, number of updates K.

Load more