Decoupling Representation Learning from Reinforcement Learning

Decoupling Representation Learning from Reinforcement Learning Adam Stooke 1 Kimin Lee 1 Pieter Abbeel 1 Michael Laskin 1 Abstract Haarnoja et al., 2018) and have been successfully applied In an effort to overcome limitations of reward- to domains ranging from real-world (Levine et al., 2016; driven feature learning in deep reinforcement Kalashnikov et al., 2018) and simulated robotics (Lee et al., learning (RL) from images, we propose decou- 2019; Laskin et al., 2020a; Hafner et al., 2020) to sophis- pling representation learning from policy learn- ticated video games (Berner et al., 2019; Jaderberg et al., ing. To this end, we introduce a new unsupervised 2019), and even high-fidelity driving simulators (Dosovit- learning (UL) task, called Augmented Temporal skiy et al., 2017). While the simplicity of end-to-end meth- Contrast (ATC), which trains a convolutional en- ods is appealing, relying on the reward function to learn coder to associate pairs of observations separated visual features can be severely limiting. For example, it by a short time difference, under image augmen- leaves features difficult to acquire under sparse rewards, and tations and using a contrastive loss. In online RL it can narrow their utility to a single task. Although our experiments, we show that training the encoder intent is broader than to focus on either sparse-reward or exclusively using ATC matches or outperforms multi-task settings, they arise naturally in our studies. We end-to-end RL in most environments. Addition- investigate how to learn visual representations which are ally, we benchmark several leading UL algorithms agnostic to rewards, without degrading the control policy. by pre-training encoders on expert demonstra- A number of recent works have significantly improved RL tions and using them, with weights frozen, in RL performance by introducing auxiliary losses, which are un- agents; we find that agents using ATC-trained en- supervised tasks that provide feature-learning signal to the coders outperform all others. We also train multi- convolution neural network (CNN) encoder, additionally task encoders on data from multiple environments to the RL loss (Jaderberg et al., 2017; van den Oord et al., and show generalization to different downstream 2018; Laskin et al., 2020b; Guo et al., 2020; Schwarzer RL tasks. Finally, we ablate components of ATC, et al., 2020). Meanwhile, in the field of computer vision, and introduce a new data augmentation to en- recent efforts in unsupervised and self-supervised learning able replay of (compressed) latent images from (Chen et al., 2020; Grill et al., 2020; He et al., 2019) have pre-trained encoders when RL requires augmen- demonstrated that powerful feature extractors can be learned tation. Our experiments span visually diverse without labels, as evidenced by their usefulness for down- RL benchmarks in DeepMind Control, DeepMind stream tasks such as ImageNet classification. Together, Lab, and Atari, and our complete code is avail- these advances suggest that visual features for RL could able at https://github.com/astooke/ possibly be learned entirely without rewards, which would rlpyt/tree/master/rlpyt/ul. grant greater flexibility to improve overall learning performance. To our knowledge, however, no single unsupervised learning (UL) task has been shown adequate for this purpose 1. Introduction in general vision-based environments. Ever since the first fully-learned approach succeeded at In this paper, we demonstrate the first decoupling of rep- playing Atari games from screen images (Mnih et al., 2015), resentation learning from reinforcement learning that per- standard practice in deep reinforcement learning (RL) has forms as well as or better than end-to-end RL. We update been to learn visual features and a control policy jointly, the encoder weights using only UL and train a control policy end-to-end. Several such deep RL algorithms have matured independently, on the (compressed) latent images. This ca- (Hessel et al., 2018; Schulman et al., 2017; Mnih et al., 2016; pability stands in contrast to previous state-of-the-art meth- 1University of California, Berkeley. Correspondence to: ods, which have trained the UL and RL objectives jointly, Adam Stooke <[email protected]>, Michael Laskin or (Laskin et al., 2020b), which observed diminished perfor- <[email protected]>. mance with decoupled encoders. Proceedings of the 38 th International Conference on Machine Our main enabling contribution is a new unsupervised task Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Decoupling Representation Learning from Reinforcement Learning tailored to reinforcement learning, which we call Aug- games. mented Temporal Contrast (ATC). ATC requires a model Ablations and Encoder Analysis: Components of ATC are to associate observations from nearby time steps within the ablated, showing their individual effects. Additionally, data same trajectory (Anand et al., 2019). Observations are en- augmentation is shown to be necessary in DMControl during coded via a convolutional neural network (shared with the RL even when using a frozen encoder. We introduce a RL agent) into a small latent space, where the InfoNCE new augmentation, subpixel random shift, which matches loss is applied (van den Oord et al., 2018). Within each performance while augmenting the latent images, unlocking randomly sampled training batch, the positive observation, computation and memory benefits. ot+k, for every anchor, ot, serves as negative for all other anchors. For regularization, observations undergo stochastic data augmentation (Laskin et al., 2020b) prior to encoding, 2. Related Work namely random shift (Kostrikov et al., 2020), and a momen- Several recent works have used unsupervised/self- tum encoder (He et al., 2020; Laskin et al., 2020b) is used supervised representation learning methods to improve per- to process the positives. A learned predictor layer further formance in RL. The UNREAL agent (Jaderberg et al., processes the anchor code (Grill et al., 2020; Chen et al., 2017) introduced unsupervised auxiliary tasks to deep RL, 2020) prior to contrasting. In summary, our algorithm is a including the Pixel Control task, a Q-learning method requir- novel combination of elements that enables generic learning ing predictions of screen changes in discrete control envi- of the structure of observations and transitions in MDPs ronments, which has become a standard in DMLab (Hessel without requiring rewards or actions as input. et al., 2019). CPC (van den Oord et al., 2018) applied con- We include extensive experimental studies establishing the trastive losses over multiple time steps as an auxiliary task effectiveness of our algorithm in a visually diverse range for the convolutional and recurrent layers of RL agents, and of common RL environments: DeepMind Control Suite it has been extended with future action-conditioning (Guo (DMControl; Tassa et al. 2018), DeepMind Lab (DMLab; et al., 2018). Recently, PBL (Guo et al., 2020) surpassed Beattie et al. 2016), and Atari (Bellemare et al., 2013). Our these methods with an auxiliary loss of forward and back- experiments span discrete and continuous control, 2D and ward predictions in the recurrent latent space using partial 3D visuals, and both on-policy and off policy RL algorithms. agent histories. Where the trend is of increasing sophisti- Complete code for all of our experiments is available at cation in auxiliary recurrent architectures, our algorithm is hiddenurl. Our empirical contributions are summarized markedly simpler, requiring only observations, and yet it as follows: proves sufficient in partially observed settings (POMDPs). Online RL with UL: We find that the convolutional encoder ST-DIM (Anand et al., 2019) introduced various tempo- trained solely with the unsupervised ATC objective can ral, contrastive losses, including ones that operate on “lo- fully replace the end-to-end RL encoder without degrading cal” features from an intermediate layer within the encoder, policy performance. ATC achieves nearly equal or greater without data augmentation. CURL (Laskin et al., 2020b) performance in all DMControl and DMLab environments introduced an augmented, contrastive auxiliary task sim- tested and in 5 of the 8 Atari games tested. In the other 3 ilar to ours, including a momentum encoder but without Atari games, using ATC as an auxiliary loss or for weight temporal contrast. (Mazoure et al., 2020) provided exten- initialization still brings improvements over end-to-end RL. sive analysis pertaining to InfoNCE losses on functions of successive time steps in MDPs, including local features in Encoder Pre-Training Benchmarks: We pre-train the convo- their auxiliary loss (DRIML) similar to ST-DIM, and finally lutional encoder to convergence on expert demonstrations, conducted experiments using global temporal contrast of and evaluate it by training an RL agent using the encoder augmented observations in the Procgen (Cobbe et al., 2019) with weights frozen. We find that ATC matches or outper- environment. Most recently, MPR (Schwarzer et al., 2020) forms all prior UL algorithms as tested across all domains, combined data augmentation with multi-step, convolutional demonstrating that ATC is a state-of-the-art UL algorithm forward modeling and a similarity loss to improve DQN for RL. agents in the Atari 100k benchmark. (Hafner et al., 2019; Multi-Task Encoders: An encoder is trained on demonstra- 2020; Lee et al., 2019) proposed to leverage world-modeling tions from multiple environments,

Load more