Unsupervised State Representation Pretraining in Reinforcement Learning Applied to Atari Games

DEGREE PROJECT IN ARCHITECTURE, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020 Unsupervised state representation pretraining in Reinforcement Learning applied to Atari games FRANCESCO NUZZO KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Unsupervised state representation pretraining in Reinforcement Learning applied to Atari games FRANCESCO NUZZO Master in Machine Learning Date: October 28, 2020 Supervisor: Fredrik Carlsson, Ali Ghadirzadeh Examiner: Danica Kragic Jensfelt School of Electrical Engineering and Computer Science Host company: RISE Swedish title: Oövervakad förträning av tillståndsrepresentation i förstärkningsinlärning tillämpat på atari-spel iii Abstract State representation learning aims to extract useful features from the observations received by a Reinforcement Learning agent interacting with an environment. These features allow the agent to take advantage of the low-dimensional and informative representation to improve the efficiency in solving tasks. In this work, we study unsupervised state representation learning in Atari games. We use a RNN architecture for learning features that depend on sequences of observations, and pretrain a single-frame encoder architecture with different methods on randomly collected frames. Finally, we empirically evaluate how pretrained state representations perform compared with a randomly initialized architecture. For this purpose, we let a RL agent train on 22 different Atari 2600 games initializing the encoder either randomly or with one of the following unsupervised methods: VAE, CPC and ST-DIM. Promising results are obtained in most games when ST-DIM is chosen as pretraining method, while VAE often performs worse than a random initialization. iv Sammanfattning Tillståndsrepresentationsinlärning handlar om att extrahera användbara egenskaper från de observationer som mottagits av en agent som interagerar med en miljö i förstärkningsinlärning. Dessa egenskaper gör det möjligt för agen- ten att dra nytta av den lågdimensionella och informativa representationen för att förbättra effektiviteten vid lösning av uppgifter. I det här arbetet studerar vi icke-väglett lärande i Atari-spel. Vi använder en RNN-arkitektur för inlärning av egenskaper som är beroende av observationssekvenser, och förtränar en ko- dararkitektur för enskild bild med olika metoder på slumpmässigt samlade bil- der. Slutligen utvärderar vi empiriskt hur förtränade tillståndsrepresentationer fungerar jämfört med en slumpmässigt initierad arkitektur. För detta ändamål låter vi en RL-agent träna på 22 olika Atari 2600-spel som initierar kodaren antingen slumpmässigt eller med en av följande metoder utan tillsyn: VAE, CPC och ST-DIM. Lovande resultat uppnås i de flesta spel när ST-DIM väljs som metod för träning, medan VAE ofta fungerar sämre än en slumpmässig initialisering. v Acknowledgement This thesis was possible thanks to RISE and their computational resources made available for running most of the experiments. I would like to thank my supervisors Fredrik Carlsson and Ali Ghadirzadeh for their help and providing advice throughout the work, and Prof. Danica Kragic for following and evaluating the thesis. I am very grateful to all my friends that supported and encouraged me during this Master, without whom it would not have been so wonderful. Most importantly, I would like to express my profound gratitude to my family for sustaining me while studying at KTH, and to Valeria and Vlera for con- stantly demonstrating their precious and motivating support. Contents 1 Introduction 1 1.1 Research Question . .2 1.2 Ethics, Societal Aspects and Sustainability . .2 2 Background 4 2.1 Reinforcement Learning . .4 2.1.1 Markov Decision Process . .4 2.2 Representation Learning . .6 2.2.1 Convolutional Neural Networks . .8 2.2.2 Recurrent Neural Networks . 10 2.3 State Representation Learning for Control and Problem For- mulation . 11 3 Related Work 14 3.1 State Representation Learning . 14 3.1.1 Contrastive methods . 19 3.1.2 Robotic priors and auxiliary objective functions . 21 3.1.3 Evaluating a state representation . 24 4 Methods 27 4.1 Overview . 27 4.2 Architecture . 27 4.2.1 Traditional CNN architecture . 28 4.2.2 Proposed CNN+RNN architecture . 28 4.3 Encoder pretraining . 31 4.3.1 Spatiotemporal Deep Infomax (ST-DIM) . 31 4.3.2 Variational AutoEncoder (VAE) . 33 4.3.3 Contrastive Predictive Coding (CPC) . 34 4.4 Evaluation . 35 vi CONTENTS vii 5 Results 37 6 Discussion 41 7 Conclusions 43 7.1 Future work . 44 Bibliography 45 A Hyperparameters 53 Chapter 1 Introduction The vast class of deep representation learning algorithms has brought signifi- cant contribution in a variety of machine learning problems across numerous domains. Often, the representation learned by a model is the result of end- to-end learning that makes use of labeled data or rewards. Moreover, that high complexity of such models and the enormous amount of parameters often make them sample-inefficient and not capable enough of generalization or transfer learning. However, the human brain appears to mainly learn without explicit supervision, indicating that there exist priors on which we can leverage to learn to extract compact and useful information from our perceptive data, i.e. model representations of useful features independently of a given task to be performed. In the context of Reinforcement Learning, learning a representation is a fundamental component for effective and efficient policy optimization. State representation learning focuses on extracting features in a low dimension from ob- servational data captured from the environment, allowing to restrict the search space of the policy to lower dimensions. The representation should capture variations in the environment, such as position speed and direction of mov- ing objects; a feature vector containing this type of information is particularly suitable and useful for robotics and control tasks. The choice of using a low dimensional representation comes with different advantages: it naturally tackles the curse of dimensionality [1, p. 932], it accelerates the process of learning an optimal policy [2, 3] and it improves the interpretability and explainability of the model [4], providing an easier way to study the cause-effect relationships learned by the model. Moreover, faster policy learning means a more efficient use of data, therefore improving sample efficiency, which is especially useful 1 2 CHAPTER 1. INTRODUCTION in real world applications, such as robotics, where interacting with the environment is often expensive. Unsupervised state representation learning aims at learning a representation from unlabeled data, i.e. observations for which the corresponding true state of the environment is not available, and then reuse that representation on the same environment, training an agent to perform a given task. The learned representation allows to transfer knowledge about how the environment works, therefore making it possible for an agent to improve and speed up the learning process on the environment for different tasks and reward functions. In our work, we use different unsupervised state representation algorithms to pretrain the architecture on frames collected on Atari 2600 games. Then we empirically evaluate and compare them by using the pretrained parameters as initialization of the feature extractor’s architecture and train an RL agent to maximize the reward signal received from the environment. 1.1 Research Question This thesis focuses on addressing two research questions: • How to modify the state representation architecture of an RL agent for Atari games to account for temporal features without stacking multiple frames? • Is it possible to improve the training performance of the agent by pretraining the state representation architecture on a small amount of unlabeled data? 1.2 Ethics, Societal Aspects and Sustainabil- ity We highlight the importance of some crucial aspects for deep learning models in accordance to modern legislation. Decision-making algorithms are regu- lated, since 2018, by the European Union to include a “right to explanation", “right to opt-out" and “non discrimination" of models [5]. Two problems might rise in this regard when deploying State Representation Learning algorithms: difficult interpretability and models biased on training data. CHAPTER 1. INTRODUCTION 3 Firstly, restricting the representation to a lower dimension makes it easier to interpret what environment’s characteristic each latent variable corresponds to [6, p. 308]. In Lesort et al. [7], interpretability in the context of State Rep- resentation Learning is defined as the capacity for a human to be able to link a variation in the representation to a variation in the environment. The interpretability of the state representation allows to improve explainability of the learnt policy, therefore understanding what the agent has learned. Secondly, unsupervised training on RL is biased towards the type of data that can be collected through a random policy, possibly restricting the broadness of the representation to a limited part of the state space. As observed in [8], high quality representations depend on a well-designed automatic exploration. From an ethical perspective, this is fundamental to guarantee equity and fair- ness in decisions made by algorithm that affect people’s life. State dimension is chosen empirically. This choice relates to the bias-variance trade-off: higher- dimensional state augments the capacity of the model and reduce the

Unsupervised State Representation Pretraining in Reinforcement Learning Applied to Atari Games

Improving Model-Based Reinforcement Learning with Internal State Representations Through Self-Supervision

Survey on Reinforcement Learning for Language Processing

Multi-Agent Reinforcement Learning: a Review of Challenges and Applications

Combining Off and On-Policy Training in Model-Based Reinforcement Learning

On Building Generalizable Learning Agents

On the Role of Planning in Model-Based Deep

The Evolving Electric Power Grid -Energy Internet, Iot and AI

Monte-Carlo Tree Search As Regularized Policy Optimization

Reinforcement Learning in Stock Market

Playing Nondeterministic Games Through Planning with a Learned

Yaklaşim Ve Uygulamalar

Muesli: Combining Improvements in Policy Optimization