DEGREE PROJECT IN ARCHITECTURE, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020
Unsupervised state representation pretraining in Reinforcement Learning applied to Atari games
FRANCESCO NUZZO
KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Unsupervised state representation pretraining in Reinforcement Learning applied to Atari games
FRANCESCO NUZZO
Master in Machine Learning Date: October 28, 2020 Supervisor: Fredrik Carlsson, Ali Ghadirzadeh Examiner: Danica Kragic Jensfelt School of Electrical Engineering and Computer Science Host company: RISE Swedish title: Oövervakad förträning av tillståndsrepresentation i förstärkningsinlärning tillämpat på atari-spel
iii
Abstract
State representation learning aims to extract useful features from the observa- tions received by a Reinforcement Learning agent interacting with an environ- ment. These features allow the agent to take advantage of the low-dimensional and informative representation to improve the efficiency in solving tasks. In this work, we study unsupervised state representation learning in Atari games. We use a RNN architecture for learning features that depend on sequences of observations, and pretrain a single-frame encoder architecture with different methods on randomly collected frames. Finally, we empirically evaluate how pretrained state representations perform compared with a randomly initialized architecture. For this purpose, we let a RL agent train on 22 different Atari 2600 games initializing the encoder either randomly or with one of the fol- lowing unsupervised methods: VAE, CPC and ST-DIM. Promising results are obtained in most games when ST-DIM is chosen as pretraining method, while VAE often performs worse than a random initialization. iv
Sammanfattning
Tillståndsrepresentationsinlärning handlar om att extrahera användbara egen- skaper från de observationer som mottagits av en agent som interagerar med en miljö i förstärkningsinlärning. Dessa egenskaper gör det möjligt för agen- ten att dra nytta av den lågdimensionella och informativa representationen för att förbättra effektiviteten vid lösning av uppgifter. I det här arbetet studerar vi icke-väglett lärande i Atari-spel. Vi använder en RNN-arkitektur för inlärning av egenskaper som är beroende av observationssekvenser, och förtränar en ko- dararkitektur för enskild bild med olika metoder på slumpmässigt samlade bil- der. Slutligen utvärderar vi empiriskt hur förtränade tillståndsrepresentationer fungerar jämfört med en slumpmässigt initierad arkitektur. För detta ändamål låter vi en RL-agent träna på 22 olika Atari 2600-spel som initierar kodaren antingen slumpmässigt eller med en av följande metoder utan tillsyn: VAE, CPC och ST-DIM. Lovande resultat uppnås i de flesta spel när ST-DIM väljs som metod för träning, medan VAE ofta fungerar sämre än en slumpmässig initialisering. v
Acknowledgement
This thesis was possible thanks to RISE and their computational resources made available for running most of the experiments.
I would like to thank my supervisors Fredrik Carlsson and Ali Ghadirzadeh for their help and providing advice throughout the work, and Prof. Danica Kragic for following and evaluating the thesis.
I am very grateful to all my friends that supported and encouraged me during this Master, without whom it would not have been so wonderful.
Most importantly, I would like to express my profound gratitude to my family for sustaining me while studying at KTH, and to Valeria and Vlera for con- stantly demonstrating their precious and motivating support. Contents
1 Introduction 1 1.1 Research Question ...... 2 1.2 Ethics, Societal Aspects and Sustainability ...... 2
2 Background 4 2.1 Reinforcement Learning ...... 4 2.1.1 Markov Decision Process ...... 4 2.2 Representation Learning ...... 6 2.2.1 Convolutional Neural Networks ...... 8 2.2.2 Recurrent Neural Networks ...... 10 2.3 State Representation Learning for Control and Problem For- mulation ...... 11
3 Related Work 14 3.1 State Representation Learning ...... 14 3.1.1 Contrastive methods ...... 19 3.1.2 Robotic priors and auxiliary objective functions . . . . 21 3.1.3 Evaluating a state representation ...... 24
4 Methods 27 4.1 Overview ...... 27 4.2 Architecture ...... 27 4.2.1 Traditional CNN architecture ...... 28 4.2.2 Proposed CNN+RNN architecture ...... 28 4.3 Encoder pretraining ...... 31 4.3.1 Spatiotemporal Deep Infomax (ST-DIM) ...... 31 4.3.2 Variational AutoEncoder (VAE) ...... 33 4.3.3 Contrastive Predictive Coding (CPC) ...... 34 4.4 Evaluation ...... 35
vi CONTENTS vii
5 Results 37
6 Discussion 41
7 Conclusions 43 7.1 Future work ...... 44
Bibliography 45
A Hyperparameters 53
Chapter 1
Introduction
The vast class of deep representation learning algorithms has brought signifi- cant contribution in a variety of machine learning problems across numerous domains. Often, the representation learned by a model is the result of end- to-end learning that makes use of labeled data or rewards. Moreover, that high complexity of such models and the enormous amount of parameters of- ten make them sample-inefficient and not capable enough of generalization or transfer learning. However, the human brain appears to mainly learn without explicit supervision, indicating that there exist priors on which we can leverage to learn to extract compact and useful information from our perceptive data, i.e. model representations of useful features independently of a given task to be performed.
In the context of Reinforcement Learning, learning a representation is a funda- mental component for effective and efficient policy optimization. State repre- sentation learning focuses on extracting features in a low dimension from ob- servational data captured from the environment, allowing to restrict the search space of the policy to lower dimensions. The representation should capture variations in the environment, such as position speed and direction of mov- ing objects; a feature vector containing this type of information is particularly suitable and useful for robotics and control tasks. The choice of using a low di- mensional representation comes with different advantages: it naturally tackles the curse of dimensionality [1, p. 932], it accelerates the process of learning an optimal policy [2, 3] and it improves the interpretability and explainability of the model [4], providing an easier way to study the cause-effect relationships learned by the model. Moreover, faster policy learning means a more efficient use of data, therefore improving sample efficiency, which is especially useful
1 2 CHAPTER 1. INTRODUCTION
in real world applications, such as robotics, where interacting with the envi- ronment is often expensive.
Unsupervised state representation learning aims at learning a representation from unlabeled data, i.e. observations for which the corresponding true state of the environment is not available, and then reuse that representation on the same environment, training an agent to perform a given task. The learned rep- resentation allows to transfer knowledge about how the environment works, therefore making it possible for an agent to improve and speed up the learning process on the environment for different tasks and reward functions.
In our work, we use different unsupervised state representation algorithms to pretrain the architecture on frames collected on Atari 2600 games. Then we empirically evaluate and compare them by using the pretrained parameters as initialization of the feature extractor’s architecture and train an RL agent to maximize the reward signal received from the environment.
1.1 Research Question
This thesis focuses on addressing two research questions: • How to modify the state representation architecture of an RL agent for Atari games to account for temporal features without stacking multiple frames? • Is it possible to improve the training performance of the agent by pre- training the state representation architecture on a small amount of unla- beled data?
1.2 Ethics, Societal Aspects and Sustainabil- ity
We highlight the importance of some crucial aspects for deep learning models in accordance to modern legislation. Decision-making algorithms are regu- lated, since 2018, by the European Union to include a “right to explanation", “right to opt-out" and “non discrimination" of models [5]. Two problems might rise in this regard when deploying State Representation Learning algorithms: difficult interpretability and models biased on training data. CHAPTER 1. INTRODUCTION 3
Firstly, restricting the representation to a lower dimension makes it easier to interpret what environment’s characteristic each latent variable corresponds to [6, p. 308]. In Lesort et al. [7], interpretability in the context of State Rep- resentation Learning is defined as the capacity for a human to be able to link a variation in the representation to a variation in the environment. The inter- pretability of the state representation allows to improve explainability of the learnt policy, therefore understanding what the agent has learned. Secondly, unsupervised training on RL is biased towards the type of data that can be collected through a random policy, possibly restricting the broadness of the representation to a limited part of the state space. As observed in [8], high quality representations depend on a well-designed automatic exploration. From an ethical perspective, this is fundamental to guarantee equity and fair- ness in decisions made by algorithm that affect people’s life. State dimension is chosen empirically. This choice relates to the bias-variance trade-off: higher- dimensional state augments the capacity of the model and reduce the training error, but may also lead to overfitting. Chapter 2
Background
This chapter provides an introduction to the theoretical aspects on which this thesis is built. We first describe the key relevant concepts of Reinforcement Learning (RL) in Section 2.1, Representation Learning through Deep Learn- ing (DL) in Section 2.2. Secondly, we illustrate the concept of State Repre- sentation Learning (SRL), and how it relates to RL algorithms. Finally, we formulate the problem that we are addressing with our research questions. For further details, the reader is referred to Sutton and Barto [9] and Goodfellow, Bengio, and Courville [10] for RL and Deep Learning respectively.
2.1 Reinforcement Learning
Reinforcement Learning is an area of Machine Learning interested in making an agent learn to achieve a goal by interacting in an environment with actions. At each time step, the agent observes the state of the environment, i.e. the set of values describing the characteristics of it, performs an action and receives a reward signal. The choice of the action to take by the agent in a given state is defined by the policy. Training an agent to find an optimal policy that max- imizes the cumulative reward requires balancing between the exploration of new state-action pairs and the exploitation of the current optimal policy. This is known as the exploration-exploitation dilemma.
2.1.1 Markov Decision Process In RL, the problem to resolve is described as a Markov Decision Process (MDP). While in this section we assume that the observation received by the agent corresponds to the true state of the environment, often this is not the
4 CHAPTER 2. BACKGROUND 5
case, and the problem needs to be formulated differently, as we discuss in Sec- tion 2.3. An MDP is 5-tuple (S, A, P, R, γ), where
•S is a set of states, called state space, describing all the possible config- urations of the environment.
•A is a set of actions, called action space.
• P (s, a, s0) ∈ P is the probability of transitioning from state s at time t to s0 at time t + 1 by taking action a.
• R(s, a, s0) ∈ R is the immediate reward received after a transition from state s to s0, due to action a.
• γ is the discount factor which is used to generate a discounted reward.
We define the policy π(a|s) which determines the probability that the agent performs action a in state s. The time horizon T defines the time that the agent has to complete the task. If T < ∞, the problem is a finite horizon MDP, and the objective is
" T # X π π π max Rt st , at , st+1 (2.1) π E t=1 therefore maximizing the sum of the immediate rewards along the trajectory. Alternatively, if the time horizon is infinite (T → ∞), the problem is defined as infinite horizon MDP or discounted MDP. The objective becomes
" ∞ # X t π π π max λ Rt st , at , st+1 . (2.2) π E t=0 where the discount factor γ ∈ [0, 1) tells how important future rewards are to the current state. As can be seen, since the policy is conditioned on the state, it is important to have a state that fully contains meaningful information about the current state of the environment. However, as we are gonna elaborate in section 2.3, this assumption might not always hold, especially for Atari games. 6 CHAPTER 2. BACKGROUND
2.2 Representation Learning
Feature generation has two major areas: Representation Learning and handy feature engineering [11]. In our work, we focus on how to learn representa- tions, rather than engineeringly designing transformations to extra features. Representation Learning is the field in Machine Learning that concerns learn- ing abstract features that characterize data, and is sometimes be referred to as Feature Learning. The general task of representation learning is to discover a mapping f : XD → Y d, which transforms the raw input data that lie in the D-dimensional original feature space X, e.g. pixels, to a more compact d-dimensional representation space Y , with usually d D. It is often difficult to evaluate the quality of a representation, especially be- cause the importance of some features might depend on the task that we want to solve. Nevertheless, we can identify general-purpose priors that are desir- able for a good representation, as discussed in [12]. Examples of such priors are the following: • Smoothness: the learned mapping f is defined to be smooth s.t. the value of f(x) and of its derivative f 0(x) are close to the values of f(x + ∆) and f 0(x + ∆) respectively when x and x + ∆ are close as defined by a kernel or a distance. • A hierarchical organization of explanatory factors: useful features that describe the world can be expressed as a composition of other concepts, in a hi- erarchy where more abstract information is on a higher level. This key concept relates to features re-use and abstraction in deep architectures. • Simplicity of Factor Dependencies: the dependencies that relate high-level with each other are simple, typically linear. • Manifolds: the probability mass is assumed to stay in a much smaller di- mensionality than the original space where the raw data lives. • Natural clustering: in classification, local variations in the same mani- fold do not induce a change in the value of the category. In general, a linear interpolation between examples belonging to different classes involves going through a low density region where P (X | Y = i) for different i does not overlap among classes. This reflects the idea that humans categorize and sep- arate classes based on statistical structure that underlie the data. • Temporal and spatial coherence: time-dependent features are assumed to change slowly over time. A small variation tend to keep the observations on the surface of the high-density manifold. • Sparsity: many values of an encoded observation are often zero. This comes from the hypothesis that, for a given observation, only a small fraction CHAPTER 2. BACKGROUND 7
of the possible hidden features are relevant. This prior is at the core of mod- els such as over-complete AutoEncoders [13] and the class of Sparse Coding methods [14, 15]. A representation learning method is often more powerful as it incorporates more of the above-mentioned general priors. As we further discuss in chapter 3, it is important to distinguish how represen- tation learning differently takes place in supervised, self-supervised and un- supervised learning. Both unsupervised learning and self-supervised learning make use of non-annotated data. While the former mainly focuses on gener- ative or reconstructive losses, the latter defines surrogate losses that are opti- mized on annotated data. These two classes of methods have gained increas- ing interest as their purpose is to extract relevant information without any prior knowledge about the task for which that information is going to be exploited. Therefore, they relate more to real-world intelligence, focusing on generaliza- tion and knowledge transfer, following the idea that the human brain mostly learn without labeled data. On the other hand, supervised learning methods are more prone to overfit train- ing data and might run into significant generalization error. Supervised learn- ing is more limited than unsupervised and self-supervised learning: it is highly dependent on the quantity and quality of the labeled data, which might be ex- pensive to produce. Khastavaneh and Ebrahimpour-Komleh [11] categorize representation learn- ing methods in four main classes: sub-space based, manifold based, shallow architectures, and deep architectures.
• Sub-space Based Representation Learning Approaches This class of algorithms mainly make use of linear combination of original features to generate new features through base functions. The most popular of these methods include Principal Component Analysis (PCA) [16], Met- ric Multi-Dimensional Scaling (MDS) [17], Independent Component Analysis (ICA) [18], and Linear Discriminant Analysis (LDA) [19].
• Manifold Based Representation Learning Approaches Manifold based methods use non-linear transformation, and assume that the data lie on or near a low-dimensional manifold. Each method reduces the dimen- sion of the original data while attempting to preserve the geometrical properties characterizing the underlying manifold.
• Shallow Representation Learning Approaches Shallow architecture methods comprise multilayer perceptron with less than five layers and 8 CHAPTER 2. BACKGROUND
local kernel machines. However, they require an exponential number of of parameters with respect to the input dimension, and therefore are not considered to be sufficiently compact. Examples of shallow methods are Restricted Boltzmann Machines (RBMs) [20], KernelPCA [21], Au- toencoders (AE) [22], as well as Variational Autoencoders (VAE) [23].
• Deep Representation Learning Approaches Deep architectures are employed to tackle the limitations of shallow architectures. The hier- archical nature of deep architectures allow to process the data multiple times on different layers, automatically discovering abstractions from low level observations to high level concepts. Because of the large amount of layers and therefore parameters, deep neural networks (DNN) require huge number of training data to achieve decent generalization. In order to promote generalization, different techniques, such as regu- larization through dropout, are used. Examples of deep neural network are Deep Belief Networks (DBN) [24], Convolutional Neural Networks (CNN) [25] and Deep Autoencoders.
Khastavaneh and Ebrahimpour-Komleh [11] consider deep architectures as the most complete class of methods for successful representation learning, as they cover more general priors of real-world intelligence [26, 27]. The hierarchical organization of features is the most important prior that they incorporate, orga- nizing features on multiple levels of abstraction. Furthermore, since low-level features, such as edges for images, are useful in multiple tasks, upper layers are more capable of transfer learning. For this reason, deep architectures are often trained on large datasets, and then fine-tuned for a specific. We briefly discuss Convolutional Neural Networks (CNN) and how they learn to extract relevant features. Similarly, we discuss Recurrent Neural Networks (RNN) in the following section.
2.2.1 Convolutional Neural Networks Convolutional neural networks are deep networks based on the mathematical operation of convolution, and are the most popular deep learning architecture for image processing. CNNs are able to successfully capture the spatial dependencies by analyzing nearby pixels through a filter. Each filter is convoluted with the input, com- putes a feature map, and a non-linear activation function, such as ReLU, is applied to the output. As shown in Figure 2.1, another common component of CHAPTER 2. BACKGROUND 9
a CNN is the pooling layer, which is responsible for reducing the spatial size of the layer’s input and extract dominant features.
Figure 2.1: A CNN for classifying handwritten digits from the MNIST dataset. Image from [28].
In a CNN, each layer learns filters of increasing complexity (Figure 2.2). The first layers learn basic feature detection filters: edges, corners and basic small patterns. The middle layers learn filters that detect parts of objects. For faces, they might learn to respond to eyes, noses, etc. The last layers have higher representation: they learn to recognize full objects, in different shapes and positions. 10 CHAPTER 2. BACKGROUND
Figure 2.2: Visualization of a CNN from Zeiler and Fergus [29].
In the context of our work, it is relevant to discuss how self-supervised and unsupervised training methods can be used for performing Transfer Learning. This paradigm allows to re-utilize the trained weights for accelerating and/or improving the supervised training. On the one hand, when a CNN is trained with labels, it learns features in or- der to discriminate an image belonging to a class from the others. Supervised classification, however, may easily lead to overfitting over the training sam- ples. Moreover, it has been shown that CNNs trained on ImageNet are biased towards texture, conflicting with the hypothesis on how humans observe im- ages [30]. On the other hand, self-supervised and unsupervised learning provide a more general and flexible framework for dealing with the limited availability of large labeled datasets and focusing the learning process in extracting the underly- ing latent factors of the data. Among these classes of methods that make use CNNs we can find AutoEncoders, GANs, self-supervised prediction of image rotation, and many others. In our work, the weights are first trained with an unsupervised method, and then fine-tuned in the learning process of the RL agent.
2.2.2 Recurrent Neural Networks Recurrent neural networks [31] are a class of neural networks that are natu- rally suited to processing time-series data and other sequential data, such as CHAPTER 2. BACKGROUND 11
natural language. In particular, in our work we use a particular type of RNN, called LSTM [32]. Long Short-Term Memory (LSTM) are made of a cell, an input gate, an output gate and a forget gate, as shown in Figure 2.3. This ar- chitecture allows to remember values over time and therefore model long-term dependencies.
Figure 2.3: The Long Short-Term Memory (LSTM) cell can process data se- quentially and keep its hidden state through time. [33]
The use of gates make the architecture overcome the vanishing gradient prob- lem observed in vanilla RNNs. Each gate selectively retain or discard informa- tion. First, the forget gate decides what information should be thrown away or kept, passing the hidden state and the input through a sigmoid function. Then, the input gate updates the cell state with the information coming from the new input and the hidden state. Finally, the output gate computes the new hidden state that will be propagated to the next time step. In the context of state representation learning, LSTM are useful for POMDPs as they extract features that depend on sequences of input data. In a typical RL setting, this could be the direction and velocity or acceleration of a moving object, or the behavior state of an element observed for multiple frames, such as Non-player character (NPC) in an Atari game.
2.3 State Representation Learning for Con- trol and Problem Formulation
State Representation Learning, in regards of Reinforcement Learning, is a par- ticular kind of representation learning where learned features are embedded in a low-dimensional space and mutates through time, often as a consequence of the agent’s actions. 12 CHAPTER 2. BACKGROUND
It is important to notice the difference between observation and state. An ob- servation consists in the pure raw data that are collected from the environment, for example measurements coming from robotic sensors, temperatures, posi- tions, angles, pixels. In order to make an agent learn from these data there are two main problems that we should address:
• How to extract important features from an observation?
• How to make the state hold the Markov property?
The first problem is related to the curse of dimensionality. With most learning algorithms, if the dimensionality of the input is large, the learner will need to see numerous samples in order to model the target function. This comes from the fact that in a high-dimensional input space, two samples might be distant, while being very close in the data low-dimensional manifold. Following Raffin et al. [34], SRL corresponds to learning a mapping φ from the observation to the state space O to the state space S. The policy exploits the state st ∈ S to output an action at to maximize the expected reward:
ϕ π ot −−→ st −−→ at (2.3) SRL RL The second problem is related to sequences in temporal systems, such as RL environments. If the RL problem is formulated as an MDP, it means that we as- sume that the Markov property hold for that stochastic process. By definition, a stochastic process has the Markov property if the conditional probability dis- tribution of future states of the process depends only upon the present state. In terms of conditional probabilities, it coincides to
P (Xn = xn | Xn−1 = xn−1,...,X0 = x0) = P (Xn = xn | Xn−1 = xn−1) . (2.4) However, whether the process is guaranteed to be Markovian, i.e. the Markov property holds, is often unknown. This entirely depends on what information the state st represents. We argue that the state st requires to be augmented through additional information coming from previous observations of the en- vironment in order to make the Markov property hold. To address this problem, we modify Equation 2.3 to make the current state depend not only on the current observation, but also on all the previous ones, as follows
ϕ π Ht −−→ st −−→ at (2.5) SRL RL CHAPTER 2. BACKGROUND 13
where Ht = o1, o2, . . . , ot is the history of observations up to timestep t. We can notice that we might as well consider Ht as the state of the environ- ment for it to be Markovian. However, the further step of computing a low- dimensional st = ϕ(Ht) is necessary to address the curse of dimensionality. Chapter 3
Related Work
The state representation problem focuses on learning a low-dimensional em- bedding that only preserves useful information and reduces the dimensionality of the search space, therefore aiming to tackle a major challenge of Reinforce- ment Learning: sample efficiency. This chapter summarizes different super- vised and unsupervised methods used to learn state representations in both model-free and model-based RL, discussing how they differ from each other, the advantages and drawbacks.
3.1 State Representation Learning
We can distinguish mainly two type of State Representation Learning (SRL) algorithms: supervised and unsupervised. While the former consists in learn- ing a representation solely based on the optimization of a policy for a specific reward function, the latter utilizes unsupervised methods to learn to extract useful features from environment observations, independently of the task to be solved.
SRL is often not specifically addressed when trying to solve an Atari game with RL. Atari games require both a good understanding of the features embedded in a frame, such as position of the player or the enemies, and representation of temporal features, defined as features that require information over two or more observations to be defined, for example the velocity of elements in a frame and the direction in which they are moving. In this section we discuss how some model-free and model-based RL algorithms address the SRL prob- lem in game environments.
14 CHAPTER 3. RELATED WORK 15
One of the first methods proposed as a solution to Atari games is DQN [35]. The authors address the problem of SRL by pre-processing the frames for reducing the use of computational resources. Namely, each frame is down- sampled from 210 × 160 to 110 × 84, converted to gray-scale, and cropped to 84 × 84. The problem of time-dependent features is addressed by stacking the last 4 frames of the history to produce the input to the CNN that extract a 256- dimensional feature vector, used to learn the Q function. The frame-stacking has become a widely adopted method to account for time features. However, it can be argued that 4 consecutive frames might not be enough for time features that vary in longer periods of time. Moreover, the family of methods belong- ing to model-free RL, as DQN itself, learn a supervised state representation that is guided by the reward, and might not be guaranteed to be flexible for transferring knowledge to other tasks on the same environment, i.e. the repre- sentation and the policy overfit the task-specific reward function.
A more recent work proposes a model-based approach, called MuZero, to solve Atari games [36]. In the context of our work, it is of particular interest to dis- cuss the representation and the dynamics functions. The input of the represen- tation function includes a stack made of the last 32 RGB frames at resolution 96 × 96, and a plane for each action that led to each of those frames, for a total input of shape 96×96×128. The dynamics function takes as input the output of the representation function, which has dimensionality 6×6×256, and a plane representing the action according to the policy. The model learns to predict those aspects of the future that are directly relevant for planning, and is able to learn to play board games, such as chess, without being given the explicit state transition according to the rules. The hidden states are free to represent state in whatever way is relevant to predicting current and future values and policies. This allows the agent to invent, internally, rules or dynamics that are most convenient for planning. While overall the method achieves state-of-the- art results, proving a new way to effectively perform model-based RL on Atari games, it is important to consider the limitations regarding computational fea- sibility and flexibility. Stacking 32 frames certainly gives important temporal information; however, this greatly increases the computational resources re- quired, and assumes that the last 32 frames contain enough information, i.e. make the state satisfy the Markov property. Furthermore, the representation is constrained by the task-specific reward function, and might not be reused for learning new tasks on the same environment.
Other works have focused on learning in an unsupervised manner. Namely, a 16 CHAPTER 3. RELATED WORK
model is first trained on data collected with a random policy to learn a repre- sentation of the world and its dynamics, which is then used to train an agent to perform a task in the same environment. A method of this kind, called World Model, is formulated in [37], although not tested on Atari games, but rather in the Car Racing and VizDoom environments. The model is illustrated in Figure 3.1. Firstly, a random agent is used to collect 10, 000 rollouts with a maximum length of 1, 000 steps, containing both observations and actions. Secondly, a Variational AutoEncoder (VAE)is trained to learn a 32-dimensional latent rep- resentation of each frame observed. Then, a MDN-RNN is trained to predict the probability distribution of the embedding of the next frame P (zt+1|at, zt, ht). In particular, this component is empirically proved by the authors to be funda- mental for the stability of the driving agent, as the hidden state of the RNN ht accounts for temporal features, and allows to tackle sharp corners effectively. Finally, a controller C is used to train the agent to maximize the expected cu- mulative reward. The controller is a linear model, deliberately designed to be small and simple, to make most of the agent’s complexity resides in the world model. The advantage of such model is that it is general and modular enough to prevent the state representation to overfit the reward function by decoupling these two optimization tasks. Furthermore, the agent can be trained on the World Model, providing a solution to environments where interaction is ex- pensive and sample efficiency is crucial. Nevertheless, the authors point out that for more complex environments, such as Atari games, an iterative pro- cedure might be required as new environment scenarios are discovered and explored and the World Model representation needs to be updated.
Figure 3.1: World Models architecture. Image from [37] CHAPTER 3. RELATED WORK 17
Figure 3.2: Phases of the SimPLe algorithm. Image from [38]
A similar approach that leverages on World Models is presented by Kaiser et al. [38]. They introduce Simulated Policy Learning (SimPLe), a model-based deep RL algorithm based on video prediction models. It consists of alternating between learning a world model and then use this model to optimize a policy with model-free RL (Figure 3.2). The model receives as input four stacked frames and the action selected by the agent, and predicts the next frame and the expected reward using the architecture shown in Figure 3.3. This setting proves to be particularly effective in low data regime, i.e. 400k frames on Atari games. However, the model turned out to be very computationally expensive, requiring three weeks of training per game. The burden comes from the 15.2M interactions that take place in the learned world model during the iterative training procedure. Another drawback is the dependency of the model on a defined reward function, making harder to transfer the learned world model to a new task.
Figure 3.3: Architecture of the stochastic model in SimPLe. Image from [38]
Hausknecht and Stone [39] argue that stacking four frames might not be enough for Atari games. Some games might require more temporal information than 18 CHAPTER 3. RELATED WORK
the one contained in four stacked frames. For this reason, the authors formulate the RL problem with single-frame observations as POMDPs. The proposed solution is a change in the architecture, as shown in Figure 3.4: the CNN takes as input a tensor of dimension 84 × 84 × 1, and its output is processed through an LSTM that accounts for temporal information. They also prove the efficacy of such approach with a Flickering Pong environment, demonstrating that one single frame and a recurrent network are sufficient for compensating the flick- ering and the modelling of the velocity of the ball.
Figure 3.4: DRQN architecture proposed by Hausknecht and Stone [39] where an LSTM is added at the end of the network. Image from [39]
In the context of robotic control, several approaches have been proposed for state representation learning. Hämäläinen et al. [40] propose a variational encoder-decoder structure that uses a low-dimensional latent representation computed by the encoder to create an affordance map, as opposed to recon- structing the input as in VAE. Each pixel value in a channel of the affordance map represents how likely is that affordance occur in that position. However, this method requires the labeled affordance images, and therefore is not unsu- pervised. Ghadirzadeh et al. [41] find a low-dimensional state representation for policy learning by training a perception super-layer. The architecture uses a convo- lutional spatial autoencoder and a Variation Autoencoder, both trained end- to-end with the policy. In particular, the two autoencoder structures are re- spectively trained to extract task-relevant states from raw camera observations CHAPTER 3. RELATED WORK 19
and to learn with a generative model the low-dimensional representation in the action manifold representing long motor trajectories. Chen et al. [42] demonstrates that it is possible to train visuomotor policies and transfer them to novel task domains by using adversarial training. The adversarial method is trained end-to-end to extract visual features that gener- alize to instances of task objects with the supervision of weak labels indicating whether an image contains a task object or not. The adversarial features are pretrained jointly using PPO, and two auxiliary networks, a discriminator and a classifier, are used to respectively discriminate whether the image comes from the source or the target domain, and recognize whether there is a task object in the image from the target domain or not.
3.1.1 Contrastive methods Contrastive learning is a class of unsupervised methods adopted for learning representations. Such algorithms force the inner product of representations of similar pairs with each other to be higher on average than with negative samples. Arora et al. [43] analyze the theory behind contrastive learning by introducing a framework based on latent classes, assuming that semantically similar data points belong to the same latent class. First, they show how the learned representation successfully contains useful features for classification tasks. Secondly, the generalization bound indicates that contrastive learning with blocks of negative samples promotes a representation that is beneficial for sample complexity and generalization.
Gutmann and Hyvärinen [44] introduce noise-contrastive estimation (NCE) as a mean for "learning by comparison" in the context of unnormalized models, i.e. probabilistic model with an analytically intractable normalizing constant. The unsupervised learning problem is approached with a model trained with supervised learning to learn through binary classification to discriminate the original observed data from artificially generated noise, sampled from a con- trastive noise distribution. Formally, having T observed data points xt and T artificially generated data points yt, we want to optimize
1 X JT (θ) = ln [h(xt; θ)] + ln [1 − h(yt; θ)] (3.1) 2T t 1 with h(u; θ) = rν(G(u; θ)), rν(u) = 1+ν exp(−u) being a parametrized logis- tic function, and G(u; θ) = ln pm(u; θ) − ln pn(u) the log-ratio between the class-conditional pdfs pm(.; θ) and pn, respectively for positive and negative 20 CHAPTER 3. RELATED WORK
samples. Understandably, the choice of the noise distribution is crucial: it should be close to the data distribution, because otherwise the classification task might be too easy and no learning of actual data structures would be re- quired to distinguish the two classes.
Based on Gutmann and Hyvärinen [44], Oord, Li, and Vinyals [45] introduce Contrastice Predictive Coding for learning representations. They train an en- coder and an autoregressive model to optimize a loss named InfoNCE. This objective function allows to maximally preserve the mutual information (MI) of the original signals x and the context latent representation c defined as
X p(x | c) I(x; c) = p(x, c) log . (3.2) p(x) x,c The authors prove that minimizing the InfoNCE loss maximizes a lower bound on mutual information. Furthermore, they show that adding CPC as an auxil- iary loss significantly improves the performance of an RL agent trained with A2C on 3D environments of DeepMind Lab [46].
Another method for learning unsupervised representations is introduced by Hjelm et al. [47]. This approach, named Deep InfoMax (DIM), maximizes mu- tual information, in favour of representations that preserve locally-consistent information across structural locations, such as image patches. With the aim of extracting useful representations for classification, DIM maximizes the aver- age MI between the high-level representation, output of the encoder, and local patches of the image, which are middle-level feature maps of the encoder. As MI estimator, the authors compare the one based on the Donsker-Varadhan rep- resentation (DV) [48], Jensen-Shannon divergence (JSD) [49] and InfoNCE [45]. The JSD estimator is found to be insensitive to the number of negative samples, while InfoNCE has higher performance for larger batches of negative samples, and DV constantly performs worse compared to InfoNCE.
Anand et al. [50] present Spatiotemporal Deep InfoMax (ST-DIM), an unsu- pervised representation learning technique that maximizes the mutual infor- mation in the representations across spatial and temporal axes. This method is based both on DIM and InfoNCE. From the first one, following Hjelm et al. [47], they maximize a sum of patch-level mutual information objectives. Following Oord, Li, and Vinyals [45], they use InfoNCE for modelling two objectives that respectively account for spatial and temporal information be- tween two consecutive observations. Moreover, a new evaluation method for CHAPTER 3. RELATED WORK 21
state representation learning in Atari games is proposed. Specifically, the en- coder’s weights trained to minimize the loss function are kept fixed and a lin- ear classifier that takes as input the output of the encoder is trained to predict the RAM state, which represent the ground truth of the environment state at a specific time step. ST-DIM outperforms generative methods, such as VAE and pixel prediction, as well as other contrastive methods, such as CPC. In our work, we adopt ST-DIM as a pretraining method for state representation learning.
3.1.2 Robotic priors and auxiliary objective functions State representation learning can be augmented through the addition of auxil- iary loss functions that assume various specific constraints and serve of differ- ent use depending on the environment characteristics. The learning process of the agent can be constrained by prior knowledge to account for intuitive physics, physical laws, mental states of other agents, and modelling regulari- ties such as compositionality and causality. Lesort et al. [8] provide a review of state-of-the-art approaches to loss functions designed for state representation learning. These prior can be used in two ways: they can be either be used as auxiliary loss functions while training the RL agent, or be the objective of an unsuper- vised state representation pretraining. The following loss functions are applied to the state space to constrain the representation according to specific criteria. With ∆st = st+1 −st is indicated the difference in between the states in the two consecutive timesteps t and t+1, and D is a set of observations. • Slowness Principle The slowness principle is used in physical systems where features of interest are assumed to fluctuate slowly and continuously through time and abrupt changes are unlikely [51, 7, 52]. 2 LSlowness(D, φ) = E k∆stk (3.3) • Variability The variability prior assumes that the positions of relevant objects vary, and the state representation should focus on such moving elements [52]. h i −kst1 −st2 k LVariability(D, φ) = E e (3.4) This prior is used to counter-balance the slowness prior, which would otherwise lead to constant values. 22 CHAPTER 3. RELATED WORK
• Proportionality The proportionality prior, proposed by Jonschkowski and Brock [51], is used in systems where we expect the environment to change with the same magnitude as a response to the same stimulus, i.e. the agent’s action. 2 LProp (D, φ) = E (k∆st2 k − k∆st1 k) | at1 = at2 (3.5)
• Repeatibility
If the same action was applied at times t1 and t2 and the two states were similar, the state change should be similar both in magnitude and direc- tion. This prior is named repeatibility loss [51]. h 2 i −kst2 −st1 k 2 LRep(D, φ) = E e k∆st2 − ∆st1 k | at1 = at2 (3.6)
• Controlability We consider controllable things to be relevant. Objects that can be ma- nipulated by an agent are likely to be useful to be represent to achieve a given task. The following robotic prior connects the accelerations of objects to the actions of the agent. Jonschkowski et al. [52] define a loss function per action dimension i to optimize the covariance between action dimension i and accelerations in a state dimension i.
(a) h (a) h (a) ii − Cov at,i,st+1,i −E (at,i−E[at,i]) st+1,i−E st+1,i Lcontrolability (i) = e = e (3.7) • Selectivity A further assumption consists in considering each feature of the state space as independently controllable factors that mutate according to the agent’s actions. By assuming that the state space is K-dimensional, we train K policies πk so that each policy selectively causes a change only to the k-th feature of the state representation. Thomas et al. [53] introduce the following selectivity loss
(k) (k) st+1 − st L (D, φ, k) = | s ∼ P a sel E 0 0 t+1 st,st+1 (3.8) P s(k ) − s(k ) k0 t+1 t
• Causality The causality prior is used as an auxiliary objective for state represen- tation learning when a reward signal is available [52, 7]. This prior CHAPTER 3. RELATED WORK 23
assumes that if the same action at1 = at2 performed in two different
timesteps t1, t2 return two different rewards rt1+1 and rt2+2, then the two states in which the action was taken should be differentiated and distant in the representation space.
h 2 i ˆ −kst2 −st1 k LCaus (D, φ) = E e | at1 = at2 , rt1+1 6= rt2+1 (3.9)
• Dynamic verification Dynamic verification is a discriminative approach proposed by Shel- hamer et al. [54] that consists in learning to detect a corrupted observa-
tion otc in a sequence of consecutive observations ot where t ∈ [0,K]. Each observation is first encoded into a state, and then a classifier pre- dicts the index of the corrupted observation in the sequence.
• Forward prediction Forward models rely on a loss function that computes the prediction error on the next state, given the current state and the action performed by the agent [55].
2 ˆ 1 ˆ Lfwd st+1, f (st, at) = f (st, at) − st+1 (3.10) 2 2 In non-deterministic systems, the prediction might be a distribution in- stead of a single state value, such as the MDN-RNN in [37]. The predic- tion error can also be used as an internal reward to guide the exploration of novel states [55].
• Inverse prediction By turning around the forward model, we can use two consecutive states st and st+1 to predict which action at made the transition happen. This inverse model is implemented by Pathak et al. [55] and integrated to- gether with the forward model.
aˆt = g (st, st+1; θI ) (3.11)
We minimize the following loss that measures the difference between the performed action at and the predicted action aˆt. In [55], the loss corresponds to Maximum Likelihood Estimation of a multinomial dis- tribution. Linv (ˆat, at) (3.12) 24 CHAPTER 3. RELATED WORK
In Jonschkowski et al. [52], they use an encoder for extracting features regard- ing the robotic positions and velocities in the current timestep. In their work, the encoder is pretrained in an unsupervised manner with some of the above loss functions on a dataset that consists of randomly sampled trajectories col- lected. The learned state representation is demonstrated through PCA to be able to successfully encode meaningful information in a very compact state space. The authors also show that, by training a small multilayer perceptron after the enconder, it is possible to perform regression on the prediction of the ground truth state, achieving low mean squared error.
3.1.3 Evaluating a state representation There exist several measures to assess the quality of a state representation [8]:
1. Task performance 2. Disentanglement metric score 3. Distortion 4. NIEQA (Normalization Independent Embedding Quality Assessment) 5. KNN-MSE 6. Supervised learning
Task performance is the most common evaluation metric, and it consists of letting a RL agent learn to accomplish a task while using the learned repre- sentation, in order to test its transferability [51, 52, 56, 57, 58, 55, 54, 59, 60, 61, 62, 63, 64]. However, especially in real world environments, train- ing a RL agent is costly and inefficient as it requires plentiful resources in terms of observational data received from the environment, training time and computations. For this reason, other methods have been designed to evaluate representations. Thomas et al. [53] assume that there are latent factors in the observations com- ing from the RL environment that are independently controllable, meaning that when an action modifies the value of one of these variables, the values of the other factors is not affected. However, this is a fair assumption only when actions are known to be independent. Higgins et al. [65] proposes a way to quantitatively estimate the amount of disentanglement by using a dis- entanglement metric score. It is based on the assumption that the ground truth process utilizes a set of generative factors, some of which are conditionally in- dependent, and interpretable. The model that measures disentanglement uses a simple low-capacity and low VC-dimension linear classifier’s accuracy. The CHAPTER 3. RELATED WORK 25
classifier predicts the generative factor that was kept fixer for a given difference of pairs of representations from the same latent factor. Two state representation evaluation methods based on manifold learning the- ory are Distortion and NIEQA. Indyk [66] shows that distortion, defined as the measure of how local and global geometry coherence in the representation varies with respect to the ground truth, is a useful measure in the context of embeddings. In particu- lar, considering that many problems over metric spaces are defined in terms of the input metric properties, a low-distortion embedding enable to reduce problem to simpler metrics. Zhang, Ren, and Zhang [67] first propose an anisotropic scaling independent measure (ASIM), and then, based on it, introduce an innovative quality estima- tion method named normalization independent embedding quality assessment (NIEQA). This method help to address two important questions:
• Does it preserve the geometric structure of local neighborhoods? • Does the representation preserve the global topology of the manifold?
This is done by performing a local assessment, a global assessment, and a com- bination of the two. The local assessment measure the average ASIM between a local neighborhood on the data manifold and its corresponding embedding. The global assessment checks if the representation preserves the global topol- ogy structure by measuring the matching degree, through geodesic distance, between a neighborhood of "representative" data samples and its correspond- ing embedding. Lastly, the overall assessment is defined as a linear combina- tion of the other two. Nearest-Neighbors is an algorithm that can allow a visual inspection in the representation space of two semantically similar observations to verify that they are close as in the ground truth state space. In addition to this qualitative comparison, we can derive a quantitative measure called KNN-MSE. By us- ing the ground truth state space value of the observations, the KNN-MSE is computed as
1 X 2 KNN-MSE(s) = ks˜ − s˜0k (3.13) k s0∈KNN(s,k) where s is the considered state, s0 is one of its k nearest neighbors, and s˜ and s˜0 are their respective associated ground truth. If labels are available, it is possible to train a supervised model on top of the feature extractor that computes the state representation. Jonschkowski et al. 26 CHAPTER 3. RELATED WORK
[52] trains a fully connected neural network to perform regression from the learned representation to the actual state values. An example of supervised classification for state representation evaluation can be found in [50]. The authors use linear probing to maximise the accuracy of linear classifiers to predict the latent generative factors from the learned representation. In this case, these factors are represented by the RAM annotations of each frame of an Atari 2600 game. Chapter 4
Methods
4.1 Overview
In this work, we focus on two main aspects that we believe might improve the state representation that a RL agent utilizes and improve while learning to play on Atari 2600 game environments. Specifically, we explore how to enhance the state representation learning so that it captures two important kind of features: spatial and temporal. For this reason, we use an architecture that, as opposed to the convention of stacking four consecutive raw observations, receives as input one single frame and temporal information propagated from the previous step. This modification is based on the following hypothesis: it is possible to improve sample efficiency on Atari games by representing temporal information and increasing the number of gradient updates while keeping the number of processed frames unchanged. This becomes possible by extracting temporal features through an RNN instead of using a CNN on stacked game frames.
4.2 Architecture
Many architectures for state representation employed in RL agents on Atari games use an input consisting of the last four frames observed by the agent. However, this implies that the agent will be unable to master games that require the player to remember events that occurred more distant than four observa- tions in the past. In other words, even though the game states are often assumed to be Markovian, any game that requires a memory of more than four frames will result being non-Markovian, as future game states would depend on more than a single input.
27 28 CHAPTER 4. METHODS
4.2.1 Traditional CNN architecture The most common architecture used for extracting features from raw obser- vations consists of a CNN, illustrated in Figure 4.1, takes as input four pre- processed game frames [35]. Grayscaling and downsampling are applied, and make the shape of the input go from 4 × 210 × 160 × 3 to 4 × 84 × 84.
Figure 4.1: Architecture from Mnih et al. [35] (DQN), often adopted for fea- ture extraction in other RL algorithms applied to Atari games.
4.2.2 Proposed CNN+RNN architecture Inspired by [39], our architecture receives as input single frames at each time step. Similarly, we transform the observation from RGB values to grayscale. However, as opposed to [39], we do not downsample to 84×84, taking the input observation ot with the original size of 210 × 160. This is a consequence of our choice to use the same CNN encoder architecture as in [50]. The complete architecture of the CNN encoder is shown in Figure 4.2. CHAPTER 4. METHODS 29
Figure 4.2: Architecture from Anand et al. [50], with the minor change of the feature size increased from 256 to 512. The input is not downsampled to allow the subsequent linear probing for the RAM annotations prediction.
The only difference lies in the size of the feature vector extracted by the en- coder: while in [50] it has 256 dimensions, we use 512. 30 CHAPTER 4. METHODS
512 Figure 4.3: Architecture of the RL agent. In all our experiments zt ∈ R and 512 ht ∈ R .
Figure 4.3 shows the whole architecture for the state representation, policy and value function. The observed frame ot passes through the encoder that extract the feature vector zt = φ(ot). Then, the agent receives a 512-dimensional hidden state vector ht−1 from the previous timestep. This vector contains tem- poral information that require more than one frame to be extrapolated, such as velocity, movement directions and other higher level features. The RNN takes as input ht−1 and zt and computes the updated temporal information ht, which is a fundamental part of the state representation. In order to reduce the bur- den on the RNN to learn identity mapping for useful spatial features extracted from the encoder, we use as state representation the concatenation of the two feature vectors xt = [zt ht], with an overall dimensionality of 1024. This is similar but not equal to [37], where the same concatenation takes place. While in [37] the RNN is pre-trained to use such input to predict the next encoded CHAPTER 4. METHODS 31
frame zt+1, we use the RNN to compute an enriched temporal state represen- tation vector ht. Similarly to [37], the concatenated input is fed through the controller, which in our agent are the actor and the critic models, as shown in Figure 4.3. These two simple models consists of a single linear layer each. This choice let the agent focus more easily on the credit assignment problem on the low-dimensional search space.
4.3 Encoder pretraining
Following Anand et al. [50], we pretrain the encoder with different methods on data collected from the environment by letting an agent interact through actions sampled from a uniformly random policy. This is performed with the intent to transfer knowledge about the partial state representation that the en- coder is able to extract after being trained on single observations. We compare three different pretraining methods: VAE, ST-DIM and CPC. The pretrained weights are used as an initialization of the actor-critic agent’s encoder and are set to be trainable. Furthermore, we compare with a random initialization, that uses semi orthogonal matrices, as shown and suggested in Saxe, Mcclelland, and Ganguli [68]. In our experiments, each encoder is pretrained on 100, 000 grayscale frames from the environment of shape 210 × 160.
4.3.1 Spatiotemporal Deep Infomax (ST-DIM) ST-DIM is a contrastive method based on the mutual information estimator InfoNCE [45]. In our work, the training of the encoder with ST-DIM is per- formed exactly as illustrated in [50]. The formulation of the objective function is as follows. N Let {(xi, yi)}i=1 be a dataset of pairs of N samples from a joint distribution p(x, y). We define as positive samples all the pairs (xi, yi), coming from the joint p(x, y), and as negative samples all the pairs for any i 6= j, (xi, yj) com- ing from the product of marginals p(x)p(y). We optimize the parameters of a score function f(x, y) which has larger values for positive samples and small values for negative samples by maximizing the mutual information bound
N N X exp f (xi, yi) I {(x , y )} = log . NCE i i i=1 PN (4.1) i=1 j=1 exp f (xi, yj) The score function f(x, y), following Oord, Li, and Vinyals [45], is chosen to be a bilinear model φ(x)T W φ(y), forcing the encoder to learn linearly pre- 32 CHAPTER 4. METHODS
dictable features which are more meaningful on a semantic level, and there- fore provide a better encoder representation φ. In the case of ST-DIM, posi- tive samples are considered as pairs of consecutive observations (xt, xt+1 and negative samples as pairs of non-consecutive observations (xt, xt∗ . Two losses are constructed to respectively account for temporal and spatial information: a local-local objective and a global-local objective. In the two following equa- tions, Xnext corresponds to the set of next observations. The global-local objective is defined as
M N X X exp (gm,n (xt, xt+1)) LGL = − log P (4.2) ∗ exp (gm,n (xt, xt∗ )) m=1 n=1 xt ∈Xnext T where the score function gm,n (xt, xt+1) = φ (xt) Wgφm,n (xt+1) combines a local feature vector φm,n, which is an intermediate output of the encoder, and the output of the last layer φ. Whilst the local-local objective corresponds to
M N X X exp (fm,n (xt, xt+1)) LLL = − log P (4.3) ∗ exp (fm,n (xt, xt∗ )) m=1 n=1 xt ∈Xnext T where the score function fm,n (xt, xt+1) = φm,n (xt) Wlφm,n (xt+1) relates the encoder’s outputs with respect to a paired input. CHAPTER 4. METHODS 33
Figure 4.4: Illustration of ST-DIM from Anand et al. [50]. On the left, the two MI objective functions: local-local infomax and global-local infomax. On the right, the global-local contrastive task. As in [50], xt∗ is a collection of negative samples.
The final objective L = LGL + LLL is minimized while training by gradient descent with the Adam optimizer [69].
4.3.2 Variational AutoEncoder (VAE) Variational autoencoders [23] are generative models that learn the probabil- ity distribution of the data in the latent space. They make strong assumptions concerning the distribution of latent variables. It is assumed that the data are generated by a directed graphical model pθ(x|h) and that the encoder is learn- ing an approximation qφ(h|x) to the posterior distribution pθ(x|h), where φ and θ denote the parameters of the encoder and decoder respectively. The encoder-decoder architecture is trained to minimize the following objective function, which corresponds Evidence Lower BOund (ELBO) with a minus in front: