DEGREE PROJECT IN ARCHITECTURE, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020

Unsupervised state representation pretraining in applied to Atari games

FRANCESCO NUZZO

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Unsupervised state representation pretraining in Reinforcement Learning applied to Atari games

FRANCESCO NUZZO

Master in Date: October 28, 2020 Supervisor: Fredrik Carlsson, Ali Ghadirzadeh Examiner: Danica Kragic Jensfelt School of Electrical Engineering and Computer Science Host company: RISE Swedish title: Oövervakad förträning av tillståndsrepresentation i förstärkningsinlärning tillämpat på atari-spel

iii

Abstract

State representation learning aims to extract useful features from the observa- tions received by a Reinforcement Learning agent interacting with an environ- ment. These features allow the agent to take advantage of the low-dimensional and informative representation to improve the efficiency in solving tasks. In this work, we study unsupervised state representation learning in Atari games. We use a RNN architecture for learning features that depend on sequences of observations, and pretrain a single-frame encoder architecture with different methods on randomly collected frames. Finally, we empirically evaluate how pretrained state representations perform compared with a randomly initialized architecture. For this purpose, we let a RL agent train on 22 different Atari 2600 games initializing the encoder either randomly or with one of the fol- lowing unsupervised methods: VAE, CPC and ST-DIM. Promising results are obtained in most games when ST-DIM is chosen as pretraining method, while VAE often performs worse than a random initialization. iv

Sammanfattning

Tillståndsrepresentationsinlärning handlar om att extrahera användbara egen- skaper från de observationer som mottagits av en agent som interagerar med en miljö i förstärkningsinlärning. Dessa egenskaper gör det möjligt för agen- ten att dra nytta av den lågdimensionella och informativa representationen för att förbättra effektiviteten vid lösning av uppgifter. I det här arbetet studerar vi icke-väglett lärande i Atari-spel. Vi använder en RNN-arkitektur för inlärning av egenskaper som är beroende av observationssekvenser, och förtränar en ko- dararkitektur för enskild bild med olika metoder på slumpmässigt samlade bil- der. Slutligen utvärderar vi empiriskt hur förtränade tillståndsrepresentationer fungerar jämfört med en slumpmässigt initierad arkitektur. För detta ändamål låter vi en RL-agent träna på 22 olika Atari 2600-spel som initierar kodaren antingen slumpmässigt eller med en av följande metoder utan tillsyn: VAE, CPC och ST-DIM. Lovande resultat uppnås i de flesta spel när ST-DIM väljs som metod för träning, medan VAE ofta fungerar sämre än en slumpmässig initialisering. v

Acknowledgement

This thesis was possible thanks to RISE and their computational resources made available for running most of the experiments.

I would like to thank my supervisors Fredrik Carlsson and Ali Ghadirzadeh for their help and providing advice throughout the work, and Prof. Danica Kragic for following and evaluating the thesis.

I am very grateful to all my friends that supported and encouraged me during this Master, without whom it would not have been so wonderful.

Most importantly, I would like to express my profound gratitude to my family for sustaining me while studying at KTH, and to Valeria and Vlera for con- stantly demonstrating their precious and motivating support. Contents

1 Introduction 1 1.1 Research Question ...... 2 1.2 Ethics, Societal Aspects and Sustainability ...... 2

2 Background 4 2.1 Reinforcement Learning ...... 4 2.1.1 Markov Decision Process ...... 4 2.2 Representation Learning ...... 6 2.2.1 Convolutional Neural Networks ...... 8 2.2.2 Recurrent Neural Networks ...... 10 2.3 State Representation Learning for Control and Problem For- mulation ...... 11

3 Related Work 14 3.1 State Representation Learning ...... 14 3.1.1 Contrastive methods ...... 19 3.1.2 Robotic priors and auxiliary objective functions . . . . 21 3.1.3 Evaluating a state representation ...... 24

4 Methods 27 4.1 Overview ...... 27 4.2 Architecture ...... 27 4.2.1 Traditional CNN architecture ...... 28 4.2.2 Proposed CNN+RNN architecture ...... 28 4.3 Encoder pretraining ...... 31 4.3.1 Spatiotemporal Deep Infomax (ST-DIM) ...... 31 4.3.2 Variational (VAE) ...... 33 4.3.3 Contrastive Predictive Coding (CPC) ...... 34 4.4 Evaluation ...... 35

vi CONTENTS vii

5 Results 37

6 Discussion 41

7 Conclusions 43 7.1 Future work ...... 44

Bibliography 45

A Hyperparameters 53

Chapter 1

Introduction

The vast class of deep representation learning has brought signifi- cant contribution in a variety of machine learning problems across numerous domains. Often, the representation learned by a model is the result of end- to-end learning that makes use of labeled data or rewards. Moreover, that high complexity of such models and the enormous amount of parameters of- ten make them sample-inefficient and not capable enough of generalization or transfer learning. However, the human brain appears to mainly learn without explicit supervision, indicating that there exist priors on which we can leverage to learn to extract compact and useful information from our perceptive data, i.e. model representations of useful features independently of a given task to be performed.

In the context of Reinforcement Learning, learning a representation is a funda- mental component for effective and efficient policy optimization. State repre- sentation learning focuses on extracting features in a low dimension from ob- servational data captured from the environment, allowing to restrict the search space of the policy to lower dimensions. The representation should capture variations in the environment, such as position speed and direction of mov- ing objects; a feature vector containing this type of information is particularly suitable and useful for robotics and control tasks. The choice of using a low di- mensional representation comes with different advantages: it naturally tackles the curse of dimensionality [1, p. 932], it accelerates the process of learning an optimal policy [2, 3] and it improves the interpretability and explainability of the model [4], providing an easier way to study the cause-effect relationships learned by the model. Moreover, faster policy learning means a more efficient use of data, therefore improving sample efficiency, which is especially useful

1 2 CHAPTER 1. INTRODUCTION

in real world applications, such as robotics, where interacting with the envi- ronment is often expensive.

Unsupervised state representation learning aims at learning a representation from unlabeled data, i.e. observations for which the corresponding true state of the environment is not available, and then reuse that representation on the same environment, training an agent to perform a given task. The learned rep- resentation allows to transfer knowledge about how the environment works, therefore making it possible for an agent to improve and speed up the learning process on the environment for different tasks and reward functions.

In our work, we use different unsupervised state representation algorithms to pretrain the architecture on frames collected on Atari 2600 games. Then we empirically evaluate and compare them by using the pretrained parameters as initialization of the feature extractor’s architecture and train an RL agent to maximize the reward signal received from the environment.

1.1 Research Question

This thesis focuses on addressing two research questions: • How to modify the state representation architecture of an RL agent for Atari games to account for temporal features without stacking multiple frames? • Is it possible to improve the training performance of the agent by pre- training the state representation architecture on a small amount of unla- beled data?

1.2 Ethics, Societal Aspects and Sustainabil- ity

We highlight the importance of some crucial aspects for models in accordance to modern legislation. Decision-making algorithms are regu- lated, since 2018, by the European Union to include a “right to explanation", “right to opt-out" and “non discrimination" of models [5]. Two problems might rise in this regard when deploying State Representation Learning algorithms: difficult interpretability and models biased on training data. CHAPTER 1. INTRODUCTION 3

Firstly, restricting the representation to a lower dimension makes it easier to interpret what environment’s characteristic each latent variable corresponds to [6, p. 308]. In Lesort et al. [7], interpretability in the context of State Rep- resentation Learning is defined as the capacity for a human to be able to link a variation in the representation to a variation in the environment. The inter- pretability of the state representation allows to improve explainability of the learnt policy, therefore understanding what the agent has learned. Secondly, unsupervised training on RL is biased towards the type of data that can be collected through a random policy, possibly restricting the broadness of the representation to a limited part of the state space. As observed in [8], high quality representations depend on a well-designed automatic exploration. From an ethical perspective, this is fundamental to guarantee equity and fair- ness in decisions made by that affect people’s life. State dimension is chosen empirically. This choice relates to the bias-variance trade-off: higher- dimensional state augments the capacity of the model and reduce the training error, but may also lead to overfitting. Chapter 2

Background

This chapter provides an introduction to the theoretical aspects on which this thesis is built. We first describe the key relevant concepts of Reinforcement Learning (RL) in Section 2.1, Representation Learning through Deep Learn- ing (DL) in Section 2.2. Secondly, we illustrate the concept of State Repre- sentation Learning (SRL), and how it relates to RL algorithms. Finally, we formulate the problem that we are addressing with our research questions. For further details, the reader is referred to Sutton and Barto [9] and Goodfellow, Bengio, and Courville [10] for RL and Deep Learning respectively.

2.1 Reinforcement Learning

Reinforcement Learning is an area of Machine Learning interested in making an agent learn to achieve a goal by interacting in an environment with actions. At each time step, the agent observes the state of the environment, i.e. the set of values describing the characteristics of it, performs an action and receives a reward signal. The choice of the action to take by the agent in a given state is defined by the policy. Training an agent to find an optimal policy that max- imizes the cumulative reward requires balancing between the exploration of new state-action pairs and the exploitation of the current optimal policy. This is known as the exploration-exploitation dilemma.

2.1.1 Markov Decision Process In RL, the problem to resolve is described as a Markov Decision Process (MDP). While in this section we assume that the observation received by the agent corresponds to the true state of the environment, often this is not the

4 CHAPTER 2. BACKGROUND 5

case, and the problem needs to be formulated differently, as we discuss in Sec- tion 2.3. An MDP is 5-tuple (S, A, P, R, γ), where

•S is a set of states, called state space, describing all the possible config- urations of the environment.

•A is a set of actions, called action space.

• P (s, a, s0) ∈ P is the probability of transitioning from state s at time t to s0 at time t + 1 by taking action a.

• R(s, a, s0) ∈ R is the immediate reward received after a transition from state s to s0, due to action a.

• γ is the discount factor which is used to generate a discounted reward.

We define the policy π(a|s) which determines the probability that the agent performs action a in state s. The time horizon T defines the time that the agent has to complete the task. If T < ∞, the problem is a finite horizon MDP, and the objective is

" T # X π π π  max Rt st , at , st+1 (2.1) π E t=1 therefore maximizing the sum of the immediate rewards along the trajectory. Alternatively, if the time horizon is infinite (T → ∞), the problem is defined as infinite horizon MDP or discounted MDP. The objective becomes

" ∞ # X t π π π  max λ Rt st , at , st+1 . (2.2) π E t=0 where the discount factor γ ∈ [0, 1) tells how important future rewards are to the current state. As can be seen, since the policy is conditioned on the state, it is important to have a state that fully contains meaningful information about the current state of the environment. However, as we are gonna elaborate in section 2.3, this assumption might not always hold, especially for Atari games. 6 CHAPTER 2. BACKGROUND

2.2 Representation Learning

Feature generation has two major areas: Representation Learning and handy feature engineering [11]. In our work, we focus on how to learn representa- tions, rather than engineeringly designing transformations to extra features. Representation Learning is the field in Machine Learning that concerns learn- ing abstract features that characterize data, and is sometimes be referred to as Feature Learning. The general task of representation learning is to discover a mapping f : XD → Y d, which transforms the raw input data that lie in the D-dimensional original feature space X, e.g. pixels, to a more compact d-dimensional representation space Y , with usually d  D. It is often difficult to evaluate the quality of a representation, especially be- cause the importance of some features might depend on the task that we want to solve. Nevertheless, we can identify general-purpose priors that are desir- able for a good representation, as discussed in [12]. Examples of such priors are the following: • Smoothness: the learned mapping f is defined to be smooth s.t. the value of f(x) and of its derivative f 0(x) are close to the values of f(x + ∆) and f 0(x + ∆) respectively when x and x + ∆ are close as defined by a kernel or a distance. • A hierarchical organization of explanatory factors: useful features that describe the world can be expressed as a composition of other concepts, in a hi- erarchy where more abstract information is on a higher level. This key concept relates to features re-use and abstraction in deep architectures. • Simplicity of Factor Dependencies: the dependencies that relate high-level with each other are simple, typically linear. • Manifolds: the probability mass is assumed to stay in a much smaller di- mensionality than the original space where the raw data lives. • Natural clustering: in classification, local variations in the same mani- fold do not induce a change in the value of the category. In general, a linear interpolation between examples belonging to different classes involves going through a low density region where P (X | Y = i) for different i does not overlap among classes. This reflects the idea that humans categorize and sep- arate classes based on statistical structure that underlie the data. • Temporal and spatial coherence: time-dependent features are assumed to change slowly over time. A small variation tend to keep the observations on the surface of the high-density manifold. • Sparsity: many values of an encoded observation are often zero. This comes from the hypothesis that, for a given observation, only a small fraction CHAPTER 2. BACKGROUND 7

of the possible hidden features are relevant. This prior is at the core of mod- els such as over-complete [13] and the class of Sparse Coding methods [14, 15]. A representation learning method is often more powerful as it incorporates more of the above-mentioned general priors. As we further discuss in chapter 3, it is important to distinguish how represen- tation learning differently takes place in supervised, self-supervised and un- . Both and self-supervised learning make use of non-annotated data. While the former mainly focuses on gener- ative or reconstructive losses, the latter defines surrogate losses that are opti- mized on annotated data. These two classes of methods have gained increas- ing interest as their purpose is to extract relevant information without any prior knowledge about the task for which that information is going to be exploited. Therefore, they relate more to real-world intelligence, focusing on generaliza- tion and knowledge transfer, following the idea that the human brain mostly learn without labeled data. On the other hand, supervised learning methods are more prone to overfit train- ing data and might run into significant . Supervised learn- ing is more limited than unsupervised and self-supervised learning: it is highly dependent on the quantity and quality of the labeled data, which might be ex- pensive to produce. Khastavaneh and Ebrahimpour-Komleh [11] categorize representation learn- ing methods in four main classes: sub-space based, manifold based, shallow architectures, and deep architectures.

• Sub-space Based Representation Learning Approaches This class of algorithms mainly make use of linear combination of original features to generate new features through base functions. The most popular of these methods include Principal Component Analysis (PCA) [16], Met- ric Multi-Dimensional Scaling (MDS) [17], Independent Component Analysis (ICA) [18], and Linear Discriminant Analysis (LDA) [19].

• Manifold Based Representation Learning Approaches Manifold based methods use non-linear transformation, and assume that the data lie on or near a low-dimensional manifold. Each method reduces the dimen- sion of the original data while attempting to preserve the geometrical properties characterizing the underlying manifold.

• Shallow Representation Learning Approaches Shallow architecture methods comprise multilayer with less than five layers and 8 CHAPTER 2. BACKGROUND

local kernel machines. However, they require an exponential number of of parameters with respect to the input dimension, and therefore are not considered to be sufficiently compact. Examples of shallow methods are Restricted Boltzmann Machines (RBMs) [20], KernelPCA [21], Au- toencoders (AE) [22], as well as Variational Autoencoders (VAE) [23].

• Deep Representation Learning Approaches Deep architectures are employed to tackle the limitations of shallow architectures. The hier- archical nature of deep architectures allow to process the data multiple times on different layers, automatically discovering abstractions from low level observations to high level concepts. Because of the large amount of layers and therefore parameters, deep neural networks (DNN) require huge number of training data to achieve decent generalization. In order to promote generalization, different techniques, such as regu- larization through dropout, are used. Examples of deep neural network are Deep Belief Networks (DBN) [24], Convolutional Neural Networks (CNN) [25] and Deep Autoencoders.

Khastavaneh and Ebrahimpour-Komleh [11] consider deep architectures as the most complete class of methods for successful representation learning, as they cover more general priors of real-world intelligence [26, 27]. The hierarchical organization of features is the most important prior that they incorporate, orga- nizing features on multiple levels of abstraction. Furthermore, since low-level features, such as edges for images, are useful in multiple tasks, upper layers are more capable of transfer learning. For this reason, deep architectures are often trained on large datasets, and then fine-tuned for a specific. We briefly discuss Convolutional Neural Networks (CNN) and how they learn to extract relevant features. Similarly, we discuss Recurrent Neural Networks (RNN) in the following section.

2.2.1 Convolutional Neural Networks Convolutional neural networks are deep networks based on the mathematical operation of , and are the most popular deep learning architecture for image processing. CNNs are able to successfully capture the spatial dependencies by analyzing nearby pixels through a filter. Each filter is convoluted with the input, com- putes a feature map, and a non-linear , such as ReLU, is applied to the output. As shown in Figure 2.1, another common component of CHAPTER 2. BACKGROUND 9

a CNN is the pooling , which is responsible for reducing the spatial size of the layer’s input and extract dominant features.

Figure 2.1: A CNN for classifying handwritten digits from the MNIST dataset. Image from [28].

In a CNN, each layer learns filters of increasing complexity (Figure 2.2). The first layers learn basic feature detection filters: edges, corners and basic small patterns. The middle layers learn filters that detect parts of objects. For faces, they might learn to respond to eyes, noses, etc. The last layers have higher representation: they learn to recognize full objects, in different shapes and positions. 10 CHAPTER 2. BACKGROUND

Figure 2.2: Visualization of a CNN from Zeiler and Fergus [29].

In the context of our work, it is relevant to discuss how self-supervised and unsupervised training methods can be used for performing Transfer Learning. This paradigm allows to re-utilize the trained weights for accelerating and/or improving the supervised training. On the one hand, when a CNN is trained with labels, it learns features in or- der to discriminate an image belonging to a class from the others. Supervised classification, however, may easily lead to overfitting over the training sam- ples. Moreover, it has been shown that CNNs trained on ImageNet are biased towards texture, conflicting with the hypothesis on how humans observe im- ages [30]. On the other hand, self-supervised and unsupervised learning provide a more general and flexible framework for dealing with the limited availability of large labeled datasets and focusing the learning process in extracting the underly- ing latent factors of the data. Among these classes of methods that make use CNNs we can find AutoEncoders, GANs, self-supervised prediction of image rotation, and many others. In our work, the weights are first trained with an unsupervised method, and then fine-tuned in the learning process of the RL agent.

2.2.2 Recurrent Neural Networks Recurrent neural networks [31] are a class of neural networks that are natu- rally suited to processing time-series data and other sequential data, such as CHAPTER 2. BACKGROUND 11

natural language. In particular, in our work we use a particular type of RNN, called LSTM [32]. Long Short-Term Memory (LSTM) are made of a cell, an input gate, an output gate and a forget gate, as shown in Figure 2.3. This ar- chitecture allows to remember values over time and therefore model long-term dependencies.

Figure 2.3: The Long Short-Term Memory (LSTM) cell can process data se- quentially and keep its hidden state through time. [33]

The use of gates make the architecture overcome the vanishing gradient prob- lem observed in vanilla RNNs. Each gate selectively retain or discard informa- tion. First, the forget gate decides what information should be thrown away or kept, passing the hidden state and the input through a . Then, the input gate updates the cell state with the information coming from the new input and the hidden state. Finally, the output gate computes the new hidden state that will be propagated to the next time step. In the context of state representation learning, LSTM are useful for POMDPs as they extract features that depend on sequences of input data. In a typical RL setting, this could be the direction and velocity or acceleration of a moving object, or the behavior state of an element observed for multiple frames, such as Non-player character (NPC) in an Atari game.

2.3 State Representation Learning for Con- trol and Problem Formulation

State Representation Learning, in regards of Reinforcement Learning, is a par- ticular kind of representation learning where learned features are embedded in a low-dimensional space and mutates through time, often as a consequence of the agent’s actions. 12 CHAPTER 2. BACKGROUND

It is important to notice the difference between observation and state. An ob- servation consists in the pure raw data that are collected from the environment, for example measurements coming from robotic sensors, temperatures, posi- tions, angles, pixels. In order to make an agent learn from these data there are two main problems that we should address:

• How to extract important features from an observation?

• How to make the state hold the Markov property?

The first problem is related to the curse of dimensionality. With most learning algorithms, if the dimensionality of the input is large, the learner will need to see numerous samples in order to model the target function. This comes from the fact that in a high-dimensional input space, two samples might be distant, while being very close in the data low-dimensional manifold. Following Raffin et al. [34], SRL corresponds to learning a mapping φ from the observation to the state space O to the state space S. The policy exploits the state st ∈ S to output an action at to maximize the expected reward:

ϕ π ot −−→ st −−→ at (2.3) SRL RL The second problem is related to sequences in temporal systems, such as RL environments. If the RL problem is formulated as an MDP, it means that we as- sume that the Markov property hold for that stochastic process. By definition, a stochastic process has the Markov property if the conditional probability dis- tribution of future states of the process depends only upon the present state. In terms of conditional probabilities, it coincides to

P (Xn = xn | Xn−1 = xn−1,...,X0 = x0) = P (Xn = xn | Xn−1 = xn−1) . (2.4) However, whether the process is guaranteed to be Markovian, i.e. the Markov property holds, is often unknown. This entirely depends on what information the state st represents. We argue that the state st requires to be augmented through additional information coming from previous observations of the en- vironment in order to make the Markov property hold. To address this problem, we modify Equation 2.3 to make the current state depend not only on the current observation, but also on all the previous ones, as follows

ϕ π Ht −−→ st −−→ at (2.5) SRL RL CHAPTER 2. BACKGROUND 13

where Ht = o1, o2, . . . , ot is the history of observations up to timestep t. We can notice that we might as well consider Ht as the state of the environ- ment for it to be Markovian. However, the further step of computing a low- dimensional st = ϕ(Ht) is necessary to address the curse of dimensionality. Chapter 3

Related Work

The state representation problem focuses on learning a low-dimensional em- bedding that only preserves useful information and reduces the dimensionality of the search space, therefore aiming to tackle a major challenge of Reinforce- ment Learning: sample efficiency. This chapter summarizes different super- vised and unsupervised methods used to learn state representations in both model-free and model-based RL, discussing how they differ from each other, the advantages and drawbacks.

3.1 State Representation Learning

We can distinguish mainly two type of State Representation Learning (SRL) algorithms: supervised and unsupervised. While the former consists in learn- ing a representation solely based on the optimization of a policy for a specific reward function, the latter utilizes unsupervised methods to learn to extract useful features from environment observations, independently of the task to be solved.

SRL is often not specifically addressed when trying to solve an Atari game with RL. Atari games require both a good understanding of the features embedded in a frame, such as position of the player or the enemies, and representation of temporal features, defined as features that require information over two or more observations to be defined, for example the velocity of elements in a frame and the direction in which they are moving. In this section we discuss how some model-free and model-based RL algorithms address the SRL prob- lem in game environments.

14 CHAPTER 3. RELATED WORK 15

One of the first methods proposed as a solution to Atari games is DQN [35]. The authors address the problem of SRL by pre-processing the frames for reducing the use of computational resources. Namely, each frame is down- sampled from 210 × 160 to 110 × 84, converted to gray-scale, and cropped to 84 × 84. The problem of time-dependent features is addressed by stacking the last 4 frames of the history to produce the input to the CNN that extract a 256- dimensional feature vector, used to learn the Q function. The frame-stacking has become a widely adopted method to account for time features. However, it can be argued that 4 consecutive frames might not be enough for time features that vary in longer periods of time. Moreover, the family of methods belong- ing to model-free RL, as DQN itself, learn a supervised state representation that is guided by the reward, and might not be guaranteed to be flexible for transferring knowledge to other tasks on the same environment, i.e. the repre- sentation and the policy overfit the task-specific reward function.

A more recent work proposes a model-based approach, called MuZero, to solve Atari games [36]. In the context of our work, it is of particular interest to dis- cuss the representation and the dynamics functions. The input of the represen- tation function includes a stack made of the last 32 RGB frames at resolution 96 × 96, and a plane for each action that led to each of those frames, for a total input of shape 96×96×128. The dynamics function takes as input the output of the representation function, which has dimensionality 6×6×256, and a plane representing the action according to the policy. The model learns to predict those aspects of the future that are directly relevant for planning, and is able to learn to play board games, such as , without being given the explicit state transition according to the rules. The hidden states are free to represent state in whatever way is relevant to predicting current and future values and policies. This allows the agent to invent, internally, rules or dynamics that are most convenient for planning. While overall the method achieves state-of-the- art results, proving a new way to effectively perform model-based RL on Atari games, it is important to consider the limitations regarding computational fea- sibility and flexibility. Stacking 32 frames certainly gives important temporal information; however, this greatly increases the computational resources re- quired, and assumes that the last 32 frames contain enough information, i.e. make the state satisfy the Markov property. Furthermore, the representation is constrained by the task-specific reward function, and might not be reused for learning new tasks on the same environment.

Other works have focused on learning in an unsupervised manner. Namely, a 16 CHAPTER 3. RELATED WORK

model is first trained on data collected with a random policy to learn a repre- sentation of the world and its dynamics, which is then used to train an agent to perform a task in the same environment. A method of this kind, called World Model, is formulated in [37], although not tested on Atari games, but rather in the Car Racing and VizDoom environments. The model is illustrated in Figure 3.1. Firstly, a random agent is used to collect 10, 000 rollouts with a maximum length of 1, 000 steps, containing both observations and actions. Secondly, a Variational AutoEncoder (VAE)is trained to learn a 32-dimensional latent rep- resentation of each frame observed. Then, a MDN-RNN is trained to predict the probability distribution of the embedding of the next frame P (zt+1|at, zt, ht). In particular, this component is empirically proved by the authors to be funda- mental for the stability of the driving agent, as the hidden state of the RNN ht accounts for temporal features, and allows to tackle sharp corners effectively. Finally, a controller C is used to train the agent to maximize the expected cu- mulative reward. The controller is a linear model, deliberately designed to be small and simple, to make most of the agent’s complexity resides in the world model. The advantage of such model is that it is general and modular enough to prevent the state representation to overfit the reward function by decoupling these two optimization tasks. Furthermore, the agent can be trained on the World Model, providing a solution to environments where interaction is ex- pensive and sample efficiency is crucial. Nevertheless, the authors point out that for more complex environments, such as Atari games, an iterative pro- cedure might be required as new environment scenarios are discovered and explored and the World Model representation needs to be updated.

Figure 3.1: World Models architecture. Image from [37] CHAPTER 3. RELATED WORK 17

Figure 3.2: Phases of the SimPLe algorithm. Image from [38]

A similar approach that leverages on World Models is presented by Kaiser et al. [38]. They introduce Simulated Policy Learning (SimPLe), a model-based deep RL algorithm based on video prediction models. It consists of alternating between learning a world model and then use this model to optimize a policy with model-free RL (Figure 3.2). The model receives as input four stacked frames and the action selected by the agent, and predicts the next frame and the expected reward using the architecture shown in Figure 3.3. This setting proves to be particularly effective in low data regime, i.e. 400k frames on Atari games. However, the model turned out to be very computationally expensive, requiring three weeks of training per game. The burden comes from the 15.2M interactions that take place in the learned world model during the iterative training procedure. Another drawback is the dependency of the model on a defined reward function, making harder to transfer the learned world model to a new task.

Figure 3.3: Architecture of the stochastic model in SimPLe. Image from [38]

Hausknecht and Stone [39] argue that stacking four frames might not be enough for Atari games. Some games might require more temporal information than 18 CHAPTER 3. RELATED WORK

the one contained in four stacked frames. For this reason, the authors formulate the RL problem with single-frame observations as POMDPs. The proposed solution is a change in the architecture, as shown in Figure 3.4: the CNN takes as input a tensor of dimension 84 × 84 × 1, and its output is processed through an LSTM that accounts for temporal information. They also prove the efficacy of such approach with a Flickering Pong environment, demonstrating that one single frame and a recurrent network are sufficient for compensating the flick- ering and the modelling of the velocity of the ball.

Figure 3.4: DRQN architecture proposed by Hausknecht and Stone [39] where an LSTM is added at the end of the network. Image from [39]

In the context of robotic control, several approaches have been proposed for state representation learning. Hämäläinen et al. [40] propose a variational encoder-decoder structure that uses a low-dimensional latent representation computed by the encoder to create an affordance map, as opposed to recon- structing the input as in VAE. Each pixel value in a channel of the affordance map represents how likely is that affordance occur in that position. However, this method requires the labeled affordance images, and therefore is not unsu- pervised. Ghadirzadeh et al. [41] find a low-dimensional state representation for policy learning by training a perception super-layer. The architecture uses a convo- lutional spatial autoencoder and a Variation Autoencoder, both trained end- to-end with the policy. In particular, the two autoencoder structures are re- spectively trained to extract task-relevant states from raw camera observations CHAPTER 3. RELATED WORK 19

and to learn with a generative model the low-dimensional representation in the action manifold representing long motor trajectories. Chen et al. [42] demonstrates that it is possible to train visuomotor policies and transfer them to novel task domains by using adversarial training. The adversarial method is trained end-to-end to extract visual features that gener- alize to instances of task objects with the supervision of weak labels indicating whether an image contains a task object or not. The adversarial features are pretrained jointly using PPO, and two auxiliary networks, a discriminator and a classifier, are used to respectively discriminate whether the image comes from the source or the target domain, and recognize whether there is a task object in the image from the target domain or not.

3.1.1 Contrastive methods Contrastive learning is a class of unsupervised methods adopted for learning representations. Such algorithms force the inner product of representations of similar pairs with each other to be higher on average than with negative samples. Arora et al. [43] analyze the theory behind contrastive learning by introducing a framework based on latent classes, assuming that semantically similar data points belong to the same latent class. First, they show how the learned representation successfully contains useful features for classification tasks. Secondly, the generalization bound indicates that contrastive learning with blocks of negative samples promotes a representation that is beneficial for sample complexity and generalization.

Gutmann and Hyvärinen [44] introduce noise-contrastive estimation (NCE) as a mean for "learning by comparison" in the context of unnormalized models, i.e. probabilistic model with an analytically intractable normalizing constant. The unsupervised learning problem is approached with a model trained with supervised learning to learn through binary classification to discriminate the original observed data from artificially generated noise, sampled from a con- trastive noise distribution. Formally, having T observed data points xt and T artificially generated data points yt, we want to optimize

1 X JT (θ) = ln [h(xt; θ)] + ln [1 − h(yt; θ)] (3.1) 2T t 1 with h(u; θ) = rν(G(u; θ)), rν(u) = 1+ν exp(−u) being a parametrized logis- tic function, and G(u; θ) = ln pm(u; θ) − ln pn(u) the log-ratio between the class-conditional pdfs pm(.; θ) and pn, respectively for positive and negative 20 CHAPTER 3. RELATED WORK

samples. Understandably, the choice of the noise distribution is crucial: it should be close to the data distribution, because otherwise the classification task might be too easy and no learning of actual data structures would be re- quired to distinguish the two classes.

Based on Gutmann and Hyvärinen [44], Oord, Li, and Vinyals [45] introduce Contrastice Predictive Coding for learning representations. They train an en- coder and an autoregressive model to optimize a loss named InfoNCE. This objective function allows to maximally preserve the mutual information (MI) of the original signals x and the context latent representation c defined as

X p(x | c) I(x; c) = p(x, c) log . (3.2) p(x) x,c The authors prove that minimizing the InfoNCE loss maximizes a lower bound on mutual information. Furthermore, they show that adding CPC as an auxil- iary loss significantly improves the performance of an RL agent trained with A2C on 3D environments of DeepMind Lab [46].

Another method for learning unsupervised representations is introduced by Hjelm et al. [47]. This approach, named Deep InfoMax (DIM), maximizes mu- tual information, in favour of representations that preserve locally-consistent information across structural locations, such as image patches. With the aim of extracting useful representations for classification, DIM maximizes the aver- age MI between the high-level representation, output of the encoder, and local patches of the image, which are middle-level feature maps of the encoder. As MI estimator, the authors compare the one based on the Donsker-Varadhan rep- resentation (DV) [48], Jensen-Shannon divergence (JSD) [49] and InfoNCE [45]. The JSD estimator is found to be insensitive to the number of negative samples, while InfoNCE has higher performance for larger batches of negative samples, and DV constantly performs worse compared to InfoNCE.

Anand et al. [50] present Spatiotemporal Deep InfoMax (ST-DIM), an unsu- pervised representation learning technique that maximizes the mutual infor- mation in the representations across spatial and temporal axes. This method is based both on DIM and InfoNCE. From the first one, following Hjelm et al. [47], they maximize a sum of patch-level mutual information objectives. Following Oord, Li, and Vinyals [45], they use InfoNCE for modelling two objectives that respectively account for spatial and temporal information be- tween two consecutive observations. Moreover, a new evaluation method for CHAPTER 3. RELATED WORK 21

state representation learning in Atari games is proposed. Specifically, the en- coder’s weights trained to minimize the are kept fixed and a lin- ear classifier that takes as input the output of the encoder is trained to predict the RAM state, which represent the ground truth of the environment state at a specific time step. ST-DIM outperforms generative methods, such as VAE and pixel prediction, as well as other contrastive methods, such as CPC. In our work, we adopt ST-DIM as a pretraining method for state representation learning.

3.1.2 Robotic priors and auxiliary objective functions State representation learning can be augmented through the addition of auxil- iary loss functions that assume various specific constraints and serve of differ- ent use depending on the environment characteristics. The learning process of the agent can be constrained by prior knowledge to account for intuitive physics, physical laws, mental states of other agents, and modelling regulari- ties such as compositionality and causality. Lesort et al. [8] provide a review of state-of-the-art approaches to loss functions designed for state representation learning. These prior can be used in two ways: they can be either be used as auxiliary loss functions while training the RL agent, or be the objective of an unsuper- vised state representation pretraining. The following loss functions are applied to the state space to constrain the representation according to specific criteria. With ∆st = st+1 −st is indicated the difference in between the states in the two consecutive timesteps t and t+1, and D is a set of observations. • Slowness Principle The slowness principle is used in physical systems where features of interest are assumed to fluctuate slowly and continuously through time and abrupt changes are unlikely [51, 7, 52].  2 LSlowness(D, φ) = E k∆stk (3.3) • Variability The variability prior assumes that the positions of relevant objects vary, and the state representation should focus on such moving elements [52]. h i −kst1 −st2 k LVariability(D, φ) = E e (3.4) This prior is used to counter-balance the slowness prior, which would otherwise lead to constant values. 22 CHAPTER 3. RELATED WORK

• Proportionality The proportionality prior, proposed by Jonschkowski and Brock [51], is used in systems where we expect the environment to change with the same magnitude as a response to the same stimulus, i.e. the agent’s action.  2  LProp (D, φ) = E (k∆st2 k − k∆st1 k) | at1 = at2 (3.5)

• Repeatibility

If the same action was applied at times t1 and t2 and the two states were similar, the state change should be similar both in magnitude and direc- tion. This prior is named repeatibility loss [51]. h 2 i −kst2 −st1 k 2 LRep(D, φ) = E e k∆st2 − ∆st1 k | at1 = at2 (3.6)

• Controlability We consider controllable things to be relevant. Objects that can be ma- nipulated by an agent are likely to be useful to be represent to achieve a given task. The following robotic prior connects the accelerations of objects to the actions of the agent. Jonschkowski et al. [52] define a loss function per action dimension i to optimize the covariance between action dimension i and accelerations in a state dimension i.

 (a)  h  (a) h (a) ii − Cov at,i,st+1,i −E (at,i−E[at,i]) st+1,i−E st+1,i Lcontrolability (i) = e = e (3.7) • Selectivity A further assumption consists in considering each feature of the state space as independently controllable factors that mutate according to the agent’s actions. By assuming that the state space is K-dimensional, we train K policies πk so that each policy selectively causes a change only to the k-th feature of the state representation. Thomas et al. [53] introduce the following selectivity loss

 (k) (k)  st+1 − st L (D, φ, k) = | s ∼ P a sel E  0 0 t+1 st,st+1  (3.8) P s(k ) − s(k ) k0 t+1 t

• Causality The causality prior is used as an auxiliary objective for state represen- tation learning when a reward signal is available [52, 7]. This prior CHAPTER 3. RELATED WORK 23

assumes that if the same action at1 = at2 performed in two different

timesteps t1, t2 return two different rewards rt1+1 and rt2+2, then the two states in which the action was taken should be differentiated and distant in the representation space.

h 2 i ˆ −kst2 −st1 k LCaus (D, φ) = E e | at1 = at2 , rt1+1 6= rt2+1 (3.9)

• Dynamic verification Dynamic verification is a discriminative approach proposed by Shel- hamer et al. [54] that consists in learning to detect a corrupted observa-

tion otc in a sequence of consecutive observations ot where t ∈ [0,K]. Each observation is first encoded into a state, and then a classifier pre- dicts the index of the corrupted observation in the sequence.

• Forward prediction Forward models rely on a loss function that computes the prediction error on the next state, given the current state and the action performed by the agent [55].

2  ˆ  1 ˆ Lfwd st+1, f (st, at) = f (st, at) − st+1 (3.10) 2 2 In non-deterministic systems, the prediction might be a distribution in- stead of a single state value, such as the MDN-RNN in [37]. The predic- tion error can also be used as an internal reward to guide the exploration of novel states [55].

• Inverse prediction By turning around the forward model, we can use two consecutive states st and st+1 to predict which action at made the transition happen. This inverse model is implemented by Pathak et al. [55] and integrated to- gether with the forward model.

aˆt = g (st, st+1; θI ) (3.11)

We minimize the following loss that measures the difference between the performed action at and the predicted action aˆt. In [55], the loss corresponds to Maximum Likelihood Estimation of a multinomial dis- tribution. Linv (ˆat, at) (3.12) 24 CHAPTER 3. RELATED WORK

In Jonschkowski et al. [52], they use an encoder for extracting features regard- ing the robotic positions and velocities in the current timestep. In their work, the encoder is pretrained in an unsupervised manner with some of the above loss functions on a dataset that consists of randomly sampled trajectories col- lected. The learned state representation is demonstrated through PCA to be able to successfully encode meaningful information in a very compact state space. The authors also show that, by training a small after the enconder, it is possible to perform regression on the prediction of the ground truth state, achieving low mean squared error.

3.1.3 Evaluating a state representation There exist several measures to assess the quality of a state representation [8]:

1. Task performance 2. Disentanglement metric score 3. Distortion 4. NIEQA (Normalization Independent Embedding Quality Assessment) 5. KNN-MSE 6. Supervised learning

Task performance is the most common evaluation metric, and it consists of letting a RL agent learn to accomplish a task while using the learned repre- sentation, in order to test its transferability [51, 52, 56, 57, 58, 55, 54, 59, 60, 61, 62, 63, 64]. However, especially in real world environments, train- ing a RL agent is costly and inefficient as it requires plentiful resources in terms of observational data received from the environment, training time and computations. For this reason, other methods have been designed to evaluate representations. Thomas et al. [53] assume that there are latent factors in the observations com- ing from the RL environment that are independently controllable, meaning that when an action modifies the value of one of these variables, the values of the other factors is not affected. However, this is a fair assumption only when actions are known to be independent. Higgins et al. [65] proposes a way to quantitatively estimate the amount of disentanglement by using a dis- entanglement metric score. It is based on the assumption that the ground truth process utilizes a set of generative factors, some of which are conditionally in- dependent, and interpretable. The model that measures disentanglement uses a simple low-capacity and low VC-dimension linear classifier’s accuracy. The CHAPTER 3. RELATED WORK 25

classifier predicts the generative factor that was kept fixer for a given difference of pairs of representations from the same latent factor. Two state representation evaluation methods based on manifold learning the- ory are Distortion and NIEQA. Indyk [66] shows that distortion, defined as the measure of how local and global geometry coherence in the representation varies with respect to the ground truth, is a useful measure in the context of embeddings. In particu- lar, considering that many problems over metric spaces are defined in terms of the input metric properties, a low-distortion embedding enable to reduce problem to simpler metrics. Zhang, Ren, and Zhang [67] first propose an anisotropic scaling independent measure (ASIM), and then, based on it, introduce an innovative quality estima- tion method named normalization independent embedding quality assessment (NIEQA). This method help to address two important questions:

• Does it preserve the geometric structure of local neighborhoods? • Does the representation preserve the global topology of the manifold?

This is done by performing a local assessment, a global assessment, and a com- bination of the two. The local assessment measure the average ASIM between a local neighborhood on the data manifold and its corresponding embedding. The global assessment checks if the representation preserves the global topol- ogy structure by measuring the matching degree, through geodesic distance, between a neighborhood of "representative" data samples and its correspond- ing embedding. Lastly, the overall assessment is defined as a linear combina- tion of the other two. Nearest-Neighbors is an algorithm that can allow a visual inspection in the representation space of two semantically similar observations to verify that they are close as in the ground truth state space. In addition to this qualitative comparison, we can derive a quantitative measure called KNN-MSE. By us- ing the ground truth state space value of the observations, the KNN-MSE is computed as

1 X 2 KNN-MSE(s) = ks˜ − s˜0k (3.13) k s0∈KNN(s,k) where s is the considered state, s0 is one of its k nearest neighbors, and s˜ and s˜0 are their respective associated ground truth. If labels are available, it is possible to train a supervised model on top of the feature extractor that computes the state representation. Jonschkowski et al. 26 CHAPTER 3. RELATED WORK

[52] trains a fully connected neural network to perform regression from the learned representation to the actual state values. An example of supervised classification for state representation evaluation can be found in [50]. The authors use linear probing to maximise the accuracy of linear classifiers to predict the latent generative factors from the learned representation. In this case, these factors are represented by the RAM annotations of each frame of an Atari 2600 game. Chapter 4

Methods

4.1 Overview

In this work, we focus on two main aspects that we believe might improve the state representation that a RL agent utilizes and improve while learning to play on Atari 2600 game environments. Specifically, we explore how to enhance the state representation learning so that it captures two important kind of features: spatial and temporal. For this reason, we use an architecture that, as opposed to the convention of stacking four consecutive raw observations, receives as input one single frame and temporal information propagated from the previous step. This modification is based on the following hypothesis: it is possible to improve sample efficiency on Atari games by representing temporal information and increasing the number of gradient updates while keeping the number of processed frames unchanged. This becomes possible by extracting temporal features through an RNN instead of using a CNN on stacked game frames.

4.2 Architecture

Many architectures for state representation employed in RL agents on Atari games use an input consisting of the last four frames observed by the agent. However, this implies that the agent will be unable to master games that require the player to remember events that occurred more distant than four observa- tions in the past. In other words, even though the game states are often assumed to be Markovian, any game that requires a memory of more than four frames will result being non-Markovian, as future game states would depend on more than a single input.

27 28 CHAPTER 4. METHODS

4.2.1 Traditional CNN architecture The most common architecture used for extracting features from raw obser- vations consists of a CNN, illustrated in Figure 4.1, takes as input four pre- processed game frames [35]. Grayscaling and downsampling are applied, and make the shape of the input go from 4 × 210 × 160 × 3 to 4 × 84 × 84.

Figure 4.1: Architecture from Mnih et al. [35] (DQN), often adopted for fea- ture extraction in other RL algorithms applied to Atari games.

4.2.2 Proposed CNN+RNN architecture Inspired by [39], our architecture receives as input single frames at each time step. Similarly, we transform the observation from RGB values to grayscale. However, as opposed to [39], we do not downsample to 84×84, taking the input observation ot with the original size of 210 × 160. This is a consequence of our choice to use the same CNN encoder architecture as in [50]. The complete architecture of the CNN encoder is shown in Figure 4.2. CHAPTER 4. METHODS 29

Figure 4.2: Architecture from Anand et al. [50], with the minor change of the feature size increased from 256 to 512. The input is not downsampled to allow the subsequent linear probing for the RAM annotations prediction.

The only difference lies in the size of the feature vector extracted by the en- coder: while in [50] it has 256 dimensions, we use 512. 30 CHAPTER 4. METHODS

512 Figure 4.3: Architecture of the RL agent. In all our experiments zt ∈ R and 512 ht ∈ R .

Figure 4.3 shows the whole architecture for the state representation, policy and value function. The observed frame ot passes through the encoder that extract the feature vector zt = φ(ot). Then, the agent receives a 512-dimensional hidden state vector ht−1 from the previous timestep. This vector contains tem- poral information that require more than one frame to be extrapolated, such as velocity, movement directions and other higher level features. The RNN takes as input ht−1 and zt and computes the updated temporal information ht, which is a fundamental part of the state representation. In order to reduce the bur- den on the RNN to learn identity mapping for useful spatial features extracted from the encoder, we use as state representation the concatenation of the two feature vectors xt = [zt ht], with an overall dimensionality of 1024. This is similar but not equal to [37], where the same concatenation takes place. While in [37] the RNN is pre-trained to use such input to predict the next encoded CHAPTER 4. METHODS 31

frame zt+1, we use the RNN to compute an enriched temporal state represen- tation vector ht. Similarly to [37], the concatenated input is fed through the controller, which in our agent are the actor and the critic models, as shown in Figure 4.3. These two simple models consists of a single linear layer each. This choice let the agent focus more easily on the credit assignment problem on the low-dimensional search space.

4.3 Encoder pretraining

Following Anand et al. [50], we pretrain the encoder with different methods on data collected from the environment by letting an agent interact through actions sampled from a uniformly random policy. This is performed with the intent to transfer knowledge about the partial state representation that the en- coder is able to extract after being trained on single observations. We compare three different pretraining methods: VAE, ST-DIM and CPC. The pretrained weights are used as an initialization of the actor-critic agent’s encoder and are set to be trainable. Furthermore, we compare with a random initialization, that uses semi orthogonal matrices, as shown and suggested in Saxe, Mcclelland, and Ganguli [68]. In our experiments, each encoder is pretrained on 100, 000 grayscale frames from the environment of shape 210 × 160.

4.3.1 Spatiotemporal Deep Infomax (ST-DIM) ST-DIM is a contrastive method based on the mutual information estimator InfoNCE [45]. In our work, the training of the encoder with ST-DIM is per- formed exactly as illustrated in [50]. The formulation of the objective function is as follows. N Let {(xi, yi)}i=1 be a dataset of pairs of N samples from a joint distribution p(x, y). We define as positive samples all the pairs (xi, yi), coming from the joint p(x, y), and as negative samples all the pairs for any i 6= j, (xi, yj) com- ing from the product of marginals p(x)p(y). We optimize the parameters of a score function f(x, y) which has larger values for positive samples and small values for negative samples by maximizing the mutual information bound

N  N  X exp f (xi, yi) I {(x , y )} = log . NCE i i i=1 PN (4.1) i=1 j=1 exp f (xi, yj) The score function f(x, y), following Oord, Li, and Vinyals [45], is chosen to be a bilinear model φ(x)T W φ(y), forcing the encoder to learn linearly pre- 32 CHAPTER 4. METHODS

dictable features which are more meaningful on a semantic level, and there- fore provide a better encoder representation φ. In the case of ST-DIM, posi- tive samples are considered as pairs of consecutive observations (xt, xt+1 and negative samples as pairs of non-consecutive observations (xt, xt∗ . Two losses are constructed to respectively account for temporal and spatial information: a local-local objective and a global-local objective. In the two following equa- tions, Xnext corresponds to the set of next observations. The global-local objective is defined as

M N X X exp (gm,n (xt, xt+1)) LGL = − log P (4.2) ∗ exp (gm,n (xt, xt∗ )) m=1 n=1 xt ∈Xnext T where the score function gm,n (xt, xt+1) = φ (xt) Wgφm,n (xt+1) combines a local feature vector φm,n, which is an intermediate output of the encoder, and the output of the last layer φ. Whilst the local-local objective corresponds to

M N X X exp (fm,n (xt, xt+1)) LLL = − log P (4.3) ∗ exp (fm,n (xt, xt∗ )) m=1 n=1 xt ∈Xnext T where the score function fm,n (xt, xt+1) = φm,n (xt) Wlφm,n (xt+1) relates the encoder’s outputs with respect to a paired input. CHAPTER 4. METHODS 33

Figure 4.4: Illustration of ST-DIM from Anand et al. [50]. On the left, the two MI objective functions: local-local infomax and global-local infomax. On the right, the global-local contrastive task. As in [50], xt∗ is a collection of negative samples.

The final objective L = LGL + LLL is minimized while training by with the Adam optimizer [69].

4.3.2 Variational AutoEncoder (VAE) Variational autoencoders [23] are generative models that learn the probabil- ity distribution of the data in the latent space. They make strong assumptions concerning the distribution of latent variables. It is assumed that the data are generated by a directed graphical model pθ(x|h) and that the encoder is learn- ing an approximation qφ(h|x) to the posterior distribution pθ(x|h), where φ and θ denote the parameters of the encoder and decoder respectively. The encoder-decoder architecture is trained to minimize the following objective function, which corresponds Evidence Lower BOund (ELBO) with a minus in front:

 L(φ, θ, x) = DKL(qφ(h|x)kpθ(h)) − Eqφ(h|x) log pθ(x|h) (4.4) P  P (x)  where DKL(P kQ) = x∈X P (x) log Q(x) is the Kullback-Leibler diver- gence from Q to P . 34 CHAPTER 4. METHODS

The encoder architecture employed in our experiments is the one in Figure 4.2, while the decoder architecture is made of transposed convolution layers. The number of filters, kernel size and stride are the same as the encoder, but with inverse order.

Figure 4.5: Diagram of a Variational AutoEncoder. Image from [70].

4.3.3 Contrastive Predictive Coding (CPC) Contrastive Predictive Coding is a contrastive method that extracts compact latent representations to encode predictions over future observations. It maximizes the mutual information between the input x and context vector c

X p(x|c) I(x; c) = p(x, c) log (4.5) p(x) x,c This allows to extract underlying latent variables that the data in the input se- quence have in common.

In our experiments, each element in a training batch is a sequence of 100 game frames. First, the CNN encoder extract a mapping zt = genc(xt) from a frame xt. Then, an autoregressive model gar encapsulate all the information in z≤t in the context latent representation ct = gar(z≤t). Following [45], a GRU is used as autoregressive model. Next, we can estimate the density ratio

p(xt+k|ct) fk(xt+k, ct) ∝ (4.6) p(xt+k) which preserves the mutual information between xt+k and ct. A simple log- bilinear model is used to obtain a positive real score CHAPTER 4. METHODS 35

T  fk(xt+k, ct) = exp zt+kWkct . (4.7) Since the length of the sequence is 100, there are going to be 100 linear models Wk.

Figure 4.6: Visualization from [45] of Contrastive Predictive Coding on MNIST. Similarly, our sequence is made of consecutive game frames, while negative samples are non-consecutive frames.

The loss of CPC is based on Noise-Contrastive Estimation (NCE), and is named InfoNCE. Given N samples containing one positive sample from p(xt+k|ct) and N − 1 negative samples from the distribution p(xt+k), we minimize " # fk(xt+k, ct) L = − X log ) (4.8) N E P f (x , c ) xj ∈X k j t

As proved by Oord, Li, and Vinyals [45], minimizing the InfoNCE loss LN maximizes a lower bound on mutual information.

4.4 Evaluation

We empirically evaluate the state representation by measuring task perfor- mance, as opposed to Anand et al. [50], where linear probing is used to predict the ground truth of the state, represented by the game RAM state. The task 36 CHAPTER 4. METHODS

performance is measured as the mean reward that an agent collects in the en- vironment over the last 100 episodes, after being trained on a certain number of frames. The environments are 22 Atari 2600 games that run on ALE (Atari 2600 Learning Environment) through the open-source Gym package [71] from OpenAI. The agent is trained using Proximal Policy Optimization (PPO). Since our goal is to find what representation speeds up the RL agent learning process, we focus on sample efficiency by fixing the number of frames on which the agent is trained and comparing the mean reward at the end of the training. Note that an agent that receives stacked frames as input will perform less gradient updates, and might as well take more samples to achieve the same performance. Chapter 5

Results

In this section we describe the experiments setup and report their results.

Each initialization that we compare is evaluated on the 22 Atari 2600 games used in Anand et al. [50] as environments. The pretraining process of each method is as follows. First, 100, 000 frames are collected from the environment by performing ac- tions drawn from a uniform distribution. The frames collected for each game are the same for all the methods. Then, the data are split in training and validation set, respectively of size 80, 000 and 20, 000. Each method minimizes the objective function with the Adam optimizer and trained until the validation loss does not improve for 15 consecutive steps (patience = 15). A list of the hyperparameters for the pretraining process and the PPO agent training are respectively reported in Table A.1 and Table A.2.

All the experiments are performed on a 64 bit machine with 24 Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, a GeForce GTX 1080 Ti GPU and 256GB of RAM. We modify code from [50] for the pretrainings, and from [72] for the PPO agent training. In Figure 5.1 we show the learning progress of the RL agent, represented by the mean reward over the last 100 episodes. Then, we compare in Table 5.1 the final mean reward for each environment and each initialization. In addition, we also compare our results with the performance obtained in the original PPO paper [73] which uses an architecture that stacks 4 frames, preprocessed with downsampling and greyscaling.

37 38 CHAPTER 5. RESULTS

Finally, we study how each initialization affect sample efficiency by examining how fast the agent achieve high rewards. In Figure 5.2 we compare the area under each learning curve in Figure 5.1 and normalize with respect to this measure computed for the agent with a random initialization. In the game Venture, the randomly initialize agent does not receive any reward. For this reason, we normalize with respect to the second worst initialization, which corresponds to CPC.

Table 5.1: Mean reward over last 100 episodes of training (40M frames) Game Random init VAE CPC ST-DIM 4 frames [73] Asteroids 2396.2 1993.2 3155.4 2553.8 2097.5 Berzerk 1604.4 876 1362.6 1358.5 // Bowling 30.91 32.35 30.11 29.86 40.1 Boxing 95.52 99.47 95.79 99.07 94.6 Breakout 556.89 430.59 633.86 649.65 274.8 DemonAttack 230801.1 4586.15 315610.2 352508.4 11378.4 Freeway 33.4 0 31.95 33.4 32.5 Frostbite 279.7 259.5 318.5 1291.8 314.2 Hero 20762.95 13943.2 36711 40383.9 // MontezumaRevenge 0 0 0 188 42.0 MsPacman 1793.4 1976.9 2620.4 2932 2096.5 Pitfall -0.4 0 0 0 -32.9 Pong 19.92 18.04 20.58 20.72 20.7 PrivateEye 0 0 0 0 69.5 Qbert 12978.75 10771.5 18239.25 18042 14293.3 Riverraid 3887.4 3031.2 8989.8 9736.5 8393.6 Seaquest 953.2 1787.4 960 1858 1204.5 SpaceInvaders 1558.6 843.55 1574.25 1603.2 942.5 Tennis -2.7 -7.4 -2.43 -6.09 -14.8 Venture 0 0 0 0 0 VideoPinball 26996.9 25129.21 77066.45 264582.79 37389.0 YarsRevenge 19886.05 20500.48 34599.16 21092.32 // CHAPTER 5. RESULTS 39

AsteroidsNoFrameskip-v4 BerzerkNoFrameskip-v4 BowlingNoFrameskip-v4 BoxingNoFrameskip-v4 3500 1750 100

70 3000 1500 80

2500 1250 60 60 2000 1000 50

750 40 1500 40

500 1000 30 20 250 20 500 0

0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 BreakoutNoFrameskip-v4 DemonAttackNoFrameskip-v4 FreewayNoFrameskip-v4 FrostbiteNoFrameskip-v4 35 2000 350000

600 30 1750 300000

500 25 1500 250000 1250 400 20 200000 1000 300 150000 15 750 200 100000 10 500 100 50000 5 250

0 0 0 0 0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 HeroNoFrameskip-v4 MontezumaRevengeNoFrameskip-v4 MsPacmanNoFrameskip-v4 −-v4

40000 3000 350 35000 2500 300 30000 250 25000 2000 200 20000 1500 150 15000 1000 10000 100

50 5000 500

0 0

0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 -v4 −2-v4 QbertNoFrameskip-v4 RiverraidNoFrameskip-v4 20000 10000 17500 8000 15000

12500 6000 10000 7500 4000 5000 2500 2000 0

0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 SeaquestNoFrameskip-v4 SpaceInvadersNoFrameskip-v4 -v4 VentureNoFrameskip-v4

1750 1750 200 1500 1500

1250 1250 150

1000 1000 100 750 750

500 500 50 250 250

0 0 0 0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000 VideoPinballNoFrameskip-v4 YarsRevengeNoFrameskip-v4 40000 300000 35000 250000 30000

200000 25000

150000 20000

100000 15000

10000 50000 5000 0

0 5000 10000 15000 20000 25000 30000 35000 40000 0 5000 10000 15000 20000 25000 30000 35000 40000

VAE CPC Random init ST-DIM

Figure 5.1: Comparison of training a PPO agent starting with different initial- izations on 22 Atari Games. The horizontal and vertical axes respectively in- dicate the number of gradient updates and mean reward over last 100 episodes. 40 CHAPTER 5. RESULTS

Asteroids Berzerk Bowling Boxing 1.0 1.2 1.2 1.0

0.8 1.0 1.0 0.8

0.8 0.8 0.6 0.6

0.6 0.6 0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0.0 0.0 0.0 0.0

Breakout DemonAttack Freeway Frostbite

2.00 1.0 1.2 2.5 1.75 1.0 0.8 2.0 1.50

0.8 1.25 0.6 1.5 0.6 1.00

0.75 0.4 1.0 0.4 0.50 0.2 0.5 0.2 0.25

0.0 0.00 0.0 0.0

Hero MontezumaRevenge MsPacman Pitfall 1.6 700 2.00 1.4 1.0 600 1.75 1.2 0.8 1.50 500 1.0 1.25 400 0.6 0.8 1.00 300 0.6 0.75 0.4 200 0.4 0.50 0.2 100 0.25 0.2

0.00 0 0.0 0.0

Pong PrivateEye Qbert Riverraid 1.4 1.0 2.00

1.2 1.75 2.0 0.8 1.0 1.50

1.25 1.5 0.8 0.6 1.00 0.6 1.0 0.4 0.75 0.4 0.50 0.2 0.5 0.2 0.25

0.0 0.0 0.00 0.0

Seaquest SpaceInvaders Tennis Venture 12 1.4 1.75 1.0 1.2 10 1.50

0.8 1.0 1.25 8

0.8 1.00 0.6 6

0.75 0.6 0.4 4 0.50 0.4

0.2 2 0.25 0.2

0.00 0.0 0.0 0

VideoPinball YarsRevenge

6 1.6

1.4 5 1.2 4 1.0

3 0.8

0.6 2 0.4 1 0.2

0 0.0

ST-DIM CPC Random init VAE

Figure 5.2: Comparison of the area below the learning curves in Figure 5.1 normalized with respect to the agent with the randomly initialized encoder. Chapter 6

Discussion

With our experiments we want to investigate two questions:

• Is the proposed architecture, initialized randomly, effective in learning on Atari environments?

• Which pretraining method best improves the performance?

From Table 5.1 we can observe that the randomly initialized encoder in the majority of the games achieves similar performance with the frame stacking architecture. ST-DIM greatly outperforms the other methods and the tradi- tional CNN architecture in most of the games when the comparison is based on the final mean reward, proving to be an effective pretraining method for initializing the encoder. This is in line with the results obtained from Anand et al. [50], where ST-DIM achieves the highest F1 score in most of the games on the prediction of the RAM values of the frames. Similarly to Shelhamer et al. [54], we find the reconstruction by VAE to be mostly harmful. On the other hand, CPC usually has comparable performance with ST-DIM and per- form better than random initialization and VAE. This can be explained by the fact that contrastive methods like CPC and ST-DIM prefer capturing high- entropy features, independently of the pixel size of elements, while generative methods prefer capturing large objects which have low entropy. Since small objects, such as projectiles, collectable items, enemies and other elements are often more significant than large objects in Atari 2600 games, then contrastive methods are going to result in better performance in most of them.

In order to compare sample efficiency, we compute the area below the learn- ing curves in Figure 5.1 for each method, as this measure takes into account

41 42 CHAPTER 6. DISCUSSION

how fast an agent learn to maximize the reward. Furthermore, we normal- ize with respect to this same measure of the agent with a randomly initialized architecture. We observe that ST-DIM is the best initialization for 14 out of 22 games, and the second best in 6 out of 22. In all the environments, ST-DIM is better than a random initialization, with the exceptions of some games: Berzerk, Bowling and Private Eye. The low performance in Bowling is due to a known issue regarding reward clipping: restricting the reward between −1 and 1 makes the agent prefer to knock off bowling pins one by one rather than aiming to one big reward of 300 (clipped to 1 for the RL agent) by striking 10 times [74]. This affects all the initialization methods. In the game Berzerk, the rooms shape and other elements are randomly gener- ated. Because of this stochastic nature, a random initialization results in less bias towards certain formations and is more adaptable to new episodes with different scenarios. Low performance on Private Eye affects all the initializations as this environ- ment has highly sparse reward and many "rooms" that the agent has to navigate in order to collect rewards.

In half of the games, VAE initialization achieves the worst performance. This might indicate that reconstruction of frames is not an optimal objective for unsupervised pretraining. In [37] the authors pretrain a VAE of randomly col- lected frames and successfully train an agent on the learned representation. We hypothesize that such method might work on simpler environments (Car Racing and VizDoom), but not with many Atari Games, as they contain more complex scenarios. In particular, while a 32-dimensional latent space is suffi- cient for the World Models method, 512 hidden features in the encoder require a much larger dataset than 100, 000 frames to avoid overfitting on the training set and might not be sufficient for representing later stages of the environment. CPC, like ST-DIM, performs better than the random initialization most of the times, confirming that contrastive methods provide a representation more suit- able for Atari games. It is important to notice that, in contrast with the original work on CPC [45], we use an encoder with weights pretrained using CPC in- stead of using the InfoNCE objective as an auxiliary loss function while train- ing the agent. The proved effectiveness gives a new sample-efficient alternative to learning representations. Chapter 7

Conclusions

Learning a useful state representation is crucial for efficient policy learning. The contribution of our work is two-fold:

• We propose an architecture for addressing temporal features and enrich- ing the state representation to guarantee the Markov property.

• We pretrain the encoder for the state representation with different unsu- pervised methods on randomly collected frames and compare how they affect the training performance of the RL agent.

We find that an architecture that handles frames singularly and uses an RNN, namely an LSTM, for propagating temporal features achieves comparable per- formance with an architecture that takes as input a stack of four consecutive frames. This is in line with the results from [39], where a Deep Recurrent Q- Network is capable of integrating information across frames to detect relevant state information such as velocity or direction of on-screen objects. We observe positive results in initializations based on contrastive pretraining (ST-DIM and CPC). Interestingly, VAE does not provide a good initial state representation and often performs worse than random initalization. On the other hand, results with ST-DIM are particularly promising and show how beneficial is to use this unsupervised method to boost the agent training in most Atari 2600 games. Our work provides a new approach to address the sample efficiency problem in Reinforcement Learning. The application of unsupervised state representa- tion pretraining might result especially useful in scenarios where interaction is expensive, such as robotics and control.

43 44 CHAPTER 7. CONCLUSIONS

7.1 Future work

Although a good initialization of the state representation architecture based on randomly collected frames is helpful for any environments, it is not sufficient for difficult environments where there are multiple scenarios with different va- rieties of elements and the reward is sparse, e.g. Atari games like Montezuma’s Revenge, Private eye and Pitfall. Further research could focus on integrating ST-DIM as an auxiliary loss func- tion to the RL agent, or apply the iterative procedure proposed in World Models for alternatively optimizing the agent and the representation [37]. Moreover, in order to maintain the state representation unsupervised, the ST- DIM loss function could be added to an intrinsically motivated agent [55], which would help in learning representation for later stages of the game or environment. In this case, actions taken by the agent would be used to train a forward model to predict the next state, and the prediction error would be the reward guiding the exploration. We did not perform extensive hyperparameter search. This leaves space for improvement: by scheduling the learning rate over training, or using different learning rate for the actor-critic agent and the state representation architecture might further improve. Finally, the first part of the training process, i.e. the unsupervised state representation pretraining, is independent of the RL algo- rithm subsequently used to train the agent, and can therefore be applied to algorithms different from PPO. Bibliography

[1] “Chapter 10 - Feature Selection and Dimensionality Reduction”. In: Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. Ed. by Gary Miner et al. Boston: Academic Press, 2012, pp. 929–934. isbn: 978-0-12-386979-1. doi: https://doi. org / 10 . 1016 / B978 - 0 - 12 - 386979 - 1 . 00038 - 4. url: http://www.sciencedirect.com/science/article/ pii/B9780123869791000384. [2] Nicolò Botteghi et al. Low Dimensional State Representation Learning with Reward-shaped Priors. 2020. arXiv: 2007.16044 [cs.LG]. [3] Yang Yu. “Towards Sample Efficient Reinforcement Learning”. In: Pro- ceedings of the Twenty-Seventh International Joint Conference on Ar- tificial Intelligence, IJCAI-18. International Joint Conferences on Arti- ficial Intelligence Organization, July 2018, pp. 5739–5743. doi: 10. 24963 / ijcai . 2018 / 820. url: https : / / doi . org / 10 . 24963/ijcai.2018/820. [4] Erico Tjoa and Cuntai Guan. A Survey on Explainable Artificial In- telligence (XAI): Towards Medical XAI. 2020. arXiv: 1907.07374 [cs.LG]. [5] Bryce Goodman and Seth Flaxman. “European Union Regulations on Algorithmic Decision-Making and a “Right to Explanation””. In: AI Magazine 38.3 (Oct. 2017), pp. 50–57. issn: 0738-4602. doi: 10.1609/ aimag.v38i3.2741. url: http://dx.doi.org/10.1609/ aimag.v38i3.2741. [6] Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques. 3rd ed. Morgan Kaufmann Series in Data Management Systems. Amsterdam: Morgan Kaufmann, 2011. isbn: 978-0-12-374856-0. url: http://www.sciencedirect. com/science/book/9780123748560.

45 46 BIBLIOGRAPHY

[7] Timothée Lesort et al. “Unsupervised state representation learning with robotic priors: a robustness benchmark”. In: CoRR abs/1709.05185 (2017). arXiv: 1709.05185. url: http://arxiv.org/abs/1709. 05185. [8] Timothée Lesort et al. “State representation learning for control: An overview”. In: Neural Networks 108 (Dec. 2018), pp. 379–392. issn: 0893-6080. doi: 10 . 1016 / j . neunet . 2018 . 07 . 006. url: http://dx.doi.org/10.1016/j.neunet.2018.07.006. [9] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. Second. The MIT Press, 2018. url: http://incompleteideas. net/book/the-book-2nd.html. [10] Ian J. Goodfellow, , and Aaron Courville. Deep Learn- ing. http://www.deeplearningbook.org. Cambridge, MA, USA: MIT Press, 2016. [11] Hassan Khastavaneh and Hossein Ebrahimpour-Komleh. “Representa- tion Learning Techniques: An Overview”. In: Data Science: From Re- search to Application. Ed. by Mahdi Bohlouli et al. Cham: Springer International Publishing, 2020, pp. 89–104. isbn: 978-3-030-37309-2. [12] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation Learning: A Review and New Perspectives. 2012. arXiv: 1206.5538 [cs.LG]. [13] Alireza Makhzani and Brendan Frey. k-Sparse Autoencoders. 2014. arXiv: 1312.5663 [cs.LG]. [14] Lei Le, Raksha Kumaraswamy, and Martha White. “Learning Sparse Representations in Reinforcement Learning with Sparse Coding”. In: CoRR abs/1707.08316 (2017). arXiv: 1707.08316. url: http:// arxiv.org/abs/1707.08316. [15] Jai Priya, R.Vidya Raghavendran, and Dr G.M.Naisra. “Sparse Cod- ing: A Deep Learning using Unlabeled Data for High -Level Represen- tation”. In: Proceedings of the IEEE (Mar. 2014), pp. 124–127. doi: 10.1109/WCCCT.2014.69. [16] Karl Pearson. LIII. On lines and planes of closest fit to systems of points in space. Nov. 1901. doi: 10.1080/14786440109462720. url: https://doi.org/10.1080/14786440109462720. BIBLIOGRAPHY 47

[17] Michael A. A. Cox and Trevor F. Cox. “Multidimensional Scaling”. In: Handbook of Data Visualization. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 315–347. isbn: 978-3-540-33037-0. doi: 10 . 1007/978-3-540-33037-0_14. url: https://doi.org/ 10.1007/978-3-540-33037-0_14. [18] Te-Won Lee. Independent Component Analysis: Theory and Applica- tions. USA: Kluwer Academic Publishers, 1998. isbn: 0792382617. [19] R. A. FISHER. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS”. In: Annals of Eugenics 7.2 (1936), pp. 179– 188. doi: 10.1111/j.1469-1809.1936.tb02137.x. eprint: https : / / onlinelibrary . wiley . com / doi / pdf / 10 . 1111/j .1469- 1809.1936 .tb02137.x. url: https:/ / onlinelibrary . wiley . com / doi / abs / 10 . 1111 / j . 1469-1809.1936.tb02137.x. [20] G E Hinton and R R Salakhutdinov. “Reducing the dimensionality of data with neural networks”. In: Science 313.5786 (July 2006), pp. 504– 507. doi: 10.1126/science.1127647. url: http://www. ncbi.nlm.nih.gov/sites/entrez?db=pubmed&uid= 16873662&cmd=showdetailview&indexed=google. [21] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. “Ker- nel principal component analysis”. In: Artificial Neural Networks — ICANN’97. Ed. by Wulfram Gerstner et al. Berlin, Heidelberg: Springer Berlin Heidelberg, 1997, pp. 583–588. isbn: 978-3-540-69620-9. [22] Mark A. Kramer. “Nonlinear principal component analysis using au- toassociative neural networks”. In: AIChE Journal 37.2 (1991), pp. 233– 243. doi: 10.1002/aic.690370209. eprint: https://aiche. onlinelibrary.wiley.com/doi/pdf/10.1002/aic. 690370209. url: https://aiche.onlinelibrary.wiley. com/doi/abs/10.1002/aic.690370209. [23] Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. 2013. arXiv: 1312.6114 [stat.ML]. [24] G. E. Hinton. “Deep belief networks”. In: Scholarpedia 4.5 (2009). re- vision #91189, p. 5947. doi: 10.4249/scholarpedia.5947. [25] Yann Lecun et al. “Gradient-based learning applied to document recog- nition”. In: Proceedings of the IEEE. 1998, pp. 2278–2324. 48 BIBLIOGRAPHY

[26] Yoshua Bengio and Yann Lecun. “Scaling learning algorithms towards AI”. English (US). In: Large-scale kernel machines. Ed. by L. Bottou et al. MIT Press, 2007. [27] Aditya Khamparia and Karan Mehtab Singh. “A systematic review on deep learning architectures and applications”. In: Expert Systems 36.3 (2019). e12400 EXSY-Jul-18-241.R3, e12400. doi: 10.1111/exsy. 12400. eprint: https://onlinelibrary.wiley.com/doi/ pdf/10.1111/exsy.12400. url: https://onlinelibrary. wiley.com/doi/abs/10.1111/exsy.12400. [28] CNN MNIST. https : / / machinelearningmastery . com / how-to-develop-a-convolutional-neural-network- from-scratch-for-mnist-handwritten-digit-classification/. [29] Matthew D. Zeiler and Rob Fergus. “Visualizing and Understanding Convolutional Networks”. In: CoRR abs / 1311.2901 (2013). arXiv: 1311.2901. url: http://arxiv.org/abs/1311.2901. [30] Robert Geirhos et al. “ImageNet-trained CNNs are biased towards tex- ture; increasing shape bias improves accuracy and robustness”. In: CoRR abs / 1811.12231 (2018). arXiv: 1811.12231. url: http://arxiv. org/abs/1811.12231. [31] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. “Learn- ing representations by back-propagating errors”. In: Nature 323.6088 (Oct. 1986), pp. 533–536. doi: 10.1038/323533a0. [32] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-term Memory”. In: Neural computation 9 (Dec. 1997), pp. 1735–80. doi: 10.1162/ neco.1997.9.8.1735. [33] Guillaume Chevalier. The LSTM cell. 2018. url: https://commons. wikimedia.org/wiki/File:The_LSTM_cell.png. [34] Antonin Raffin et al. “S-RL Toolbox: Environments, Datasets and Eval- uation Metrics for State Representation Learning”. In: CoRR abs/1809.09369 (2018). arXiv: 1809.09369. url: http://arxiv.org/abs/ 1809.09369. [35] Volodymyr Mnih et al. “Human-level control through deep reinforce- ment learning”. In: Nature 518.7540 (Feb. 2015), pp. 529–533. issn: 00280836. url: http://dx.doi.org/10.1038/nature14236. [36] Julian Schrittwieser et al. Mastering Atari, Go, Chess and by Planning with a Learned Model. 2019. arXiv: 1911.08265 [cs.LG]. BIBLIOGRAPHY 49

[37] David Ha and Jürgen Schmidhuber. “World Models”. In: CoRR abs/1803.10122 (2018). arXiv: 1803.10122. url: http://arxiv.org/abs/ 1803.10122. [38] Lukasz Kaiser et al. Model-Based Reinforcement Learning for Atari. 2019. arXiv: 1903.00374 [cs.LG]. [39] Matthew Hausknecht and Peter Stone. Deep Recurrent Q-Learning for Partially Observable MDPs. 2015. arXiv: 1507.06527 [cs.LG]. [40] Aleksi Hämäläinen et al. Affordance Learning for End-to-End Visuo- motor . 2019. arXiv: 1903.04053 [cs.RO]. [41] Ali Ghadirzadeh et al. Deep Predictive Policy Training using Reinforce- ment Learning. 2017. arXiv: 1703.00727 [cs.RO]. [42] Xi Chen et al. Adversarial Feature Training for Generalizable Robotic Visuomotor Control. 2019. arXiv: 1909.07745 [cs.RO]. [43] Sanjeev Arora et al. “A Theoretical Analysis of Contrastive Unsuper- vised Representation Learning”. In: CoRR abs/1902.09229 (2019). arXiv: 1902.09229. url: http://arxiv.org/abs/1902.09229. [44] Michael U. Gutmann and Aapo Hyvärinen. “Noise-Contrastive Estima- tion of Unnormalized Statistical Models, with Applications to Natural Image Statistics”. In: J. Mach. Learn. Res. 13.null (Feb. 2012), pp. 307– 361. issn: 1532-4435. [45] Aäron van den Oord, Yazhe Li, and Oriol Vinyals. “Representation Learning with Contrastive Predictive Coding”. In: CoRR abs/1807.03748 (2018). arXiv: 1807.03748. url: http://arxiv.org/abs/ 1807.03748. [46] Charles Beattie et al. DeepMind Lab. 2016. arXiv: 1612.03801 [cs.AI]. [47] R Devon Hjelm et al. Learning deep representations by mutual informa- tion estimation and maximization. 2018. arXiv: 1808.06670 [stat.ML]. [48] M. D. Donsker and S. R. S. Varadhan. “Asymptotic evaluation of certain markov process expectations for large time. IV”. In: Communications on Pure and Applied Mathematics 36.2 (1983), pp. 183–212. doi: 10. 1002/cpa.3160360204. eprint: https://onlinelibrary. wiley.com/doi/pdf/10.1002/cpa.3160360204. url: https : / / onlinelibrary . wiley . com / doi / abs / 10 . 1002/cpa.3160360204. 50 BIBLIOGRAPHY

[49] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Train- ing Generative Neural Samplers using Variational Divergence Mini- mization. 2016. arXiv: 1606.00709 [stat.ML]. [50] Ankesh Anand et al. Unsupervised State Representation Learning in Atari. 2019. arXiv: 1906.08226 [cs.LG]. [51] Rico Jonschkowski and Oliver Brock. “Learning State Representations with Robotic Priors”. In: Auton. Robots 39.3 (Oct. 2015), pp. 407–428. issn: 0929-5593. doi: 10 . 1007 / s10514 - 015 - 9459 - 7. url: https://doi.org/10.1007/s10514-015-9459-7. [52] Rico Jonschkowski et al. PVEs: Position-Velocity Encoders for Unsu- pervised Learning of Structured State Representations. 2017. arXiv: 1705.09805 [cs.RO]. [53] Valentin Thomas et al. Independently Controllable Factors. 2017. arXiv: 1708.01289 [cs.LG]. [54] Evan Shelhamer et al. “Loss is its own Reward: Self-Supervision for Re- inforcement Learning”. In: CoRR abs/1612.07307 (2016). arXiv: 1612. 07307. url: http://arxiv.org/abs/1612.07307. [55] Deepak Pathak et al. Curiosity-driven Exploration by Self-supervised Prediction. 2017. arXiv: 1705.05363 [cs.LG]. [56] J. Munk, J. Kober, and R. Babuška. “Learning state representation for deep actor-critic control”. In: 2016 IEEE 55th Conference on Decision and Control (CDC). 2016, pp. 4667–4673. [57] Herke Hoof et al. “Stable Reinforcement Learning with Autoencoders for Tactile and Visual Data”. In: Oct. 2016. doi: 10.1109/IROS. 2016.7759578. [58] Chelsea Finn et al. Deep Spatial Autoencoders for Visuomotor Learn- ing. 2015. arXiv: 1509.06113 [cs.LG]. [59] Junhyuk Oh, Satinder Singh, and Honglak Lee. Value Prediction Net- work. 2017. arXiv: 1707.03497 [cs.AI]. [60] Simone Parisi, Simon Ramstedt, and Jan Peters. “Goal-Driven Dimen- sionality Reduction for Reinforcement Learning”. In: Sept. 2017. doi: 10.1109/IROS.2017.8206334. [61] John-Alexander M. Assael et al. Data-Efficient Learning of Feedback Policies from Image Pixels using Deep Dynamical Models. 2015. arXiv: 1510.02173 [cs.AI]. BIBLIOGRAPHY 51

[62] Irina Higgins et al. DARLA: Improving Zero-Shot Transfer in Reinforce- ment Learning. 2017. arXiv: 1707.08475 [stat.ML]. [63] Amy Zhang, Harsh Satija, and Joelle Pineau. Decoupling Dynamics and Reward for Transfer Learning. 2018. arXiv: 1804.10689 [cs.LG]. [64] Samuel Alvernaz and Julian Togelius. Autoencoder-augmented Neu- roevolution for Visual Doom Playing. 2017. arXiv: 1707.03902 [cs.AI]. [65] Irina Higgins et al. “beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework”. In: ICLR. 2017. [66] P. Indyk. “Algorithmic applications of low-distortion geometric embed- dings”. In: Proceedings 42nd IEEE Symposium on Foundations of Com- puter Science. 2001, pp. 10–33. [67] Peng Zhang, Yuanyuan Ren, and Bo Zhang. “A new embedding qual- ity assessment method for manifold learning”. In: Neurocomputing 97 (2012), pp. 251–266. issn: 0925-2312. doi: https://doi.org/ 10.1016/j.neucom.2012.05.013. url: http://www. sciencedirect.com/science/article/pii/S092523121200389X. [68] Andrew M. Saxe, James L. Mcclelland, and Surya Ganguli. “Exact so- lutions to the nonlinear dynamics of learning in deep linear neural net- work”. In: In International Conference on Learning Representations. 2014. [69] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. 2017. arXiv: 1412.6980 [cs.LG]. [70] Eduardo Pinho and Carlos Costa. “Unsupervised Learning for Con- cept Detection in Medical Images: A Comparative Analysis”. In: Ap- plied Sciences 8.8 (July 2018), p. 1213. issn: 2076-3417. doi: 10 . 3390/app8081213. url: http://dx.doi.org/10.3390/ app8081213. [71] Greg Brockman et al. OpenAI Gym. cite arxiv:1606.01540. 2016. url: http://arxiv.org/abs/1606.01540. [72] Ilya Kostrikov. PyTorch Implementations of Reinforcement Learning Algorithms. https://github.com/ikostrikov/pytorch- a2c-ppo-acktr-gail. 2018. [73] John Schulman et al. “Proximal Policy Optimization Algorithms”. In: CoRR abs/1707.06347 (2017). arXiv: 1707.06347. url: http:// arxiv.org/abs/1707.06347. 52 BIBLIOGRAPHY

[74] Marin Toromanoff, Emilie Wirbel, and Fabien Moutarde. Is Deep Rein- forcement Learning Really Superhuman on Atari? Leveling the playing field. 2019. arXiv: 1908.04683 [cs.AI]. Appendix A

Hyperparameters

Parameter Value Image Width 160 Image Height 210 Grayscaling Yes Action Repetitions 4 Max-pool over last N action repeat frames 2 Frame Stacking None End of episode when life lost Yes No-Op action reset Yes Seed 2020 Latent size 512 Batch size 64 Sequence Length (CPC) 100 Learning Rate 3e-4 Encoder training steps 80000

Table A.1: Hyperparameters for the unsupervised methods.

53 54 APPENDIX A. HYPERPARAMETERS

Parameter Value Frames 40M Learning Rate 2.5e − 5 × α Horizon (T) 128 Num. epochs 3 Minibatch size 32 × 8 Discount (γ) 0.99 GAE parameter (λ) 0.95 Number of actors 8 Clipping parameter  0.1 VF coeff. c1 0.5 Entropy coeff. c2 0.01 Encoder latent size 512 LSTM hidden size 512 Seed 9

Table A.2: Hyperparameters for the PPO agent training. We refer to Schulman et al. [73] for detailed information about each hyperparameter. α is linearly annealed from 1 to 0 over the course of learning.

TRITA-EECS-EX-2020:840

www.kth.se