<<

Representation Learning for Control

A Thesis Presented by

Kevin James Doty

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirements for the degree of

Master of Science

in

Electrical and Computer Engineering

Northeastern University Boston, Massachusetts

May 2019 To my family.

i Contents

List of Figures iv

List of Tables v

List of Acronyms vi

Acknowledgments vii

Abstract of the Thesis viii

1 Introduction 1

2 Background 3 2.1 Markov Decision Process ...... 3 2.2 Solving MDP’s ...... 4 2.2.1 Q-Learning ...... 5 2.3 Convolutional Neural Network ...... 6 2.4 Deep Q-Learning ...... 7 2.4.1 Experience Replay ...... 7 2.4.2 Policy and Target Network ...... 8 2.5 ...... 8 2.6 ...... 9

3 Methodology 11 3.1 Setup ...... 11 3.2 Puck-Stack Environment ...... 11 3.3 Data Generation ...... 12 3.4 Proposed Architectures ...... 12 3.4.1 Reconstruction Architectures ...... 13 3.4.2 Predictive Variational Autoencoder ...... 15 3.4.3 Inverse Models ...... 16 3.5 Training Procedure ...... 18

ii 3.6 Evaluation ...... 18

4 Results 19

5 Conclusions 23 5.1 The Motivation for Representation Learning ...... 23 5.2 Future Work ...... 23 5.2.1 Partial Observability ...... 24 5.2.2 Representation Conducive to Planning ...... 24

Bibliography 25

A VAE Loss Derivation 27

B Architecture Building Block Summary 29

iii List of Figures

2.1 MDP Agent Environment Interaction ...... 3

3.1 Successful Sequence in PuckStack ...... 12 3.2 VAE Structure ...... 15 3.3 Predictive VAE Structure ...... 16 3.4 Inverse Model Structure ...... 17 3.5 Variational Inverse Model Structure ...... 18

4.1 Comparing latent controller performances to the fully convolutional DQN and the VAE latent-DQN, min-max-mean training curves over five training trials ...... 20

iv List of Tables

4.1 Optimized Representation Learning Architecture Parameters ...... 22

v List of Acronyms

AE Autoencoder. A neural network structure that maps an input to itself through a bottle- neck .

CNN Convolutional Neural Network. A network that filters an input feature map by applying the correlation operation between kernels with learned weights and the input features.

DQN Deep Q-Network. An extension of traditional Q-learning to continuous state spaces by using neural networks as Q-function approximators.

MDP Markov Decision Process. A discrete time control process with perfect information. At each time step a decision making agent perceives a state, selects an action, and then progresses to a new state, optionally receiving a reward.

PCA Principal Component Analysis. A statistical procedure for converting data into a set of uncorrelated features.

POMDP Partially Observable Markov Decision Process. A special case of the markov decision process, in which the agent receives observations that are not the true state, and do not contain perfect information.

SARS (State, Action, Reward, Next State)

VAE Variational Autoencoder. A probabilistic autoencoding structure in which the encoder is trained to approximate the posterior distribution of the data generation process.

vi Acknowledgments

I would like to thank my family and friends for their support and my advisor Rob Platt for the insightful discussions and guidance. I would also like to express my appreciation for the robotics community of NEU at large. The past couple of years have truly been fulfilling. It has been my pleasure to be here learning with all of you.

vii Abstract of the Thesis

Representation Learning for Control

by Kevin James Doty Master of Science in Electrical and Computer Engineering Northeastern University, May 2019 Dr. Robert Platt, Advisor

State representation learning finds an embedding from a high dimensional observation space to a lower dimensional and information dense state space, without supervision. Effective state representation learning can improve the sample efficiency and stability of downstream tasks such as for control. In this work, two autoencoder variants and two inverse models are explored as representation learning architectures. The DQN algorithm is used to learn control in the resulting latent spaces. Latent-control performance is benchmarked against a standard fully convolutional DQN on a downstream control task.

viii Chapter 1

Introduction

One of the fundamental challenges of reinforcement learning is learning control policies in high dimensional observation spaces. This is particularly important in robotics applica- tions, where we would like to develop behaviors that are responsive to raw sensory data such as images, point clouds or audio. Often these raw data contain substantial and features irrelevant to the task. Learning in these conditions can require excessive amounts of data and time, and is prone to instability. One idea that can be applied to get around the curse of dimensionality in raw data is state representation learning. In state representation learning the structure in information, that is gathered passively or actively, is exploited to learn mappings from a high dimensional observation space to a compact and meaningful state space. The learned mappings can then be used to pre-process raw data before sending it to a reinforcement learning algorithm as state. This process can greatly simplify the problem of policy learning. In this work, two forms of representation learning are explored, the autoencoder and the inverse model. The autoencoder has a fully unsupervised training process in which an input is mapped to itself through an information bottleneck. The structure is trained to reconstruct its input after compressing the information down to a predefined dimensionality. The inverse model is a self-supervised technique. Here, agent actions and observations influenced by those actions are recorded simultaneously. A siamese network structure is then given sequential observations as intput and trained to predict the action that occurred between them. This forces an encoder to learn features relevant for discerning the effect of

1 CHAPTER 1. INTRODUCTION

actions in the environment. and inverse models have been applied in various settings. This work seeks to understand the relative performance of the two architectures, as well as to test two novel architectures based on the autoencoding and inverse model paradigms. To accomplish this, these architectures are trained on data collected in the PuckStack environment. Deep Q-Network (DQN) agents are then trained in the latent spaces of the resulting encoders. The number of training episodes until convergence, training stability, and performance of these controllers are compared to each other and to a DQN baseline agent that learns directly from environment pixels. The goals of this work are to produce further understanding of the abilities and relative performances of the tested representation learning techniques, and to inform the design of these architectures for future applications. Effective state representation learning has the potential to solve many of the difficult aspects of control learning from raw data, in an unsupervised or self-supervised manner. The long term goals for this field are to reduce the amount of data and time necessary to learn control policies, and to increase the stability and overall capability of these policies. With some progress, it may become feasible to train policies on physical robots in the lab in days or hours.

2 Chapter 2

Background

2.1 Markov Decision Process

The Markov Decision Process (MDP) is a mathematical framework for modeling se- quential decision making. The MDP involves the notions of an agent and an environment that interact through , state transitions, and reward signals. The decision making agent perceives the state of the environment, this agent then chooses an action, progresses to a new state and receives a reward signal.

Figure 2.1: MDP Agent Environment Interaction (Source [1])

An MDP can be formally defined by a state space, action space, a state transition function, and a reward function. The state transition function returns a probability distribution over next state given a current state and action. The reward function returns a distribution over rewards given a state and action.

3 CHAPTER 2. BACKGROUND

In this work we consider only finite episodic MDP’s in which the state space and action space are finite, and the agents interactions with the environment consist of a series of finite length sequences. The interaction of the agent and the environment continues until the agent encounters some terminal state. A traversal through the environment from initial state to terminal state is referred to as an episode.

2.2 Solving MDP’s

The goal of the agent is to maximize the cumulative reward it receives from the envi- ronment during an episode. This can be formalized as the return defined in 2.1, where γ is a discount factor with a value in the interval [0, 1). This factor prevents the sum from equaling infinity, and can be used to weight the importance of short term verse long term rewards in a particular application. This formula can be regrouped as in 2.2 so it can then be expressed recursively 2.3.

2 3 Gt , Rt+1 + γRt+2 + γ Rt+3 + γ Rt+4 + ··· (2.1) 2 = Rt+1 + γ(Rt+2 + γRt+3 + γ Rt+4 + ··· ) (2.2)

= Rt+1 + γGt+1 (2.3)

Action selection in an MDP dictates not only reward received at the current time step, but also subsequent states, and therefore rewards that can be received in future time steps. In solving an MDP, current and future rewards must be considered. This can involve selecting low reward actions in the present to gain access to higher rewards in future steps. The function the agent uses to select actions given states is referred to as a policy. This policy can be probabilistic or deterministic. We want to learn an optimal policy, or a policy that achieves the maximum return possible in the environment. There are many different ways of solving for optimal policies in the reinforcement learning literature. These techniques vary by their assumed knowledge of the MDP, the functions they approximate, and the manner in which they recursively update their estimates.

4 CHAPTER 2. BACKGROUND

Two important functions used for finding optimal policies are the value function and the action-value function. For a given policy, the value function 2.4 estimates the return of a state if the policy starts at that state and is used to select actions until the terminal state.

 ∞  X k vπ(s) Eπ[Gt|St = s] = Eπ γ Rt+k+1 St = s (2.4) , k=0 The action-value function 2.5 is defined as the expected return starting from a given state s, taking a given action a and following a policy π thereafter, until the terminal state is reached.

 ∞  X k qπ(s, a) Eπ[Gt|St = s, At = a] = Eπ γ Rt+k+1 St = s, At = a (2.5) , k=0

2.2.1 Q-Learning

In this work, an extended form of the Q-learning algorithm is used to learn control policies. Q-learning approximates the state-action value function, also often referred to as the Q-function. The Q-learning algorithm approximates this function directly from experience in the environment, without estimating any form of transition model, and it updates it’s estimate every time step using a recursive update equation. Another feature of this algorithm is it’s recursive update equation 2.6 assumes the optimal policy is used as opposed to the current policy. Hence, Q-learning is an off-policy, model-free, temporal- difference learning algorithm.

  Q(St,At) ← Q(St,At) + α Rt+1 + γ max Q(St+1, a) − Q(St,At) (2.6) a

5 CHAPTER 2. BACKGROUND

Algorithm 1 Q-Learning 1: Initialize Q(S,A) arbitrarilly 2: for all episodes do 3: Initialize S 4: while S is not terminal do 5: Choose A given S using policy derived from Q 6: Take action A , observe R, S’   0 7: Q(S,A) ← Q(S,A) + α R + γ maxa Q(S , a) − Q(S,A) 8: S ← S0

2.3 Convolutional Neural Network

The goal in representation learning is to extract low dimensional, meaningful features from high dimensional data. When that high dimensional data takes the form of pixels, at present, the dominant tool for the job is the Convolutional Neural Network (CNN). All encoders tested in this work incorporate CNN layers in their architectures. A CNN layer operates by passing kernels over an input feature map and performing a correlation operation. A unique kernel for each input channel is swept over its channel of the input feature map. The resulting feature maps are summed to produce one channel of the final output feature map. This process is repeated for each channel of the output feature map. The number of output channels, kernel spatial dimensions, stride, and padding of this process are user defined parameters. The weight values in each kernel are learned by backprogation and , in a procedure. Repeated stacking of convolutional layers creates a CNN. In typical network design paradigms the output feature map gradually decreases in the spatial dimensions and overall size, while the number of channels increases and the receptive field of each feature grows larger [2]. It has been demonstrated through visualization of learned feature maps in the late layers of CNN’s trained on object recognition and other task, that kernels learn to identify complex, semantically meaningful features in the original input [3].

6 CHAPTER 2. BACKGROUND

2.4 Deep Q-Learning

Traditional Q-learning has only been successful in very low dimensional environments, in which all combinations of potential state action pairs can be tabulated and visited repeatedly until convergence. In higher dimensional domains it is not computationally tractable to visit every state action pair multiple times, so the estimated state-action value function will not converge to the optimal state-action value function. DQN extended Q-learning to high dimensional state spaces by introducing an architecture and training procedure for using neural networks as Q-function approximators [4]. The relevant primary components of this algorithm are described in the following subsections.

2.4.1 Experience Replay

DQN gathers data in the same manner as traditional Q-learning approaches. The agent steps through the environment experiencing (State, Action, Reward, Next State) (SARS) tuples. Traditional Q-learning could immediately apply the bellman update equation to incoming tuples and update the tabulated Q-function. DQN cannot train on the data as it comes in live, because it uses a neural network to map the value of state directly to the Q-values. A new training procedure is needed to avoid the phenomena of catastrophic forgetting [5]. This is the phenomena in which neural networks trained on non-stationary data distributions alter previously learned weights to fit to the current distribution of the data, wiping out any important patterns they may have learned for previously experienced data distributions. Deep reinforcement learning runs into the problem of catastrophic forgetting because the distribution of SARS tuples experienced by the agent can change significantly as it traverses an MDP. To overcome catastrophic forgetting the DQN algorithm stores SARS tuples in a large memory, and randomly samples batches from this memory to train the neural network. Randomly sampling batches eliminates the temporal correlations in the training data, making the distribution of SARS tuples more uniform.

7 CHAPTER 2. BACKGROUND

2.4.2 Policy and Target Network

DQN uses the standard Q-learning update equation 2.6, but now instead of explicitly setting the output of the Q-function to the updated value, error is calculated between the current and updated value, and used to optimize the network through backpropogation and gradient descent. The nature of neural network weight learning introduces one problem into this recursive update procedure. The Q-value for the optimal next state action pair is being used to increase/lower the Q-value for the current state action pair. If the Q-functions for both pairs are calculated with the same network, the adjustment of one pair will likely push the output values for both pairs in a similar direction. This has the effect of introducing broader change to Q-values than would occur in traditional Q-learning, which can result in unstable training behavior. DQN introduces the use of two networks, the policy network, and the target network. The target network is used for calculating the recursive update term in the Q-learning update equation. The policy network is used for controlling the agent and is updated by the update equation. The weights from the policy network are periodically transferred over to the target network, implementing a form of low-pass filtering and increasing the stability of the training procedure.

2.5 Autoencoder

The autoencoder is a neural network structure that maps an input to itself through a low dimensional bottleneck layer [6]. This is an problem that can be broken down into two components. The encoding component, in which an arbitrary function is used to compress the input into a pre-specified dimensionality. And the decoding component, in which an arbitrary function maps the compressed representation back to its original dimensionality. A reconstruction loss term is applied to the output and used to adjust the weights of the network. Mean squared error is commonly applied as the loss. The purpose of the structure is to produce an encoder, decoder pair that can achieve high levels of compression while maintaining important features of the data.

8 CHAPTER 2. BACKGROUND

2.6 Variational Autoencoder

The Variational Autoencoder (VAE) is an autoencoding architecture that maps an input to a low dimensional probability distribution. This distribution is then sampled from and decoded to reconstruct the input. This structure can be thought of as a directed graph probabilistic model. It is assumed training data X is generated by the process p(X|Z) where Z is an unobserved continuous . Z is recovered by estimating the

posterior of the data generation process qφ(Z|X) in the form of an encoder, and estimating

the generation process likelihood qθ(X|Z) in the form of a decoder, where φ and θ are the parameters of the encoder and decoder networks. Solving for a posterior distribution with Bayes’ theorem requires calculating the evidence p(X) which can involve an intractible integral. The posterior can be approximated with methods like Markov Chain Monte Carlo, but this can become computationally expensive in high dimensions. The VAE training procedure performs a gradient-based form of variational inference, where a family of distributions is assumed for the posterior, and the parameters of that family are tuned to estimate the posterior as closely as possible. These parameters are calculated and optimized with a neural network. This estimation procedure has been named Stochastic Gradient Variational Bayes, and was shown to be an unbiased estimator of the Variational Lower Bound [7]. The for training the VAE is shown in equation 2.7. The first term is the expected negative log-likelihood of a datapoint with respect to the encoder’s distribution. This can be interpreted as a reconstruction loss that pushes the decoder towards accurately reconstructing the data. The second term can be interpreted as a regularization that mini- mizes the divergence between an assumed prior p(Z) and the encoder’s estimated posterior distribution. p(Z) is typically a unit normal gaussian for the standard VAE.

h i  

li(θ, φ) = − EZ log pφ(Xi|Z) + KL qθ(Z|Xi) p(Z) (2.7) This loss function is derived by minimizing the Kullback-Leibler Divergence between

the estimated posterior qθ(Z|X) and the true posterior p(Z|X). This derivation can be found in appendix A.

9 CHAPTER 2. BACKGROUND

There is one problem to be overcome to estimate posterior parameters with a network. Backpropogation requires calculations from which partial derivatives can be calculated, to determine the direction of weight adjustment. A random sampling process from a gaussian with mean µ and variance σ2 is not differentiable. To bypass this, the VAE forward pass implements the reparameterization trick. Instead of sampling directly from a distribution with given µ and σ2, the algorithm samples from a standard normal gaussian, scales the output by σ2 and shifts it by µ. This results in differentiable operations influencing the value of the sampled latent variable, that can be optimized with backpropogation and gradient descent.

10 Chapter 3

Methodology

3.1 Setup

A random decision policy is used to create a dataset of observation, action, reward, next observation tuples in the PuckStack environment. Four proposed representation learn- ing architectures are trained on the generated dataset. Each architecture is used to map pixel-wise observation of the environment to a low dimensional vector. For each architecture, a controller is trained using the respective architecture’s latent representation of observation as state. The controller consists of two fully connected layers mapping state to action. The weights of this controller are optimized with the DQN algorithm. The performance of the in-latent-space controllers is compared to a fully convolutional DQN algorithm applied directly to the environment observation pixels. The goal of this procedure is to evaluate the performance of two common and two novel state representation learning architectures on the grounds of their ability to facilitate fast and stable control policy learning.

3.2 Puck-Stack Environment

Controller learning performance is tested in the PuckStack domain. The environment consists of a 4x4 grid with two pucks placed randomly on the grid. The goal is to pick up one puck and place it on the other. Each action takes one time step. The environment

11 CHAPTER 3. METHODOLOGY

episode terminates when the pucks have been successfully stacked, or when ten actions have been taken. This environment serves as a simplified version of a pick and place task. PuckStack actions are specified by choosing a location in the grid, i.e. selecting an integer from 0 through 15. Environment observations are elevation images, where the pucks have a height of 1. This environment includes a sparse reward signal that equals 1 when the pucks have been successfully stacked, and is zero at all other times.

0 0 0

20 20 20

40 40 40

60 60 60

80 80 80

0 20 40 60 80 0 20 40 60 80 0 20 40 60 80

(a) Start state (b) Picked top puck (c) Placed puck on the other

Figure 3.1: Successful Sequence in PuckStack

3.3 Data Generation

A dataset of 10,000 environment episode sequences is generated by stepping an agent, governed by a discrete uniform random policy, through the environment. The state, action, reward, next state tuples experienced by the agent are recorded. To reduce training time, the observation images are re-sized to 28x28 pixels. Pixel values are scaled by one half, such that the maximum pixel value in any observation is one and the minimum is zero.

3.4 Proposed Architectures

There are a number of representation learning techniques that have been used in the control setting. In this work, the focus is on a direct comparison of two types, the variational

12 CHAPTER 3. METHODOLOGY

autoencoder and the inverse model, and a variant of each type. To facilitate a fair comparison, architecture details are kept as similar as possible. Each architecture has the exact same encoder structure. A single fully connected layer maps encoder output features to a specified latent dimension size. The controller for all latent DQN agents consists of a two layer fully connected network where the input size is equal to the latent space dimensionality. This is mapped to a 512 dimensional space and then mapped to the action space. Architectural details for all models can be found in the appendix.

3.4.1 Reconstruction Architectures

A fundamental idea in representation learning is that the ’true’ state was used to create the observation, and is present in the observation, along with noise. With this idea in mind, one technique is to compress the observation to a lower dimensional space that describes most of the variation in the observation space. This lower dimensional representation is then used as an approximation of the true state. A simple linear transformation that accomplishes this is Principal Component Analysis (PCA). Deep autoencoder architectures can be thought of as a non-linear form of PCA with a constraint on the dimensionality of the latent space. Investigations into the latent space structure of standard autoencoder networks have shown the learned representation space can have structure of irregular shape, with gaps in the space where no training samples appear. It is suspected that this irregular structure can make downstream tasks more difficult. One of the discussed benefits of the VAE is the structure it imposes on the latent space through the normalization loss applied to the latent distribution. A Kullback-Leibler divergence term penalizes the latent distribution for being dissimilar to an assumed latent distribution. This has the effect of clustering the learned representations into a compact area of the latent space. The most common assumed latent distribution is the unit normal gaussian. One drawback of the reconstruction approach is that it is a truly unsupervised method. There is no signal present that enables the differentiation between features that are relevant to the downstream task, and those that aren’t. Reconstruction methods simply represent the most salient features in the image, which can lead to the presence of distractor features in the representation. Despite this, they have been used with reasonable success in a number of

13 CHAPTER 3. METHODOLOGY

applications.

3.4.1.1 Variational Autoencoder

As detailed in the background section 2.6, the VAE structure is trained to reconstruct the input data after compressing it to a low dimensional statistical distribution and sam- pling from it, where that distribution is usually guassian. The networks is trained with a reconstruction loss, and a normalization loss on the latent distribution. The reconstruction loss is the element-wise binary cross entropy between the network output and the input. The normalization loss is the Kullback-Leibler divergence between the latent distribution parameters and the standard unit gaussian with mean zero and standard deviation of one.The VAE has been used in a variety of applications. In control, it has been used to learn state representations for robotic manipulators [8], and simulated games like Car-Racing and a variant of Doom [9]. Because of its ubiquity, this is the first representation learning technique tested, and will serve as a benchmark among the latent space methods. All subsequent architectures will be compared to the standard DQN as well as the VAE.

14 CHAPTER 3. METHODOLOGY

Figure 3.2: VAE Structure

3.4.2 Predictive Variational Autoencoder

The second structure being tested is a novel variant of the VAE. Again, an encoder is used to map an input observation to gaussian distribution parameters. The sampled latent vector is fed to two parallel decoders. The target of the first decoder is to reconstruct the current observation, as the standard structure does. The second decoder is trained to reconstruct the next time steps observation. The losses are the same as the standard VAE, except now two reconstruction loss terms are calculated, one for each decoder output. The intuition behind this configuration is this structure is trained to predict a compact representation that parameterizes the current observation as well as a distribution over possible future states. This has conceptual similarities to the successor representation [10], in which states are encoded by their predictive relationship with other states. It is hypothesized this information about future possible states could better inform action selection, and facilitate learning.

15 CHAPTER 3. METHODOLOGY

Figure 3.3: Predictive VAE Structure

3.4.3 Inverse Models

Another representation that has been used in control applications is the inverse model. This structure maps two sequential observations to the action that occurred between them. This model is reasoned about as learning features useful for predicting actionable changes in the environment. One benefit of this is the ability to ignore distracting features irrelevant to control. Where the VAE attempts to reconstruct all features of an observation, the inverse model is only intcentivized to encode features that are effected by control. Other works have paired an inverse model with a forward model to form a representation [11], or inform internal rewards and exploration [12]. Here, a reverse model is applied alone as a means of learning a latent space.

3.4.3.1 Inverse Model

The inverse model is trained in a siamese fashion, in which the weights of the encoders being applied to both images are shared. The output features of these encoders are flattened

16 CHAPTER 3. METHODOLOGY

into vectors, concatenated and passed to a fully connected layer that classifies the action. The softmax operation is applied to the final output. Loss is calculated as the cross entropy between the softmax vector and a one-hot vector specifying the true action.

Figure 3.4: Inverse Model Structure

3.4.3.2 Variational Inverse Model

The final architecture tested is an inverse model that has been modified to include a variational inference step in the encoding process, analogous to as is done in the VAE. The encoder of the model now maps observations to the parameters of a gaussian distribution. The distributions of two sequential observations are sampled from and concatenated to predict the action that occurred between them. The motivation for this is to attain a more structured and compact representation space as is observed in the VAE. This architecture is trained with both the cross entropy of the output and true action, as well as the kullback- leibler divergence between the encoded distribution and the unit normal distribution.

17 CHAPTER 3. METHODOLOGY

Figure 3.5: Variational Inverse Model Structure

3.5 Training Procedure

Each structure is trained with its appropriate training procedure, on the random policy data. The encoder of each structure is then used to preprocess pixel-wise observations into state representations for a two layer fully connected network controller that is optimized with the DQN algorithm.

3.6 Evaluation

The structures are evaluated on the number of episodes required to learn a policy that has a high average performance over 100 episodes of the environment, as well as learning curve variance over repeated training runs. The fully convolutional DQN algorithm will serve as a non-representation learning baseline. The VAE will serve as a representation learning baseline due to its wide popularity and use.

18 Chapter 4

Results

Significant search was performed in latent space dimensionality and number of training epochs, to produce encodings that facilitate the most performant downstream controllers. A summary of these final parameters can be found in table 4.1. Figure 4.1 displays the training curves of all the architectures. The training curves of the predictive VAE, inverse model, and variational inverse model are plotted against the baseline fully convolutional DQN and the VAE latent controller. All controllers were trained with the DQN training procedure. Environment episodes were limited to 10 steps. A memory size of 10000 SASR! (SASR!) tuples was used. Every 100 steps in the environment a policy network update occurs, in which a batch of 128 SARS tuples is randomly sampled from the memory and used to train the network. Every 2000 steps in the environment the weights from the policy network are passed to the target network. The epsilon-greedy exploration strategy is used for environment exploration. Exploration rate starts at 95% and decay to 5% linearly over 100000 steps in the environment. It was discovered after performing these experiments that the usage of data in the memory here is less optimal than possible. These parameters use the data in the memory on average about once. It is possible to train more frequently, such that the data in the memory is used 50 to 100 times over. Applying these changes here would scale all training curves along the x-axis. In five training trials, the baseline DQN algorithm learned a near-perfect policy in approximately 10000 episodes of experience. The controllers trained in the VAE latent

19 CHAPTER 4. RESULTS

100 DQN 100 DQN VAE-DQN VAE-DQN PVAE-DQN IM-DQN 80 80

60 60

40 40 Successes in 100 Rollouts Successes in 100 Rollouts 20 20

0 0

0 2000 4000 6000 8000 10000 12000 0 2000 4000 6000 8000 10000 12000 Episodes Episodes

(a) Predictive VAE (b) Inverse Model

100 DQN VAE-DQN VIM-DQN 80

60

40 Successes in 100 Rollouts 20

0

0 2000 4000 6000 8000 10000 12000 Episodes

(c) Variational Inverse Model

Figure 4.1: Comparing latent controller performances to the fully convolutional DQN and the VAE latent-DQN, min-max-mean training curves over five training trials

20 CHAPTER 4. RESULTS

space reached equivalent performance at approximately 8000 episodes of experience, 20 percent fewer than the standard DQN. In addition, the latent-VAE controllers had more stable performance in their training processes. This was an expected result, as the controller does not need to learn any parameters for the feature extracting convolutional layers. The predictive VAE latent space controllers performed nearly identically to the VAE latent space controllers. This seems to refute the notion that information about next possible states aids in decision making. It is possible that the current testing environment is too simple to gain any advantage from information about next possible states. It is notable that the most performant predictive VAE has the same latent dimensionality as the most performant VAE. This architecture was optimized for downstream control performance, not for its ability to encode information about next possible states. Another possibility is this encoder, intentionally or unintentionally, never learned to encode information about these states, and that is the reason for the performance similarity to the standard VAE. Analysis of the predictive VAE’s reconstructions would be necessary to assess this. It was hypothesized that the inverse model latent space would be at least as good as the VAE’s latent space for policy learning, if not better. Surprisingly, in terms of , performance was consistently between that of the VAE and standard DQN. The reason for this is not entirely clear. It would seem that a sufficiently trained inverse model encodes all the information necessary for perfect control in a compact latent space. The inverse model doesn’t have any normalization terms enforcing structure into the latent space. This could be one influencing factor and was motivation for the final structure tested, the variational inverse model. After significant parameter search, no combination was found that resulted in a per- formant variational inverse model. The reasons for this are unclear. Another interesting observation is the inverse models required a larger latent space to produce performant controllers than the architectures focused on reconstruction.

21 CHAPTER 4. RESULTS

Table 4.1: Optimized Representation Learning Architecture Parameters

Structure Latent Dimension Epochs Trained VAE 64 10 PVAE 64 16 IM 256 16 VIM 288 16

22 Chapter 5

Conclusions

5.1 The Motivation for Representation Learning

Representation learning focuses on learning compact, information dense representations from complex high dimensional data, taking advantage of natural structure that occurs in that data in an unsupervised or self-supervised manner. One application for representation learning in robotics is state abstraction, which could reduce training time and improve performance of reinforcement learning algorithms. Many techniques exist for learning abstractions from data, the performance is these methods in terms of downstream control is not well understood. In this work two common representation learning structures, and two novel structures are trained and compared on the basis of downstream control.

5.2 Future Work

One problem with the current approach is the architecture training is reliant upon a dataset generated by a randomly controlled agent. In tasks where a large portion of the state space only becomes visible after certain sub-tasks have been accomplished, the encoding architecture will never be trained to represent these areas. This problem can be side-stepped by iterative methods that train representations from random policy data, learn a policy in the representation space, use that policy to generate new data and repeat. A more elegant

23 CHAPTER 5. CONCLUSIONS solution could be to learn representations that inform intrinsic motivation and curiosity, to speed up the exploration process, similar to what is achieved in [12]. This is a relatively young area of research with a lot of exciting directions to pursue. In the methods tested in this work, few constraints are placed on the structure of the resulting latent space. There are certain properties of a latent space that could be very useful for performing tasks. One such property is disentanglement, in which the variables of the latent space have little or no correlation with other variables in the latent space [13]. Other properties that could be useful include hierarchical structure [14], and dynamics similar to the physical world [15]. It could also be useful to learn representations of a very specific design, such as the positions and velocities of objects [16], or with a predefined graphical structure [17]. These are all directions of active research that could prove very useful for robotic control. The following subsections detail two areas that are actively being pursued as follow-up work.

5.2.1 Partial Observability

Most environments are inherently non-markovian. A simple example of this is a moving object. The state of a moving object can not be known from a single visual observation alone. Velocity is lost when only a snap shot of the present is available. One interesting direction for work is learning representations that encode this temporal information for effective control in the continuous non-markovian world. As a continuation of this thesis, recurrent representation learning structures are being explored for this task.

5.2.2 Representation Conducive to Planning

The methods tested in this work use unsupervised and self-supervised learning to map observations to lower dimensional spaces. These methods make no assumptions on the dynamics or behavior of the lower dimensional space. Another interesting direction is to enforce latent space dynamics that are conducive to types of planning and control. For example, one thought is to map observations to a linear dynamical space such that LQR and other classical techniques can be applied for control.

24 Bibliography

[1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.

[2] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[3] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European conference on computer vision. Springer, 2014, pp. 818–833.

[4] V.Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.

[5] M. McCloskey and N. J. Cohen, “Catastrophic interference in connectionist networks: The sequential learning problem,” in Psychology of learning and motivation. Elsevier, 1989, vol. 24, pp. 109–165.

[6] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.

[7] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.

[8] A. V. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine, “Visual reinforcement learning with imagined goals,” in Advances in Neural Information Processing Systems, 2018, pp. 9191–9200.

25 BIBLIOGRAPHY

[9] D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy evolution,” in Advances in Neural Information Processing Systems, 2018, pp. 2450–2462.

[10] P. Dayan, “Improving generalization for temporal difference learning: The successor representation,” Neural Computation, vol. 5, no. 4, pp. 613–624, 1993.

[11] P. Agrawal, A. V. Nair, P. Abbeel, J. Malik, and S. Levine, “Learning to poke by poking: Experiential learning of intuitive physics,” in Advances in Neural Information Processing Systems, 2016, pp. 5074–5082.

[12] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” in Proceedings of the IEEE Conference on Computer Vision and Workshops, 2017, pp. 16–17.

[13] S. Narayanaswamy, T. B. Paige, J.-W. Van de Meent, A. Desmaison, N. Goodman, P. Kohli, F. Wood, and P. Torr, “Learning disentangled representations with semi- supervised deep generative models,” in Advances in Neural Information Processing Systems, 2017, pp. 5925–5935.

[14] E. Mathieu, C. L. Lan, C. J. Maddison, R. Tomioka, and Y. W. Teh, “Hierar- chical representations with poincar\’e variational auto-encoders,” arXiv preprint arXiv:1901.06033, 2019.

[15] R. Jonschkowski and O. Brock, “State representation learning in robotics: Using prior knowledge about physical interaction.” in Robotics: Science and Systems, 2014.

[16] R. Jonschkowski, R. Hafner, J. Scholz, and M. Riedmiller, “Pves: Position-velocity encoders for unsupervised learning of structured state representations,” arXiv preprint arXiv:1705.09805, 2017.

[17] M. Johnson, D. K. Duvenaud, A. Wiltschko, R. P. Adams, and S. R. Datta, “Composing graphical models with neural networks for structured representations and fast inference,” in Advances in neural information processing systems, 2016, pp. 2946–2954.

26 Appendix A

VAE Loss Derivation

We want to approximate the true posterior of the data generation process p(Z|X) with a variational posterior qλ(Z|X) where λ indexes the distribution parameters. The quality of the approximation is measured with the Kullback-Leibler divergence between the true and approximated posteriors. The optimal approximation minimizes the value of the divergence.

  ∗ qλ(Z|X) = argminλKL qλ(Z|X) p(Z|X)

We re-express the divergence to understand how it can be minimized:

  X qλ(Z|X) KL q (Z|X) p(Z|X) = q (Z|X) log λ λ p(Z|X) Z h q (Z|X)i = log λ EZ p(Z|X) h i = EZ log qλ(Z|X) − log p(Z|X) h p(X,Z)i = log q (Z|X) − log EZ λ p(X) h i = EZ log qλ(Z|X) − log p(X,Z) + log p(X)   h i

log p(X) = KL qλ(Z|X) p(Z|X) + EZ log p(X,Z) − log qλ(Z|X)

27 APPENDIX A. VAE LOSS DERIVATION

Here, the equation has been expressed such that a constant is equal to the divergence we want to minimize plus the term on the right. The term on the right is the Evidence Lower BOund (ELBO) by definition. It is known by Jensen’s inequality that the KL-divergence is always greater than or equal to zero. So, we can minimize this divergence by maximizing ELBO.

h i ELBO = EZ log p(X,Z) − log qλ(Z|X) h i = EZ log p(X|Z) + log p(Z) − log qλ(Z|X) h i h q (Z|X)i = log p(X|Z) − log λ EZ EZ p(Z) h i  

= EZ log p(X|Z) − KL qz(Z|X) p(Z) h i  

−ELBO = − EZ log p(X|Z) + KL qz(Z|X) p(Z)

We can now minimize the negative of ELBO with gradient descent, resulting in the loss function for VAE training.

28 Appendix B

Architecture Building Block Summary

Encoder:

• 2D Conv: in-chan = 1, out-chan = 16, ksize = 4, stride = 2, pad = 1 • batchnorm2D • relu • 2D Conv: in-chan = 16, out-chan = 32, ksize = 4, stride = 2, pad = 1 • batchnorm2D • relu • 2D Conv: in-chan = 32, out-chan = 32, ksize = 4, stride = 2, pad = 1 • batchnorm2D • relu

Decoder:

• 2D DeConv: in-chan = 32, out-chan = 32, ksize = 4, stride = 2, pad = 1 • batchnorm2D • relu • 2D DeConv: in-chan = 32, out-chan = 16, ksize = 4, stride = 2, pad = 0 • batchnorm2D • relu

29 APPENDIX B. ARCHITECTURE BUILDING BLOCK SUMMARY

• 2D DeConv: in-chan = 16, out-chan = 1, ksize = 4, stride = 2, pad = 1 • batchnorm2D • sigmoid

Latent DQN Architecture:

• Linear: in-dim = latent-size + 1 , out-dim = 512 • relu • Linear: in-dim = 512, out-dim = 16

Fully Convolutional DQN Architecture:

• 2D Conv: in-chan = 1, out-chan = 16, ksize = 4, stride = 2, pad = 1 • batchnorm2D • relu • 2D Conv: in-chan = 16, out-chan = 32, ksize = 4, stride = 2, pad = 1 • batchnorm2D • relu • 2D Conv: in-chan = 32, out-chan = 32, ksize = 4, stride = 2, pad = 1 • batchnorm2D • relu • Linear: in-dim = 32*3*3 + 1, out-dim = 16

30