Representation Learning for Control

Representation Learning for Control A Thesis Presented by Kevin James Doty to The Department of Electrical and Computer Engineering in partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer Engineering Northeastern University Boston, Massachusetts May 2019 To my family. i Contents List of Figures iv List of Tables v List of Acronyms vi Acknowledgments vii Abstract of the Thesis viii 1 Introduction 1 2 Background 3 2.1 Markov Decision Process . 3 2.2 Solving MDP’s . 4 2.2.1 Q-Learning . 5 2.3 Convolutional Neural Network . 6 2.4 Deep Q-Learning . 7 2.4.1 Experience Replay . 7 2.4.2 Policy and Target Network . 8 2.5 Autoencoder . 8 2.6 Variational Autoencoder . 9 3 Methodology 11 3.1 Setup . 11 3.2 Puck-Stack Environment . 11 3.3 Data Generation . 12 3.4 Proposed Architectures . 12 3.4.1 Reconstruction Architectures . 13 3.4.2 Predictive Variational Autoencoder . 15 3.4.3 Inverse Models . 16 3.5 Training Procedure . 18 ii 3.6 Evaluation . 18 4 Results 19 5 Conclusions 23 5.1 The Motivation for Representation Learning . 23 5.2 Future Work . 23 5.2.1 Partial Observability . 24 5.2.2 Representation Conducive to Planning . 24 Bibliography 25 A VAE Loss Derivation 27 B Architecture Building Block Summary 29 iii List of Figures 2.1 MDP Agent Environment Interaction . 3 3.1 Successful Sequence in PuckStack . 12 3.2 VAE Structure . 15 3.3 Predictive VAE Structure . 16 3.4 Inverse Model Structure . 17 3.5 Variational Inverse Model Structure . 18 4.1 Comparing latent controller performances to the fully convolutional DQN and the VAE latent-DQN, min-max-mean training curves over five training trials . 20 iv List of Tables 4.1 Optimized Representation Learning Architecture Parameters . 22 v List of Acronyms AE Autoencoder. A neural network structure that maps an input to itself through a bottleneck layer. CNN Convolutional Neural Network. A network that filters an input feature map by applying the correlation operation between kernels with learned weights and the input features. DQN Deep Q-Network. An extension of traditional Q-learning to continuous state spaces by using neural networks as Q-function approximators. MDP Markov Decision Process. A discrete time control process with perfect information. At each time step a decision making agent perceives a state, selects an action, and then progresses to a new state, optionally receiving a reward. PCA Principal Component Analysis. A statistical procedure for converting data into a set of uncorrelated features. POMDP Partially Observable Markov Decision Process. A special case of the markov decision process, in which the agent receives observations that are not the true state, and do not contain perfect information. SARS (State, Action, Reward, Next State) VAE Variational Autoencoder. A probabilistic autoencoding structure in which the encoder is trained to approximate the posterior distribution of the data generation process. vi Acknowledgments I would like to thank my family and friends for their support and my advisor Rob Platt for the insightful discussions and guidance. I would also like to express my appreciation for the robotics community of NEU at large. The past couple of years have truly been fulfilling. It has been my pleasure to be here learning with all of you. vii Abstract of the Thesis Representation Learning for Control by Kevin James Doty Master of Science in Electrical and Computer Engineering Northeastern University, May 2019 Dr. Robert Platt, Advisor State representation learning finds an embedding from a high dimensional observation space to a lower dimensional and information dense state space, without supervision. Effective state representation learning can improve the sample efficiency and stability of downstream tasks such as reinforcement learning for control. In this work, two autoencoder variants and two inverse models are explored as representation learning architectures. The DQN algorithm is used to learn control in the resulting latent spaces. Latent-control performance is benchmarked against a standard fully convolutional DQN on a downstream control task. viii Chapter 1 Introduction One of the fundamental challenges of reinforcement learning is learning control policies in high dimensional observation spaces. This is particularly important in robotics applications, where we would like to develop behaviors that are responsive to raw sensory data such as images, point clouds or audio. Often these raw data contain substantial noise and features irrelevant to the task. Learning in these conditions can require excessive amounts of data and time, and is prone to instability. One idea that can be applied to get around the curse of dimensionality in raw data is state representation learning. In state representation learning the structure in information, that is gathered passively or actively, is exploited to learn mappings from a high dimensional observation space to a compact and meaningful state space. The learned mappings can then be used to pre-process raw data before sending it to a reinforcement learning algorithm as state. This process can greatly simplify the problem of policy learning. In this work, two forms of representation learning are explored, the autoencoder and the inverse model. The autoencoder has a fully unsupervised training process in which an input is mapped to itself through an information bottleneck. The structure is trained to reconstruct its input after compressing the information down to a predefined dimensionality. The inverse model is a self-supervised technique. Here, agent actions and observations influenced by those actions are recorded simultaneously. A siamese network structure is then given sequential observations as intput and trained to predict the action that occurred between them. This forces an encoder to learn features relevant for discerning the effect of 1 CHAPTER 1. INTRODUCTION actions in the environment. Autoencoders and inverse models have been applied in various settings. This work seeks to understand the relative performance of the two architectures, as well as to test two novel architectures based on the autoencoding and inverse model paradigms. To accomplish this, these architectures are trained on data collected in the PuckStack environment. Deep Q-Network (DQN) agents are then trained in the latent spaces of the resulting encoders. The number of training episodes until convergence, training stability, and performance of these controllers are compared to each other and to a DQN baseline agent that learns directly from environment pixels. The goals of this work are to produce further understanding of the abilities and relative performances of the tested representation learning techniques, and to inform the design of these architectures for future applications. Effective state representation learning has the potential to solve many of the difficult aspects of control learning from raw data, in an unsupervised or self-supervised manner. The long term goals for this field are to reduce the amount of data and time necessary to learn control policies, and to increase the stability and overall capability of these policies. With some progress, it may become feasible to train policies on physical robots in the lab in days or hours. 2 Chapter 2 Background 2.1 Markov Decision Process The Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision making. The MDP involves the notions of an agent and an environment that interact through action selection, state transitions, and reward signals. The decision making agent perceives the state of the environment, this agent then chooses an action, progresses to a new state and receives a reward signal. Figure 2.1: MDP Agent Environment Interaction (Source [1]) An MDP can be formally defined by a state space, action space, a state transition function, and a reward function. The state transition function returns a probability distribution over next state given a current state and action. The reward function returns a distribution over rewards given a state and action. 3 CHAPTER 2. BACKGROUND In this work we consider only finite episodic MDP’s in which the state space and action space are finite, and the agents interactions with the environment consist of a series of finite length sequences. The interaction of the agent and the environment continues until the agent encounters some terminal state. A traversal through the environment from initial state to terminal state is referred to as an episode. 2.2 Solving MDP’s The goal of the agent is to maximize the cumulative reward it receives from the environment during an episode. This can be formalized as the return defined in 2.1, where γ is a discount factor with a value in the interval [0, 1). This factor prevents the sum from equaling infinity, and can be used to weight the importance of short term verse long term rewards in a particular application. This formula can be regrouped as in 2.2 so it can then be expressed recursively 2.3. 2 3 Gt , Rt+1 + γRt+2 + γ Rt+3 + γ Rt+4 + ··· (2.1) 2 = Rt+1 + γ(Rt+2 + γRt+3 + γ Rt+4 + ··· ) (2.2) = Rt+1 + γGt+1 (2.3) Action selection in an MDP dictates not only reward received at the current time step, but also subsequent states, and therefore rewards that can be received in future time steps. In solving an MDP, current and future rewards must be considered. This can involve selecting low reward actions in the present to gain access to higher rewards in future steps. The function the agent uses to select actions given states is referred to as a policy. This policy can be probabilistic or deterministic. We want to learn an optimal policy, or a policy that achieves the maximum return possible in the environment. There are many different ways of solving for optimal policies in the reinforcement learning literature.

Load more