Deepmdp: Learning Continuous Latent Space Models for Representation Learning

DeepMDP: Learning Continuous Latent Space Models for Representation Learning Carles Gelada 1 Saurabh Kumar 1 Jacob Buckman 2 Ofir Nachum 1 Marc G. Bellemare 1 Abstract learning process by reducing the redundant and irrelevant information presented to the agent. Representation learn- Many reinforcement learning (RL) tasks provide ing techniques for reinforcement learning seek to improve the agent with high-dimensional observations that the learning efficiency of existing RL algorithms by doing can be simplified into low-dimensional continu- exactly this: learning a mapping from states to simplified ous states. To formalize this process, we introduce states. the concept of a DeepMDP, a parameterized latent space model that is trained via the minimization Bisimulation metrics (Ferns et al., 2004; 2011) define two of two tractable latent space losses: prediction states to be behaviourally similar if they (1) produce the of rewards and prediction of the distribution over close immediate reward and (2) they transition to states next latent states. We show that the optimization which themselves are behaviourally similar. Bisimulation of these objectives guarantees (1) the quality of metrics have been used reduce the dimensionality of the the embedding function as a representation of the state space by aggregating states (a form of representation state space and (2) the quality of the DeepMDP as learning) but have not received much attention due to their a model of the environment. Our theoretical find- high computational cost. Furthermore, state aggregation ings are substantiated by the experimental result techniques, whether based on bisimulation or other methods that a trained DeepMDP recovers the latent struc- (Abel et al., 2017; Li et al., 2006; Singh et al., 1995; Givan ture underlying high-dimensional observations on et al., 2003; Jiang et al., 2015; Ruan et al., 2015), suffer a synthetic environment. Finally, we show that from poor compatibility with function approximation meth- learning a DeepMDP as an auxiliary task in the ods. Instead, to support stochastic gradient descent-based Atari 2600 domain leads to large performance training procedures we explore the use of continuous la- improvements over model-free RL. tent representations. Specifically, for any MDP, we propose utilizing the latent space of its corresponding DeepMDP. A DeepMDP is a latent space model of an MDP which has 1. Introduction been trained to minimize two tractable losses: predicting In reinforcement learning (RL), it is typical to model the en- the rewards and predicting the distribution of next latent vironment as a Markov Decision Process (MDP). However, states. DeepMDPs can be viewed as a formalization of re- for many practical tasks, the state representations of these cent works which use neural networks to learn latent space MDPs include a large amount of redundant information and models of the environment (Ha & Schmidhuber, 2018; Oh task-irrelevant noise. For example, image observations from et al., 2017; Hafner et al., 2018). The state of a DeepMDP the Arcade Learning Environment (Bellemare et al., 2013a) can be interpreted as a representation of the original MDP’s consist of 33,600-dimensional pixel arrays, yet it is intu- state, and doing so reveals a profound theoretical connection itively clear that there exist lower-dimensional approximate to bisimulation. We show that minimization of the Deep- representations for all games. Consider PONG; observing MDP losses guarantees that two non-bisimilar states can only the positions and velocities of the three objects in the never be collapsed into a single representation. Additionally, frame is enough to play. Converting each frame into such this guarantees that value functions in the DeepMDP are a simplified state before learning a policy facilitates the good approximations of value functions in the original task MDP. These results serve not only provide a theoretically- 1Google Brain 2Center for Language and Speech Processing, grounded approach to representation learning but also repre- Johns Hopkins University. Correspondence to: Carles Gelada sent a promising first step towards principled latent-space < > [email protected] . model-based RL algorithms. th Proceedings of the 36 International Conference on Machine In a synthetic environment, we show that a DeepMDP learns Learning , Long Beach, California, PMLR 97, 2019. Copyright to recover the low-dimensional latent structure underlying 2019 by the author(s). DeepMDP: Learning Continuous Latent Space Models for Representation Learning high-dimensional observations. We then demonstrate that When it’s clear what the underlying metric is, we will simply learning a DeepMDP as an auxiliary task to model-free RL write W . The Wasserstein has an equivalent dual form in the Atari 2600 environment (Bellemare et al., 2013b) (Mueller, 1997): leads to significant improvement in performance when com- pared to a baseline model-free method. Wd pP; Qq “ sup | E fpxq ´ E fpyq|; (1) x„P y„Q fPFd 2. Background where Fd is the set of absolutely continuous 1-Lipschitz functions: Define a Markov Decision Process (MDP) in standard fash- ion: M “ xS; A; R; P; γy (Puterman, 1994). For simplic- Fd “ tf : |fpxq ´ fpyq| ¤ dpx; yqu: ity of notation we will assume that S and A are discrete spaces unless otherwise stated. A policy π defines a distri- The Wasserstein metric can be extended to pseudometrics. bution over actions conditioned on the state, πpa|sq. Denote A pseudometric d satisfies all the properties of a metric ex- by Π the set of all stationary policies. The value function cept identity of indiscernibles, dpx; yq “ 0 ô x “ y. The of a policy π P Π at a state s is the expected sum of future kernel of a pseudometric is the equivalence relation defined discounted rewards by running the policy from that state. for all states where the pseudometric is 0. Note how the V π : S Ñ R is defined as: triangle inequality of the pseudometric ensures the kernel 8 is a valid equivalence satisfying the associative property. π t Intuitively, using a pseudometric for the Wasserstein can be V psq “ E γ Rpst; atq|s0 “ s : a „πp¨|s q t t «t“0 ff interpreted letting points that are different be equivalent un- s „Pp¨|s ;a q t`1 t t ¸ der the pseudometric and thus, requiring no transportation. The action value function is similarly defined: Central to the results in this work is the connection between 8 π t the Wasserstein metric and Lipschitz smoothness. The fol- Q ps; aq “ E γ Rpst; atq|s0 “ s; a0 “ a a „πp¨|s q lowing property, trivially derived from the dual form of t t «t“0 ff st`1„Pp¨|st;atq ¸ the Wasserstein distance, will be used throughout. For any We denote π˚ as the optimal policy in M; i.e., the policy K-Lipschitz function f, which maximizes expected future reward. We denote the ˚ | E fpxq ´ E fpyq| ¤ K ¨ W pP; Qq (2) optimal state and action value functions with respect to π x„P y„Q as V ˚;Q˚. We denote the stationary distribution of a policy π in M by dπ; i.e., 2.2. Lipschitz MDPs dπpsq “ Pps|s;9 a9qπpa9|s9qdπps9q da9 Several works have studied Lipschitz smoothness con- s9PS;a9 PA ¸ straints on the transition and reward functions (Hinderer, The state-action stationary distribution is given by 2005; Asadi et al., 2018), to provide conditions for value ξπps; aq “ dπpsqπpa|sq. Although only non-terminating functions to be Lipschitz. Closely following their formula- MDPs have stationary distributions, a state distribution for tion, we define Lipschitz MDPs as follows: terminating MDPs with similar properties exists (Gelada & Definition 2. Let M “ xS; A; R; P; γy be an MDP with a Bellemare, 2019). continuous, metric state space pS; dS q, where dS : S ˆS Ñ ` R , and a discrete action space A. We say M is pKR;KP q- 2.1. Wasserstein Distance Lipschitz if, for all s1; s2 P S and a P A: For two distribution P and Q defined on a metric space |Rps1; aq ´ Rps2; aq| ¤ KRdS ps1; s2q xχ, dy, the optimal transport problem (Villani, 2008) studies how to transform the probability mass of P into Q with the W pPp¨|; s1; aq; Pp¨|s2; aqq ¤ KP dS ps1; s2q minimum cost, where the cost of a particle from point x to y comes from a metric dpx; yq. The Wasserstein-1 metric From here onwards, we will we restrict our attention to between P and Q, denoted by W pP; Qq, is the minimal the set of Lipschitz MDPs for which the constant KP is possible cost of such a transport. sufficiently small, as stated in the following assumption. Definition 1. Let d be any metric. The Wasserstein-1 metric Assumption 1. The Lipschitz constant KP of the transition between distributions and is defined as 1 W P Q function P is strictly smaller than γ . Wd pP; Qq “ inf dpx; yqλpx; yq dx dy: From a practical standpoint, Assumption1 is relatively λPΓpP;Qq » strong, but simplifies our analysis by ensuring that close where ΓpP; Qq denotes the set of all couplings of P and Q. states cannot have future trajectories that are “divergent.” DeepMDP: Learning Continuous Latent Space Models for Representation Learning An MDP might not exhibit divergent behaviour even when et al., 2018; Ha & Schmidhuber, 2018; Oh et al., 2017; 1 KP ¥ γ . In particular, when episodes terminate after a Hafner et al., 2018; Kaiser et al., 2019; Silver et al., 2017), finite amount of time, Assumption1 becomes unnecessary. the parametrizations and training objectives used to learn Wes leave as future work how to improve on this assumption. such models have varied widely. For example Ha & Schmid- huber(2018); Hafner et al.(2018); Kaiser et al.(2019) The main use of Lipschitz MDPs will be to study the Lips- use pixel prediction losses to learn the latent representa- chitz properties of value functions.1 tion while (Oh et al., 2017) chooses instead to optimize Definition 3.

Deepmdp: Learning Continuous Latent Space Models for Representation Learning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support