<<

DeepMDP: Learning Continuous Latent Space Models for Representation Learning

Carles Gelada 1 Saurabh Kumar 1 Jacob Buckman 2 Ofir Nachum 1 Marc G. Bellemare 1

Abstract learning process by reducing the redundant and irrelevant information presented to the agent. Representation learn- Many (RL) tasks provide ing techniques for reinforcement learning seek to improve the agent with high-dimensional observations that the learning efficiency of existing RL algorithms by doing can be simplified into low-dimensional continu- exactly this: learning a mapping from states to simplified ous states. To formalize this process, we introduce states. the concept of a DeepMDP, a parameterized latent space model that is trained via the minimization Bisimulation metrics (Ferns et al., 2004; 2011) define two of two tractable latent space losses: prediction states to be behaviourally similar if they (1) produce the of rewards and prediction of the distribution over close immediate reward and (2) they transition to states next latent states. We show that the optimization which themselves are behaviourally similar. Bisimulation of these objectives guarantees (1) the quality of metrics have been used reduce the dimensionality of the the embedding function as a representation of the state space by aggregating states (a form of representation state space and (2) the quality of the DeepMDP as learning) but have not received much attention due to their a model of the environment. Our theoretical find- high computational cost. Furthermore, state aggregation ings are substantiated by the experimental result techniques, whether based on bisimulation or other methods that a trained DeepMDP recovers the latent struc- (Abel et al., 2017; Li et al., 2006; Singh et al., 1995; Givan ture underlying high-dimensional observations on et al., 2003; Jiang et al., 2015; Ruan et al., 2015), suffer a synthetic environment. Finally, we show that from poor compatibility with function approximation meth- learning a DeepMDP as an auxiliary task in the ods. Instead, to support stochastic -based Atari 2600 domain leads to large performance training procedures we explore the use of continuous la- improvements over model-free RL. tent representations. Specifically, for any MDP, we propose utilizing the latent space of its corresponding DeepMDP. A DeepMDP is a latent space model of an MDP which has 1. Introduction been trained to minimize two tractable losses: predicting In reinforcement learning (RL), it is typical to model the en- the rewards and predicting the distribution of next latent vironment as a Markov Decision Process (MDP). However, states. DeepMDPs can be viewed as a formalization of re- for many practical tasks, the state representations of these cent works which use neural networks to learn latent space MDPs include a large amount of redundant information and models of the environment (Ha & Schmidhuber, 2018; Oh task-irrelevant . For example, image observations from et al., 2017; Hafner et al., 2018). The state of a DeepMDP the Arcade Learning Environment (Bellemare et al., 2013a) can be interpreted as a representation of the original MDP’s consist of 33,600-dimensional pixel arrays, yet it is intu- state, and doing so reveals a profound theoretical connection itively clear that there exist lower-dimensional approximate to bisimulation. We show that minimization of the Deep- representations for all games. Consider PONG; observing MDP losses guarantees that two non-bisimilar states can only the positions and velocities of the three objects in the never be collapsed into a single representation. Additionally, frame is enough to play. Converting each frame into such this guarantees that value functions in the DeepMDP are a simplified state before learning a policy facilitates the good approximations of value functions in the original task MDP. These results serve not only provide a theoretically- 1Google Brain 2Center for and Speech Processing, grounded approach to representation learning but also repre- Johns Hopkins University. Correspondence to: Carles Gelada sent a promising first step towards principled latent-space < > [email protected] . model-based RL algorithms. th Proceedings of the 36 International Conference on Machine In a synthetic environment, we show that a DeepMDP learns Learning , Long Beach, California, PMLR 97, 2019. Copyright to recover the low-dimensional latent structure underlying 2019 by the author(s). DeepMDP: Learning Continuous Latent Space Models for Representation Learning

high-dimensional observations. We then demonstrate that When it’s clear what the underlying metric is, we will simply learning a DeepMDP as an auxiliary task to model-free RL write W . The Wasserstein has an equivalent dual form in the Atari 2600 environment (Bellemare et al., 2013b) (Mueller, 1997): leads to significant improvement in performance when com- pared to a baseline model-free method. Wd pP,Qq “ sup | E fpxq ´ E fpyq|, (1) x„P y„Q fPFd

2. Background where Fd is the set of absolutely continuous 1-Lipschitz functions: Define a Markov Decision Process (MDP) in standard fash- ion: M “ xS, A, R, P, γy (Puterman, 1994). For simplic- Fd “ tf : |fpxq ´ fpyq| ď dpx, yqu. ity of notation we will assume that S and A are discrete spaces unless otherwise stated. A policy π defines a distri- The Wasserstein metric can be extended to pseudometrics. bution over actions conditioned on the state, πpa|sq. Denote A pseudometric d satisfies all the properties of a metric ex- by Π the set of all stationary policies. The value function cept identity of indiscernibles, dpx, yq “ 0 ô x “ y. The of a policy π P Π at a state s is the expected sum of future kernel of a pseudometric is the equivalence relation defined discounted rewards by running the policy from that state. for all states where the pseudometric is 0. Note how the V π : S Ñ R is defined as: triangle inequality of the pseudometric ensures the kernel 8 is a valid equivalence satisfying the associative property. π t Intuitively, using a pseudometric for the Wasserstein can be V psq “ E γ Rpst, atq|s0 “ s . a „πp¨|s q t t «t“0 ff interpreted letting points that are different be equivalent un- s „Pp¨|s ,a q t`1 t t ÿ der the pseudometric and thus, requiring no transportation. The action value function is similarly defined: Central to the results in this work is the connection between 8 π t the Wasserstein metric and Lipschitz smoothness. The fol- Q ps, aq “ E γ Rpst, atq|s0 “ s, a0 “ a a „πp¨|s q lowing property, trivially derived from the dual form of t t «t“0 ff st`1„Pp¨|st,atq ÿ the Wasserstein distance, will be used throughout. For any We denote π˚ as the optimal policy in M; i.e., the policy K-Lipschitz function f, which maximizes expected future reward. We denote the ˚ | E fpxq ´ E fpyq| ď K ¨ W pP,Qq (2) optimal state and action value functions with respect to π x„P y„Q as V ˚,Q˚. We denote the stationary distribution of a policy π in M by dπ; i.e., 2.2. Lipschitz MDPs

dπpsq “ Pps|s,9 a9qπpa9|s9qdπps9q da9 Several works have studied Lipschitz smoothness con- s9PS,a9 PA ÿ straints on the transition and reward functions (Hinderer, The state-action stationary distribution is given by 2005; Asadi et al., 2018), to provide conditions for value ξπps, aq “ dπpsqπpa|sq. Although only non-terminating functions to be Lipschitz. Closely following their formula- MDPs have stationary distributions, a state distribution for tion, we define Lipschitz MDPs as follows: terminating MDPs with similar properties exists (Gelada & Definition 2. Let M “ xS, A, R, P, γy be an MDP with a Bellemare, 2019). continuous, metric state space pS, dS q, where dS : S ˆS Ñ ` R , and a discrete action space A. We say M is pKR,KP q- 2.1. Wasserstein Distance Lipschitz if, for all s1, s2 P S and a P A: For two distribution P and Q defined on a metric space |Rps1, aq ´ Rps2, aq| ď KRdS ps1, s2q xχ, dy, the optimal transport problem (Villani, 2008) studies how to transform the probability mass of P into Q with the W pPp¨|, s1, aq, Pp¨|s2, aqq ď KP dS ps1, s2q minimum cost, where the cost of a particle from point x to y comes from a metric dpx, yq. The Wasserstein-1 metric From here onwards, we will we restrict our attention to between P and Q, denoted by W pP,Qq, is the minimal the set of Lipschitz MDPs for which the constant KP is possible cost of such a transport. sufficiently small, as stated in the following assumption.

Definition 1. Let d be any metric. The Wasserstein-1 metric Assumption 1. The Lipschitz constant KP of the transition between distributions and is defined as 1 W P Q function P is strictly smaller than γ .

Wd pP,Qq “ inf dpx, yqλpx, yq dx dy. From a practical standpoint, Assumption1 is relatively λPΓpP,Qq ż strong, but simplifies our analysis by ensuring that close where ΓpP,Qq denotes the set of all couplings of P and Q. states cannot have future trajectories that are “divergent.” DeepMDP: Learning Continuous Latent Space Models for Representation Learning

An MDP might not exhibit divergent behaviour even when et al., 2018; Ha & Schmidhuber, 2018; Oh et al., 2017; 1 KP ě γ . In particular, when episodes terminate after a Hafner et al., 2018; Kaiser et al., 2019; Silver et al., 2017), finite amount of time, Assumption1 becomes unnecessary. the parametrizations and training objectives used to learn Wes leave as future work how to improve on this assumption. such models have varied widely. For example Ha & Schmid- huber(2018); Hafner et al.(2018); Kaiser et al.(2019) The main use of Lipschitz MDPs will be to study the Lips- use pixel prediction losses to learn the latent representa- chitz properties of value functions.1 tion while (Oh et al., 2017) chooses instead to optimize Definition 3. A policy π P Π is KV -Lipschitz-valued if for the model to predict next latent states with the same value all s1, s2 P S and a, a P A: function as the sampled next states. π π |V ps1q ´ V ps2q| ď KV dS ps1, s2q In this work, we study the minimization of latent space π π losses defined with respect to rewards and transitions in the |Q ps1, aq ´ Q ps2, aq| ď KV dS ps1, s2q latent space: Lipschitz MDPs allow us to provide a simple condition for LRps, aq “ |Rps, aq ´ Rpφpsq, aq| (3) a policy to have a Lipschitz value function. LPs ps, aq “ W φPp¨|s, aq, Pp¨|φpsq, aq (4) Lemma 1. Let M be pKR,KP q-Lipschitz and let π be any s s ` P ˘ policy with the property that @s1, s2 P S, where we use the shorthand notationsφ p¨|s, aq to denote the probability distribution over S of first sampling s1 „ π π π π |V ps1q ´ V ps2q| ď max |Q ps1, aq ´ Q ps2, aq| P and then embedding 1 1 . Francois-Lavet aPA p¨|s, aq s “ φps q et al.(2018) and Chung et al.(2019s) have studied similar then π is KR -Lipschitz-valued. latent losses, but to the best of our knowledge ours is the 1´γKP s first theoretical analysis of latent space models as auxiliary Proof. See the Appendix for all proofs. losses. We use the term DeepMDP to refer to a parameterized This defines a rich set of Lipschitz-Valued policies. Notably, latent space model which minimizes the latent losses L the optimal policy π˚ satisfies the condition of Lemma1. R and LP (sometimes referred to as the DeepMDP losses). In ˚ s Corollary 1. Let M be pKR,KP q-Lipschitz, then π is Section3, we derive theoretical guarantees of DeepMDPs KR -Lipschitz-Valued. s 1´γKP when minimizing the DeepMDP losses over the whole state space (which we term global optimization). However, our 2.3. Latent Space Models principal objective is to learn DeepMDPs parameterized by deep networks, which requires losses in the form of M M S A R P For some MDP , let “ x , , , , γy be an MDP expectations. We show in Section4 that similar theoretical S D A where Ă R for finite D and the action space is shared guarantees can be obtained in this setting. between M and M. Furthermore,Ď s lets φs: S Ñ S be an embeddings function which connects the state spaces of these two MDPs. We referĎ to pM, φq as a latent space models of 3. Global DeepMDP Bounds M. We refer to the following losses as the global DeepMDP Since M is, by definition,Ď an MDP, value functions can losses, to emphasize their dependence on the whole state be defined in the standard way. We use V π, Qπ to denote and action space:2 the valueĎ functions of a policy π P Π, wheres Πs is the set 8 ˚ LR “ sup LRps, aq (5) of policies defined on the state space Ss. Wes use π to s,a M denote the optimal policy in .s The correspondings s optimal 8s s ˚ ˚ LP “ sup LP ps, aq (6) state and action value functions are thensV , Q . Fors ease s,a of notation, when s P S, weĎ use πp¨|sq :“ πp¨|φpsqq to s s denote first using φ to map s to the state spaces sS of M and 3.1. Value Difference Bound subsequently using π to generate thes probabilitys distribution over actions. Ď We start by bounding the difference of the value functions Qπ and Qπ for any policy π P Π. Note that Qπps, aq Although similar definitionss of latent space models have π iss computeds using P and R on S while Q pφpsqs, aq is been previously studied (Francois-Lavet et al., 2018; Zhang P R S computeds using and on s. s s 1Another benefit of MDP smoothness is improved learning dy- Lemma 2. Let M and M be an MDP ands DeepMDP namics. Pirotta et al.(2015) suggest that the smaller the Lipschitz respectively, withs an embeddings s function φ and global loss constant of an MDP, the faster it is to converge to a near-optimal 2 Ď policy. The 8 notation is a reference to the `8 norm DeepMDP: Learning Continuous Latent Space Models for Representation Learning functions L8 and L8. For any K -Lipschitz-valued policy Where K “ KR is an upper bound to the Lipschitz R P V V 1´γKP Ď ˚ π P Π the value difference can be bounded by constant of the value function V π , as shown by Corol- s s s s Ď 8 8 lary1. s L ` γKV L s s Qπps, aq ´ Qπpφpsq, aq ď R P , s 1 ´ γ s ˇ s s ˇ s s 4. Local DeepMDP Bounds ˇ s ˇ The previous result holds for all policies Π Ď Π, a subset In large-scale tasks, data from many regions of the state of all possible policies Π. The reader might ask whether space is often unavailable,3 making it infeasible to measure this is an interesting set of policies to consider;s in Section – let alone optimize – global losses. Further, when the ca- 5, we characterize this set of policies via a connection with pacity of a model is limited, or when sample efficiency is a bisimulation. concern, it might not even be desirable to precisely learn a A similar bound can be found in Asadi et al.(2018), who model of the whole state space. Interestingly, we can still study non-latent transition models with the use of an exact provide similar guarantees based on the DeepMDP losses, expectation reward function. We also note that our results are arguably as measured under an over a state-action distri- simpler, since we do not require the treatment of MDP tran- bution, denoted here as ξ. We refer to these as the losses local to ξ. Taking Lξ , Lξ to be the reward and transition sitions in terms of distributions over a set of deterministic R P components. losses under ξ, respectively, we have the following local DeepMDP losses: s s

3.2. Representation Quality Bound ξ LR “ E |Rps, aq ´ Rpφpsq, aq|, (7) When a representation is used to predict value functions s,a„ξ ξ of policies π P Π, a clear failure case is when two states s LP “ E W φPp¨|s,s aq, Pp¨|φpsq, aq . (8) with different value functions are collapsed to the same s,a„ξ “ ` ˘‰ representation.s Thes following result demonstrates that when s s 8 8 Losses of this form are compatible with the stochastic gradi- the global DeepMDP losses LR “ 0 and LP “ 0, this failure case can never occur for the embedding function φ. ent decent methods used by neural networks. Thus, study of s s the local losses allows us to bridge the gap between theory Theorem 1. Let M and M be an MDP and DeepMDP and practice. respectively, with an embedding function φ and global loss functions L8 and L8. For any K -Lipschitz-valued policy R P Ď V 4.1. Value Difference Bound π P Π the representation φ guarantees that for any s1, s2 P S and a P As, s s We provide a value function bound for the local case, analo- s s gous to Lemma2. π π Q ps1, aq ´ Q ps2, aq ď KV }φps1q ´ φps2q}2 Lemma 3. Let M and M be an MDP and DeepMDP s s 8 8 ˇ ˇ L ` γKV L respectively, with an embedding function φ. For any KV - ˇ ˇ s` 2 R P 1 ´ γ Lipschitz-valued policy π PĎΠ, the expected value function ` s ˘ s s difference can be bounded using the local loss functionss ξπ and ξπ measured under , the stationary state action This result justifies using the embedding function φ as a LR LP s sξπ representation to predict values. distributions s of π. s s s

3.3. Suboptimality Bound s ξπ ξπ L ` γKV L π π R P E Q ps, aq ´ Q pφpsq, aq ď s s , Although we do not make use of any form of model-based s,a„ξ ´ ¯ π s 1 ´ γs s RL in this work, we bound the performance loss of running s s s ˇ ˇ the optimal policy of M in M, compared to the optimal ˇ s ˇ policy π˚. The provided bound guarantees that for any policy π P Π Ď which visits state-action pairs ps, aq where LRps, aq and Theorem 2. Let M and M be an MDP and a pKR,KP q- Lipschitz DeepMDP respectively, with an embedding func- LP ps, aq are small, the DeepMDP will provide accurates s value functions for any states likely to be seens under the tion φ and global loss functionsĎ L8 and L8. For all s P S, R P policy.s 4 the suboptimality of the optimal policy π˚ of M evaluated on M can be bounded by, s s 3Challenging exploration environments like Montezuma’s Re- Ď venge are a prime example. 8 s 8 4 ˚ L ` γKV L The value functions might be inaccurate in states that the V ˚psq ´ V π psq ď 2 R P 1 ´ γ policy π rarely visits. s s s s s DeepMDP: Learning Continuous Latent Space Models for Representation Learning

4.2. Representation Quality Bound behavioural similarity of two discrete states, and proposed the aggregation of states that are  away under the bisim- We can also extend the local value difference bound to ulation metric. We present the extension of bisimulation provide a local bound on how well the representation can φ metrics to continuous state spaces as proposed in Ferns et al. be used to predict the value function of a policy ¯, π¯ P Π (2011). analogous to Theorem1. Definition 5. Given an MDP M, a bisimulation metric d Theorem 3. Let M and M be an MDP and DeepMDP satisfies the fixed point: respectively, with an embedding function φ. Let π P Π be r any KV -Lipschitz-valued policyĎ with stationary distribution ξπ ξπ dps1, s2q “ maxp1 ´ γq |Rps1, aq ´ Rps2, aq| dπpsq and let L and L be the local loss functionss s mea- a s R P s s sured under ξπ, the stationary state action distribution of ` γWdpPp¨|s1, aq, Pp¨|s2, aqq s s s r π. For any two states s1, s2 P S, the local representation s r similarity can be bounded by This recurrent formulation of bisimulation metrics has a s π π single fixed point d, which has as its kernel the maximal |V ps1q ´ V ps2q| ď KV }φps1q ´ φps2q}2 bisimulation relation „ (i.e. dps1, s2q “ 0 ðñ s1 „ s2). s s ξπ ξπ LR ` γKsV LP 1 1 r ` s s ` We show a connection between bisimulation metrics and 1 ´ γ dπps1q dπps2q r s s s ˆ ˙ the representation φ learned by the global DeepMDP losses (see Lemma5 in Appendix). The main application of this s s Thus, the representation quality argument given in 3.2 holds result will be to characterize the set of policies Π. With for any two states s1 and s2 which are visited often by a that aim, we first define the concept of Lipschitz-bisimilar policy π. policies: s

Definition 6. We denote by ΠK the set of K-Lipschitz- 5. Connections to Bisimulation bisimilar policies, s.t. for all s1, s2 P S, a P A, As we will now see, the representation learned by optimizing r tπ : π P Π, |πpa|s q ´ πpa|s q| ď Kdps , s qu. the DeepMDP losses is closely connected to bisimulation. 1 2 1 2 Givan et al.(2003) first studied bisimulation relations in the r context of RL as a formalization of behavioural equivalence Lipschitz-bisimilar policies act differently only on states that between states. They proposed grouping equivalent states are different under the bisimulation metric: Π excludes any to reduce the dimensionality of the MDP. policies which act differently on states that are fundamen- tally equivalent. We guarantee that the set ofr deep policies Definition 4 (Givan et al.(2003)) . Given an MDP M, an is sufficiently expressive by showing that minimizing the equivalence relation B between states is a bisimulation global DeepMDP losses ensures that for any π P Π , there relation if for all states s , s P S equivalent under B (i.e. K 1 2 is a deep policy Π which is close. The following result thus s1Bs2), the following conditions hold, characterizes the set of policies Π: r r s Rps1, aq “ Rps2, aq Theorem 4. Let M be an MDP and M be a (KR, KP )- s PpG|s1, aq “ PpG|s2, aq, @G P S{B Lipschitz DeepMDP, with an embedding function φ, and 8 8 s s global loss functions L and L . DenoteĎ by ΠK the Where S{B denotes the partition of S under the rela- R P set of K-Lipschitz deep policies tπ : π P Π, |πpa|s1q ´ tion B, the set of all sets of equivalent states, and where s s s 1 πpa|s2q| ď K }φps1q ´ φps2q}2 , @s1, s2 P S, a P Au. Fi- PpG|s, aq “ 1 Pps |s, aq. p1´γqK s PG nally define the constant C “ s sR .s Thens for any 1´γKP Essentially, twoř states are bisimilar if (1) they have the same s π P ΠK there exists a π P ΠCK which isĎ close to π in the immediate reward for all actions and (2) both of their distri- sense that, for all s P S and a P A, butions over next-states contain states which themselves are r r s s r bisimilar. Of particular interest is the maximal bisimulation 8 8 KR |πpa|sq ´ πpa|sq| ď LR ` γLP relation „, which defines the partition S{„ with the fewest 1 ´ γKP s elements. s s We speculater thats similar results should be possibles based A drawback of bisimulation relations is their all-or-nothing on local DeepMDP losses, but they would require a general- nature. Two states that are nearly identical, but differ slightly ization of bisimulation metrics to the local setting. in their reward or transition functions, are treated as though they were just as unrelated as two states with nothing in Although bisimulation metrics have been used for state ag- common. Ferns et al.(2004) introduced the usage of bisimu- gregation (Givan et al., 2003; Ferns et al., 2004; Ruan et al., lation metrics, which are pseudometrics that to quantify the 2015), feature discovery (Comanici & Precup, 2011) and DeepMDP: Learning Continuous Latent Space Models for Representation Learning transfer learning between MDPs (Castro & Precup, 2010), proposed as an alternative solution to the minimization they have not been scaled up to modern deep reinforcement of the Wasserstein (Dabney et al., 2018b), (Dabney et al., learning techniques. In that sense, our method is of indepen- 2018a), and has shown to perfom well for Distributional dent interest as a practical representation learning scheme RL. The reader might note that issues with the stochastic for deep reinforcement learning that provides the desirable minimization of the Wasserstein distance have been found properties of bisimulation metrics via learning objectives by Bellemare et al.(2017b) and Bikowski et al.(2018). that are tractable to compute. In our experiments, we circumvent these issues by assum- ing that both P and P are deterministic. This reduces 6. Related Work the Wasserstein distance W φPp¨|s, aq, Pp¨|φpsq, aq to φpPps, aqq ´ Ppφpsq,s aq , where Pps, aq and Pps, aq de- 2 ` ˘ We have shown that learning a DeepMDP via the minimiza- note the deterministic transition functions. s › › tion of latent space losses leads to representations that: › s › s s (2) Control the Lipschitz constants KR and KP . We also • Allow for the good approximation of a large set of turn to the field of Wasserstein GANs for approaches to con- interesting policies. strain deep networks to be Lipschitz. Originally,s s Arjovsky • Allow for the good approximation of the value func- et al.(2017) used a projection step to constraint the discrim- tion of these policies. inator function to be 1-Lipschitz. Gulrajani et al.(2017a) proposed using a gradient penalty, and sowed improved Thus, DeepMDPs are a mathematically sound approach to learning dynamics. Lipschitz continuity has also been pro- representation learning that is compatible with neural nets posed as a regularization method by Gouk et al.(2018), and simple to implement. who provided an approach to compute an upper bound to A similar connection between the quality of representations the Lipschitz constant of neural nets. In our experiments, and model based objectives in the linear setting was made we follow Gulrajani et al.(2017a) and utilize the gradient by Parr et al.(2008). There exists an extensive body of penalty. literature on exploiting the transition function structure for representation learning (Mahadevan & Maggioni, 2007; Bar- 7.1. DonutWorld Experiments reto et al., 2017), but these works do not rely on the reward In order to evaluate whether we can learn effective repre- function. Recently, Bellemare et al.(2019) approached the sentations, we study the representations learned by Deep- representation learning problem from the perspective that a MDPs in a simple synthetic environment we call Donut- good representation is one that allows the prediction via a lin- World. DonutWorld consists of an agent rewarded for run- ear map of any possible value function in the value function ning clockwise around a fixed track. Staying in the center of polytope (Dadashi et al., 2019). Other auxiliary tasks, with- the track results in faster movement. Observations are given out the same level of theoretical justification, been shown in terms of 32x32 greyscale pixel arrays, but there is a sim- to improve the performance of RL agents (Jaderberg et al., ple 2D latent state space (the x-y coordinates of the agent). 2016; van den Oord et al., 2018; Mirowski et al., 2017). Lyle We investigate whether the x-y coordinates are correctly et al.(2019) argued that the performance benefits of Distri- recovered when learning a two-dimensional representation. butional RL (Bellemare et al., 2017a) can also be explained as a form of auxiliary task. This task epitomizes the low-dimensional dynamics, high- dimensional observations structure typical of Atari 2600 games, while being sufficiently simple to experiment with. 7. Empirical Evaluation We implement the DeepMDP training procedure using Ten- Our results depend on minimizing losses in expectation, sorflow and compare it to a simple autoencoder baseline. which is the main requirement for deep networks to be See AppendixB for a full environment specification, ex- applicable. Still, two main obstacles arise when turning perimental setup, and additional experiments. Code for these theoretical results into practical algorithms: replicating all experiments is included in the supplementary material. (1) Minimization of the Wasserstein Arjovsky et al. (2017) first proposed the use of the Wasserstein distance In order to investigate whether the learned representations for Generative Adversarial Networks (GANs) via its dual learned correspond well to reality, we plot a heatmap of formulation (see Equation1). Their approach consists of closeness of representation for various states. Figure 1(a) training a network, constrained to be 1-Lipschitz, to at- shows that the DeepMDP representations effectively recover tain the supremum of the dual. Once this supremum is the underlying state of the agent, i.e. its 2D position, from attained, the Wasserstein can be minimized by differenti- the high-dimensional pixel observations. In contrast, the ating through the network. Quantile regression has been autoencoder representations are less meaningful, even when the autoencoder solves the task near-perfectly. DeepMDP: Learning Continuous Latent Space Models for Representation Learning

masses on a discrete support and minimizes the KL di- vergence between the estimated distribution and a target distribution. C51 encodes the input frames using a convo- lutional neural network φ : S Ñ S, outputting a dense vector representation s “ φpsq. The C51 Q-function is a feed-forward neural network whichs maps s to an estimate of the reward distribution’ss logits. To incorporate learning a DeepMDP as an auxiliarys learning objective, we define a deep reward function and deep transi- tion function. These are each implemented as a feed-forward neural network, which uses s to estimate the immediate re- (a) One-track DonutWorld. ward and the next-state representation, respectively. The overall objective function iss a simple linear combination of the standard C51 loss and the Wasserstein distance-based ap- proximations to the local DeepMDP loss given by Equations 7 and8. For experimental details, see AppendixC. By optimizing φ to jointly minimize both C51 and Deep- MDP losses, we hope to learn meaningful s that form the basis for learning good value functions. In the following subsections, we aim to answer the followings questions: (1) How does the learning of a DeepMDP affect the overall performance of C51 on Atari 2600 games? (2) How do the DeepMDP objectives compare with similar representation- (b) Four-track DonutWorld. learning approaches? Figure 1. Given a state in our DonutWorld environment (first row), we plot a heatmap of the distance between that latent state and each 7.3. DeepMDPs as an Auxiliary Task other latent state, for both autoencoder representations (second We show that when using the best performing DeepMDP row) and DeepMDP representations (third row). More-similar architecture described in Appendix , we obtain nearly latent states are represented by lighter colors. C.2 consistent performance improvements over C51 on the suite In Figure 1(b), we modify the environment: rather than a of 60 Atari 2600 games (see Figure2). single track, the environment now has four identical tracks. The agent starts in one uniformly at random, and cannot 7.4. Comparison to Alternative Objectives move between tracks. The DeepMDP hidden state correctly We empirically compare the effect of the DeepMDP aux- merges all states with indistinguishable value functions, illiary objectives on the performance of a C51 agent to a learning a deep state representation which is almost com- variety of alternatives. In the experiments in this section, we pletely invariant to which track the agent is in. replace the deep transition loss suggested by the DeepMDP bounds with each of the following: 7.2. Atari 2600 Experiments (1) Observation Reconstruction: We train a state decoder In this section, we demonstrate practical benefits of ap- to reconstruct observations s P S from s. This framework proximately learning a DeepMDP in the Arcade Learn- is similar to (Ha & Schmidhuber, 2018), who learn a la- ing Environment (Bellemare et al., 2013a). Our results on tent space representation of the environment with an auto- representation-similarity s indicate that learning a DeepMDP encoder, and use it to train an RL agent. is a principled method for learning a high-quality repre- sentation. Therefore, we minimize DeepMDP losses as an (2) Next Observation Prediction: We train a transition model auxiliary task alongside model-free reinforcement learning, to predict next observations s1 „ Pp¨|s, aq from the current learning a single representation which is shared between state representation s. This framework is similar to model- both tasks. Our implementations of the proposed algorithms based RL algorithms which predict future observations (Xu are based on Dopamine (Castro et al., 2018). et al., 2018). s We adopt the Distributional Q-learning approach to model- (3) Next Logits Prediction: We train a transition model to free RL; specifically, we use as a baseline the C51 agent predict next-state representations such that the Q-function (Bellemare et al., 2017a), which estimates probability correctly predicts the logits of ps1, a1q, where a1 is the ac- DeepMDP: Learning Continuous Latent Space Models for Representation Learning

Figure 2. We compare the DeepMDP agent versus the C51 agent on the 60 games from the ALE (3 seeds each). For each game, the percentage performance improvement of DeepMDP over C51 is recorded.

Figure 3. Using various auxiliary tasks in the Arcade Learning Environment. We compare predicting the next state’s representation (Next Latent State, recommended by theoretical bounds on DeepMDPs) with reconstructing the current observation (Observation), predicting the next observation (Next Observation), and predicting the next C51 logits (Next Logits). Training curves for a baseline C51 agent are also shown. tion associated with the max Q-value of s1. This can be reveals several insights. A novel connection to bisimula- understood as a distributional analogue of the Value Pre- tion metrics guarantees that our analysis applies to a large diction Network, VPN, (Oh et al., 2017). Note that this set of interesting policies. Further, the representation al- auxiliary loss is used to update only the parameters of the lows the value functions of these policies can be predicted. representation encoder and the transition model, not the Together, these findings suggest a novel approach to repre- Q-function. sentation learning. Our results are corroborated by strong performance on large-scale Atari 2600 experiments, demon- Our experiments demonstrate that the deep transition loss strating that model-based DeepMDP auxiliary losses can be suggested by the DeepMDP bounds (i.e. predicting the next useful auxiliary tasks in model-free RL. Using the transition state’s representation) outperforms all three ablations (see and reward models of the DeepMDP for model-based RL Figure3). Accurately modeling Atari 2600 frames, whether (e.g. planning, exploration) is a promising future research through observation reconstruction or next observation pre- direction. Additionally, extending DeepMDPs to accommo- diction, forces the representation to encode irrelevant in- date different action spaces or time scales from the original formation with respect to the underlying task. VPN-style MDPs could be a promising path towards learning hierar- losses have been shown to be helpful when using the learned chical models of the environment. predictive model for planning (Oh et al., 2017); however, we find that with a distributional RL agent, using this as an auxiliary task tends to hurt performance. 9. Acknowledgements The authors would like to thank Philip Amortila and Robert 8. Conclusions Dadashi for invaluable feedback on the theoretical results; Pablo Samuel Castro, Doina Precup, Nicolas Le Roux, We introduce the concept of a DeepMDP: a parameterized Sasha Vezhnevets, Simon Osindero, Arthur Gretton, Adrien latent space model trained via the minimization of tractable Ali Taiga, Fabian Pedregosa and Shane Gu for useful dis- latent space losses. Theoretical analysis of DeepMDPs cussions and feedback. DeepMDP: Learning Continuous Latent Space Models for Representation Learning

References Chung, W., Nath, S., Joseph, A. G., and White, M. Two- timescale networks for nonlinear value function approxi- Abel, D., Hershkowitz, D. E., and Littman, M. L. Near mation. In International Conference on Learning Repre- optimal behavior via approximate state abstraction. arXiv sentations, 2019. preprint arXiv:1701.04113, 2017. Comanici, G. and Precup, D. Basis function discovery using Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein spectral clustering and bisimulation metrics. In AAMAS, generative adversarial networks. In ICML, 2017. 2011.

Asadi, K., Misra, D., and Littman, M. L. Lipschitz continu- Dabney, W., Ostrovski, G., Silver, D., and Munos, R. Im- ity in model-based reinforcement learning. arXiv preprint plicit quantile networks for distributional reinforcement arXiv:1804.07193, 2018. learning. In ICML, 2018a.

Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., Dabney, W., Rowland, M., Bellemare, M. G., and Munos, Silver, D., and van Hasselt, H. P. Successor features for R. Distributional reinforcement learning with quantile transfer in reinforcement learning. In NIPS, 2017. regression. In AAAI, 2018b.

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. Dadashi, R., Taiga, A. A., Roux, N. L., Schuurmans, D., The Arcade Learning Environment: An evaluation plat- and Bellemare, M. G. The value function polytope in form for general agents. Journal of Artificial Intelligence reinforcement learning. CoRR, abs/1901.11524, 2019. Research, 47:253–279, June 2013a. Ferns, N., Panangaden, P., and Precup, D. Metrics for Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, finite markov decision processes. Proceedings of the M. H. The arcade learning environment: An evaluation 20th Conference on Uncertainty in Artificial Intelligence, platform for general agents. J. Artif. Intell. Res., 47:253– UAI’04:162–169, 2004. 279, 2013b. Ferns, N., Panangaden, P., and Precup, D. Bisimulation metrics for continuous markov decision processes. SIAM Bellemare, M. G., Dabney, W., and Munos, R. A distri- Journal on Computing, 40(6):1662–1714, 2011. butional perspective on reinforcement learning. In Pro- ceedings of the International Conference on Machine Francois-Lavet, V., Bengio, Y., Precup, D., and Pineau, J. Learning, 2017a. Combined reinforcement learning via abstract representa- tions. arXiv preprint arXiv:1809.04506, 2018. Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S., Lakshminarayanan, B., Hoyer, S., and Munos, R. The Gelada, C. and Bellemare, M. G. Off-policy deep reinforce- cramer distance as a solution to biased wasserstein gradi- ment learning by bootstrapping the covariate shift. CoRR, ents. arXiv preprint arXiv:1705.10743, 2017b. abs/1901.09455, 2019.

Bellemare, M. G., Dabney, W., Dadashi, R., Taiga, A. A., Givan, R., Dean, T., and Greig, M. Equivalence notions Castro, P. S., Roux, N. L., Schuurmans, D., Lattimore, and model minimization in markov decision processes. T., and Lyle, C. A geometric perspective on opti- Artificial Intelligence, 147(1-2):163–223, 2003. mal representations for reinforcement learning. CoRR, Gouk, H., Frank, E., Pfahringer, B., and Cree, M. J. Regu- abs/1901.11530, 2019. larisation of neural networks by enforcing lipschitz conti- Bikowski, M., Sutherland, D. J., Arbel, M., and Gretton, A. nuity. CoRR, abs/1804.04368, 2018. Demystifying MMD GANs. In International Conference Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and on Learning Representations, 2018. URL https:// Courville, A. C. Improved training of wasserstein gans. openreview.net/forum?id=r1lUOzWCW. In NIPS, 2017a.

Castro, P. and Precup, D. Using bisimulation for policy Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and transfer in mdps. Proceeedings of the 9th International Courville, A. C. Improved training of wasserstein gans. Conference on Autonomous Agents and Multiagent Sys- In Advances in Neural Information Processing Systems, tems (AAMAS-2010), 2010. pp. 5767–5777, 2017b.

Castro, P. S., Moitra, S., Gelada, C., Kumar, S., and Belle- Ha, D. and Schmidhuber, J. Recurrent world models facili- mare, M. G. Dopamine: A research framework for deep tate policy evolution. In Advances in Neural Information reinforcement learning. arXiv, 2018. Processing Systems, pp. 2455–2467, 2018. DeepMDP: Learning Continuous Latent Space Models for Representation Learning

Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Puterman, M. L. Markov decision processes: Discrete Lee, H., and Davidson, J. Learning latent dynamics for stochastic dynamic programming. 1994. planning from pixels. arXiv preprint arXiv:1811.04551, 2018. Ruan, S. S., Comanici, G., Panangaden, P., and Precup, D. Representation discovery for mdps using bisimulation Hinderer, K. Lipschitz continuity of value functions in metrics. In AAAI, 2015. markovian decision processes. Math. Meth. of OR, 62: Silver, D., van Hasselt, H. P., Hessel, M., Schaul, T., Guez, 3–22, 2005. A., Harley, T., Dulac-Arnold, G., Reichert, D. P., Rabi- Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., nowitz, N. C., Barreto, A., and Degris, T. The predictron: Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforce- End-to-end learning and planning. In ICML, 2017. ment learning with unsupervised auxiliary tasks. arXiv Singh, S. P., Jaakkola, T., and Jordan, M. I. Reinforce- preprint arXiv:1611.05397, 2016. ment learning with soft state aggregation. In Advances Jiang, N., Kulesza, A., and Singh, S. Abstraction selection in neural information processing systems, pp. 361–368, in model-based reinforcement learning. In International 1995. Conference on , pp. 179–188, 2015. van den Oord, A., Li, Y., and Vinyals, O. Representa- tion learning with contrastive predictive coding. CoRR, Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Camp- abs/1807.03748, 2018. bell, R. H., Czechowski, K., Erhan, D., Finn, C., Koza- kowski, P., Levine, S., Sepassi, R., Tucker, G., and Villani, C. Optimal transport: Old and new. 2008. Michalewski, H. Model-based reinforcement learning for atari. CoRR, abs/1903.00374, 2019. Xu, H., Li, Y., Tian, Y., Darrell, T., and Ma, T. Algorithmic framework for model-based reinforcement learning with Li, L., Walsh, T. J., and Littman, M. L. Towards a unified theoretical guarantees. arXiv preprint arXiv:1807.03858, theory of state abstraction for mdps. In ISAIM, 2006. 2018.

Lyle, C., Castro, P. S., and Bellemare, M. G. A compara- Zhang, M., Vikram, S., Smith, L., Abbeel, P., Johnson, M. J., tive analysis of expected and distributional reinforcement and Levine, S. Solar: Deep structured latent represen- learning. CoRR, abs/1901.11084, 2019. tations for model-based reinforcement learning. arXiv preprint arXiv:1808.09105, 2018. Mahadevan, S. and Maggioni, M. Proto-value functions: A laplacian framework for learning representation and control in markov decision processes. Journal of Machine Learning Research, 8:2169–2231, 2007.

Mirowski, P. W., Pascanu, R., Viola, F., Soyer, H., Bal- lard, A. J., Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavukcuoglu, K., Kumaran, D., and Hadsell, R. Learning to navigate in complex environments. CoRR, abs/1611.03673, 2017.

Mueller, A. Integral probability metrics and their generating classes of functions. 1997.

Oh, J., Singh, S., and Lee, H. Value prediction network. In Advances in Neural Information Processing Systems, pp. 6118–6128, 2017.

Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., and Littman, M. L. An analysis of linear models, linear value-function approximation, and for reinforcement learning. In ICML, 2008.

Pirotta, M., Restelli, M., and Bascetta, L. Policy gradient in lipschitz markov decision processes. Machine Learning, 100(2-3):255–283, 2015.