Lipschitz Continuity in Model-Based Reinforcement Learning

Lipschitz Continuity in Model-based Reinforcement Learning Kavosh Asadi * 1 Dipendra Misra * 2 Michael L. Littman 1 Abstract introduce a novel characterization of models, referred We examine the impact of learning Lipschitz to as a Lipschitz model class, that represents stochastic continuous models in the context of model-based dynamics using a set of component deterministic functions. reinforcement learning. We provide a novel bound This allows us to study any stochastic dynamic using on multi-step prediction error of Lipschitz models the Lipschitz continuity of its component deterministic where we quantify the error using the Wasserstein functions. To learn a Lipschitz model class in continuous metric. We go on to prove an error bound for state spaces, we provide an Expectation-Maximization the value-function estimate arising from Lipschitz algorithm (Dempster et al., 1977). models and show that the estimated value function One promising direction for mitigating the effects of is itself Lipschitz. We conclude with empirical inaccurate models is the idea of limiting the complexity results that show the benefits of controlling the of the learned models or reducing the horizon of Lipschitz constant of neural-network models. planning (Jiang et al., 2015). Doing so can sometimes make models more useful, much as regularization in supervised learning can improve generalization performance 1. Introduction (Tibshirani, 1996). In this work, we also examine a type The model-based approach to reinforcement learning (RL) of regularization that comes from controlling the Lipschitz focuses on predicting the dynamics of the environment constant of models. This regularization technique can be to plan and make high-quality decisions (Kaelbling et al., applied efficiently, as we will show, when we represent the 1996; Sutton & Barto, 1998). Although the behavior of transition model by neural networks. model-based algorithms in tabular environments is well understood and can be effective (Sutton & Barto, 1998), 2. Background scaling up to the approximate setting can cause instabilities. Even small model errors can be magnified by the planning We consider the Markov decision process (MDP) setting process resulting in poor performance (Talvitie, 2014). in which the RL problem is formulated by the tuple ; ; R; T; γ . Here, by we mean a continuous state In this paper, we study model-based RL through the lens of spacehS A and by i we mean a discreteS action set. The functions Lipschitz continuity, intuitively related to the smoothness R : S A AR and T : Pr( ) denote the reward of a function. We show that the ability of a model to make and transition× ! dynamics.S Finally, × A !γ S[0; 1) is the discount accurate multi-step predictions is related to the model’s rate. If = 1, the setting is called2 a Markov reward one-step accuracy, but also to the magnitude of the Lipschitz process (MRP).jAj constant (smoothness) of the model. We further show that the dependence on the Lipschitz constant carries over to the 2.1. Lipschitz Continuity value-prediction problem, ultimately influencing the quality of the policy found by planning. Our analyses leverage the “smoothness” of various functions, quantified as follows. We consider a setting with continuous state spaces and stochastic transitions where we quantify the distance Definition 1. Given two metric spaces (M1; d1) and between distributions using the Wasserstein metric. We (M2; d2) consisting of a space and a distance metric, a function f : M1 M2 is Lipschitz continuous (sometimes * 1 Equal contribution Department of Computer Science, simply Lipschitz)7! if the Lipschitz constant, defined as Brown University, Providence, USA 2Department of Computer Science and Cornell Tech, Cornell University, New York, USA. Correspondence to: Kavosh Asadi <[email protected]>. d2 f(s1); f(s2) Kd1;d2 (f) := sup ; (1) th s1 M1;s2 M1 d1(s1; s2) Proceedings of the 35 International Conference on Machine 2 2 Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). is finite. Lipschitz Continuity in Model-based Reinforcement Learning f(s) Sometimesf(s) referred to as “Earth Mover’s distance”, Wasserstein is the minimum expected distance between pairs of points where the joint distribution j is constrained to match the marginals µ1 and µ2. New applications of this metric are discovered in machine learning, namely in s1 s1 the context of generative adversarial networks (Arjovsky et al., 2017) and value distributions in reinforcement learning (Bellemare et al., 2017). Wasserstein is linked to Lipschitz continuity using duality: Figure 1. An illustration of Lipschitz continuity. Pictorially, Lipschitz continuity ensures that f lies in between the two affine W (µ1; µ2) = sup f(s)µ1(s) f(s)µ2(s) ds : f:Kd;d (f) 1 − functions (colored in blue) with slopes K and −K. R ≤ Z (4) Equivalently, for a Lipschitz f, This equivalence, known as Kantorovich-Rubinstein duality (Villani, 2008), lets us compute Wasserstein by maximizing R s1; s2 d2 f(s1); f(s2) Kd1;d2 (f) d1(s1; s2) : over a Lipschitz set of functions f : , a relatively 8 8 ≤ easier problem to solve. In our theory,S 7! we utilize both The concept of Lipschitz continuity is visualized in Figure 1. definitions, namely the primal definition (3) and the dual definition (4). A Lipschitz function f is called a non-expansion when K (f) = 1 and a contraction when K (f) < 1. d1;d2 d1;d2 3. Lipschitz Model Class Lipschitz continuity, in one form or another, has been a key tool in the theory of reinforcement learning (Bertsekas, We introduce a novel representation of stochastic MDP 1975; Bertsekas & Tsitsiklis, 1995; Littman & Szepesvari,´ transitions in terms of a distribution over a set of 1996; Muller,¨ 1996; Ferns et al., 2004; Hinderer, 2005; deterministic components. Rachelson & Lagoudakis, 2010; Szepesvari,´ 2010; Pazis Definition 4. Given a metric state space ( ; d ) and an & Parr, 2013; Pirotta et al., 2015; Pires & Szepesvari,´ S S action space , we define Fg as a collection of functions: 2016; Berkenkamp et al., 2017; Bellemare et al., 2017) and A Fg = f : distributed according to g(f a) where bandits (Kleinberg et al., 2008; Bubeck et al., 2011). Below, f S 7! Sg j a . We say that Fg is a Lipschitz model class if we also define Lipschitz continuity over a subset of inputs. 2 A K := sup K (f) ; Definition 2. A function is uniformly F dS ;dS f : M1 M2 f Fg Lipschitz continuous in if × A 7! 2 A is finite. d f(s ; a); f(s ; a) 2 1 2 Our definition captures a subset of stochastic transitions, KdA1;d2 (f) := sup sup ; (2) a s1;s2 d1(s1; s2) 2A namely ones that can be represented as a state-independent is finite. distribution over deterministic transitions. An example is provided in Figure 2. We further prove in the appendix (see Note that the metric d1 is defined only on M1. Claim 1) that any finite MDP transition probabilities can be decomposed into a state-independent distribution g over a 2.2. Wasserstein Metric finite set of deterministic functions f. We quantify the distance between two distributions using Associated with a Lipschitz model class is a transition the following metric: function given by: Definition 3. Given a metric space (M; d) and the set T (s0 s; a) = 1 f(s) = s0 g(f a) : j j P(M) of all probability measures on M, the Wasserstein f X metric (or the 1st Kantorovic metric) between two b probability distributions µ1 and µ2 in P(M) is defined as Given a state distribution µ(s), we also define a generalized notion of transition function T ( µ, a) given by: G · j W (µ1; µ2) := inf j(s1; s2)d(s1; s2)ds2 ds1 ; (3) j Λ 1 2 ZZ T (s0 µ, a) = f(sb) = s0 g(f a) µ(s)ds : G j j Zs f where Λ denotes the collection of all joint distributions j on X b M M with marginals µ1 and µ2 (Vaserstein, 1969). Tb(s0 s;a) × j | {z } Lipschitz Continuity in Model-based Reinforcement Learning 1 2 µ(s)1 1 4 T (. µ, a) c1 G | 1 4 Figure 2. An example of a Lipschitz model class in a gridworld T (. µ, a) c2 G | environment (Russell & Norvig, 1995). The dynamics are such b that any action choice results in an attempted transition in the Figure 3. A state distribution µ(s) (top), a stochastic environment corresponding direction with probability 0.8 and in the neighboring that randomly adds or subtracts c1 (middle), and an approximate directions with probabilities 0.1 and 0.1. We can define Fg = transition model that randomly adds or subtracts a second scalar up right down left ff ; f ; f ; f g where each f outputs a deterministic c2 (bottom). next position in the grid (factoring in obstacles). For a = up, we have: g(fup j a = up) = 0:8; g(fright j a = up) = The next possible choice is Total Variation (TV) defined as: g(fleft j a = up) = 0:1, and g(fdown j a = up) = 0. Defining distances between states as their Manhattan distance in the grid, TV T ( µ, a); T ( µ, a) 8f sup d(f(s ); f(s )=d(s ; s ) = 2 K = then s1;s2 1 2 1 2 , and so F G · j G · j 2. So, the four functions and g comprise a Lipschitz model class. 1 := T (s0 µ, a) T (s0 µ, a) ds0 = 1 ; 2 G j b − G j Z if the two distributions have disjointb supports regardless of how far the supports are from each other. We are primarily interested in Kd;dA (T ), the Lipschitz G constant of T .

Load more