<<

Lipschitz Continuity in Model-based Reinforcement Learning

Kavosh Asadi * 1 Dipendra Misra * 2 Michael L. Littman 1

Abstract introduce a novel characterization of models, referred We examine the impact of learning Lipschitz to as a Lipschitz model class, that represents stochastic continuous models in the context of model-based dynamics using a set of component deterministic functions. reinforcement learning. We provide a novel bound This allows us to study any stochastic dynamic using on multi-step prediction error of Lipschitz models the Lipschitz continuity of its component deterministic where we quantify the error using the Wasserstein functions. To learn a Lipschitz model class in continuous . We go on to prove an error bound for state spaces, we provide an Expectation-Maximization the value- estimate arising from Lipschitz algorithm (Dempster et al., 1977). models and show that the estimated value function One promising direction for mitigating the effects of is itself Lipschitz. We conclude with empirical inaccurate models is the idea of limiting the complexity results that show the benefits of controlling the of the learned models or reducing the horizon of Lipschitz constant of neural-network models. planning (Jiang et al., 2015). Doing so can sometimes make models more useful, much as regularization in supervised learning can improve generalization performance 1. Introduction (Tibshirani, 1996). In this work, we also examine a type The model-based approach to reinforcement learning (RL) of regularization that comes from controlling the Lipschitz focuses on predicting the dynamics of the environment constant of models. This regularization technique can be to plan and make high-quality decisions (Kaelbling et al., applied efficiently, as we will show, when we represent the 1996; Sutton & Barto, 1998). Although the behavior of transition model by neural networks. model-based algorithms in tabular environments is well understood and can be effective (Sutton & Barto, 1998), 2. Background scaling up to the approximate setting can cause instabilities. Even small model errors can be magnified by the planning We consider the Markov decision process (MDP) setting process resulting in poor performance (Talvitie, 2014). in which the RL problem is formulated by the tuple , , R, T, γ . Here, by we mean a continuous state In this paper, we study model-based RL through the lens of spacehS A and by i we mean a discreteS action set. The functions Lipschitz continuity, intuitively related to the R : S A AR and T : Pr( ) denote the reward of a function. We show that the ability of a model to make and transition× → dynamics.S Finally, × A →γ S[0, 1) is the discount accurate multi-step predictions is related to the model’s rate. If = 1, the setting is called∈ a Markov reward one-step accuracy, but also to the magnitude of the Lipschitz process (MRP).|A| constant (smoothness) of the model. We further show that the dependence on the Lipschitz constant carries over to the 2.1. Lipschitz Continuity value-prediction problem, ultimately influencing the quality of the policy found by planning. Our analyses leverage the “smoothness” of various functions, quantified as follows. We consider a setting with continuous state spaces and stochastic transitions where we quantify the distance Definition 1. Given two metric spaces (M1, d1) and between distributions using the Wasserstein metric. We (M2, d2) consisting of a space and a distance metric, a function f : M1 M2 is Lipschitz continuous (sometimes * 1 Equal contribution Department of Computer Science, simply Lipschitz)7→ if the Lipschitz constant, defined as Brown University, Providence, USA 2Department of Computer Science and Cornell Tech, Cornell University, New York, USA. Correspondence to: Kavosh Asadi . d2 f(s1), f(s2) Kd1,d2 (f) := sup , (1) th s1 M1,s2 M1 d1(s1, s2) Proceedings of the 35 International Conference on Machine ∈ ∈  Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). is finite. Lipschitz Continuity in Model-based Reinforcement Learning

f(s) Sometimesf(s) referred to as “Earth Mover’s distance”, Wasserstein is the minimum expected distance between pairs of points where the joint distribution j is constrained to match the marginals µ1 and µ2. New applications of this metric are discovered in machine learning, namely in s1 s1 the context of generative adversarial networks (Arjovsky et al., 2017) and value distributions in reinforcement learning (Bellemare et al., 2017). Wasserstein is linked to Lipschitz continuity using duality: Figure 1. An illustration of Lipschitz continuity. Pictorially, Lipschitz continuity ensures that f lies in between the two affine W (µ1, µ2) = sup f(s)µ1(s) f(s)µ2(s) ds . f:Kd,d (f) 1 − functions (colored in blue) with slopes K and −K. R ≤ Z  (4)

Equivalently, for a Lipschitz f, This equivalence, known as Kantorovich-Rubinstein duality (Villani, 2008), lets us compute Wasserstein by maximizing R s1, s2 d2 f(s1), f(s2) Kd1,d2 (f) d1(s1, s2) . over a Lipschitz set of functions f : , a relatively ∀ ∀ ≤ easier problem to solve. In our theory,S 7→ we utilize both  The concept of Lipschitz continuity is visualized in Figure 1. definitions, namely the primal definition (3) and the dual definition (4). A Lipschitz function f is called a non-expansion when K (f) = 1 and a contraction when K (f) < 1. d1,d2 d1,d2 3. Lipschitz Model Class Lipschitz continuity, in one form or another, has been a key tool in the theory of reinforcement learning (Bertsekas, We introduce a novel representation of stochastic MDP 1975; Bertsekas & Tsitsiklis, 1995; Littman & Szepesvari,´ transitions in terms of a distribution over a set of 1996; Muller,¨ 1996; Ferns et al., 2004; Hinderer, 2005; deterministic components. Rachelson & Lagoudakis, 2010; Szepesvari,´ 2010; Pazis Definition 4. Given a metric state space ( , d ) and an & Parr, 2013; Pirotta et al., 2015; Pires & Szepesvari,´ S S action space , we define Fg as a collection of functions: 2016; Berkenkamp et al., 2017; Bellemare et al., 2017) and A Fg = f : distributed according to g(f a) where bandits (Kleinberg et al., 2008; Bubeck et al., 2011). Below, { S 7→ S} | a . We say that Fg is a Lipschitz model class if we also define Lipschitz continuity over a subset of inputs. ∈ A K := sup K (f) , Definition 2. A function is uniformly F dS ,dS f : M1 M2 f Fg Lipschitz continuous in if × A 7→ ∈ A is finite. d f(s , a), f(s , a) 2 1 2 Our definition captures a subset of stochastic transitions, KdA1,d2 (f) := sup sup , (2) a s1,s2 d1(s1, s2) ∈A  namely ones that can be represented as a state-independent is finite. distribution over deterministic transitions. An example is provided in Figure 2. We further prove in the appendix (see

Note that the metric d1 is defined only on M1. Claim 1) that any finite MDP transition probabilities can be decomposed into a state-independent distribution g over a 2.2. Wasserstein Metric finite set of deterministic functions f. We quantify the distance between two distributions using Associated with a Lipschitz model class is a transition the following metric: function given by:

Definition 3. Given a (M, d) and the set T (s0 s, a) = 1 f(s) = s0 g(f a) . | | P(M) of all probability measures on M, the Wasserstein f X  metric (or the 1st Kantorovic metric) between two b probability distributions µ1 and µ2 in P(M) is defined as Given a state distribution µ(s), we also define a generalized notion of transition function T ( µ, a) given by: G · | W (µ1, µ2) := inf j(s1, s2)d(s1, s2)ds2 ds1 , (3) j Λ 1 ∈ ZZ T (s0 µ, a) = f(sb) = s0 g(f a) µ(s)ds . G | | Zs f where Λ denotes the collection of all joint distributions j on X  b M M with marginals µ1 and µ2 (Vaserstein, 1969). Tb(s0 s,a) × | | {z } Lipschitz Continuity in Model-based Reinforcement Learning

1 2 µ(s)1 1 4 T (. µ, a) c1 G | 1 4 Figure 2. An example of a Lipschitz model class in a gridworld T (. µ, a) c2 G | environment (Russell & Norvig, 1995). The dynamics are such b that any action choice results in an attempted transition in the Figure 3. A state distribution µ(s) (top), a stochastic environment corresponding direction with probability 0.8 and in the neighboring that randomly adds or subtracts c1 (middle), and an approximate directions with probabilities 0.1 and 0.1. We can define Fg = transition model that randomly adds or subtracts a second scalar up right down left {f , f , f , f } where each f outputs a deterministic c2 (bottom). next position in the grid (factoring in obstacles). For a = up, we have: g(fup | a = up) = 0.8, g(fright | a = up) = The next possible choice is Total Variation (TV) defined as: g(fleft | a = up) = 0.1, and g(fdown | a = up) = 0. Defining distances between states as their Manhattan distance in the grid, TV T ( µ, a), T ( µ, a) ∀f sup d(f(s ), f(s )/d(s , s ) = 2 K = then s1,s2 1 2 1 2 , and so F G · | G · | 2. So, the four functions and g comprise a Lipschitz model class. 1 := T (s0 µ, a) T (s0 µ, a) ds0 = 1 , 2 G | b − G | Z

if the two distributions have disjointb supports regardless of how far the supports are from each other. We are primarily interested in Kd,dA (T ), the Lipschitz G constant of T . However, since T takes as input a In contrast, Wasserstein is sensitive to how far the constants probability distributionG and also outputsGb a probability are as: distribution, web require a notion of distanceb between two distributions. This notion is quantified using Wasserstein W T ( µ, a), T ( µ, a) = c1 c2 . G · | G · | | − | and is justified in the next section.  It is clear that, of the three,b Wasserstein corresponds best to the intuitive sense of how closely T approximates 4. On the Choice of Probability Metric T . This is particularly important in high-dimensionalG spacesG where the true distribution is known to usually lie in We consider the stochastic model-based setting and show low-dimensional manifolds. (Narayanan & Mitter, 2010) through an example that the Wasserstein metric is a b reasonable choice compared to other common options. 5. Understanding the Compounding Error Consider a uniform distribution over states µ(s) as shown in black in Figure 3 (top). Take a transition function T in Phenomenon G the environment that, given an action a, uniformly randomly To extract a prediction with a horizon n > 1, model-based adds or subtracts a scalar c1. The distribution of states algorithms typically apply the model for n steps by taking after one transition is shown in red in Figure 3 (middle). the state input in step t to be the state output from Now, consider a transition model T that approximates T G G the step t 1. Previous work has shown that model by uniformly randomly adding or subtracting the scalar error can result− in poor long-horizon predictions and c2. The distribution over states afterb one transition using ineffective planning (Talvitie, 2014; 2017). Observed even this imperfect model is shown in blue in Figure 3 (bottom). beyond reinforcement learning (Lorenz, 1972; Venkatraman We desire a metric that captures the similarity between the et al., 2015), this is referred to as the compounding error output of the two transition functions. We first consider phenomenon. The goal of this section is to provide a bound Kullback-Leibler (KL) divergence and observe that: on multi-step prediction error of a model. We formalize the notion of model accuracy below: Definition 5. Given an MDP with a transition function KL T ( µ, a), T ( µ, a) G · | G · | T , we identify a Lipschitz model Fg as ∆-accurate if its T (s0  µ, a) := T (s0 µ, a)b log G | ds0 = , induced T satisfies: G | ∞ Z T (s0 µ, a) G | s a W T ( s, a),T ( s, a) ∆ . ∀ ∀b · | · | ≤ b  unless the two constants are exactly the same. b Lipschitz Continuity in Model-based Reinforcement Learning

We want to express the multi-step Wasserstein error in δ(1) := W T ( µ, a0),T ( µ, a0) terms of the single-step Wasserstein error and the Lipschitz G · | G · | := sup T (s0 s, a ) T (s0 s, a ) f(s0)µ(s) ds ds0 constant of the transition function T . We provide a bound b | 0 − | 0 G f ZZ on the Lipschitz constant of T using the following lemma:  G b b sup T (s0 s, a0) T (s0 s, a0) f(s0) ds0 µ(s) ds Lemma 1. A generalized transition function T induced by ≤ f | − | b G Z Z a Lipschitz model class Fg is Lipschitz with a constant:  b =W Tb( s,a0),T ( s,a0) due to duality (4) b ·| ·| | {z } W T ( µ , a), T ( µ , a)  1 2 = W T ( s, a0),T ( s, a0) µ(s) ds KW,WA (T ) := sup sup G ·| G ·| KF · | · | G a µ1,µ2 W (µ1, µ2) ≤  Z ∆ due to Definition 5 b b ≤  b b ∆| µ(s) ds ={z ∆ . } ≤ Z Intuitively, Lemma 1 states that, if the two input We now prove the inductive step. Assuming δ(n 1) := distributions are similar, then for any action the output n 1 n 1 n 2 i− W T − ( µ),T − ( µ) ∆ i=0− (KF ) we can distributions given by T are also similar up to a KF factor. write:G · | G · | ≤ We prove this lemma, asG well as the subsequent lemmas, in  P b n n the appendix. δ(n) := W T ( µ),T ( µ) b G · | G · | n n 1 Given the one-step error (Definition 5), a start state W T ( µ), T T − (  µ), an 1 ≤ G ·b | G · | G · | − distribution µ and a fixed sequence of actions a , ..., a , 0 n 1  n 1 n  − +W Tb T −b( µ), an 1 ,T ( µ) (Triangle ineq) we desire a bound on n-step error: G · | G · | − G · |  n 1  n 1 =W Tb ( T − ( µ), an 1), T T − ( µ), an 1 δ(n) := W T n( µ),T n( µ) , G · | G · | − G · | G · | − G · | G · |  n 1 n 1  +W Tb Tb − ( µ), an 1 ,Tb( T − ( µ), an 1)  G · | G · | − G · | G · | − n b where T ( µ) := T ( T ( ...T ( µ, a0)..., an 2), an 1) We now use Lemma 1 and Definition 5 to upper bound the G ·| G ·| G ·| G ·| − − b n recursive calls first and the second term of the last line respectively. and T nb( µ) is definedb b similarly.b We provide a useful n 1 n 1 lemmaG followed· | by| the theorem. {z } δ(n) KF W T − ( µ),T − ( µ) + ∆ ≤ G · | G · | n 1 Lemma 2. (Composition Lemma) Define three metric − i = KF δ(n b 1) + ∆ ∆ (KF ) . (5) − ≤ spaces (M1, d1), (M2, d2), and (M3, d3). Define Lipschitz i=0 X functions f : M2 M3 and g : M1 M2 with constants 7→ 7→ Note that in the , we may replace Kd ,d (f) and Kd ,d (g). Then, h : f g : M1 M3 is 2 3 1 2 n 1 n 1 ◦ 7→ T T − ( µ) with T T − ( µ) and follow Lipschitz with constant Kd1,d3 (h) Kd2,d3 (f)Kd1,d2 (g). ≤ theG same· | G basic· steps | to get: G · | G · |   b b Similar to composition, we can show that summation n 1 n n − i preserves Lipschitz continuity with a constant bounded by W T ( µ),T ( µ) ∆ (KT ) . (6) G · | G · | ≤ the sum of the Lipschitz constants of the two functions. We i=0 omitted this result due to brevity.  X Combiningb (5) and (6) allows us to write: Theorem 1. Define a ∆-accurate T with the Lipschitz G n n constant KF and an MDP with a Lipschitz transition δ(n) = W T ( µ),T ( µ) G · | G · | function T with constant KT . Let bK¯ = min KF ,KT . n 1 n 1 G { } − i − i Then n 1: minb ∆ (KT ) , ∆ (KF ) ∀ ≥ ≤ ( i=0 i=0 ) X X n 1 n 1 − − i δ(n) := W T n( µ),T n( µ) ∆ (K¯ )i . = ∆ (K¯ ) , G · | G · | ≤ i=0 i=0  X X b which concludes the proof.

Proof. We construct a proof by induction. Using There exist similar results in the literature relating Kantarovich-Rubinstein duality (Lipschitz property of f one-step transition error to multi-step transition error and not shown for brevity) we first prove the base of induction: sub-optimality bounds for planning with an approximate Lipschitz Continuity in Model-based Reinforcement Learning

model. The Simulation Lemma (Kearns & Singh, 2002; Let = h : KdS ,R(h) 1 . Then given f : Strehl et al., 2009) is for discrete state MDPs and relates F { ≤ } ∈ F ∞ n n n error in the one-step model to the value obtained by KR γ f(s0) T (s0 δs) T (s0 δs) ds0 G | − G | using it for planning. A related result for continuous n=0 Z state-spaces (Kakade et al., 2003) bounds the error in X  ∞ n nb n estimating the probability of a trajectory using total KR γ sup f(s0) T (s0 δs) T (s0 δs) ds0 ≤ f G | − G | variation. A second related result (Venkatraman et al., n=0 ∈F Z X  2015) provides a slightly looser bound for prediction error n n b :=W T (. δs),Tb (. δs) due to duality (4) in the deterministic case—our result can be thought of as a G | G | | {z } generalization of their result to the probabilistic case. ∞ n n n  = KR γ W T (. δs), T (. δs) G | G | n=0 X Pn−1 ∆(K¯ )i due to Theorem 1 6. Value Error with Lipschitz Models ≤ i=0 b n| 1 {z } ∞ n − i We next investigate the error in the state-value function KR γ ∆(K¯ ) ≤ induced by a Lipschitz model class. To answer this question, n=0 i=0 X X we consider an MRP M1 denoted by , , T, R, γ and ¯ n hS A i ∞ n 1 K a second MRP M that only differs from the first in its = KR∆ γ − 2 1 K¯ n=0 transition function , , T , R, γ . Let = a be the X − hS A i A { } γK ∆ action set with a single action a. We further assume that = R . the reward function is onlyb dependent upon state. We first (1 γ)(1 γK¯ ) − − express the state-value function for a start state s with We can derive the same bound for using VTb(s) VT (s) respect to the two transition functions. By δs below, we the fact that Wasserstein distance is a metric,− and therefore mean a denoting a distribution with symmetric, thereby completing the proof. probability 1 at state s. Regarding the tightness of our bounds, we can show that ∞ n n VT (s) := γ T (s0 δs)R(s0) ds0 , when the transition model is deterministic and linear then G | n=0 Z Theorem 1 provides a tight bound. Moreover, if the reward X function is linear, the bound provided by Theorem 2 is tight. (See Claim 2 in the appendix.) Notice also that our proof

∞ n n does not require a bounded reward function. VT (s) := γ T (s0 δs)R(s0) ds0 . b G | n=0 X Z b 7. Lipschitz Generalized Value Iteration

Next we derive a bound on VT (s) V (s) s. We next show that, given a Lipschitz transition model, − Tb ∀ solving for the fixed point of a class of Bellman equations Theorem 2. Assume a Lipschitz model class Fg with a yields a Lipschitz state-action value function. Our proof is in ∆-accurate T with K¯ = min K ,K . Further, assume F T the context of Generalized Value Iteration (GVI) (Littman & a Lipschitz reward function with{ constant} K = K (R). R dS ,R Szepesvari,´ 1996), which defines Value Iteration (Bellman, Then s band K¯ [0, 1 ) ∀ ∈ S ∈ γ 1957) for planning with arbitrary backup operators.

γKR∆ VT (s) V (s) . Algorithm 1 GVI algorithm − Tb ≤ (1 γ)(1 γK¯ ) Input: − − initial Q(s, a), δ, and choose an operator f repeat for each s, ab do ∈ S × A Proof. We first define the function f(s) = R(s) . It can be Q(s, a) R(s, a)+γ T (s0 s, a)f Q(s0, ) ds0 KR end for ← | · observed that KdS ,R(f) = 1. We now write: R  until bconvergence b b

VT (s) VTb(s) − To prove the result, we make use of the following lemmas. ∞ n n n = γ R(s0) T (s0 δs) T (s0 δs) ds0 Lemma 3. Given a Lipschitz function f : R with G | − G | n=0 Z constant K (f): S 7→ X  dS ,dR ∞ n n b n = KR γ f(s0) T (s0 δs) T (s0 δs) ds0 G | − G | KdAS ,dR T (s0 s, a)f(s0)ds0 KdS ,dR (f)KdAS ,W T . n=0 | ≤ X Z Z     b b b Lipschitz Continuity in Model-based Reinforcement Learning

that it is sufficient to have a Lipschitz model, an assumption Lemma 4. The following operators (Asadi & Littman, perhaps easier to confirm. The second implication relates to 2017) are Lipschitz with constants: value-aware model learning (VAML) objective (Farahmand et al., 2017). Using the above theorem, we can show that

1. K ,dR (max(x)) = K ,dR mean(x) = kk∞ kk∞ minimizing Wasserstein is equivalent to minimizing the K ,dR (-greedy(x)) = 1 kk∞  VAML objective (Asadi et al., 2018). P βx i e i log n 2. K ,dR (mmβ(x) := β ) = 1 kk∞ 8. Experiments Pn x eβxi 1 i=1 i Our first goal in this section is to compare TV, KL, and 3. K ,dR (boltzβ(x) := Pn eβ x ) A + kk∞ i=1 i ≤ | | Wasserstein in terms of the ability to best quantify error of βVmax A | | p an imperfect model. To this end, we built finite MRPs with random transitions, = 10 states, and γ = 0.95. In the Theorem 3. For any non-expansion backup operator f |S| outlined in Lemma 4, GVI computes a value function first case the reward signal is randomly sampled from [0, 10], KA (R) and in the second case the reward of an state is the index of with a Lipschitz constant bounded by dS ,dR if 1 γKd ,W (T ) − S that state, so small Euclidean between two states is . 5 γKdAS ,W (T ) < 1 an indication of similar values. For 10 trials, we generated an MRP and a random model, and then computed model Proof. From Algorithm 1, in the nth round of GVI updates: error and planning error (Figure 4). We understand a good metric as the one that computes a model error with a high

Qn (s, a) R(s, a) + γ T (s0 s, a)f Qn(s0, ) ds0. correlation with value error. We show these correlations for +1 ← | · Z different values of γ in Figure 5.  Nowb observe that: b

KdAS ,dR (Qn+1)

KA (R)+γKA T (s0 s, a)f Qn(s0, ) ds0 ≤ dS ,dRb dS ,dR | · Z   KA (R) + γKA (T ) Kd f Qbn(s, ) ≤ dS ,dR dS ,W S,R ·   KdA ,d (R) + γKdA ,W (T )K ,dR (f)KdA ,d (Qn) ≤ S R S k·k∞ b S R = KA (R) + γKA (T )KA (Qn) dS ,dR dS ,W dS ,dR b Figure 4. Value error (x axis) and model error (y axis). When Where we used Lemmas 3, 2, and 4 for theb second, third, the reward is the index of the state (right), correlation between and fourth inequality respectively. Equivalently: Wasserstein error and value-prediction error is high. This n highlights the fact that when closeness in the state-space is an i indication of similar values, Wasserstein can be a powerful metric KA (Qn ) KA (R) γKA (T ) dS ,dR +1 ≤ dS ,dR dS ,W for model-based RL. Note that Wasserstein provides no advantage i=0 X  given random rewards (left). b n + γKdAS ,W (T ) KdAS ,dR (Q0) .  By computing the limit of both sides, we get: b n i lim KdA ,d (Qn+1) lim KdA ,d (R) γKdA ,W (T ) n S R ≤ n S R S →∞ →∞ i=0 X n  b + lim γKdA ,W (T ) KdA ,d (Q0) n S S R →∞ KA (R)  = dS ,dR + 0 , b 1 γKdS ,W (T ) − Figure 5. Correlation between value-prediction error and model This concludes the proof. error for the three metrics using random rewards (left) and index rewards (right). Given a useful notion of state similarities, low Two implications of this result: First, PAC exploration in Wasserstein error is a better indication of planning error. continuous state spaces is shown assuming a Lipschitz value function (Pazis & Parr, 2013). However, the theorem shows 1We release the code here: github.com/kavosh8/Lip Lipschitz Continuity in Model-based Reinforcement Learning

Function f Definition Lipschitz constant K p, p (f) kk kk p = 1 p = 2 p = n n ∞ ReLu : R R ReLu(x)i := max 0, xi 1 1 1 +b : Rn Rn→, b Rn +b(x) := x +{b } 1 1 1 → ∀ ∈ n m m n 2 W : R R , W R × W (x) := W x j Wj j Wj 2 supj Wj 1 × → ∀ ∈ × k k∞ k k k k P qP Table 1. Lipschitz constant for various functions used in a neural network. Here, Wj denotes the jth row of a weight matrix W .

It is known that controlling the Lipschitz constant of neural -250 nets can help in terms of improving generalization error due -500 to a lower bound on Rademacher complexity (Neyshabur -750 et al., 2015; Bartlett & Mendelson, 2002). It then follows -1000 -1250 from Theorems 1 and 2 that controlling the Lipschitz -1500 constant of a learned transition model can achieve better error bounds for multi-step and value predictions. To average return enforce this constraint during learning, we bound the per Lipschitz constant of various operations used in building episode neural network. The bound on the constant of the entire neural network then follows from Lemma 2. In Table 1, we provide Lipschitz constant for operations (see Appendix for proof) used in our experiments. We quantify these results for different p-norms . k·kp Figure 6. Impact of Lipschitz constant of learned models in Cart Given these simple methods for enforcing Lipschitz Pole (left) and Pendulum (right). An intermediate value of k continuity, we performed empirical evaluations to (Lipschitz constant) yields the best performance. understand the impact of Lipschitz continuity of transition models, specifically when the transition model is used to perform multi-step state-predictions and policy settings. To capture stochasticity we need an algorithm to improvements. We chose two standard domains: Cart Pole learn a Lipschitz model class (Definition 4). We used an EM and Pendulum. In Cart Pole, we trained a network on a algorithm to joinly learn a set of functions f, parameterized 3 f dataset of 15 10 tuples s, a, s . During training, we by θ = θ : f Fg , and a distribution over functions 0 { ∈ } ensured that the∗ weights ofh the networki are smaller than k. g. Note that in practice our dataset only consists of a set For each k, we performed 20 independent model estimation, of samples s, a, s0 and does not include the function the and chose the model with cross-validation error. sample is drawnh from.i Hence, we consider this as our latent variable z. As is standard with EM, we start with the Using the learned model, along with the actual reward log-likelihood objective (for simplicity of presentation we signal of the environment, we then performed stochastic assume a single action in the derivation): actor-critic RL. (Barto et al., 1983; Sutton et al., 2000) This required an interaction between the policy and the N learned model for relatively long trajectories. To measure L(θ) = log p(si, si0; θ) the usefulness of the model, we then tested the learned i=1 X policy on the actual domain. We repeated this experiment N on Pendulum. To train the neural transition model for = log p(zi = f, si, si0; θ) this domain we used 104 samples. Notably, we used i=1 f X X deterministic policy gradient (Silver et al., 2014) for training N p(zi = f, si, si0; θ) the policy network with the hyper parameters suggested by = log q(zi =f si, si0) | q(zi = f si, si ) Lillicrap et al. (2015). We report these results in Figure 6. i=1 f 0 X X | N Observe that an intermediate Lipschitz constant yields the p(zi = f, si, si0; θ) q(zi =f si, si0)log , best result. Consistent with the theory, controlling the ≥ | q(zi = f si, si ) i=1 f 0 Lipschitz constant in practice can combat the compounding XX | errors and can help in the value estimation problem. This ultimately results in learning a better policy. where we used Jensen’s inequality and concavity of log in the last line. This derivation leads to the following EM We next examined if the benefits carry over to stochastic algorithm. Lipschitz Continuity in Model-based Reinforcement Learning

4 4

2 2

0 0

−2 −2

−4 −4

−6 −6

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

4 4

2 2

0 0

−2 −2

−4 −4

−6 −6

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 Figure 8. Impact of controlling the Lipschitz constant in the supervised-learning domain. Notice the U-shape of final Figure 7. A stochastic problem solved by training a Lipschitz Wasserstein loss with respect to Lipschitz constant k. model class using EM. The top left figure shows the functions before any training (iteration 0), and the bottom right figure shows the final results (iteration 50).

In the M step, find θt by solving for:

N p(zi = f, si, si0; θ) argmax qt 1(zi = f si, si0)log θ − | qt (zi = f si, si ) i=1 f 1 0 XX − | In the E step, compute posteriors: Figure 9. Performance of a Lipschitz model class on the gridworld f p(si, si0 zi = f; θt )g(zi = f; θt) domain. We show model test accuracy (left) and quality of the q (z =f s , s 0)= | . t i i i f policy found using the model (right). Notice the poor performance | f p(si, si0 zi = f; θt )g(zi = f; θt) | of tabular and expected models. Note that we assumeP each point is drawn from a neural network f with probability: Results in Figure 9 show that tabular models fail due to no f f 2 p si, si0 zi = f; θt = si0 f(si, θt ) , σ , generalization, and expected models fail since the ghosts | N − do not move on expectation, a prediction not useful for    and with a fixed variance σ2 tuned as a hyper-parameter. planner. Performing value iteration with a Lipschitz model class outperforms the baselines. We used a supervised-learning domain to evaluate the EM algorithm. We generated 30 points from 5 functions (written at the end of Appendix) and trained 5 neural networks to fit 9. Conclusion these points. Iterations of a single run is shown in Figure 7 We took an important step towards understanding and the summary of results is presented in Figure 8. Observe model-based RL with function approximation. We showed that the EM algorithm is effective, and that controlling the that Lipschitz continuity of an estimated model plays Lipschitz constant is again useful. a central role in multi-step prediction error, and in We next applied EM to train a transition model for an RL value-estimation error. We also showed the benefits of setting, namely the gridworld domain from Moerland et al. employing Wasserstein for model-based RL. An important (2017). Here a useful model needs to capture the stochastic future work is to apply these ideas to larger problems. behavior of the two ghosts. We modify the reward to be -1 whenever the agent is in the same cell as either one of 10. Acknowledgements the ghosts and 0 otherwise. We performed environmental interactions for 1000 time-steps and measured the return. The authors recognize the assistance of Eli Upfal, John We compared against standard tabular methods(Sutton Langford, George Konidaris, and members of Brown’s Rlab & Barto, 1998), and a deterministic model that predicts specifically Cameron Allen, David Abel, and Evan Cater. expected next state (Sutton et al., 2008; Parr et al., 2008). In The authors also thank anonymous ICML reviewer 1 for all cases we used value iteration for planning. insights on a value-aware interpretation of Wasserstein. Lipschitz Continuity in Model-based Reinforcement Learning

References 20th International Conference on Artificial Intelligence and Statistics, pp. 1486–1494, 2017. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International Ferns, N., Panangaden, P., and Precup, D. Metrics for finite Conference on Machine Learning, pp. 214–223, 2017. markov decision processes. In Proceedings of the 20th Asadi, K. and Littman, M. L. An alternative softmax conference on Uncertainty in artificial intelligence, pp. operator for reinforcement learning. In Proceedings of 162–169. AUAI Press, 2004. the 34th International Conference on Machine Learning , Hinderer, K. Lipschitz continuity of value functions in pp. 243–252, 2017. Markovian decision processes. Mathematical Methods of Asadi, K., Cater, E., Misra, D., and Littman, M. L. Operations Research, 62(1):3–22, 2005. Equivalence between wasserstein and value-aware Jiang, N., Kulesza, A., Singh, S., and Lewis, R. The model-based reinforcement learning. arXiv preprint dependence of effective planning horizon on model arXiv:1806.01265, 2018. accuracy. In Proceedings of AAMAS, pp. 1181–1189, Bartlett, P. L. and Mendelson, S. Rademacher and gaussian 2015. complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002. Kaelbling, L. P., Littman, M. L., and Moore, A. W. Reinforcement learning: A survey. Journal of artificial Barto, A. G., Sutton, R. S., and Anderson, C. W. Neuronlike intelligence research, 4:237–285, 1996. adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and Kakade, S., Kearns, M. J., and Langford, J. Exploration cybernetics, pp. 834–846, 1983. in metric state spaces. In Proceedings of the 20th International Conference on Machine Learning Bellemare, M. G., Dabney, W., and Munos, R. A (ICML-03), pp. 306–312, 2003. distributional perspective on reinforcement learning. In International Conference on Machine Learning, pp. Kearns, M. and Singh, S. Near-optimal reinforcement 449–458, 2017. learning in polynomial time. Machine Learning, 49(2-3): 209–232, 2002. Bellman, R. A markovian decision process. Journal of Mathematics and Mechanics, pp. 679–684, 1957. Kleinberg, R., Slivkins, A., and Upfal, E. Multi-armed bandits in metric spaces. In Proceedings of the Fortieth Berkenkamp, F., Turchetta, M., Schoellig, A., and Krause, Annual ACM Symposium on Theory of Computing, pp. A. Safe model-based reinforcement learning with 681–690. ACM, 2008. stability guarantees. In Advances in Neural Information Processing Systems, pp. 908–919, 2017. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous Bertsekas, D. Convergence of discretization procedures in control with deep reinforcement learning. arXiv preprint dynamic programming. IEEE Transactions on Automatic arXiv:1509.02971, 2015. Control, 20(3):415–419, 1975. Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic Littman, M. L. and Szepesvari,´ C. A generalized programming: an overview. In Decision and Control, reinforcement-learning model: Convergence and 1995, Proceedings of the 34th IEEE Conference on, applications. In Proceedings of the 13th International volume 1, pp. 560–564. IEEE, 1995. Conference on Machine Learning, pp. 310–318, 1996. Bubeck, S., Munos, R., Stoltz, G., and Szepesvari,´ C. Lorenz, E. Predictability: does the flap of a butterfly’s wing X-armed bandits. Journal of Machine Learning Research, in Brazil set off a tornado in Texas? na, 1972. 12(May):1655–1695, 2011. Moerland, T. M., Broekens, J., and Jonker, C. M. Learning Dempster, A. P., Laird, N. M., and Rubin, D. B. multimodal transition dynamics for model-based Maximum likelihood from incomplete data via the em reinforcement learning. arXiv preprint arXiv:1705.00470, algorithm. Journal of the royal statistical society. Series 2017. B (methodological), pp. 1–38, 1977. Muller,¨ A. Optimal selection from distributions with Farahmand, A.-M., Barreto, A., and Nikovski, D. unknown parameters: Robustness of bayesian models. Value-Aware Loss Function for Model-based Mathematical Methods of Operations Research, 44(3): Reinforcement Learning. In Proceedings of the 371–386, 1996. Lipschitz Continuity in Model-based Reinforcement Learning

Narayanan, H. and Mitter, S. Sample complexity of Szepesvari,´ C. Algorithms for reinforcement learning. testing the manifold hypothesis. In Advances in Neural Synthesis Lectures on Artificial Intelligence and Machine Information Processing Systems, pp. 1786–1794, 2010. Learning, 4(1):1–103, 2010. Neyshabur, B., Tomioka, R., and Srebro, N. Norm-based Talvitie, E. Model regularization for stable sample capacity control in neural networks. In Proceedings of rollouts. In Proceedings of the Thirtieth Conference on The 28th Conference on Learning Theory, pp. 1376–1401, Uncertainty in Artificial Intelligence, pp. 780–789. AUAI 2015. Press, 2014.

Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., and Talvitie, E. Self-correcting models for model-based Littman, M. L. An analysis of linear models, linear reinforcement learning. In Proceedings of the Thirty-First value-function approximation, and feature selection AAAI Conference on Artificial Intelligence, February 4-9, for reinforcement learning. In Proceedings of the 2017, San Francisco, California, USA., 2017. 25th international conference on Machine learning, pp. 752–759. ACM, 2008. Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B Pazis, J. and Parr, R. Pac optimal exploration in continuous (Methodological), pp. 267–288, 1996. space markov decision processes. In AAAI, 2013. Vaserstein, L. N. Markov processes over denumerable Pires, B. A.´ and Szepesvari,´ C. Policy error bounds for products of spaces, describing large systems of automata. model-based reinforcement learning with factored linear Problemy Peredachi Informatsii, 5(3):64–72, 1969. models. In Conference on Learning Theory, pp. 121–151, 2016. Venkatraman, A., Hebert, M., and Bagnell, J. A. Improving multi-step prediction of learned time series models. In Pirotta, M., Restelli, M., and Bascetta, L. Policy gradient in Proceedings of the Twenty-Ninth AAAI Conference on lipschitz Markov decision processes. Machine Learning, Artificial Intelligence, January 25-30, 2015, Austin, Texas, 100(2-3):255–283, 2015. USA., 2015.

Rachelson, E. and Lagoudakis, M. G. On the locality of Villani, C. Optimal transport: old and new, volume 338. action domination in sequential decision making. In Springer Science & Business Media, 2008. International Symposium on Artificial Intelligence and Mathematics, ISAIM 2010, Fort Lauderdale, Florida, USA, January 6-8, 2010, 2010. Russell, S. J. and Norvig, P. Artificial intelligence: A modern approach, 1995. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In ICML, 2014. Strehl, A. L., Li, L., and Littman, M. L. Reinforcement learning in finite mdps: Pac analysis. Journal of Machine Learning Research, 10(Nov):2413–2444, 2009. Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. The MIT Press, 1998. Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pp. 1057–1063, 2000. Sutton, R. S., Szepesvari,´ C., Geramifard, A., and Bowling, M. H. Dyna-style planning with linear function approximation and prioritized sweeping. In UAI 2008, Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence, Helsinki, Finland, July 9-12, 2008, pp. 528–536, 2008.