<<

Lipschitz Continuity in Model-based Reinforcement Learning

Kavosh Asadi * 1 Dipendra Misra * 2 Michael L. Littman 1

Abstract introduce a novel characterization of models, referred We examine the impact of learning Lipschitz to as a Lipschitz model class, that represents stochastic continuous models in the context of model-based dynamics using a set of component deterministic functions. reinforcement learning. We provide a novel bound This allows us to study any stochastic dynamic using on multi-step prediction error of Lipschitz models the Lipschitz continuity of its component deterministic where we quantify the error using the Wasserstein functions. To learn a Lipschitz model class in continuous . We go on to prove an error bound for state spaces, we provide an Expectation-Maximization the value- estimate arising from Lipschitz algorithm (Dempster et al., 1977). models and show that the estimated value function One promising direction for mitigating the effects of is itself Lipschitz. We conclude with empirical inaccurate models is the idea of limiting the complexity results that show the benefits of controlling the of the learned models or reducing the horizon of Lipschitz constant of neural-network models. planning (Jiang et al., 2015). Doing so can sometimes make models more useful, much as regularization in supervised learning can improve generalization performance 1. Introduction (Tibshirani, 1996). In this work, we also examine a type The model-based approach to reinforcement learning (RL) of regularization that comes from controlling the Lipschitz focuses on predicting the dynamics of the environment constant of models. This regularization technique can be to plan and make high-quality decisions (Kaelbling et al., applied efficiently, as we will show, when we represent the 1996; Sutton & Barto, 1998). Although the behavior of transition model by neural networks. model-based algorithms in tabular environments is well understood and can be effective (Sutton & Barto, 1998), 2. Background scaling up to the approximate setting can cause instabilities. Even small model errors can be magnified by the planning We consider the Markov decision process (MDP) setting process resulting in poor performance (Talvitie, 2014). in which the RL problem is formulated by the tuple , , R, T, γ . Here, by we mean a continuous state In this paper, we study model-based RL through the lens of spacehS A and by i we mean a discreteS action set. The functions Lipschitz continuity, intuitively related to the R : S A AR and T : Pr( ) denote the reward of a function. We show that the ability of a model to make and transition× → dynamics.S Finally, × A →γ S[0, 1) is the discount accurate multi-step predictions is related to the model’s rate. If = 1, the setting is called∈ a Markov reward one-step accuracy, but also to the magnitude of the Lipschitz process (MRP).|A| constant (smoothness) of the model. We further show that arXiv:1804.07193v3 [cs.LG] 27 Jul 2018 the dependence on the Lipschitz constant carries over to the 2.1. Lipschitz Continuity value-prediction problem, ultimately influencing the quality of the policy found by planning. Our analyses leverage the “smoothness” of various functions, quantified as follows. We consider a setting with continuous state spaces and stochastic transitions where we quantify the distance Definition 1. Given two metric spaces (M1, d1) and between distributions using the Wasserstein metric. We (M2, d2) consisting of a space and a distance metric, a function f : M1 M2 is Lipschitz continuous (sometimes * 1 Equal contribution Department of Computer Science, simply Lipschitz)7→ if the Lipschitz constant, defined as Brown University, Providence, USA 2Department of Computer Science and Cornell Tech, Cornell University, New York, USA. Correspondence to: Kavosh Asadi . d2 f(s1), f(s2) Kd1,d2 (f) := sup , (1) th s1 M1,s2 M1 d1(s1, s2) Proceedings of the 35 International Conference on Machine ∈ ∈  Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). is finite. Lipschitz Continuity in Model-based Reinforcement Learning

f(s) Sometimesf(s) referred to as “Earth Mover’s distance”, Wasserstein is the minimum expected distance between pairs of points where the joint distribution j is constrained to match the marginals µ1 and µ2. New applications of this metric are discovered in machine learning, namely in s1 s1 the context of generative adversarial networks (Arjovsky et al., 2017) and value distributions in reinforcement learning (Bellemare et al., 2017). Wasserstein is linked to Lipschitz continuity using duality: Figure 1. An illustration of Lipschitz continuity. Pictorially, Lipschitz continuity ensures that f lies in between the two affine W (µ1, µ2) = sup f(s)µ1(s) f(s)µ2(s) ds . f:Kd,d (f) 1 − functions (colored in blue) with slopes K and −K. R ≤ Z  (4)

Equivalently, for a Lipschitz f, This equivalence, known as Kantorovich-Rubinstein duality (Villani, 2008), lets us compute Wasserstein by maximizing R s1, s2 d2 f(s1), f(s2) Kd1,d2 (f) d1(s1, s2) . over a Lipschitz set of functions f : , a relatively ∀ ∀ ≤ easier problem to solve. In our theory,S 7→ we utilize both  The concept of Lipschitz continuity is visualized in Figure1. definitions, namely the primal definition (3) and the dual definition (4). A Lipschitz function f is called a non-expansion when K (f) = 1 and a contraction when K (f) < 1. d1,d2 d1,d2 3. Lipschitz Model Class Lipschitz continuity, in one form or another, has been a key tool in the theory of reinforcement learning (Bertsekas, We introduce a novel representation of stochastic MDP 1975; Bertsekas & Tsitsiklis, 1995; Littman & Szepesvari´ , transitions in terms of a distribution over a set of 1996;M uller¨ , 1996; Ferns et al., 2004; Hinderer, 2005; deterministic components. Rachelson & Lagoudakis, 2010; Szepesvari´ , 2010; Pazis Definition 4. Given a metric state space ( , d ) and an & Parr, 2013; Pirotta et al., 2015; Pires & Szepesvari´ , S S action space , we define Fg as a collection of functions: 2016; Berkenkamp et al., 2017; Bellemare et al., 2017) and A Fg = f : distributed according to g(f a) where bandits (Kleinberg et al., 2008; Bubeck et al., 2011). Below, { S 7→ S} | a . We say that Fg is a Lipschitz model class if we also define Lipschitz continuity over a subset of inputs. ∈ A K := sup K (f) , Definition 2. A function is uniformly F dS ,dS f : M1 M2 f Fg Lipschitz continuous in if × A 7→ ∈ A is finite. d f(s , a), f(s , a) 2 1 2 Our definition captures a subset of stochastic transitions, KdA1,d2 (f) := sup sup , (2) a s1,s2 d1(s1, s2) ∈A  namely ones that can be represented as a state-independent is finite. distribution over deterministic transitions. An example is provided in Figure2. We further prove in the appendix (see

Note that the metric d1 is defined only on M1. Claim 1) that any finite MDP transition probabilities can be decomposed into a state-independent distribution g over a 2.2. Wasserstein Metric finite set of deterministic functions f. We quantify the distance between two distributions using Associated with a Lipschitz model class is a transition the following metric: function given by:

Definition 3. Given a (M, d) and the set T (s0 s, a) = 1 f(s) = s0 g(f a) . | | P(M) of all probability measures on M, the Wasserstein f X  metric (or the 1st Kantorovic metric) between two b probability distributions µ1 and µ2 in P(M) is defined as Given a state distribution µ(s), we also define a generalized notion of transition function T ( µ, a) given by: G · | W (µ1, µ2) := inf j(s1, s2)d(s1, s2)ds2 ds1 , (3) j Λ 1 ∈ ZZ T (s0 µ, a) = f(sb) = s0 g(f a) µ(s)ds . G | | Zs f where Λ denotes the collection of all joint distributions j on X  b M M with marginals µ1 and µ2 (Vaserstein, 1969). Tb(s0 s,a) × | | {z } Lipschitz Continuity in Model-based Reinforcement Learning

1 2 µ(s)1 1 4 T (. µ, a) c1 G | 1 4 Figure 2. An example of a Lipschitz model class in a gridworld T (. µ, a) c2 G | environment (Russell & Norvig, 1995). The dynamics are such b that any action choice results in an attempted transition in the Figure 3. A state distribution µ(s) (top), a stochastic environment corresponding direction with probability 0.8 and in the neighboring that randomly adds or subtracts c1 (middle), and an approximate directions with probabilities 0.1 and 0.1. We can define Fg = transition model that randomly adds or subtracts a second scalar up right down left {f , f , f , f } where each f outputs a deterministic c2 (bottom). next position in the grid (factoring in obstacles). For a = up, we have: g(fup | a = up) = 0.8, g(fright | a = up) = The next possible choice is Total Variation (TV) defined as: g(fleft | a = up) = 0.1, and g(fdown | a = up) = 0. Defining distances between states as their Manhattan distance in the grid, TV T ( µ, a), T ( µ, a) ∀f sup d(f(s ), f(s )/d(s , s ) = 2 K = then s1,s2 1 2 1 2 , and so F G · | G · | 2. So, the four functions and g comprise a Lipschitz model class. 1 := T (s0 µ, a) T (s0 µ, a) ds0 = 1 , 2 G | b − G | Z

if the two distributions have disjointb supports regardless of how far the supports are from each other. We are primarily interested in Kd,dA (T ), the Lipschitz G constant of T . However, since T takes as input a In contrast, Wasserstein is sensitive to how far the constants probability distributionG and also outputsGb a probability are as: distribution, web require a notion of distanceb between two distributions. This notion is quantified using Wasserstein W T ( µ, a), T ( µ, a) = c1 c2 . G · | G · | | − | and is justified in the next section.  It is clear that, of the three,b Wasserstein corresponds best to the intuitive sense of how closely T approximates 4. On the Choice of Probability Metric T . This is particularly important in high-dimensionalG spacesG where the true distribution is known to usually lie in We consider the stochastic model-based setting and show low-dimensional manifolds. (Narayanan & Mitter, 2010) through an example that the Wasserstein metric is a b reasonable choice compared to other common options. 5. Understanding the Compounding Error Consider a uniform distribution over states µ(s) as shown in black in Figure3 (top). Take a transition function T in Phenomenon G the environment that, given an action a, uniformly randomly To extract a prediction with a horizon n > 1, model-based adds or subtracts a scalar c1. The distribution of states algorithms typically apply the model for n steps by taking after one transition is shown in red in Figure3 (middle). the state input in step t to be the state output from Now, consider a transition model T that approximates T G G the step t 1. Previous work has shown that model by uniformly randomly adding or subtracting the scalar error can result− in poor long-horizon predictions and c2. The distribution over states afterb one transition using ineffective planning (Talvitie, 2014; 2017). Observed even this imperfect model is shown in blue in Figure3 (bottom). beyond reinforcement learning (Lorenz, 1972; Venkatraman We desire a metric that captures the similarity between the et al., 2015), this is referred to as the compounding error output of the two transition functions. We first consider phenomenon. The goal of this section is to provide a bound Kullback-Leibler (KL) divergence and observe that: on multi-step prediction error of a model. We formalize the notion of model accuracy below: Definition 5. Given an MDP with a transition function KL T ( µ, a), T ( µ, a) G · | G · | T , we identify a Lipschitz model Fg as ∆-accurate if its T (s0  µ, a) := T (s0 µ, a)b log G | ds0 = , induced T satisfies: G | ∞ Z T (s0 µ, a) G | s a W T ( s, a),T ( s, a) ∆ . ∀ ∀b · | · | ≤ b  unless the two constants are exactly the same. b Lipschitz Continuity in Model-based Reinforcement Learning

We want to express the multi-step Wasserstein error in δ(1) := W T ( µ, a0),T ( µ, a0) terms of the single-step Wasserstein error and the Lipschitz G · | G · | := sup T (s0 s, a ) T (s0 s, a ) f(s0)µ(s) ds ds0 constant of the transition function T . We provide a bound b | 0 − | 0 G f ZZ on the Lipschitz constant of T using the following lemma:  G b b sup T (s0 s, a0) T (s0 s, a0) f(s0) ds0 µ(s) ds Lemma 1. A generalized transition function T induced by ≤ f | − | b G Z Z a Lipschitz model class Fg is Lipschitz with a constant:  b =W Tb( s,a0),T ( s,a0) due to duality (4) b ·| ·| | {z } W T ( µ , a), T ( µ , a)  1 2 = W T ( s, a0),T ( s, a0) µ(s) ds KW,WA (T ) := sup sup G ·| G ·| KF · | · | G a µ1,µ2 W (µ1, µ2) ≤  Z ∆ due to Definition5 b b ≤  b b ∆| µ(s) ds ={z ∆ . } ≤ Z Intuitively, Lemma1 states that, if the two input We now prove the inductive step. Assuming δ(n 1) := distributions are similar, then for any action the output n 1 n 1 n 2 i− W T − ( µ),T − ( µ) ∆ i=0− (KF ) we can distributions given by T are also similar up to a KF factor. write:G · | G · | ≤ We prove this lemma, asG well as the subsequent lemmas, in  P b n n the appendix. δ(n) := W T ( µ),T ( µ) b G · | G · | n n 1 Given the one-step error (Definition5), a start state W T ( µ), T T − (  µ), an 1 ≤ G ·b | G · | G · | − distribution µ and a fixed sequence of actions a , ..., a , 0 n 1  n 1 n  − +W Tb T −b( µ), an 1 ,T ( µ) (Triangle ineq) we desire a bound on n-step error: G · | G · | − G · |  n 1  n 1 =W Tb ( T − ( µ), an 1), T T − ( µ), an 1 δ(n) := W T n( µ),T n( µ) , G · | G · | − G · | G · | − G · | G · |  n 1 n 1  +W Tb Tb − ( µ), an 1 ,Tb( T − ( µ), an 1)  G · | G · | − G · | G · | − n b where T ( µ) := T ( T ( ...T ( µ, a0)..., an 2), an 1) We now use Lemma1 and Definition 5 to upper bound the  G ·| G ·| G ·| G ·| − − b n recursive calls first and the second term of the last line respectively. and T nb( µ) is definedb b similarly.b We provide a useful n 1 n 1 lemmaG followed· | by| the theorem. {z } δ(n) KF W T − ( µ),T − ( µ) + ∆ ≤ G · | G · | n 1 Lemma 2. (Composition Lemma) Define three metric − i = KF δ(n b 1) + ∆ ∆ (KF ) . (5) − ≤ spaces (M1, d1), (M2, d2), and (M3, d3). Define Lipschitz i=0 X functions f : M2 M3 and g : M1 M2 with constants 7→ 7→ Note that in the , we may replace Kd ,d (f) and Kd ,d (g). Then, h : f g : M1 M3 is 2 3 1 2 n 1 n 1 ◦ 7→ T T − ( µ) with T T − ( µ) and follow Lipschitz with constant Kd1,d3 (h) Kd2,d3 (f)Kd1,d2 (g). ≤ theG same· | G basic· steps | to get: G · | G · |   b b Similar to composition, we can show that summation n 1 n n − i preserves Lipschitz continuity with a constant bounded by W T ( µ),T ( µ) ∆ (KT ) . (6) G · | G · | ≤ the sum of the Lipschitz constants of the two functions. We i=0 omitted this result due to brevity.  X Combiningb (5) and (6) allows us to write: Theorem 1. Define a ∆-accurate T with the Lipschitz G n n constant KF and an MDP with a Lipschitz transition δ(n) = W T ( µ),T ( µ) G · | G · | function T with constant KT . Let bK¯ = min KF ,KT . n 1 n 1 G { } − i − i Then n 1: minb ∆ (KT ) , ∆ (KF ) ∀ ≥ ≤ ( i=0 i=0 ) X X n 1 n 1 − − i δ(n) := W T n( µ),T n( µ) ∆ (K¯ )i . = ∆ (K¯ ) , G · | G · | ≤ i=0 i=0  X X b which concludes the proof.

Proof. We construct a proof by induction. Using There exist similar results in the literature relating Kantarovich-Rubinstein duality (Lipschitz property of f one-step transition error to multi-step transition error and not shown for brevity) we first prove the base of induction: sub-optimality bounds for planning with an approximate Lipschitz Continuity in Model-based Reinforcement Learning

model. The Simulation Lemma (Kearns & Singh, 2002; Let = h : KdS ,R(h) 1 . Then given f : Strehl et al., 2009) is for discrete state MDPs and relates F { ≤ } ∈ F ∞ n n n error in the one-step model to the value obtained by KR γ f(s0) T (s0 δs) T (s0 δs) ds0 G | − G | using it for planning. A related result for continuous n=0 Z state-spaces (Kakade et al., 2003) bounds the error in X  ∞ n nb n estimating the probability of a trajectory using total KR γ sup f(s0) T (s0 δs) T (s0 δs) ds0 ≤ f G | − G | variation. A second related result (Venkatraman et al., n=0 ∈F Z X  2015) provides a slightly looser bound for prediction error n n b :=W T (. δs),Tb (. δs) due to duality (4) in the deterministic case—our result can be thought of as a G | G | | {z } generalization of their result to the probabilistic case. ∞ n n n  = KR γ W T (. δs), T (. δs) G | G | n=0 X Pn−1 ∆(K¯ )i due to Theorem1 6. Value Error with Lipschitz Models ≤ i=0 b n| 1 {z } ∞ n − i We next investigate the error in the state-value function KR γ ∆(K¯ ) ≤ induced by a Lipschitz model class. To answer this question, n=0 i=0 X X we consider an MRP M1 denoted by , , T, R, γ and ¯ n hS A i ∞ n 1 K a second MRP M that only differs from the first in its = KR∆ γ − 2 1 K¯ n=0 transition function , , T , R, γ . Let = a be the X − hS A i A { } γK ∆ action set with a single action a. We further assume that = R . the reward function is onlyb dependent upon state. We first (1 γ)(1 γK¯ ) − − express the state-value function for a start state s with We can derive the same bound for using VTb(s) VT (s) respect to the two transition functions. By δs below, we the fact that Wasserstein distance is a metric,− and therefore mean a denoting a distribution with symmetric, thereby completing the proof. probability 1 at state s. Regarding the tightness of our bounds, we can show that ∞ n n VT (s) := γ T (s0 δs)R(s0) ds0 , when the transition model is deterministic and linear then G | n=0 Z Theorem1 provides a tight bound. Moreover, if the reward X function is linear, the bound provided by Theorem2 is tight. (See Claim 2 in the appendix.) Notice also that our proof

∞ n n does not require a bounded reward function. VT (s) := γ T (s0 δs)R(s0) ds0 . b G | n=0 X Z b 7. Lipschitz Generalized Value Iteration

Next we derive a bound on VT (s) V (s) s. We next show that, given a Lipschitz transition model, − Tb ∀ solving for the fixed point of a class of Bellman equations Theorem 2. Assume a Lipschitz model class Fg with a yields a Lipschitz state-action value function. Our proof is in ∆-accurate T with K¯ = min K ,K . Further, assume F T the context of Generalized Value Iteration (GVI) (Littman & a Lipschitz reward function with{ constant} K = K (R). R dS ,R Szepesvari´ , 1996), which defines Value Iteration (Bellman, Then s band K¯ [0, 1 ) ∀ ∈ S ∈ γ 1957) for planning with arbitrary backup operators.

γKR∆ VT (s) V (s) . Algorithm 1 GVI algorithm − Tb ≤ (1 γ)(1 γK¯ ) Input: − − initial Q(s, a), δ, and choose an operator f repeat for each s, ab do ∈ S × A Proof. We first define the function f(s) = R(s) . It can be Q(s, a) R(s, a)+γ T (s0 s, a)f Q(s0, ) ds0 KR end for ← | · observed that KdS ,R(f) = 1. We now write: R  until bconvergence b b

VT (s) VTb(s) − To prove the result, we make use of the following lemmas. ∞ n n n = γ R(s0) T (s0 δs) T (s0 δs) ds0 Lemma 3. Given a Lipschitz function f : R with G | − G | n=0 Z constant K (f): S 7→ X  dS ,dR ∞ n n b n = KR γ f(s0) T (s0 δs) T (s0 δs) ds0 G | − G | KdAS ,dR T (s0 s, a)f(s0)ds0 KdS ,dR (f)KdAS ,W T . n=0 | ≤ X Z Z     b b b Lipschitz Continuity in Model-based Reinforcement Learning

that it is sufficient to have a Lipschitz model, an assumption Lemma 4. The following operators (Asadi & Littman, perhaps easier to confirm. The second implication relates to 2017) are Lipschitz with constants: value-aware model learning (VAML) objective (Farahmand et al., 2017). Using the above theorem, we can show that

1. K ,dR (max(x)) = K ,dR mean(x) = kk∞ kk∞ minimizing Wasserstein is equivalent to minimizing the K ,dR (-greedy(x)) = 1 kk∞  VAML objective (Asadi et al., 2018). P βx i e i log n 2. K ,dR (mmβ(x) := β ) = 1 kk∞ 8. Experiments Pn x eβxi 1 i=1 i Our first goal in this section is to compare TV, KL, and 3. K ,dR (boltzβ(x) := Pn eβ x ) A + kk∞ i=1 i ≤ | | Wasserstein in terms of the ability to best quantify error of βVmax A | | p an imperfect model. To this end, we built finite MRPs with random transitions, = 10 states, and γ = 0.95. In the Theorem 3. For any non-expansion backup operator f |S| outlined in Lemma4, GVI computes a value function first case the reward signal is randomly sampled from [0, 10], KA (R) and in the second case the reward of an state is the index of with a Lipschitz constant bounded by dS ,dR if 1 γKd ,W (T ) − S that state, so small Euclidean between two states is . 5 γKdAS ,W (T ) < 1 an indication of similar values. For 10 trials, we generated an MRP and a random model, and then computed model Proof. From Algorithm1, in the nth round of GVI updates: error and planning error (Figure4). We understand a good metric as the one that computes a model error with a high

Qn (s, a) R(s, a) + γ T (s0 s, a)f Qn(s0, ) ds0. correlation with value error. We show these correlations for +1 ← | · Z different values of γ in Figure5.  Nowb observe that: b

KdAS ,dR (Qn+1)

KA (R)+γKA T (s0 s, a)f Qn(s0, ) ds0 ≤ dS ,dRb dS ,dR | · Z   KA (R) + γKA (T ) Kd f Qbn(s, ) ≤ dS ,dR dS ,W S,R ·   KdA ,d (R) + γKdA ,W (T )K ,dR (f)KdA ,d (Qn) ≤ S R S k·k∞ b S R = KA (R) + γKA (T )KA (Qn) dS ,dR dS ,W dS ,dR b Where we used Lemmas3,2, and4 for the second,b third, Figure 4. Value error (x axis) and model error (y axis). When and fourth inequality respectively. Equivalently: the reward is the index of the state (right), correlation between Wasserstein error and value-prediction error is high. This n i highlights the fact that when closeness in the state-space is an KA (Qn ) KA (R) γKA (T ) dS ,dR +1 ≤ dS ,dR dS ,W indication of similar values, Wasserstein can be a powerful metric i=0 for model-based RL. Note that Wasserstein provides no advantage Xn  b given random rewards (left). + γKdAS ,W (T ) KdAS ,dR (Q0) .  By computing the limit of both sides, we get: b n i lim KdA ,d (Qn+1) lim KdA ,d (R) γKdA ,W (T ) n S R ≤ n S R S →∞ →∞ i=0 X n  b + lim γKdA ,W (T ) KdA ,d (Q0) n S S R →∞ KA (R)  = dS ,dR + 0 , b 1 γKdS ,W (T ) − Figure 5. Correlation between value-prediction error and model This concludes the proof. error for the three metrics using random rewards (left) and index rewards (right). Given a useful notion of state similarities, low Two implications of this result: First, PAC exploration in Wasserstein error is a better indication of planning error. continuous state spaces is shown assuming a Lipschitz value function (Pazis & Parr, 2013). However, the theorem shows 1We release the code here: github.com/kavosh8/Lip Lipschitz Continuity in Model-based Reinforcement Learning

Function f Definition Lipschitz constant K p, p (f) kk kk p = 1 p = 2 p = n n ∞ ReLu : R R ReLu(x)i := max 0, xi 1 1 1 +b : Rn Rn→, b Rn +b(x) := x +{b } 1 1 1 → ∀ ∈ n m m n 2 W : R R , W R × W (x) := W x j Wj j Wj 2 supj Wj 1 × → ∀ ∈ × k k∞ k k k k P qP Table 1. Lipschitz constant for various functions used in a neural network. Here, Wj denotes the jth row of a weight matrix W .

It is known that controlling the Lipschitz constant of neural -250 nets can help in terms of improving generalization error due -500 to a lower bound on Rademacher complexity (Neyshabur -750 et al., 2015; Bartlett & Mendelson, 2002). It then follows -1000 -1250 from Theorems1 and2 that controlling the Lipschitz -1500 constant of a learned transition model can achieve better error bounds for multi-step and value predictions. To average return enforce this constraint during learning, we bound the per Lipschitz constant of various operations used in building episode neural network. The bound on the constant of the entire neural network then follows from Lemma2. In Table1, we provide Lipschitz constant for operations (see Appendix for proof) used in our experiments. We quantify these results for different p-norms . k·kp Figure 6. Impact of Lipschitz constant of learned models in Cart Given these simple methods for enforcing Lipschitz Pole (left) and Pendulum (right). An intermediate value of k continuity, we performed empirical evaluations to (Lipschitz constant) yields the best performance. understand the impact of Lipschitz continuity of transition models, specifically when the transition model is used to perform multi-step state-predictions and policy settings. To capture stochasticity we need an algorithm to improvements. We chose two standard domains: Cart Pole learn a Lipschitz model class (Definition4). We used an EM and Pendulum. In Cart Pole, we trained a network on a algorithm to joinly learn a set of functions f, parameterized 3 f dataset of 15 10 tuples s, a, s . During training, we by θ = θ : f Fg , and a distribution over functions 0 { ∈ } ensured that the∗ weights ofh the networki are smaller than k. g. Note that in practice our dataset only consists of a set For each k, we performed 20 independent model estimation, of samples s, a, s0 and does not include the function the and chose the model with cross-validation error. sample is drawnh from.i Hence, we consider this as our latent variable z. As is standard with EM, we start with the Using the learned model, along with the actual reward log-likelihood objective (for simplicity of presentation we signal of the environment, we then performed stochastic assume a single action in the derivation): actor-critic RL. (Barto et al., 1983; Sutton et al., 2000) This required an interaction between the policy and the N learned model for relatively long trajectories. To measure L(θ) = log p(si, si0; θ) the usefulness of the model, we then tested the learned i=1 X policy on the actual domain. We repeated this experiment N on Pendulum. To train the neural transition model for = log p(zi = f, si, si0; θ) this domain we used 104 samples. Notably, we used i=1 f X X deterministic policy gradient (Silver et al., 2014) for training N p(zi = f, si, si0; θ) the policy network with the hyper parameters suggested by = log q(zi =f si, si0) | q(zi = f si, si ) Lillicrap et al.(2015). We report these results in Figure6. i=1 f 0 X X | N Observe that an intermediate Lipschitz constant yields the p(zi = f, si, si0; θ) q(zi =f si, si0)log , best result. Consistent with the theory, controlling the ≥ | q(zi = f si, si ) i=1 f 0 Lipschitz constant in practice can combat the compounding XX | errors and can help in the value estimation problem. This ultimately results in learning a better policy. where we used Jensen’s inequality and concavity of log in the last line. This derivation leads to the following EM We next examined if the benefits carry over to stochastic algorithm. Lipschitz Continuity in Model-based Reinforcement Learning

4 4

2 2

0 0

−2 −2

−4 −4

−6 −6

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

4 4

2 2

0 0

−2 −2

−4 −4

−6 −6

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 Figure 8. Impact of controlling the Lipschitz constant in the supervised-learning domain. Notice the U-shape of final Figure 7. A stochastic problem solved by training a Lipschitz Wasserstein loss with respect to Lipschitz constant k. model class using EM. The top left figure shows the functions before any training (iteration 0), and the bottom right figure shows the final results (iteration 50).

In the M step, find θt by solving for:

N p(zi = f, si, si0; θ) argmax qt 1(zi = f si, si0)log θ − | qt (zi = f si, si ) i=1 f 1 0 XX − | In the E step, compute posteriors: Figure 9. Performance of a Lipschitz model class on the gridworld f p(si, si0 zi = f; θt )g(zi = f; θt) domain. We show model test accuracy (left) and quality of the q (z =f s , s 0)= | . t i i i f policy found using the model (right). Notice the poor performance | f p(si, si0 zi = f; θt )g(zi = f; θt) | of tabular and expected models. Note that we assumeP each point is drawn from a neural network f with probability: Results in Figure9 show that tabular models fail due to no f f 2 p si, si0 zi = f; θt = si0 f(si, θt ) , σ , generalization, and expected models fail since the ghosts | N − do not move on expectation, a prediction not useful for    and with a fixed variance σ2 tuned as a hyper-parameter. planner. Performing value iteration with a Lipschitz model class outperforms the baselines. We used a supervised-learning domain to evaluate the EM algorithm. We generated 30 points from 5 functions (written at the end of Appendix) and trained 5 neural networks to fit 9. Conclusion these points. Iterations of a single run is shown in Figure7 We took an important step towards understanding and the summary of results is presented in Figure8. Observe model-based RL with function approximation. We showed that the EM algorithm is effective, and that controlling the that Lipschitz continuity of an estimated model plays Lipschitz constant is again useful. a central role in multi-step prediction error, and in We next applied EM to train a transition model for an RL value-estimation error. We also showed the benefits of setting, namely the gridworld domain from Moerland et al. employing Wasserstein for model-based RL. An important (2017). Here a useful model needs to capture the stochastic future work is to apply these ideas to larger problems. behavior of the two ghosts. We modify the reward to be -1 whenever the agent is in the same cell as either one of 10. Acknowledgements the ghosts and 0 otherwise. We performed environmental interactions for 1000 time-steps and measured the return. The authors recognize the assistance of Eli Upfal, John We compared against standard tabular methods(Sutton Langford, George Konidaris, and members of Brown’s Rlab & Barto, 1998), and a deterministic model that predicts specifically Cameron Allen, David Abel, and Evan Cater. expected next state (Sutton et al., 2008; Parr et al., 2008). In The authors also thank anonymous ICML reviewer 1 for all cases we used value iteration for planning. insights on a value-aware interpretation of Wasserstein. Lipschitz Continuity in Model-based Reinforcement Learning

References 20th International Conference on Artificial Intelligence and Statistics, pp. 1486–1494, 2017. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International Ferns, N., Panangaden, P., and Precup, D. Metrics for finite Conference on Machine Learning, pp. 214–223, 2017. markov decision processes. In Proceedings of the 20th Asadi, K. and Littman, M. L. An alternative softmax conference on Uncertainty in artificial intelligence, pp. operator for reinforcement learning. In Proceedings of 162–169. AUAI Press, 2004. the 34th International Conference on Machine Learning, Fox, R., Pakman, A., and Tishby, N. G-learning: Taming pp. 243–252, 2017. the noise in reinforcement learning via soft updates. Asadi, K., Cater, E., Misra, D., and Littman, M. L. Uncertainty in Artifical Intelligence, 2016. Equivalence between wasserstein and value-aware model-based reinforcement learning. arXiv preprint Gao, B. and Pavel, L. On the properties of the arXiv:1806.01265, 2018. softmax function with application in game theory and reinforcement learning. arXiv preprint arXiv:1704.00805, Bartlett, P. L. and Mendelson, S. Rademacher and gaussian 2017. complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002. Hinderer, K. Lipschitz continuity of value functions in Markovian decision processes. Mathematical Methods of Barto, A. G., Sutton, R. S., and Anderson, C. W. Neuronlike Operations Research, 62(1):3–22, 2005. adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and Jiang, N., Kulesza, A., Singh, S., and Lewis, R. The cybernetics, pp. 834–846, 1983. dependence of effective planning horizon on model accuracy. In Proceedings of AAMAS, pp. 1181–1189, Bellemare, M. G., Dabney, W., and Munos, R. A 2015. distributional perspective on reinforcement learning. In International Conference on Machine Learning , pp. Kaelbling, L. P., Littman, M. L., and Moore, A. W. 449–458, 2017. Reinforcement learning: A survey. Journal of artificial Bellman, R. A markovian decision process. Journal of intelligence research, 4:237–285, 1996. Mathematics and Mechanics, pp. 679–684, 1957. Kakade, S., Kearns, M. J., and Langford, J. Exploration Berkenkamp, F., Turchetta, M., Schoellig, A., and Krause, in metric state spaces. In Proceedings of the A. Safe model-based reinforcement learning with 20th International Conference on Machine Learning stability guarantees. In Advances in Neural Information (ICML-03), pp. 306–312, 2003. Processing Systems, pp. 908–919, 2017. Kearns, M. and Singh, S. Near-optimal reinforcement Bertsekas, D. Convergence of discretization procedures in learning in polynomial time. Machine Learning, 49(2-3): dynamic programming. IEEE Transactions on Automatic 209–232, 2002. Control, 20(3):415–419, 1975. Kleinberg, R., Slivkins, A., and Upfal, E. Multi-armed Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic bandits in metric spaces. In Proceedings of the Fortieth programming: an overview. In Decision and Control, Annual ACM Symposium on Theory of Computing, pp. 1995, Proceedings of the 34th IEEE Conference on, 681–690. ACM, 2008. volume 1, pp. 560–564. IEEE, 1995. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, Bubeck, S., Munos, R., Stoltz, G., and Szepesvari,´ C. X-armed bandits. Journal of Machine Learning Research, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous arXiv preprint 12(May):1655–1695, 2011. control with deep reinforcement learning. arXiv:1509.02971, 2015. Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the em Littman, M. L. and Szepesvari,´ C. A generalized algorithm. Journal of the royal statistical society. Series reinforcement-learning model: Convergence and B (methodological), pp. 1–38, 1977. applications. In Proceedings of the 13th International Conference on Machine Learning, pp. 310–318, 1996. Farahmand, A.-M., Barreto, A., and Nikovski, D. Value-Aware Loss Function for Model-based Lorenz, E. Predictability: does the flap of a butterfly’s wing Reinforcement Learning. In Proceedings of the in Brazil set off a tornado in Texas? na, 1972. Lipschitz Continuity in Model-based Reinforcement Learning

Moerland, T. M., Broekens, J., and Jonker, C. M. Learning Strehl, A. L., Li, L., and Littman, M. L. Reinforcement multimodal transition dynamics for model-based learning in finite mdps: Pac analysis. Journal of Machine reinforcement learning. arXiv preprint arXiv:1705.00470, Learning Research, 10(Nov):2413–2444, 2009. 2017. Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Muller,¨ A. Optimal selection from distributions with Introduction. The MIT Press, 1998. unknown parameters: Robustness of bayesian models. Mathematical Methods of Operations Research, 44(3): Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, 371–386, 1996. Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Nachum, O., Norouzi, M., Xu, K., and Schuurmans, Information Processing Systems, pp. 1057–1063, 2000. D. Bridging the gap between value and policy based reinforcement learning. arXiv preprint arXiv:1702.08892, Sutton, R. S., Szepesvari,´ C., Geramifard, A., and Bowling, 2017. M. H. Dyna-style planning with linear function approximation and prioritized sweeping. In UAI 2008, Narayanan, H. and Mitter, S. Sample complexity of Proceedings of the 24th Conference in Uncertainty in testing the manifold hypothesis. In Advances in Neural Artificial Intelligence, Helsinki, Finland, July 9-12, 2008, Information Processing Systems, pp. 1786–1794, 2010. pp. 528–536, 2008. Neu, G., Jonsson, A., and Gomez,´ V. A unified view of Szepesvari,´ C. Algorithms for reinforcement learning. entropy-regularized Markov decision processes. arXiv Synthesis Lectures on Artificial Intelligence and Machine preprint arXiv:1705.07798, 2017. Learning, 4(1):1–103, 2010. Neyshabur, B., Tomioka, R., and Srebro, N. Norm-based Talvitie, E. Model regularization for stable sample capacity control in neural networks. In Proceedings of rollouts. In Proceedings of the Thirtieth Conference on The 28th Conference on Learning Theory, pp. 1376–1401, Uncertainty in Artificial Intelligence, pp. 780–789. AUAI 2015. Press, 2014.

Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., and Talvitie, E. Self-correcting models for model-based Littman, M. L. An analysis of linear models, linear reinforcement learning. In Proceedings of the Thirty-First value-function approximation, and feature selection AAAI Conference on Artificial Intelligence, February 4-9, for reinforcement learning. In Proceedings of the 2017, San Francisco, California, USA., 2017. 25th international conference on Machine learning, pp. 752–759. ACM, 2008. Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B Pazis, J. and Parr, R. Pac optimal exploration in continuous (Methodological), pp. 267–288, 1996. space markov decision processes. In AAAI, 2013. Vaserstein, L. N. Markov processes over denumerable Pires, B. A.´ and Szepesvari,´ C. Policy error bounds for products of spaces, describing large systems of automata. model-based reinforcement learning with factored linear Problemy Peredachi Informatsii, 5(3):64–72, 1969. models. In Conference on Learning Theory, pp. 121–151, 2016. Venkatraman, A., Hebert, M., and Bagnell, J. A. Improving multi-step prediction of learned time series models. In Pirotta, M., Restelli, M., and Bascetta, L. Policy gradient in Proceedings of the Twenty-Ninth AAAI Conference on lipschitz Markov decision processes. Machine Learning, Artificial Intelligence, January 25-30, 2015, Austin, Texas, 100(2-3):255–283, 2015. USA., 2015. Rachelson, E. and Lagoudakis, M. G. On the locality of Villani, C. Optimal transport: old and new, volume 338. action domination in sequential decision making. In Springer Science & Business Media, 2008. International Symposium on Artificial Intelligence and Mathematics, ISAIM 2010, Fort Lauderdale, Florida, USA, January 6-8, 2010, 2010. Russell, S. J. and Norvig, P. Artificial intelligence: A modern approach, 1995. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In ICML, 2014. Lipschitz Continuity in Model-based Reinforcement Learning Appendix Claim 1. In a finite MDP, transition probabilities can be expressed using a finite set of deterministic functions and a distribution over the functions.

Proof. Let P r(s, a, s0) denote the probability of a transiton from s to s0 when executing the action a. Define an ordering over states s1, ..., sn with an additional unreachable state s0. Now define the cumulative probability distribution:

i

C(s, a, si) := P r(s, a, sj) . j=0 X Further define L as the set of distinct entries in C:

L := C(s, a, si) s , i [0, n] . | ∈ S ∈ n o Note that, since the MDP is assumed to be finite, then L is finite. We sort the values of L and denote, by ci, ith smallest | | value of the set. Note that c0 = 0 and c L = 1. We now build determinstic set of functions f1, ..., f L as follows: | | | | i = 1 to L and j = 1 to n, define fi(s) = sj if and only if: ∀ | | ∀

C(s, a, sj 1) < ci C(s, a, sj) . − ≤ We also define the probability distribution g over f as follows:

g(fi a) := ci ci 1 . | − −

Given the functions f1, ..., f L and the distribution g, we can now compute the probability of a transition to sj from s after executing action a: | |

1(fi(s) = sj) g(fi a) | i X = 1 C(s, a, sj 1) < ci C(s, a, sj) (ci ci 1) − ≤ − − i X  = C(s, a, sj) C(s, a, sj 1) − − = P r(s, a, sj) , where 1 is a binary function that outputs one if and only if its condition holds. We reconstructed the transition probabilities using distribution g and deterministic functions f1, ..., f L . | | Claim 2. Given a deterministic and linear transition model, and a linear reward signal, the bounds provided in Theorems1 and2 are both tight.

Assume a linear transition function T defined as: T (s) = Ks Assume our learned transition function Tˆ: Tˆ(s) := Ks + ∆ Note that: max T (s) Tˆ(s) = ∆ s − and that: min KT ,K = K { Tˆ} First observe that the bound in Theorem 2 is tight for n = 2:

1 s T T (s) Tˆ Tˆ(s) = K2s K2s + ∆(1 + K) = ∆ Ki ∀ − − i=0   X

Lipschitz Continuity in Model-based Reinforcement Learning and more generally and after n compositions of the models, denoted by T n and Tˆn, the following equality holds:

n 1 − s T n(s) Tˆn(s) = ∆ Ki ∀ − i=0 X Lets further assume that the reward is linear:

R(s) = KRs Consider the state s = 0. Note that clearly v(0) = 0. We now compute the value predicted using Tˆ, denoted by vˆ(0):

0 1 2 vˆ(0) = R(0) + γR(0 + ∆ Ki) + γ2R(0 + ∆ Ki) + γ3R(0 + ∆ Ki) + ... i=0 i=0 i=0 X X X 0 1 2 i 2 i 3 i = 0 + γKR∆ K + γ KR∆ K ) + γ KR∆ K + ... i=0 i=0 i=0 X X X n 1 ∞ n − i γKR∆ = γKR∆ γ K = , (1 γ)(1 γK¯ ) n=0 i=0 X X − − and so: γK ∆ v(0) vˆ(0) = R | − | (1 γ)(1 γK¯ ) − − Note that this exactly matches the bound derived in our Theorem2.

Lemma 1. A generalized transition function T induced by a Lipschitz model class Fg is Lipschitz with a constant: G

b W T ( µ1, a), T ( µ2, a) KW,WA (T ) := sup sup G ·| G ·| KF G a µ ,µ W (µ , µ ) 1 2 1 2  ≤ b b b Proof.

W T ( µ1, a), T ( µ2, a) := inf j(s10 , s20 )d(s10 , s20 )ds10 ds20 · | · | j s0 s0 Z 1 Z 2  b b 1 = inf f(s1) = s10 f(s2) = s20 j(s1, s2, f)d(s10 , s20 )ds10 ds20 ds1ds2 j s s s0 s0 ∧ Z 1 Z 2 Z 1 Z 2 f X  = inf j(s1, s2, f)d f(s1), f(s2) ds1ds2 j Zs1 Zs2 f X  KF inf g(f a)j(s1, s2)d(s1, s2)ds1ds2 ≤ j | s1 s2 f Z Z X

= KF g(f a) inf j(s1, s2)d(s1, s2)ds1ds2 | j f s1 s2 X Z Z = KF g(f a)W (µ , µ ) = KF W (µ , µ ) | 1 2 1 2 f X

Dividing by W (µ1, µ2) and taking sup over a, µ1, and µ2, we conclude:

W T ( µ1, a), T ( µ2, a) KW,WA (T ) = sup sup · | · | KF . a µ ,µ W (µ , µ ) 1 2 1 2  ≤ b b b We can also prove this using the Kantarovich-Rubinstein duality theorem: Lipschitz Continuity in Model-based Reinforcement Learning

For every µ , µ , and a we have: 1 2 ∈ A

W T ( µ1, a), T ( µ2, a) = sup T (s µ1, a) T (s µ2, a) f(s)ds G G G G · | · | f:Kd ,R(f) 1 s | − | S ≤ Z   b b b b = sup T (s s0, a)µ1(s0) T (s s0, a)µ2(s0) f(s)dsds0 f:Kd ,R(f) 1 s s | − | S ≤ Z Z 0   b b = sup T (s s0, a) µ1(s0) µ2(s0) f(s)dsds0 f:Kd ,R(f) 1 s s | − S ≤ Z Z 0   b = sup g(t a)1 t(s0) = s µ1(s0) µ2(s0) f(s)dsds0 f:Kd ,R(f) 1 s s | − S ≤ Z Z 0 t X   = sup g(t a) 1 t(s0) = s µ1(s0) µ2(s0) f(s)dsds0 f:Kd ,R(f) 1 | s s − S ≤ t Z 0 Z X   = sup g(t a) µ1(s0) µ2(s0) f t(s0) ds0 f:Kd ,R(f) 1 | s − S ≤ t Z 0 X   g(t a) sup µ1(s0) µ2(s0) f t(s0) ds0 ≤ | f:Kd ,R(f) 1 s − t S ≤ Z 0 X   composition of f, t is Lipschitz with constant upper bounded by KF .

f(t(s0)) = KF g(t a) sup µ1(s0) µ2(s0) ds0 | f:Kd ,R(f) 1 s − KF t S ≤ Z 0 X  KF g(t a) sup µ1(s0) µ2(s0) h(s0))ds0 ≤ | h:Kd ,R(h) 1 s − t S ≤ Z 0 X  = KF g(t a)W (µ1, µ2) = KF W (µ1, µ2) t | X

Again we conclude by dividing by W (µ1, µ2) and taking sup over a, µ1, and µ2.

Lemma 2. (Composition Lemma) Define three metric spaces (M1, d1), (M2, d2), and (M3, d3). Define Lipschitz functions f : M M and g : M M with constants Kd ,d (f) and Kd ,d (g). Then, h : f g : M M is Lipschitz with 2 7→ 3 1 7→ 2 2 3 1 2 ◦ 1 7→ 3 constant Kd ,d (h) Kd ,d (f)Kd ,d (g). 1 3 ≤ 2 3 1 2

Proof.

d3 f g(s1) , f g(s2)

Kd1,d3 (h) = sup s1,s2  d1(s1, s2)  d f g(s ) , f g(s ) d2 g(s1), g(s2) 3 1 2 = sup s ,s d (s , s )   1 2 1 1 2  d2 g(s1), g(s2)  d2 g(s1), g(s2) d3 f(s1), f(s2) sup sup  s ,s d (s , s ) s ,s d (s , s ) ≤ 1 2 1 1 2  1 2 2 1 2 

= Kd1,d2 (g)Kd2,d3 (f).

Lemma 3. Given a Lipschitz function f : R with constant Kd ,d (f): S 7→ S R

KA T (s0 s, a)f(s0)ds0 Kd ,d (f)KA T . dS ,dR | ≤ S R dS ,W Z    b b Lipschitz Continuity in Model-based Reinforcement Learning

Proof.

s0 T (s0 s1, a) T (s0 s2, a) f(s0)ds0 KdAS ,dR T (s0 s, a)f(s0)ds0 = sup sup | | − | | 0 | a s ,s d(s , s ) Zs 1 2 R 1 2    b b K (f) b dS ,dR s0 T (s0 s1, a) T (s0 s2, a) f(s0) K (f) ds0 = sup sup | | − | dS ,dR | a s1,s2 R d(s1, s2)  b b f(s0) s0 T (s0 s1, a) T (s0 s2, a) K (f) ds0 | | − | dS ,dR | = KdS ,dR (f) sup sup a s1,s2 R d(s1, s2)  b b

supg:Kd ,d (g) 1 s0 T (s0 s1, a) T (s0 s2, a) g(s0)ds0 | S R ≤ | − | | KdS ,dR (f) sup sup ≤ a s1,s2 R d(s1, s2)  b b

supg:Kd ,d (g) 1 s0 T (s0 s1, a) T (s0 s2, a) g(s0)ds0 S R ≤ | − | = KdS ,dR (f) sup sup a s1,s2 R d(s1, s2)  b b W T ( s1, a), T ( s2, a) = KdS ,dR (f) sup sup ·| ·| a s ,s d(s , s ) 1 2 1 2  b b = KdS ,dR (f)KdAS ,W (T ) .

b Lemma 4. The following operators (Asadi & Littman, 2017) are Lipschitz with constants:

1. K ,dR (max(x)) = K ,dR mean(x) = K ,dR (-greedy(x)) = 1 kk∞ kk∞ kk∞ P βx i e i  log n 2. K ,dR (mmβ(x) := β ) = 1 kk∞

Pn βxi i=1 xie 3. K ,dR (boltzβ(x) := Pn eβ x ) A + βVmax A kk∞ i=1 i ≤ | | | | p Proof. 1 was proven by Littman & Szepesvari´ (1996), and 2 is proven several times (Fox et al., 2016; Asadi & Littman, 2017; Nachum et al., 2017; Neu et al., 2017). We focus on proving 3. Define

eβxi ρ(x)i = n , βxi i=1 e P and observe that boltzβ(x) = x>ρ(x). Gao & Pavel(2017) showed that ρ is Lipschitz:

ρ(x ) ρ(x ) β x x (7) k 1 − 2 k2 ≤ k 1 − 2k2 Using their result, we can further show:

ρ(x )>x ρ(x )>x | 1 1 − 2 2| ρ(x )>x ρ(x )>x + ρ(x )>x ρ(x )>x ≤ | 1 1 − 1 2| | 1 2 − 2 2| ρ(x ) x x ≤ k 1 k2 k 1 − 2k2 + x ρ(x ) ρ(x ) (Cauchy-Shwartz) k 2k2 k 1 − 2 k2 ρ(x ) x x ≤ k 1 k2 k 1 − 2k2 + x β x x from Eqn7) k 2k2 k 1 − 2k2 (1 + βVmax A ) x1 x2  ≤ | | k − k2 ( A + βVpmax A ) x1 x2 , ≤ | | | | k − k∞ p dividing both sides by x1 x2 leads to 3. k − k∞ Lipschitz Continuity in Model-based Reinforcement Learning

Below, we derive the Lipschitz constant for various functions mentioned in Table1. ReLu non-linearity We show that ReLu : Rn Rn has Lipschitz constant 1 for p. →

ReLu(x1) ReLu(x2) p K . p, . p (ReLu) = sup k − k k k k k x ,x x x p 1 2 k 1 − 2k p 1 ( ReLu(x )i ReLu(x )i ) p = sup i | 1 − 2 | x ,x x x p 1 2 P k 1 − 2k (We can show that ReLu(x )i ReLu(x )i x ,i x ,i and so): | 1 − 2 | ≤ | 1 − 2 | p 1 ( x ,i x ,i ) p sup i | 1 − 2 | ≤ x ,x x x p 1 2 P k 1 − 2k x x p = sup k 1 − 2k = 1 x ,x x x p 1 2 k 1 − 2k

n m Matrix multiplication Let W R × . We derive the Lipschitz continuity for the function W (x) = W x. ∈ × For p = we have: ∞ K , W (x1) kk∞ kk∞ × W (x1)  W (x2) W x1 W x2 W (x1 x2) = sup k× − × k∞ = sup k − k∞ = sup k − k∞ x1,x2 x1 x2 x1,x2 x1 x2 x1,x2 x1 x2 k − k∞ k − k∞ k − k∞ sup Wj(x1 x2) = sup j | − | x1,x2 x1 x2 k − k∞ supj Wj x1 x2 sup k k k − k∞ (Holder’s¨ inequality) ≤ x1,x2 x1 x2 k − k∞ = sup Wj 1 , j k k where Wj refers to jth row of the weight matrix W . Similarly, for p = 1 we have:

K , W (x1) kk1 kk1 × W (x ) W (x ) W x W x W (x x ) = sup k× 1 − × 2 k1 = sup k 1 − 2k1 = sup k 1 − 2 k1 x ,x x x x ,x x x x ,x x x 1 2 k 1 − 2k1 1 2 k 1 − 2k1 1 2 k 1 − 2k1 Wj(x1 x2) = sup j | − | x ,x x x 1 2 P k 1 − 2k1 j Wj x1 x2 1 sup k k∞ k − k = Wj , ≤ x ,x x x k k 1 2 P 1 2 1 j ∞ k − k X and finally for p = 2:

K , W (x1) kk2 kk2 × W (x ) W (x ) W x W x W (x x ) = sup k× 1 − × 2 k2 = sup k 1 − 2k2 = sup k 1 − 2 k2 x ,x x x x ,x x x x ,x x x 1 2 k 1 − 2k2 1 2 k 1 − 2k2 1 2 k 1 − 2k2 2 j Wj(x1 x2) = sup | − | x1,x2 qP x1 x2 k − k2 2 2 Wj x x j k k2 k 1 − 2k2 2 sup = Wj 2 . ≤ x ,x q x x k k 1 2 P 1 2 2 j k − k sX Lipschitz Continuity in Model-based Reinforcement Learning

Vector addition We show that +b : Rn Rn has Lipschitz constant 1 for p = 0, 1, for all b Rn. → ∞ ∈

+ b(x1) +b(x2) p K . p, . p (ReLu) = sup k − k k k k k x ,x x x p 1 2 k 1 − 2k (x + b) (x + b) p x x p = sup k 1 − 2 k = k 1 − 2k = 1 x ,x x x p x x p 1 2 k 1 − 2k k 1 − 2k

Supervised-learning domain We used the following 5 functions to generate the dataset:

f0(x) = tanh(x) + 3 f (x) = x x 1 ∗ f (x) = sin(x) 5 2 − f (x) = sin(x) 3 3 − f (x) = sin(x) sin(x) 4 ∗

We sampled each function 30 times, where the input was chosen uniformly randomly from [ 2, 2] each time. −