arXiv:1905.12654v1 [cs.LG] 29 May 2019 rceig fthe of Proceedings Learning 09b h author(s). the by 2019 unWang Huan 2018 2017 ovd n htdasmr n oeatnini h is- ( the learning is reinforcement attention in more overfitting and of more un- sue draws remain that learning One reinforcement solved. in questions key some ( design chitecture ( ment ( tems ( games as a such in successful applications proven of has series (RL) learning Reinforcement Introduction 1. 1 aefreRsac,Pl loC,UA orsodneto: Correspondence USA. CA, Alto Palo Research, Salesforce nteGnrlzto a nRprmtrzbeReinforcem Reparameterizable in Gap Generalization the On ela h A-ae on.Orbudsug- bound Our bound. PAC-Bayes the as well ,rbtc ( robotics ), ; eeaiaingpadteefcostruhsim- ulations. through factors the these and between gap relationship generalization also the We verify class. empirically function policy agent transition, and environment reward the includ- of factors “smoothness” multiple ing to related reparame- is of RL terizable capability generalization the gests as complexity Rademacher on exter- based and errors, intrinsic nal both for expected return the empirical between and de- gap the we on relationships, guarantees these rive Through learn- theory. transfer and ing learning supervised using RL reparametrizable study to us vari- enables computed which random ables, be peripheral can given trajectory deterministically the and efficient is return expected the estimating class, For problem this trick. reparametrization de- the be using can composed distribution trajectory the the on where lems, focus of We class apply. special not do supervised theory traditional learning of many assumptions as challenge, common significant a is reinforcement (RL) learning in generalization Understanding ie al. et Li a tal. et Mao nhe al. et Mnih ogBah aiona MR9,21.Copyright 2019. 97, PMLR California, Beach, Long , < [email protected] , 2010 , oe tal. et Kober 2016 , 36 ae tal. et Baker ; 2015 th hn tal. et Shani eaaeeial RL reparameterizable ; nentoa ofrneo Machine on Conference International Abstract ihsiie al. et Mirhoseini unWang Huan ; iyl tal. et Vinyals , , 2013 2017 , 2005 ,rcmedto sys- recommendation ), ,admr.However more. and ), 1 > ,rsuc manage- resource ), . ivre al. et Silver tpa Zheng Stephan , , 2018 2017 Sutton ,nua ar- neural ), ; prob- OpenAI , , 1995 2016 1 ; ; , amn Xiong Caiming eietf w ao btce nteaayi fon- of analysis the in obstacles major two identify We ( lhuhoefitn a eneprclyosre nRL in observed empirically been has overfitting Although n niomn r ettesm stann.O the On test- training. the as in same transition the hand, the kept other and are even environment distribution environment, ing the initial of the randomness if the overfit- by observed re- caused empirically deep ting They in gap learning. performance the inforcement evaluated and testing and vrtigi uevsdlann,adteohr aldex called to other, analogous the is and dif- error, learning, very intrinsic supervised are in call errors overfitting we of which types One, two ferent. these actually We that the distribution. in argue environment randomness the by in caused shifts errors and the environment mix to research tends state-of-the-art RL on Second, wit process a dependencies. optimiza- to complex leads during this policy analysis, finite-sample updated For tion. the by dis- new induced the from tribution redrawn Therefore, continuously be optimization. to during have changing episodes updated keeps gets distribution policy the episode as the First, RL. training. policy in are policy that current experience the of using episodes “on-the-fly” on sampled based trained are cies et oe rhtcue,adoptimization. environ and the architectures, in model e “difficulty” ment, of distribution, level state the are initial gaps transition, such ronment factors: that test- multiple observed been and to has training related It the reward. between or One gap loss the ing data. al- is unseen an interest previously accurate of on how metric predict measure to able to is like gorithm would we general In generaliza improve tion. that growing algorithms general a developing model and also for ization conditions is the There understanding in environment. when interest testing well the perform in not used may or may environment, training al. et Zhang al. et Cobbe hswr,w ou on In focus we missing. still work, are this guarantees, generalizatio finite-sample on especially guarantees theoretical time, to time from for even problem. algorithms, RL same and the models across behaviors testing per testing in formance. differences huge observed and training, from al. et Cobbe 2018b pi admysmldiiilsae notraining into states initial sampled randomly split ) akre al. et Packer aerte tal. et Farebrother , , ( 1 2018 2018a 2018 ihr Socher Richard ; loe h etevrnett vary to environment test the allowed ) hn tal. et Zhang .Amdlta efrswl nthe in well performs that model A ). ( 2018 nplc RL on-policy ( lorpre eydifferent very reported also ) 2018 , 2018b 1 n Learning ent ); utsne al. et Justesen ; hr gn poli- agent where , akre al. et Packer hn tal. et Zhang ( , 2018 2018 nvi- n, ); h ; - - - - - On the Generalization Gap in Reparameterizable RL ternal error, looks more like the errors in transfer learning. viding a simple bound for “smooth” environmentsand models with a limited number of parameters. Our key observation is there exists a special class of RL, called reparameterizable RL, where randomness in the environment can be decoupled from the transition and • A guarantee for reparameterized RL when the envi- initialization procedures via the reparameterization trick ronment is changed during testing. In particular we (Kingma & Welling, 2014). Through reparameterization, discuss two cases in environment shift: change in the an episode’s dependency on the policy is “lifted” to the initial distribution for the states, or the transition func- states. Hence, as the policy gets updated, episodes are de- tion. terministic given peripheral random variables. As a con- sequence, the expected reward in reparameterizable RL 2. Notation and Formulation is connected to the Rademacher complexity as well as the PAC-Bayes bound. The reparameterization trick also We denote a Markov Decision Process (MDP) as a 5-tuple makes the analysis for the second type of errors, i.e., when (S, A, P, r, P0). Here S is the state space, A is the action- the environment distribution is shifted, much easier since space, P(s,a,s′): S×A×S → [0, 1] is the transition the environment parameters are also “lifted” to the repre- probability from state s to s′ when taking action a, r(s): R sentation of states. S → represents the reward function, and P0(s): S → [0, 1] is the initial state distribution. Let π(s) ∈ Π: S → A Related Work Generalization in reinforcement learn- be the policy map that returns the action a at state s. ing has been investigated a lot both theoretically and We consider episodic MDPs with a finite horizon. Given empirically. Theoretical work includes bandit anal- the policy map π and the transition probability P, the state- ysis (Agarwal et al., 2014; Auer et al., 2002; 2009; to-state transition probability is Tπ(s,s′) = P(s, π(s),s′). Beygelzimer et al., 2011), Probably Approximately Without loss of generality, the length of the episode is T +1. Correct (PAC) analysis (Jiang et al., 2017; Dann et al., We denote a sequence of states [s0,s1,...,sT ] as s. The 2017; Strehl et al., 2009; Lattimore & Hutter, 2014) T total reward in an episode is R(s) = γtr , where as well as minimax analysis (Azar et al., 2017; t=0 t γ ∈ (0, 1] is a discount factor and rt = r(st). Chakravorty & Hyland, 2003). Most works focus on P the analysis of regret and consider the gap between the Denote the joint distribution of the sequence of states in an expected value and optimal return. On the empirical side, episode s = [s0,s1,...,sT ] as Dπ. Note Dπ is also related besides the previously mentioned work, Whiteson et al. to P and P0. In this work we assume P and P0 are fixed, (2011) proposes generalized methodologies that are so Dπ is a function of π. Our goal is to find a policy that based on multiple environments sampled from a distri- maximizes the expected total discounted reward (return): bution. Nair et al. (2015) also use random starts to test generalization. T E s E t π∗ = argmax s π R( ) = argmax s π γ rt. Other research has also examined generalization π Π ∼D π Π ∼D ∈ ∈ t=0 from a transfer learning perspective. Lazaric (2012); X (1) Taylor & Stone (2009); Zhan & Taylor (2015); Laroche (2017) examine model generalization across different learning tasks, and provide guarantees on asymptotic Suppose during training we have a budget of n episodes, performance. then the empirical return is

There are also works in robotics for transferring policy n 1 from simulator to real world and optimizing an internal πˆ =arg max R(si), (2) i π Π,s π n model from data (Kearns & Singh, 2002), or works trying ∈ ∼D i=1 to solve abstracted or compressed MDPs (Majeed & Hutter, X si i i i 2018). where = [s0,s1,...,sT ] is the ith episode of length . We are interested in the generalization gap Our Contributions: T +1

n • A connection between (on-policy) reinforcement 1 i Φ= R(s ) − Es ′ R(s) . (3) learning and through the reparam- πˆ n ∼D i=1 eterization trick. It simplifies the finite-sample anal- X ysis for RL, and yields Rademacher and PAC-Bayes Note that in (3) the distribution D may be different from bounds on Markov Decision Processes (MDP). π′ˆ Dπˆ since in the testing environment P′ as well as P0′ may • Identifying a class of reparameterizable RL and pro- be shifted compared to the training environment. On the Generalization Gap in Reparameterizable RL

3. Generalization in Reinforcement Learning using the triangle inequality. The first term in (5) is the v.s. Supervised Learning concentrationerror between the empirical reward and its ex- pectation. Since it is caused by intrinsic randomness of the Generalization has been well studied in the supervised environment, we call it the intrinsic error. Even if the test learning scenario. A popular assumption is that samples are environment shares the same distribution with training, in independent and identically distributed (xi,yi) ∼ D, ∀i ∈ the finite-sample scenario there is still a gap between train- {1, 2,...,n}. Similar to empirical return maximization dis- ing and testing. This is analogous to the overfitting problem cussed in Section 2, in supervised learning a popular algo- studied in supervised learning. Zhang et al. (2018b) mainly rithm is empirical risk minimization: focuses on this aspect of generalization. In particular, their n randomness is carefully controlled in experiments to only 1 fˆ = arg min ℓ(f, xi,yi), (4) come from the initial states s0 ∼P0. f n ∈F i=1 X We call the second term in (5) external error, as it is caused where f ∈ F : X → Y is the prediction function to be by shifts of the distribution in the environment. For exam- learned and ℓ : F×X×Y → R+ is the . Simi- ple, the transition distribution P or the initialization distri- larly generalization in supervised learning concerns the gap bution P0 may get changed during testing, which leads to a E between the expected loss [ℓ(f,x,y)] and the empirical different underlying episode distribution Dπ′ . This is analo- 1 n loss n i=1 ℓ(f, xi,yi). gous to the transfer learning problem. For instance, gener- alization as in Cobbe et al. (2018) is mostly external error It is easy to find the correspondence between the episodes P since the number of levels used for training and testing are defined in Section 2 and the samples (x ,y ) in supervised i i different even though the difficult level parameters are sam- learning. Just like supervised learning where (x, y) ∼ D, pled from the same distribution. The setting in Packer et al. in (episodic) reinforcement learning si ∼ D . Also the re- π (2018) covers both intrinsic and external errors. ward function R in reinforcement learning is similar to the loss function ℓ in supervised learning. However, reinforce- ment learning is different because 5. Why Intrinsic Generalization Error? If π is fixed, by concentrationof measures, as the numberof • In supervised learning, the sample distribution D is episodes n increases, the intrinsic error decreases roughly kept fixed, and the loss function ℓ ◦ f changes as we with 1 . For example, if the reward is bounded |R(si)|≤ choose different predictors f. √n c/2, by McDiarmid’s bound, with probability at least 1 − δ, • In reinforcement learning, the reward function R is n kept fixed, but the sample distribution Dπ changes as 2 1 i log δ we choose different policy maps π. R(s ) − Es [R(s)] ≤ c , (6) n ∼D s 2n i=1 X As a consequence, the training procedure in reinforcement where c > 0. Note the bound above also holds for the test learning is also different. Popular methods such as RE- samples if the distribution D is fixed and stest ∼ D. INFORCE (Williams, 1992), Q-learning (Sutton & Barto., 1998), and actor-critic methods (Mnih et al., 2016) draw For the population argument (1), π∗ is defined determinis- E tically since the value s π R(s) is a deterministic func- new states and episodes on the fly as the policy π is being ∼D updated. That is, the distribution D from which episodes tion of π. However, in the finite-sample case (2), the policy π si are drawn always changes during optimization. In contrast, map πˆ is stochastic: it depends on the samples . Asa in supervised learning we only update the predictor f with- consequence, the underlying distribution Dπˆ is not fixed. In that case, the expectation Es s in (6) becomes out affecting the underlying sample distribution D. πˆ [R( )] a random variable so (6) does not∼D hold any more.

4. Intrinsic vs External Generalization Errors One way of fixing the issue caused by random Dπˆ is to prove a bound that holds uniformly for all policies π ∈ Π. The generalization gap (3) can be bounded If π is finite, by applying a union bound, it follows that: n Lemma 1. If Π is finite, and |R(s)|≤ c/2, then with prob- 1 si E s Φ ≤ R( ) − s πˆ R( ) ability at least 1 − δ, for all π ∈ Π n ∼D i=1 X n 2 Π intrinsic log | | 1 i E δ R(s ) − s π [R(s)] ≤ c , (7) Es s Es ′ s n ∼D s 2n |+ πˆ R({z) − π R(} ) (5) i=1 ∼D ∼Dˆ X

external where |Π| is the cardinality of Π.

| {z } On the Generalization Gap in Reparameterizable RL

Unfortunately in most of the applications, Π is not finite. Algorithm 1 Reparameterized MDP One difficulty in analyzing the intrinsic generalization er- Initialization: Sample g ,g ,g ,...,g ∼ G S . s = ror is that the policy changes during the optimization proce- init 0 1 T | | 0 argmax(g + log P ), R =0. dure. This leads to a change in the episode distribution D . init 0 π for t in 0,...,T do Usually π is updated using episodes generated from some R = R + γtr(s ) “previous” distributions, which are then used to generate t st+1 = argmax(gt + log P(st, π(st))) new episodes. In this case it is not easy to split episodes end for into a training and testing set, since during optimization return R. samples always come from the updated policy distribution.

S 6. Reparameterization Trick In the reparameterized MDP procedure, G| | is an |S|- dimensional Gumbel distribution. g0,...,gT are |S|- The reparameterization trick has been popular in the op- dimensional vectors with each entry being a Gumbel ran- timization of deep networks (Kingma & Welling, 2014; dom variable. Also g0 + log P0 and gt + log P(st,at) are Maddison et al., 2017; Jang et al., 2017; Tokui & Sato, entry-wise vector sums, so they are both |S|-dimensional 2016) and used, e.g., for the purpose of optimization effi- vectors. arg max(v) returns the index of the maximum en- ciency. In RL, suppose the objective (1) is reparameteriz- try in the |S|-dimensional vector v. In the reparameterized able: MDP procedure shown above, the states st are represented as an index in . After reparameterization, we E s E s {1, 2,..., |S|} s π R( )= ξ p(ξ)R( (f(ξ, π))). 1 ∼D ∼ may rewrite the RL objective (2) as: n Then under some weak assumptions 1 πˆ =arg max R(si(gi; π)), (9) E s E s i |S|T ∇θ s π R( )= ∇θ ξ p(ξ)R( (f(ξ, πθ))) π Π,g n ∼D θ ∼ ∈ ∼G i=1 E s X = ξ p(ξ) [∇θR( (f(ξ, πθ)))] (8) i i i i i ∼ where g = [g0,g1,...,gT ], gt is an |S|-dimensional Gum- The reparameterization trick has already been used: for ex- bel random variable, and ample, PGPE (R¨uckstieß et al., 2010) uses policy reparam- T si i t i i i i eterization, and SVG (Heess et al., 2015) uses policy and R( (g ; π)) = γ r(st(g0,g1,...,gt; π)) (10) t=0 environment dynamics reparameterization. In this work, X we will show the reparameterization trick can help to an- is the discounted return for one episode of length T +1. alyze the generalization gap. More precisely, we will show The reparameterized objective (9) maximizes the empiri- that since both P and P are fixed, even if they are un- 0 cal reward by varying the policy . The distribution from known, as long as they satisfy some “smoothness” assump- π which the random variables i are drawn does not depend tions, we can provide theoretical guarantees on the test per- g on the policy anymore, and the policy only affects the formance. π π reward R(si(gi; π)) through the states si. 7. Reparameterized MDP The objective (9) is a discrete function due to the arg max operator. One way to circumvent this is to use Gumbel soft- We start our analysis with reparameterizing a Markov Deci- max to approximate the argmax operator (Maddison et al., sion Process with discrete states. We will give a general ar- 2017; Jang et al., 2017). If we denote s as a one-hot vec- S gument on reparameterizableRL in the next section. In this tor in R| |, and further relax the entries in s to take pos- section we slightly abuse notation by letting P0 and P(s,a) itive values that sum up to one, we may use the softmax denote |S|-dimensional probability vectors for multinomial to approximate the arg max operator. For instance, the distributions for initialization and transition respectively. reparametrized initial-state distribution becomes:

One difficulty in the analysis of the generalization in rein- exp{(g + log P0)/τ} s0 = , (11) forcement learning rises from the sampling steps in MDP k exp{(g + log P0)/τ}k1 where states are drawn from multinomial distributions spec- where g is an |S|-dimensional Gumbel random variable, P ified by either P or P(s ,a ), because the sampling proce- 0 0 t t is an |S|-dimensional probability vector in multinomial dis- dure does not explicitly connect the states and the distri- tribution, and τ is a positive scalar. As the temperature τ → bution parameters. We can use standard Gumbel random 0, the softmax approaches s = argmax(g + log P ) ∼P variables g ∼ exp(−g + exp(−g)) to reparameterize sam- 0 0 in terms of the one-hot vector representation. pling and get a procedure equivalent to classical MDPs but i i with slightly different expressions, as shown in Algorithm 1Again we abuse the notation by denoting s (f(g ; π)) as i i 1. s (g ; π). On the Generalization Gap in Reparameterizable RL

8. Reparameterizable RL expected and the empirical reward. In particular, the as- sumptions we make are3 In general, as long as the transition and initialization pro- R S A R S cess can be reparameterized so that the environment param- Assumption 1. T (s,a): | | × R| | → | | is Lt1- eters are separated from the random variables, the objective Lipschitz in terms of the first variable s, and Lt2-Lipschitz can always be reformulated so that the policy only affects in terms of the second variable a. That is, ∀x, x′,y,y′,z, the reward instead of the underlying distribution. The repa- kT (x,y,z) − T (x′,y,z)k≤ Lt1kx − x′k, rameterizable RL procedure is shown in Algorithm 2. kT (x,y,z) − T (x, y′,z)k≤ Lt2ky − y′k. Algorithm 2 Reparameterizzble RL Assumption 2. The policy is parameterized as π(s; θ): S m A R| | × R → R| |, and π(s; θ) is Lπ1-Lipschitz in terms Initialization: Sample ξ0, ξ1,...,ξT . s0 = I(ξ0), R = 0. of the states, and Lπ2-Lipschitz in terms of the parameter Rm for t in 0,...,T do θ ∈ , that is, ∀s,s′,θ,θ′ R = R + γtr(s ) t kπ(s; θ) − π(s′; θ)k≤ Lπ1ks − s′k, s = T (s , π(s ), ξ ) t+1 t t t kπ(s; θ) − π(s; θ )k≤ L kθ − θ k. end for ′ π2 ′ return R. S Assumption 3. The reward r(s): R| | → R is Lr- Lipschitz:

In this procedure, ξs are d-dimensional random variables |r(s′) − r(s)|≤ Lrks′ − sk. but they are not necessarily sampled from the same dis- tribution.2 In many scenarios they are treated as random If assumptions (1)(2)and (3) hold, we have the following: Rd R S noise. I : → | | is the initialization function. Dur- Theorem 1. In reparameterizable RL, suppose the tran- ing initialization, the random variable ξ is taken as input 0 sition T ′ in the test environment satisfies ∀x, y, z, k(T ′ − and the output is an initial state s0. The transition function T )(x,y,z)k ≤ ζ, and suppose the initialization function S A d S T : R ×R ×R → R , takes the current state st, the | | | | | | I′ in the test environment satisfies ∀ξ, k(I′ − I)(ξ)k ≤ ǫ. action produced by the policy π(st), and a random variable If assumptions (1),(2)and(3) hold, the peripheral random ξt to produce the next state st+1. variables ξi for each episode are i.i.d., and the reward is In reparameterizable RL, the peripheral random variables bounded |R(s)|≤ c/2, then with probability at least 1 − δ, for all policies : ξ0, ξ1,...,ξT can be sampled before the episode is gener- π ∈ Π ated. In this way, the randomnessis decoupledfromthe pol- 1 i |E [R(s(ξ; π, T ′, I′))] − R(s(ξ ; π, T , I))| icy function, and as the policy π gets updated, the episodes ξ n i can be computed deterministically. X T t T t ν − 1 t t The class of reparamterizable RL problems includes those ≤ Rad(Rπ, , )+ Lrζ γ + Lrǫ γ ν T I ν − 1 whose initial state, transition, reward and optimal policy t=0 t=0 distribution can be reparameterized. Generally, a distribu- X X log(1/δ) tion can be reparameterized, e.g., if it has a tractable in- + O c , verse CDF, is a composition of reparameterizable distribu- r n ! tions (Kingma & Welling, 2014), or is a limit of smooth where ν = Lt1 + Lt2Lπ1, and Rad(Rπ, , ) = approximators (Maddison et al., 2017; Jang et al., 2017). n T I E E sup 1 σ R(si(ξi; π, T , I)) is the Reparametrizable RL settings include LQR (Lewis et al., ξ σ π n i=1 i Rademacher complexity of R(s(ξ; π, T , I)) under the 1995) and physical systems (e.g., robotics) where the dy-   training transitionP T , the training initialization I, and n is namics are given by stochastic partial differential equations the number if training episodes. (PDE) with reparameterizable components over continuous state-action spaces. Note the i.i.d. assumption on the peripheral variables ξi is across episodes. Within the same episode, there could be i 9. Main Result correlations among the ξts at different time steps. For reparameterizable RL, if the environments and the Similar arguments can also be made when the transition T ′ policy are “smooth”, we can control the error between the in the test environment stays the same as T , but the initial- ization I′ is different from I. In the following sections we 2They may also have different dimensions. In this work, with- will bound the intrinsic and external errors respectively. out loss of generality, we assume the random variables have the 3 m same dimension d. k · k is the L2 norm, and θ ∈ R . On the Generalization Gap in Reparameterizable RL

10. Bounding Intrinsic Generalization Error In the context of , deep neural networks are over-parameterized models that have proven to work well After reparameterization, the objective (9) is essentially in many applications. However, the bound above does the same as an empirical risk minimization problem in not explain why over-parameterized models also general- the supervised learning scenario. According to classical ize well since the Rademacher complexity bound (14) can learning theory, the following lemma is straight-forward be extremely large as m grows. To ameliorate this is- (Shalev-Shwartz & Ben-David, 2014): sue, recently Arora et al. (2018) proposed a compression Lemma 2. If the reward is bounded, |R(s)|≤ c/2,c> 0, approach that compresses a neural network to a smaller i S T and g ∼ G| |× are i.i.d. for each episode, with probabil- one with fewer parameters but has roughly the same train- ity at least 1 − δ, for ∀π ∈ Π: ing errors. Whether this also applies to reparameterizable RL is yet to be proven. There are also trajectory-based E s 1 si i | g |S|×T [R( (g; π))] − R( (g ; π))| techniques proposed to sharpen the generalization bound ∼G n i (Li et al., 2018; Allen-Zhu et al., 2018; Arora et al., 2019; X Cao & Gu, 2019). log(1/δ) ≤ Rad(R )+ O c , (12) π n r ! 10.1. PAC-Bayes Bound on Reparameterizable RL E E 1 n si i where Rad(Rπ) = g σ supπ n i=1 σiR( (g ; π)) We can also analyze the Rademacher complexity of the em- is the Rademacher complexity of R(s(g; π)). pirical return by making a slightly different assumption on  P  the policy. Suppose π is parameterized as π(θ), and θ is The bound (12) holds uniformly for all π ∈ Π, sampled from some posterior distribution θ ∼ Q. Accord- so it also holds for πˆ. Unfortunately, in MDPs ing to the PAC-Bayes theorem (McAllester, 1998; 2003; Rad(Rπ) is hard to control, mainly due to the recur- Neyshabur et al., 2018; Langford & Shawe-Taylor, 2002): sive argmax in the representation of the states, s = t+1 Lemma 5. Given a “prior” distribution D , with probabil- argmax(g + log P(s , π(s ))). 0 t t t ity at least 1 − δ over the draw of n episodes, ∀Q: On the other hand, for general reparameterizable RL we 1 i may control the intrinsic generalization gap by assuming Eg[Rθ (g)] ≥ Rθ (g ) ∼Q n ∼Q some “smoothness” conditions on the transitions T , as well i as the policy π. In particular, it is straight-forward to prove X 2n 2(KL(Q||D0) + log ) that the empirical return R is “smooth” if the transitions − 2 δ , (15) and policies are all Lipschitz. s n − 1 i i i Lemma 3. For reparameterizable RL, given assumptions Rθ (g )= Eθ R(s (g ; π(θ))) ∼Q ∼Q 1, 2, and 3, the empirical return R defined in (10), as a T E  t i i  function of the parameter θ, has a Lipschitz constant of = θ γ r(st(g ; π(θ))) , (16) ∼Q "t=0 # T t X t ν − 1 β = LrLt2 Lπ2 γ , (13) where Rθ (g) is the expected “Bayesian” reward. ν − 1 ∼Q t=0 X The bound (15) holds for all posterior Q. In particular it where ν = Lt1 + Lt2Lπ1. holds if Q is θ + u where θ could be any solution pro- vided by empirical return maximization, and u is a pertur- Also, if the number of parameters m in π(θ) is bounded, bation, e.g., zero-centered uniform or Gaussian distribution. then the Rademacher complexity Rad(R ) in Lemma 2 π This suggests maximizing a perturbed objective instead can be controlled (van der Vaart., 1998; Bartlett, 2013). may lead to better generalization performance, which has Lemma 4. For reparameterizable RL, given assumptions already been observed empirically (Wang et al., 2018b). 1, 2, and 3, if the parameters θ ∈ Rm is bounded such that kθk≤ 1, and the function class of the reparameterized The tricky part about perturbing the policy is choosing the reward R is closed under negations, then the Rademacher level of noise. Suppose there is an empirical reward opti- mizer π(θˆ). When the noise level is small, the first term in complexity Rad(Rπ) is bounded by (15) is large, but the second term may also be large since ˆ m the posterior Q is too focused on θ but the “prior” D0 can- Rad(Rπ)= O β (14) ˆ n not depend on θ, and vice versa. On the other hand, if the  r  reward function is “nice”, e.g., if some “smoothness” as- where β is the Lipschitz constant defined in (13), and n is sumption holds in a local neighborhood of θˆ, then one can the number of episodes. prove the optimal noise level roughly scales inversely as On the Generalization Gap in Reparameterizable RL the square root of the local Hessian diagonals (Wang et al., Table 1. Intrinsic Gap versus Smoothness 2018a). Temperature Policy State Action 1 ˆl τ Gap τ Πlkθ kF Gap Gap 11. Bounding External Generalization Error 0.001 0.554 2.20 · 106 0.632 0.612 0.01 0.494 4.46 · 105 0.632 0.608 Another source of generalization error in RL comes from 5 the change of environment. For example, in an MDP 0.1 0.482 1.74 · 10 0.633 0.603 1 0.478 8.83 · 104 0.598 0.598 (S, A, P, r, P ), the transition probability P or the initial- 0 10 0.479 5.06 · 104 0.588 0.594 ization distribution P is different in the test environment. 0 100 0.468 4.77 · 104 0.581 0.594 Cobbe et al. (2018) and Packer et al. (2018) show that as 4 the distribution of the environment varies the gap between 1000 0.471 3.29 · 10 0.590 0.594 the training and testing could be huge. Indeed if the test distribution is drastically different from The other possible environment change is that the test ini- the training environment, there is no guarantee the perfor- tialization I stays the same but the transition changes from mance of the same model could possibly work for testing. the training transition T to T ′. Similar to before, we have: On the other hand, if the test distribution D′ is not too far Lemma 7. In reparameterizable RL, suppose the transi- away from the training distribution D then the test error can tion T ′ in the test environment satisfies ∀x, y, z, k(T ′ − still be controlled. For example, for supervised learning, T )(x,y,z)k ≤ ζ, and the initialization I in the test en- Mohri & Medina (2012) prove the expected loss of a drift- vironment is the same as training. If assumptions (1),(2) ing distribution is also bounded. In addition to Rademacher and (3) hold then complexity and a concentration tail, there is one more term E s E s in the gap that measures the discrepancy between the train- | ξ[R( (ξ; T ′))] − ξ[R( (ξ; T ))]| ing and testing distribution. T νt − 1 ≤ L ζ γt (18) For reparameterizable RL, since the environment parame- r ν − 1 t=0 ters are lifted into the reward function in the reformulated X objective (9), the analysis becomes easier. For MDPs, a where ν = Lt1 + Lt2Lπ1. small change in environment could cause large difference in the reward since arg max is not continuous. However The difference between (18)and(17) is that the change ζ in if the transition function is “smooth”, the expected reward transition T is further enlarged during an episode: as long in the new environment can also be controlled. e.g., if we as ν > 1, the gapin (18) is larger and can become huge as assume the transition function T , the reward function r, as the length T of the episode increases. well as the policy function π are all Lipschitz, as in section 10. 12. Simulation If the transition function T is the same in the test environ- We now present empirical measurements in simulations to ment and the only difference is the initialization, we can verify some claims made in section 10 and 11. The bound prove the following lemma: (14) suggests the gap between the expected reward and the Lemma 6. In reparameterizable RL, suppose the ini- empirical reward is related to the Lipschitz constant β of R, which according to equation (13) is related to the Lipschitz tialization function I′ in the test environment satisfies constant of a series of functions including , , and . ∀ξ, k(I′ − I)(ξ)k ≤ ζ for ζ > 0, and the transition func- π T r tion T in the test environment is the same as training. If assumptions (1),(2), and (3) hold, then: 12.1. Intrinsic Generalization Gap In (13), as the length of the episode T increases, the dom- |Eξ[R(s(ξ; I′))] − Eξ[R(s(ξ; I))]| inating factors in β are Lt1, Lt2 and Lπ1. Our first sim- T ulation fixes the environment and verifies L . In the sim- t t π ≤ Lrζ γ (Lt1 + Lt2Lπ1) (17) ulation, we assume the initialization I and the transition t=0 X T are all known and fixed. I is an identity function, and S ξ0 ∈ R| | is a vector of i.i.d. uniformly distributed ran- Lemma 6 means that if the initialization in the test environ- dom variables: ξ0[k] ∼ U[0, 1], ∀k ∈ 1,... |S|. The transit S ment is not too different from the training one, and if the function is T (s,a,ξ)= sT1 + aT2 + ξT3, where s ∈ R| |, A 2 S S transition, policy and reward functions are smooth, then the a ∈ R| |, ξ ∈ R are row vectors, and T1 ∈ R| |×| |, A S 2 S expected reward in the test environmentwon’t deviate from T2 ∈ R| |×| |, and T3 ∈ R ×| | are matrices used to that of training too much. project the states, actions, and noise respectively. T1, T2, On the Generalization Gap in Reparameterizable RL and T3 are randomly generated and then kept fixed during Params 65.6k 131.3k 263.2k 583.4k 1.1m the experiment. We use γ = 1 as the discounting constant Gap 0.204 0.183 0.214 0.336 0.418 throughout. Table 2. Empirical gap vs #policy params. The policy π(s,θ) is modeled using a multiple layer per- ζ in I 1 10 100 1,000 ceptron (MLP) with rectified linear as the activation. The Gap 0.481 0.477 0.659 0.532 last layer of MLP is a linear layer followed by a softmax x[k] exp τ Table 3. Empirical generalization gap vs shift in initialization. function with temperature: q(x[k]; τ)= x[k] . Pk exp τ ζ in T 1 10 100 1,000 By varying the temperature we are able to control the τ Gap 11 451 8,260 73,300 Lipschitz constant of the policy class Lπ1 and Lπ2 if we as- sume the bound on the parameters kθk≤ B is unchanged. Table 4. Empirical generalization gap vs shift in transition. We set the length of the episode T = 128, and randomly sample ξ0, ξ1,...,ξT for n = 128 training and testing also vary the smoothness in the transition function a a func- episodes. Then we use the same random noise to evalu- tion of states (T1), and actions (T2), by applying softmax ate a series of policy classes with different temperatures with different temperatures τ to the singular values of the τ ∈{0.001, 0.01, 0.1, 1, 10, 100, 1000}. randomly generated matrix. Since we assume I and T are known, during training the Table 1 shows the average generalization gap roughly de- 1 ˆl computation graph is complete. Hence we can directly op- creases as τ decreases. The metric τ Πlkθ kF also de- timize the coefficients θ in π(s; θ) just as in supervised creases similarly as the average gap. In particular, the 2nd learning.4 We use Adam (Kingma & Ba, 2015) to optimize and 3rd column shows the average gap as the policy be- 2 3 with initial learning rates 10− and 10− . When the reward comes “smoother”. The 4th column shows, if we fix the stops increasing we halved the learning rate. and analyze policy-τ as well as setting T2 = 1, the generalization gap the gap between the average training and testing reward. decreases as we increase the transition-τ for T1 (states). Similarly the last column is the gap as the transition- for First, we observe the gap is affected by the optimization τ actions ( ) varies. In Table 2 the environment is fixed and procedure. For example, different learning rates can lead T2 for each parameter configuration the gap is averaged from to different local optima, even if we decrease the learning trials with randomly initialized and then optimized poli- rate by half when the reward does not increase. Second, 100 cies. even if we know the environment I and T , so that we can optimize the policy π(s; θ) directly, we still experience un- stable learning just like other RL algorithms. This suggests 12.2. External Generalization Gap that the unstableness of the RL algorithms may not rise To measure the external generalization gap, we vary the from the estimation of the environmentfor the model based transition T as well as the initialization I in the test envi- algorithms such as A2C and A3C (Mnih et al., 2016), since ronment. For that, we add a vector of Rademacher random even if we know the environment the learning is still unsta- variables ∆ to I or T , with k∆k = ζ. We adjust the level ble. of noise δ in the simulation and report the change of the Given the unstable training procedure, for each trial we ran average gap in Table 3 and Table 4. It is not surprising that the training for 1024 epochs with learning rate of 1e-2 and the change ∆T in transition T leads to a higher generaliza- 1e-3, and the one with higher training reward at the last tion gap since the impact from ∆T is accumulated across epoch is used for reporting. Ideally as we vary τ, the Lip- time steps. Indeed if we compare the bound (18) and (17), schitz constant for the function class π ∈ Π is changed when γ =1 as long as ν > 1,the gapin (18) is larger. accordingly given the assumption kθk≤ B. However, it is unclear if B is changed or not for different configurations. 13. Discussion and Future Work After all, the assumption that the parameters are bounded is artificial. To ameliorate this defect we also check the Even though a variety of distributions, discrete or continu- 1 l l ous, can be reparameterized, and we have shown that the metric τ Πlkθ kF , where θ is the weight matrix of the lth layer of MLP. In our experimentthere is no bias term in the classical MDP with discrete states is reparameterizable, it 1 l is not clear in general under which conditions reinforce- linear layers in MLP, so Πlkθˆ kF can be used as a metric τ ment learning problems are reparameterizable. Classifying on the Lipschitz constant L at the solution point θˆ. We π1 particular cases where RL is not reparameterizable is an in- 4In real applications this is not doable since T and I are un- teresting direction for future work. Second, the transitions known. Here we assume they are known just to investigate the of discrete MDPs are inherently non-smooth, so Theorem generalization gap. 1 does not apply. In this case, the PAC-Bayes bound can be On the Generalization Gap in Reparameterizable RL applied, but this requires a totally different framework. It Cobbe, K., Klimov, O., Hesse, C., Kim, T., and will be interesting to see if there is a “Bayesian” version of Schulman, J. Quantifying generalization in Theorem 1. Finally, our analysis only covers “on-policy” reinforcement learning. CoRR, 2018. URL RL. Studying generalization for “off-policy” RL remains http://arxiv.org/abs/1812.02341. an interesting future topic. Dann, C., Lattimore, T., and Brunskill, E. Unifying pac and regret: Uniform pac bounds for episodic reinforcement References learning. International Conference on Neural Informa- Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and tion Processing Systems (NIPS), 2017. Schapire, R. Taming the monster: A fast and simple al- gorithm for contextual bandits. International Conference Farebrother, J., Machado, M. C., and Bowling, M. Gener- on , 2014. alization and regularization in dqn. CoRR, 2018. URL https://arxiv.org/abs/1810.00123. Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and gen- eralization in overparameterized neural networks, going Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., beyond two layers. CoRR, abs/1811.04918, 2018. and Tassa, Y. Learning continuous control policies by stochastic value gradients. Advances in Neural Informa- Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. Stronger tion Processing Systems, 2015. generalization bounds for deep nets via a compression approach. International Conference on Machine Learn- Jang, E., Gu, S., and Poole, B. Categorical reparameteri- ing, 2018. zation by gumbel-softmax. International Conference on Learning Representations, 2017. Arora, S., Du, S. S., Hu, W., Li, Z., and Wang, R. Fine- grained analysis of optimization and generalization for Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., overparameterized two-layer neural networks. Interna- and Schapire, R. E. Contextual decision processes with tional Conference on Machine Learning, 2019. low Bellman rank are PAC-learnable. International Con- ference on Machine Learning, 2017. Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time anal- ysis of the multiarmed bandit problem. Maching Learn- Justesen, N., Torrado, R. R., Bontrager, P., Khalifa, A., ing, 2002. Togelius, J., and Risi, S. Illuminating generalization in deep reinforcement learning through procedural level Auer, P., Jaksch, T., and Ortner, R. Near-optimal regret generation. NeurIPS Deep RL Workshop, 2018. bounds for reinforcement learning. Advances in Neural Information Processing Systems 21, 2009. Kearns, M. and Singh, S. Near-optimal reinforcement Azar, M. G., Osband, I., and Munos, R. Minimax regret learning in polynomial time. Mache Learning, 2002. International Con- bounds for reinforcement learning. Kingma, D. P. and Ba, J. Adam: A method for stochas- ference on Machine Learning , 2017. tic optimization. International Conference on Learning Baker, B., Gupta, O., Naik, N., and Raskar, R. Designing Representations, 2015. neural network architectures using reinforcement learn- Kingma, D. P. and Welling, M. Auto-encoding variational ing. 2017. bayes. International Conference on Learning Represen- Bartlett, P. Lecture notes on theoretical statistics. 2013. tations, 2014.

Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Kober, J., Bagnell, J. A., and Peters, J. Reinforcement Schapire, R. Contextual bandit algorithms with super- learning in robotics: A survey. International Journal of vised learning guarantees. Proceedings of the Fourteenth Robotic Research, 2013. International Conference on Artificial Intelligence and Statistics, 2011. Langford, J. and Shawe-Taylor, J. Pac-bayes & margins. In- ternational Conference on Neural Information Process- Cao, Y. and Gu, Q. A generalization theory of gradient ing Systems (NIPS), 2002. descent for learning over-parameterized deep relu net- works. CoRR, abs/1902.01384, 2019. Laroche, R. Transfer reinforcement learning with shared dynamics. 2017. Chakravorty, S. and Hyland, D. C. Minimax reinforcement learning. American Institute of Aeronautics and Astro- Lattimore, T. and Hutter, M. Near-optimal PAC bounds for nautic, 2003. discounted mdps. Theoretical Computer Science, 2014. On the Generalization Gap in Reparameterizable RL

Lazaric, A. Transfer in reinforcement learning: a frame- Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., work and a survey. Reinforcement Learning - State of Fearon, R., Maria, A. D., Panneershelvam, V., the Art, Springer, 2012. Suleyman, M., Beattie, C., Petersen, S., Legg, S., Mnih, V., Kavukcuoglu, K., and Silver, D. Lewis, F., Syrmos, V., and Syrmos, V. Opti- Massively parallel methods for deep reinforcement mal Control. A Wiley-interscience publication. learning. CoRR, abs/1507.04296, 2015. URL Wiley, 1995. ISBN 9780471033783. URL http://arxiv.org/abs/1507.04296. https://books.google.com/books?id=jkD37elP6NIC. Neyshabur, B., Bhojanapalli, S., and Srebro, N. A Li, L., Chu, W., Langford, J., and Schapire, R. E. A pac-bayesian approach to spectrally-normalized margin contextual-bandit approach to personalized news article bounds for neural networks. International Conference recommendation. Proceedings of the 19th International on Learning Representations (ICLR), 2018. Conference on World Wide Web, 2010. OpenAI. Openai five. Li, Y., Ma, T., and Zhang, H. Algorithmic regularization in https://blog.openai.com/openai-five/, over-parameterized matrix recovery, 2018. 2018.

Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete Packer, C., Gao, K., Kos, J., Kr¨ahenb¨uhl, P., Koltun, distribution: a continuous relaxation of discrete random V., and Song, D. Assessing generalization in variables. International Conference on Learning Repre- deep reinforcement learning. CoRR, 2018. URL sentations, 2017. https://arxiv.org/abs/1810.12282.

Majeed, S. J. and Hutter, M. Performance guarantees R¨uckstieß, T., Sehnke, F., Schaul, T., Wierstra, D., Sun, for homomorphisms beyond markov decision processes. Y., and Schmidhuber, J. Exploring parameter space in CoRR, abs/1811.03895, 2018. reinforcement learning. Paladyn, 2010.

Mao, H., Alizadeh, M., Menache, I., and Kandula, S. Re- Shalev-Shwartz, S. and Ben-David, S. Understanding Ma- source management with deep reinforcement learning. chine Learning: From Theory to Algorithms. Cambridge 2016. University Press, New York, NY, USA, 2014. ISBN 1107057132, 9781107057135. McAllester, D. A. Some pac-bayesian theorems. Confer- ence on Learning Theory (COLT), 1998. Shani, G., Brafman, R. I., and Heckerman, D. An mdp- based recommender system. The Journal of Machine McAllester, D. A. Simplified pac-bayesian margin bounds. Learning Research, 2005. Conference on Learning Theory (COLT), 2003. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Mirhoseini, A., Goldie, A., Pham, H., Steiner, van den Driessche, G., Schrittwieser, J., Antonoglou, I., B., Le, Q. V., and Dean, J. Hierarchical Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, planning for device placement. 2018. URL D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, https://openreview.net/pdf?id=Hkc-TeZ0W. T., Leach, M., Kavukcuoglu, K., Graepel, T., and Has- sabis, D. Mastering the game of go with deep neural Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve- networks and tree search. Nature, 2016. ness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wier- M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Grae- stra, D., Legg, S., and Hassabis, D. Human-level control pel, T., Lillicrap, T. P., Simonyan, K., and Hassabis, D. through deep reinforcement learning. Nature, 2015. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. CoRR, 2017. URL Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, http://arxiv.org/abs/1712.01815. T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn- chronous methods for deep reinforcement learning. In- Strehl, A. L., Li, L., and Littman, M. L. Reinforcement ternational Conference on Machine Learning, 2016. learning in finite mdps: Pac analysis. Journal of Machine Learning Research, 2009. Mohri, M. and Medina, A. M. New analysis and algo- rithm for learning with drifting distributions. Algorith- Sutton, R. and Barto., A. Reinforcement Learning: An In- mic Learning Theory, 2012. troduction. MIT Press, 1998. On the Generalization Gap in Reparameterizable RL

Sutton, R. S. Generalization in reinforcement learning: Successful examples using sparse coarse coding. 1995.

Taylor, M. E. and Stone, P. Transfer learning for reinforce- ment learning domains: A survey. J. Mach. Learn. Res., 2009. Tokui, S. and Sato, I. Reparameterization trick for discrete variables. CoRR, 2016. URL https://arxiv.org/abs/1611.01239. van der Vaart., A. Asymptotic Statistics.. Cambridge, 1998. Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S., Yeo, M., Makhzani, A., K¨uttler, H., Agapiou, J., Schrittwieser, J., Quan, J., Gaffney, S., Petersen, S., Simonyan, K., Schaul, T., van Has- selt, H., Silver, D., Lillicrap, T. P., Calderone, K., Keet, P., Brunasso, A., Lawrence, D., Ekermo, A., Repp, J., and Tsing, R. Starcraft II: A new chal- lenge for reinforcement learning. CoRR, 2017. URL http://arxiv.org/abs/1708.04782. Wang, H., Keskar, N. S., Xiong, C., and Socher, R. Identifying generalization prop- erties in neural networks. 2018a. URL https://openreview.net/forum?id=BJxOHs0cKm. Wang, J., Liu, Y., and Li, B. Reinforcement learning with perturbed rewards. CoRR, abs/1810.01032, 2018b. URL http://arxiv.org/abs/1810.01032. Whiteson, S., Tanner, B., Taylor, M. E., and Stone, P. Pro- tecting against evaluation overfitting in empirical rein- forcement learning. 2011 IEEE Symposium on Adap- tive Dynamic Programming and Reinforcement Learn- ing (ADPRL), 2011. Williams, R. J. Simple statistical gradient-following al- gorithms for connectionist reinforcement learning. Ma- chine Learning, 1992. Zhan, Y. and Taylor, M. E. Online transfer learning in re- inforcement learning domains. CoRR, abs/1507.00436, 2015. Zhang, A., Ballas, N., and Pineau, J. A dis- section of overfitting and generalization in continu- ous reinforcement learning. CoRR, 2018a. URL https://arxiv.org/abs/1806.07937.

Zhang, C., Vinyals, O., Munos, R., and Ben- gio, S. A study on overfitting in deep re- inforcement learning. CoRR, 2018b. URL http://arxiv.org/abs/1804.06893. On the Generalization Gap in Reparameterizable RL

A. Proof of Lemma 3 By assumption 3, r(s) is Lr-Lipschitz, so Lemma. For Reparameterizable RL, given assumptions 1, kr(s ) − r(s )k≤ L ks − s k 2, and 3, the empirical reward R defined in (10), as a func- t′ t r t′ t t tion of the parameter θ, has a Lipschitz constant of ν − 1 ≤ L L L kθ′ − θk r t2 π2 ν − 1 T νt − 1 β = γtL L L r t2 π2 ν − 1 So the reward t=0 X T T where ν = Lt1 + Lt2Lπ1. t t |R(s′) − R(s)| = | γ r(st′ ) − γ r(st)| t=0 t=0 X X T T Proof. Let’s denote st′ = st(θ′), and st = st(θ). We start t t by investigating the policy function across different time ≤ | γ (r(st′ ) − r(st))|≤ γ |r(st′ ) − r(st))| t=0 t=0 steps: X X T t t ν − 1 kπ(s ; θ ) − π(s ; θ)k ≤ γ LrLt2 Lπ2 kθ′ − θk = βkθ′ − θk t′ ′ t ν − 1 t=0 = kπ(st′ ; θ′) − π(st; θ′)+ π(st; θ′) − π(st; θ)k X

≤kπ(st′ ; θ′) − π(st; θ′)k + kπ(st; θ′) − π(st; θ)k

≤ Lπ1kst′ − stk + Lπ2kθ′ − θk (19)

The first inequality is the triangle inequality, and the second B. Proof of Lemma 6 is from our Lipschitz assumption 2. Lemma. In reparameterizable RL, suppose the initializa- If we look at the change of states as the episode proceeds: tion function I′ in the test environment satisfies k(I′ − I)(ξ)k ≤ δ, and the transition function is the same for kst′ − stk both training and testing environment. If assumptions (1), (2), and (3) hold then = kT (st′ 1, π(st′ 1; θ′), ξt 1) − T (st 1, π(st 1; θ), ξt 1)k − − − − − − ≤ kT (st′ 1, π(st′ 1; θ′), ξt 1) − T (st 1, π(st′ 1; θ′), ξt 1)k − − − − − − |Eξ[R(s(ξ; I′))] − Eξ[R(s(ξ; I))]|≤ + kT (st 1, π(st′ 1; θ′), ξt 1) − T (st 1, π(st 1; θ), ξt 1)k − − − − − − T t t ≤ Lt1kst′ 1 − st 1k + Lt2kπ(st′ 1; θ′) − π(st 1; θ)k γ L (L + L L ) δ − − − − r t1 t2 π1 (20) t=0 X

Now combine both (19)and (20), Proof. Denote the states at time t with I′ as the initializa- tion function as st′ . Againwe lookat the differencebetween kst′ − stk st′ and st. By triangle inequality and assumptions 1 and 2, ≤ Lt1kst′ 1 − st 1k − − + Lt2(Lπ1kst′ 1 − st 1k + Lπ2kθ′ − θk) ks′ − stk − − t ≤ (L + L L )ks′ − s k + L L kθ′ − θk t1 t2 π1 t 1 t 1 t2 π2 = kT (st′ 1, π(st′ 1), ξt 1) − T (st 1, π(st 1), ξt 1)k − − − − − − − − ≤ kT (st′ 1, π(st′ 1), ξt 1) − T (st 1, π(st′ 1), ξt 1)k − − − − − − In the initialization, we know s′ = s since the initializa- 0 0 + kT (st 1, π(st′ 1), ξt 1) − T (st 1, π(st 1), ξt 1)k tion process does not involve any computation using the − − − − − − ≤ Lt1kst′ 1 − st 1k + Lt2kπ(st′ 1) − π(st 1)k parameter θ in the policy π. − − − − ≤ Lt1kst′ 1 − st 1k + Lt2Lπ1kst′ 1 − st 1k By recursion, we get − − − − = (Lt1 + Lt2Lπ1)kst′ 1 − st 1k − − t t 1 ≤ (Lt1 + Lt2Lπ1) ks0′ − s0k − t kst′ − stk≤ Lt2 Lπ2kθ′ − θk (Lt1 + Lt2Lπ1) t ≤ (Lt1 + Lt2Lπ1) δ t=0 X νt − 1 = L L kθ′ − θk where the last inequality is due to the assumption that t2 π2 ν − 1

ks′ − s0k = kI′(ξ) − I(ξ)k≤ δ where ν = Lt1 + Lt2Lπ1. 0 On the Generalization Gap in Reparameterizable RL

Also since r(s) is also Lipschitz, Again we have the initialization condition

T T s0′ = s0 |R(s ) − R(s)| = | γtr(s ) − γtr(s )| ′ t′ t since the initialization procedure I stays the same. By re- t=0 t=0 X X cursion we have T T t t t 1 ≤ γ |r(st′ ) − r(st)|≤ γ Lrkst′ − stk − t t=0 t=0 kst′ − stk≤ δ (Lt1 + Lt2Lπ1) (22) X X t=0 T X t t ≤ Lrδ γ (Lt1 + Lt2Lπ1) By assumption 3, t=0 X T T t t The argument above holds for any given random input ξ, so |R(s′) − R(s)| = | γ r(st′ ) − γ r(st)| t=0 t=0 X X T T |Eξ[R(s′(ξ)] − Eξ[R(s(ξ)]| t t ≤ γ |r(st′ ) − r(st)|≤ γ Lrkst′ − stk ≤ (R(s′(ξ)) − R(s(ξ))) t=0 t=0 X X Zξ T t 1 t − k ≤ Lrδ γ (Lt1 + Lt2Lπ1) ≤ |R(s′(ξ)) − R(s(ξ))| t=0 k ! ξ X X=0 Z T T νt − 1 t t ≤ L δ γt ≤ Lrδ γ (Lt1 + Lt2Lπ1) r ν − 1 t=0 t=0 X X where ν = Lt1 + Lt2Lπ1. Again the argument holds for any given random input ξ, so

|Eξ[R(s′(ξ)] − Eξ[R(s(ξ)]| C. Proof of Lemma 7 ≤ (R(s′(ξ)) − R(s(ξ))) Lemma. In reparameterizable RL, suppose the transi- Zξ tion in the test environment satisfies T ′ ∀x, y, z, k(T ′ − T )(x,y,z)k≤ δ, and the initialization is the same for both ≤ |R(s′(ξ)) − R(s(ξ))| ξ the training and testing environment. If assumptions (1),(2) Z T t and (3) hold then t ν − 1 ≤ L δ γ r ν − 1 t=0 T t X t 1 − ν |E [R(s(ξ; T ′))] − E [R(s(ξ; T ))]|≤ γ L δ ξ ξ r 1 − ν t=0 X (21) D. Proof of Theorem 1 where ν = Lt1 + Lt2Lπ1 Theorem. In reparameterizable RL, suppose the transi- tion T ′ in the test environment satisfies ∀x, y, z, k(T ′ − Proof. Again let’s denote the state at time t with the new T )(x,y,z)k ≤ ζ, and suppose the initialization function transition function T ′ as st′ , and the state at time t with the I′ in the test environment satisfies ∀ξ, k(I′ − I)(ξ)k ≤ ǫ. original transition function T as st, then If assumptions (1),(2)and(3) hold, the peripheral random variables ξi for each episode are i.i.d., and the reward is kst′ − stk bounded |R(s)|≤ c/2, then with probability at least 1 − δ, = kT ′(st′ 1, π(st′ 1), ξt 1) − T (st 1, π(st 1), ξt 1)k for all policy π ∈ Π, − − − − − − ≤ kT ′(st′ 1, π(st′ 1), ξt 1) − T ′(st 1, π(st 1), ξt 1)k+ 1 i − − − − − − |Eξ[R(s(ξ; π, T ′, I′))] − R(s(ξ ; π, T , I))| kT ′(st , π(st ), ξt ) − T (st , π(st ), ξt )k n 1 1 1 1 1 1 i − − − − − − X ≤ kT ′(st′ 1, π(st′ 1), ξt 1) − T ′(st 1, π(st′ 1), ξt 1)k T t T − − − − − − t ν − 1 t t + kT ′(st 1, π(st′ 1), ξt 1) − T ′(st 1, π(st 1), ξt 1)k + δ ≤ Rad(Rπ, , )+ Lrζ γ + Lrǫ γ ν − − − − − T I ν − 1 − t=0 t=0 ≤ Lt1kst′ 1 − st 1k + Lt2kπ(st′ 1) − π(st 1)k + δ X X − − − − log(1/δ) ≤ Lt1kst′ 1 − st 1k + Lt2Lπ1kst′ 1 − st 1k + δ − − + O c − − n ! = (Lt1 + Lt2Lπ1)kst′ 1 − st 1k + δ r − − On the Generalization Gap in Reparameterizable RL where ν = Lt1 + Lt2Lπ1, and

n 1 i i Rad(Rπ, , )= EξEσ sup σiR(s (ξ ; π, T , I)) T I π n " i=1 # X is the Rademacher complexity of R(s(ξ; π, T , I)) under the training transition T , the training initialization I, and n is the number if training episodes.

Proof. Note

1 i | R(s(ξ ; π, T , I)) − E [R(s(ξ; π, T ′, I′))]| n ξ i X 1 ≤ | R(s(ξi; π, T , I)) − E [R(s(ξ; π, T , I))]| n ξ i X + |Eξ[R(s(ξ; π, T , I))] − Eξ[R(s(ξ; π, T ′, I))]|

+ |Eξ[R(s(ξ; π, T ′, I))] − Eξ[R(s(ξ; π, T ′, I′))]|

Then theorem 1 is a direct consequence of Lemma 2, Lemma 6, and Lemma 7.