On the Generalization Gap in Reparameterizable Reinforcement

On the Generalization Gap in Reparameterizable Reinforcement Learning Huan Wang 1 Stephan Zheng 1 Caiming Xiong 1 Richard Socher 1 Abstract Cobbe et al., 2018; Zhang et al., 2018b; Packer et al., 2018; Zhang et al., 2018a). A model that performs well in the Understanding generalization in reinforcement training environment, may or may not perform well when learning (RL) is a significant challenge, as many used in the testing environment. There is also a growing common assumptions of traditional supervised interest in understanding the conditions for model general- learning theory do not apply. We focus on the ization and developing algorithms that improve generaliza- special class of reparameterizable RL prob- tion. lems, where the trajectory distribution can be de- composed using the reparametrization trick. For In general we would like to measure how accurate an al- this problem class, estimating the expected return gorithm is able to predict on previously unseen data. One is efficient and the trajectory can be computed metric of interest is the gap between the training and test- deterministically given peripheral random vari- ing loss or reward. It has been observed that such gaps are ables, which enables us to study reparametrizable related to multiple factors: initial state distribution, envi- RL using supervised learning and transfer learn- ronment transition, the level of “difficulty” in the environ- ing theory. Through these relationships, we de- ment, model architectures, and optimization. Zhang et al. rive guarantees on the gap between the expected (2018b) split randomly sampled initial states into training and empirical return for both intrinsic and exter- and testing and evaluated the performance gap in deep re- nal errors, based on Rademacher complexity as inforcement learning. They empirically observed overfit- well as the PAC-Bayes bound. Our bound sug- ting caused by the randomness of the environment, even gests the generalization capability of reparame- if the initial distribution and the transition in the test- terizable RL is related to multiple factors includ- ing environment are kept the same as training. On the ing “smoothness” of the environment transition, other hand, Farebrother et al. (2018); Justesen et al. (2018); reward and agent policy function class. We also Cobbe et al. (2018) allowed the test environment to vary empirically verify the relationship between the from training, and observed huge differences in testing per- generalization gap and these factors through sim- formance. Packer et al. (2018) also reported very different ulations. testing behaviors across models and algorithms, even for the same RL problem. Although overfitting has been empirically observed in RL 1. Introduction from time to time, theoretical guarantees on generalization, especially finite-sample guarantees, are still missing. In arXiv:1905.12654v1 [cs.LG] 29 May 2019 Reinforcement learning (RL) has proven successful in a this work, we focus on on-policy RL, where agent poli- series of applications such as games (Silver et al., 2016; cies are trained based on episodes of experience that are 2017; Mnih et al., 2015; Vinyals et al., 2017; OpenAI, sampled “on-the-fly” using the current policy in training. 2018), robotics (Kober et al., 2013), recommendation sys- We identify two major obstacles in the analysis of on- tems (Li et al., 2010; Shani et al., 2005), resource manage- policy RL. First, the episode distribution keeps changing ment (Mao et al., 2016; Mirhoseini et al., 2018), neural ar- as the policy gets updated during optimization. Therefore, chitecture design (Baker et al., 2017), and more. However episodes have to be continuously redrawn from the new dis- some key questions in reinforcement learning remain un- tribution induced by the updated policy during optimiza- solved. One that draws more and more attention is the is- tion. For finite-sample analysis, this leads to a process with sue of overfitting in reinforcement learning (Sutton, 1995; complex dependencies. Second, state-of-the-art research 1Salesforce Research, Palo Alto CA, USA. Correspondence to: on RL tends to mix the errors caused by randomness in the Huan Wang <[email protected]>. environment and shifts in the environment distribution. We argue that actually these two types of errors are very dif- 36 th Proceedings of the International Conference on Machine ferent. One, which we call intrinsic error, is analogous to Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). overfitting in supervised learning, and the other, called ex- On the Generalization Gap in Reparameterizable RL ternal error, looks more like the errors in transfer learning. viding a simple bound for “smooth” environmentsand models with a limited number of parameters. Our key observation is there exists a special class of RL, called reparameterizable RL, where randomness in the environment can be decoupled from the transition and • A guarantee for reparameterized RL when the envi- initialization procedures via the reparameterization trick ronment is changed during testing. In particular we (Kingma & Welling, 2014). Through reparameterization, discuss two cases in environment shift: change in the an episode’s dependency on the policy is “lifted” to the initial distribution for the states, or the transition func- states. Hence, as the policy gets updated, episodes are de- tion. terministic given peripheral random variables. As a con- sequence, the expected reward in reparameterizable RL 2. Notation and Formulation is connected to the Rademacher complexity as well as the PAC-Bayes bound. The reparameterization trick also We denote a Markov Decision Process (MDP) as a 5-tuple makes the analysis for the second type of errors, i.e., when (S, A, P, r, P0). Here S is the state space, A is the action- the environment distribution is shifted, much easier since space, P(s,a,s′): S×A×S → [0, 1] is the transition the environment parameters are also “lifted” to the repre- probability from state s to s′ when taking action a, r(s): R sentation of states. S → represents the reward function, and P0(s): S → [0, 1] is the initial state distribution. Let π(s) ∈ Π: S → A Related Work Generalization in reinforcement learn- be the policy map that returns the action a at state s. ing has been investigated a lot both theoretically and We consider episodic MDPs with a finite horizon. Given empirically. Theoretical work includes bandit anal- the policy map π and the transition probability P, the state- ysis (Agarwal et al., 2014; Auer et al., 2002; 2009; to-state transition probability is Tπ(s,s′) = P(s, π(s),s′). Beygelzimer et al., 2011), Probably Approximately Without loss of generality, the length of the episode is T +1. Correct (PAC) analysis (Jiang et al., 2017; Dann et al., We denote a sequence of states [s0,s1,...,sT ] as s. The 2017; Strehl et al., 2009; Lattimore & Hutter, 2014) T total reward in an episode is R(s) = γtr , where as well as minimax analysis (Azar et al., 2017; t=0 t γ ∈ (0, 1] is a discount factor and rt = r(st). Chakravorty & Hyland, 2003). Most works focus on P the analysis of regret and consider the gap between the Denote the joint distribution of the sequence of states in an expected value and optimal return. On the empirical side, episode s = [s0,s1,...,sT ] as Dπ. Note Dπ is also related besides the previously mentioned work, Whiteson et al. to P and P0. In this work we assume P and P0 are fixed, (2011) proposes generalized methodologies that are so Dπ is a function of π. Our goal is to find a policy that based on multiple environments sampled from a distri- maximizes the expected total discounted reward (return): bution. Nair et al. (2015) also use random starts to test generalization. T E s E t π∗ = argmax s π R( ) = argmax s π γ rt. Other research has also examined generalization π Π ∼D π Π ∼D ∈ ∈ t=0 from a transfer learning perspective. Lazaric (2012); X (1) Taylor & Stone (2009); Zhan & Taylor (2015); Laroche (2017) examine model generalization across different learning tasks, and provide guarantees on asymptotic Suppose during training we have a budget of n episodes, performance. then the empirical return is There are also works in robotics for transferring policy n 1 from simulator to real world and optimizing an internal πˆ =arg max R(si), (2) i π Π,s π n model from data (Kearns & Singh, 2002), or works trying ∈ ∼D i=1 to solve abstracted or compressed MDPs (Majeed & Hutter, X si i i i 2018). where = [s0,s1,...,sT ] is the ith episode of length . We are interested in the generalization gap Our Contributions: T +1 n • A connection between (on-policy) reinforcement 1 i Φ= R(s ) − Es ′ R(s) . (3) learning and supervised learning through the reparam- πˆ n ∼D i=1 eterization trick. It simplifies the finite-sample anal- X ysis for RL, and yields Rademacher and PAC-Bayes Note that in (3) the distribution D may be different from bounds on Markov Decision Processes (MDP). π′ˆ Dπˆ since in the testing environment P′ as well as P0′ may • Identifying a class of reparameterizable RL and pro- be shifted compared to the training environment. On the Generalization Gap in Reparameterizable RL 3. Generalization in Reinforcement Learning using the triangle inequality. The first term in (5) is the v.s. Supervised Learning concentrationerror between the empirical reward and its ex- pectation. Since it is caused by intrinsic randomness of the Generalization has been well studied in the supervised environment, we call it the intrinsic error. Even if the test learning scenario. A popular assumption is that samples are environment shares the same distribution with training, in independent and identically distributed (xi,yi) ∼ D, ∀i ∈ the finite-sample scenario there is still a gap between train- {1, 2,...,n}.

Load more