Sample Complexity of using Linearly Combined Model Ensembles

Aditya Modi1 Nan Jiang2 Ambuj Tewari1 Satinder Singh1 1University of Michigan Ann Arbor 2University of Illinois at Urbana-Champaign

Abstract a range of robotics tasks (Christiano et al., 2016; Tobin et al., 2017). A common aspect in most of these suc- cess stories is the use of simulation. Arguably, given Reinforcement learning (RL) methods have a simulator of the real environment, it is possible to been shown to be capable of learning intel- use RL to learn a near-optimal policy from (usually a ligent behavior in rich domains. However, large amount of) simulation data. If the simulator is this has largely been done in simulated do- highly accurate, the learned policy should also perform mains without adequate focus on the process well in the real environment. of building the simulator. In this paper, we consider a setting where we have access to Apart from some cases where the true environment and an ensemble of pre-trained and possibly inac- the simulator coincide (e.g., in game playing) or a nearly curate simulators (models). We approximate perfect simulator can be created from the law of physics the real environment using a state-dependent (e.g., in simple control problems), in general we will linear combination of the ensemble, where need to construct the simulator using data from the real the coefficients are determined by the given environment, making the overall approach an instance state features and some unknown parameters. of model-based RL. As the algorithms for learning from Our proposed algorithm provably learns a simulated experience mature (which is what the RL near-optimal policy with a sample complexity community has mostly focused on), the bottleneck has polynomial in the number of unknown param- shifted to the creation of a good simulator. How can eters, and incurs no dependence on the size of we learn a good model of the world from interaction the state (or action) space. As an extension, experiences? we also consider the more challenging problem A popular approach for meeting this challenge is to of model selection, where the state features learn using a wide variety of simulators, which imparts are unknown and can be chosen from a large robustness and adaptivity to the learned policies. Re- candidate set. We provide exponential lower cent works have demonstrated the benefits of using such bounds that illustrate the fundamental hard- an ensemble of models, which can be used to either ness of this problem, and develop a provably transfer policies from simulated to real-world domains, efficient algorithm under additional natural or to simply learn robust policies (Andrychowicz et al., assumptions. 2018; Tobin et al., 2017; Rajeswaran et al., 2017). Bor- rowing the motivation from these empirical works, we notice that the process of learning a simulator inher- 1 INTRODUCTION ently includes various choices like inductive biases, data collection policy, design aspects etc. As such, instead of Reinforcement learning methods with deep neural net- relying on a sole approximate model for learning in sim- works as function approximators have recently demon- ulation, interpolating between models obtained from strated prominent success in solving complex and obser- different sources can provide better approximation of vations rich tasks like games (Mnih et al., 2015; Silver the real environment. Previous works like Buckman et al., 2016), simulated control problems (Todorov et al., et al. (2018); Lee et al. (2019); Kurutach et al. (2018) 2012; Lillicrap et al., 2015; Mordatch et al., 2016) and have also demonstrated the effectiveness of using an ensemble of models for decreasing modelling error or its effect thereof during learning. Proceedings of the 23rdInternational Conference on Artificial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. In this paper, we consider building an approximate PMLR: Volume 108. Copyright 2020 by the author(s). model of the real environment from interaction data Sample Complexity of Reinforcement Learning using Linearly Combined Model Ensembles using a set (or ensemble) of possibly inaccurate models, true environment, and use π∗ and v∗ as shorthand for which we will refer to as the base models. The simplest πM ∗ and vM ∗ , respectively. way to combine the base models is to take a weighted In our setting, the agent is given access to a set of combination, but such an approach is rather limited. K base MDPs {M ,...,M }. They share the same For instance, each base model might be accurate in 1 K S, A,H,P , and only differ in P and . In addition, a certain regions of the state space, in which case it is 1 feature map φ : S × A → ∆ is given which maps natural to consider a state-dependent mixture. We d−1 state-action pairs to d-dimensional real vectors. Given consider the problem of learning in such a setting, these two objects, we consider the class of mod- where one has to identify an appropriate combination els which can be obtained from the following state- of the base models through real-world interactions, dependent linear combination of the base models: so that the induced policy performs well in the real environment. The data collected through interaction Definition 2.1 (Linear Combination of Base Mod- with the real world can be a precious resource and, els). For given model ensemble {M1,...,MK } and the therefore, we need the learning procedure to be sample- feature map φ : S × A → ∆d−1, we consider models efficient. Our main result is an algorithm that enjoys parametrized by W with the following transition and polynomial sample complexity guarantees, where the reward functions: polynomial has no dependence on the size of the state K and action spaces. We also study a more challenging W X k P (·|s, a) = [W φ(s, a)]kP (·|s, a), setting where the featurization of states for learning k=1 the combination is unknown and has to be discovered K from a large set of candidate features. W X k R (·|s, a) = [W φ(s, a)]kR (·|s, a). k=1 Outline. We formally set the problem and nota- tion in Section 2, and discuss related work in Section 3. We will use M(W ) to denote such a model for any K×d The main algorithm is introduced in Section 4, together parameter W ∈ W with W0 ≡ {W ∈ [0, 1] : PK with its sample complexity guarantees. We then pro- i=1 Wij = 1 for all j ∈ [d]}. ceed to the feature selection problem in Section 5 and ∗ conclude in Section 6. For now, let’s assume that there exists some W such that M ∗ = M(W ∗), i.e., the true environment can be captured by our model class; we will relax this 2 SETTING AND NOTATION assumption shortly.

We consider episodic Markov decision processes (MDP). To develop intuition, consider a simplified scenario where d = 1 and φ(s, a) ≡ 1. In this case, the matrix An MDP M is specified by a tuple (S, A,P,R,H,P1), where S is the state space and A is the action space. W becomes a K × 1 stochastic vector, and the true P denotes the transition kernel describing the system environment is approximated by a linear combination dynamics P : S × A → ∆(S) and R is the per-timestep of the base models. reward function R : S × A → ∆([0, 1]). The agent Example 2.2 (Global convex combination of models). interacts with the environment for a fixed number of If the base models are combined using a set of constant timesteps, H, which determines the horizon of the weights w ∈ ∆K−1, then this is a special case of Defi- problem. The initial state distribution is P1. The nition 2.1 where d = 1 and each state’s feature vector agent’s goal is to find a policy π : S × [H] → A which is φ(s, a) ≡ 1. maximizes the value of the policy: In the more general case of d > 1, we allow the com- π π bination weights to be a linear transformation of the vM := Es∼P1 [VM,1(s)] features, which are W φ(s, a), and hence obtain more where the value function at step h is defined as: flexibility in choosing different combination weights in different regions of the state-action space. A special H case of this more general setting is when φ corresponds π h X i VM,h(s) = E rh0 sh = s, ah:H ∼ π, sh:H ∼ M to a partition of the state-action space into multiple h0=h groups, and the linear combination coefficients are con- stant within each group. Here we use “sh:H ∼ M” to imply that the sequence of states are generated according to the dynamics of M. Example 2.3 (State space partition). Let S × A = S A policy is said to be optimal for M if it maximizes i∈[d] Xi be a partition (i.e., {Xi} are disjoint). Let π the value vM . We denote such a policy as πM and its φi(s, a) = 1[(s, a) ∈ Xi] for all i ∈ [d] where 1[·] is the ∗ value as vM . We use M to denote the model of the indicator function. This φ satisfies the condition that Aditya Modi1 Nan Jiang2 Ambuj Tewari1 Satinder Singh1

φ(s, a) ∈ ∆d−1, and when combined with a set of base admit low-rank factorization, and the left matrix in the models, forms a special case of Definition 2.1. factorization are known to the learner as state-action features (corresponding to our φ). Their environmen- Goal. We consider the popular PAC learning objec- tal assumption is a special case of ours, where the tive: with probability at least 1 − δ, the algorithm transition dynamics of each base model P k(·|s, a) is π ∗ should output a policy π with value vM ∗ ≥ v −  by independent of s and a, i.e., each base MDP can be fully collecting poly(d, K, H, 1/, log(1/δ)) episodes of data. specified by a single density distribution over S. This Importantly, here the sample complexity is not allowed special case enjoys many nice properties, such as the to depend on |S| or |A|. However, the assumption that value function of any policy is also linear in state-action M ∗ lies the class of linear models can be limiting and, features, and the linear value-function class is closed therefore, we will allow some approximation error in under the Bellman update operators, which are heavily our setting as follows: exploited in their algorithms and analyses. In contrast, none of these properties hold under our more general θ := min sup P ∗(·|s, a) − P W (·|s, a) setup, yet we are still able to provide sample efficiency W ∈W 1 (s,a)∈S×A guarantees. That said, we do note that the special case

∗ W allows these recent works to obtain stronger results: + R (·|s, a) − R (·|s, a) (1) 1 their algorithms are both statistically and computa- We denote the optimal parameter attaining this value tionally efficient (ours is only statistically efficient), by W ∗. The case of θ = 0 represents the realizable and some of these algorithms work without knowing 1 setting where M ∗ = M(W ∗) for some W ∗ ∈ W. When the K base distributions. We leave the problem of θ 6= 0, we cannot guarantee returning a policy with the utilizing this linear structure in a regret minimization value close to v∗, and will have to pay an additional framework (Jin et al., 2019) for future work. penalty term proportional to the approximation error θ, as is standard in RL theory. Contextual MDPs Abbasi-Yadkori and Neu (2014); Modi et al. (2018); Modi and Tewari (2019) consider

Further Notations Let πW be a shorthand for a setting similar to our Example 2.2, except that the πM(W ), the optimal policy in M(W ). When refer- linear combination coefficients are visible to the learner ring to value functions and state-action distributions, and the base models are unknown. Therefore, despite we will use the superscript to specify the policy and the similarity in environmental assumptions, the learn- use the subscript to specify the MDP in which the ing objectives and the resulting sample complexities W policy is evaluated. For example, we will use VW 0,h to are significantly different (e.g., their guarantees depend denote the value of πW (the optimal policy for model on |S| and |A|). M(W )) when evaluated in model M(W 0) starting from W timestep h. The term dW 0,h denotes the state-action Bellman rank Jiang et al. (2017) have identified distribution induced by policy πW at timestep h in the a structural parameter called Bellman rank for explo- 0 W MDP M(W ). Furthermore, we will write VM ∗,h and ration under general value-function approximation, and W ∗ dM ∗,h when the evaluation environment is M . For devised an algorithm called OLIVE whose sample com- conciseness, VW,h and QW,h will denote the optimal plexity is polynomial in the Bellman rank. A related (state- and Q-) value functions in model M(W ) at step notion is the witness rank (the model-based analog of W h (e.g., VW,h(s) ≡ VW,h(s)). The expected return of a Bellman rank) proposed by Sun et al. (2019). While policy π in model M(W ) is defined as: our algorithm and analyses draw inspiration from these works, our setting does not obviously yield low Bell- π π 2 vW = Es∼P1 [VM(W ),1(s)]. (2) man rank or witness rank. We will also provide a more detailed comparison to Sun et al. (2019), whose PH We assume that the total reward h=1 rh lies in [0, 1] algorithm is most similar to ours among the existing almost surely in all MDPs of interest and under all works, in Section 4. policies. Further, whenever used, any value function at π 1In our setting, not knowing the base models immedi- step H + 1 (e.g., VW,H+1) evaluates to 0 for any policy and any model. ately leads to hardness of learning, as it is equivalent to learning a general MDP without any prior knowledge even when d = K = 1. This requires Ω(|S||A|) sample complex- 3 RELATED WORK ity (Azar et al., 2012), which is vacuous as we are interested in solving problems with arbitrarily large state and action spaces. MDPs with low-rank transition matrices Yang 2In contrast, the low-rank MDPs considered by Yang and Wang (2019b,a); Jin et al. (2019) have recently and Wang (2019b,a); Jin et al. (2019) do admit low Bellman considered structured MDPs whose transition matrices rank and low witness rank. Sample Complexity of Reinforcement Learning using Linearly Combined Model Ensembles

Mixtures/ensembles of models The closest work Algorithm 1 PAC Algorithm for Linear Model En- to our setting is the multiple model-based RL (MMRL) sembles architecture proposed by Doya et al. (2002) where they 1: Input: {M1,...MK }, , δ, φ(·, ·), W0 also decompose a given domain as a convex combina- 2: for t → 1, 2,... do tion of multiple models. However, instead of learning 3: Compute optimistic model Wt and set πt to πWt the combination coefficients for a given ensemble, their method trains the model ensemble and simultaneously Wt ← arg max VW learns a mixture weight for each base model as a func- W ∈Wt−1 tion of state features. Their experiments demonstrate that each model specialized for different domains of 4: Estimate the value of πt using neval trajectories: the state space where the environment dynamics is H predictable, thereby, providing a justification for using 1 X (i) vˆt := rh (3) convex combination of models for simulation. Further, neval h=1 the idea of combining different models is inherently present in Bayesian learning methods where a posterior √ 5: if v − vˆ ≤ 3/4 + (3 dK + 1)Hθ then approximation of the real environment is iteratively - Wt t 6: Terminate and output π fined using interaction data. For instance, Rajeswaran t 7: Collect n trajectories using π : a ∼ π (s ) et al. (2017) introduce the EPOpt algorithm which t h t h 8: Estimate the matrix Z andy ˆ as uses an ensemble of simulated domains to learn robust bt t and generalizable policies. During learning, they adapt n H > 1 X X  (i) (i)  (i) (i) the ensemble distribution (convex combination) over Zbt := V t,h s , a φ s , a n h h h h source domains using data from the target domain to i=1 h=1 progressively make it a better approximation. Simi- (4) larly, Lee et al. (2019) combine a set of parameterized n H 1 X X (i) (i) models by adaptively refining the mixture distribution yˆ := r + V s  (5) t n h+1 t,h+1 h+1 over the latent parameter space. Here, we study a i=1 h=1 relatively simpler setting where a finite number of such base models are combined and give a frequentist sample 9: Update the version space to Wt as the set: complexity analysis for our method.  W ∈ W : yˆ − hW, Z i ≤ √ + Hθ t−1 t bt 12 dK (6) 4 ALGORITHM AND MAIN RESULTS

In this section we introduce the main algorithm that learns a near-optimal policy in the aforementioned about W ∗ from these trajectories. In particular, for setup with a poly(d, K, H, 1/, log(1/δ)) sample com- (i) every h, the bag of samples {sh+1}i∈[n] may be viewed plexity. We will first give the intuition behind the algo- as an unbiased draw from the following distribution rithm, and then present the formal sample complexity guarantees. Due to space constraints, we present the complete proof in the appendix. For simplicity, we n 1 X W ∗ (i) (i) will describe the intuition for the realizable case with P · |sh , ah . (7) ∗ n θ = 0 (P ∗ ≡ P W ). The pseudocode (Algorithm 1) i=1 and the results are, however, stated for the general case of θ 6= 0. The situation for rewards is similar and will be omitted At a high level, our algorithm proceeds in iterations in the discussion. So in principle we could substitute t = 1, 2,..., and gradually refines a version space W ∗ in Eq.(7) with any candidate W , and if the re- Wt of plausible parameters. Our algorithm follows sulting distribution differs significantly from the real an explore-or-terminate template and in each itera- (i) ∗ samples {sh+1}h∈[H],i∈[n], we can assert that W 6= W tion, either chooses to explore with a carefully cho- and eliminate W from the version space. However, sen policy or terminates with a near-optimal policy. the state space can be arbitrarily large in our setting, For exploration in the t-th iteration, we collect n tra- and comparing state distributions directly can be in- (i) (i) (i) (i) (i) (i) (i) jectories {(s1 , a1 , r1 , s2 , . . . , sH , aH , rH )}i∈[n] fol- tractable. Instead, we project the state distribution in lowing some exploration policy πt (Line 7). A key Eq.(7) using a (non-stationary) discriminator function H component of the algorithm is to extract knowledge {ft,h}h=1 (which will be chosen later) and consider the Aditya Modi1 Nan Jiang2 Ambuj Tewari1 Satinder Singh1 following scalar property The remaining concern is to choose the exploration policy πt and the discriminator function {ft,h} to en- n H 1 X X h 0 i sure that the linear constraint induced in each iteration E W ∗ (i) (i) r + ft,h+1(s ) , (8) n r∼R ·|sh ,ah , is significantly different from the previous ones and i=1 h=1 0 W ∗ (i) (i) induces deep cuts in the version space. We guarantee s ∼P ·|sh ,ah 3 this by choosing πt := πWt and ft,h := VWt,h , where which can be effectively estimated by Wt is the optimistic model as computed in Line 3. That is, Wt predicts the highest optimal value among n H 1 X X  (i) (i)  all candidate models in Wt−1. Following a terminate- r + ft,h+1 s . (9) n h h+1 or-explore argument, we show that as long as πt is i=1 h=1 suboptimal, the linear constraint induced by our choice of π and {f } will significantly reduce the volume of Since we have projected states onto R, Eq.(9) is the t t,h average of scalar random variables and enjoys state- the version space, and the iteration complexity can be space-independent concentration. Now, in order to test bounded as poly(d, K) by an ellipsoid argument similar the validity of a parameter W in a given version space, to that of Jiang et al. (2017). Similarly, the sample size we compare the estimate in eq. 9 with the prediction needed in each iteration only depends polynomially on given by M(W ), which is: d and K and incurs no dependence on |S| or |A|, as we have summarized high-dimensional objects such as ft,h n H (function over states) using low-dimensional quantities 1 X X h 0 i E W (i) (i) r + ft,h+1(s ) . (10) such as V t,h (vector of length K). n r∼R ·|sh ,ah , i=1 h=1 0 W (i) (i) s ∼P ·|sh ,ah The bound on the number of iterations and the number of samples needed per iteration leads to the following As we consider a linear model class, by using linearity sample complexity result: of expectations, Eq.(10) may also be written as: Theorem 4.1 (PAC bound for Alg. 1). In Algorithm 1, 32H2 4T 1800d2KH2 8dKT n H if neval := 2 log and n = 2 log 1 h i>h i  √δ  δ X X (i) (i) (i) (i) 2 2KH 5 W φ(s , a ) V t,h(s , a ) (11) where T = dK log / log , with probability at n h h h h  3 i=1 h=1 least 1−δ, the algorithm terminates after using at most n H D 1 X X (i) (i) (i) (i) E 3 2 2 = W, V (s , a ) φ(s , a )> , (12) d K H dKH  n t,h h h h h Oe log (14) i=1 h=1 2 δ T where hA, Bi denotes Tr(A>B) for any two matrices trajectories√ and returns a policy πT with a value v ≥ v∗ −  − (3 dK + 2)Hθ. A and B. In eq. 12, V t,h is a function that maps (s, a) to a K dimensional vector with each entry being By setting d and K to appropriate values, we obtain   h 0 i the following sample complexity bounds as corollaries: : k V t,h(s, a) k = Er∼R (·|s,a), r + ft,h+1(s ) . (13) s0∼P k(·|s,a) Corollary 4.2 (Sample complexity for partitions). Since the state-action partitioning setting (Example 2.3) The intuition behind Eq.(11) is that for each fixed is subsumed by the general setup, the sample complexity (i) (i) is again: state-action pair (sh , ah ), the expectation in Eq.(8) can be computed by first taking expectation of r + 3 2 2 0 d K H dKH  ft,h+1(s ) over the reward and transition distribu- Oe log (15) 2 δ tions of each of the K base models—which gives V h— and then aggregating the results using the combina- Corollary 4.3 (Sample complexity for global convex tion coefficients. Rewriting Eq.(11) as Eq.(12), we combination). When base models are combined without see that Eq.(8) can also be viewed as a linear mea- any dependence on state-action features (Example 2.2), surement of W ∗, where the measurement matrix is the setting is special case of the general setup with d = 1. 1 Pn PH (i) (i) (i) (i)> Thus, the sample complexity is: again n i=1 h=1 V h sh , ah φ sh , ah . There- fore, by estimating this measurement matrix and the K2H2 KH  outcome (Eq.(9)), we obtain an approximate linear Oe log (16) 2 δ equality constraint over Wt−1 and can eliminate any candidate W that violates such constraints. By using a Our algorithm, therefore, satisfies the requirement of finite sample concentration bound over the inner prod- learning a near-optimal policy without any dependence uct, we get a linear inequality constraint to update the 3 version space (Eq.(6)). We use the simplified notation Vt,h for VWt,h. Sample Complexity of Reinforcement Learning using Linearly Combined Model Ensembles

on the |S| or |A|. Moreover, we can also account for Further, we also assume that for any given W , we can the√ approximation error θ but also incur a cost of compute the optimal policy πW and its value function: (3 dK + 1)Hθ in the performance guarantee of the our elimination criteria in Eq. 6 uses estimates Zbt and final policy. As we use the projection of value functions yˆt which in turn depend on the value function. This through the linear model class, we do not model the requirement corresponds to a standard planning ora- complete dynamics√ of the environment. This leads to cle, and aligns with the motivation of our setting, as an additive loss of 3 dKHθ in value in addition to the we can delegate these computations to any learning best achievable value loss of 2Hθ (see Corollary A.4 in algorithm operating in the simulated environment with the appendix). the given combination coefficient. Our algorithm, in- stead, focuses on careful and systematic exploration to Comparison to OLIME (Sun et al., 2019) Our minimize the sample complexity in the real world. Algorithm 1 shares some structural similarity with the OLIME algorithm proposed by Sun et al. (2019), but 5 MODEL SELECTION WITH there are also several important differences. First of all, OLIME in each iteration will pick a time step and take CANDIDATE PARTITIONS uniformly random actions during data collection, and consequently incur polynomial dependence on |A| in In the previous section we showed that a near-optimal its sample complexity. In comparison, our main data policy can be PAC-learned under our modeling assump- collection step (Line 7) never takes a random deviation, tions, where the feature map φ : S × A → [0, 1] is given and we do not pay any dependence on the cardinal- along with the approximation error θ. In this section, ity of the action space. Secondly, similar to how we we explore the more interesting and challenging setting project the transition distributions onto a discriminator where a realizable feature map φ is unknown, but we function (Eq.(7) and (8)), OLIME projects the distri- know that the realizable φ belongs to a candidate set N butions onto a static discriminator class and uses the {φi}i=1, i.e., the true environment satisfies our mod- corresponding integral probability metric (IPM) as a eling assumption in Definition 2.1 under φ = φi∗ for ∗ measure of model misfit. In our setting, however, we some i ∈ [N] with θi∗ = 0. Note that Definition 2.1 find that the most efficient and elegant way to extract may be satisfied by multiple φi’s; for example, adding knowledge from data is to use a dynamic discrimina- redundant features to an already realizable φi∗ still yields a realizable feature map. In such cases, we con- tor function, VWt,h, which changes from iteration to iteration and depends on the previously collected data. sider φi∗ to be the most succinct feature map among Such a choice of discriminator function allows us to all realizable ones, i.e., the one with the lowest dimen- make direct cuts on the parameter space W, whereas sionality. Let di denote the dimensionality of φi, and ∗ OLIME can only make cuts in the value prediction d = di∗ . space. One obvious baseline in this setup is to run Algorithm 1 with each φi and select the best policy among the Computational Characteristics In each iteration, returned ones. This leads to a sample complexity of our algorithm computes the optimistic policy within PN 3 N roughly i=1 di (only the dependence on {di}i=1 is the version space. Therefore, we rely on access to the considered), which can be very inefficient: When there following optimistic planning oracle: ∗ 3 exists j such that d  dj, we pay for dj which is much ∗ Assumption 4.4 (Optimistic planning oracle). We greater than the sample complexity of d ; When {di} assume that when given a version space Wt, we can are relatively uniform, we pay a linear dependence on obtain the optimistic model through a single oracle call N, preventing us from competing with a large set of

for Wt = arg maxW ∈Wt VW . candidate feature maps.

It is important to note that any version space Wt that So the key result we want to obtain is a sample com- we deal with is always an intersection of half-spaces plexity that scales as (d∗)3, possibly with a mild mul- induced by the linear inequality constraints. Therefore, tiplicative overhead dependence on d∗ and/or N (e.g., one would hope to solve the optimistic planning prob- log d∗ and log N). lem in a computationally efficient manner given the nice geometrical form of the version space. However, Hardness Result for Unstructured {φi} even for a finite state-action space, we are not aware of Unfortunately, we show that this is impossible when any efficient solutions as the planning problem induces {φi} is unstructured via a lower bound. In the lower bilinear and non-convex constraints despite the linear- bound construction, we have an exponentially large ity assumption. Many recently proposed algorithms set of candidate feature maps, all of which are state also suffer from such a computational difficulty (Jiang space partitions. Each of the partitions has trivial et al., 2017; Dann et al., 2018; Sun et al., 2019). dimensionalities (di = 2, K = 2), but the sample Aditya Modi1 Nan Jiang2 Ambuj Tewari1 Satinder Singh1

complexity of learning is exponential, which can only is nested, meaning that ∀(s, a), (s0, a0), be explained away as Ω(N). 0 0 0 0 φi(s, a) = φi(s , a ) =⇒ φj(s, a) = φj(s , a ), ∀i ≤ j. Proposition 5.1. For the aforementioned problem of learning an -optimal policy using a candidate While this structural assumption almost allows us to feature set of size N, no algorithm can achieve develop sample-efficient algorithms, it is still insufficient poly(d∗,K,H, 1/, 1/δ, N 1−α) sample complexity for as demonstrated by the following hardness result. any constant 0 < α < 1. Proposition 5.2. Fixing K = 2, there exist base mod- On a separate note, besides providing formal justifica- els M1 and M2 and nested state space partitions φ1 tion for the structural assumption we will introduce and φ2, such that it is information-theoretically impos- later, this proposition is of independent interest as it sible for any algorithm to obtain poly(d∗,H,K, 1/, 1/δ) also sheds light on the hardness of model selection with sample complexity when an adversary chooses an MDP state abstractions. We discuss the further implications that satisfies our environmental assumption (Defini- in Appendix B. tion 2.1) under either φ1 or φ2.

Proof. We will again use an exponential tree style con- Proof of Proposition 5.1. We construct a linear class struction to prove the lower bound. Specifically, we of MDPs with two base models M and M in the 1 2 construct two MDPs M and M 0 which are obtained following way: Consider a complete tree of depth H by combining two base MDPs M and M using two with a branching factor of 2. The vertices forming 1 2 different partitions φ and φ . The specification of the state space of M and M and the two outgoing 1 2 1 2 M and M is exactly the same as in the proof of edges in each state are the available actions. Both 1 2 Proposition 5.1. We choose φ to be a partition of size MDPs share the same deterministic transitions and 1 d = 1, where all nodes are grouped together. φ has each non-leaf node yields 0 reward. Every leaf node 1 2 size d = 2H , where each leaf node belongs to a sepa- yields +1 reward in M and 0 in M . Now we construct 2 1 2 rate group. (As before, which group the inner nodes a candidate partition set {φ } of size 2H : for φ , the i i belong to does not matter.) φ and φ are obviously i-th leaf node belongs to one equivalence class while all 1 2 nested. We construct M that is realizable under φ by other leaf nodes belong to the other. (Non-leaf nodes 2 randomly choosing a leaf and setting the weights for can belong to either class as M and M agree on their 1 2 the convex combination as (1/2 + 2, 1/2 − 2) for that transitions and rewards.) leaf; for all other leafs, the weights are (1/2, 1/2). This Observe that the above model class contains a finite is equivalent to randomly choosing M from a set of family of 2H MDPs, each of which only has 1 rewarding 2H MDPs, each of which has only one good leaf node leaf node. Concretely, the MDP whose i-th leaf is yielding a random reward drawn from Ber(1/2 + 2) 0 rewarding is exactly realized under the feature map φi, instead of Ber(1/2). In contrast, M is such that all whose corresponding W ∗ is the identity matrix: the leaf nodes yield Ber(1/2) reward, which is realizable i-th leaf yields +1 reward as in M1, and all other leaves under φ1 with weights (1/2, 1/2). yield 0 reward as in M . Learning in this family of 2H 2 Observe that M and M 0 are exactly the same as the MDPs is provably hard (Krishnamurthy et al., 2016), constructions in the proof of the multi-armed bandit as when the rewarding leaf is chosen adversarially, the lower bound by (Auer et al., 2002) (the number of arms learner has no choice but to visit almost all leaf nodes is 2H ), where it has been shown that distinguishing be- to identify the rewarding leaf as long as  is below a tween M and M 0 takes Ω(2H /2) samples. Now assume constant threshold. The proposition follows from the towards contradiction that there exists an algorithm fact that in this setting d∗ = 2, K = 2, 1/ is a constant, that achieves poly(d∗,H,K, 1/, 1/δ) complexity; let N = 2H , but the sample complexity is Ω(2H ). f be the specific polynomial in its guarantee. After f(1,H, 2, 1/, 1/δ) trajectories are collected, the algo- This lower bound shows the necessity of introducing rithm must stop if the true environment is M 0 to honor ∗ structural assumptions in {φi}. Below, we consider a the sample complexity guarantee (since d = 1, K = 2), particular structure of nested partitions that is natural and proceed to collect more trajectories if M is the true and enables sample-efficient learning. Similar assump- environment (since d∗ = 2H ). Making this decision tions have also been considered in the state abstraction essentially requires distinguishing between M 0 and M literature (e.g., Jiang et al., 2015). using f(1,H, 2, 1/, 1/δ) = poly(H) trajectories, which contradicts the known hardness result from Auer et al. Nested Partitions as a Structural Assumption (2002). This proves the statement. Consider the case where every φi is a partition. W..o.g. let d1 ≤ d2 ≤ ... ≤ dN . We assume {φi} Essentially, the lower bound creates a situation where Sample Complexity of Reinforcement Learning using Linearly Combined Model Ensembles

d1  d2, and the nature may adversarially choose where the last inequality holds thanks to the termina- a model such that either φ1 or φ2 is realizable. If tion condition of Algorithm 2, which relies on knowl- ? φ1 is realizable, the learner is only allowed a small edge of v . The total number of iterations of the ∗ sample budget and cannot fully explore with φ2, and algorithm is at most O(log d ). Therefore, by taking if φ1 is not realizable the learner must do the opposite. a union bound over all possible iterations, the sample The information-theoretic lower bound shows that it complexity is is fundamentally hard to distinguish between the two situations: Once the learner explores with φ1, she J 3 2 2 ∗3 2 2 cannot decide whether she should stop or move on to X di K H  d K H ∗ Oe ≤ Oe log d . 2 2 φ2 without collecting a large amount of data. i=1 This hardness result motivates our last assumption in this section, that the learner knows the value of v? (a scalar) as side information. This way, the learner Discussion. Model selection in online learning— can compare the value of the returned policy in each especially in the context of sequential decision making— round to v? and effectively decide when to stop. This is generally considered very challenging. There has naturally leads to our Algorithm 2 that uses a doubling been relatively limited work in the generic setting until recently for some special cases. For instance, Foster scheme over {di}, with the following sample complexity guarantee. et al. (2019) consider the model selection problem in linear contextual bandits with a sequence of nested Algorithm 2 Model Selection with Nested Partitions policy classes with dimensions d1 < d2 < . . .. They consider a similar goal of achieving sub-linear regret Input:{φ , φ , . . . , φ }, {M ,...,M }, , δ, v?. 1 2 N 1 K bounds which only scale with the optimal dimension i → 0 d ∗ . In contrast to our result, they do not need to while True do m know the achievable value in the environment and give Choose φ such that d is the largest among {d : i i j no-regret learning methods in the knowledge-free set- d ≤ 2i}. j ting. However, this is not contradictory to our lower Run Algorithm 1 on Φ with  =  and δ = δ . i i 2 i 2N bound: Due to the extremely delayed reward signal, Terminate the sub-routine if t > √ our construction is equivalent to a multi-armed bandit d K log 2 2KH / log 5 . i  3 problem with 2H arms. Our negative result (Proposi- Let π be the returned policy (if any). Let vˆ be i i tion 5.2) shows a lower bound on sample complexity the estimated return of π using n = 9 log 2N i eval 22 δ which is exponential in horizon, therefore eliminating Monte-Carlo trajectories. the possibility of sample efficient and knowledge-free if vˆ ≥ v∗ − 2 then i 3 model selection in MDPs. Terminate with output πi.

Theorem 5.3. When Algorithm 2 is run with the in- put v∗, with probability at least 1 − δ, it returns a 6 CONCLUSION near-optimal policy π with vπ ≥ v∗ −  using at most  ∗3 2 2 ∗  ˜ d K H ∗ d KHN O 2 log d log δ samples. In this paper, we proposed a sample efficient model based algorithms which learns a near-optimal policy Proof. In Algorithm 2, for each partition i, we run Al- by approximating the true environment via a feature gorithm 1 until termination or until the sample budget dependent convex combination of a given ensemble. is exhausted. By union bound it is easy to verify that Our algorithm offers a sample complexity bound which with probability at least 1 − δ, all calls to Algorithm 1 is independent of the size of the environment and only will succeed and the Monte-Carlo estimation of the depends on the number of parameters being learnt. In returned policies will be (/3)-accurate, and we will addition, we also consider a model selection problem, only consider this success event in the rest of the proof. show exponential lower bounds and then give sample When the partition under consideration is realizable, efficient methods under natural assumptions. The pro- πi ∗ posed algorithm and its analysis relies on a linearity we get vM ∗ ≥ v − /3, therefore assumption and shares this aspect with existing explo- i πi  ∗ 2 vˆ ≥ v − 3 ≥ v − 3 , ration methods for rich observation MDPs. We leave so the algorithm will terminate after considering a real- the possibility of considering a richer class of convex izable φi. Similarly, whenever the algorithm terminates, combinations to future work. Lastly, our work also we have vπi ≥ v∗ − . This is because revisits the open problem of coming up with a com- putational and sample efficient model based learning πi i  ∗ v ≥ vˆ − 3 ≥ v − , algorithm. Aditya Modi1 Nan Jiang2 Ambuj Tewari1 Satinder Singh1

Acknowledgements Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. (2017). Contextual decision This work was supported in part by a grant from the processes with low bellman rank are pac-learnable. Open Philanthropy Project to the Center for Human- In Proceedings of the 34th International Conference Compatible AI, and in part by NSF grant CAREER on -Volume 70, pages 1704–1713. IIS-1452099. JMLR. org. References Jiang, N., Kulesza, A., and Singh, S. (2015). Abstrac- tion selection in model-based reinforcement learning. Abbasi-Yadkori, Y. and Neu, G. (2014). Online learn- In International Conference on Machine Learning, ing in mdps with side information. arXiv preprint pages 179–188. arXiv:1406.6812. Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Agarwal, A., Rakhlin, A., and Bartlett, P. (2008). Ma- (2019). Provably efficient reinforcement learning trix regularization techniques for online multitask with linear function approximation. arXiv preprint learning. Technical Report UCB/EECS-2008-138, arXiv:1907.05388. EECS Department, University of California, Berke- Kearns, M. and Singh, S. (2002). Near-optimal re- ley. inforcement learning in polynomial time. Machine Andrychowicz, M., Baker, B., Chociej, M., Jozefowicz, learning, 49(2-3):209–232. R., McGrew, B., Pachocki, J., Petron, A., Plap- pert, M., Powell, G., Ray, A., et al. (2018). Learn- Krishnamurthy, A., Agarwal, A., and Langford, J. ing dexterous in-hand manipulation. arXiv preprint (2016). Pac reinforcement learning with rich observa- arXiv:1808.00177. tions. In Advances in Neural Information Processing Systems 29, pages 1840–1848. Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit prob- Kurutach, T., Clavera, I., Duan, Y., Tamar, A., and lem. Machine learning, 47(2-3):235–256. Abbeel, P. (2018). Model-ensemble trust-region pol- icy optimization. In International Conference on Azar, M. G., Munos, R., and Kappen, H. J. (2012). Learning Representations. On the sample complexity of reinforcement learning with a generative model. In Proceedings of the 29th Lee, G., Hou, B., Mandalika, A., Lee, J., and Srinivasa, International Coference on International Conference S. S. (2019). Bayesian policy optimization for model on Machine Learning, pages 1707–1714. Omnipress. uncertainty. In International Conference on Learning Representations. Buckman, J., Hafner, D., Tucker, G., Brevdo, E., and Lee, H. (2018). Sample-efficient reinforcement learn- Li, L., Walsh, T. J., and Littman, M. L. (2006). To- ing with stochastic ensemble value expansion. In wards a unified theory of state abstraction for mdps. Advances in Neural Information Processing Systems, In ISAIM. pages 8224–8234. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, Christiano, P., Shah, Z., Mordatch, I., Schneider, J., T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Blackwell, T., Tobin, J., Abbeel, P., and Zaremba, Continuous control with deep reinforcement learning. W. (2016). Transfer from simulation to real world arXiv preprint arXiv:1509.02971. through learning deep inverse dynamics model. arXiv Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve- preprint arXiv:1610.03518. ness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Dann, C., Jiang, N., Krishnamurthy, A., Agarwal, A., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human- Langford, J., and Schapire, R. E. (2018). On oracle- level control through deep reinforcement learning. efficient pac rl with rich observations. In Advances Nature, 518(7540):529. in Neural Information Processing Systems, pages Modi, A., Jiang, N., Singh, S., and Tewari, A. (2018). 1422–1432. Markov decision processes with continuous side in- Doya, K., Samejima, K., Katagiri, K.-i., and Kawato, formation. In Algorithmic Learning Theory, pages M. (2002). Multiple model-based reinforcement learn- 597–618. ing. Neural computation, 14(6):1347–1369. Modi, A. and Tewari, A. (2019). Contextual markov Foster, D. J., Krishnamurthy, A., and Luo, H. (2019). decision processes using generalized linear models. Model selection for contextual bandits. arXiv arXiv preprint arXiv:1903.06187. preprint arXiv:1906.00531. Mordatch, I., Mishra, N., Eppner, C., and Abbeel, Givan, R., Dean, T., and Greig, M. (2003). Equivalence P. (2016). Combining model-based policy search notions and model minimization in markov decision with online model learning for control of physical processes. Artificial Intelligence, 147(1-2):163–223. humanoids. In 2016 IEEE International Conference Sample Complexity of Reinforcement Learning using Linearly Combined Model Ensembles

on Robotics and Automation (ICRA), pages 242–248. IEEE. Rajeswaran, A., Ghotra, S., Ravindran, B., and Levine, S. (2017). Epopt: Learning robust neural network policies using model ensembles. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neu- ral networks and tree search. nature, 529(7587):484. Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., and Langford, J. (2019). Model-based rl in contex- tual decision processes: Pac bounds and exponen- tial improvements over model-free approaches. In Beygelzimer, A. and Hsu, D., editors, Proceedings of the Thirty-Second Conference on Learning The- ory, volume 99 of Proceedings of Machine Learning Research, pages 2898–2933, Phoenix, USA. PMLR. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P. (2017). Domain randomization for transferring deep neural networks from simula- tion to the real world. In 2017 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pages 23–30. IEEE. Todorov, E., Erez, T., and Tassa, Y. (2012). Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE. Whitt, W. (1978). Approximations of dynamic pro- grams, i. Mathematics of Operations Research, 3(3):231–243. Yang, L. and Wang, M. (2019a). Sample-optimal para- metric q-learning using linearly additive features. In Chaudhuri, K. and Salakhutdinov, R., editors, Pro- ceedings of the 36th International Conference on Ma- chine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6995–7004, Long Beach, California, USA. PMLR. Yang, L. F. and Wang, M. (2019b). Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389.