View metadata, citation and similar papers at core..uk brought to you by CORE

provided by The Australian National University

The Sample-Complexity of General

Tor Lattimore [email protected] Australian National University Marcus Hutter [email protected] Australian National University Peter Sunehag [email protected] Australian National University

Abstract {ν1, ··· , νN } of arbitrary environments, an accuracy , and a confidence δ. The main result is that MERL We present a new algorithm for general - has a sample-complexity of inforcement learning where the true environ-  N N  ment is known to belong to a finite class of N ˜ 2 O 2 3 log , arbitrary models. The algorithm is shown to  (1 − γ) δ(1 − γ) be near-optimal for but O(N log2 N) time- where 1/(1 − γ) is the effective horizon determined steps with high probability. Infinite classes by discount rate γ. We also consider the case where are also considered where we show that com- M is infinite, but compact with respect to a particular pactness is a key criterion for determining topology. In this case, a variant of MERL has the same the existence of uniform sample-complexity sample-complexity as above, but where N is replaced bounds. A matching lower bound is given for by the size of the smallest -cover. A lower bound the finite case. is also given that matches the upper bound except for logarithmic factors. Finally, if M is non-compact then in general no finite sample-complexity bound exists. 1. Introduction 1.1. Related Work Reinforcement Learning (RL) is the task of learning Many authors have worked on the sample-complexity policies that lead to nearly-optimal rewards where the of RL in various settings. The simplest case is the environment is unknown. One metric of the efficiency multiarmed bandit problem that has been extensively of an RL algorithm is sample-complexity, which is a studied with varying assumptions. The typical mea- high probability upper bound on the number of time- sure of efficiency in the bandit literature is regret, but steps when that algorithm is not nearly-optimal that sample-complexity bounds are also known and some- holds for all environment in some class. Such bounds times used. The next step from bandits is finite state are typically shown for very specific classes of environ- MDPs, of which bandits are an example with only a ments, such as (partially observable/factored) Markov single state. There are two main settings when MDPs Decision Processes (MDP) and bandits. We consider are considered, the discounted case where sample- more general classes of environments where at each complexity bounds are proven and the undiscounted time-step an agent takes an action a ∈ A where-upon (average reward) case where regret bounds are more it receives reward ∈ [0, 1] and an observation o ∈ O, typical. In the discounted setting the upper and lower which are generated stochastically by the environment bounds on sample-complexity are now extremely re- and may depend arbitrarily on the entire history se- fined. See Strehl et al. (2009) for a detailed review quence. of the popular algorithms and theorems. More re- We present a new reinforcement learning algorithm, cent work on closing the gap between upper and lower named Maximum Exploration Reinforcement Learn- bounds is by Szita & Szepesv´ari(2010); Lattimore & ing (MERL), that accepts as input a finite set M := Hutter (2012); Azar et al. (2012). In the undiscounted case it is necessary to make some form of ergodicity Proceedings of the 30 th International Conference on Ma- assumption as without this regret bounds cannot be chine Learning, Atlanta, Georgia, USA, 2013. JMLR: given. In this work we avoid ergodicity assumptions W&CP volume 28. Copyright 2013 by the author(s). The Sample-Complexity of General Reinforcement Learning and discount future rewards. Nevertheless, our al- butions over observation/reward pairs given the his- gorithm borrows some tricks used by UCRL2 (Auer tory so far. A policy π is a function π : H∗ → et al., 2010). Previous work for more general environ- A. An environment and policy interact sequen- ment classes is somewhat limited. For factored MDPs tially to induce a measure, Pµ,π, on filtered prob- ∞ there are known bounds, see (Chakraborty & Stone, ability space (H , F, {Ft}). For convenience, we 2011) and references there-in. Even-dar et al. (2005) abuse notation and write Pµ,π(h) := Pµ,π(Γh). If 0 0 give essentially unimprovable exponential bounds on h @ h then conditional probabilities are Pµ,π(h |h) := the sample-complexity of learning in finite partially 0 Pt+d k−t Pµ,π(h )/Pµ,π(h). Rt(h; d) := k=t γ rk(h) is the observable MDPs. Maillard et al. (2013) show regret d-step return function and Rt(h) := limd→∞ Rt(h; d). bounds for undiscounted RL where the true environ- Given history ht with `(ht) = t, the value func- ment is assumed to be finite, Markov and commu- π tion is defined by Vµ (ht; d) := E[Rt(h; d)|ht] where nicating, but where the state is not directly observ- the expectation is taken with respect to Pµ,π(·|ht). able. As far as we know there has been no work on π π Vµ (ht) := limd→∞ Vµ (ht; d). The optimal policy for the sample-complexity of RL when environments are ∗ π environment µ is πµ := arg maxπ Vµ , which with our completely general, but asymptotic results have gar- assumptions is known to exist (Lattimore & Hutter, ∗ nered some attention with positive results by Hutter πµ 2011a). The value of the optimal policy is V ∗ := V . (2002); Ryabko & Hutter (2008); Sunehag & Hutter µ µ In general, µ denotes the true environment while ν is a (2012) and (mostly) negative ones by Lattimore & model. π will typically be the policy of the algorithm Hutter (2011b). Perhaps the closest related worked is under consideration. Q∗ (h, a) is the value in history (Diuk et al., 2009), which deals with a similar problem µ h of following policy π∗ except for the first time-step in the rather different setting of learning the optimal µ when action a is taken. M is a set of environments predictor from a class of N experts. They obtain an (models). O(N log N) bound, which is applied to the problem of Sample-complexity. Policy π is -optimal in his- structure learning for discounted finite-state factored tory h and environment µ if V ∗(h) − V π(h) ≤ . The MDPs. Our work generalises this approach to the non- µ µ sample-complexity of a policy π in environment class Markov case and compact model classes. M is the smallest Λ such that, with high probability, π is -optimal for all but Λ time-steps for all µ ∈ M.  ∞ 2. Notation Define Lµ,π : H → N ∪ {∞} to be the number of time-steps when π is not -optimal. The definition of environments is borrowed from the ∞ work of ?, although the notation is slightly more formal  X  ∗ π  Lµ,π(h) := Vµ (ht) − Vµ (ht) >  , to ease the application of martingale inequalities. t=1 General. = {0, 1, 2, ···} is the natural numbers. N where ht is the length t prefix of h. The sample- For the indicator function we write [[x = y]] = 1 if complexity of policy π is Λ with respect to accuracy  x = y and 0 otherwise. We use ∧ and ∨ for logical   and confidence 1 − δ if P Lµ,π(h) > Λ < δ, ∀µ ∈ M. and/or respectively. If A is a set then |A| is its size and A∗ is the set of all finite strings (sequences) over 3. Finite Case A. If x and y are sequences then x @ y means that x is a prefix of y. Unless otherwise mentioned, log We start with the finite case where the true environ- represents the natural logarithm. For ment is known to belong to a finite set of models, M. X we write EX for its expectation. For x ∈ R, dxe is The Maximum Exploration Reinforcement Learning the ceiling function. algorithm is model-based in the sense that it maintains Environments and policies. Let A, O and R ⊂ R a set, Mt ⊆ M, where models are eliminated once be finite sets of actions, observations and rewards re- they become implausible. The algorithm operates in spectively and H := A × O × R. H∞ is the set of phases of exploration and exploitation, choosing to ex- infinite history sequences while H∗ := (A × O × R)∗ ploit if it knows all plausible environments are reason- is the set of finite history sequences. If h ∈ H∗ ably close under all optimal policies and explore oth- then `(h) is the number of action/observation/reward erwise. This method of exploration essentially guar- tuples in h. We write at(h), ot(h), rt(h) for the antees that MERL is nearly optimal whenever it is tth action/observation/reward of history sequence h. exploiting and the number of exploration phases is ∗ 0 ∞ 0 For h ∈ H ,Γh := {h ∈ H : h @ h } is the cylin- limited with high probability. The main difficulty is ∗ der set. Let F := σ({Γh : h ∈ H }) and Ft := specifying what it means to be plausible. Previous au- ∗ σ({Γh : h ∈ H ∧ `(h) = t}) be σ-algebras. An en- thors working on finite environments, such as MDPs vironment µ is a set of conditional probability distri- or bandits, have removed models for which the tran- The Sample-Complexity of General Reinforcement Learning sition probabilities are not sufficiently close to their one of them has expectation larger than (1 − γ)/8. empirical estimates. In the more general setting this MERL eliminates environment ν from the model class approach fails because states (histories) are never vis- if the cumulative sum of X(ν, κ) over all exploration ited more than once, so sufficient empirical estimates phases where ν ∈ {ν, ν} is sufficiently large, but it cannot be collected. Instead, we eliminate environ- tests this condition only when the counts E(ν, κ) has ments if the reward we actually collect over time is increased enough since the last test. Let α := αj not sufficiently close to the reward we expected given j for α ∈ (1, 2) as defined in the algorithm. MERL only that environment. tests if ν should be removed from the model class when Before giving the explicit algorithm, we explain the E(ν, κ) = αj for some j ∈ N. This restriction ensures operation of MERL more formally in two parts. First that tests are not performed too often, which allows we describe how it chooses to explore and exploit and us to apply the union bound without losing too much. then how the model class is maintained. See Figure Note that if the true environment µ ∈ {ν, ν}, then 1 for a diagram of how exploration and exploitation Eµ,πX(µ, κ) = 0, which will ultimately be enough to occurs. ensure that µ remains in the model class with high Exploring and exploiting. At each time-step t probability. The reason for using κ to bucket explo- MERL computes the pair of environments ν, ν in the ration phases will become apparent later in the proof model class Mt and the policy π maximising the dif- of Lemma 3. ference Algorithm 1 MERL π π 1 8 ∆ := Vν (h; d) − Vν (h; d), d := log . 1: Inputs: , δ and M := {ν1, ν2, ··· , νN }. 1 − γ (1 − γ) 2: t = 1 and h empty history 3: d := 1 log 8 , δ := δ If ∆ > /4, then MERL follows policy π for d time- 1−γ (1−γ) 1 32|K|N3/2 √ 4 N  j  steps, which we call an exploration phase. Other- 4: α := √ and αj := α wise, for one time-step it follows the optimal pol- 4 N−1 5: E(ν, κ) := 0, ∀ν ∈ M and κ ∈ N icy with respect to the first environment currently in 6: loop the model class. Therefore, if MERL chooses to ex- 7: repeat ∗ 8: Π := {πν : ν ∈ M} ploit, then all policies and environments in the model π π 9: ν, ν, π := arg max Vν (h; d) − Vν (h; d) class lead to similar values, which implies that exploit- ν,ν∈M,π∈Π π π ing is near-optimal. If MERL explores, then either 10: if ∆ := Vν (h; d) − Vν (h; d) > /4 then π π π π Vν (h; d)−Vµ (h; d) > /8 or Vµ (h; d)−Vν (h; d) > /8, 11: h˜ = h and R = 0 which will allow us to apply concentration inequalities 12: for j = 0 → d do j to eventually eliminate either ν (the upper bound) or 13: R = R + γ rt(h) ν (the lower bound). 14: Act(π) 15: end for  κ−2 The model class. An exploration phase is a κ- 16: κ := min κ ∈ N : ∆ > 2 . exploration phase if ∆ ∈ [2κ−2, 2κ−1), where 17: E(ν, κ) = E(ν, κ) + 1 and E(ν, κ) = E(ν, κ) + 1 π ˜  1  18: X(ν, κ)E(ν,κ) = (1 − γ)(Vν (h; d) − R) κ ∈ K := 0, 1, 2, ··· , log2 + 2 . π ˜ (1 − γ) 19: X(ν, κ)E(ν,κ) = (1 − γ)(R − Vν (h; d)) 20: else For each environment ν ∈ M and each κ ∈ K, MERL ∗ 21: i := min {i : νi ∈ M} and Act(πνi ) associates a counter E(ν, κ), which is incremented at 22: end if the start of a κ-exploration phase if ν ∈ {ν, ν}. At the 23: until ∃ν ∈ M, κ, j ∈ N such that E(ν, κ) = αj and end of each κ-exploration phase MERL calculates the E(ν,κ) r X E(ν, κ) discounted return actually received during that explo- X(ν, κ)i ≥ 2E(ν, κ) log . δ1 ration phase R ∈ [0, 1/(1 − γ)] and records the values i=1 π 24: M = M − {ν} X(ν, κ) := (1 − γ)(Vν (h; d) − R) 25: end loop π X(ν, κ) := (1 − γ)(R − Vν (h; d)), 26: function Act(π) 27: Take action a = π(h) and receive reward and ob- where h is the history at the start of the exploration t servation rt, ot from environment phase. So X(ν, κ) is the difference between the return 28: t ← t + 1 and h ← hatotrt expected if the true model was ν and the actual return 29: end function and X(ν, κ) is the difference between the actual return Subscripts. For clarity, we have omitted subscripts and the expected return if the true model was ν. Since in the pseudo-code above. In the analysis we will refer the expected value of R is V π(h; d), and ν,ν are upper µ to E (ν, κ) and M for the values of E(ν, κ) and M and lower bounds respectively, the expected values of t t respectively at time-step t. We write ν for ν in line both X(ν, κ) and X(ν, κ) are non-negative and at least t i The Sample-Complexity of General Reinforcement Learning

21 and similarly π := π∗ . MBIE (Strehl & Littman, 2005) or UCRL2 (Auer t νt Phases. An exploration phase is a period of exactly d et al., 2010). In MBIE, the confidence interval about time-steps, starting at time-step t if the empirical estimate of a transition probability is updated after every observation. A slight theoreti- 1. t is not currently in an exploration phase. cal improvement used by UCRL2 is to only update 2. ∆ := V π(h ; d) − V π(h ; d) > /4. ν t ν t when the number of samples of a particular statis- We say it is a ν-exploration phase if ν = ν or ν = ν and tic doubles. The latter trick allows a cheap appli- κ−2 κ−1 a κ-exploration phase if ∆ ∈ [2 , 2 ) ≡ [κ, 2κ) cation of the union bound over all updates without κ−2 where κ := 2 . It is a (ν, κ)-exploration phase if wasting too many samples. For our purposes, how- it satisfies both of the previous statements. We say ever, we need to update slightly more often than the that MERL is exploiting at time-step t if t is not in an doubling trick would allow. Instead, we check if an exploration phase. A failure phase is also a period of environment should be eliminated if the number of (ν, κ)-exploration phases is exactly α for some j where d time-steps and starts in time-step t if √ j α := αj and α := √4 N ∈ (1, 2). Since the growth 1. t is not in an exploration phase or earlier failure j 4 N−1 phase of αj is still exponential, the union bound will still be ∗ π applicable. 2. Vµ (ht) − Vµ (ht) > . Probabilities. For the remainder of this section, un- Unlike exploration phases, the algorithm does not de- less otherwise mentioned, all probabilities and expec- pend on the failure phases, which are only used in the tations are with respect to Pµ,π where π is the policy analysis, An exploration or failure phase starting at of Algorithm 1 and µ ∈ M is the true environment. time-step t is proper if µ ∈ Mt. The effective horizon 216N|K| 2 29N π π Analysis. Define Gmax := 2(1−γ)2 log 2(1−γ)2δ and d is chosen to ensure that Vµ (h; d) ≥ Vµ (h) − /8 for 1 216N 2 29N all π, µ and h. Emax := 2 2 log 2 2 , which are high proba-  (1−γ)  (1−γ) δ1 ∗ π Vµ (h) − Vµ (h) >  bility bounds on the number of failure and exploration π π π π Vν (h; d) − Vν (h; d) = 4 Vν (h; d) − Vν (h; d) =  phases respectively.

Theorem 1. Let µ ∈ M = {ν1, ν2, ··· νN } be the true t environment and π be the policy of Algorithm 1. Then   P Lµ,π(h) ≥ d · (Gmax + Emax) ≤ δ. explore, κ = 4 explore, κ = 2 If lower order logarithmic factors are dropped then the sample-complexity bound of MERL given by Theorem exploiting failure phase exploiting   ˜ N 2 N 1 is O 2(1−γ)3 log δ(1−γ) . Theorem 1 follows from Figure 1. Exploration/exploitation/failure phases, d = 4 three lemmas.

Test statistics. We have previously remarked that Lemma 2. µ ∈ Mt for all t with probability 1 − δ/4. most traditional model-based algorithms with sample- Lemma 3. The number of proper failure phases is complexity guarantees record statistics about the tran- bounded by sition probabilities of an environment. Since the en- 216N|K| 29N vironments are assumed to be finite, these statistics G := log2 max 2(1 − γ)2 2(1 − γ)2δ eventually become accurate (or irrelevant) and the 1 δ standard theory on the concentration of measure can with probability at least 1 − 2 . be used for hypothesis testing. In the general case, Lemma 4. The number of proper exploration phases environments can be infinite and so we cannot col- is bounded by lect useful statistics about individual transitions. In- 16 9 2 N 2 2 N stead, we use the statistics X(ν, κ), which are de- Emax := 2 2 log 2 2  (1 − γ)  (1 − γ) δ1 pendent on the value function rather than individual with probability at least 1 − δ . transitions. These satisfy Eµ,π[X(µ, κ)i] = 0 while 4 Eµ,π[X(ν, κ)i] ≥ 0 for all ν ∈ Mt. Testing is then Pαk Proof of Theorem 1 Applying the union bound to performed on the statistic i=1 X(ν, κ)i, which will satisfy certain martingale inequalities. the results of Lemmas 2, 3 and 4 gives the following Updates. As MERL explores, it updates its model with probability at least 1 − δ. class, Mt ⊆ M, by removing environments that have 1. There are no non-proper exploration or failure become implausible. This is comparable to the - phases. dating of confidence intervals for algorithms such as 2. The number of proper exploration phases is at The Sample-Complexity of General Reinforcement Learning

most Emax. show by means of Azuma’s inequality that all environ- 3. The number of proper failure phases is at most ments that are not -close are quickly eliminated after 1 Gmax. O( 2(1−γ)2 ) ν-exploration phases, which would lead to If π is not -optimal at time-step t then t is either in the desired bound. Unfortunately though, the truth of an exploration or failure phase. Since both are exactly (1) or (2) above cannot be determined, which greatly d time-steps long the total number of time-steps when increases the complexity of the proof. π is sub-optimal is at most d · (Gmax + Emax).  Proof of Lemma 4 Fix a κ ∈ K and let Emax,κ be a constant to be chosen later. Let ht be the history We now turn our attention to proving Lemmas 2, 3 at the start of some κ-exploration phase. We say an and 4. Of these, Lemma 4 is more conceptually chal- (ν, κ)-exploration phase is ν-effective if π π lenging while Lemma 3 is intuitively unsurprising, but E[X(ν, κ)E(ν,κ)|Ft] ≡ (1 − γ)(Vµ (ht; d) − Vν (ht; d)) technically difficult. > (1 − γ)κ/2 Proof of Lemma 2 If µ is removed from M, then and ν-effective if the same condition holds for ν. Now there exists a κ and j ∈ N such that since t is the start of a proper exploration phase we αj r have that µ ∈ Mt and so X αj X(µ, κ)i ≥ 2αj log . π π π δ Vν (ht; d) ≥ Vµ (ht; d) ≥ Vν (ht; d) i=1 1 π π V (ht; d) − V (ht; d) > κ. Fix a κ ∈ K, E∞(µ, κ) := limt Et(µ, κ) and Xi := ν ν X(µ, κ) . Define a sequence of random variables Therefore every proper exploration phase is either ν- i P ( effective or ν-effective. Let Et,κ := ν Et(ν, κ), which Xi if i ≤ E∞(µ, κ) X˜i := is twice the number of κ-exploration phases at time t 0 otherwise. and E∞,κ := limt Et,κ, which is twice the total number n 1 P ˜ of κ-exploration phases. Let Ft(ν, κ) be the number Now we claim that Bn := i=1 Xi is a martingale with of ν-effective (ν, κ)-exploration phases up to time-step |Bi+1 − Bi| ≤ 1 and EBi = 0. That it is a martingale with zero expectation follows because if t is the time- t. Since each proper κ-exploration phase is either ν- P step at the start of the exploration phase associated effective or ν-effective or both, ν Ft(ν, κ) ≥ Et,κ/2. Applying Lemma 8 to y := E (ν, κ)/E and x := with variable Xi, then E[Xi|Ft] = 0. |Bi+1 − Bi| ≤ 1 ν t t,κ ν because discounted returns are bounded in [0, 1/(1 − Ft(ν, κ)/Et(ν, κ) shows that if E∞,κ > Emax,κ then 0 there exists a t and ν such that E 0 = E and γ)] and by the definition of Xi. t ,κ max,κ 2 Ft0 (ν, κ) 1 For all j ∈ N, we have by Azuma’s inequality that ≥ , (1)  r  Emax,κEt0 (ν, κ) 4N αj δ1 P Bαj ≥ 2αj log ≤ . which implies that δ1 αj r (a) Apply the union bound over all j. Emax,κEt0 (ν, κ) Et0 (ν, κ) Ft0 (ν, κ) ≥ ≥ √ , (2)  r  ∞ 4N 4N αj X δ1 P ∃j ∈ : Bα ≥ 2αj log ≤ . where (a) follows because E = E 0 ≥ E 0 (ν, κ). N j δ α max,κ t ,κ t 1 j=1 j Let Z(ν) be the event that there exists a t0 satisfying Complete the result by the union bound over all κ, (1). We will shortly show that P {Z(ν)} < δ/(4N|K|). applying Lemma 10 (see Appendix) and the definition Therefore P P∞ δ1 of δ1 to bound ≤ δ/4. X κ∈K j=1 αj  P {E∞,κ > Emax,κ} ≤ P {∃ν : Z(ν)} ≤ P {Z(ν)} ν∈M We are now ready to give a high-probability bound on ≤ δ/(4|K|) the number of proper exploration phases. If MERL Finally take the union bound over all κ and let starts a proper exploration phase at time-step t then X 1 E := E , at least one of the following holds: max 2 max,κ κ∈K 1. E[X(ν, κ)E(ν,κ)|Ft] > (1 − γ)/8. 1 where we used 2 Emax,κ because Emax,κ is a high- 2. E[X(ν, κ) |Ft] > (1 − γ)/8. E(ν,κ) probability upper bound on E∞,κ, which is twice the number of κ-exploration phases. This contrasts with E[X(µ, κ)E(µ,κ)|Ft] = 0, which en- sures that µ remains in M for all time-steps. If one 1Note that it is never the case that ν = ν at the start could know which of the above statements were true at of an exploration phase, since in this case ∆ = 0. each time-step then it would be comparatively easy to The Sample-Complexity of General Reinforcement Learning

Bounding P {Z(ν)} < δ/(4N|K|). Fix a ν ∈ M Recall that if MERL is exploiting at time-step t, then and let X1,X2, ··· ,XE∞(ν,κ) be the sequence with πt is the optimal policy with respect to the first en- Xi := X(ν, κ)i and let ti be the corresponding time- vironment in the model class. To prove Lemma 3 we step at the start of the ith (ν, κ)-exploration phase. start by showing that in this case πt is nearly-optimal.

Define a sequence Lemma 5. Let t be a time-step and ht be the corre- ( sponding history. If µ ∈ Mt and MERL is exploiting Xi − E[Xi|Fti ] if i ≤ E∞(ν, κ) Yi := ∗ πt 0 otherwise (not exploring), then Vµ (ht) − Vµ (ht) ≤ 5/8. q Let λ(E) := 2E log E . Now if Z(ν), then the largest Proof of Lemma 5 Since MERL is not exploring δ1 0 (a)  time-step t ≤ t with Et(ν, t) = αj for some j ∈ N is ∗ πt ∗ πt Vµ (ht) − Vµ (ht) ≤ Vµ (ht; d) − Vµ (ht; d) + 0 8 t := max {t ≤ t : ∃j ∈ N s.t. αj = Et(ν, t)} , (b) π∗ which exists and satisfies ≤ V µ (h ; d) − V πt (h ; d) + 5/8 νt t νt t 1. E (ν, κ) = α for some j. (c) t j ≤ 5/8, 2. E∞(ν, κ) > Et(ν, κ). p (a) follows by truncating the value function. (b) fol- 3. Ft(ν, κ) ≥ Et(ν, κ)Emax,κ/(16N). lows because µ ∈ M and MERL is exploiting. (c) is 4. Et(ν, κ) ≥ Emax,κ/(16N). t true since πt is the optimal policy in νt.  where parts 1 and 2 are straightforward and parts 3 and 4 follow by the definition of {αj}, which was chosen specifically for this part of the proof. Since Lemma 5 is almost sufficient to prove Lemma 3. The only problem is that MERL only follows π = π∗ un- E∞(ν, κ) > Et(ν, κ), at the end of the exploration t νt phase starting at time-step t, ν must remain in M. til there is an exploration phase. The idea to prove Therefore Lemma 3 is as follows: α α (a) j (b) j X X κ(1 − γ)Ft(ν, κ) 1. If there is a low probability of entering an explo- λ(α ) ≥ X ≥ Y + j i i 2 ration phase within the next d time-steps follow- i=1 i=1 α ing policy πt, then π is nearly as good as πt, which (c) j r X κ(1 − γ) αjEmax,κ itself is nearly optimal by Lemma 5. ≥ Y + , (3) i 8 N 2. The number of time-steps when the probability i=1 of entering an exploration phase within the next where in (a) we used the definition of the confidence in- d time-steps is high is unlikely to be too large terval of MERL. In (b) we used the definition of Y and i before an exploration phase is triggered. Since the fact that EX ≥ 0 for all i and EX ≥  (1 − γ)/2 i i κ there are not many exploration phases with high if X is effective. Finally we used the lower bound on i probability, there are also unlikely to be too many the number of effective ν-exploration phases, Ft(ν, κ) 211N 2 29N time-steps when π expects to enter one with high (part 3 above). If Emax,κ := 2 2 log 2 2 , κ(1−γ)  (1−γ) δ1 probability. 29N then by applying Lemma 9 with a = 2 (1−γ)2 and κ Before the proof of Lemma 3 we remark on an eas- b = 1/δ1 we obtain ier to prove (but weaker) version of Theorem 1. If 29N E 29N α ∗ max,κ j MERL is exploiting then Lemma 5 shows that Vµ (h)− Emax,κ ≥ 2 2 log ≥ 2 2 log ∗ κ(1 − γ) δ1 κ(1 − γ) δ1 Qµ(h, π(h)) ≤ 5/8 < . Therefore if we cared about Multiplying both sides by αj and rearranging and us- the number of time-steps when this is not the case ∗ π ing the definition of λ(αj) leads to (rather than Vµ −Vµ ), then we would already be done r by combining Lemmas 4 and 5. κ(1 − γ) αjEmax,κ ≥ 2λ(αj). 8 N Proof of Lemma 3 Let t be the start of a proper Inserting this into Equation (3) shows that Z(ν) im- failure phase with corresponding history, h. Therefore Pαj ∗ π ∗ π plies that there exists an αj such that i=1 Yi ≤ Vµ (h) − Vµ (h) > . By Lemma 5, Vµ (h) − Vµ (h) = ∗ π π π π π −λ(αj). Now by the same argument as in the proof V (h)−V t (h)+V t (h)−V (h) ≤ 5/8+V t −V (h) Pn µ µ µ µ µ µ of Lemma 2, Bn := i=1 Yi is a martingale with and so 3 |Bi+1 − Bi| ≤ 1. Therefore by Azuma’s inequality πt π Vµ (h) − Vµ (h) ≥ . (4) ( αj ) 8 X δ1 ∗ P Yi ≤ −λ(αj) ≤ . We define set Hκ ⊂ H to be the set of extensions of h αj ∗ i=1 that trigger κ-exploration phases. Formally Hκ ⊂ H 0 0 0 Finally apply the union bound over all j.  is the prefix free set such that h in Hκ if h @ h and h The Sample-Complexity of General Reinforcement Learning

˜ ˜ triggers a κ-exploration phase for the first time since definition of Yi and Yi, if i ≤ Gκ then E[Yi|Fti ] ≥ 0 0 0 −κ−3 t. Let Hκ,d := {h : h ∈ Hκ ∧ `(h ) ≤ t + d}, which is 2 /|K| and for i > Gκ, Y˜i is always 1. Setting the set of extensions of h that are at most d long and κ+4 Gmax,κ := 2 |K|Emax,κ trigger κ-exploration phases. Therefore 217N|K| 29N (a) 2 3 π π = 2 log 2 2 ≤ V t (h) − V (h) κ(1 − γ)  (1 − γ) δ1 8 µ µ PGmax,κ ˜ (b) 0 is sufficient to guarantee E[ Yi] > 2Emax,κ and X X 0 `(h )−t πt 0 π 0  i=1 = P (h |h)γ Vµ (h ) − Vµ (h ) an application of Azuma’s inequality to the martingale 0 κ∈K h ∈Hκ difference sequence completes the result. Finally we (c) X X  apply the union bound over all κ and set G := ≤ P (h0|h) V πt (h0) − V π(h0) + max µ µ P G > P G . 0 8 κ∈ max,κ κ∈K max,κ  κ∈K h ∈Hκ,d N (d) X X 0 ∗ 0 π 0   ≤ P (h |h) Vµ (h ; d) − Vµ (h ; d) + 0 4 4. Compact Case κ∈K h ∈Hκ,d (e) In the last section we presented MERL and proved a X X 0  ≤ P (h |h)4κ + , sample-complexity bound for the case when the envi- 0 4 κ∈K h ∈Hκ,d ronment class is finite. In this section we show that (a) follows from Equation (4). (b) by noting that that if the number of environments is infinite, but compact π = πt until an exploration phase is triggered. (c) by with respect to the topology generated by a natural 0 replacing Hκ with Hκ,d and noting that if h ∈ Hκ − metric, then sample-complexity bounds are still pos- `(h0)−t Hκ,d, then γ ≤ (1 − γ)/8. (d) by substituting sible with a minor modification of MERL. The key ∗ 0 πt 0 Vµ (h ) ≥ Vµ (h ) and by using the effective horizon to idea is to use compactness to cover the space of envi- truncate the value functions. (e) by the definition of a ronments with -balls and compute statistics on these κ-exploration phase. balls rather than individual environments. Since all environments in the same -ball are sufficiently close, Since the maximum of a set is greater than the aver- P 0 the resulting statistics cannot be significantly different age, there exists a κ ∈ K such that 0 P (h |h) ≥ h ∈Hκ,d and all analysis goes through identically to the finite −κ−3 2 /|K|, which is the probability that MERL en- case. Define a topology on the space of all environ- counters a κ-exploration phase within d time-steps ments induced by the pseudo-metric from h. Now fix a κ and let t , t , ··· , ··· , t be the 1 2 Gκ d(ν , ν ) := sup |V π (h) − V π (h)|. 1 2 ν1 ν2 sequence of time-steps such that ti is the start of a fail- h,π ure phase and the probability of a κ-exploration phase Theorem 6. Let M be compact and coverable by N −κ−3 within the next d time-steps is at least 2 /|K|. -balls then a modification of Algorithm 1 satisfies Let Yi ∈ {0, 1} be the event that a κ-exploration  2 P Lµ,π(h) ≥ d · (Gmax + Emax) ≤ δ. phase does occur within d time-steps of ti and define ˜ ˜ ˜ an auxiliary infinite sequence Y1, Y2, ··· by Yi := Yi The main modification is to define statistics on ele- if i ≤ Gκ and 1 otherwise. Let Eκ be the num- ments of the cover, rather than specific environments. ber of κ-exploration phases and Gmax,κ be a constant to be chosen later and suppose Gκ > Gmax,κ, then 1. Let U1, ··· ,UN be an -cover of M. PGmax,κ PGmax,κ PGmax,κ 2. At each time-step choose U and U such that ν ∈ U Y˜i = Yi and either Y˜i ≤ i=1 i=1 i=1 and ν ∈ U. Emax,κ or Eκ > Emax,κ, where the latter follows be- 3. Define statistics {X} on elements of the cover, cause Yi = 1 implies a κ-exploration phase occurred. Therefore rather than environments, by π X(U, κ)E(U,κ) := inf (1 − γ)(R − Vν (h)) P {Gκ > Gmax,κ} ν∈U   Gmax,κ π   X(U, κ)E(U,κ) := inf (1 − γ)(Vν (h) − R) X ν∈U ≤ P Y˜i < Emax,κ + P {Eκ > Emax,κ}  i=1  4. If there exists a U where the test fails then elimi-   nate all environments in that cover. Gmax,κ  X  δ ≤ P Y˜ < E + . The proof requires only small modifications to show i max,κ 4|K|  i=1  that with high probability the U containing the true

We now choose Gmax,κ sufficiently large to bound the environment is never discarded, while those not con- first term in the display above by δ/(4|K|). By the taining the true environment are if tested sufficiently often. The Sample-Complexity of General Reinforcement Learning 5. Unbounded Environment Classes 7. Conclusions Summary. The Maximum Exploration Reinforce- If the environment class is non-compact then we can- ment Learning algorithm was presented. For finite not in general expect finite sample-complexity bounds. classes of arbitrary environments a sample-complexity Indeed, even asymptotic results are usually not possi- bound was given that is linear in the number of envi- ble. ronments. We also presented lower bounds that show that in general this cannot be improved except for log- Theorem 7. There exist non-compact M for which arithmic factors. Learning is also possible for com- no agent has a finite PAC bound. pact classes with the sample complexity depending on the size of the smallest -cover where the distance between two environments is the difference in value The obvious example is when M is the set of all en- functions over all policies and history sequences. Fi- vironments. Then for any policy M includes an envi- nally, for non-compact classes of environments sample- ronment that is tuned to ensure the policy acts sub- complexity bounds are typically not possible. optimally infinitely often. A more interesting example Running time. The running time of MERL can be is the class of all computable environments, which is arbitrary large since computing the policy maximising non-compact and also does not admit algorithms with ∆ depends on the environment class used. Even as- uniform finite sample-complexity. See negative results suming the distribution of observation/rewards given by Lattimore & Hutter (2011b) for counter-examples. the history can be computed in constant time, the val- ues of optimal policies can still only be computed in 6. Lower Bound time exponential in the horizon. Future work. MERL is close to unimprovable in the We now turn our attention to the lower bound. In sense that there exists a class of environments where specific cases, the bound in Theorem 1 is very weak. the upper bound is nearly tight. On the other hand, For example, if M is the class of finite MDPs with there are classes of environments where the bound of |S| states then a natural covering leads to a PAC Theorem 1 scales badly compared to the bounds of bound with exponential dependence on the state-space tuned algorithms (for example, finite state MDPs). It while it is known that the true dependence is at most would be interesting to show that MERL, or a vari- quadratic. This should not be surprising since infor- ant thereof, actually performs comparably to the opti- mation about the transitions for one state gives infor- mal sample-complexity even in these cases. This ques- mation about a large subset of M, not just a single tion is likely to be subtle since there are unrealistic environment. We show that the bound in Theorem 1 classes of environments where the algorithm minimis- is unimprovable for general environment classes except ing sample-complexity should take actions leading di- for logarithmic factors. That is, there exists a class of rectly to a trap where it receives low reward eternally, environments where Theorem 1 is nearly tight. but is never (again) sub-optimal. Since MERL will q := 2 − 1/γ The simplest counter-example not behave this way it will tend to have poor sample- is a set of MDPs with four complexity bounds in this type of environment class. 1 − (a) 1 2 states, S = {0, 1, ⊕, } and r = 0 r = 0 This is really a failure of the sample-complexity op- N actions, A = {a1, ··· , aN }. timality criterion rather than MERL, since jumping The rewards and transitions 1 − p into non-rewarding traps is clearly sub-optimal by any 1 + (a) 1 − q are depicted in the figure on 2 realistic measure. the right where the transi- Acknowledgements. This work was supported by tion probabilities depend on ⊕ 0 ARC grant DP120100950. r = 1 r = 0 the action. Let M := 1 − q {ν1, ··· , νN } where for νk we q p := 1/(2 − γ) A. Technical Results set (ai) = [[i = k]](1 − γ). N PN Therefore in environment νk, ak is the optimal ac- Lemma 8. Let x, y ∈ [0, 1] satisfy i=1 yi = 1 and tion in state 1. M can be viewed as a set of bandits PN 2 i=1 xiyi ≥ 1/2. Then maxi xi yi > 1/(4N). with rewards in (0, 1/(1 − γ)). In the bandit domain Lemma 9. Let a, b > 2 and x := 4a(log ab)2. Then tight lower bounds on sample-complexity are known x ≥ a log bx. and given in Mannor & Tsitsiklis (2004). These re- √ sults can be applied as in Strehl et al. (2009) and Lat- Lemma 10. Let α := αj where α := √4 N . Then √ j 4 N−1 timore & Hutter (2012) to show that no algorithm has P∞ α−1 ≤ 4 N. N 1 j=1 j sample-complexity less than O( 2(1−γ)3 log δ ). The Sample-Complexity of General Reinforcement Learning

References Maillard, Odalric-Ambrym, Nguyen, Phuong, Ort- ner, Ronald, and Ryabko, Daniil. Optimal regret Auer, P., Jaksch, T., and Ortner, R. Near-optimal bounds for selecting the state representation in re- regret bounds for reinforcement learning. J. Mach. inforcement learning. In Proceedings of the Thirti- Learn. Res., 99:1563–1600, August 2010. ISSN 1532- eth International Conference on 4435. (ICML’13), 2013.

Azar, M., Munos, R., and Kappen, B. On the sam- Mannor, S. and Tsitsiklis, J. The sample complexity ple complexity of reinforcement learning with a gen- of exploration in the multi-armed bandit problem. erative model. In Proceedings of the 29th interna- J. Mach. Learn. Res., 5:623–648, December 2004. tional conference on machine learning, New York, ISSN 1532-4435. NY, USA, 2012. ACM. Ryabko, D. and Hutter, M. On the possibility of learn- Chakraborty, D. and Stone, P. Structure learning ing in reactive environments with arbitrary depen- in ergodic factored mdps without knowledge of the dence. Theoretical Computer Science, 405(3):274– transition function’s in-degree. In Proceedings of 284, 2008. the Twenty Eighth International Conference on Ma- chine Learning (ICML’11), 2011. Strehl, A. and Littman, M. A theoretical analysis of model-based interval estimation. In Proceedings of Diuk, C., Li, ., and Leffler, B. The adaptive k- the 22nd international conference on Machine learn- meteorologists problem and its application to struc- ing, ICML ’05, . 856–863, 2005. ture learning and feature selection in reinforcement learning. In Danyluk, Andrea Pohoreckyj, Bottou, Strehl, A., Li, L., and Littman, M. Reinforcement L´eon,and Littman, Michael L. (eds.), Proceedings of learning in finite MDPs: PAC analysis. J. Mach. the 26th Annual International Conference on Ma- Learn. Res., 10:2413–2444, December 2009. chine Learning (ICML 2009), pp. 249–256. ACM, Sunehag, P. and Hutter, M. Optimistic agents are 2009. asymptotically optimal. In Proceedings of the 25th Australasian AI conference, 2012. Even-dar, E., Kakade, S., and Mansour, Y. Reinforce- ment learning in POMDPs without resets. In In Szita, I. and Szepesv´ari,C. Model-based reinforce- IJCAI, pp. 690–695, 2005. ment learning with nearly tight exploration com- plexity bounds. In Proceedings of the 27th inter- Hutter, M. Self-optimizing and Pareto-optimal policies national conference on Machine learning, pp. 1031– in general environments based on Bayes-mixtures. 1038, New York, NY, USA, 2010. ACM. In Proc. 15th Annual Conf. on Computational Learning Theory (COLT’02), volume 2375 of LNAI, pp. 364–379, Sydney, 2002. Springer, Berlin. URL http://arxiv.org/abs/cs.AI/0204040.

Lattimore, T. and Hutter, M. Time consistent dis- counting. In Kivinen, Jyrki, Szepesv´ari, Csaba, Ukkonen, Esko, and Zeugmann, Thomas (eds.), Al- gorithmic Learning Theory, volume 6925 of Lecture Notes in Computer Science. Springer Berlin / Hei- delberg, 2011a.

Lattimore, T. and Hutter, M. Asymptotically optimal agents. In Kivinen, Jyrki, Szepesv´ari,Csaba, Ukko- nen, Esko, and Zeugmann, Thomas (eds.), Algorith- mic Learning Theory, volume 6925 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2011b.

Lattimore, T. and Hutter, M. PAC bounds for dis- counted MDPs. Technical report, 2012. http://tor- lattimore.com/pubs/pac-tech.pdf.