On the Sample Complexity of with a Generative Model

Mohammad Gheshlaghi Azar [email protected]. Department of Biophysics, Radboud University Nijmegen, 6525 EZ Nijmegen, The Netherlands ´emi Munos [email protected] INRIA Lille, SequeL Project, 40 avenue, Halley 59650, Villeneuve dAscq, France Hilbert J. Kappen [email protected] Department of Biophysics, Radboud University Nijmegen, 6525 EZ Nijmegen, The Netherlands

Abstract tributions to estimate the optimal (action-)value function through the Bellman recursion. In the finite We consider the problem of learning the opti- state-action problems, it has been shown that an mal action-value function in the discounted- action-value based variant of VI, model-based Q-value reward Markov decision processes (MDPs). iteration (QVI), finds an ε-optimal estimate of the We prove a new PAC bound on the sample- action-value function with high probability using only complexity of model-based value iteration al- T = O˜(N/ (1 γ)4ε2 ) samples (Kearns & Singh, gorithm in the presence of the generative 1999; Kakade, 2004− , chap. 9.1), where N and γ denote model, which indicates that for an MDP with the size of state-action space and the discount factor, N state-action pairs and the discount factor respectively.1 Although this bound matches the best γ [0, 1) only O N log(N/δ)/ (1 γ)3ε2 existing upper bound on the sample complexity of samples∈ are required to find an ε-optimal− es- estimating the action-value function (Azar et al., timation of the action-value function with the 2011a), it has not been clear, so far, whether this probability 1 δ. We also prove a matching bound is a tight bound on the performance of QVI or lower bound of− Θ N log(N/δ)/ (1 γ)3ε2 it can be improved by a more careful analysis of QVI on the sample complexity of estimating− the  algorithm. This is mainly due to the fact that there is optimal action-value function by every RL a gap of order 1/(1 γ)2 between the upper bound of algorithm. To the best of our knowledge, QVI and the state-of-the-art− result for lower bound, this is the first matching result on the sample which is of Ω N/ (1 γ)2ε2 (Azar et al., 2011b). complexity of estimating the optimal (action- − ) value function in which the upper bound In this paper, we focus on the problems which are matches the lower bound of RL in terms of N, formulated as finite state-action discounted infinite- ε, δ and 1/(1 γ). Also, both our lower bound horizon Markov decision processes (MDPs), and prove − and our upper bound significantly improve on a new tight bound of O N log(N/δ)/ (1 γ)3ε2 on the state-of-the-art in terms of 1/(1 γ). the sample complexity of the QVI algorithm.− The new − upper bound improves on the existing bound of QVI by an order of 1/(1 γ).2 We also present a new match- 1. Introduction ing lower bound of− Θ N log(N/δ)/ (1 γ)3ε2 , which also improves on the best existing lower− bound of RL Model-based value iteration (VI) (Kearns & Singh, by an order of 1/(1 γ). The new results, which close 1999; Bu¸s,oniu et al., 2010) is a well-known - − inforcement learning (RL) (Szepesv´ari, 2010; 1The notation g = O˜(f) implies that there are constants c2 Sutton & Barto, 1998) algorithm which relies on c1 and c2 such that g c1f log (f). 2 ≤ an empirical estimate of the state-transition dis- In this paper, to keep the presentation succinct, we only consider the value iteration algorithm. However, one th Appearing in Proceedings of the 29 International Confer- can prove upper-bounds of a same order for other model- ence on , Edinburgh, Scotland, UK, 2012. based methods such as policy iteration and linear program- Copyright 2012 by the author(s)/owner(s). ming using the results of this paper. On the Sample Complexity of Reinforcement Learning with a Generative Model the above-mentioned gap between the lower bound and counted MDP is a quintuple (X, A,P, R,γ), where X the upper bound of RL, guarantee that no learning and A are the set of states and actions, P is the state method, given a generative model of the MDP, can transition distribution, R is the reward function, and be significantly more efficient than QVI in terms of γ (0, 1) is a discount factor. We denote by P ( x,a) the sample complexity of estimating the action-value and∈ r(x,a) the probability distribution over the| next function. state and the immediate reward of taking action a at state x, respectively. The main idea to improve the upper bound of QVI is to express the performance loss of QVI in terms of the Remark 1. To keep the representation succinct, in variance of the sum of discounted rewards as opposed the sequel, we use the notation Z for the joint state- to the maximum V = R /(1 γ) in the previous action space X A. We also make use of the shorthand max max × results. For this we make use of− Bernstein’s concen- notations z and β for the state-action pair (x,a) and tration inequality (Cesa-Bianchi & Lugosi, 2006, ap- 1/(1 γ), respectively. − pendix, pg. 361), which bounds the estimation error Assumption 1 (MDP Regularity). We assume Z in terms of the variance of the value function. We also and, subsequently, X and A are finite sets with car- rely on the fact that the variance of the sum of dis- dinalities N, X and A , respectively. We also as- counted rewards, like the expected value of the sum sume that the| immediate| | | reward r(x,a) is taken from (value function), satisfies a Bellman-like equation, in the interval [0, 1]. which the variance of the value function plays the role of the instant reward (Munos & Moore, 1999; Sobel, A mapping π : X A is called a stationary and 1982), to derive a sharp bound on the variance of the deterministic Markovian→ policy, or just a policy in value function. In the case of lower bound, we im- short. Following a policy π in an MDP means that prove on the result of Azar et al. (2011b) by adding at each time step t the control action A A is given t ∈ some structure to the class of MDPs for which we by At = π(Xt), where Xt X. The value and the prove the lower bound: In the new model, there is a action-value functions of a policy∈ π, denoted respec- high probability for transition from every intermediate tively by V π : X R and Qπ : Z R, are defined state to itself. This adds to the difficulty of estimating as the expected sum→ of discounted→ rewards that are the value function, since even a small estimation error encountered when the policy π is executed. Given may propagate throughout the recursive structure of an MDP, the goal is to find a policy that attains ∗ , π the MDP and inflict a big performance loss especially the best possible values, V (x) supπ V (x), x for γ’s close to 1. X. Function V ∗ is called the optimal value func-∀ ∈ tion. Similarly the optimal action-value function is The rest of the paper is organized as follows. After in- defined as Q∗(x,a) = sup Qπ(x,a). We say that a troducing the notation used in the paper in Section 2, π policy π∗ is optimal if it attains the optimal V ∗(x) we describe the model-based Q-value iteration (QVI) for x X. The policy π defines the state tran- algorithm in Subsection 2.1. We then state our main ∈ sition kernel Pπ as: Pπ(y x) , P (y x, π(x)) for all theoretical results, which are in the form of PAC sam- | π | x X. The right-linear operators P , P and Pπ are ple complexity bounds in Section 3. Section 4 con- ∈ π then defined as (P Q)(z) , XP (y z)Q(y, π(y)), tains the detailed proofs of the results of Sections 3, y∈ | (PV )(z) , XP (y z)V (y) for all z Z and i.e., sample complexity bound of QVI and a general y∈ | P ∈ (P V )(x) , P (y x)V (y) for all x X, respec- new lower bound for RL. Finally, we conclude the pa- π Py∈X π | ∈ per and propose some directions for the future work in tively. Finally, shall denote the supremum (ℓ∞) P , Section 5. norm which is defined as g maxy∈Y g(y) , where Y is a finite set and g : Y Ris a real-valued| function.| 3 → 2. Background 2.1. Model-based Q-value Iteration (QVI) In this section, we review some standard concepts The algorithm makes n transition samples from each and definitions from the theory of Markov decision state-action pair z Z for which it makes n calls to the processes (MDPs). We then present the model- generative model.4∈It then builds an empirical model based Q-value iteration algorithm of Kearns & Singh of the transition probabilities as: P (y z) , m(y,z)/n, (1999). We consider the standard reinforcement learn- | ing (RL) framework (Bertsekas & Tsitsiklis, 1996; 3For ease of exposition, in the sequel, we remove the b Sutton & Barto, 1998) in which a learning agent inter- dependence on z and x , e.g., writing Q for Q(z) and V for acts with a stochastic environment and this interaction V (x), when there is no possible confusion. 4The total number of calls to the generative model is is modeled as a discrete-time discounted MDP. A dis- given by T = nN. On the Sample Complexity of Reinforcement Learning with a Generative Model where m(y,z) denotes the number of times that the 4. Analysis state y X has been reached from z Z. The al- gorithm∈ then makes an empirical estimate∈ of the opti- In this section, we first provide the full proof of the mal action-value function Q∗ by iterating some action- finite-time PAC bound of QVI, reported in Theorem 1, in Subsection 4.1. We then prove Theorem 2, the RL value function Qk, with the initial value of Q0, through the empirical Bellman optimality operator T.5 lower bound, in Subsection 4.2. Also, we need to em- phasize that we discard some of the proofs of the tech- b nical lemmas due to the lack of space. We provide 3. Main Results those results in a long version of this paper. Our main results are in the form of PAC (probably 4.1. Poof of Theorem 1 approximately correct) bounds on the ℓ∞-norm of the ∗ difference of the optimal action-value function Q and We begin by introducing some new notation. Con- its sample estimate: sider the stationary policy π. We define Vπ(z) , t π 2 Theorem 1 (PAC-bound for model-based Q-value it- E[ t≥0γ r(Zt) Q (z) ] as the variance of the eration). Let Assumption 1 hold and T be a positive sum| of discounted− rewards| starting from z P π ∈ integer. Then, there exist some constants c and c0 such Z under the policy π. Also, define σ (z) , 2 π π π π 2 that for all ε (0, 1) and δ (0, 1) a total sampling γ y∈ZP (y z) Q (y) P Q (z) as the immedi- ∈ ∈ | | − 2 | π budget of ate variance at z Z, i.e., γ VY ∼P π (|z)[Q (Y )]. Also, 3 P ∈π ∗ cβ N c0N we shall denote v and v as the immediate variance T = log , ⌈ ε2 δ ⌉ of the value function V π and V ∗ defined as vπ(z) , ∗ 2V π ∗ , 2V ∗ suffices for the uniform approximation error Q γ Y ∼P (|z)[V (Y )] and v (z) γ Y ∼P (|z)[V (Y )], Q ε, w.p. (with the probability) at least 1 δ−, for all z Z, respectively. Further, we denote the im- k ∈ after ≤ only k = log(6β/ε)/ log(1/γ) iteration of QVI− mediate variance of the action-value function Qπ, V π algorithm.6 In⌈ particular, one may⌉ choose c = 68 and and V ∗ by σπ, vπ and v∗, respectively. c0 = 12. b b We now state our first result which indicates that Qk b b b ∗ b k The following general result provides a tight lower is very close to Q to an order of O(γ ). Therefore, to prove bound on Q∗ Q , one only needs to bound bound on the number of transitions T for every RL al- − k Q∗ Q∗ in highb probability. gorithm to achieve an ε-optimal estimate of the action- − value function w.p. 1 δ, under the assumption that Lemma 1. Let Assumption 1 hold and Q0(z) be in the algorithm is (ε,δ,T−)-correct: the intervalb [0,β] for all z Z. Then we have ∈ Definition 1 ((ε,δ,T )-correct algorithm). Let QA T Q Q∗ γkβ. be the estimate of Q∗ by an RL algorithm A af- k − ≤ ter observing T 0 transition samples. We say A ≥ M In the rest of this subsection,b we focus on prov- that is (ε,δ,T )-correct on the class of MDPs if ∗ ∗ A ing a high probability bound on Q Q . One Q∗ Q ε with probability at least 1 δ for all − − T ≤ − can prove a crude bound of O˜(β2/√n) on Q∗ M M.7 − ∈ Q∗ by first proving that Q∗ Q∗ b β (P − ≤ − Theorem 2 (Lower bound on the sample complex- P )V ∗ and then using the Hoeffding’s tail inequal- ity of estimating the optimal action-value function). ityb (Cesa-Bianchi & Lugosi, 2006, appendix,b pg. 359) There exists some constants ε , δ , c , c , and a class 0 0 1 2 tob bound the (P P )V ∗ in high of MDPs M, such that for all ε (0,ε ), δ (0,δ /N), 0 0 probability. Here, we follow a different− and more and every (ε,δ,T )-correct RL algorithm∈ A∈on the class subtle approach to bound Q∗ Q∗ b, which leads of MDPs M the number of transitions needs to be at to a tight bound of O˜(β1.5/√n−): (i) We prove in least β3N N Lemma 2 component-wise upper andb lower bounds on ∗ ∗ T = 2 log . the error Q Q which are expressed in terms of ⌈ c1ε c2δ ⌉ ∗ ∗ (I γP π )−1 −P P V ∗ and (I γP πb )−1 P P V ∗, 5 b − − − − The operator T is defined on the action-value function respectively. (ii)bWe make use of the sharp result of b b     Q, for all z Z, by TQ(z) = r(z) + γPV (z), where V (x) = Bernstein’sb inequalityb to bound P bP V ∗ in termsb of ∈ maxa∈A(Q(x, a)) for all x X. − ∗ 6 ∈ the squared root of the variance of V in high proba- For every real number u, u is defined as the smallest bility. (iii) We prove the key result of this subsection integer number not less than ⌈u.⌉ b 7The algorithm A, unlike QVI, does not need to gener- (Lemma 5) which shows that the variance of the sum ate a same number of transition samples for every state- of discounted rewards satisfies a Bellman-like recur- action pair and can generate samples in an arbitrarily way. sion, in which the instant reward r(z) is replaced by On the Sample Complexity of Reinforcement Learning with a Generative Model

σπ(z). Based on this result we prove an upper-bound We deduce (using a union bound): of order O(β1.5) on (I γP π)−1 V(Qπ) for any pol- − v∗ icy π, which combined with the previous steps leads to γ(P P )V ∗ c′ + b′ 1, (7) p pv pv the sharp upper bound of O˜(β1.5/√n) on Q∗ Q∗ . − ≤ r n − v∗ We proceed by the following lemma which bounds Q∗ γ(P Pb)V ∗ c′ b′ 1, (8) b pv n pv Q∗ from above and below: − − ≥−r − ′ , ′ , Lemma 2 (Component-wise bounds on Q∗ Q∗ ). where cpv 2log(bN/δ) and bpv 2γβ log(N/δ)/3n. b − The result then follows by plugging (3) and (4) into (7) ∗ ∗ ∗ π −1 ∗ Q Q γ(I γP ) P P V , b (1) and (8), respectively, and then taking a union bound. − ≤ − ∗ − ∗ ∗ πb −1 ∗ Q Q γ(I γP ) P P V . (2) − b ≥ − b − b   We now state the key lemma of this section which We now concentrateb on boundingb the RHSb (right hand shows that for any policy π the variance Vπ satisfies sides) of (1) and (2). We first state the following tech- ∗ the following Bellman-like recursion. Later, we use nical result which relates v∗ to σπ and σ∗. this result, in Lemma 6, to bound (I γP π)−1σπ. − Lemma 3. Let Assumption 1 hold and 0 <δ< 1. Lemma 5. Vπ satisfies the Bellman equation Then, w.p. at least 1 δ: b b π π 2 π π − V = σ + γ P V . ∗ v∗ σπ + b 1, (3) v Z ∗ ≤ ∗ Proof. For all z we have v σ + bv1, (4) ∈ ≤ 2 b π t π V (z)= E γ r(Zt) Q (z) where bv is defined as − b  t≥0  X

18γ4β4 log 3N 4γ2β4 log 3N E E t π , δ δ = Z1∼P π (.|z) γ r(Zt) γQ (Z1) bv + , − s n n  t≥1 X 2 and 1 is a function which assigns 1 to all z Z. (Qπ(z) r(z ) γQπ(Z )) ∈ − − − 1  Based on Lemma 3 we prove the following sharp bound 2 ∗ 2E E t−1 π on γ(P P )V , for which we also rely on Bern- = γ Z1∼P π (.|z) γ r(Z t) Q (Z1) − − stein’s inequality (Cesa-Bianchi & Lugosi, 2006, ap-  t≥1  X pendix, pg.b 361). E π π 2 Z1∼P π (.|z) (Q (z) r(z) γQ (Z1)) Lemma 4. Let Assumption 1 hold and 0 <δ< 1. − − −  Define cpv , 2 log(2N/δ) and bpv as: E γtr(Z ) γQπ(Z ) Z × t − 1 1 3/4  t≥1  6(γβ)4/3 log 6N 5γβ2 log 6N X , δ δ π π 2 bpv + . + E π ( Q (z) r(z) γQ (Z ) ) n n Z1∼P (|z) | − − 1 | ! 2 2 t−1 π = γ E π E γ r(Z ) Q (Z ) Then w.p. 1 δ we have Z1∼P (.|z) t − 1 −  t≥1  X π∗ 2V π ∗ cpvσ + γ Z1∼P π (|z)( Q (Z1)) γ(P P )V + bpv1, (5) − ≤ r n = γ2 P π(y z)Vπ(y)+ σπ(z), b ∗ | b ∗ cpvσ y∈Z γ(P P )V bpv1. (6) X − ≥−r n − in which we rely on E( γtr(Z ) γQπ(Z ) Z )= t≥1 t − 1 | 1 b b 0. Proof. For all z Z and all 0 <δ< 1, Bernstein’s P inequality implies∈ that w.p. at least 1 δ: − Based on Lemma 5, one can prove the following result ∗ 1 1 on the immediate variance. ∗ 2v (z) log 2β log (P P )V (z) δ + δ , Lemma 6. − ≤ s γ2n 3n (I γ2P π)−1σπ β2, (9) b ∗ 1 1 ∗ 2v (z) log δ 2β log δ − ≤ (P P )V (z) . (I γP π)−1√σπ 2 log(2)β1.5. (10) − ≥−s γ2n − 3n − ≤ b On the Sample Complexity of Reinforcement Learning with a Generative Model

Proof. The first inequality follows from Lemma 5 by Proof. By incorporating the result of Lemma 4 and solving (5) in terms of Vπ and taking the sup-norm Lemma 6 into Lemma 2, we deduce that: over both sides of the resulted equation. In the case Q∗ Q∗ b1, of Eq.(10) we have 8 − ≤ Q∗ Q∗ b1, − b ≥− π −1 π k w.p. 1 δ. The scalar b is given by: (I γP ) √σπ = (γP ) √σπ − b − 3/4 kX≥0 17β3 log 2N 6(γβ2)4/3 log 6N , δ δ t−1 b + s n n ! = (γP π)tl ( γP π)j √σπ 3 6N ≥0 j=0 5γβ log X X + δ . n t−1 (γt)l (γP π)j √σπ The result then follows by combining these two bounds ≤ l≥0 j=0 and taking the ℓ∞ norm. X X

1 t−1 π j √ π Proof of Theorem 1. We combine the proof of = t (γP ) σ , ∗ 1 γ Lemma 7 and Lemma 1 in order to bound Q Qk j=0 − − X in high probability. We then solve the resulted bound (11) w.r.t. n and k.9 in which we write k = tl + j with t is a positive inte- t−1 π j π ger. We now prove a bound on j=0(γP ) √σ by making use of Jensen’s inequality as well as Cauchy- 4.2. Proof of the Lower-bound P Schwarz inequality: In this section, we provide the proof of Theorem 2. In our analysis, we rely on the likelihood-ratio method, t−1 t−1 which has been previously used to prove a lower bound (γP π)j √σπ γj (P π)j σπ ≤ for multi-armed bandits (Mannor & Tsitsiklis, 2004), j=0 j=0 X X p and extend this approach to RL and MDPs. We begin

t−1 by defining a class of MDPs for which the proposed √t (γ2P π)j σπ (12) lower bound will be obtained (see Figure 1). We de- ≤ v uj=0 fine the class of MDPs M as the set of all MDPs with uX t the state-action space of cardinality N = 3KL, where √t (I γ2P π)−1σ π ≤ − K and L are positive integers. Also, we assume that √ p for all M M, the state space X consists of three β t, ∈ ≤ smaller sets S, Y1 and Y2. The set S includes K states, where in the last step we rely on (9). The result then each of those states corresponds with the set of actions 1 2 follows by plugging (12) into (11) and optimizing the A = a1,a2,...,aL , whereas the states in Y and Y { } bound in terms of t to achieve the best dependency on are single-action states. By taking the action a A ∈ β. from every state x S, we move to the next state y(z) Y1 with the∈ probability 1, where z = (x,a). ∈ Now we make use of Lemma 6 and Lemma 4 to bound The transition probability from Y1 is characterized by 1 Q∗ Q∗ in high probability. the transition probability pM from every y(z) Y to − itself and with the probability 1 p to the∈ corre- Lemma 7. Let Assumption 1 hold. Then, for any M sponding y(z) Y2.10 Further, for− all M M, Y2 0 <δ

samples from the state y(z) Y1 for z X A. We , ∈∗ A∈ × define the event E1(z) Q0(z) Qt (z) ε for ∗ ,{| − | ≤ } all z S A, where Q0 γ/(1 γp) is the optimal action-value∈ × function for all z S− A under the MDP ∈ × M0. We then define k , r1 + r2 + + rt as the sum of rewards of making t transitions from y(z) Y1. We also introduce the event E (z), for all z S ∈A as: 2 ∈ × ′ c2 E2(z) , pt k 2p(1 p)t log . ( − ≤ r − 2θ )

Further, we define E(z) , E1(z) E2(z). We then state the following lemmas required for∩ our analysis. We begin our analysis of the lower bound by the fol- lowing lemma: Lemma 8. Define θ , exp c′ α2t/(p(1 p)) . Then, − 1 − for every RL algorithm A, there exists an MDP Mm M∗ ′ ′  ∈ and constants c1 > 0 and c2 > 0 such that

P ∗ A θ m( Q (z) Qt (z) ) >ε) > ′ , (14) | − | c2 by the choice of α = 2(1 γp)2ε/(γ2). − Figure 1. The class of MDPs considered in the proof of Theorem 2. Nodes represent states and arrows show tran- Proof. To prove this result we make use of a contra- sitions between the states (see the text for details). diction argument, i.e., we assume that there exists an algorithm A for which:

In the rest of the proof, we concentrate on proving the ∗ A θ A P (( Q (z) Q (z) ) >ε) , or lower bound for Q∗ Q for all z S A. Now, m t ′ T | − | ≤ c2 − ∗ ∈ × (15) let us consider a set of two MDPs M = M0,M1 in { } P ∗ A θ M with the transition probabilities m(( Q (z) Qt (z) ) ε) 1 ′ , | − | ≤ ≥ − c2 p M = M0, for all M M∗ and show that this assumption leads pM = m ∈ (p + α M = M1, to a contradiction. We now state the following techni- cal lemmas required for the proof: where α and p are some positive numbers such that 1 0 2 : this section. We≤ assume that the discount factor γ is bounded from below by some positive constant γ for 2θ 0 P0(E2(z)) > 1 . ∗ ′ the class of MDP M . We denote by Em ad Pm the − c2 expectation and the probability under the model Mm in the rest of this section. P ∗ A Now, by the assumption that m( Q (z) Qt (z) ) > ′ M∗ | P− E | We follow the following steps in the proof: (i) we prove ε) θ/c2 for all Mm , we have 0( 1(z)) ≤ ′ ′ ∈ ≥ a lower bound on the sample-complexity of learning 1 θ/c2 1 1/c2. This combined with Lemma 9 and − ≥ − ′ P E the value function for every state y Y on the class of with the choice of c2 = 6 implies that 0( (z)) > 1/2, M∗ ∈ for all z S A. MDP (ii) we then make use of the fact that the ∈ × ∗ S A 1−p estimates of Q (z) for different z are inde- Lemma 10. Let ε 2 2 . Then, for all z ∈ × 4γ (1−γp) pendent of each others to combine these bounds and ≤ ′ ∈ S A: P1(E1(z)) > θ/c . prove the tight result of Theorem 2. × 2 We now introduce some new notation: Let QA(z) Now by the choice of α = 2(1 γp)2ε/(γ2), we have t γ − γ be an empirical estimate of the action-value function that Q∗(z) Q∗(z) = > 2ε, thus 1 − 0 1−γ(p+α) − 1−γp Q∗(z) by the RL algorithm A using t > 0 transition Q∗(z)+ ε

∗ A ′ Q0(z) Qt (z) ε does not overlap with the event z only (indeed, since the samples from z and z are in- {| ∗ − A |≤ } ′ Q1(z) Qt (z) ε . dependent, the samples collected from z do not bring {| − |≤ } more information about Q∗(z) than the information Now let us return to the assumption of Eq. (15), M∗ P ∗ brought by the samples collected from z). Thus, by which states that for all Mm , m( Q (z) ∗ A A ∈ | − defining Q(z) , Q (z) Q (z) >ε , we have that Q (z) ) ε) 1 θ/c′ under Algorithm A. Based on Tz t 2 for such algorithms,{| the events− Q(z|) and} Q(z′) are con- Lemma| 10≤ we≥ have− P ( Q∗(z) QA(z) ε) > θ/c′ . 1 0 t 2 ′ | − ∗| ≤ A ditionally independent given Tz and Tz . Thus, there This combined with the fact that Q0(y) Qt (z) M∗ ∗ A {| − |} exists an MDP Mm such that: and Q1(z) Qt (z) do not overlap implies that ∈ P {|∗ −A |} ′ 1( Q (z) Qt (z) ) ε) 1 θ/c2, which violates the c | − | ≤ ≤ − Pm Q(z(i)) N Tz(i) ξ(ε,δ) N assumption of Eq. (15). The contradiction between the { }1≤i≤ 6 ∩ { ≤ }1≤i≤ 6 result of Lemma 10 and the assumption which leads to ξ(ε,δ) ξ(ε,δ)  this result proves the lower bound of Eq. (14). = Pm Tz(i) = ti N { }1≤i≤ 6 t1=0 t 6=0 X N/X   Now by the choice of p = 4γ−1 and c = 8100, we have 3γ 1 P c m Q(z(i)) 1≤i≤N/6 Tz(i) = ti 1≤i≤ N that for every ε (0, 3] and for all 0.4 = γ0 γ < 1 { } ∩ { } 6 ∈ ≤ there exists an MDP M M∗ such that ξ(ε,δ) ξ(ε,δ)  m ∈ − 2 = P T = t N c1Tz ε m z(i) i 1≤i≤ 6 A 1 3 ∗ 6β { } P ( Q (z) Q (z) ) >ε) > e , t1=0 t 6=0 m Tz ′ N/   | − | c2 X X P Q c This result implies that for any state-action z S A, m (z(i)) Tz(i) = ti ∈ × ∩ the probability of making an estimation error of ε is at 1≤iY≤N/6   ξ(ε,δ) ξ(ε,δ) least δ on M0 or M1 whenever the number of transition 3 N 6 , 6β 1 P T = t N (1 δ) , samples Tz from z Z is less that ξ(ε,δ) 2 log ′ . m z(i) i 1≤i≤ 6 c1ε c2δ ≤ { } − ∈ t1=0 t 6=0 We now extend this result to the whole state-action X N/X   space S A. × from Eq. (16), thus Lemma 11. Assume that for every algorithm A, for 11 every state-action z S A we have P c ∈ × m Q(z(i)) 1≤i≤N/6 Tz(i) ξ(ε,δ) 1≤i≤N/6 A { } { ≤ } P ( Q∗(z) Q (z) >ε T = t ) > δ, (16) m Tz z z  N/6  | − | | (1 δ) . Then for any δ′ (0, 1/2), for any algorithm A using ≤ − a total number of∈ transition samples less than T = We finally deduce that if the total number of transition ′ N ξ ε, 12δ , there exists an MDP M M∗ such that samples is less than N ξ(ε,δ), then 6 N m ∈ 6  P ∗ A ′ m Q QT >ε >δ , (17) P ∗ A P − m( Q QT >ε m Q(z) A − ≥ where QT denotes the empirical estimate of the opti- z∈S×A ∗   [  mal action-value function Q by A using T transition c 1 Pm Q(z(i)) N Tz(i) ξ(ε,δ) N samples. ≥ − { }1≤i≤ 6 { ≤ }1≤i≤ 6  N δN  1 (1 δ) 6 , Proof. First note that if the total number of observed ≥ − − ≥ 12 transitions is less than KL/2ξ(ε,δ) =(N/6)ξ(ε,δ), whenever δN 1. Setting δ′ = δN , we obtain the then there exists at least KL/2 = N/6 state-action 6 ≤ 12 pairs that are sampled at most ξ(ε,δ) times. Indeed, desired result. if this was not the case, then the total number of tran- sitions would be strictly larger than N/6ξ(ε,δ), which Lemma 11 implies that if the total number of sam- 3 2 implies a contradiction). Now let us denote those ples T is less than β N/(c1ε )log(N/(c2δ)), with the states as z(1),...,z(N/6). choice of c1 = 8100 and c2 = 72, then the probability of Q∗ QA ε is at maximum 1 δ on either M or We consider the specific class of MDPs described in T 0 M. This− is≤ equivalent to the statement− that for every Figure 1. In order to prove that (17) holds for any al- 1 RL algorithm A to be (ε,δ,T )-correct on the set M∗, gorithm, it is sufficient to prove it for the class of algo- A and subsequently on the class of MDPs M, the total rithms that return an estimate Q (z) for each state- Tz number of transitions T needs to satisfy the inequal- action z based on the transition samples observed from 3 2 ity T >β N/(c1ε )log(N/(c2δ)), which concludes the 11 Note that we allow Tz to be random. proof of Theorem 2. On the Sample Complexity of Reinforcement Learning with a Generative Model

5. Conclusion and Future Works 2011a. In this paper, we have presented the first minimax Azar, M. Gheshlaghi, Munos, R., Ghavamzadeh, M., and bound on the sample complexity of estimating the Kappen, H.J. Reinforcement learning with a near op- timal rate of convergence. Technical Report inria- optimal action-value function in discounted reward 00636615, INRIA, 2011b. MDPs. We have proven that the model-based Q-value iteration algorithm (QVI) is an optimal learning algo- Bartlett, P. L. and Tewari, A. REGAL: A regularization rithm since it minimizes the dependencies on 1/ε, N, based algorithm for reinforcement learning in weakly communicating MDPs,. In Proceedings of the 25th Con- δ and β. Also, our results have significantly improved ference on Uncertainty in Artificial Intelligence, 2009. on the state-of-the-art in terms of dependency on β. Overall, we conclude that QVI is an efficient RL al- Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Pro- gorithm which completely closes the gap between the gramming. Athena Scientific, Belmont, Massachusetts, 1996. lower and upper bound of the sample complexity of RL in the presence of a generative model of the MDP. Bu¸s,oniu, L., Babuˇs,ka, R., De Schutter, B., and Ernst, D. Reinforcement Learning and Dynamic Programming Us- In this work, we are only interested in the es- ing Function Approximators. CRC Press, Boca Raton, timation of the optimal action-value function and Florida, 2010. not the problem of exploration. Therefore, we did Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and not compare our results with the-state-of-the-art of Games. Cambridge University Press, New York, NY, PAC-MDP (Strehl et al., 2009; Szita & Szepesv´ari, USA, 2006. 2010) and upper-confidence bound based algo- rithms (Bartlett & Tewari, 2009; Jaksch et al., 2010), Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine in which the choice of the exploration policy has an Learning Research, 11:1563–1600, 2010. influence on the behavior of the learning algorithm. However, we believe that it would be possible to im- Kakade, S. M. On the Sample Complexity of Reinforcement prove on the state-of-the-art in PAC-MDP, based on Learning. PhD thesis, Gatsby Computational Neuro- science Unit, 2004. the results of this paper. This is mainly due to the fact that most PAC-MDP algorithms rely on an extended Kearns, M. and Singh, S. Finite-sample convergence rates variant of model-based Q-value iteration to estimate for Q-learning and indirect algorithms. In Advances the action-value function, but they use the naive result in Neural Information Processing Systems 12, . 996– 1002. MIT Press, 1999. of Hoeffding’s inequality for concentration of measure which leads to non-tight sample complexity results. Mannor, S. and Tsitsiklis, J. N. The sample complexity of One can improve on those results, in terms of depen- exploration in the multi-armed bandit problem. Journal dency on β, using the improved analysis of this paper of Machine Learning Research, 5:623–648, 2004. which makes use of the sharp result of Bernstein’s in- Munos, R. and Moore, A. Influence and variance of a equality as opposed to the Hoeffding’s inequality in Markov chain : Application to adaptive discretizations the previous works. Also, we believe that the existing in optimal control. In Proceedings of the 38th IEEE Con- lower bound on the sample complexity of exploration ference on Decision and Control, 1999. of any reinforcement learning algorithm (Strehl et al., Sobel, M. J. The variance of discounted markov decision 2009) can be significantly improved in terms of de- processes. Journal of Applied Probability, 19:794–802, pendency on β using the new “hard” class of MDPs 1982. presented in this paper. Strehl, A. L., Li, L., and Littman, M. L. Reinforcement learning in finite MDPs: PAC analysis. Journal of Ma- Acknowledgments chine Learning Research, 10:2413–2444, 2009. The authors appreciate supports from the Euro- Sutton, R. S. and Barto, A. G. Reinforcement Learning: pean Community’s Seventh Framework Programme An Introduction. MIT Press, Cambridge, Massachusetts, 1998. (FP7/2007-2013) under grant agreement no 231495. We also thank Bruno Scherrer for useful discussion and Szepesv´ari, Cs. Algorithms for Reinforcement Learning. the anonymous reviewers for their valuable comments. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010. References Szita, I. and Szepesv´ari, Cs. Model-based reinforce- ment learning with nearly tight exploration complexity Azar, M. Gheshlaghi, Munos, R., Ghavamzadeh, M., and bounds. In Proceedings of the 27th International Confer- Kappen, H. J. Speedy q-learning. In Advances in Neu- ence on Machine Learning, pp. 1031–1038. Omnipress, ral Information Processing Systems 24, pp. 2411–2419, 2010.