On the Sample Complexity of Reinforcement Learning with a Generative Model
Mohammad Gheshlaghi Azar [email protected].nl Department of Biophysics, Radboud University Nijmegen, 6525 EZ Nijmegen, The Netherlands R´emi Munos [email protected] INRIA Lille, SequeL Project, 40 avenue, Halley 59650, Villeneuve dAscq, France Hilbert J. Kappen [email protected] Department of Biophysics, Radboud University Nijmegen, 6525 EZ Nijmegen, The Netherlands
Abstract tributions to estimate the optimal (action-)value function through the Bellman recursion. In the finite We consider the problem of learning the opti- state-action problems, it has been shown that an mal action-value function in the discounted- action-value based variant of VI, model-based Q-value reward Markov decision processes (MDPs). iteration (QVI), finds an ε-optimal estimate of the We prove a new PAC bound on the sample- action-value function with high probability using only complexity of model-based value iteration al- T = O˜(N/ (1 γ)4ε2 ) samples (Kearns & Singh, gorithm in the presence of the generative 1999; Kakade, 2004− , chap. 9.1), where N and γ denote model, which indicates that for an MDP with the size of state-action space and the discount factor, N state-action pairs and the discount factor respectively.1 Although this bound matches the best γ [0, 1) only O N log(N/δ)/ (1 γ)3ε2 existing upper bound on the sample complexity of samples∈ are required to find an ε-optimal− es- estimating the action-value function (Azar et al., timation of the action-value function with the 2011a), it has not been clear, so far, whether this probability 1 δ. We also prove a matching bound is a tight bound on the performance of QVI or lower bound of− Θ N log(N/δ)/ (1 γ)3ε2 it can be improved by a more careful analysis of QVI on the sample complexity of estimating− the algorithm. This is mainly due to the fact that there is optimal action-value function by every RL a gap of order 1/(1 γ)2 between the upper bound of algorithm. To the best of our knowledge, QVI and the state-of-the-art− result for lower bound, this is the first matching result on the sample which is of Ω N/ (1 γ)2ε2 (Azar et al., 2011b). complexity of estimating the optimal (action- − ) value function in which the upper bound In this paper, we focus on the problems which are matches the lower bound of RL in terms of N, formulated as finite state-action discounted infinite- ε, δ and 1/(1 γ). Also, both our lower bound horizon Markov decision processes (MDPs), and prove − and our upper bound significantly improve on a new tight bound of O N log(N/δ)/ (1 γ)3ε2 on the state-of-the-art in terms of 1/(1 γ). the sample complexity of the QVI algorithm.− The new − upper bound improves on the existing bound of QVI by an order of 1/(1 γ).2 We also present a new match- 1. Introduction ing lower bound of− Θ N log(N/δ)/ (1 γ)3ε2 , which also improves on the best existing lower− bound of RL Model-based value iteration (VI) (Kearns & Singh, by an order of 1/(1 γ). The new results, which close 1999; Bu¸s,oniu et al., 2010) is a well-known re- − inforcement learning (RL) (Szepesv´ari, 2010; 1The notation g = O˜(f) implies that there are constants c2 Sutton & Barto, 1998) algorithm which relies on c1 and c2 such that g c1f log (f). 2 ≤ an empirical estimate of the state-transition dis- In this paper, to keep the presentation succinct, we only consider the value iteration algorithm. However, one th Appearing in Proceedings of the 29 International Confer- can prove upper-bounds of a same order for other model- ence on Machine Learning, Edinburgh, Scotland, UK, 2012. based methods such as policy iteration and linear program- Copyright 2012 by the author(s)/owner(s). ming using the results of this paper. On the Sample Complexity of Reinforcement Learning with a Generative Model the above-mentioned gap between the lower bound and counted MDP is a quintuple (X, A,P, R,γ), where X the upper bound of RL, guarantee that no learning and A are the set of states and actions, P is the state method, given a generative model of the MDP, can transition distribution, R is the reward function, and be significantly more efficient than QVI in terms of γ (0, 1) is a discount factor. We denote by P ( x,a) the sample complexity of estimating the action-value and∈ r(x,a) the probability distribution over the | next function. state and the immediate reward of taking action a at state x, respectively. The main idea to improve the upper bound of QVI is to express the performance loss of QVI in terms of the Remark 1. To keep the representation succinct, in variance of the sum of discounted rewards as opposed the sequel, we use the notation Z for the joint state- to the maximum V = R /(1 γ) in the previous action space X A. We also make use of the shorthand max max × results. For this we make use of− Bernstein’s concen- notations z and β for the state-action pair (x,a) and tration inequality (Cesa-Bianchi & Lugosi, 2006, ap- 1/(1 γ), respectively. − pendix, pg. 361), which bounds the estimation error Assumption 1 (MDP Regularity). We assume Z in terms of the variance of the value function. We also and, subsequently, X and A are finite sets with car- rely on the fact that the variance of the sum of dis- dinalities N, X and A , respectively. We also as- counted rewards, like the expected value of the sum sume that the| immediate| | | reward r(x,a) is taken from (value function), satisfies a Bellman-like equation, in the interval [0, 1]. which the variance of the value function plays the role of the instant reward (Munos & Moore, 1999; Sobel, A mapping π : X A is called a stationary and 1982), to derive a sharp bound on the variance of the deterministic Markovian→ policy, or just a policy in value function. In the case of lower bound, we im- short. Following a policy π in an MDP means that prove on the result of Azar et al. (2011b) by adding at each time step t the control action A A is given t ∈ some structure to the class of MDPs for which we by At = π(Xt), where Xt X. The value and the prove the lower bound: In the new model, there is a action-value functions of a policy∈ π, denoted respec- high probability for transition from every intermediate tively by V π : X R and Qπ : Z R, are defined state to itself. This adds to the difficulty of estimating as the expected sum→ of discounted→ rewards that are the value function, since even a small estimation error encountered when the policy π is executed. Given may propagate throughout the recursive structure of an MDP, the goal is to find a policy that attains ∗ , π the MDP and inflict a big performance loss especially the best possible values, V (x) supπ V (x), x for γ’s close to 1. X. Function V ∗ is called the optimal value func-∀ ∈ tion. Similarly the optimal action-value function is The rest of the paper is organized as follows. After in- defined as Q∗(x,a) = sup Qπ(x,a). We say that a troducing the notation used in the paper in Section 2, π policy π∗ is optimal if it attains the optimal V ∗(x) we describe the model-based Q-value iteration (QVI) for all x X. The policy π defines the state tran- algorithm in Subsection 2.1. We then state our main ∈ sition kernel Pπ as: Pπ(y x) , P (y x, π(x)) for all theoretical results, which are in the form of PAC sam- | π | x X. The right-linear operators P , P and Pπ are ple complexity bounds in Section 3. Section 4 con- ∈ π then defined as (P Q)(z) , XP (y z)Q(y, π(y)), tains the detailed proofs of the results of Sections 3, y∈ | (PV )(z) , XP (y z)V (y) for all z Z and i.e., sample complexity bound of QVI and a general y∈ | P ∈ (P V )(x) , P (y x)V (y) for all x X, respec- new lower bound for RL. Finally, we conclude the pa- π Py∈X π | ∈ per and propose some directions for the future work in tively. Finally, shall denote the supremum (ℓ∞) P , Section 5. norm which is defined as g maxy∈Y g(y) , where Y is a finite set and g : Y R is a real-valued| function.| 3 → 2. Background 2.1. Model-based Q-value Iteration (QVI) In this section, we review some standard concepts The algorithm makes n transition samples from each and definitions from the theory of Markov decision state-action pair z Z for which it makes n calls to the processes (MDPs). We then present the model- generative model.4∈It then builds an empirical model based Q-value iteration algorithm of Kearns & Singh of the transition probabilities as: P (y z) , m(y,z)/n, (1999). We consider the standard reinforcement learn- | ing (RL) framework (Bertsekas & Tsitsiklis, 1996; 3For ease of exposition, in the sequel, we remove the b Sutton & Barto, 1998) in which a learning agent inter- dependence on z and x , e.g., writing Q for Q(z) and V for acts with a stochastic environment and this interaction V (x), when there is no possible confusion. 4The total number of calls to the generative model is is modeled as a discrete-time discounted MDP. A dis- given by T = nN. On the Sample Complexity of Reinforcement Learning with a Generative Model where m(y,z) denotes the number of times that the 4. Analysis state y X has been reached from z Z. The al- gorithm∈ then makes an empirical estimate∈ of the opti- In this section, we first provide the full proof of the mal action-value function Q∗ by iterating some action- finite-time PAC bound of QVI, reported in Theorem 1, in Subsection 4.1. We then prove Theorem 2, the RL value function Qk, with the initial value of Q0, through the empirical Bellman optimality operator T.5 lower bound, in Subsection 4.2. Also, we need to em- phasize that we discard some of the proofs of the tech- b nical lemmas due to the lack of space. We provide 3. Main Results those results in a long version of this paper. Our main results are in the form of PAC (probably 4.1. Poof of Theorem 1 approximately correct) bounds on the ℓ∞-norm of the ∗ difference of the optimal action-value function Q and We begin by introducing some new notation. Con- its sample estimate: sider the stationary policy π. We define Vπ(z) , t π 2 Theorem 1 (PAC-bound for model-based Q-value it- E[ t≥0γ r(Zt) Q (z) ] as the variance of the eration). Let Assumption 1 hold and T be a positive sum| of discounted− rewards| starting from z P π ∈ integer. Then, there exist some constants c and c0 such Z under the policy π. Also, define σ (z) , 2 π π π π 2 that for all ε (0, 1) and δ (0, 1) a total sampling γ y∈ZP (y z) Q (y) P Q (z) as the immedi- ∈ ∈ | | − 2 | π budget of ate variance at z Z, i.e., γ VY ∼P π ( |z)[Q (Y )]. Also, 3 P ∈π ∗ cβ N c0N we shall denote v and v as the immediate variance T = log , ⌈ ε2 δ ⌉ of the value function V π and V ∗ defined as vπ(z) , ∗ 2V π ∗ , 2V ∗ suffices for the uniform approximation error Q γ Y ∼P ( |z)[V (Y )] and v (z) γ Y ∼P ( |z)[V (Y )], Q ε, w.p. (with the probability) at least 1 δ−, for all z Z, respectively. Further, we denote the im- k ∈ after ≤ only k = log(6β/ε)/ log(1/γ) iteration of QVI− mediate variance of the action-value function Qπ, V π algorithm.6 In⌈ particular, one may⌉ choose c = 68 and and V ∗ by σπ, vπ and v∗, respectively. c0 = 12. b b We now state our first result which indicates that Qk b b b ∗ b k The following general result provides a tight lower is very close to Q up to an order of O(γ ). Therefore, to prove bound on Q∗ Q , one only needs to bound bound on the number of transitions T for every RL al- − k Q∗ Q∗ in highb probability. gorithm to achieve an ε-optimal estimate of the action- − value function w.p. 1 δ, under the assumption that Lemma 1. Let Assumption 1 hold and Q0(z) be in the algorithm is (ε,δ,T−)-correct: the intervalb [0,β] for all z Z. Then we have ∈ Definition 1 ((ε,δ,T )-correct algorithm). Let QA T Q Q∗ γkβ. be the estimate of Q∗ by an RL algorithm A af- k − ≤ ter observing T 0 transition samples. We say A ≥ M In the rest of this subsection,b we focus on prov- that is (ε,δ,T )-correct on the class of MDPs if ∗ ∗ A ing a high probability bound on Q Q . One Q∗ Q ε with probability at least 1 δ for all − − T ≤ − can prove a crude bound of O˜(β2/√n) on Q∗ M M.7 − ∈ Q∗ by first proving that Q∗ Q∗ b β (P − ≤ − Theorem 2 (Lower bound on the sample complex- P )V ∗ and then using the Hoeffding’s tail inequal- ity of estimating the optimal action-value function). ityb (Cesa-Bianchi & Lugosi, 2006, appendix,b pg. 359) There exists some constants ε , δ , c , c , and a class 0 0 1 2 tob bound the random variable (P P )V ∗ in high of MDPs M, such that for all ε (0,ε ), δ (0,δ /N), 0 0 probability. Here, we follow a different− and more and every (ε,δ,T )-correct RL algorithm∈ A∈on the class subtle approach to bound Q∗ Q∗ b, which leads of MDPs M the number of transitions needs to be at to a tight bound of O˜(β1.5 /√n−): (i) We prove in least β3N N Lemma 2 component-wise upper andb lower bounds on ∗ ∗ T = 2 log . the error Q Q which are expressed in terms of ⌈ c1ε c2δ ⌉ ∗ ∗ (I γP π )−1 −P P V ∗ and (I γP πb )−1 P P V ∗, 5 b − − − − The operator T is defined on the action-value function respectively. (ii)bWe make use of the sharp result of b b Q, for all z Z, by TQ(z) = r(z) + γPV (z), where V (x) = Bernstein’sb inequalityb to bound P bP V ∗ in termsb of ∈ maxa∈A(Q(x, a)) for all x X. − ∗ 6 ∈ the squared root of the variance of V in high proba- For every real number u, u is defined as the smallest bility. (iii) We prove the key result of this subsection integer number not less than ⌈u.⌉ b 7The algorithm A, unlike QVI, does not need to gener- (Lemma 5) which shows that the variance of the sum ate a same number of transition samples for every state- of discounted rewards satisfies a Bellman-like recur- action pair and can generate samples in an arbitrarily way. sion, in which the instant reward r(z) is replaced by On the Sample Complexity of Reinforcement Learning with a Generative Model
σπ(z). Based on this result we prove an upper-bound We deduce (using a union bound): of order O(β1.5) on (I γP π)−1 V(Qπ) for any pol- − v∗ icy π, which combined with the previous steps leads to γ(P P )V ∗ c′ + b′ 1, (7) p pv pv the sharp upper bound of O˜(β1.5/√n) on Q∗ Q∗ . − ≤ r n − v∗ We proceed by the following lemma which bounds Q∗ γ(P Pb)V ∗ c′ b′ 1, (8) b pv n pv Q∗ from above and below: − − ≥−r − ′ , ′ , Lemma 2 (Component-wise bounds on Q∗ Q∗ ). where cpv 2log(bN/δ) and bpv 2γβ log(N/δ)/3n. b − The result then follows by plugging (3) and (4) into (7) ∗ ∗ ∗ π −1 ∗ Q Q γ(I γP ) P P V , b (1) and (8), respectively, and then taking a union bound. − ≤ − ∗ − ∗ ∗ πb −1 ∗ Q Q γ(I γP ) P P V . (2) − b ≥ − b − b We now state the key lemma of this section which We now concentrateb on boundingb the RHSb (right hand shows that for any policy π the variance Vπ satisfies sides) of (1) and (2). We first state the following tech- ∗ the following Bellman-like recursion. Later, we use nical result which relates v∗ to σπ and σ∗. this result, in Lemma 6, to bound (I γP π)−1σπ. − Lemma 3. Let Assumption 1 hold and 0 <δ< 1. Lemma 5. Vπ satisfies the Bellman equation Then, w.p. at least 1 δ: b b π π 2 π π − V = σ + γ P V . ∗ v∗ σπ + b 1, (3) v Z ∗ ≤ ∗ Proof. For all z we have v σ + bv1, (4) ∈ ≤ 2 b π t π V (z)= E γ r(Zt) Q (z) where bv is defined as − b t≥0 X
18γ4β4 log 3N 4γ2β4 log 3N E E t π , δ δ = Z1∼P π (.|z) γ r(Zt) γQ (Z1) bv + , − s n n t≥1 X 2 and 1 is a function which assigns 1 to all z Z. (Qπ(z) r(z ) γQπ(Z )) ∈ − − − 1 Based on Lemma 3 we prove the following sharp bound 2 ∗ 2E E t−1 π on γ(P P )V , for which we also rely on Bern- = γ Z1∼P π (.|z) γ r(Z t) Q (Z1) − − stein’s inequality (Cesa-Bianchi & Lugosi, 2006, ap- t≥1 X pendix, pg.b 361). E π π 2 Z1∼P π (.|z) (Q (z) r(z) γQ (Z1)) Lemma 4. Let Assumption 1 hold and 0 <δ< 1. − − − Define cpv , 2 log(2N/δ) and bpv as: E γtr(Z ) γQπ(Z ) Z × t − 1 1 3/4 t≥1 6(γβ)4/3 log 6N 5γβ2 log 6N X , δ δ π π 2 bpv + . + E π ( Q (z) r(z) γQ (Z ) ) n n Z1∼P ( |z) | − − 1 | ! 2 2 t−1 π = γ E π E γ r(Z ) Q (Z ) Then w.p. 1 δ we have Z1∼P (.|z) t − 1 − t≥1 X π∗ 2V π ∗ cpvσ + γ Z1∼P π ( |z)( Q (Z1)) γ(P P )V + bpv1, (5) − ≤ r n = γ2 P π(y z)Vπ(y)+ σπ(z), b ∗ | b ∗ cpvσ y∈Z γ(P P )V bpv1. (6) X − ≥−r n − in which we rely on E( γtr(Z ) γQπ(Z ) Z )= t≥1 t − 1 | 1 b b 0. Proof. For all z Z and all 0 <δ< 1, Bernstein’s P inequality implies∈ that w.p. at least 1 δ: − Based on Lemma 5, one can prove the following result ∗ 1 1 on the immediate variance. ∗ 2v (z) log 2β log (P P )V (z) δ + δ , Lemma 6. − ≤ s γ2n 3n (I γ2P π)−1σπ β2, (9) b ∗ 1 1 ∗ 2v (z) log δ 2β log δ − ≤ (P P )V (z) . (I γP π)−1√σπ 2 log(2)β1.5. (10) − ≥−s γ2n − 3n − ≤ b On the Sample Complexity of Reinforcement Learning with a Generative Model
Proof. The first inequality follows from Lemma 5 by Proof. By incorporating the result of Lemma 4 and solving (5) in terms of Vπ and taking the sup-norm Lemma 6 into Lemma 2, we deduce that: over both sides of the resulted equation. In the case Q∗ Q∗ b1, of Eq.(10) we have 8 − ≤ Q∗ Q∗ b1, − b ≥− π −1 π k w.p. 1 δ. The scalar b is given by: (I γP ) √σπ = (γP ) √σπ − b − 3/4 kX≥0 17β3 log 2N 6(γβ2)4/3 log 6N , δ δ t−1 b + s n n ! = (γP π)tl ( γP π)j √σπ 3 6N l≥0 j=0 5γβ log X X + δ . n t−1 (γt)l (γP π)j √σπ The result then follows by combining these two bounds ≤ l≥0 j=0 and taking the ℓ∞ norm. X X
1 t−1 π j √ π Proof of Theorem 1. We combine the proof of = t (γP ) σ , ∗ 1 γ Lemma 7 and Lemma 1 in order to bound Q Qk j=0 − − X in high probability. We then solve the resulted bound (11) w.r.t. n and k.9 in which we write k = tl + j with t is a positive inte- t−1 π j π ger. We now prove a bound on j=0(γP ) √σ by making use of Jensen’s inequality as well as Cauchy- 4.2. Proof of the Lower-bound P Schwarz inequality: In this section, we provide the proof of Theorem 2. In our analysis, we rely on the likelihood-ratio method, t−1 t−1 which has been previously used to prove a lower bound (γP π)j √σπ γj (P π)j σπ ≤ for multi-armed bandits (Mannor & Tsitsiklis, 2004), j=0 j=0 X X p and extend this approach to RL and MDPs. We begin
t−1 by defining a class of MDPs for which the proposed √t (γ2P π)j σπ (12) lower bound will be obtained (see Figure 1). We de- ≤ v uj=0 fine the class of MDPs M as the set of all MDPs with uX t the state-action space of cardinality N = 3KL, where √t (I γ2P π)−1σ π ≤ − K and L are positive integers. Also, we assume that √ p for all M M, the state space X consists of three β t, ∈ ≤ smaller sets S, Y1 and Y2. The set S includes K states, where in the last step we rely on (9). The result then each of those states corresponds with the set of actions 1 2 follows by plugging (12) into (11) and optimizing the A = a1,a2,...,aL , whereas the states in Y and Y { } bound in terms of t to achieve the best dependency on are single-action states. By taking the action a A ∈ β. from every state x S, we move to the next state y(z) Y1 with the∈ probability 1, where z = (x,a). ∈ Now we make use of Lemma 6 and Lemma 4 to bound The transition probability from Y1 is characterized by 1 Q∗ Q∗ in high probability. the transition probability pM from every y(z) Y to − itself and with the probability 1 p to the∈ corre- Lemma 7. Let Assumption 1 hold. Then, for any M sponding y(z) Y2.10 Further, for− all M M, Y2 0 <δ