On the Sample Complexity of Reinforcement Learning with a Generative Model

On the Sample Complexity of Reinforcement Learning with a Generative Model Mohammad Gheshlaghi Azar [email protected] Department of Biophysics, Radboud University Nijmegen, 6525 EZ Nijmegen, The Netherlands Rémi Munos [email protected] INRIA Lille, SequeL Project, 40 avenue, Halley 59650, Villeneuve dAscq, France Hilbert J. Kappen [email protected] Department of Biophysics, Radboud University Nijmegen, 6525 EZ Nijmegen, The Netherlands Abstract tributions to estimate the optimal (action-)value function through the Bellman recursion. In the finite We consider the problem of learning the opti- state-action problems, it has been shown that an mal action-value function in the discounted- action-value based variant of VI, model-based Q-value reward Markov decision processes (MDPs). iteration (QVI), finds an ε-optimal estimate of the We prove a new PAC bound on the sample- action-value function with high probability using only complexity of model-based value iteration al- T = O˜(N/ (1 γ)4ε2 ) samples (Kearns & Singh, gorithm in the presence of the generative 1999; Kakade, 2004− , chap. 9.1), where N and γ denote model, which indicates that for an MDP with the size of state-action space and the discount factor, N state-action pairs and the discount factor respectively.1 Although this bound matches the best γ [0, 1) only O N log(N/δ)/ (1 γ)3ε2 existing upper bound on the sample complexity of samples∈ are required to find an ε-optimal− es- estimating the action-value function (Azar et al., timation of the action-value function with the 2011a), it has not been clear, so far, whether this probability 1 δ. We also prove a matching bound is a tight bound on the performance of QVI or lower bound of− Θ N log(N/δ)/ (1 γ)3ε2 it can be improved by a more careful analysis of QVI on the sample complexity of estimating− the algorithm. This is mainly due to the fact that there is optimal action-value function by every RL a gap of order 1/(1 γ)2 between the upper bound of algorithm. To the best of our knowledge, QVI and the state-of-the-art− result for lower bound, this is the first matching result on the sample which is of Ω N/ (1 γ)2ε2 (Azar et al., 2011b). complexity of estimating the optimal (action- − ) value function in which the upper bound In this paper, we focus on the problems which are matches the lower bound of RL in terms of N, formulated as finite state-action discounted infinite- ε, δ and 1/(1 γ). Also, both our lower bound horizon Markov decision processes (MDPs), and prove − and our upper bound significantly improve on a new tight bound of O N log(N/δ)/ (1 γ)3ε2 on the state-of-the-art in terms of 1/(1 γ). the sample complexity of the QVI algorithm.− The new − upper bound improves on the existing bound of QVI by an order of 1/(1 γ).2 We also present a new match- 1. Introduction ing lower bound of− Θ N log(N/δ)/ (1 γ)3ε2 , which also improves on the best existing lower− bound of RL Model-based value iteration (VI) (Kearns & Singh, by an order of 1/(1 γ). The new results, which close 1999; Bu¸s,oniu et al., 2010) is a well-known re- − inforcement learning (RL) (Szepesvári, 2010; 1The notation g = O˜(f) implies that there are constants c2 Sutton & Barto, 1998) algorithm which relies on c1 and c2 such that g c1f log (f). 2 ≤ an empirical estimate of the state-transition dis- In this paper, to keep the presentation succinct, we only consider the value iteration algorithm. However, one th Appearing in Proceedings of the 29 International Confer- can prove upper-bounds of a same order for other model- ence on Machine Learning, Edinburgh, Scotland, UK, 2012. based methods such as policy iteration and linear program- Copyright 2012 by the author(s)/owner(s). ming using the results of this paper. On the Sample Complexity of Reinforcement Learning with a Generative Model the above-mentioned gap between the lower bound and counted MDP is a quintuple (X, A,P, R,γ), where X the upper bound of RL, guarantee that no learning and A are the set of states and actions, P is the state method, given a generative model of the MDP, can transition distribution, R is the reward function, and be significantly more efficient than QVI in terms of γ (0, 1) is a discount factor. We denote by P ( x,a) the sample complexity of estimating the action-value and∈ r(x,a) the probability distribution over the·| next function. state and the immediate reward of taking action a at state x, respectively. The main idea to improve the upper bound of QVI is to express the performance loss of QVI in terms of the Remark 1. To keep the representation succinct, in variance of the sum of discounted rewards as opposed the sequel, we use the notation Z for the joint state- to the maximum V = R /(1 γ) in the previous action space X A. We also make use of the shorthand max max × results. For this we make use of− Bernstein’s concen- notations z and β for the state-action pair (x,a) and tration inequality (Cesa-Bianchi & Lugosi, 2006, ap- 1/(1 γ), respectively. − pendix, pg. 361), which bounds the estimation error Assumption 1 (MDP Regularity). We assume Z in terms of the variance of the value function. We also and, subsequently, X and A are finite sets with car- rely on the fact that the variance of the sum of dis- dinalities N, X and A , respectively. We also as- counted rewards, like the expected value of the sum sume that the| immediate| | | reward r(x,a) is taken from (value function), satisfies a Bellman-like equation, in the interval [0, 1]. which the variance of the value function plays the role of the instant reward (Munos & Moore, 1999; Sobel, A mapping π : X A is called a stationary and 1982), to derive a sharp bound on the variance of the deterministic Markovian→ policy, or just a policy in value function. In the case of lower bound, we im- short. Following a policy π in an MDP means that prove on the result of Azar et al. (2011b) by adding at each time step t the control action A A is given t ∈ some structure to the class of MDPs for which we by At = π(Xt), where Xt X. The value and the prove the lower bound: In the new model, there is a action-value functions of a policy∈ π, denoted respec- high probability for transition from every intermediate tively by V π : X R and Qπ : Z R, are defined state to itself. This adds to the difficulty of estimating as the expected sum→ of discounted→ rewards that are the value function, since even a small estimation error encountered when the policy π is executed. Given may propagate throughout the recursive structure of an MDP, the goal is to find a policy that attains ∗ , π the MDP and inflict a big performance loss especially the best possible values, V (x) supπ V (x), x for γ’s close to 1. X. Function V ∗ is called the optimal value func-∀ ∈ tion. Similarly the optimal action-value function is The rest of the paper is organized as follows. After in- defined as Q∗(x,a) = sup Qπ(x,a). We say that a troducing the notation used in the paper in Section 2, π policy π∗ is optimal if it attains the optimal V ∗(x) we describe the model-based Q-value iteration (QVI) for all x X. The policy π defines the state tran- algorithm in Subsection 2.1. We then state our main ∈ sition kernel Pπ as: Pπ(y x) , P (y x, π(x)) for all theoretical results, which are in the form of PAC sam- | π | x X. The right-linear operators P , P and Pπ are ple complexity bounds in Section 3. Section 4 con- ∈ π · · · then defined as (P Q)(z) , XP (y z)Q(y, π(y)), tains the detailed proofs of the results of Sections 3, y∈ | (PV )(z) , XP (y z)V (y) for all z Z and i.e., sample complexity bound of QVI and a general y∈ | P ∈ (P V )(x) , P (y x)V (y) for all x X, respec- new lower bound for RL. Finally, we conclude the pa- π Py∈X π | ∈ per and propose some directions for the future work in tively. Finally, shall denote the supremum (ℓ∞) P · , Section 5. norm which is defined as g maxy∈Y g(y) , where Y is a finite set and g : Y Ris a real-valued| function.| 3 → 2. Background 2.1. Model-based Q-value Iteration (QVI) In this section, we review some standard concepts The algorithm makes n transition samples from each and definitions from the theory of Markov decision state-action pair z Z for which it makes n calls to the processes (MDPs). We then present the model- generative model.4∈It then builds an empirical model based Q-value iteration algorithm of Kearns & Singh of the transition probabilities as: P (y z) , m(y,z)/n, (1999). We consider the standard reinforcement learn- | ing (RL) framework (Bertsekas & Tsitsiklis, 1996; 3For ease of exposition, in the sequel, we remove the b Sutton & Barto, 1998) in which a learning agent inter- dependence on z and x , e.g., writing Q for Q(z) and V for acts with a stochastic environment and this interaction V (x), when there is no possible confusion.

Load more