PAC Model-Free Reinforcement Learning

PAC Model-Free Reinforcement Learning Alexander L. Strehl [email protected] Lihong Li [email protected] Department of Computer Science, Rutgers University, Piscataway, NJ 08854 USA Eric Wiewiora [email protected] Computer Science and Engineering Department University of California, San Diego John Langford [email protected] TTI-Chicago, 1427 E 60th Street, Chicago, IL 60637 USA Michael L. Littman [email protected] Department of Computer Science, Rutgers University, Piscataway, NJ 08854 USA Abstract When evaluating RL algorithms, there are three es- For a Markov Decision Process with finite sential traits to consider: space complexity, computa- state (size S) and action spaces (size A per tional complexity, and sample complexity. We define state), we propose a new algorithm—Delayed a timestep to be a single interaction with the environ- Q-Learning. We prove it is PAC, achieving ment. Space complexity measures the amount of mem- near optimal performance except for O˜(SA) ory required to implement the algorithm while compu- timesteps using O(SA) space, improving on tational complexity measures the amount of operations the O˜(S2A) bounds of best previous algo- needed to execute the algorithm, per timestep. Sam- rithms. This result proves efficient reinforce- ple complexity measures the amount of timesteps for ment learning is possible without learning a which the algorithm does not behave near optimally model of the MDP from experience. Learning or, in other words, the amount of experience it takes takes place from a single continuous thread of to learn to behave well. experience—no resets nor parallel sampling We will call algorithms whose sample complexity can is used. Beyond its smaller storage and ex- be bounded by a polynomial in the environment size perience requirements, Delayed Q-learning’s and approximation parameters, with high probabil- per-experience computation cost is much less ity, PAC-MDP (Probably Approximately Correct in than that of previous PAC algorithms. Markov Decision Processes). All algorithms known to be PAC-MDP to date involve the maintenance and solution (often by value iteration or mathemat- 1. Introduction ical programming) of an internal MDP model. Such In the reinforcement-learning (RL) problem (Sutton algorithms, including Rmax (Brafman & Tennenholtz, 3 & Barto, 1998), an agent acts in an unknown or in- 2002), E (Kearns & Singh, 2002), and MBIE (Strehl completely known environment with the goal of max- & Littman, 2005), are called model-based algorithms imizing an external reward signal. One of the funda- and have relatively high space and computational com- mental obstacles in RL is the exploration-exploitation plexities. Another class of algorithms, including most dilemma: whether to act to gain new information (ex- forms of Q-learning (Watkins & Dayan, 1992), make plore) or to act consistently with past experience to no effort to learn a model and can be called model free. maximize reward (exploit). This paper models the RL It is difficult to articulate a hard and fast rule di- problem as a Markov Decision Process (MDP) envi- viding model-free and model-based algorithms, but ronment with finite state and action spaces. model-based algorithms generally retain some transi- Appearing in Proceedings of the 23 rd International Con- tion information during learning whereas model-free ference on Machine Learning, Pittsburgh, PA, 2006. Copy- algorithms only keep value-function information. In- right 2006 by the author(s)/owner(s). stead of formalizing this intuition, we have decided to PAC Model-Free Reinforcement Learning adopt a crisp, if somewhat unintuitive, definition. For current state. We assume that rewards all lie between π π our purposes, a model-free RL algorithm is one whose 0 and 1. For any policy π, let VM (s)(QM (s, a)) denote space complexity is asymptotically less than the space the discounted, infinite-horizon value (action-value or required to store an MDP. Q-value) function for π in M (which may be omitted from the notation) from state s. If T is a positive in- π Definition 1 A learning algorithm is said to be teger, let VM (s, T ) denote the T -step value function of 2 π P∞ j−1 model free if its space complexity is always o(S A), policy π. Specifically, VM (s) = E[ j=1 γ rj] and where S is the number of states and A is the number π PT j−1 VM (s, T ) = E[ j=1 γ rj] where [r1, r2,...] is the of actions of the MDP used for learning. reward sequence generated by following policy π from state s. These expectations are taken over all possi- Although they tend to have low space and computa- ble infinite paths the agent might follow. The optimal tional complexity, no model-free algorithm has been ∗ ∗ policy is denoted π and has value functions VM (s) proven to be PAC-MDP. In this paper, we present a ∗ and QM (s, a). Note that a policy cannot have a value new model-free algorithm, Delayed Q-learning, and greater than 1/(1 − γ) in any state. prove it is the first such algorithm. The hardness of learning an arbitrary MDP as mea- 3. Learning Efficiently sured by sample complexity is still relatively unex- plored. For simplicity, we let O˜(·)(Ω(˜ ·)) represent O(·) In our discussion, we assume that the learner receives (Ω(·)) where logarithmic factors are ignored. When we S, A, , δ, and γ as input. The learning problem is consider only the dependence on S and A, the lower defined as follows. The agent always occupies a sin- bound of Kakade (2003) says that with probability gle state s of the MDP M. The learning algorithm greater than 1 − δ, the sample complexity of any algo- is told this state and must select an action a. The rithm will be Ω(˜ SA). However, the best upper bound agent receives a reward r and is then transported to known provides an algorithm whose sample complex- another state s0 according to the rules from Section 2. ity is O˜(S2A) with probability at least 1 − δ. In other This procedure then repeats forever. The first state words, there are algorithms whose sample complex- occupied by the agent may be chosen arbitrarily. ity is known to be no greater than approximately the There has been much discussion in the RL community number of bits required to specify an MDP to fixed over what defines efficient learning or how to define precision. However, there has been no argument prov- sample complexity. For any fixed , Kakade (2003) de- ing that learning to act near-optimally takes as long fines the sample complexity of exploration (sam- as approximating the dynamics of an MDP. We solve ple complexity, for short) of an algorithm A to be the this open problem, first posed by Kakade (2003), by number of timesteps t such that the non-stationary showing that Delayed Q-learning has sample complex- ˜ policy at time t, At, is not -optimal from the current ity O(SA), with high probability. Our result therefore 1 At ∗ state , st at time t (formally V (st) < V (st) − ). proves that efficient RL is possible without learning a We believe this definition captures the essence of mea- model of the environment from experience. suring learning. An algorithm A is then said to be PAC-MDP (Probably Approximately Correct in 2. Definitions and Notation Markov Decision Processes) if, for any and δ, the sample complexity of A is less than some polynomial in This section introduces the Markov Decision Process the relevant quantities (S, A, 1/, 1/δ, 1/(1 − γ)), with notation used throughout the paper; see Sutton and probability at least 1 − δ. Barto (1998) for an introduction. An MDP M is a five tuple hS, A, T, R, γi, where S is the state space, A The above definition penalizes the learner for exe- is the action space, T : S × A × S → R is a transi- cuting a non--optimal policy rather than for a non- tion function, R : S ×A → R is a reward function, and optimal policy. Keep in mind that, with only a finite 0 ≤ γ < 1 is a discount factor on the summed sequence amount of experience, no algorithm can identify the of rewards. We also let S and A denote the number of optimal policy with complete confidence. In addition, states and the number of actions, respectively. From due to noise, any algorithm may be misled about the state s under action a, the agent receives a random underlying dynamics of the system. Thus, a failure reward r, which has expectation R(s, a), and is trans- probability of at most δ is allowed. See Kakade (2003) ported to state s0 with probability T (s0|s, a). A policy for a full motivation of this performance measure. is a strategy for choosing actions. Only deterministic 1 Note that At is completely defined by A and the policies are dealt with in this paper. A stationary pol- agent’s history up to time t. icy is one that produces an action based on only the PAC Model-Free Reinforcement Learning 4. Delayed Q-learning as long as performing the update would result in a new Q-value estimate that is at least 1 smaller than In this section we describe a new reinforcement- the previous estimate. In other words, the following learning algorithm, Delayed Q-learning. equation must be satisfied for an update to occur: Delayed Q-learning maintains Q-value estimates, m ! 1 X Q(s, a) for each state-action pair (s, a). At time Q (s, a) − (r + γV (s )) ≥ 2 . (2) t m ki ki ki 1 t(= 1, 2,...), let Qt(s, a) denote the algorithm’s cur- i=1 rent Q-value estimate for (s, a) and let Vt(s) denote maxa∈A Qt(s, a).

PAC Model-Free Reinforcement Learning

Sustained Space Complexity

Lecture 10: Space Complexity III

A Short History of Computational Complexity

Lecture 4: Space Complexity II: NL=Conl, Savitch's Theorem

26 Space Complexity

Complexity Slides

Space Complexity

Space Complexity Space Is a Computation Resource

Complexity Theory Lectures 1–6

On the Space Complexity of Randomized Synchronization

Space Complexity and LOGSPACE

Complexity Classes