PAC Model-Free Reinforcement Learning Alexander L. Strehl
[email protected] Lihong Li
[email protected] Department of Computer Science, Rutgers University, Piscataway, NJ 08854 USA Eric Wiewiora
[email protected] Computer Science and Engineering Department University of California, San Diego John Langford
[email protected] TTI-Chicago, 1427 E 60th Street, Chicago, IL 60637 USA Michael L. Littman
[email protected] Department of Computer Science, Rutgers University, Piscataway, NJ 08854 USA Abstract When evaluating RL algorithms, there are three es- For a Markov Decision Process with finite sential traits to consider: space complexity, computa- state (size S) and action spaces (size A per tional complexity, and sample complexity. We define state), we propose a new algorithm—Delayed a timestep to be a single interaction with the environ- Q-Learning. We prove it is PAC, achieving ment. Space complexity measures the amount of mem- near optimal performance except for O˜(SA) ory required to implement the algorithm while compu- timesteps using O(SA) space, improving on tational complexity measures the amount of operations the O˜(S2A) bounds of best previous algo- needed to execute the algorithm, per timestep. Sam- rithms. This result proves efficient reinforce- ple complexity measures the amount of timesteps for ment learning is possible without learning a which the algorithm does not behave near optimally model of the MDP from experience. Learning or, in other words, the amount of experience it takes takes place from a single continuous thread of to learn to behave well. experience—no resets nor parallel sampling We will call algorithms whose sample complexity can is used.