The Sample-Complexity of General Reinforcement Learning

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by The Australian National University The Sample-Complexity of General Reinforcement Learning Tor Lattimore [email protected] Australian National University Marcus Hutter [email protected] Australian National University Peter Sunehag [email protected] Australian National University Abstract fν1; ··· ; νN g of arbitrary environments, an accuracy , and a confidence δ. The main result is that MERL We present a new algorithm for general re- has a sample-complexity of inforcement learning where the true environ- N N ment is known to belong to a finite class of N ~ 2 O 2 3 log ; arbitrary models. The algorithm is shown to (1 − γ) δ(1 − γ) be near-optimal for all but O(N log2 N) time- where 1=(1 − γ) is the effective horizon determined steps with high probability. Infinite classes by discount rate γ. We also consider the case where are also considered where we show that com- M is infinite, but compact with respect to a particular pactness is a key criterion for determining topology. In this case, a variant of MERL has the same the existence of uniform sample-complexity sample-complexity as above, but where N is replaced bounds. A matching lower bound is given for by the size of the smallest -cover. A lower bound the finite case. is also given that matches the upper bound except for logarithmic factors. Finally, if M is non-compact then in general no finite sample-complexity bound exists. 1. Introduction 1.1. Related Work Reinforcement Learning (RL) is the task of learning Many authors have worked on the sample-complexity policies that lead to nearly-optimal rewards where the of RL in various settings. The simplest case is the environment is unknown. One metric of the efficiency multiarmed bandit problem that has been extensively of an RL algorithm is sample-complexity, which is a studied with varying assumptions. The typical mea- high probability upper bound on the number of time- sure of efficiency in the bandit literature is regret, but steps when that algorithm is not nearly-optimal that sample-complexity bounds are also known and some- holds for all environment in some class. Such bounds times used. The next step from bandits is finite state are typically shown for very specific classes of environ- MDPs, of which bandits are an example with only a ments, such as (partially observable/factored) Markov single state. There are two main settings when MDPs Decision Processes (MDP) and bandits. We consider are considered, the discounted case where sample- more general classes of environments where at each complexity bounds are proven and the undiscounted time-step an agent takes an action a 2 A where-upon (average reward) case where regret bounds are more it receives reward r 2 [0; 1] and an observation o 2 O, typical. In the discounted setting the upper and lower which are generated stochastically by the environment bounds on sample-complexity are now extremely re- and may depend arbitrarily on the entire history se- fined. See Strehl et al. (2009) for a detailed review quence. of the popular algorithms and theorems. More re- We present a new reinforcement learning algorithm, cent work on closing the gap between upper and lower named Maximum Exploration Reinforcement Learn- bounds is by Szita & Szepesvári(2010); Lattimore & ing (MERL), that accepts as input a finite set M := Hutter (2012); Azar et al. (2012). In the undiscounted case it is necessary to make some form of ergodicity Proceedings of the 30 th International Conference on Ma- assumption as without this regret bounds cannot be chine Learning, Atlanta, Georgia, USA, 2013. JMLR: given. In this work we avoid ergodicity assumptions W&CP volume 28. Copyright 2013 by the author(s). The Sample-Complexity of General Reinforcement Learning and discount future rewards. Nevertheless, our al- butions over observation/reward pairs given the his- gorithm borrows some tricks used by UCRL2 (Auer tory so far. A policy π is a function π : H∗ ! et al., 2010). Previous work for more general environ- A. An environment and policy interact sequen- ment classes is somewhat limited. For factored MDPs tially to induce a measure, Pµ,π, on filtered prob- 1 there are known bounds, see (Chakraborty & Stone, ability space (H ; F; fFtg). For convenience, we 2011) and references there-in. Even-dar et al. (2005) abuse notation and write Pµ,π(h) := Pµ,π(Γh). If 0 0 give essentially unimprovable exponential bounds on h @ h then conditional probabilities are Pµ,π(h jh) := the sample-complexity of learning in finite partially 0 Pt+d k−t Pµ,π(h )=Pµ,π(h). Rt(h; d) := k=t γ rk(h) is the observable MDPs. Maillard et al. (2013) show regret d-step return function and Rt(h) := limd!1 Rt(h; d). bounds for undiscounted RL where the true environ- Given history ht with `(ht) = t, the value func- ment is assumed to be finite, Markov and commu- π tion is defined by Vµ (ht; d) := E[Rt(h; d)jht] where nicating, but where the state is not directly observ- the expectation is taken with respect to Pµ,π(·|ht). able. As far as we know there has been no work on π π Vµ (ht) := limd!1 Vµ (ht; d). The optimal policy for the sample-complexity of RL when environments are ∗ π environment µ is πµ := arg maxπ Vµ , which with our completely general, but asymptotic results have gar- assumptions is known to exist (Lattimore & Hutter, ∗ nered some attention with positive results by Hutter πµ 2011a). The value of the optimal policy is V ∗ := V . (2002); Ryabko & Hutter (2008); Sunehag & Hutter µ µ In general, µ denotes the true environment while ν is a (2012) and (mostly) negative ones by Lattimore & model. π will typically be the policy of the algorithm Hutter (2011b). Perhaps the closest related worked is under consideration. Q∗ (h; a) is the value in history (Diuk et al., 2009), which deals with a similar problem µ h of following policy π∗ except for the first time-step in the rather different setting of learning the optimal µ when action a is taken. M is a set of environments predictor from a class of N experts. They obtain an (models). O(N log N) bound, which is applied to the problem of Sample-complexity. Policy π is -optimal in his- structure learning for discounted finite-state factored tory h and environment µ if V ∗(h) − V π(h) ≤ . The MDPs. Our work generalises this approach to the non- µ µ sample-complexity of a policy π in environment class Markov case and compact model classes. M is the smallest Λ such that, with high probability, π is -optimal for all but Λ time-steps for all µ 2 M. 1 2. Notation Define Lµ,π : H ! N [ f1g to be the number of time-steps when π is not -optimal. The definition of environments is borrowed from the 1 work of ?, although the notation is slightly more formal X ∗ π Lµ,π(h) := Vµ (ht) − Vµ (ht) > ; to ease the application of martingale inequalities. t=1 General. = f0; 1; 2; · · · g is the natural numbers. N where ht is the length t prefix of h. The sample- For the indicator function we write [[x = y]] = 1 if complexity of policy π is Λ with respect to accuracy x = y and 0 otherwise. We use ^ and _ for logical and confidence 1 − δ if P Lµ,π(h) > Λ < δ; 8µ 2 M. and/or respectively. If A is a set then jAj is its size and A∗ is the set of all finite strings (sequences) over 3. Finite Case A. If x and y are sequences then x @ y means that x is a prefix of y. Unless otherwise mentioned, log We start with the finite case where the true environ- represents the natural logarithm. For random variable ment is known to belong to a finite set of models, M. X we write EX for its expectation. For x 2 R, dxe is The Maximum Exploration Reinforcement Learning the ceiling function. algorithm is model-based in the sense that it maintains Environments and policies. Let A, O and R ⊂ R a set, Mt ⊆ M, where models are eliminated once be finite sets of actions, observations and rewards re- they become implausible. The algorithm operates in spectively and H := A × O × R. H1 is the set of phases of exploration and exploitation, choosing to ex- infinite history sequences while H∗ := (A × O × R)∗ ploit if it knows all plausible environments are reason- is the set of finite history sequences. If h 2 H∗ ably close under all optimal policies and explore oth- then `(h) is the number of action/observation/reward erwise. This method of exploration essentially guar- tuples in h. We write at(h), ot(h), rt(h) for the antees that MERL is nearly optimal whenever it is tth action/observation/reward of history sequence h. exploiting and the number of exploration phases is ∗ 0 1 0 For h 2 H ,Γh := fh 2 H : h @ h g is the cylin- limited with high probability. The main difficulty is ∗ der set. Let F := σ(fΓh : h 2 H g) and Ft := specifying what it means to be plausible. Previous au- ∗ σ(fΓh : h 2 H ^ `(h) = tg) be σ-algebras. An en- thors working on finite environments, such as MDPs vironment µ is a set of conditional probability distri- or bandits, have removed models for which the tran- The Sample-Complexity of General Reinforcement Learning sition probabilities are not sufficiently close to their one of them has expectation larger than (1 − γ)/8.

Load more