On Thompson Sampling and Asymptotic Optimality∗
Total Page:16
File Type:pdf, Size:1020Kb
On Thompson Sampling and Asymptotic Optimality∗ Jan Leike Tor Lattimore Laurent Orseau Marcus Hutter DeepMind Indiana University DeepMind Australian National University Future of Humanity Institute University of Oxford Abstract We discuss two different notions of optimality: asymptotic optimality and worst-case regret. We discuss some recent results on Thompson sam- Asymptotic optimality [Lattimore and Hutter, 2011] re- pling for nonparametric reinforcement learning in quires that asymptotically the agent learns to act optimally, countable classes of general stochastic environ- i.e., that the discounted value of the agent’s policy π con- ments. These environments can be non-Markovian, verges to the optimal discounted value for every environ- non-ergodic, and partially observable. We show ment from the environment class. Asymptotic optimality can that Thompson sampling learns the environment be achieved through an exploration component on top of a class in the sense that (1) asymptotically its value Bayes-optimal agent [Lattimore, 2013, Ch. 5] or through op- converges in mean to the optimal value and (2) timism [Sunehag and Hutter, 2015]. given a recoverability assumption regret is sublin- Asymptotic optimality in mean is essentially a qualitative ear. We conclude with a discussion about optimal- version of probably approximately correct (PAC) that comes ity in reinforcement learning. without a concrete convergence rate: for all " > 0 and δ > 0 the probability that our policy is "-suboptimal converges to Keywords. General reinforcement learning, Thompson zero (at an unknown rate). Eventually this probability will be sampling, asymptotic optimality, regret, discounting, recov- less than δ forever thereafter. Since our environment class can erability, AIXI. be very large and non-compact, concrete PAC/convergence rates are likely impossible. 1 Introduction Regret is how many expected rewards the agent forfeits by not following the best informed policy. Different problem In reinforcement learning (RL) an agent interacts with an un- classes have different regret rates, depending on the struc- known environment with the goal of maximizing rewards. ture and the difficulty of the problem class. Multi-armed Recently reinforcement learning has received a surge of in- bandits provide a (problem-independent) worst-case regret terest, triggered by its success in applications such as sim- p ple video games [Mnih et al., 2015]. However, theory is bound of Ω( KT ) where K is the number of arms [Bubeck lagging behind application and most theoretical analyses has and Bianchi, 2012]. Inp Markov decision processes (MDPs) been done in the bandit framework and for Markov decision the lower bound is Ω( DSAT ) where S is the number of processes (MDPs). These restricted environment classes fall states, A the number of actions, and D the diameter of the short of the full reinforcement learning problem and theoret- MDP [Auer et al., 2010]. For a countable class of environ- ical results usually assume ergocity and visiting every state ments given by state representation functions that map histo- infinitely often. Needless to say, these assumptions are not ries to MDP states, a regret of O~(T 2=3) is achievable assum- satisfied for any but the simplest applications. The goal of ing the resulting MDP is weakly communicating [Nguyen et this line of work is to lift these restrictions; we consider gen- al., 2013]. A problem class is considered learnable if there is eral reinforcement learning [Hutter, 2005; Lattimore, 2013; an algorithm that has a sublinear regret guarantee. Leike, 2016b], a top-down approach to RL with the aim to un- This paper continues a narrative that started with defini- derstand the fundamental underlying problems in their gener- tion of the Bayesian agent AIXI [Hutter, 2000] and the proof ality. Our approach to general RL is nonparametric: we only that it satisfies various optimality guarantees [Hutter, 2002]. assume that the true environment belongs to a given count- Recently it was revealed that these optimality notions are sub- able environment class. However, we leave computational jective [Leike and Hutter, 2015]: a Bayesian agent does not considerations aside for now. explore enough to lose the prior’s bias, and a particularly bad We are interested in agents that maximize rewards opti- prior can make the agent conform to any arbitrarily bad policy mally. Since the agent does not know the true environment as long as this policy yields some rewards. In particular, gen- in advance, it is not obvious what optimality should mean. eral Bayesian agents are not asymptotically optimal [Orseau, 2013]. These negative results put the Bayesian approach to ∗This is an abriged version of Leike et al. [2016a] RL into question. We remedy the situation by showing that using Bayesian techniques an agent can indeed be optimal in trates around the likely optimal actions, so sampling a pol- an objective sense. icy that takes the suboptimal action is very unlikely. This We report recent results on a strategy called Thomp- has been a known effect in the context of partial monitor- son sampling, posterior sampling, or the Bayesian control ing problems [Bartok´ et al., 2014], that commonly involve rule [Thompson, 1933]. This strategy samples an environ- information that can only be gained by taking suboptimal ac- ment ρ from the posterior, follows the ρ-optimal policy for a tions. However, in the most common theoretical frameworks while, and then repeats. We show that this policy is asymptot- for RL, multi-armed bandits and tabular MDPs, this problem ically optimal in mean. Furthermore, using a recoverability does not exist and thus has gone unnoticed so far by the the- assumption on the environment, and some (minor) assump- oretical literature. tions on the discount function, we prove that the worst-case regret is sublinear. 2 Preliminaries and Notation Thompson sampling was originally proposed by Thomp- In reinforcement learning, an agent interacts with an environ- son as a bandit algorithm [Thompson, 1933]. It is easy to ment in cycles: at time step t the agent chooses an action a implement and often achieves quite good results [Chapelle t and receives a percept e = (o ; r ) consisting of an obser- and Li, 2011]. In multi-armed bandits it attains optimal t t t vation o and a real-valued reward r ; the cycle then repeats regret [Agrawal and Goyal, 2011; Kaufmann et al., 2012]. t t for time step t + 1.A history is a sequence of actions and Thompson sampling has also been considered for MDPs: as percepts: we use æ to denote a history of length t − 1. In model-free method relying on distributions over Q-functions <t the following we assume that rewards are bounded between 0 with convergence guarantee [Dearden et al., 1998], and as a and 1. model-based algorithm without theoretical analysis [Strens, 2000]. Bayesian and frequentist regret bounds have also been In contrast to most of the literature on reinforcement learn- established [Osband et al., 2013; Osband and Van Roy, 2014; ing, we are agnostic towards the discounting strategy. Our goal is to maximize discounted rewards P1 γ r for a Gopalan and Mannor, 2015]. PAC guarantees have been es- t=1 t t discount function γ : ! γ ≥ 0 tablished for an optimistic variant of Thompson sampling for fixed N R with t and P1 γ < 1. Geometric discounting (γ = γt for some MDPs [Asmuth et al., 2009]. t=1 t t constant γ 2 (0; 1)) is the most common form of discount- For general RL, Thompson sampling was first suggested by ing, although other forms can be used [Lattimore and Hut- Ortega and Braun [2010] with resampling at every time step. ter, 2014]. The discount normalization factor is defined as The authors prove that the action probabilities of Thompson 1 Γ := P γ . sampling converge to the action probability of the optimal t k=t k An "-effective horizon H (") is a horizon that is long policy almost surely, but require a finite environment class t enough to encompass all but an " of the discount function’s and two (arguably quite strong) technical assumptions on the mass: behavior of the posterior distribution (akin to ergodicity) and H (") := minfk j Γ =Γ ≤ "g the similarity of environments in the class. Our convergence t t+k t (1) results do not require these assumptions, but we rely on an An "-effective horizon is a central quantity in online rein- (unavoidable) recoverability assumption for our regret bound. forcement learning, and has a similar function to an episode Thompson sampling can be viewed as inference over op- in the episodic setting. It is the amount of time that an agent timal policies [Ortega and Braun, 2012]. With each environ- needs to plan ahead while losing only a fraction " of the pos- ∗ ment ν 2 M we associate an optimal policy πν . At time sible value. It can be used to constrain the planning horizon step t conditional on history æ<t, the posterior belief over to a finite number of steps regardless of the discount function environment ν is w(ν j æ<t). A Bayesian agent averages used. For geometric discounting, the horizon is dlogγ "e (see over all environments by maximizing reward according to the Leike, 2016b, Tab. 4.1). P Bayesian mixture ξ(· j æ<t) = ν w(ν j æ<t)ν(· j æ<t). A policy is a function π(a j æ<t) specifying the probability In contrast, Thompson sampling averages over optimal poli- of taking action a after seeing the history æ<t. Likewise, an P ∗ cies and we get πT = ν w(ν j æ<t)πν . This way no ex- environment is a function ν(e j æ<tat) specifying the prob- plicit reward structure is needed, only a mapping from envi- ability of emitting percept e after seeing the history æ<t and ∗ ronment ν to optimal policy πν .