Arxiv:1903.01021V2 [Cs.LG] 27 May 2019 Made That the Problem Be Dropped Over Germany, Is a Separate Problem
Total Page:16
File Type:pdf, Size:1020Kb
A Strongly Asymptotically Optimal Agent in General Environments Michael K. Cohen1∗ , Elliot Catt1 and Marcus Hutter1 1Australian National University fmichael.cohen, elliot.carpentercatt, [email protected] Abstract Our work is within the Reinforcement Learning (RL) paradigm: an agent selects an action, and the environment Reinforcement Learning agents are expected to responds with an observation and a reward. The interaction eventually perform well. Typically, this takes the may end, or it may continue forever. Each interaction cycle form of a guarantee about the asymptotic behav- is called a timestep. The agent has a discount function that ior of an algorithm given some assumptions about weights its relative concern for the reward it achieves at vari- the environment. We present an algorithm for a ous future timesteps. The agent’s job is to select actions that policy whose value approaches the optimal value maximize the total expected discounted reward it achieves in with probability 1 in all computable probabilistic its lifetime. The “value” of an agent’s policy at a certain point environments, provided the agent has a bounded in time is the expected total discounted reward it achieves af- horizon. This is known as strong asymptotic op- ter that time if it follows that policy. One formal specification timality, and it was previously unknown whether it of the exploration-exploitation problem is: what policy can an was possible for a policy to be strongly asymptot- agent follow so that the policy’s value approaches the value ically optimal in the class of all computable prob- of the optimal informed policy with probability 1, even when abilistic environments. Our agent, Inquisitive Re- the agent doesn’t start out knowing the true dynamics of its inforcement Learner (Inq), is more likely to ex- environment? plore the more it expects an exploratory action to Most work in RL makes strong assumptions about the en- reduce its uncertainty about which environment it vironment—that the environment is Markov, for instance. is in, hence the term inquisitive. Exploring inquis- Impressive recent development in the field of reinforcement itively is a strategy that can be applied generally; learning often makes use of the Markov assumption, includ- for more manageable environment classes, inquis- ing Deep Q Networks [Mnih et al., 2015], A3C [Mnih et al., itiveness is tractable. We conducted experiments 2016], Rainbow [Hessel et al., 2018], and AlphaZero [Silver in “grid-worlds” to compare the Inquisitive Rein- et al., 2017]. Another example of making strong assumptions forcement Learner to other weakly asymptotically in RL comes from some model-based algorithms that implic- optimal agents. itly assume that the environment is representable by, for ex- ample, a fixed-size neural network, or whatever construct is 1 Introduction used to model the environment. We do not make any such “Efforts to solve [an instance of the exploration- assumptions. exploitation problem] so sapped the energies and Many recent developments in RL are largely about minds of Allied analysts that the suggestion was tractably learning to exploit; how to explore intelligently arXiv:1903.01021v2 [cs.LG] 27 May 2019 made that the problem be dropped over Germany, is a separate problem. We address the latter problem. as the ultimate instrument of intellectual sabotage.” Our approach, inquisitiveness, is based on Orseau et al.’s –Peter Whittle [Whittle, 1979] [2013] Knowledge Seeking Agent for Stochastic Environ- ments, which selects the actions that best inform the agent The Allied analysts were considering the simplest possible about what environment it is in. Our Inquisitive Reinforce- problem in which there is a trade-off to be made between ment Learner (Inq) explores like a knowledge seeking agent, exploiting, taking the apparently best option, and exploring, and is more likely to explore when there is apparently (ac- choosing a different option to learn more. We tackle what cording to its current beliefs) more to be learned. Some- we consider the most difficult instance of the exploration- times exploring well requires “expeditions,” or many consec- exploitation trade-off problem: when the environment could utive exploratory actions. Inq entertains expeditions of all be any computable probability distribution, not just a multi- lengths, although it follows the longer ones less often, and it armed bandit, how can one achieve optimal performance in doesn’t resolutely commit in advance to seeing the expedition the limit? through. ∗Contact Author This is a very human approach to information acquisition. When we spot an opportunity to learn something about our and definitions for quick reference. Appendix B contains the natural environment, we feel inquisitive. We get distracted. proofs of the lemmas. We are inclined to check it out, even if we don’t see directly in advance how this information might help us better achieve 2 Notation our goals. Moreover, if we can tell that the opportunity to We follow the notation of Orseau, et al. [2013]. The rein- learn something requires a longer term project, we may find forcement learning setup is as follows: A is a finite set of ourselves less inquisitive. actions available to the agent; O is a finite set of observa- For the class of computable environments (stochastic envi- tions it might observe, and R = [0; 1] \ Q is the set of ronments that follow a computable probability distribution), possible rewards. The set of all possible interactions in a it was previously unknown whether any policy could achieve timestep is H := A × O × R. At every timestep, one el- strong asymptotic optimality (convergence of the value to op- ement from this set occurs. A reinforcement learner’s policy [ ] timality with probability 1). Lattimore et al. 2011 showed π is a stochastic function which outputs an action given an in- that no deterministic policy could achieve this. The key ad- ∗ ∗ S1 i teraction history, denoted by π : H A.(X := i=0 X vantage that stochastic policies have is that they can let the represents all finite strings from an alphabet X ). An envi- exploration probability go to 0 while still exploring infinitely ronment is a stochastic function which outputs an observa- often. (For example, an agent that explores with probability tion and reward given an interaction history and an action: 1=t at time t still explores infinitely often). ∗ ν : H × A O × R. For a stochastic function f : X!Y, There is a weaker notion of optimality–“weak asymptotic f(yjx) denotes the probability that f outputs y 2 Y when optimality”–for which positive results already exist; this con- x 2 X is input. dition requires that the average value over the agent’s life- A policy and an environment induce a probability measure time approach optimality. Lattimore et al. [2011] identified a over H1, the set of all possible infinite histories: for h 2 H∗, weakly asymptotically optimal agent for deterministic com- π Pν (h) denotes the probability that an infinite history begins putable environments; the agent maintains a list of environ- with h when actions are sampled from the policy π, and ob- ments consistent with its observations, exploiting as if it is servations and rewards are sampled from the environment ν. in the first such one, and exploring in bursts. A recent algo- π Formally, we define this inductively: Pν () 7! 1, where is rithm for a Thompson Sampling Bayesian agent was shown, the empty history, and for h 2 H∗, a 2 A, o 2 O, r 2 R, with an elegant proof, to be weakly asymptotically optimal in π π we define Pν (haor) 7! Pν (h)π(ajh)ν(orjha). In an infinite all computable environments, but not strongly asymptotically history h 2 H1, a , o , and r refer to the tth action, ob- optimal [Leike et al., 2016]. 1:1 t t t servation and reward, and ht refers to the tth timestep: atotrt. Most work in RL regards (Partially Observable) Markov h<t refers to the first t − 1 timesteps, and ht:k refers to the Decision Processes (PO)MDPs. However, environments string of timesteps t through k (inclusive). Strings of actions, that enter completely novel states infinitely often render observations, and rewards are notated similarly. (PO)MDP algorithms helpless. For example, an RL agent A Bayesian agent deems a class of environments a priori acting as a chatbot, optimizing a function, or proving mathe- feasible. Its “beliefs” take the form of a probability distribu- matical theorems would struggle to model the environment as tion over which environment is the true one. We call this the an MDP, and would likely require an exploration mechanism agent’s belief distribution. In our formulation, Inq considers like ours. In the chatbot case, for instance, as a conversation any computable environment feasible, and starts with a prior with a person progresses, the person never returns to the same belief distribution based on the environments’ Kolmogorov state. complexities: that is, the length of the shortest program that If we formally compare Inq to existing algorithms in computes the environment on some reference machine. How- MDPs, we find that many achieve asymptotic optimality. ever, all our results hold as long as the true environment is Epsilon-greedy, upper confidence bound, and Thompson contained in the class of environments that are considered sampling exploration strategies suffice in MDPs. Our primary feasible, and as long as the prior belief distribution assigns motivation is for the sorts of environments described above. nonzero probability to each environment in the class. We To discriminate between exploratory approaches in ergodic take M to be the class of all computable environments, and MDPs, one can formally bound regret, and we would like to w(ν) := 2−K(ν)(1+")=N to be the prior probability of the en- do this for Inq in the future.