Axioms for Rational Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
Axioms for Rational Reinforcement Learning Peter Sunehag and Marcus Hutter Research School of Computer Science Australian National University Canberra, ACT, 0200, Australia {Peter.Sunehag,Marcus.Hutter}@anu.edu.au July 2011 Abstract We provide a formal, simple and intuitive theory of rational decision mak- ing including sequential decisions that affect the environment. The theory has a geometric flavor, which makes the arguments easy to visualize and under- stand. Our theory is for complete decision makers, which means that they have a complete set of preferences. Our main result shows that a complete rational decision maker implicitly has a probabilistic model of the environ- ment. We have a countable version of this result that brings light on the issue of countable vs finite additivity by showing how it depends on the geome- try of the space which we have preferences over. This is achieved through fruitfully connecting rationality with the Hahn-Banach Theorem. The theory presented here can be viewed as a formalization and extension of the betting odds approach to probability of Ramsey and De Finetti [Ram31, deF37]. Contents arXiv:1107.5520v1 [cs.LG] 27 Jul 2011 1 Introduction 2 2 Rational Decisions for Accepting or Rejecting Contracts 4 3 Countable Sets of Events 10 4 Rational Agents for Classes of Environments 13 5 Conclusions 15 Keywords Rationality; Probability; Utility; Banach Space; Linear Functional. 1 1 Introduction We study complete decision makers that can take a sequence of actions to rationally pursue any given task. We suppose that the task is described in a reinforcement learning framework where the agent takes actions and receives observations and rewards. The aim is to maximize total reward in some given sense. Rationality is meant in the sense of internal consistency [Sug91], which is how it has been used in [NM44] and [Sav54]. In [NM44], it is proven that preferences together with rationality axioms and probabilities for possible events imply the existence of utility values for those events that explain the preferences as arising through maximizing expected utility. Their rationality axioms are 1. Completeness: Given any two choices we either prefer one of them to the other or we consider them to be equally preferable; 2. Transitivity: A preferable to B and B to C imply A preferable to C; 3. Independence: If A is preferable to B and t ∈ [0, 1] then tA + (1 − t)C is preferable (or equal) to tB + (1 − t)C; 4. Continuity: If A is preferable to B and B to C then there exists t ∈ [0, 1] such that B is equally preferable to tA + (1 − t)C. In [Sav54] the probabilities are not given but it is instead proven that preferences together with rationality axioms imply the existence of probabilities and utilities. We are here interested in the case where one is given utility (rewards) and preferences over actions and then deriving the existence of a probabilistic world model. We put an emphasis on extensions to sequential decision making with respect to a countable class of environments. We set up simple axioms for a rational decision maker, which implies that the decisions can be explained (or defined) from probabilistic beliefs. The theory of [Sav54] is called subjective expected utility theory (SEUT) and was intended to provide statistics with a strictly behaviorial foundation. The behavioral approach stands in stark contrast to approaches that directly postulate axioms that “degrees of belief” should satisfy [Cox46, Hal99, Jay03]. Cox’s approach [Cox46, Jay03] has also been found [Par94] to need additional technical assumptions in addition to the common sense axioms originally listed by Cox. The original proof by [Cox46] has been exposed as not mathematically rigorous and his theorem as wrong [Hal99]. An alternative approach by [Ram31, deF37] is interpreting probabilities as fair betting odds. The theory of [Sav54] has greatly influenced economics [Sug91] where it has been used as a description of rational agents. Seemingly strange behavior was explained as having beliefs (probabilities) and tastes (utilities) that were different from those of the person to whom it looked irrational. This has turned out to be insufficient as a description of human behavior [All53, Ell61] and it is better suited as a normative theory or design principle in artificial intelligence. In this article, we are interested 2 in studying the necessity for rational agents (biological or not) to have a probabilis- tic model of their environment. To achieve this, and to have as simple common sense axioms of rationality as possible, we postulate that given any set of values (a contract) associated with the possible events, the decision maker needs to have an opinion on wether he prefers these values to a guaranteed zero outcome or not (or equal). From this setting and our other rationality axioms we deduce the existence of probabilities that explain all preferences as maximizing expected value. There is an intuitive similarity to the idea of explaining/deriving probabilities as a book- maker’s betting odds as done in [deF37] and [Ram31]. One can argue that the theory presented here (in Section 2) is a formalization and extension of the betting odds approach. Geometrically, the result says that there is a hyper-plane in the space of contracts that separates accept from reject. We generalize this statement, by using the Hahn-Banach Theorem, to the countable case where the set of hyper-planes (the dual space) depends on the space of contract. The answers for different cases can then be found in the Banach space theory literature. This provides a new approach to understanding issues like finite vs. countable additivity. We take advantage of this to formulate rational agents that can deal successfully with countable (possibly universal as in all computable environments) classes of environments. Our presentation begins in Section 2 by first looking at a fundamental case where one has to accept or reject certain contracts defining positive and negative rewards that depend on the outcome of an event with finitely many possibilities. To draw the conclusion that there are implicit unique probabilistic beliefs, it is important that the decision maker has an opinion (acceptable, rejectable or both) on every possible contract. This is what we mean when we say complete decision maker. In a more general setting, we consider sequential decision making where given any contract on the sequence of observations and actions, the decision maker must be able to choose a policy (i.e. an action tree). Note that the actions may affect the environment. A contract on such a sequence can e.g. be viewed as describing a re- ward structure for a task. An example of a task is a cleaning robot that gets positive rewards for collecting dust and negative for falling down the stairs. A prerequisite for being able to continue to collect dust can be to recharge the battery before run- ning out. A specialized decision maker that deals only with one contract/task does not always need to have implicit probabilities, it can suffice with qualitative beliefs to take reasonable decisions. A qualitative belief can be that one pizza delivery com- pany (e.g. Pizza Hut vs Dominos) is more likely to arrive on time than the other. If one believes the pizzas are equally good and the price is the same, we will chose the company we believe is more often delivering on time. Considering all contracts (reward structures) on the actions and events, leads to a situation where having a way of making rational (coherent) decisions, implies that the decision maker has implicit probabilistic beliefs. We say that the probabilities are implicit because the decision maker, which might e.g. be a human, a dog, a computer or just a set of rules, might have a non-probabilistic description of how the decisions are made. In Section 3, we investigate extensions to the case with countably many possible 3 outcomes and the interesting issue of countable versus finite additivity. Savage’s axioms are known to only lead to finite additivity while [Arr70] showed that adding a monotone continuity assumption guarantees countable additivity. We find that in our setting, it depends on the space of contracts in an interesting way. In Section 4, we discuss a setting where we have a class of environments. 2 Rational Decisions for Accepting or Rejecting Contracts We consider a setting where we observe a symbol (letter) from a finite alphabet and we are offered a form of bet we call a contract that we can accept or not. Definition 1 (Passive Environment, Event) A passive environment is a se- quence of symbols (letters) jt, called events, being presented one at a time. At time t the symbols j1, ..., jt are available. We can equivalently say that a passive environment is a function ν from finite strings to {0, 1} where ν(j1, ..., jt)=1 if and only if the environment begins with j1, ..., jt. Definition 2 (Contract) Suppose that we have a passive environment with sym- bols from an alphabet with m elements. A contract for an event is an element m x =(x1, ..., xm) in R and xj is the reward received if the event is the j:th symbol, under the assumption that the contract is accepted (see next definition). Definition 3 (Decision Maker, Decision) A decision maker (for some unknown environment) is a set Z ⊂ Rm which defines exactly the contracts that are accept- able. In other words, a decision maker is a function from Rm to {accepted, rejected, either}. The function value is called the decision. If x ∈ Z and λ ≥ 0 then we want λx ∈ Z since it is simply a multiple of the same contract.