arXiv:1107.5520v1 [cs.LG] 27 Jul 2011 xosfrRtoa enocmn Learning Reinforcement Rational for Axioms ainlAet o lse fEvrnet 13 F Linear Space; Banach Utility; Probability; Rationality; 4 Environments of Classes Conclusions for Agents Contracts 5 Events Rational Rejecting of or Sets 4 Accepting Countable for Decisions 3 Rational 2 Introduction 1 dsapoc opoaiiyo asyadD iet [Ram31 Finetti De and extensio Ramsey and of achie formalization probability is a to as approach This viewed odds be over. can Th on preferences here Hahn-Banach depends have presented the with we it rationality which how connecting showing space fruitfully by the l additivity of brings finite that try result vs this mod countable tha of version shows probabilistic of countable result a a have main has We mea Our implicitly ment. which maker makers, preferences. decision of decision rational set complete visuali complete for to a is easy have theory arguments Our the makes stand. which flavor, environme geometric the affect a that decisions sequential including ing epoieafra,sml n nutv hoyo rational of theory intuitive and simple formal, a provide We ee Sunehag Peter { Peter.Sunehag,Marcus.Hutter eerhSho fCmue Science Computer of School Research abra C,00,Australia 0200, ACT, Canberra, utainNtoa University National Australian uy2011 July Keywords Contents Abstract and 1 acsHutter Marcus } @anu.edu.au unctional. oe.Tetheory The eorem. t h hoyhas theory The nt. lo h environ- the of el gto h issue the on ight ftebetting the of n eadunder- and ze eiinmak- decision complete a t deF37]. , they that ns e through ved h geome- the 15 10 2 1 Introduction

We study complete decision makers that can take a sequence of actions to rationally pursue any given task. We suppose that the task is described in a reinforcement learning framework where the agent takes actions and receives observations and rewards. The aim is to maximize total reward in some given sense. Rationality is meant in the sense of internal consistency [Sug91], which is how it has been used in [NM44] and [Sav54]. In [NM44], it is proven that preferences together with rationality axioms and probabilities for possible events imply the existence of utility values for those events that explain the preferences as arising through maximizing expected utility. Their rationality axioms are

1. Completeness: Given any two choices we either prefer one of them to the other or we consider them to be equally preferable;

2. Transitivity: A preferable to B and B to C imply A preferable to C;

3. Independence: If A is preferable to B and t ∈ [0, 1] then tA + (1 − t)C is preferable (or equal) to tB + (1 − t)C;

4. Continuity: If A is preferable to B and B to C then there exists t ∈ [0, 1] such that B is equally preferable to tA + (1 − t)C.

In [Sav54] the probabilities are not given but it is instead proven that preferences together with rationality axioms imply the existence of probabilities and utilities. We are here interested in the case where one is given utility (rewards) and preferences over actions and then deriving the existence of a probabilistic world model. We put an emphasis on extensions to sequential decision making with respect to a countable class of environments. We set up simple axioms for a rational decision maker, which implies that the decisions can be explained (or defined) from probabilistic beliefs. The theory of [Sav54] is called subjective expected utility theory (SEUT) and was intended to provide statistics with a strictly behaviorial foundation. The behavioral approach stands in stark contrast to approaches that directly postulate axioms that “degrees of belief” should satisfy [Cox46, Hal99, Jay03]. Cox’s approach [Cox46, Jay03] has also been found [Par94] to need additional technical assumptions in addition to the common sense axioms originally listed by Cox. The original proof by [Cox46] has been exposed as not mathematically rigorous and his theorem as wrong [Hal99]. An alternative approach by [Ram31, deF37] is interpreting probabilities as fair betting odds. The theory of [Sav54] has greatly influenced economics [Sug91] where it has been used as a description of rational agents. Seemingly strange behavior was explained as having beliefs (probabilities) and tastes (utilities) that were different from those of the person to whom it looked irrational. This has turned out to be insufficient as a description of human behavior [All53, Ell61] and it is better suited as a normative theory or design principle in artificial intelligence. In this article, we are interested

2 in studying the necessity for rational agents (biological or not) to have a probabilis- tic model of their environment. To achieve this, and to have as simple common sense axioms of rationality as possible, we postulate that given any set of values (a contract) associated with the possible events, the decision maker needs to have an opinion on wether he prefers these values to a guaranteed zero outcome or not (or equal). From this setting and our other rationality axioms we deduce the existence of probabilities that explain all preferences as maximizing expected value. There is an intuitive similarity to the idea of explaining/deriving probabilities as a book- maker’s betting odds as done in [deF37] and [Ram31]. One can argue that the theory presented here (in Section 2) is a formalization and extension of the betting odds approach. Geometrically, the result says that there is a hyper-plane in the space of contracts that separates accept from reject. We generalize this statement, by using the Hahn-Banach Theorem, to the countable case where the set of hyper-planes (the ) depends on the space of contract. The answers for different cases can then be found in the theory literature. This provides a new approach to understanding issues like finite vs. countable additivity. We take advantage of this to formulate rational agents that can deal successfully with countable (possibly universal as in all computable environments) classes of environments. Our presentation begins in Section 2 by first looking at a fundamental case where one has to accept or reject certain contracts defining positive and negative rewards that depend on the outcome of an event with finitely many possibilities. To draw the conclusion that there are implicit unique probabilistic beliefs, it is important that the decision maker has an opinion (acceptable, rejectable or both) on every possible contract. This is what we mean when we say complete decision maker. In a more general setting, we consider sequential decision making where given any contract on the sequence of observations and actions, the decision maker must be able to choose a policy (i.e. an action tree). Note that the actions may affect the environment. A contract on such a sequence can e.g. be viewed as describing a re- ward structure for a task. An example of a task is a cleaning robot that gets positive rewards for collecting dust and negative for falling down the stairs. A prerequisite for being able to continue to collect dust can be to recharge the battery before run- ning out. A specialized decision maker that deals only with one contract/task does not always need to have implicit probabilities, it can suffice with qualitative beliefs to take reasonable decisions. A qualitative belief can be that one pizza delivery com- pany (e.g. Pizza Hut vs Dominos) is more likely to arrive on time than the other. If one believes the pizzas are equally good and the price is the same, we will chose the company we believe is more often delivering on time. Considering all contracts (reward structures) on the actions and events, leads to a situation where having a way of making rational (coherent) decisions, implies that the decision maker has implicit probabilistic beliefs. We say that the probabilities are implicit because the decision maker, which might e.g. be a human, a dog, a computer or just a set of rules, might have a non-probabilistic description of how the decisions are made. In Section 3, we investigate extensions to the case with countably many possible

3 outcomes and the interesting issue of countable versus finite additivity. Savage’s axioms are known to only lead to finite additivity while [Arr70] showed that adding a monotone continuity assumption guarantees countable additivity. We find that in our setting, it depends on the space of contracts in an interesting way. In Section 4, we discuss a setting where we have a class of environments.

2 Rational Decisions for Accepting or Rejecting Contracts

We consider a setting where we observe a symbol (letter) from a finite alphabet and we are offered a form of bet we call a contract that we can accept or not.

Definition 1 (Passive Environment, Event) A passive environment is a se- quence of symbols (letters) jt, called events, being presented one at a time. At time t the symbols j1, ..., jt are available. We can equivalently say that a passive environment is a function ν from finite strings to {0, 1} where ν(j1, ..., jt)=1 if and only if the environment begins with j1, ..., jt.

Definition 2 (Contract) Suppose that we have a passive environment with sym- bols from an alphabet with m elements. A contract for an event is an element m x =(x1, ..., xm) in R and xj is the reward received if the event is the j:th symbol, under the assumption that the contract is accepted (see next definition).

Definition 3 (Decision Maker, Decision) A decision maker (for some unknown environment) is a set Z ⊂ Rm which defines exactly the contracts that are accept- able. In other words, a decision maker is a function from Rm to {accepted, rejected, either}. The function value is called the decision.

If x ∈ Z and λ ≥ 0 then we want λx ∈ Z since it is simply a multiple of the same contract. We also want the sum of two acceptable contracts to be acceptable. If we cannot lose money we are prepared to accept the contract. If we are guaranteed to win money we are not prepared to reject it. We summarize these properties in the definition below of a rational decision maker.

Definition 4 (Rationality I) We say that the decision maker (Z ⊂ Rm) is ratio- nal if

1. Every contract x ∈ Rm is either acceptable or rejectable or both;

2. x is acceptable if and only if −x is rejectable;

3. x, y ∈ Z, λ,γ ≥ 0 then λx + γy ∈ Z;

4. If xk ≥ 0 ∀k then x =(x1, ..., xm) ∈ Z while if xk < 0 ∀k then x∈ / Z.

4 If we want to compare these axioms to rationality axioms for a preference relation on contracts we will say that x is better or equal (as in equally good) than y if x − y is acceptable while it is worse or equal if x − y is rejectable. The first axiom is completeness. The second says that if x is better or equal than y then y is worse or equal to x. The third implies transitivity since (x − y)+(y − z)=(x − z). The fourth says that if x has a better (or equal) reward than y for any event, then x is better (or equal) than y.

2.1 Probabilities and Expectations Theorem 5 (Existence of Probabilities) Given a rational decision maker, there are numbers pi ≥ 0 that satisfy

{x | X xipi > 0} ⊂ Z ⊆{x | X xipi ≥ 0}. (1)

Assuming Pi pi =1 makes the numbers unique and we will use the notation Pr(i)= pi.

Proof. See the proof of the more general Theorem 23. It tells us that the closure Z¯ of Z is a closed half space and can be written as {x | P xipi ≥ 0} for some vector Rm p =(pi) (since every linear functional on is of the form f(x)= P xipi) and not every pi is 0. The fourth property tells us that pi ≥ 0 ∀i.

Definition 6 (Expectation) We will refer to the function g(x)= P pixi from (1) as the decision makers expectation. In this terminology, a rational decision maker has an expectation function and accepts a contract x if g(x) > 0 and reject it if g(x) < 0.

Remark 7 Suppose that we have a contract x = (xi) where xi = 1 for all i. If we want g(x)=1, we need P pi =1.

We will write E(x) instead of g(x) (assuming P pi = 1) from now on and call it the expected value or expectation of x.

2.2 Multiple Events Suppose that the contract is such that we can view the symbol to be drawn as consisting of two (or several) symbols from smaller alphabets. That is we can write a drawn symbol as (i, j) where all the possibilities can be found through 1 ≤ i ≤ m, 1 ≤ j ≤ n. In this way of writing, a contract is defined by real numbers xi,j. Theorem 5 tells us that for a rational decision maker there exists unique ri,j ≥ 0 such that Pi,j ri,j = 1 and an expectation function g(x) = P ri,jxi,j such that contracts are accepted if g(x) > 0 and rejected if g(x) < 0.

5 2.3 Marginals Suppose that we can take rational decisions on bets for a pair of horse races, while the person that offers us bets only cares about the first race. Then we are still equipped to respond since the bets that only depend on the first race is a subset of all bets on the pair of races. Definition 8 (Marginals) Suppose that we have a rational decision maker (Z) for contracts on the events (i, j). Then we say that the marginal decision maker for the first symbol (Z1) is the restriction of the decision maker Z to the contracts xi,j that only depend on i, i.e. xi,j = xi. In other words given a contract y =(yi) on the first event, we extend that contract to a contract on (i, j) by letting yi,j = yi and then the original decision maker can decide.

Suppose that xi,j = xi. Then the expectation P ri,jxi,j can be rewritten as P pixi where pi = Pj ri,j. We write that

Pr(i)= X Pr(i, j). j These are the marginal probabilities for the first variable that describe the marginal decision maker for that variable. Naturally we can also define a marginal for the second variable (considering contracts xi,j = xj) by letting qj = Pi ri,j and Pr(j)= Rm Rn Pi Pr(i, j). The marginals define sets Z1 ⊂ and Z2 ⊂ of acceptable contracts on the first and second variables separately.

2.4 Conditioning Again suppose that we are taking decisions on bets for a pair of horse races, but this time suppose that the first race is already over and we know the result. We are still equipped to respond to bets on the second race by extending the bet to a bet on both where there is no reward for (pairs of) events that are inconsistent with what we know. Definition 9 (Conditioning) Suppose that we have a rational decision maker (Z) for contracts on the events (i, j). We define the conditional decision maker Zj=j0 for i given j = j0 by restricting the original decision maker Z to contracts xi,j which are such that xi,j =0 if j =6 j0. In other words if we start with a contract y = (yi) on i we extend it to a contract on (i, j) by letting yi,j0 = yi and yi,j = 0 if j =6 j0. Then the original decision maker can make a decision for that contract.

Suppose that xi,j = 0 if j =6 j0. The unconditional expectation of this contract is Pi,j ri,jxi,j as usual which equals Pi ri,j0 xi,j0 . This leads to the same decisions ri,j0 (i.e. the same Z) as using xi,j0 which is of the form in Theorem 5. We Pi Pk rk,j0 write that Pr(i, j0) Pr(i, j0) Pr(i|j0)= = . (2) Pk Pr(k, j0) Pr(j0) 6 From this it follows that

Pr(i0)Pr(j0|i0)= Pr(j0)Pr(i0|j0) (3) which is one way of writing Bayes rule.

2.5 Learning In the previous section we defined conditioning which lead us to a definition of what it means to learn. Given that we have probabilities for events that are sequences of a certain number of symbols and we have observed one or several of them, we use conditioning to determine what our belief regarding the remaining symbols should be.

Definition 10 (Learning) Given a rational decision maker, defined by pi1,...,iT for T the events (it)t=1 and the first t−1 symbols i1, ..., it−1, we define the informed rational decision maker for it by conditioning on the past i1, ..., it−1 and marginalize over the future it+1, ..., iT . Formally,

pi1,...,i ,j +1,...,j informed Pjt+1,...,jT t t T Pit (i)= Pr(i|i1, ..., it)= . pi1,...,i −1,j ,...,j Pjt,...,jT t t T

2.6 Choosing between Contracts Definition 11 (Choosing contract) We say that to rationally prefer contract x over y is (equivalent) to rationally consider x − y to be acceptable.

As before we assume that we have a decision maker that takes rational decisions on accepting or rejecting contracts x that are based on an event that will be observed. Hence there exist implicit probabilities that represent all choices and an expectation function. Suppose that an agent has to choose between action a1 that leads to receiving reward xi if i is drawn and action a2 that leads to receiving yi in the case of seeing i. Let zi = xi −yi. We can now go back to choosing between accepting and rejecting a contract by saying that choosing (preferring) a1 over a2 means accepting the contract z. In other words if E(x) > E(y) choose a1 and if E(x) < E(y) choose a2.

Remark 12 We note that if we postulate that choosing between contract x and the zero contract is the same as choosing between accepting or rejecting x, then being able to choose between contracts implies the ability to choose between accepting and rejecting one contract. We, therefore, can say that the ability to choose between a pair of contracts is equivalent to the ability to choose to accept or reject a single contract.

7 We can also choose between several contracts. Suppose that action ak gives us k k m j k the contract x = (xi )i=1. If E(x ) > E(x ) ∀k =6 j then we strictly prefer aj over all other actions. In other words a contract xj − xk would for all k be accepted and not rejected by a rational decision maker.

Remark 13 If we have a rational decision maker for accepting or rejecting con- tracts, then there are implicitly probabilities pi for symbol i that characterize the k decisions. A rational choice between actions ak leading to contracts x is taken by choosing action ∗ k a = argmax X pixi . (4) k i

2.7 Choosing between Environments In this section, we assume that the event that the contracts are concerned with might be affected by the choice of action.

Definition 14 (Reactive environment) An environment is a tree with symbols jt (percepts) on the nodes and actions at on the edges. We provide the environment with an action at at each time t and it presents the symbol jt at the node we arrive at by following the edge chosen by the action. We can also equivalently say that a reactive environment ν is a function from strings a1j1, ..., atjt to {0, 1} which equals 1 if and only if ν would produce j1, ..., jt given the actions a1, ..., at.

We will define the concept of a decision maker for the case where one decision will be taken in a situation where not only the contract, but also the outcome can depend on the choice. We do this by defining the choice as being between two different environments.

Definition 15 (Active decision maker) Consider a choice between having con- tract x for passive environment env1 or contract y for passive environment env2. A decision maker is a set Z ⊂ Rm1 × Rm2 which defines exactly the pairs (x, y) for which we choose env1 with x over env2 with y.

Definition 16 (Rational active choice) To choose between action a1 with con- tract x and a2 with contract y in a situation where the action may affect the event, we consider two separate environments, namely the environments that result from the two different actions. We would then have a situation where we will have one observation from each environment. Preferring a1 with x to a2 with y is (equivalent) to consider x − y to be an acceptable contract for the pair of events.

Remark 17 Definition 16 means that a1 with x is preferred over a2 with y if a1 with x − y is preferred over a2 with the zero contract.

8 Proposition 18 (Probabilities for reactive setting) Suppose that we have a reactive environment and a rational active decision maker that will make one choice between action a1 and a2 as described in Definitions 15 and 16, then there exist pi ≥ 0 and qi ≥ 0 such that action a1 with contract x is preferred over action a2 with contract y if P pixi > P qiyi and the reverse if P pixi < P qiyi. This means that the decision maker acts according to probabilities Pr(·|a1) and Pr(·|a2).

Proof. Let Z˜ be all contracts that when combined with action a1 is preferred over a2 with the zero contract. Theorem 1 guarantees the existence of pi such that ˜ ˜ P pixi > 0 implies that x ∈ Z and P pixi < 0 implies that x∈ / Z. The same way we find qi that describe when we prefer a2 with y to a1 with the zero contract. That these probabilities (pi and qi) explain the full decision maker as stated in the proposition now follows directly from Definition 16 understood as in Remark 17. Suppose that we are going to make a sequence of T < ∞ decisions where at every point of time we will have a finite number of actions to chose between. We will consider contracts, which can pay out some reward at each time step and that can depend on everything (actions chosen and symbols observed) that has happened up until this time and we want to maximize the accumulated reward at time T . We can view the choice as just making one choice, namely choosing an action tree. We will sometimes call an action tree a policy.

Definition 19 (Action tree) An action tree is a function from histories of sym- bols j1, ..., jt and decisions a1, ..., at−1 to new decisions, given that the decisions were made according to the function. Formally,

f(a1, j1, ..., at−1, jt−1)= at. An action tree will assign exactly one action for any of the circumstances that one can end up in. That is, given the history up to any time t < T of actions and events, we have a chosen action. We can, therefore, choose an action tree at time 0 and receive a total accumulated reward at time T . This brings us back to the situation of one event and one rational choice.

Definition 20 (Sequential decisions) Given a rational decision maker for the T events (jt)t=1 and the first t − 1 symbols j1, ..., jt−1 and decisions a1, ..., at−1, we define the informed rational decision maker at time t by conditioning on the past a1, j1..., at−1, jt−1. Proposition 21 (Beliefs for sequential decisions) Suppose that we have a re- active environment and a rational decision maker that will take T < ∞ decisions. Furthermore, suppose that the decisions 0 ≤ t < T have been taken and resulted in history a1, j1..., at−1, jt−1. Then the decision makers preferences at this time can be explained (through expected utility maximization) by probabilities

Pr(jt, ..., jT |a1, j1..., at−1, jt−1, at, at+1..., aT ).

9 Proof. Definition 20 and Proposition 18 immediately lead us to the conclusion that given a past up to a point t − 1 and a policy for the time t to T we have probabilistic beliefs over the possible future sequences from time t to T and the choice is categorized by maximizing expected accumulated reward at time T .

3 Countable Sets of Events

Instead of a finite set of possible outcomes, we will in this section assume a countable set. We suppose that the set of contracts is a vector space of sequences xk,k = 0, 1, 2, ... where we use pointwise addition and multiplication with scalar. We will define a space by choosing a norm and let the space consist of the sequences that have finite norm as is common in Banach space theory. If the norm makes the space complete it is called a Banach [Die84]. Interesting examples ∞ are ℓ of bounded sequences with the maximum norm k(αk)k∞ = max |αk|, c0 of sequence that converges to 0 equipped with the same maximum norm and ℓp which for 1 ≤ p< ∞ is defined by the norm

p 1/p k(αk)kp =(X |αk| ) .

For all of these spaces we can consider weighted versions (wk > 0) where

k(αk)kp,wk = k(αkwk)kp.

p p ∞ This means that α ∈ ℓ (w) iff (αkwk) ∈ ℓ , e.g. α ∈ ℓ (w) iff supk |αkwk| < ∞. Given a Banach (sequence) space X we use X′ to denote the dual space that consists of all continuous linear functionals f : X → R. It is well known that a linear functional on a Banach space is continuous if and only if it is bounded, i.e. that |f(x)| ′ there is C < ∞ such that kxk ≤ C ∀x ∈ X. Equipping X with the norm |f(x)| 1 ′ ∞ kfk = sup kxk makes it into a Banach space. Some examples are (ℓ ) = ℓ , ′ 1 p ′ q c0 = ℓ and for 1

f(x)= X xipi where the dual space is the space that (pi) must lie in to make the functional both well defined and bounded. It is clear that ℓ1 ⊂ (ℓ∞)′ but (ℓ∞)′ also contains “stranger” objects. The existence of these other objects can be deduced from the Hahn-Banach theorem (see e.g. [Kre89] or [NB97]) that says that if we have a linear function defined on a subspace Y ∈ X and if it is bounded on Y then there is an extension to a bounded linear functional on X. If Y is dense in X the extension is unique but in general it is not. One can use this Theorem by first looking at the subspace ∞ of all sequences in ℓ that converge and let f(α) = limk→∞ αk. The Hahn-Banach

10 theorem guarantees the existence of extensions to bounded linear functionals that are defined on all of ℓ∞. These are called Banach limits. The space (ℓ∞)′ can be identified with the so called ba space of bounded and finitely additive measures with the variation norm kνk = |ν|(A) where A is the underlying set. Note that ℓ1 can be identified with the smaller space of countably additive bounded measures with the same norm. The Hahn-Banach Theorem has several equivalent forms. One of these identifies the hyper-planes with the bounded linear functionals [NB97].

Definition 22 (Rationality II) Given a Banach sequence space X of contracts, we say that the decision maker (subset Z of X defining acceptable contracts) is rational if

1. Every contract x ∈ X is either acceptable or rejectable or both;

2. x is acceptable if and only if −x is rejectable;

3. x, y ∈ Z, λ,γ ≥ 0 then λx + γy ∈ Z;

4. If xk ≥ 0 ∀k then x = (xk) is acceptable while if xk > 0 ∀k then x is not rejectable.

Theorem 23 (Linear separation) Suppose that we have a space of contracts X that is a Banach sequence space. Given a rational decision maker there is a positive continuous linear functional f : X → R such that

{x | f(x) > 0} ⊂ Z ⊆{x | f(x) ≥ 0}. (5)

Proof. The third property tells us that Z and −Z are convex cones. The second and fourth property tells us that Z =6 Rm. Suppose that there is a point x that lies in both the interior of Z and of −Z. Then the same is true for −x according to the second property and for the origin. That a ball around the origin lies in Z means that Z = Rm which is not true. Thus the interiors of Z and −Z are disjoint open convex sets and can, therefore, be separated by a hyperplane (according to the Hahn-Banach theorem) which goes through the origin (since according to the second and fourth property the origin is both acceptable and rejectable). The first two properties tell us that Z ∪ −Z = Rm. Given a separating hyperplane (between the interiors of Z and −Z), Z must contain everything on one side. This means that Z is a half space whose boundary is a hyperplane that goes through the origin and the closure Z¯ of Z is a closed half space and can be written as {x | f(x) ≥ 0} for some f ∈ X′. The fourth property tells us that f is positive.

Corollary 24 (Additivity) 1. If X = c0 then a rational decision maker is de- scribed by a countably additive (probability) measure. 2. If X = ℓ∞ then a rational decision maker is described by a finitely additive (probability) measure.

11 It seems from Corollary 24 that we pay the price of losing countable additivity ∞ for expanding the space of contracts from c0 to ℓ but we can expand the space ∞ ′ even more by looking at c0(w) where wk → 0 which contains ℓ and X is then 1 ℓ ((1/wk)). This means that we get countable additivity back but we instead have a restriction on how fast the probabilities pk must tend to 0. Note that a bounded linear functional on c0 can always be extended to a bounded linear functional on ∞ ℓ by the formula f(x) = P pixi but that is not the unique extension. Note also ∞ that every bounded linear functional on ℓ can be restricted to c0 and there be ∞ represented as f(x)= P pixi. Therefore, a rational decision maker on ℓ contracts has probabilistic beliefs (unless pi = 0 ∀i), though it might also take asymptotic behavior of a contract into account. For example (and here pi =0 ∀i), the decision 1 n maker that makes decisions based on asymptotic averages limn→∞ n Pi=1 xi when they exist. That strategy can be extended to all of ℓ∞ (a ). The following proposition will help us decide which decision maker on ℓ∞ is described with countably additive probabilities.

∞ ′ ∞ j Proposition 25 Suppose that f ∈ (ℓ ) . For any x ∈ ℓ , let xi = xi if i ≤ j and j xi =0 otherwise. If for any x, lim f(xj)= f(x), j→∞

∞ then f can be written as f(x)= P pixi where pi ≥ 0 and Pi=1 pi < ∞. ∞ Proof. The restriction of f to c0 gives us numbers pi ≥ 0 such that Pi=1 pi < ∞ j j ∞ and f(x) = P pixi for x ∈ c0. This means that f(x ) = Pi=1 pixi for any x ∈ ℓ j ∞ and j < ∞. Thus limj→∞ f(x )= Pi=1 pixi. Definition 26 (Monotone decisions) We define the concept of a monotone deci- sion maker in the following way. Suppose that for every x ∈ ℓ∞ there is N < ∞ such that the decision is the same for all xj, j ≥ N (See Proposition 25 for definition) as for x. Then we say that the decision maker is monotone.

∞ Example 27 Let f ∈ ℓ be such that if lim αk → L then f(α) = L (i.e. f is a Banach limit). Furthermore define a rational decision maker by letting the set of acceptable contracts be Z = {x | f(x) ≥ 0}. Then f(xj)=0 (where we use notation from Proposition 25) for all j < ∞ and regardless of which x we define xj from. Therefore, all sequences that are eventually zero are acceptable contracts. This means that this decision maker is not monotone since there are contracts that are not acceptable.

Theorem 28 (Monotone rationality) Given a monotone rational decision ∞ maker for ℓ contracts, there are pi ≥ 0 such that P pi < ∞ and

{x | X xipi > 0} ⊂ Z ⊆{x | X xipi ≥ 0}. (6)

12 Proof. According to Theorem 23 there is f ∈ (ℓ∞)′ such that (the closure of ¯ Z) Z = {x| f(x) ≥ 0} . Let pi ≥ 0 be such that P pi < ∞ and such that j f(x) = P xipi for x ∈ c0. Remember that x (notation as in Proposition 25) is always in c0. Suppose that there is x such that x is accepted but xipi < 0. This n P violate monotonicity since there exist N < ∞ such that Pi=1 xipi < 0 for all n ≥ N and, therefore, xj is not accepted for j ≥ N but x is accepted. We conclude that if x is accepted then P pixi ≥ 0 and if P pixi > 0 then x is accepted.

4 Rational Agents for Classes of Environments

We will here study agents that are designed to deal with a large range of situations. Given a class of environments we want to define agents that can learn to act well when placed in any of them, assuming it is at all possible. Definition 29 (Universality for a class) We say that a decision maker is uni- versal for a class of environments M if for any outcome sequence a1j1a2j2... that given the actions would be produced by some environment in the class, there is c> 0 (depending on the sequence) such that the decision maker has probabilities that sat- isfy Pr(j1, ..., jt|a1, ..., at) ≥ c ∀t. This is obviously true if the decision maker’s probabilistic beliefs are a convex com- bination Pν∈M wνν, wν > 0 and Pν wν =1. We will next discuss how to define some large classes of environments and agents that can succeed for them. We assume that the total accumulated reward from the environment will be finite regardless of our actions since we want any policy to have finite utility. Furthermore, we assume that rewards are positive and that it is possible to achieve strictly positive rewards in any environment. We would like the agent to perform well regardless of which environment from the chosen class it is placed in. For any possible policy (action tree) π and environment ν, there is a total reward π Vν that following π in ν would result in. This means that for any π there is a contract π sequence (Vν )ν, assuming we have enumerated our set of environments. Let ∗ π Vν = max Vν . π ∗ π We know that Vν > 0 for all ν. Every contract sequence (Vν )ν lies in X = ∞ ∗ π ℓ ((1/Vν )) and k(Vν )kX ≤ 1. The rational decision makers are the positive, con- ′ 1 ∗ tinuous linear functionals on X. X contains the space ℓ (Vν ). In other words if ∗ wν ≥ 0 and P wνVν < ∞ then the sequence (wν) defines a rational decision maker for the contract space X. These are exactly the monotone rational decision makers. Letting (which is the AIXI agent from [Hut05])

∗ π π ∈ arg max wνVν (7) π X ν

13 we have a choice with the property that for any other π with

π π∗ X wνVν < X wνVν . ν ν π∗ π ∗ Hence the contract (Vν − Vν ) is not rejectable. In other words π is strictly ∗ preferable to π. By letting pν = wνVν , we can rewrite (7) as π ∗ Vν π ∈ arg max X pν . (8) π V ∗ ν ν ∗ If one further restricts the class of environments by assuming Vν ≤ 1 for all ν then π ∞ for every π,(Vν ) ∈ ℓ . Therefore, by Theorem 28 the monotone rational agents for this setting can be formulated as in (7) with (wν) ∈ ℓ1, i.e. Pν wν < ∞. However, since (pν) ∈ ℓ1, a formulation of the form of (8) is also possible. Normalizing p and w individually to probabilities makes (7) into a maximum expected utility criterion and (8) into maximum relative utility. As long as our w and p relate the way they do it is still the same decisions. If we would base both expectations on the same probabilistic beliefs it would be different criteria. When we have an upper bound ∗ Vν

4.1 Asymptotic Optimality

π Denote a chosen countable class of environments by M. Let Vν,k be the rewards achieved after time k using policy π in environment ν. We suppress the dependence on the history so far. Let π π Vν,k Wν,k = ∗ Vν,k denote the skill (relative reward) of π in environment ν from time k. The maximum possible skill is 1. We would like to have a policy π such that

π lim Wν,k =1 ∀ν ∈M. k→∞ This would mean that the agent asymptotically achieve maximum skill when placed in any environment from M. Let I(hk, ν) = 1 if ν is consistent with history hk and I(hk, ν) = 0 otherwise. Furthermore, let pν,0 pν,k = Pµ∈M pµ,0I(hk,µ) be the agent’s weight for environment ν at time k and let πp be a policy that at time k acts according to a policy in π Vν,k arg max X pν,k . (9) π V ∗ ν ν,k

14 In the following theorem, we prove that for every environment ν ∈M, the policy πp will asymptotically achieve perfect relative rewards. We have to assume that there exists a sequence of policies πk > 0 with this property (as for the similar Theorem 5.34 in [Hut05] which dealt with discounted values). The convergence in W -values is the relevant sense of optimality for our setting, since the V -values converge to zero for any policy.

Theorem 30 (Asymptotic optimality) Suppose that we have a decision maker that is universal (i.e. pν > 0 ∀ν) with respect to the countable class M of environ- ments (which can be stochastic) and that there exists policies πk such that for all πk,ν ν, Wk → 1 if ν is the actual environment (or the sequence is consistent with ν). πp,µ This implies that Wk → 1 where µ is the actual environment.

The proof technique is similar to that of Theorem 5.34 in [Hut05]. Proof. Let πk,ν k k k 0 ≤ 1 − Wk =: ∆ν , ∆ = X pν,k∆ν . (10) ν k πk,ν The assumptions tells us that ∆ν = Wk − 1 → 0 for all ν that are consistent with the sequence (pν,k = 0 if ν is inconsistent with the history at time k) and since k ∆ν ≤ 1 , it follows that k k ∆ = X pν,k∆ν → 0. ν πp,µ ν k Note that p (1 − W ) ≤ p (1 − W p ) ≤ p (1 − W ) = µ,k k Pν ν,k π ,k Pν ν,k πk,ν k k pν,k∆ν = ∆ . Since we also know that pµ,k ≥ pµ,0 > 0 it follows that P πp,µ (1 − Wk ) → 0.

5 Conclusions

We studied complete rational decision makers including the cases of actions that may affect the environment and sequential decision making. We set up simple common sense rationality axioms that imply that a complete rational decision maker has preferences that can be characterized as maximizing expected utility. Of particular interest is the countable case where our results follow from identifying the Banach space dual of the space of contracts. Acknowledgement. This work was supported by ARC grant DP0988049.

15 References

[All53] M Allais. Le comportement de l’homme rationnel devant le risque: Critique des postulats et axiomes de l’ecole americaine. Econometrica, 21(4):503–546, 1953. [Arr70] K Arrow. Essays in the Theory of Risk-Bearing. North-Holland, 1970. [Cox46] R. T. Cox. Probability, frequency and reasonable expectation. Am. Jour. Phys, 14:1–13, 1946. [deF37] B. deFinetti. La prvision: Ses lois logiques, ses sources subjectives. In Annales de l’Institut Henri Poincar 7, pages 1–68. Paris, 1937. [Die84] Joseph Diestel. Sequences and series in Banach spaces. Springer-Verlag, 1984. [Ell61] Daniel Ellsberg. Risk, Ambiguity, and the Savage Axioms. The Quarterly Journal of Economics, 75(4):643–669, 1961. [Hal99] Joseph Y. Halpern. A counterexample to theorems of Cox and Fine. Journal of AI research, 10:67–85, 1999. [Hut05] Marcus Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin, 2005. [Jay03] E. T. Jaynes. Probability theory: the logic of science. Cambridge University Press, 2003. [Kre89] Erwin Kreyszig. Introductory With Applications. Wiley, 1989. [NB97] Lawrence Naricia and Edward Beckenstein. The Hahn-Banach theorem: the life and times. Topology and its Applications, 77(2):193–211, 1997. [NM44] J. Neumann and O. Morgenstern. Theory of Games and Economic Behavior. Princeton University Press, 1944. [Par94] J. B. Paris. The uncertain reasoner’s companion: a mathematical perspective. Cambridge University Press, New York, NY, USA, 1994. [Ram31] Frank P. Ramsey. Truth and probability. In R. B. Braithwaite, editor, The Foundations of Mathematics and other Logical Essays, chapter 7, pages 156–198. Brace & Co., 1931. [Sav54] L. Savage. The Foundations of Statistics. Wiley, New York, 1954. [Sug91] Robert Sugden. Rational choice: A survey of contributions from economics and philosophy. Economic Journal, 101(407):751–85, July 1991.

16