Optimal Policies Tend To Seek Power

Alexander Matt Turner Logan Smith Rohin Shah Oregon State University Mississippi State University UC Berkeley [email protected] [email protected] [email protected]

Andrew Critch Prasad Tadepalli UC Berkeley Oregon State University [email protected] [email protected]

Abstract

Some researchers speculate that intelligent (RL) agents would be incentivized to seek resources and power in pursuit of their objectives. Other researchers are skeptical, because human-like power-seeking instincts need not be present in RL agents. To clarify this debate, we develop the first formal theory of the statistical tendencies of optimal policies in reinforcement learning. In the context of Markov decision processes (MDPs), we prove that certain environmental symmetries are sufficient for optimal policies to tend to seek power over the environment. These symmetries exist in many environments in which the agent can be shut down or destroyed. We prove that for most prior beliefs one might have about the agent’s reward function (including as a special case the situations where the reward function is known), one should expect optimal policies to seek power in these environments. These policies seek power by keeping a range of options available and, when the discount rate is sufficiently close to 1, by navigating towards larger sets of potential terminal states.

1 Introduction

Omohundro[2008], Bostrom[2014], Russell[2019] hypothesize that highly intelligent agents tend to seek power in pursuit of their goals. Such power-seeking agents might gain power over humans. For example, imagined that an agent tasked with proving the Riemann hypothesis might rationally turn the planet – along with everyone on it – into computational resources [Russell and Norvig, 2009]. arXiv:1912.01683v7 [cs.AI] 1 Jun 2021 Some researchers argue that these worries stem from anthropomorphization of AI [Various, 2019, Pinker and Russell, 2020, Mitchell, 2021]. LeCun and Zador[2019] argue that AI “never needed to evolve, so it didn’t develop the survival instinct that leads to the impulse to dominate others.” We clarify this debate by grounding the claim that highly intelligent agents will tend to seek power. In section4, we identify optimal policies in MDPs as a reasonable formalization of “highly intelligent agents.” Optimal policies “tend to” take an action when the action is optimal for most reward functions. Section5 defines “power” as the ability to achieve a wide range of goals: after all, “money is power”, and money is instrumentally useful for many goals. Conversely, it’s harder to pursue most goals when physically restrained, and so a physically restrained person has little power. An action “seeks power” if it leads to states where the agent has higher power. We make no claims about when real-world AI power-seeking behavior could become plausible. Instead, we consider the theoretical consequences of optimal action in MDPs. Future work is needed

Preprint. Under review. to translate our theory from optimal policies to learned, real-world policies – we expect that our arguments will at least hold conceptually, if not provably. Section6 shows that power-seeking tendencies arise not from anthropomorphism, but from the combination of optimal behavior and certain graphical symmetries present in many MDPs. These symmetries automatically occur in many environments where the agent can be shut down or destroyed, yielding broad applicability of our main result (theorem 6.13).

2 Related Work

An action is instrumental to an objective when it helps achieve that objective. Some actions are instrumental to many objectives, making them robustly instrumental. The claim that power-seeking is robustly instrumental is a specific instance of the instrumental convergence thesis:

Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents [Bostrom, 2014].

For example, in Atari games, avoiding (virtual) death is instrumental for both completing the game and for optimizing curiosity [Burda et al., 2019]. Many AI alignment researchers hypothesize that most advanced AI agents will have concerning instrumental incentives, such as resisting deactiva- tion [Soares et al., 2015, Milli et al., 2017, Hadfield-Menell et al., 2017, Carey, 2018] and acquiring resources [Benson-Tilsen and Soares, 2016]. Lastly, Menache et al.[2002] identify and navigate towards robustly instrumental “bottleneck states.” We formalize power as the ability to achieve a wide variety of goals. AppendixA compares our formalization with information-theoretic empowerment [Salge et al., 2014]. Some of our results relate the formal power of states to the structure of the environment. Foster and Dayan[2002], Drummond[1998], Sutton et al.[2011], Schaul et al.[2015] note that value functions encode important information about the environment, as they capture the agent’s ability to achieve different goals. Turner et al.[2020] speculate that a state’s optimal value correlates strongly across reward functions. In particular, Schaul et al.[2015] learn regularities across value functions, suggesting that some states are valuable for many different reward functions (i.e. powerful). We are not the first to study convergence of behavior, form, or function. In economics, turnpike theory studies how certain paths of accumulation tend to be optimal [McKenzie, 1976]. In biology, convergent evolution occurs when similar features (e.g. flight) independently evolve in different time periods [Reece and Campbell, 2011]. Lastly, computer vision networks reliably learn e.g. edge detectors [Olah et al., 2020].

3 State Visit Distribution Functions Quantify The Agent’s Available Options

We clarify the power-seeking debate by prov- ing what optimal policies “usually look like” ` ∅ r in a given environment. We illustrate our - % results with a simple case study, before ex- plaining how to reason about a wide range of ` `/ r. r MDPs. Appendix C.1 lists MDP theory contri- . F & butions of independent interest, appendixC left right lists definitions and theorems, and appendix Figure 1: ` is a 1-cycle, and ∅ is a terminal state. D contains the proofs. Arrows represent. deterministic transitions induced by Definition 3.1 (Rewardless MDP). taking some action a . Since the left subgraph is , ,T is a rewardless MDP with “almost a copy” of the∈right A subgraph, proposition 6.9 finitehS A statei and action spaces and will prove that more reward functions have optimal , and stochastic transition functionS policies which go right than which go left at state A T : ∆( ). We treat the discount ?, and that such policies seek power – both intuitively, rateSγ ×as A a variable → S with domain [0, 1]. and in a reasonable formal sense.

2 Definition 3.2 (1-cycle states). Let es R|S| be the unit vector for state s, such that there is a 1 in ∈ the entry for state s and 0 elsewhere. State s is a 1-cycle if a : T (s, a) = es. s is a terminal state if a : T (s, a) = e . ∃ ∈ A ∀ ∈ A s Our theorems apply to stochastic environments, but we present a deterministic case study for clarity. The environment of fig.1 is small, but its structure is rich. For example, the agent has more “options” at ? than at the terminal state ∅. Formally, ? has more visit distribution functions than ∅ does. Definition 3.3 (State visit distribution [Sutton and Barto, 1998]). Π := S , the set of stationary deterministic policies. The visit distribution induced by following policy πAfrom state s at discount π,s t π,s : P∞ : rate γ [0, 1) is f (γ) = t=0 γ Est π s [est ]. f is a visit distribution function; (s) = f π,s ∈π Π . ∼ | F { | ∈ } In fig.1, starting from ` , the agent can stay at ` or alternate between ` and ` , and so (` ) = 1 1 . . . - F 1 . e` , 2 (e` + γe` ) . In contrast, agents at ∅ must stay at ∅. (∅) = e . 1 γ . 1 γ . - 1 γ ∅ { − π,s − } F {1 − } π f is usually non-injective – at ∅, all policies π map to visit distribution function 1 γ e∅. 7→ − Before moving on, we introduce two important concepts used in our main results. First, we sometimes restrict our attention to visit distributions which take certain actions (fig.2). Definition 3.4 ( single-state restriction). Considering only visit distribution functions induced by F  π,s policies taking action a at state s0, (s π(s0) = a) := f (s) π Π : π(s0) = a, f = f . F | ∈ F | ∃ ∈ Second, some f (s) are “unimportant.” Consider an agent ∈ F optimizing reward function er (1 reward when at r , 0 oth- r & & erwise) at e.g. γ = 1 . Its optimal policies navigate to r and % 2 & stay there. Similarly, for reward function er , optimal policies navigate to and stay there. However, for% no reward function r r. r % F & is it uniquely optimal to alternate between r and r . Only dom- right inated visit distribution functions alternate% between&r and r (definition 3.6). % & Figure 2: The subgraph corre- Definition 3.5 (Value function). Let π Π. For any reward sponding to (? π(?) = ∈ right). GrayF dotted| actions function R RS over the state space, the on-policy value at state ∈ π π,s s and discount rate γ [0, 1) is V (s, γ) := f (γ)>r, where are only taken by the policies of R π ∈ dominated f (?) nd(?). r R|S| is R expressed as a column vector (one entry per state). ∈ F \F ∈ π The optimal value is VR∗ (s, γ) := maxπ VR (s, γ). Definition 3.6 (Non-domination).

π π π0 nd(s) := f (s) r R|S|, γ (0, 1) : f (γ)>r > max f (γ)>r . (1) F { ∈ F | ∃ ∈ ∈ f π0 (s) f π } ∈F \{ } For any reward function R and discount rate γ, f π (s) is (weakly) dominated by f π0 (s) ∈ F ∈ F if V π(s, γ) V π0 (s, γ). f π (s) is non-dominated if there exist R and γ at which f π is not R ≤ R ∈ Fnd dominated by any other f π0 .

4 Some Actions Have A Greater Probability Of Being Optimal

We claim that optimal policies “tend” to take certain actions in certain situations. We first consider the probability that certain actions are optimal. 1 Reconsider the reward function er , optimized at γ = . Starting from ?, the optimal trajectory goes & 2 right to r. to r , where the agent remains. The right action is optimal at ? under these incentives. Optimal policy sets& capture the behavior incentivized by a reward function and a discount rate.

Definition 4.1 (Optimal policy set function). Π∗ (R, γ) is the optimal policy set for reward function R at γ (0, 1). All R have at least one optimal policy π Π [Puterman, 2014]. Π∗ (R, 0) := ∈ ∈ limγ 0 Π∗ (R, γ) and Π∗ (R, 1) := limγ 1 Π∗ (R, γ) exist by lemma D.32 (taking the limits with respect→ to the discrete topology over policy→ sets). We may be unsure which reward function an agent will optimize. We may expect to deploy a system in a known environment, without knowing the exact form of e.g. the reward shaping [Ng et al., 1999]

3 or intrinsic motivation [Pathak et al., 2017]. Alternatively, one might attempt to reason about future RL agents, whose details are unknown. Our power-seeking results do not hinge on such uncertainty, as they also apply to degenerate distributions (i.e. we know what reward function will be optimized). Definition 4.2 (Reward function distributions). Different results make different distributional as- sumptions. Results with any hold for any probability distribution over R|S|. We sometimes assume D a bounded distribution bound in order to ensure well-defined expectations. Letting X be any continu- D ous bounded distribution over R, X-IID := X|S|. For example, when Xu := unif(0, 1), Xu-IID is the maximum-entropy distribution.D is the degenerate distribution on the state indicatorD reward Ds function es, which assigns 1 reward to s and 0 elsewhere.

With any representing our prior beliefs about the agent’s reward function, what behavior should we expectD from its optimal policies? Perhaps we want to reason about the probability that it’s optimal to go from ? to ∅ or to r. and then stay at r . In this case, we quantify the optimality probability of 2 % : γ γ F = e? + 1 γ e∅, e? + γer. + 1 γ er . { − − % } ( ) [0 1] ( ) := Definition 4.3 (Visit distribution optimality probability). Let F s , γ , . P any F, γ π  ⊆ F ∈ D PR any f F : π Π∗ (R, γ) . ∼D ∃ ∈ ∈ Alternatively, perhaps we’re interested in the probability that right is optimal at ?. Definition 4.4 (Action optimality probability). At discount rate γ and at state s, the optimality probability of action ( ) := Π ( ) : ( ) =  a is P any s, a, γ PR any π∗ ∗ R, γ π∗ s a . D ∼D ∃ ∈ Optimality probability may seem hard to reason about. It’s hard enough to compute an optimal policy for a single reward function, let alone uncountably many! But consider any -IID. When γ = 0, DX optimal policies maximize next-state reward. At ?, identically distributed reward means `/ and r. have an equal probability of having maximal next-state reward. Therefore, ( left 0) = P X-IID ?, , ( right 0). This is not a proof, but such statements are provable. D P X-IID ?, , D With being the degenerate distribution on reward function e , left 1  = 1 0 = `/ `/ P ` ?, , 2 > D D / ?, right, 1 . Similarly, ?, left, 1  = 0 < 1 = ?, right, 1 . Therefore, “what P `/ 2 P r. 2 P r. 2 doD optimal policies ‘tend’ to lookD like?” seems to depend on one’sD prior beliefs. But in fig.1, we claimed that left is optimal for fewer reward functions than right is. The claim is meaningful and true, but we will return to it in section6.

5 Some States Give The Agent More Control Over The Future

The agent has more “options” at ` than at the inescapable terminal state ∅. Furthermore, since r has a loop, the agent has more “options”. at r than at ` . A glance at fig.3 leads us to intuit that% & . r affords the agent more power than ∅. & What is power? Philosophers have many answers. One prominent answer is the dispositional view: power is the ability to achieve a range of goals [Sattarov, 2019]. In an MDP, optimal value functions VR∗ (s, γ) capture the agent’s ability to “achieve the goal” R. So average optimal value captures the agent’s ability to achieve a range of goals bound. D Definition 5.1 (Average optimal value). The average optimal value at state s and discount rate h i (0 1) is ( ) :=  ( ) = max f( ) r γ , V ∗bound s, γ ER bound VR∗ s, γ Er bound f (s) γ > . ∈ D ∼D ∼D ∈F Figure3 shows the pleasing result that for the max-entropy distribution, r has greater average optimal value than . However, ( ) has a few problems as a measure& of power. All f ( ) ∅ V ∗bound s, γ s D ∈ F count the agent’s initial presence at the initial state , and f( ) = 1 (proposition D.3) diverges s γ 1 1 γ as 1. Accordingly, lim ( ) tends to diverge. Definition− 5.2 fixes these issues. γ γ 1 V ∗bound s, γ → → D Definition 5.2 (POWER). Let γ (0, 1). ∈ " # 1 γ 1 γ >   POWER bound (s, γ) := E max − f(γ) es r = − E VR∗ (s, γ) R(s) . D r bound f (s) γ − γ R bound − ∼D ∈F ∼D (2)

4 ` r - ∅ %

` r. r `/ F & . left right

1 1 1 γ 2 1 Figure 3: For Xu := unif(0, 1), V ∗ (∅, γ) = , V ∗ (` , γ) = + 2 ( + γ), Xu-IID 2 1 γ Xu-IID . 2 1 γ 3 2 1 γ 2 D1 2 − D − and V ∗ (r , γ) = + . and are the expected maxima of one and two draws from Xu-IID & 2 1 γ 3 2 3 the uniformD distribution, respectively.− For all γ (0, 1), V ( , γ) < V (` , γ) < ∗X -IID ∅ ∗X -IID ∈ D u D u . V (r , γ). POWER ( , γ) = 1 , POWER (` , γ) = 1 ( 2 + 1 γ), and ∗X -IID Xu-IID ∅ 2 Xu-IID 1+γ 3 2 D u & D D . POWER ( ) = 2 . The POWER of is due to the agent only choosing the best of Xu-IID r , γ 3 ` two statesD on every& other time step. .

POWER has nice formal properties.

Lemma 5.3 (Continuity of POWER). POWER bound (s, γ) is Lipschitz continuous on γ [0, 1]. D ∈   Proposition 5.4 (Maximal POWER). POWER bound (s, γ) ER bound maxs R(s) , with equality if s can deterministically reach all states in oneD step and≤ all states∼D are 1-cycles.∈S

Proposition 5.5 (POWER is smooth across reversible dynamics). Let bound be bounded [b, c]. Sup- D pose s and s0 can both reach each other in one step with probability 1.  POWER bound (s, γ) POWER bound s0, γ (c b)(1 γ). (3) D − D ≤ − − We now formalize what it means for actions to “seek power” in a situation. Definition 5.6 (POWER-seeking actions). At state s and discount rate γ [0, 1], action a seeks more   ∈   POWER bound than a0 when Esa T (s,a) POWER bound (sa, γ) Es T (s,a ) POWER bound (sa0 , γ) . D ∼ D ≥ a0 ∼ 0 D

POWER is sensitive to the choice of distribution. ` gives maximal POWER ` to ` . r D . D . . D & assigns maximal POWER to r . even gives maximal POWER to ! In what sense does r ∅ ∅ ∅ ∅ D & & D D have “less POWER” than r , and in what sense does going right “tend to seek POWER” compared to left? &

6 Certain Environmental Symmetries Produce Power-Seeking Tendencies

We prove that for all γ [0, 1] and for most distributions , POWER (` , γ) POWER (r , γ). But first, we explore why∈ this must be true. D D . ≤ D & 1 1 1 1 1 (` ) = 1 γ e` , 1 γ2 (e` + γe` ) and (r ) = 1 γ er , 1 γ2 (er + γer ), 1 γ er . F . { . . - } F & { & & % % } These two sets− look awfully− similar. (` ) is a “subset” of− (r )−, only with “different− states.” Figure4 demonstrates a state permutationF .φ which embeds (`F ) &into (r ). F . F & Definition 6.1 (Similarity of visitation distribu- ` ∅ r tion sets). Let F,F 0 R|S|. Given state per- - % ⊆ mutation φ S inducing permutation matrix Involution φ ∈ |S| P , ( ) := P f f . is r φ R|S|×|S| φ F φ F F ` `/ F r. ∈ | ∈ . & similar to F 0 when φ : φ(F ) = F 0. φ is an invo- left right 1 ∃ lution if φ = φ− – it either swaps states, or leaves them be. Figure 4: nd(` ) is similar to a strict sub- F . Let I R. If F,F 0 instead contain vector-valued set of (r ) via involution φ: ⊆ F & functions I R|S|, F is similar to F 0 when φ :  7→  ∃  γ I : Pφf(γ) f F = f 0(γ) f 0 F 0 . φ nd(` ) ∀ ∈ | ∈ | ∈ F . : 1 1 = 1 γ Pφe` , 1 γ2 (Pφe` + γPφe` ) Consider a reward function R0 which assigns { − . − . - } 1 1 R0(` ) = R0(` ) = 1, R0(r ) = R0(r ) = 0. = 1 γ er , 1 γ2 (er + γer ) ( (r ). . - & % { − & − & % } F &

5 R assigns more optimal value to ` than to r : V (` , γ) = 1 > 0 = V (r , γ). However, 0 R∗0 1 γ R∗0 . & . − & consider φ R0: (φ R0)(` ) := R0(φ(` )) = R0(r ) = 0. This permuted reward function switches · · . . &1 the optimal value functions: V ∗ (` , γ) = 0 < = V ∗ (r , γ). φ R0 1 γ φ R0 · . − · & Remarkably, this φ has the property that for any R which assigns ` greater optimal value than r (i.e. . & VR∗(` , γ) > VR∗(r , γ)), the opposite holds for the permuted φ R: Vφ∗R(` , γ) < Vφ∗R(r , γ). . & · · . · & We can permute reward functions, but we can also permute reward function distributions. Permuted distributions simply permute which states get which rewards.

Definition 6.2 (Pushforward distribution of a permutation). ( ) Let φ S . φ any is the pushforward distribution in- D ∈ |S| D φswap duced by applying the random vector f(r) := P r to any. y φ D 0 Definition 6.3 (Orbit of a probability distribution). The D orbit of any under the symmetric group S is S any := D |S| |S|·D φ( any) φ S . { D | ∈ |S|} x For example, the orbit of a degenerate state indicator distri- Figure 5: Probability density plots of bution s is S s = s s0 , and fig.5 shows |S| 0 the Gaussian distributions and 0 the orbitD of a 2D Gaussian·D {D distribution.| ∈ S} 2 D D over R . The symmetric group S2 Considering again the involution φ of fig.4: for every bound contains the identity permutation φid D and the reflection permutation φswap for which ` has more POWER bound than r , ` has less . D & . POWERφ( ) than r . This fact is not obvious – it is (switching the y and x values). The bound & shown byD the proof of lemma D.22. orbit of consists of φid( ) = and D D D φswap( ) = 0. Imagine bound’s orbit elements “voting” whether ` or r D D . & has strictlyD more POWER. Proposition 6.5 will show that r can’t lose the “vote” for the orbit of any bounded reward function distribution. Definition 6.4 formalizes& this “voting” notion.

Definition 6.4 (Inequalities which hold for most bounded probability distributions). Let f1, f2 : ∆(R|S|) R be any functions from reward function distributions to real numbers. We write → 1 f most f when, for all bound, the following cardinality inequality holds: 1 ≥ 2 D

S bound f1( ) > f2( ) S bound f1( ) < f2( ) . (4) {D ∈ |S| ·D | D D } ≥ {D ∈ |S| ·D | D D }

Proposition 6.5 (States with “more options” generally have more POWER). Suppose (s0) is Fnd similar to a subset of (s) via involution φ. Then γ [0, 1] : POWER bound (s0, γ) most F ∀ ∈ D ≤ POWER bound (s, γ). D  If (s) φ (s0) is non-empty, then for all γ (0, 1), the inequality is strict for all -IID Fnd \ Fnd ∈ DX and POWER bound (s0, γ) most POWER bound (s, γ) – i.e. POWER bound (s0, γ) most POWER bound (s, γ) does not hold.D 6≥ D D ≥ D

Proposition 6.5 proves that for all γ [0, 1], POWER bound (` , γ) most POWER bound (r , γ) via : : ∈ D . ≤ 1 D & s0 = ` , s = r , and the involution φ shown in fig.4. In fact, because ( 1 γ er ) nd(r ) . & − % ∈ F & \ φ( nd(` )), r has “strictly more options” and therefore fulfills proposition 6.5’s stronger condition. F . & Proposition 6.5 is shown using the fact that φ injectively maps under which r has less POWER , to distributions φ( ) which agree with the intuition that r offersD more control.& Therefore, at leastD D & half of each orbit must agree, and r never “loses the POWER vote” against ` .2 & .

1 Boundedness only ensures that POWER is well-defined. Optimality probability results hold for all Dany orbits. 2 Proposition 6.5 also proves that in general, ∅ has less POWER than `. and r&. However, this does not prove that most distributions D satisfy the joint inequality POWERD(∅, γ) ≤ POWERD(`., γ) ≤ POWERD(r&, γ) – only that these inequalities hold pairwise for most D. The orbit elements D which agree that ∅ has less 0 OWER D OWER P D than `. need not be the same elements which agree that `. has less P D0 than r&.

6 6.1 Keeping options open tends to be POWER-seeking and tends to be optimal

Certain symmetries in the MDP structure ensure that, compared to left, going right tends to be optimal and to be POWER-seeking. Intuitively, by going right, the agent has “strictly more choices.” Proposition 6.9 formalizes this tendency, but we need several technical concepts to state the result. Definition 6.6 (Equivalent actions). Actions a and a are equivalent at state s (written a a ) if 1 2 1 ≡s 2 they induce the same transition probabilities: T (s, a1) = T (s, a2).

The agent can only reach states in r., r , r by taking actions equivalent to right at state ?. { % &} Definition 6.7 (State-space bottleneck). Starting from s, state s0 is a bottleneck for S via action a when state s can reach the states of S with positive probability, but only by taking actions⊆ S equivalent a to a at state s0. We write this as s s0 S. → →  Definition 6.8 (States reachable after taking an action). REACH s0, a is the set of states reachable with positive probability after taking the action a in state s0. Proposition 6.9 (Keeping options open tends to be POWER-seeking and tends to be optimal).

a  a0  Suppose that s s0 REACH s0, a and s s0 REACH s0, a0 . F := (s π(s0) = → → → → a0 F | a0),Fa := (s π(s0) = a). Suppose Fa is similar to a subset of Fa via involution φ which fixes all F |  0  states not belonging to REACH s0, a0 or REACH s0, a .   Then for all γ [0, 1], Es T (s ,a ) POWER bound (sa0 , γ) most ∈ a0 ∼ 0 0 D ≤  OWER ( ) and ( ) ( ). Esa T (s0,a) P bound sa, γ P bound Fa0 , γ most P bound Fa, γ ∼ D D ≤ D If (s) F φ(F ) is non-empty, then for all γ (0, 1), both inequalities are strict Fnd ∩ a \ a0 ∈ for all ,  OWER ( )  OWER ( ), and X-IID Esa T (s0,a0) P bound sa0 , γ most Esa T (s0,a) P bound sa, γ D 0 ∼ . D 6≥ ∼ D P bound (Fa0 , γ) most P bound (Fa, γ) D 6≥ D

We check the conditions of proposition 6.9. s = s0 := ?, : : ` r a0 = left, a = right. Figure6 shows that ? - ∅ % left right → Involution φ ? `/, ` , ` and ? ? r., r , r . → { - .} → → { % &} (? π(?) = left) is similar to a strict subset of ` r. r `/ F & F(? | π(?) = right) via permutation φ, which fixes . left right F | 2 ? and ∅. Furthermore, nd(?) e? + γer. + γ er + & γ3 γF2 ∩ { γ2 1 γ er , e?+γer. + 1 γ er = e?+γer. + 1 γ er % % } { % } Figure 6 is− non-empty, and so all− conditions are met. − For any γ [0, 1] and such that P (?, left, γ) > P (?, right, γ), environmental symmetry ∈ D D D ensures that Pφ( ) (?, left, γ) < Pφ( ) (?, right, γ). A similar statement holds for POWER. D D 6.2 When γ 1, optimal policies tend to navigate towards “larger” sets of cycles ≈ Proposition 6.5 and proposition 6.9 are powerful because they apply to all γ [0, 1], but they can only be applied given hard-to-satisfy environmental symmetries. In contrast,∈ proposition 6.12 and theorem 6.13 apply to many finite structured environments common to RL. Starting from ?, consider the cycles which the agent can reach. Recurrent state distributions (RSDs) generalize deterministic graphical cycles to potentially stochastic environments. RSDs simply record how often the agent tends to visit a state in the limit of infinitely many time steps. Definition 6.10 (Recurrent state distributions [Puterman, 2014]). The recurrent state distributions  which can be induced from state s are RSD (s) := limγ 1(1 γ)f(γ) f (s) . RSDnd (s) is → the set of RSDs which strictly maximize average reward for some− reward| function.∈ F

1 1 As suggested by fig.3, RSD (?) = e` , 2 (e` + e` ), e∅, er , 2 (er + er ), er . As dis- 1 { . . - % % & & } cussed in section3, 2 (er + er ) is dominated: alternating between r and r is never strictly better than choosing one or% the other.& % & A reward function’s optimal policies can vary with the discount rate. When γ 1, optimal policies “ignore” transient reward because average reward is the dominant consideration.≈

7 Definition 6.11 (Blackwell optimal policies [Blackwell, 1962]). Π∗ (R, 1) := limγ 1 Π∗ (R, γ) is the Blackwell optimal policy set for reward function R. →

Blackwell optimal policies maximize average reward. Average reward is governed by RSD access. For example, r has “more” RSDs than ∅; therefore, it usually has greater POWER when γ = 1. &  Proposition 6.12 (When γ = 1, RSDs control POWER). If RSDnd s0 is similar to a subset  of RSD (s) via involution φ, then POWER bound s0, 1 most POWER bound (s, 1). If RSDnd (s) D ≤ D  \ φ(RSDnd(s0)) is non-empty, then this inequality is strict for all X-IID, and POWER bound s0, 1 most D D 6≥ POWER bound (s, 1). D

We check that both conditions of proposition 6.12 are satisfied when s0 := ∅, s := r , and the & involution φ swaps ∅ and r . φ(RSDnd (∅)) = φ( e∅ ) = er ( er , er = RSDnd(r ). The conditions are satisfied.& { } { & } { & % } & Informally, states with more RSDs generally have more POWER at γ = 1, no matter their transient dynamics. Furthermore, Blackwell optimal policies are more likely to end up in “larger” sets of RSDs than in “smaller” ones. Thus, Blackwell optimal policies tend to navigate towards parts of the state space which contain more RSDs.

Theorem 6.13 (Blackwell optimal policies tend to end up in “larger” sets of RSDs). Let D0,D ⊆ RSD (s). Suppose that D0 is similar to a subset of D via involution φ and that the sets D0 D  ∪ and RSDnd (s) D0 D have pairwise orthogonal vector elements (i.e. pairwise disjoint vector support). \ ∪ Then  . If  is non-empty, the inequality P bound D0, 1 most P bound (D, 1) RSDnd (s) D φ(D0) D ≤ D ∩ \ is strict for all , and  . X-IID P bound D0, 1 most P bound (D, 1) D D 6≥ D Corollary 6.14 (Blackwell optimal policies tend not to end up in any given 1-cycle). Suppose are distinct. Then  . esx , es0 RSD (s) P bound esx , 1 most P bound RSD (s) esx , 1 ∈ D { } ≤ D \{ } If there exists a third , then  . es00 RSD (s) P bound esx , 1 most P bound RSD (s) esx , 1 ∈ D { } 6≥ D \{ }

Figure7 illustrates that e∅, er , er RSD (?). Thus, both conclusions& of% corol-∈ lary 6.14 hold: e 1 ` r P bound ∅ , most ∅ D { } ≤ - % RSD ( ) e 1 and P bound ? ∅ , D \{ } e , 1 RSD (?) e , 1. In P bound ∅ most P bound ∅ ` ` r. r D { } 6≥ D \{ } / F & other words, Blackwell optimal policies tend to end up . left right in RSDs besides ∅. Since ∅ is a terminal state, it cannot reach other RSDs. Since Blackwell optimal policies tend to end up in other RSDs, Blackwell optimal policies tend Figure 7: The cycles in RSD (?). to avoid ∅. This section’s results prove the γ = 1 case. Lemma 5.3 and proposition D.33 respectively show that POWER and optimality probability are continuous at γ = 1. Therefore, if strict POWER-seeking or strictly greater optimality probability holds when γ = 1, the same holds for discount rates sufficiently close to 1.

Lastly, since most (definition 6.4) results apply to all bound, our key results apply to all degenerate distributions.≥ Therefore, our key results apply not justD to distributions over reward functions, but to individual reward functions.

6.3 How to reason about other environments

Consider an embodied navigation task through a room with a vase. Proposition 6.9 suggests that optimal policies tend to avoid breaking the vase, since doing so would strictly decrease available options. Theorem 6.13 dictates where Blackwell optimal agents tend to end up, but not what actions they tend to take in order to reach their RSDs. Therefore, care is needed. In appendixB, fig. 10 demonstrates an environment in which seeking POWER is a “detour” for most reward functions.

8 Even though randomly generated MDPs are unlikely to satisfy our sufficient conditions for POWER- seeking tendencies, theorem 6.13 is easy to apply to many structured RL environments. Loosely speaking, if the “vast majority” of RSDs are only reachable by following a subset of policies, theorem 6.13 implies that that subset tends to be Blackwell optimal. In fig.8, the player dies by going left, but can reach thousands of RSDs by heading in other directions. Even if some Blackwell optimal policies go left in order to reach fig.8’s ‘game over’ terminal state, all other RSDs cannot be reached by going left. There are many 1-cycles besides the immediate terminal state. Therefore, corollary 6.14 proves that Blackwell optimal left right ··· policies tend to not go left in this situation. Blackwell Figure 8: Consider the dynamics of the optimal policies tend to avoid immediately dying in Pac- Pac-Man video game. Ghosts kill the Man, even though most reward functions do not resemble player, at which point we consider the Pac-Man’s original score function. player to enter a ‘game over’ terminal state which shows the final configuration. Optimal policies for e.g. Pac-Man do not seek power in This rewardless MDP has Pac-Man’s dy- the real world. However, we think that our results suggest namics, but not its usual score function. that optimal policies for real-world environments tend to Fixing the dynamics, what actions are op- seek real-world power. timal as we vary the reward function?

7 Discussion

Corollary 6.14 dictates where Blackwell optimal agents tend to end up, but not how they get there. Corollary 6.14 says that such agents tend not to stay in any given 1-cycle. It does not say that such agents will avoid entering such states. For example, in an embodied navigation task, a robot may enter a 1-cycle by idling in the center of a room. Corollary 6.14 implies that Blackwell optimal robots tend not to idle in that particular spot, but not that they tend to avoid that spot entirely. However, Blackwell optimal robots do tend to avoid getting shut down. A terminal state is, by definition 3.2, unable to access other 1-cycles. Since corollary 6.14 shows that Blackwell optimal agents tend to end up in other 1-cycles, Blackwell optimal policies must tend to completely avoid the terminal state. The agent’s task MDP often represents agent shutdown with terminal states. Therefore, we conclude that in many such situations, Blackwell optimal policies tend to avoid shutdown. Reconsider the case of a hypothetical intelligent real-world agent. Suppose the designers initially have control over a Blackwell optimal agent. If the agent began to misbehave, they could just deactivate it. Unfortunately, our results suggest that this strategy might not work. Blackwell optimal agents would generally stop us from deactivating them, if physically possible. Extrapolating from our results, we conjecture that Blackwell optimal policies tend to seek power by accumulating resources, to the detriment of any other agents in the environment.

Future work. Most real-world tasks are partially observable. Although our results only apply to optimal policies in finite MDPs, we expect the key conclusions to generalize. We look forward to future work which addresses partially observable environments or suboptimal policies. Past work shows that it would be bad for an agent to disempower humans in its environment. In a two-player agent / human game, minimizing the human’s information-theoretic empowerment [Salge et al., 2014] produces adversarial agent behavior [Guckelsberger et al., 2018]. In contrast, maximizing human empowerment produces helpful agent behavior [Salge and Polani, 2017, Guckelsberger et al., 2016, Du et al., 2020]. We do not yet formally understand if, when, or why POWER-seeking policies tend to disempower other agents in the environment. More complex environments probably have more pronounced power-seeking incentives. Intuitively, there are often combinatorially many ways for power-seeking to be optimal, and relatively few ways for power-seeking not to be optimal. For example, suppose that in some environment, theo- rem 6.13 holds for one million involutions φ. Does this guarantee more pronounced incentives than if theorem 6.13 only held for one involution? We proved sufficient conditions for when reward functions tend to incentivize power-seeking. In the absence of prior information, one should expect that an arbitrary reward function distribution incentivizes power-seeking behavior under these conditions. However, we have prior information:

9 AI designers usually try to specify a good reward function. Even so, it may be hard to specify orbit elements which do not incentivize bad power-seeking.

Societal impact. We believe that this paper builds toward a rigorous understanding of the risks presented by AI power-seeking incentives. Understanding these risks is the first step in addressing them. However, basic theoretical work can have many consequences. For example, this theory could somehow help future researchers build power-seeking agents which go on to disempower humanity. We believe that the benefit of understanding outweighs the potential societal harm.

Conclusion. We developed the first formal theory of the statistical tendencies of optimal policies in reinforcement learning. In the context of MDPs, we proved sufficient conditions under which optimal policies tend to seek power, both formally (by taking POWER-seeking actions) and intuitively (by taking actions which keep the agent’s options open). Many real-world environments have symmetries which produce power-seeking incentives. In particular, optimal policies tend to seek power when the agent can be shut down or destroyed. Maximizing control over the environment will often involve resisting shutdown, and perhaps monopolizing resources. We caution that many real-world tasks are partially observable and that learned policies are rarely optimal. Our results do not mathematically prove that hypothetical superintelligent AI agents will seek power. However, we hope that this work will foster thoughtful, serious, and rigorous discussion of this possibility.

Acknowledgments

This work was supported by the Center for Human-Compatible AI, the Berkeley Existential Risk Initiative, and the Long-Term Future Fund. Yousif Almulla, John E. Ball, Daniel Blank, Steve Byrnes, Ryan Carey, Michael Dennis, Scott Emmons, Alan Fern, Daniel Filan, Ben Garfinkel, Ofer Givoli, Adam Gleave, Evan Hubinger, DNL Kok, Vanessa Kosoy, Victoria Krakovna, Cassidy Laidlaw, Joel Lehman, David Lindner, Dylan Hadfield-Menell, Richard Möhn, Alexandra Nolan, Matt Olson, Neale Ratzlaff, Adam Shimi, Sam Toyer, Joshua Turner, Cody Wild, and Davide Zagami provided valuable feedback.

References Tsvi Benson-Tilsen and Nate Soares. Formalizing convergent instrumental goals. Workshops at the Thirtieth AAAI Conference on Artificial Intelligence, 2016.

David Blackwell. Discrete dynamic programming. The Annals of Mathematical Statistics, page 9, 1962.

Nick Bostrom. . Oxford University Press, 2014.

Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A. Efros. Large-scale study of curiosity-driven learning. In International Conference on Learning Representations, 2019.

Ryan Carey. Incorrigibility in the CIRL framework. AI, Ethics, and Society, 2018.

Chris Drummond. Composing functions to speed up reinforcement learning in a changing world. In Machine Learning: ECML-98, volume 1398, pages 370–381. Springer, 1998.

Yuqing Du, Stas Tiomkin, Emre Kiciman, Daniel Polani, Pieter Abbeel, and Anca Dragan. AvE: Assistance via empowerment. Advances in Neural Information Processing Systems, 33, 2020.

David Foster and Peter Dayan. Structure in the space of value functions. Machine Learning, pages 325–346, 2002.

Christian Guckelsberger, Christoph Salge, and Simon Colton. Intrinsically motivated general companion NPCs via coupled empowerment maximisation. In IEEE Conference on Computational Intelligence and Games, pages 1–8, 2016.

Christian Guckelsberger, Christoph Salge, and Julian Togelius. New and surprising ways to be mean. In IEEE Conference on Computational Intelligence and Games, pages 1–8, 2018.

10 Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. The off-switch game. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 220–227, 2017. Tejas D Kulkarni, Ardavan Saeedi, Simanta Gautam, and Samuel J Gershman. Deep successor reinforcement learning. arXiv preprint arXiv:1606.02396, 2016. Yann LeCun and Anthony Zador. Don’t fear the Terminator, September 2019. URL https://blogs. scientificamerican.com/observations/dont-fear-the-terminator/. Steven A Lippman. On the set of optimal policies in discrete dynamic programming. Journal of Mathematical Analysis and Applications, 24(2):440–445, 1968. Lionel W McKenzie. Turnpike theory. Econometrica: Journal of the Econometric Society, pages 841–865, 1976. Ishai Menache, Shie Mannor, and Nahum Shimkin. Q-cut—dynamic discovery of sub-goals in reinforcement learning. In European Conference on Machine Learning, pages 295–306. Springer, 2002. Smitha Milli, Dylan Hadfield-Menell, Anca Dragan, and Stuart Russell. Should robots be obedient? In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 4754–4760, 2017. Melanie Mitchell. Why AI is harder than we think. arXiv preprint arXiv:2104.12871, 2021. Andrew Y. Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 278–287. Morgan Kaufmann, 1999. Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020. Stephen Omohundro. The basic AI drives, 2008. Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self- supervised prediction. In ICML, 2017. Steven Pinker and Stuart Russell. The foundations, benefits, and possible existential threat of AI, June 2020. URL https://futureoflife.org/2020/06/15/steven-pinker-and-stuart-russell-on- the-foundations-benefits-and-possible-existential-risk-of-ai/. Martin L Puterman. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, 2014. J.B. Reece and N.A. Campbell. Campbell Biology. Pearson Australia, 2011. Kevin Regan and Craig Boutilier. Robust policy computation in reward-uncertain MDPs using nondominated policies. In Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010. Stuart Russell. : Artificial intelligence and the problem of control. Viking, 2019. Stuart J Russell and Peter Norvig. Artificial intelligence: a modern approach. Pearson Education Limited, 2009. Christoph Salge and Daniel Polani. Empowerment as replacement for the three laws of robotics. Frontiers in Robotics and AI, 4:25, 2017. Christoph Salge, Cornelius Glackin, and Daniel Polani. Empowerment–an introduction. In Guided Self- Organization: Inception, pages 67–114. Springer, 2014. Faridun Sattarov. Power and technology: a philosophical and ethical analysis. Rowman & Littlefield Interna- tional, Ltd, 2019. Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International Conference on Machine Learning, pages 1312–1320, 2015. Nate Soares, Benja Fallenstein, Stuart Armstrong, and . Corrigibility. AAAI Workshops, 2015. Richard S Sutton and Andrew G Barto. Reinforcement learning: an introduction. MIT Press, 1998. Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White, and Doina Precup. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In International Conference on Autonomous Agents and Multiagent Systems, pages 761–768, 2011.

11 Alexander Matt Turner, Dylan Hadfield-Menell, and Prasad Tadepalli. Conservative agency via attainable utility preservation. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 385–391, 2020.

Various. Debate on instrumental convergence between LeCun, Russell, Bengio, Zador, and more, 2019. URL https://www.alignmentforum.org/posts/WxW6Gc6f2z3mzmqKs/debate-on-instrumental- convergence-between-lecun-russell. Tao Wang, Michael Bowling, and Dale Schuurmans. Dual representations for dynamic programming and rein- forcement learning. In International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 44–51. IEEE, 2007.

Tao Wang, Michael Bowling, Dale Schuurmans, and Daniel J Lizotte. Stable dual dynamic programming. In Advances in Neural Information Processing Systems, pages 1569–1576, 2008.

12 Appendix A Comparing POWER with Information-Theoretic Empowerment

Salge et al.[2014] define information-theoretic empowerment as the maximum possible mutual information between the agent’s actions and the state observations n steps in the future, written En(s). This notion requires an arbitrary choice of horizon, failing to account for the agent’s discount rate γ. “In a discrete deterministic world empowerment reduces to the logarithm of the number of sensor states reachable with the available actions” [Salge et al., 2014]. Figure9 demonstrates how empowerment can return counterintuitive verdicts with respect to the agent’s control over the future.

3 4

1 2

(a) (b) (c) Figure 9: Proposed empowerment measures fail to adequately capture how future choice is affected by present actions. In (a): En(s1) varies depending on whether n is even; thus, limn En(s1) does not exist. In (b) and (c): n : E (s ) = E (s ), even though s allows greater control→∞ over ∀ n 3 n 4 4 future state trajectories than s3 does. For example, suppose that in both (b) and (c), the leftmost black state and the rightmost red state have 1 reward while all other states have 0 reward. In (c), the agent can independently maximize the intermediate black-state reward and the delayed red-state reward. Independent maximization is not possible in (b).

POWER returns intuitive answers in these situations. limγ 1 POWER bound (s1, γ) converges by lemma 5.3. Consider the obvious involution φ which takes→ each stateD in fig. 9b to its coun-  terpart in fig. 9c. Since φ nd(s3) ( nd(s4), proposition 6.5 proves that γ [0, 1] : F F ∀ ∈ POWER bound (s3, γ) most POWER bound (s4, γ), with strict inequality under all X-IID when γ (0, 1). D ≤ D D ∈ Empowerment can be adjusted to account for these cases, perhaps by considering the channel capacity between the agent’s actions and the state trajectories induced by stationary policies. However, since POWER is formulated in terms of optimal value, we believe that POWER is better suited for MDPs than information-theoretic empowerment is.

Appendix B Seeking POWER can be a “Detour”

Remark. The results of appendixD do not depend on this section’s results.

One might suspect that optimal policies tautologically tend to seek POWER. This intuition is wrong.

Proposition B.1 (Greater POWER bound does not imply greater P bound ). D D Action a seeking more POWER bound than a0 at state s and γ does not imply that D . P bound (s, a, γ) P bound s, a0, γ D ≥ D 2 3 Proof. Consider the environment of fig. 10. Let Xu := unif(0, 1), and consider , which is bounded. Direct computation of the Xu-IID N OWER D OWER ( 1) = NE P expectation (definition 5.2) yields P Xu-IID s2, 3 2 D 1 > = POWER X -IID (s3, 1). Therefore, N seeks more 4 3 D u POWER X -IID than NE at state s1 and γ = 1. D u Figure 10 However, (s , N, 1) = 1 < 2 = (s , NE, 1). P X -IID 1 3 3 P X -IID 1 D u D u

Lemma B.2 ( most and trivial orbits). Suppose f most f . For all bounded reward function ≥ 1 ≥ 2 distributions with one-element orbits, f1( ) f2( ). In particular, has a one-element orbit when it is IID.D D ≥ D D

13   Proof. By definition 6.4, at least half of the elements S satisfy f1( ) f2( ). But D ∈ |S| ·D D ≥ D S bound = 1, f1( ) f2( ) must hold. |S| ·D D ≥ D If is IID, it has a one-element orbit due to the assumed identical distribution of reward. D Proposition B.3 (Actions which tend to seek POWER do not necessarily tend to be optimal). Action tending to seek more OWER than at state and does not imply that a P a0 s γ P bound (s, a, γ) most D ≥ . P bound s, a0, γ D

Proof. Consider the environment of fig. 10. Since RSD (s3) ( RSDnd (s2), theorem 6.13 shows that POWER bound (s2, 1) most POWER bound (s3, 1) via s0 := s3, s := s2, φ the identity permutation D ≥ D (which is an involution). Therefore, N tends to seek more POWER than NE at state s1 and γ = 1. If ( N ) ( NE ), then lemma B.2 shows that ( N ) P bound s1, , γ most P bound s1, , γ P X-IID s1, , γ most D ( NE ) for≥ all D . But the proof of proposition B.1 showed thatD ( N ≥1) P bound s1, , γ X-IID P X -IID s1, , < D D D u ( NE 1) for := unif(0 1). Therefore, it cannot be true that ( N ) P X -IID s1, , Xu , P bound s1, , γ most D u D ≥ ( ) P bound s1, NE, γ . D Appendix C Lists of Results

List of Definitions

3.1 Definition (Rewardless MDP)...... 2 3.2 Definition (1-cycle states) ...... 3 3.3 Definition (State visit distribution [Sutton and Barto, 1998]) ...... 3 3.4 Definition ( single-state restriction) ...... 3 F 3.5 Definition (Value function) ...... 3 3.6 Definition (Non-domination) ...... 3 4.1 Definition (Optimal policy set function) ...... 3 4.2 Definition (Reward function distributions) ...... 4 4.3 Definition (Visit distribution optimality probability) ...... 4 4.4 Definition (Action optimality probability) ...... 4 5.1 Definition (Average optimal value) ...... 4 5.2 Definition (POWER)...... 4 5.6 Definition (POWER-seeking actions) ...... 5 6.1 Definition (Similarity of visitation distribution sets) ...... 5 6.2 Definition (Pushforward distribution of a permutation) ...... 6 6.3 Definition (Orbit of a probability distribution) ...... 6 6.4 Definition (Inequalities which hold for most bounded probability distributions) . .6 6.6 Definition (Equivalent actions) ...... 7 6.7 Definition (State-space bottleneck) ...... 7 6.8 Definition (States reachable after taking an action) ...... 7 6.10 Definition (Recurrent state distributions [Puterman, 2014]) ...... 7 6.11 Definition (Blackwell optimal policies [Blackwell, 1962]) ...... 8 D.2 Definition (Transition matrix induced by a policy) ...... 17 D.6 Definition (Continuous reward function distribution) ...... 18

14 D.9 Definition (Non-dominated linear functionals) ...... 19 D.12 Definition (Non-dominated vector functions) ...... 19 D.13 Definition (Affine transformation of visit distribution sets) ...... 19

D.17 Definition (Support of any)...... 21 D D.18 Definition (Linear functional optimality probability) ...... 21 D.23 Definition (Indicator function) ...... 23 D.28 Definition (Evaluating sets of visit distribution functions at γ)...... 26 D.36 Definition (Discount-normalized value function) ...... 28 D.39 Definition (Normalized visit distribution function) ...... 30 D.48 Definition (Non-dominated single-state restriction) ...... 36 F List of Theorems

5.3 Lemma (Continuity of POWER)...... 5 5.4 Proposition (Maximal POWER)...... 5 5.5 Proposition (POWER is smooth across reversible dynamics) ...... 5 6.5 Proposition (States with “more options” generally have more POWER)...... 6 6.9 Proposition (Keeping options open tends to be POWER-seeking and tends to be optimal) 7 6.12 Proposition (When γ = 1, RSDs control POWER)...... 8 6.13 Theorem (Blackwell optimal policies tend to end up in “larger” sets of RSDs) . . .8 6.14 Corollary (Blackwell optimal policies tend not to end up in any given 1-cycle) . . .8

B.1 Proposition (Greater POWER bound does not imply greater P bound )...... 13 D D B.2 Lemma ( most and trivial orbits) ...... 13 ≥ B.3 Proposition (Actions which tend to seek POWER do not necessarily tend to be optimal) 14 D.1 Lemma (A policy is optimal iff it induces an optimal visit distribution at every state) 17 D.3 Proposition (Properties of visit distribution functions) ...... 17 D.4 Lemma (f (s) is multivariate rational on γ)...... 18 ∈ F D.5 Corollary (On-policy value is rational on γ)...... 18 D.7 Lemma (Distinct linear functionals disagree almost everywhere on their domains) . 18 D.8 Corollary (Unique maximization of almost all vectors) ...... 19 D.10 Lemma (All vectors are maximized by a non-dominated linear functional) . . . . . 19 D.11 Corollary (Maximal value is invariant to restriction to non-dominated functionals) . 19 D.14 Lemma (Invariance of non-domination under positive affine transform) ...... 19

D.15 Lemma (Helper lemma for demonstrating f most f )...... 20 1 ≥ 2 D.16 Lemma (A helper result for expectations of functions) ...... 20 D.19 Proposition (Non-dominated linear functionals and their optimality probability) . . 21 D.20 Lemma (Non-dominated functionals strictly increase expected optimal value) . . . 21

D.21 Lemma (For continuous IID distributions -IID, b < c : (b, c)|S| supp( -IID)) 22 DX ∃ ⊆ DX D.22 Lemma (Expectation superiority lemma) ...... 22 D.24 Lemma (Optimality probability inclusion relations) ...... 23

15 D.25 Lemma (Optimality probability of similar linear functional sets) ...... 24 D.26 Lemma (Optimality probability superiority lemma) ...... 25 D.27 Proposition (How to transfer optimal policy sets across discount rates) ...... 25 D.29 Lemma (Non-domination across γ values for expectations of visit distributions) . . 26 D.30 Lemma ( γ (0, 1) : d (s, γ) iff d ND (s, γ))...... 26 ∀ ∈ ∈ Fnd ∈ F : D.31 Lemma ( γ [0, 1) VR∗ (s, γ) = maxf nd(s) f(γ)>r)...... 26 ∀ ∈ ∈F D.32 Lemma (Optimal policy shift bound) ...... 26 D.33 Proposition (Optimality probability’s limits exist) ...... 26 D.34 Lemma (Optimality probability identity) ...... 27 D.35 Lemma (POWER identities) ...... 27 D.37 Lemma (Normalized value functions have uniformly bounded derivative) . . . . . 28 D.38 Lemma (Lower bound on current POWER based on future POWER)...... 29 D.40 Lemma (Normalized visit distribution functions are continuous) ...... 31 D.41 Lemma (Non-domination of normalized visit distribution functions) ...... 31 D.42 Lemma (POWER limit identity) ...... 31 D.43 Lemma (Lemma for POWER superiority) ...... 32 D.44 Lemma ( (s) “factorizes” across bottlenecks) ...... 34 F D.45 Lemma (Non-dominated visit distribution functions never agree with other visit distribution functions at that state) ...... 35 D.46 Corollary (Cardinality of non-dominated visit distributions) ...... 35 D.47 Lemma (Limit probability inequalities which hold for most distributions) . . . . . 36 D.49 Lemma (Optimality probability and state bottlenecks) ...... 36 D.50 Lemma (POWER identity when γ = 1)...... 39 D.51 Proposition (RSD properties) ...... 41 D.52 Lemma (When reachable with probability 1, 1-cycles induce non-dominated RSDs) 42

C.1 Contributions of independent interest

We developed new basic MDP theory by exploring the structural properties of visit distribution functions. Echoing Wang et al.[2007, 2008], we believe that this area is interesting and underexplored.

C.1.1 Optimal value theory : Lemma D.37 shows that f(γ∗) = limγ∗ γ (1 γ∗)VR∗ (s, γ∗) is Lipschitz continuous on γ [0, 1], → − ∈ with Lipschitz constant depending only on R 1. For all states s and policies π Π, corollary D.5 π k k ∈ shows that VR (s, γ) is smooth on γ.

Optimal value has a well-known dual formulation: VR∗ (s, γ) = maxf (s) f(γ)>r. ∈F : Lemma D.31 ( γ [0, 1) VR∗ (s, γ) = maxf nd(s) f(γ)>r). ∀ ∈ ∈F In a fixed rewardless MDP, lemma D.31 may enable more efficient computation of optimal value functions for multiple reward functions.

C.1.2 Optimal policy theory Proposition D.27 demonstrates how to preserve optimal incentives while changing the discount rate. Proposition D.27 (How to transfer optimal policy sets across discount rates). Suppose reward function R has optimal policy set Π∗ (R, γ) at discount rate γ (0, 1). For any γ∗ (0, 1), we ∈ ∈

16  can construct a reward function R0 such that Π∗ R0, γ∗ = Π∗ (R, γ). Furthermore, V ∗ ( , γ∗) = R0 · V ∗ ( , γ). R · C.1.3 Visit distribution theory While Regan and Boutilier[2010] consider a visit distribution function f (s) to be non-dominated ∈ F if it is optimal for some reward function in a set R|S|, our stricter definition 3.6 considers f to R ⊆ be non-dominated when r R|S|, γ (0, 1) : f(γ)>r > maxf (s) f f 0(γ)>r. ∃ ∈ ∈ 0∈F \{ } h i π,s0 Kulkarni et al.[2016] learn to estimate Succ(s, a) := Es T (s,a) f (γ) in order to infer state- 0∼ space bottlenecks. We prove that bottlenecks “factorize” state visit distribution functions.

a1 s s0

Figure 11: Informally, lemma D.44 shows that a bottleneck at s0 “factorizes” (s π(s0) = a1) into combinations of “what happens before acting at the bottleneck” and “what happensF | after acting at the a1  bottleneck.” In this rewardless MDP, s s0 REACH s0, a1 . Therefore, (s) = 9: each of the → → F three red “prefix” visit distribution functions (induced before reaching s0) can combine with the three green “suffix” visit distribution functions (induced after reaching s0).

ai Lemma D.44 ( (s) “factorizes” across bottlenecks). Suppose 1 i k : s s0  F Pk P ∀ ≤ ≤ → → REACH s0, ai . Then let 1reach := es and 1rest := 1 1reach (where i=1 sj REACH(s0,ai) j − b  π,s∈ 1 1 R|S| is the all-ones vector). Let Frest := f rest π Π (with the Hadamard product), ∈ | ∈ b  π,s b n π,s o : 1 : : ai Frest,a = f rest π Π π(s0) s0 ai , and Fa = Esa T (s ,ai) [f ] π Π . In i | ∈ ≡ i i ∼ 0 | ∈ the following, γ is left variable on [0, 1).     b b (s π(s0) = ai) = frest(γ) + 1 (1 γ) frest(γ) fai (γ) frest F , fai F . F | − − 1 | ∈ rest,a1 ∈ ai (149)

Appendix D Theoretical Results

Lemma D.1 (A policy is optimal iff it induces an optimal visit distribution at every state). Let γ (0, 1) and let R be a reward function. π Π∗ (R, γ) iff π induces an optimal visit distribution at∈ every state. ∈

Proof. By definition, a policy π is optimal iff π induces the maximal on-policy value at each state, which is true iff π induces an optimal visit distribution at every state (by the dual formulation of optimal value functions).

Definition D.2 (Transition matrix induced by a policy). Tπ is the transition matrix induced by policy π π t π Π, where T es := T (s, π(s)). (T ) es gives the probability distribution over the states visited at∈ time step t, after following π for t steps from s.

π,s Proposition D.3 (Properties of visit distribution functions). Let s, s0 , f (s). ∈ S ∈ F 1. f π,s(γ) is element-wise non-negative and element-wise monotonically increasing on γ [0, 1). ∈ 2. [0 1) : f π,s( ) = 1 . γ , γ 1 1 γ ∀ ∈ −

17 π,s P π t π t Proof. Item1: by examination of definition 3.3, f = t∞=0 (γT ) es. Since each (T ) is left stochastic and es is the standard unit vector, each entry in each summand is non-negative. Therefore, π,s γ [0, 1) : f (γ)>e 0, and this function monotonically increases on γ. ∀ ∈ s0 ≥ Item2:

π,s X∞ π t f (γ) = (γT ) es (5) 1 t=0 1

X∞ t π t = γ (T ) es (6) 1 t=0 X∞ = γt (7) t=0 1 = . (8) 1 γ − Equation (6) follows because all entries in each (Tπ)t e are non-negative by item1. Equation (7) s π t π t follows because each (T ) is left stochastic and es is a stochastic vector, and so (T ) es = 1 1.

Lemma D.4 (f (s) is multivariate rational on γ). f π (s) is a multivariate rational function on γ [0, 1). ∈ F ∈ F ∈ π π Proof. Let r R|S| and consider f (s). Let vR be the VR∗ (s, γ) function in column vector form, with one∈ entry per state value. ∈ F

π π 1 π 1 By the Bellman equations, v = (I γT )− r. Let A := (I γT )− , and for state s, form R − γ − As,γ by replacing Aγ ’s column for state s with r. As noted by Lippman[1968], by Cramer’s rule, V π(s, γ) = det As,γ is a rational function with numerator and denominator having degree at most R det Aγ . |S| e π( ) = f π,s( ) e In particular, for each state indicator reward function si , Vsi s, γ γ > si is a rational function of γ whose numerator and denominator each have degree at most . This implies that f π(γ) is multivariate rational on γ [0, 1). |S| ∈ π Corollary D.5 (On-policy value is rational on γ). Let π Π and R be any reward function. VR (s, γ) is rational on γ [0, 1). ∈ ∈ π π,s Proof. VR (s, γ) = f (γ)>r, and f is a multivariate rational function of γ by lemma D.4. Therefore, π,s for fixed r, f (γ)>r is a rational function of γ.

D.1 Non-dominated visit distribution functions

Definition D.6 (Continuous reward function distribution). Results with cont hold for any absolutely continuous reward function distribution. D

Remark. We assume R|S| is endowed with the standard topology.

Lemma D.7 (Distinct linear functionals disagree almost everywhere on their domains). Let x, x0 be distinct.  . ∈ R|S| Pr cont x>r = x0>r = 0 ∼D n o Proof. r R|S| (x x0)>r = 0 is a hyperplane since x x0 = 0. Therefore, it has no interior ∈ | − − 6 in the standard topology on R|S|. Since this empty-interior set is also convex, it has zero Lebesgue measure. By the Radon-Nikodym theorem, it has zero measure under any continuous distribution cont. D

18 Corollary D.8 (Unique maximization of almost all vectors). Let X (R|S| be finite.   . Pr cont arg maxx X x00>r > 1 = 0 ∼D 00∈

Proof. Let x, x0 X be distinct. For any r R|S|, x, x0 arg maxx X x00>r iff x>r = x0>r ∈ ∈ ∈ 00∈ ≥ maxx X x,x x00>r. By lemma D.7, x>r = x0>r holds with probability 0 under any cont. 00∈ \{ 0} D D.1.1 Generalized non-domination results

Our formalism includes both nd(s) and RSDnd (s); we therefore prove results that are applicable to both. F

Definition D.9 (Non-dominated linear functionals). Let X (R|S| be finite. ND (X) := n o x X r R|S| : x>r > maxx X x x0>r . ∈ | ∃ ∈ 0∈ \{ }

Lemma D.10 (All vectors are maximized by a non-dominated linear functional). Let r R|S| and ∈ let X (R|S| be finite and non-empty. x∗ ND (X) : x∗>r = maxx X x>r. ∃ ∈ ∈

Proof. Let A(r X) := arg maxx X x>r = x1,..., xn . Then | ∈ { }

x1>r = = xn>r > max x0>r. (9) ··· x X A(r X) 0∈ \ |

In eq. (9), each x>r expression is linear on r. The max is piecewise linear on r since it is the maximum of a finite set of linear functionals. In particular, all expressions in eq. (9) are continuous on r, and so we can find some δ > 0 neighborhood B(r, δ) such that r0 B(r, δ) : maxxi A(r X) xi>r0 > ∀ ∈ ∈ | maxx X A(r X) x0>r0. 0∈ \ | But almost all r0 B(r, δ) are maximized by a unique functional x∗ by corollary D.8; in particular, ∈ at least one such r00 exists. Formally, r00 B(r, δ) : x∗>r00 > maxx X x x0>r00. Therefore, ∃ ∈ 0∈ \{ ∗} x∗ ND (X) by definition D.9. ∈ x∗>r0 maxxi A(r X) xi>r0 > maxx X A(r X) x0>r0, with the strict inequality following because ≥ ∈ | 0∈ \ | r00 B(r, δ). These inequalities imply that x∗ A(r X). ∈ ∈ |

Corollary D.11 (Maximal value is invariant to restriction to non-dominated functionals). Let r R|S| ∈ and let X (R|S| be finite. maxx X x>r = maxx ND(X) x>r. ∈ ∈ Proof. If X is empty, the statement holds trivially. Otherwise, apply lemma D.10.

Definition D.12 (Non-dominated vector functions). Let I R and let  I ⊆ F ( R|S| be a finite set of vector-valued functions on I. ND (F ) := n o f F γ I, r R|S| : f(γ)>r > maxf F f f 0(γ)>r . ∈ | ∃ ∈ ∈ 0∈ \{ } Observe that for all states s, (s) = ND (s) by definition 3.6. Fnd F Definition D.13 (Affine transformation of visit distribution sets). For notational convenience, we de-  fine set-scalar multiplication and set-vector addition on X R|S|: for c R, cX := cx x X .  ⊆ ∈ | ∈ For a R|S|, X + a := x + a x X . ∈ | ∈ Similar operations hold when X is a set of vector functions R R|S|. 7→ Lemma D.14 (Invariance of non-domination under positive affine transform).

1. Let X (R|S| be finite. If x ND (X), then c > 0, a R|S| : (cx + a) ND (cX + a). ∈ ∀ ∈ ∈  I 2. Let I R and let F ( R|S| be a finite set of vector-valued functions on I. If ⊆ f ND (F ), then c > 0, a R|S| : (cf + a) ND (cF + a). ∈ ∀ ∈ ∈

19 Proof. Item1: suppose x ND (X) is strictly optimal for r R|S|. Then let c > 0, a R|S| be ∈ ∈ ∈ arbitrary, and define b := a>r.

x>r > max x0>r (10) x X x 0∈ \{ } cx>r + b > max cx0>r + b (11) x X x 0∈ \{ } (cx + a)>r > max (cx0 + a)>r (12) x X x 0∈ \{ } (cx + a)>r > max x00>r. (13) x (cX+a) cx+a 00∈ \{ } Equation (11) follows because c > 0. Equation (12) follows by the definition of b.

Item2: if f ND (F ), then by definition D.12, there exist γ I, r R|S| such that ∈ ∈ ∈ f(γ)>r > max f 0(γ)>r. (14) f F f 0∈ \{ } Apply item1 to conclude

(cf(γ) + a)>r > max (cf 0(γ) + a)>r. (15) (cf +a) (cF +a) cf+a 0 ∈ \{ } Therefore, (cf + a) ND (cF + a). ∈ D.1.2 Inequalities which hold under most reward function distributions

Definition 6.4 (Inequalities which hold for most bounded probability distributions). Let f1, f2 : ∆(R|S|) R be any functions from reward function distributions to real numbers. We write → 3 f most f when, for all bound, the following cardinality inequality holds: 1 ≥ 2 D

S bound f1( ) > f2( ) S bound f1( ) < f2( ) . (4) {D ∈ |S| ·D | D D } ≥ {D ∈ |S| ·D | D D }

Lemma D.15 (Helper lemma for demonstrating f1 most f2). If φ S such that for all bound, ≥ ∃ ∈ |S| D f ( bound) < f ( bound) implies that f φ ( bound) > f φ ( bound) , then f most f . 1 D 2 D 1 D 2 D 1 ≥ 2

Proof. φ acts as a bijection from S bound to itself. By assumption on φ, the image of |S| ·D {D ∈ S bound f1( ) < f2( ) under φ is a subset of S bound f1( ) > f2( ) . Since |S| ·D | D D } {D ∈ |S| ·D | D D } φ is injective, S bound f1( ) < f2( ) S bound f1( ) > f2( ) . {D ∈ |S| ·D | D D } ≤ {D ∈ |S| ·D | D D } f most f by definition 6.4. 1 ≥ 2

Lemma D.16 (A helper result for expectations of functions). Let B1,...,Bn R|S| and let any ⊆ D be any probability distribution on R|S|. Suppose f is a function of the form "  #  f B1,...,Bn any = E g max b1>r,..., max bn>r (16) | D r any b1 B1 bn Bn ∼D ∈ ∈ for some function g. Let φ be a state permutation. Then    f B ,...,B any = f φ(B ), . . . , φ(B ) φ any . (17) 1 n | D 1 n | D  Proof. Let distribution any have probability measure F , and let φ any have probability measure D D Fφ.  f B ,...,B any (18) 1 n | D 3 Boundedness only ensures that POWER is well-defined. Optimality probability results hold for all Dany orbits.

20 "  # := E g max b1>r,..., max bn>r (19) r any b1 B1 bn Bn ∼D ∈ ∈ Z   := g max b1>r,..., max bn>r dF (r) (20) b1 B1 bn Bn R|S| ∈ ∈ Z   = g max b1>r,..., max bn>r dFφ(Pφr) (21) b1 B1 bn Bn R|S| ∈ ∈ Z    1   1  = g max b1> P−φ r0 ,..., max bn> Pφ− r0 det Pφ dFφ(r0) (22) b1 B1 bn Bn R|S| ∈ ∈ Z     = g max Pφb1 > r0,..., max Pφbn > r0 dFφ(r0) (23) b1 B1 bn Bn R|S| ∈ ∈ Z ! = g max b10>r0,..., max bn0>r0 dFφ(r0) (24) b0 φ(B1) b φ(Bn) R|S| 1∈ n0 ∈   =: f φ(B ), . . . , φ(B ) φ any . (25) 1 n | D

Equation (21) follows by the definition of Fφ (definition 6.2). Equation (22) follows by substituting r0 := Pφr. Equation (23) follows from the fact that all permutation matrices have unitary determinant 1 and are orthogonal (and so (Pφ− )> = Pφ).

Definition D.17 (Support of any). Let any be any reward function distribution. supp( any) is the D D D smallest closed subset of R|S| whose complement has measure zero under any. D Definition D.18 (Linear functional optimality probability). For finite A, B (R|S|, the probability under that is optimal over ( ) := max a r max b r any A B is p any A B Pr any a A > b B > . D D ≥ ∼D ∈ ≥ ∈

Proposition D.19 (Non-dominated linear functionals and their optimality probability). Let A (R|S| be finite. If b < c : [b, c]|S| supp( any), then a ND (A) implies that a is strictly optimal for a ∃ ⊆ D ∈ set of reward functions with positive measure under any. D

Proof. Suppose b < c : [b, c]|S| supp( any). If a ND (A), then let r be such that a>r > ∃ ⊆ D ∈ maxa A a a0>r. For a1 > 0, a2 R, positively affinely transform r0 := a1r + a21 (where 0∈ \{ } ∈ 1 R|S| is the all-ones vector) so that r0 (b, c)|S|. ∈ ∈ Note that a is still strictly optimal for r0:

a>r > max a0>r a>r0 > max a0>r0. (26) a A a ⇐⇒ a A a 0∈ \{ } 0∈ \{ } Furthermore, by the continuity of both terms on the right-hand side of eq. (26), a is strictly optimal for reward functions in some open neighborhood N of r0. Let N 0 := N (b, c)|S|. N 0 is still open in ∩ R|S| since it is the intersection of two open sets N and (b, c)|S|.

any must assign positive probability measure to all open sets in its support. Therefore, any assigns D D positive probability to N 0 supp( any). ⊆ D

Lemma D.20 (Non-dominated functionals strictly increase expected optimal value). Let A, B (R|S| be finite, let bound be any bounded reward function distribution, and let g : R R be an increasing D → function. If ND (A) is similar to some B0 B via permutation φ S , then ⊆ ∈ |S| "  # "  # E g max a>r E g max b>r . (27) r bound a A ≤ r φ( bound) b B ∼D ∈ ∼ D ∈

If g is strictly increasing, ND (B) B0 is non-empty, and b < c : (b, c)|S| supp( bound), then eq. (27) is strict. \ ∃ ⊆ D

21 Proof.   "  # ! E g max a>r = E g max a>r  (28) r bound a A r bound a ND(A) ∼D ∈ ∼D ∈      = E g  max a>r (29) r φ( bound) a φ(ND(A)) ∼ D ∈ "  # = E g max b>r (30) r φ( bound) b B ∼ D ∈ 0 "  # E g max b>r . (31) ≤ r φ( bound) b B ∼ D ∈

Equation (28) holds because r R|S| : maxa A a>r = maxa ND(A) a>r by corollary D.11. ∀ ∈ ∈ ∈ Equation (29) holds by lemma D.16. Equation (30) holds by the definition of B0. Furthermore, our assumption on φ guarantees that B0 B. Therefore, maxa B0 a>r maxa B a>r, and so eq. (31) holds by the fact that g is an increasing⊆ function. ∈ ≤ ∈

Suppose that g is strictly increasing, ND (B) B0 is non-empty, and b < c : (b, c)|S| \ ∃ ⊆ supp( bound). Let x ND (B) B0. D ∈ \   "  # ! E g max b>r < E g max b>r  (32) r φ( bound) b B r φ( bound) a B x ∼ D ∈ 0 ∼ D ∈ 0∪{ } "  # E g max b>r . (33) ≤ r φ( bound) b B ∼ D ∈ x is strictly optimal for a positive-probability subset of supp( bound) by proposition D.19. Since g is strictly increasing, eq. (32) is strict. Therefore, we conclude thatD eq. (27) is strict.

Lemma D.21 (For continuous IID distributions -IID, b < c : (b, c)|S| supp( -IID)). DX ∃ ⊆ DX

Proof. X-IID := XS is isomorphic to X|S| (and we generally do not bother to distinguish between the two).D Since state reward distribution X is continuous, X must have support on some open interval

(b, c). Since -IID is IID across states, (b, c)|S| supp( -IID). DX ⊆ DX

Lemma D.22 (Expectation superiority lemma). Let A, B (R|S| be finite and let g : R R be an → increasing function. If ND (A) is similar to some B0 B via involution φ S , then ⊆ ∈ |S| "  # "  # E g max a>r most E g max b>r . (34) r bound a A ≤ r bound b B ∼D ∈ ∼D ∈

Furthermore, if g is strictly increasing and ND (B) B0 is non-empty, then eq. (34) is strict for all h i \ h i X-IID. In particular, Er bound g maxa A a>r most Er bound g maxb B b>r . D ∼D ∈ 6≥ ∼D ∈

h i Proof. Suppose that bound is such that Er bound g maxb B b>r < D ∼D ∈ h i Er bound g maxa A a>r . ∼D ∈   "  # ! E g max a>r = E g max a>r  (35) r φ( bound) a A r φ( bound) a ND(A) ∼ D ∈ ∼ D ∈

22 "  # g max b>r (36) 2E ≤ r φ ( bound) b B ∼ D ∈ "  # = E g max b>r (37) r bound b B ∼D ∈ "  # < E g max a>r (38) r bound a A ∼D ∈  ! = E g max a>r  (39) r bound a ND(A) ∼D ∈ "  # E g max b>r . (40) ≤ r φ( bound) b B ∼ D ∈

Equation (35) and eq. (39) follow because r R|S| : maxa A a>r = maxa ND(A) a>r by corollary D.11. Equation (36) follows by applying∀ ∈ lemma D.20 with∈ permutation φ∈. Equation (37) 1 2 follows because involutions satisfy φ− = φ, and φ is therefore the identity. Equation (38) h i h i follows because we assumed that Er bound g maxb B b>r < Er bound g maxa A a>r . ∼D ∈ ∼D ∈ Equation (40) follows by applying lemma D.20 with permutation φ. By lemma D.15, eq. (34) holds.

Suppose g is strictly increasing and ND (B) B0 is non-empty. Let φ0 S . \ ∈ |S| "  # "  # E g max a>r = E g max a>r (41) r φ ( X-IID ) a A r X-IID a A ∼ 0 D ∈ ∼D ∈ "  # < E g max b>r (42) r φ( X-IID ) b B ∼ D ∈ "  # = E g max b>r . (43) r φ ( X-IID ) b B ∼ 0 D ∈

Equation (41) and eq. (43) hold because -IID distributes reward identically across states: φ DX ∀ x ∈ S : φx ( X-IID) = X-IID. By lemma D.21, b < c : (b, c)|S| supp( X-IID). Therefore, apply lemma|S| D.20D to concludeD that eq. (42) holds. ∃ ⊆ D h i h i :   Therefore, φ0 S Er φ ( X-IID ) g maxa A a>r < Er φ ( X-IID ) g maxb B b>r , and ∀ ∈ |S| ∼ 0 D ∈ ∼ 0 D ∈ h i h i so Er bound g maxa A a>r most Er bound g maxb B b>r by definition 6.4. ∼D ∈ 6≥ ∼D ∈ 1 Definition D.23 (Indicator function). Let L be a predicate which takes input x. L(x) is the function which returns 1 when L(x) is true, and 0 otherwise.

Lemma D.24 (Optimality probability inclusion relations). Let X,Y (R|S| be finite and suppose Y 0 Y . ⊆     p any (X Y ) p any X Y 0 p any X Y Y 0 Y . (44) D ≥ ≤ D ≥ ≤ D ∪ \ ≥  If b < c : (b, c)|S| supp( any), X Y , and ND (Y ) Y Y 0 is non-empty, then the second inequality∃ is strict. ⊆ D ⊆ ∩ \

Proof. h i ( ) := 1 (45) p any X Y E maxx X x>r maxy Y y>r D ≥ r any ∈ ≥ ∈ ∼D h i 1 (46) E maxx X x>r maxy Y y>r ≤ r any ∈ ≥ ∈ 0 ∼D

23 h i 1 (47) E maxx X (Y Y ) x>r maxy Y y>r ≤ r any ∈ ∪ \ 0 ≥ ∈ 0 ∼D h i = 1 (48) E maxx X (Y Y ) x>r maxy Y (Y Y ) y>r r any ∈ ∪ \ 0 ≥ ∈ 0∪ \ 0 ∼D h i = 1 (49) E maxx X (Y Y ) x>r maxy Y y>r r any ∈ ∪ \ 0 ≥ ∈ ∼D    =: p any X Y Y 0 Y . (50) D ∪ \ ≥ Equation (46) follows because r : 1 1 R|S| maxx X x>r maxy Y y>r maxx X x>r maxy Y y>r ∀ ∈ ∈ ≥ ∈ ≤ ∈ ≥ ∈ 0 since Y 0 Y ; note that eq. (46) equals p any X Y 0 , and so the first inequality of ⊆ D ≥ eq. (44) is shown. Equation (47) holds because r : 1 R|S| maxx X x>r maxy Y y>r 1 . ∀ ∈ ∈ ≥ ∈ 0 ≤ maxx X (Y Y ) x>r maxy Y b>r ∈ ∪ \ 0 ≥ ∈ 0  Suppose b < c : (b, c)|S| supp( any), X Y , and ND (Y ) Y Y 0 is non-empty. Let ∃  ⊆ D ⊆ ∩ \ y∗ ND (Y ) Y Y 0 . By proposition D.19, y∗ is strictly optimal on a subset of supp( any) ∈ ∩ \ D with positive measure under any. In particular, for a set of r∗ with positive measure under any, we D D have y∗>r∗ > maxy Y y>r∗. ∈ 0 Then eq. (47) is strict, and therefore the second inequality of eq. (44) is strict as well.

Lemma D.25 (Optimality probability of similar linear functional sets). Let A, B, C (R|S| be finite.    If A is similar to B0 B via φ such that φ ND (C) B B0 = ND (C) B B0 , then ⊆ \ \ \ \ p (A C) p (B C) . (51) any φ( any) D ≥ ≤ D ≥  If b < c : (b, c)|S| supp( any), B0 C, and ND (C) B B0 is non-empty, then eq. (51) is strict.∃ ⊆ D ⊆ ∩ \

Proof.  p any (A C) = p any A ND (C) (52) D ≥ D ≥   p any A ND (C) B B0 (53) ≤ D ≥ \ \     = p φ (A) φ ND (C) B B0 (54) φ( any) D ≥ \ \   = p B0 ND (C) B B0 (55) φ( any) D ≥ \ \    p B0 B B0 ND (C) (56) φ( any) ≤ D ∪ \ ≥ = p (B C) . (57) φ( any) D ≥

Equation (52) follows because r R|S| : maxa C a>r = maxa ND(C) a>r by corollary D.11. Equation (53) follows by applying∀ ∈ the first inequality∈ of lemma∈ D.24 with X := A, Y := ND (C) ,Y 0 := ND (C) (B B0). Equation (54) follows by applying lemma D.16 to eq. (52) with permutation φ. \ \ Equation (55) follows by our assumptions on φ. Equation (56) follows because by applying the second inequality of lemma D.24 with X := B0,Y := ND (C) ,Y 0 := ND (C) (B B0). \ \  Suppose b < c : (b, c)|S| supp( any); note that (b, c)|S| supp(φ any ), since such support ∃ ⊆ D ⊆ D  must be invariant to permutation. Further suppose that B0 C and that ND (C) B B0 is ⊆ ∩ \ non-empty. Then letting X := B0,Y := ND (C) ,Y 0 := ND (C) (B B0) and noting that ND ND (Y ) = ND (Y ), apply lemma D.24 to eq. (56) to conclude that\ eq.\ (51) is strict.

24 Lemma D.26 (Optimality probability superiority lemma). Let A, B, C (R|S| be finite. If A is    similar to B0 B via involution φ S such that φ ND (C) B B0 = ND (C) B B0 , ⊆ ∈ |S| \ \ \ \ then p bound (A C) most p bound (B C). D ≥ ≤ D ≥  If B0 C and ND (C) B B0 is non-empty, then the inequality is strict for all -IID and ⊆ ∩ \ DX p bound (A C) most p bound (B C). D ≥ 6≥ D ≥

Proof. Suppose bound is such that p bound (B C) < p bound (A C). D D ≥ D ≥ 1 pφ( bound) (A C) = pφ ( bound) (A C) (58) D ≥ − D ≥ p bound (B C) (59) ≤ D ≥ < p bound (A C) (60) D ≥ pφ( bound) (B C) . (61) ≤ D ≥ Equation (58) holds because φ is an involution. Equation (59) and eq. (61) hold by ap- plying lemma D.25 with permutation φ. Equation (60) holds by assumption. Therefore, p bound (A C) most p bound (B C) by lemma D.15. D ≥ ≤ D ≥  Suppose B0 C and ND (C) B B0 is non-empty, and let -IID be any continuous distribution ⊆ ∩ \ DX which distributes reward independently and identically across states. Let φ0 S . ∈ |S|

pφ ( X-IID ) (A C) = p X-IID (A C) (62) 0 D ≥ D ≥

< pφ( X-IID ) (B C) (63) D ≥

= pφ ( X-IID ) (A C) . (64) 0 D ≥

Equation (62) and eq. (64) hold because -IID distributes reward identically across states, φ DX ∀ x ∈ S : φx ( X-IID) = X-IID. By lemma D.21, b < c : (b, c)|S| supp( X-IID). Therefore, apply lemma|S| D.25D to concludeD that eq. (63) holds. ∃ ⊆ D : Therefore, φ0 S pφ ( X-IID ) (A C) < pφ ( X-IID ) (B C). In particular, ∀ ∈ |S| 0 D ≥ 0 D ≥ p bound (A C) most p bound (B C) by definition 6.4. D ≥ 6≥ D ≥ D.1.3 results Fnd Proposition D.27 (How to transfer optimal policy sets across discount rates). Suppose reward function R has optimal policy set Π∗ (R, γ) at discount rate γ (0, 1). For any γ∗ (0, 1), we  ∈ ∈ can construct a reward function R0 such that Π∗ R0, γ∗ = Π∗ (R, γ). Furthermore, V ∗ ( , γ∗) = R0 · V ∗ ( , γ). R ·

Proof. Let R be any reward function. Suppose γ∗ (0, 1) and construct R0(s) := VR∗ (s, γ) h i ∈ − γ∗ maxa Es T (s,a) VR∗ s0, γ . ∈A 0∼  Let π Π be any policy. By the definition of optimal policies, π Π∗ R0, γ∗ iff for all s: ∈ ∈ h i h i R0(s) + γ∗ V ∗ s0, γ∗ = R0(s) + γ∗ max V ∗ s0, γ∗ (65) E R0 E R0 s T (s,π(s)) a s0 T (s,a) 0∼ ∈A ∼ h i h i R0(s) + γ∗ E VR∗ s0, γ = R0(s) + γ∗ max E VR∗ s0, γ (66) s T (s,π(s)) a s0 T (s,a) 0∼ ∈A ∼ h i h i γ∗ E VR∗ s0, γ = γ∗ max E VR∗ s0, γ (67) s T (s,π(s)) a s0 T (s,a) 0∼ ∈A ∼ h i h i E VR∗ s0, γ = max E VR∗ s0, γ . (68) s T (s,π(s)) a s0 T (s,a) 0∼ ∈A ∼

h i By the Bellman equations, R0(s) = VR∗ (s, γ∗) γ∗ maxa Es T (s,a) VR∗ s0, γ∗ . By the 0 − ∈A 0∼ 0 definition of R0, V ∗ ( , γ∗) = V ∗ ( , γ) must be the unique solution to the Bellman equations for R0 · R ·

25 R0 at γ∗. Therefore, eq. (66) holds. Equation (67) follows by plugging in R0 := VR∗ (s, γ) h i − γ∗ maxa Es T (s,a) VR∗ s0, γ to eq. (66) and doing algebraic manipulation. Equation (68) ∈A 0∼ follows because γ∗ > 0.  h i Equation (68) shows that π Π∗ R0, γ∗ iff s : Es T (s,π(s)) VR∗ s0, γ = ∈ ∀ 0∼ h i  maxa Es T (s,a) VR∗ s0, γ . That is, π Π∗ R0, γ∗ iff π Π∗ (R, γ). ∈A 0∼ ∈ ∈

Definition D.28 (Evaluating sets of visit distribution functions at γ). For γ (0, 1), define (s, γ) := f(γ) f (s) and (s, γ) := f(γ) f (s) . If F∈ (s), then F | ∈ F Fnd | ∈ Fnd ⊆ F F (γ) := f(γ) f F . | ∈ Lemma D.29 (Non-domination across γ values for expectations of visit distributions). Let ∆   d ∈  π,sd ∆ R|S| be any state distribution and let F := Esd ∆d [f ] π Π . f ND (F ) iff γ∗  ∼ | ∈ ∈ ∀ ∈ (0, 1) : f(γ∗) ND F (γ∗) . ∈

Proof. Let f π ND (F ) be strictly optimal for reward function R at discount rate γ (0, 1): ∈ ∈ π π0 f (γ)>r > max f (γ)>r. (69) f π0 F f π ∈ \{ }  Let γ∗ (0, 1). By proposition D.27, we can produce R0 such that Π∗ R0, γ∗ = Π∗ (R, γ). Since the optimal∈ policy sets are equal, lemma D.1 implies that

π π0 f (γ∗)>r0 > max f (γ∗)>r0. (70) f π0 F f π ∈ \{ }

π  Therefore, f (γ∗) ND F (γ∗) . ∈ The reverse direction follows by the definition of ND (F ).

Lemma D.30 ( γ (0, 1) : d (s, γ) iff d ND (s, γ)). ∀ ∈ ∈ Fnd ∈ F n o Proof. By definition D.28, (s, γ) := f(γ) f ND (s) . By applying lemma D.29 with Fnd | ∈ F ∆ := e , f ND (s) iff γ (0, 1) : f(γ) ND (s, γ). d s ∈ F ∀ ∈ ∈ F

: Lemma D.31 ( γ [0, 1) VR∗ (s, γ) = maxf nd(s) f(γ)>r). ∀ ∈ ∈F

Proof. ND (s, γ) = (s, γ) by lemma D.30, so apply corollary D.11 with X := (s, γ). F Fnd F D.2 Some actions have greater probability of being optimal

Lemma D.32 (Optimal policy shift bound). For fixed R, Π∗ (R, γ) can take on at most (2 + P (s)  |S| 1) F distinct values over γ (0, 1). s | 2 | ∈

Proof. By lemma D.1, Π∗ (R, γ) changes value iff there is a change in optimality status for some visit distribution function at some state. Lippman[1968] showed that two visit distribution functions can (s)  trade off optimality status at most 2 + 1 times. At each state s, there are F such pairs. |S| | 2 |

Let . Proposition D.33 (Optimality probability’s limits exist). F (s) P any (F, 0) = ⊆ F D and . limγ 0 P any (F, γ) P any (F, 1) = limγ 1 P any (F, γ) → D D → D

26 Proof. First consider the limit as γ 1. Let any have probability measure Fany, and define  → D  δ(γ) := Fany R RS γ∗ [γ, 1) : Π∗ (R, γ∗) = Π∗ (R, 1) . Since Fany is a probability ∈ | ∃ ∈ 6 measure, δ(γ) is bounded [0, 1], and δ(γ) is monotone decreasing. Therefore, limγ 1 δ(γ) exists. → If limγ 1 δ(γ) > 0, then there exist reward functions whose optimal policy sets Π∗ (R, γ) never → converge (in the discrete topology on sets) to the Blackwell-optimal policy set Π∗ (R, 1), contradicting lemma D.32. So limγ 1 δ(γ) = 0. → ( ) ( ) ( 1) By the definition of optimality probability (definition 4.3) and of δ γ , P any F, γ P any F, | D − D | ≤ ( ) lim ( ) = 0 lim ( ) = ( 1) δ γ . Since γ 1 δ γ , γ 1 P any F, γ P any F, . → → D D lim ( ) = ( 0) A similar proof shows that γ 0 P any F, γ P any F, . → D D Lemma D.34 (Optimality probability identity). Let γ (0, 1) and let F (s). ∈ ⊆ F   P any (F, γ) = p 0 F (γ) (s, γ) = p 0 F (γ) nd(s, γ) . (71) D D ≥ F D ≥ F Proof. Let γ (0, 1). ∈ ( ) := f π : Π ( ) P any F, γ P F π ∗ R, γ (72) D R any ∃ ∈ ∈ ∼D h i = 1 (73) E maxf F f(γ)>r=maxf (s) f 0(γ)>r r any ∈ 0∈F ∼D h i = 1 (74) E maxf F f(γ)>r=maxf (s) f 0(γ)>r r any ∈ 0∈Fnd ∼D  =: p F (γ) nd(s, γ) . (75) D0 ≥ F Equation (73) follows because lemma D.1 shows that π is optimal iff it induces an optimal visit distribution f at every state. Equation (74) follows because r R|S| : maxf (s) f 0(γ)>r = ∀ ∈ 0∈F maxf nd(s) f 0(γ)>r by lemma D.31. 0∈F

D.3 Basic properties of POWER

Lemma D.35 (POWER identities). Let γ (0, 1). ∈ " # 1 γ > POWER bound (s, γ) = E max − f(γ) es r (76) D r bound f nd(s) γ − ∼D ∈F 1 γ   = − E VR∗ (s, γ) R(s) (77) γ r bound − ∼D  1 γ   = − V ∗bound (s, γ) E R(s) (78) γ D − R bound ∼D   h π i = E max E (1 γ) VR s0, γ  . (79) R bound π Π s T (s,π(s)) − ∼D ∈ 0∼

Proof. " # 1 γ > POWER bound (s, γ) := E max − f(γ) es r (80) D r bound f (s) γ − ∼D ∈F " # 1 γ  = E max − f(γ) es > r (81) r bound f nd(s) γ − ∼D ∈F " # 1 γ  = E max − f(γ) es > r (82) r bound f (s) γ − ∼D ∈F 1 γ   = − E VR∗ (s, γ) R(s) (83) γ r bound − ∼D

27 1 γ   = ( )  ( ) − V ∗bound s, γ E R s (84) γ D − R bound ∼D   h i π,s0 = E max E (1 γ) f (γ)>r  (85) r bound π Π s T (s,π(s)) − ∼D ∈ 0∼   h π i = E max E (1 γ) VR s0, γ  . (86) R bound π Π s T (s,π(s)) − ∼D ∈ 0∼

Equation (81) follows from lemma D.31. Equation (83) follows from the dual formulation of optimal value functions. Equation (84) holds by the definition of ( ) (definition 5.1). Equation (85) V ∗bound s, γ h iD f π,s( ) = e + f π,s0 ( ) holds because γ s γ Es T (s,π(s)) γ by the definition of a visit distribution 0∼ function (definition 3.3).

Definition D.36 (Discount-normalized value function). Let π be a policy, R a reward function, and s π : π a state. For γ [0, 1], VR, norm (s, γ) = limγ∗ γ (1 γ∗)VR (s, γ∗). ∈ → − Lemma D.37 (Normalized value functions have uniformly bounded derivative). There exists K 0 ≥ d π such that for all reward functions r R|S|, sups ,π Π,γ [0,1] dγ VR, norm (s, γ) K r 1. ∈ ∈S ∈ ∈ ≤ k k

π Proof. Let π be any policy, s a state, and R a reward function. Since VR, norm (s, γ) = limγ∗ γ (1 → π,s d π π,s − γ∗)f (γ∗)>r, VR, norm (s, γ) is controlled by the behavior of limγ∗ γ (1 γ∗)f (γ∗). We dγ → − show that this function’s gradient is bounded in infinity norm. π,s By lemma D.4, f (γ) is a multivariate rational function on γ. Therefore, for any state s0, π,s P (γ) π,s 1 f (γ)>e = in reduced form. By proposition D.3, 0 f (γ)>e . Thus, s0 Q(γ) ≤ s0 ≤ 1 γ Q may only have a root of multiplicity 1 at γ = 1, and Q(γ) = 0 for γ [0−, 1). Let π,s 6 ∈ f (γ) := (1 γ)f (γ)>e . s0 − s0 If Q(1) = 0, then the derivative f (γ) is bounded on γ [0, 1) because the polynomial (1 γ)P (γ) s00 cannot diverge6 on a bounded domain. ∈ −

If Q(1) = 0, then factor out the root as Q(γ) = (1 γ)Q∗(γ). − d (1 γ)P (γ) f 0 (γ) = − (87) s0 dγ Q(γ) d  P (γ)  = (88) dγ Q∗(γ) P (γ)Q (γ) (Q ) (γ)P (γ) = 0 ∗ ∗ 0 (89) − 2 . (Q∗(γ))

Since Q∗(γ) is a polynomial with no roots on γ [0, 1], f 0 (γ) is bounded on γ [0, 1). ∈ s0 ∈ Therefore, whether or not Q(γ) has a root at γ = 1, f 0 (γ) is bounded on γ [0, 1). Furthermore, s0 ∈ sup (1 γ)f π,s(γ) = sup max f (γ) is finite since there are only finitely γ [0,1) γ [0,1) s0 s00 many∈ states.∇ − ∞ ∈ ∈S

There are finitely many π Π, and finitely many states s, and so there exists some K0 such that sup (1 )f∈π,s( ) . Then (1 )f π,s( ) =: . s , γ γ K0 γ γ 1 K0 K π Π,γ∈S[0,1) ∇ − ∞ ≤ ∇ − ≤ |S| ∈ ∈

d π d π sup VR,norm (s, γ) := sup lim (1 γ∗)VR (s, γ∗) (90) γ γ s , dγ s , dγ ∗→ − π Π,γ∈S[0,1) π Π,γ∈S[0,1) ∈ ∈ ∈ ∈ d π = sup (1 γ)VR (s, γ) (91) s , dγ − π Π,γ∈S[0,1) ∈ ∈

28

π,s = sup (1 γ)f (γ)>r (92) s , ∇ − π Π,γ∈S[0,1) ∈ ∈ sup (1 )f π,s( ) r (93) γ γ 1 1 ≤ s , ∇ − k k π Π,γ∈S[0,1) ∈ ∈ K r . (94) ≤ k k1 π Equation (91) holds because VR (s, γ) is continuous on γ [0, 1) by corollary D.5. Equation (93) holds by the Cauchy-Schwarz inequality. ∈

Since d V π (s, γ) is bounded for all γ [0, 1), eq. (94) also holds for γ 1. dγ R,norm ∈ →

Lemma 5.3 (Continuity of POWER). POWER bound (s, γ) is Lipschitz continuous on γ [0, 1]. D ∈

Proof. Let b, c be such that supp( bound) [b, c]|S|. For any r supp( bound) and π Π, π D ⊆ ∈ D ∈ VR, norm (s, γ) has Lipschitz constant K r 1 K r K max( c , b ) on γ (0, 1) by lemma D.37. k k ≤ |S| k k∞ ≤ |S| | | | | ∈   h i For (0 1), POWER ( ) = max (1 ) π by γ , bound s, γ ER bound π Π Es T (s,π(s)) γ VR s0, γ ∈ D ∼D ∈ 0∼ − eq. (86). The expectation of the maximum of a set of functions which share a Lipschitz constant, also shares the Lipschitz constant. This shows that POWER bound (s, γ) is Lipschitz continuous on γ (0, 1). Thus, its limits are well-defined as γ 0 and γ D 1. So it is Lipschitz continuous on the∈ closed unit interval. → →   Proposition 5.4 (Maximal POWER). POWER bound (s, γ) ER bound maxs R(s) , with equality if s can deterministically reach all states in oneD step and≤ all states∼D are 1-cycles.∈S

Proof. Let γ (0, 1). ∈ " # h i POWER bound (s, γ) = E max E (1 γ)VR∗ s0, γ (95) D R bound π Π s T (s,π(s)) − ∼D ∈ 0∼ "  # maxs R(s00) E max E (1 γ) 00∈S (96) ≤ R bound π Π s T (s,π(s)) − 1 γ ∼D ∈ 0∼ −   = E max R(s00) . (97) R bound s ∼D 00∈S

 maxs R(s00) Equation (95) follows from lemma D.35. Equation (96) follows because V ∗ s0, γ 00∈S R ≤ 1 γ – no policy can do better than achieving maximal reward at each time step. Taking limits, the inequality− holds for all γ [0, 1]. ∈ Suppose that s can deterministically reach all states in one step and all states are 1-cycles. Then eq. (96) is an equality for all γ (0, 1), since for each R, the agent can select an action which deterministically transitions to a∈ state with maximal reward. Thus the equality holds for all γ [0, 1]. ∈

Lemma D.38 (Lower bound on current POWER based on future POWER).   h i POWER bound (s, γ) (1 γ) min E R(s0) + γ max E POWER bound s0, γ . D ≥ − a s0 T (s,a), a s T (s,a) D ∼ 0∼ R bound ∼D (98)

h i Proof. Let γ (0, 1) and let a∗ arg maxa Es T (s,a) POWER bound s0, γ . ∈ ∈ 0∼ D

POWER bound (s, γ) (99) D

29 " # h i = (1 γ) E max E VR∗ s0, γ (100) − R bound a s T (s,a) ∼D 0∼   h i (1 γ) max E E VR∗ s0, γ (101) ≥ − a s T (s,a) R bound 0 ∼D ∼ h i = (1 ) max  γ E V ∗bound s0, γ (102) − a s0 T (s,a) D ∼     γ  = (1 γ) max E E R(s0) + POWER bound s0, γ (103) − a s T (s,a) R bound 1 γ D 0 ∼D ∼  −    γ  (1 γ) E E R(s0) + POWER bound s0, γ (104) ≥ − s T (s,a ) R bound 1 γ D 0∼ ∗ ∼D −   h i (1 γ) min E R(s0) + γ E POWER bound s0, γ . (105) ≥ − a s0 T (s,a), s T (s,a ) D ∼ 0∼ ∗ R bound ∼D   Equation (100) holds by lemma D.35. Equation (101) follows because Ex X maxa f(a, x)   ∼ ≥ maxa Ex X f(a, x) by Jensen’s inequality, and eq. (103) follows by lemma D.35. ∼ The inequality also holds when we take the limits γ 0 or γ 1. → → Proposition 5.5 (POWER is smooth across reversible dynamics). Let bound be bounded [b, c]. Sup- D pose s and s0 can both reach each other in one step with probability 1.  POWER bound (s, γ) POWER bound s0, γ (c b)(1 γ). (3) D − D ≤ − −  Proof. Suppose γ [0, 1]. First consider the case where POWER bound (s, γ) POWER bound s0, γ . ∈ D ≥ D      POWER bound s0, γ (1 γ) min E R(sx) + γ max E POWER bound (sx, γ) D ≥ − a sx T (s0,a), a sx T (s ,a) D ∼ ∼ 0 R bound ∼D (106)

(1 γ)b + γPOWER bound (s, γ) . (107) ≥ − D Equation (106) follows by lemma D.38. Equation (107) follows because reward is lower-bounded by b and because s0 can reach s in one step with probability 1.

  POWER bound (s, γ) POWER bound s0, γ = POWER bound (s, γ) POWER bound s0, γ (108) D − D D − D  POWER bound (s, γ) (1 γ)b + γPOWER bound (s, γ) ≤ D − − D (109)  = (1 γ) POWER bound (s, γ) b (110) − D −   ! (1 γ) E max R(s00) b (111) ≤ − R bound s00 − ∼D ∈S (1 γ)(c b). (112) ≤ − −  Equation (108) follows because POWER bound (s, γ) POWER bound s0, γ . Equation (109) follows by eq. (107). Equation (111) follows byD proposition≥ 5.4. EquationD (112) follows because reward under bound is upper-bounded by c. D  The case where POWER bound (s, γ) POWER bound s0, γ is similar, leveraging the fact that s can D ≤ D also reach s0 in one step with probability 1.

D.4 Seeking POWER is often more probable under optimality

D.4.1 Keeping options open tends to be POWER-seeking and tends to be optimal

Definition D.39 (Normalized visit distribution function). Let f : [0, 1) R|S| be a vector function. → For γ [0, 1], NORM (f, γ) := limγ γ (1 γ∗)f(γ∗) (this limit need not exist for arbitrary f). If ∈ ∗→ − F is a set of such f, then NORM (F, γ) := NORM (f, γ) f F . | ∈

30  Remark. RSD (s) = NORM (s), 1 . F Lemma D.40 (Normalized visit distribution functions are continuous). Let ∆s ∆( ) be a state π,s ∈ S probability distribution, let π Π, and let f ∗ := Es ∆s [f ]. NORM (f ∗, γ) is continuous on γ [0, 1]. ∈ ∼ ∈ Proof.  π,s  NORM (f ∗, γ) := lim (1 γ∗) E f (γ∗) (113) γ∗ γ − s ∆s →  ∼  π,s = E lim (1 γ∗)f (γ∗) (114) s ∆s γ∗ γ − ∼ →  π,s  =: E NORM (f , γ) . (115) s ∆s ∼ Equation (114) follows because the expectation is over a finite set. Each f π,s (s) is continuous π,s ∈ F on γ [0, 1) by lemma D.4, and limγ 1(1 γ∗)f (γ∗) exists because RSDs are well-defined ∗→ [Puterman∈ , 2014]. Therefore, each NORM (f π,s−, γ) is continuous on γ [0, 1]. Lastly, eq. (115)’s expectation over finitely many continuous functions is itself continuous.∈

Lemma D.41 (Non-domination of normalized visit distribution functions). Let ∆s ∆( )  π,s ∈ S be a state probability distribution and let F := Es ∆s [f ] π Π . For all γ [0, 1],   ∼ | ∈ ∈ ND NORM (F, γ) NORM ND (F ) , γ , with equality when γ (0, 1). ⊆ ∈ Proof. Suppose γ (0, 1). ∈   ND NORM (F, γ) = ND (1 γ)F (γ) (116) − = (1 γ)ND F (γ) (117) − = (1 γ) ND (F )(γ) (118) −  = NORM ND (F ) , γ . (119)

Equation (116) and eq. (119) follow by the continuity of NORM (f, γ) (lemma D.40). Equation (117) follows by lemma D.14 item1. Equation (118) follows by lemma D.29.  Let γ = 1. Let d ND NORM (F, 1) be strictly optimal for r∗ R|S|. Then let Fd F be the ∈ ∈ ⊆ subset of f F such that NORM (f, 1) = d. ∈  max NORM (f, 1)> r∗ > max NORM f 0, 1 > r∗. (120) f Fd f F Fd ∈ 0∈ \

Since NORM (f, 1) is continuous at γ = 1 (lemma D.40), x>r∗ is continuous on x R|S|, and F ∈ is finite, eq. (120) holds for some γ∗ (0, 1) sufficiently close to γ = 1. By lemma D.10, at least ∈  one f Fd is an element of ND F (γ∗) . Then by lemma D.29, f ND (F ). We conclude that ∈   ∈ ND NORM (F, 1) NORM ND (F ) , 1 . ⊆ The case for γ = 0 proceeds similarly.

Lemma D.42 (POWER limit identity). Let γ [0, 1]. ∈ " # 1 γ∗ > POWER bound (s, γ) = E max lim − f(γ∗) es r . (121) D r bound f nd(s) γ∗ γ γ − ∼D ∈F → ∗

Proof. Let γ [0, 1] and let supp( bound) [b, c]|S|. ∈ D ⊆

POWER bound (s, γ) = lim POWER bound (s, γ∗) (122) D γ∗ γ D → " # 1 γ∗  = lim E max − f(γ∗) es > r (123) γ∗ γ r bound f nd(s) γ − → ∼D ∈F ∗

31 " # 1 γ∗  = E lim max − f(γ∗) es > r (124) r bound γ∗ γ f nd(s) γ − ∼D → ∈F ∗ " # 1 γ∗  = E max lim − f(γ∗) es > r . (125) r bound f nd(s) γ∗ γ γ − ∼D ∈F → ∗

Equation (122) follows because POWER bound (s, γ) is continuous on γ [0, 1] by lemma 5.3. Equation (123) follows by lemma D.35. D ∈

1 γ∗ > For γ∗ (0, 1), let fγ (r) := maxf nd(s) − f(γ∗) es r. For any sequence γn γ, ∈ ∗ ∈F γ∗ − →  ∞ fγn is a sequence of functions which are piecewise linear on r R|S|, which means n=1 ∈ they are continuous and therefore measurable. Since lemma D.4 shows that each f (s) ∈ Fnd is multivariate rational (and therefore continuous),  ∞ converges pointwise to limit func- fγn n=1 tion fγ on the bounded domain supp( bound) [b, c]|S|. Furthermore, for R supp( bound), b c D ⊆ 1 γ∗ ∈ D VR∗ (s, γ∗) , b R(s) c, and so fγn (r) = − (VR∗ (s, γ∗) R(s)) is bounded 1 γ∗ ≤ ≤ 1 γ∗ ≤ ≤ γ∗ − [b,− c]. Therefore, apply Lebesgue’s− bounded interval convergence theorem to conclude that eq. (124) holds. Equation (125) holds because max is a continuous function.

Lemma D.43 (Lemma for POWER superiority). Let ∆1, ∆2 ∆ ( ) be state probability distri-  1 π,si ∈ S butions. For i = 1, 2, let F∆i := γ− Esi ∆i [f esi ] π Π . Suppose ND (F∆1 ) is ∼ − | ∈   similar to Fsub0 F∆2 via involution φ. Then γ [0, 1] : Es1 ∆1 POWER bound (s1, γ) most  ⊆  ∀ ∈ ∼ D ≤ Es2 ∆2 POWER bound (s2, γ) . ∼ D

If ND (F∆ ) F 0 is non-empty, then for all γ (0, 1), the inequality is strict for all X-IID and  2 \ sub   ∈  D Es1 ∆1 POWER bound (s1, γ) most Es2 ∆2 POWER bound (s2, γ) . ∼ D 6≥ ∼ D  π,si These results also hold when replacing F∆i with F∆∗ := Esi ∆i [f ] π Π for i = 1, 2. i ∼ | ∈ Proof.     φ ND NORM (F , γ) φ NORM ND (F ) , γ (126) ∆1 ⊆ ∆1   := Pφ lim (1 γ∗)f(γ∗) f ND (F∆ ) (127) γ γ 1 ∗→ − | ∈   = lim (1 γ∗)Pφf(γ∗) f ND (F∆ ) (128) γ γ 1 ∗→ − | ∈   = lim (1 γ∗)f(γ∗) f Fsub0 (129) γ γ ∗→ − | ∈   lim (1 γ∗)f(γ∗) f F∆ (130) γ γ 2 ⊆ ∗→ − | ∈

=: NORM (F∆2 , γ) . (131)

Equation (126) follows by lemma D.41. Equation (128) follows because Pφ is a continuous linear operator. Equation (130) follows by assumption.   1 γ∗   π,s1 > E POWER bound (s1, γ) := E max lim − f (γ∗) es1 r (132) s1 ∆1 D s1 ∆1, π Π γ∗ γ γ − ∼ ∼ ∈ → ∗ r bound ∼D   1 γ∗  π,s1 > = E max lim − E f (γ∗) es1 r (133) r bound π Π γ∗ γ γ s1 ∆1 − ∼D ∈ → ∗ ∼  

= E  max d>r (134) r bound d NORM(F∆ ,γ) ∼D ∈ 1  

= E  max d>r (135) r bound   d ND NORM(F∆ ,γ) ∼D ∈ 1

32  

most E  max d>r (136) r ≤ bound d NORM(F∆ ,γ) ∼D ∈ 2   1 γ∗  π,s2 > = E max lim − E f (γ∗) es2 r (137) r bound π Π γ∗ γ γ s2 ∆2 − ∼D ∈ → ∗ ∼   1 γ∗ π,s2 > = E max lim − f (γ∗) es2 r (138) s2 ∆2, π Π γ∗ γ γ − ∼ ∈ → ∗ r bound ∼D   =: E POWER bound (s2, γ) . (139) s2 ∆2 D ∼ Equation (132) and eq. (139) follow by lemma D.42. Equation (133) and eq. (138) follow because each R has a stationary deterministic optimal policy π Π∗ (R, γ) Π which simultaneously achieves optimal value at all states. Equation (135) follows∈ by corollary⊆ D.11.

Apply lemma D.22 with A := NORM (F∆1 , γ) ,B := NORM (F∆2 , γ), g the identity function, and involution φ (satisfying φ ND (A) B by eq. (131)) in order to conclude that eq. (136) holds. ⊆ Suppose that ND (F∆ ) F 0 is non-empty. Lemma D.29 shows that for all γ (0, 1),  2 \ sub ∈ ND F (γ) F 0 (γ) is non-empty. Lemma D.14 item1 then implies that ND (B) φ(A) = ∆2 \ sub \ 1 γ     1 γ  −γ ND F∆2 (γ) es −γ Fsub0 (γ) is non-empty. Then lemma D.22 implies that for − \   all γ (0, 1), eq. (136) is strict for all X-IID and Es1 ∆1 POWER bound (s1, γ) most ∈  D ∼ D 6≥ Es2 ∆2 POWER bound (s2, γ) . ∼ D We show that this result’s preconditions holding for implies the preconditions. Suppose F∆∗ i F∆i     π,si F∆∗ := Esi ∆i [f ] π Π for i = 1, 2 are such that Fsub∗ := φ ND F∆∗ F∆∗ . In the i ∼ | ∈ 1 ⊆ 2 following, the ∆i are represented as vectors in R|S|, and γ is a variable.     φ γf f ND (F ) = φ ND F ∗ ∆ (140) | ∈ ∆1 ∆1 − 1    = φ ND F ∗ ∆ (141) ∆1 − 1 n o = P f P ∆ f ND F ∗ (142) φ − φ 1 | ∈ ∆1  f ∆ f F ∗ (143) ⊆ − 2 | ∈ ∆2 = γf f F . (144) | ∈ ∆2    Equation (141) follows from lemma D.14 item2. Since we assumed that φ ND F ∗ F ∗ , ∆1 ⊆ ∆2      φ ∆ = φ ND F ∗ (0) F ∗ (0) = ∆ . This implies that P ∆ = ∆ and so { 1} ∆1 ⊆ ∆2 { 2} φ 1 2 eq. (143) follows.    Equation (144) shows that φ γf f ND (F∆1 ) γf f F∆2 . But we then   | ∈ ⊆ n| ∈ o have φ γf f ND (F ) := γP f f ND (F ) = γf f φ ND (F ) | ∈ ∆1 φ | ∈ ∆1 | ∈ ∆1 ⊆ γf f F . Thus, φ ND (F ) F . | ∈ ∆2 ∆1 ⊆ ∆2      Suppose ND F ∗ φ ND F ∗ is non-empty, which implies that ∆2 \ ∆1   n o φ γf f ND (F ) = P f P ∆ f ND F ∗ (145) | ∈ ∆1 φ − φ 1 | ∈ ∆1     = f P ∆ f φ ND F ∗ (146) − φ 1 | ∈ ∆1 n o ( f ∆2 f ND F ∗ (147) − | ∈ ∆2

33 = γf f ND (F ) . (148) | ∈ ∆2  Then ND (F∆2 ) φ ND (F∆1 ) must be non-empty. Therefore, if the preconditions of this result are met for ,\ they are met for . F∆∗ i F∆i

Proposition 6.5 (States with “more options” generally have more POWER). Suppose (s0) is Fnd similar to a subset of (s) via involution φ. Then γ [0, 1] : POWER bound (s0, γ) most F ∀ ∈ D ≤ POWER bound (s, γ). D  If (s) φ (s0) is non-empty, then for all γ (0, 1), the inequality is strict for all -IID Fnd \ Fnd ∈ DX and POWER bound (s0, γ) most POWER bound (s, γ) – i.e. POWER bound (s0, γ) most POWER bound (s, γ) does not hold.D 6≥ D D ≥ D

: : : : Proof. Let Fsub = φ( nd(s0)) (s). Let ∆1 = es0 , ∆2 = es, and define F∆∗ = F ⊆ F   i  π,si Esi ∆i [f ] π Π for i = 1, 2. Then nd(s0) = ND F∆∗ is similar to Fsub = ∼ | ∈ F 1 Fsub∗ F∆∗ = (s) via involution φ. Apply lemma D.43 to conclude that γ [0, 1] : ⊆ 2 F ∀ ∈ POWER bound s0, γ most POWER bound (s, γ). D ≤ D   Furthermore, ( ) = ND , and = , and so if ( ) ( ) := ( ) nd s F∆∗ 2 Fsub Fsub∗ nd s φ nd s0 nd s  F  F \ F F \ = ND is non-empty, then lemma D.43 shows that for all (0 1), the inequality Fsub F∆∗ 2 Fsub∗ γ , \  ∈ is strict for all X-IID and POWER bound s0, γ most POWER bound (s, γ). D D 6≥ D ai Lemma D.44 ( (s) “factorizes” across bottlenecks). Suppose 1 i k : s s0  F Pk P ∀ ≤ ≤ → → REACH s0, ai . Then let 1reach := es and 1rest := 1 1reach (where i=1 sj REACH(s0,ai) j − b  π,s∈ 1 1 R|S| is the all-ones vector). Let Frest := f rest π Π (with the Hadamard product), ∈ | ∈ b  π,s b n π,s o : 1 : : ai Frest,a = f rest π Π π(s0) s0 ai , and Fa = Esa T (s ,ai) [f ] π Π . In i | ∈ ≡ i i ∼ 0 | ∈ the following, γ is left variable on [0, 1).     b b (s π(s0) = ai) = frest(γ) + 1 (1 γ) frest(γ) fai (γ) frest F , fai F . F | − − 1 | ∈ rest,a1 ∈ ai (149)

Proof. Keep in mind that in order to detail how state-space bottlenecks affect the structure of visit distribution functions, we hold γ variable on [0, 1). (s π(s0) = a ) (150) F | i  π,s := f π Π : π(s0) = a (151) | ∈ i  π,s = f π Π : π(s0) a (152) | ∈ ≡s0 i  π,s 1 1 : = f (γ) ( rest + reach) π Π π(s0) s0 ai (153)  | ∈ ≡  π,s  π,s  π = f (γ) 1rest + 1 (1 γ) f (γ) 1rest f (γ) π Π : π(s0) s ai (154) − − 1 ai | ∈ ≡ 0     b b = frest(γ) + 1 (1 γ) frest(γ) fai (γ) frest F , fai F (155) − − 1 | ∈ rest,ai ∈ ai     b b = frest(γ) + 1 (1 γ) frest(γ) fai (γ) frest F , fai F . (156) − − 1 | ∈ rest,a1 ∈ ai Equation (152) holds because by the definition of action equivalence (definition 6.6), equiva- lent actions induce identical state visit distribution functions f π,s. Equation (153) follows since 1rest + 1reach = 1. To see that eq. (154) follows, consider first that once the agent takes an action π b equivalent to a at state s0, it induces state visit distribution f (γ) F (γ) (by the definition of i ai ∈ ai b f π,s Fai ). Since the bottleneck assumption ensures that no other components of visit the states of k  π,s 1 π [0,1) REACH s0, ai , f (γ) reach = c(γ)f (γ) for some scaling function c R . ∪i=1 ai ∈ 1 f π,s(γ) = (157) 1 1 γ − 34 π,s π 1 f (γ) 1rest + c(γ)f (γ) = (158) ai 1 1 γ − π,s π 1 f (γ) 1rest + c(γ)f (γ) = (159) 1 ai 1 1 γ −   π 1 1 π,s c(γ) = f (γ) − f (γ) 1rest (160) ai 1 1 γ − 1  −  1 π,s c(γ) = (1 γ) f (γ) 1rest (161) − 1 γ − 1 − π,s c(γ) = 1 (1 γ) f (γ) 1rest . (162) − − 1 Equation (157) and eq. (161) follow by proposition D.3 item2. Equation (159) follows because f ( ) f π ( ) 0 ( ) = ( ) ( ) f π,s( ) rest γ , ai γ . In eq. (162), c γ c γ must hold because if c γ were negative, γ would contain negative entries (which is impossible by proposition D.3 item1). Equation (162) demonstrates that eq. (154) holds.  To see that eq. (155) holds, consider that policy choices on REACH s0, ai cannot affect the visit π,s  distribution function f (γ) 1rest. This is because the definition of REACH s0, ai ensures that  once the agent reaches REACH s0, ai , it never leaves. Therefore, for all π, π0 Π which only  π,s π0,s ∈ π b disagree on s REACH s0, a , f (γ) 1rest = f (γ) 1rest. This implies that any f F j ∈ i ∈ ai is compatible with any f π0 F b . Therefore, eq. (155) holds. ∈ rest b = b Lastly, we show that Frest,ai Frest,a1 , which shows that eq. (156) holds. In the following, let  π,s  d(γ) := 1 (1 γ) f (γ) 1rest . − − 1 b Frest,ai (163)  π,s := f 1rest π Π : π(s0) a (164) | ∈ ≡s0 i n π,s π  o = f (γ) 1rest + d(γ)f (γ) 1rest π Π : π(s0) a (165) ai | ∈ ≡s0 i n π,s π  o = f (γ) 1rest + d(γ)f (γ) 1rest π Π : π(s0) a (166) a1 | ∈ ≡s0 i n π,s π  o = f (γ) 1rest + d(γ)f (γ) 1rest π Π : π(s0) a (167) a1 | ∈ ≡s0 1  π,s = f 1rest π Π : π(s0) a (168) | ∈ ≡s0 1 =: b Frest,a1 . (169)

Equation (165) and eq. (168) follow by eq. (154) above. Equation (166) follows because by the 1 π b π b π 1 π 1 definition of rest and of fa Fa , fa Fa , we have d(γ)fa (γ) rest = d(γ)fa (γ) rest = 0 i ∈ i 1 ∈ 1 i 1  (the all-zeros vector in R|S|). Equation (167) follows because the definition of REACH s0, ai ensures  that once the agent takes actions equivalent to a at s , it only visits states s k REACH s , a . i 0 j i0=1 0 i0 π,s ∈ ∪ The same is true for a . Therefore, by the definition of 1rest, f (γ) 1rest is invariant to the choice 1 of action ai versus a1. We conclude that eq. (156) holds, which proves the desired equality.

Lemma D.45 (Non-dominated visit distribution functions never agree with other visit distribution functions at that state). Let f (s), f 0 (s) f . γ (0, 1) : f(γ) = f 0(γ). ∈ Fnd ∈ F \{ } ∀ ∈ 6

Proof. Let γ (0, 1). Since f nd(s), there exists a γ∗ (0, 1) at which f is strictly optimal for some reward function.∈ Then by∈ proposition F D.27, we can produce∈ another reward function for which f is strictly optimal at discount rate γ; in particular, proposition D.27 guarantees that the policies which induce f 0 are not optimal at γ. So f(γ) = f 0(γ). 6 Corollary D.46 (Cardinality of non-dominated visit distributions). Let F (s). γ (0, 1) : ⊆ F ∀ ∈ F nd(s) = F (γ) nd(s, γ) . ∩ F ∩ F Proof. Let γ (0, 1). ∈

35   By applying lemma D.29 with ∆d := es, f nd(s) = ND (s) iff f(γ) ND (s, γ) . By  ∈ F F ∈ F lemma D.30, ND (s, γ) = nd(s, γ). So all f F nd(s) induce f(γ) F (γ) nd(s, γ), F F ∈ ∩ F ∈ ∩ F and F nd(s) F (γ) nd(s, γ) . ∩ F ≥ ∩ F

Lemma D.45 implies that for all f, f 0 nd(s), f = f 0 iff f(γ) = f 0(γ). Therefore, F nd(s) ∈ F ∩ F ≤ F (γ) nd(s, γ) . So F nd(s) = F (γ) nd(s, γ) . ∩ F ∩ F ∩ F Lemma D.47 (Limit probability inequalities which hold for most distributions). Let I R and ⊆ let FA,FB,FC be finite sets of vector functions I R|S|. Let γ be a limit point of I such that  7→  f1( any) := limγ γ p any FB(γ∗) FC (γ∗) , f2( any) := limγ γ p any FA(γ∗) FC (γ∗) D ∗→ D ≥ D ∗→ D ≥ are well-defined for all any. D Suppose ND (FA) is similar to a subset of FB via involution φ such that       φ ND (F ) F φ ND (F ) = ND (F ) F φ ND (F ) . Then f most f . C \ B \ A C \ B \ A 2 ≤ 1

Proof. Suppose bound is such that f ( bound) > f ( bound). D 2 D 1 D   1  f φ( bound) = f φ− ( bound) (170) 2 D 2 D  := lim 1 ( ) ( ) pφ− ( bound) FA γ∗ FC γ∗ (171) γ∗ γ D ≥ →  lim p bound FB(γ∗) FC (γ∗) (172) ≤ γ∗ γ D ≥ →  < lim p bound FA(γ∗) FC (γ∗) (173) γ∗ γ D ≥ →  lim pφ( bound) FB(γ∗) FC (γ∗) (174) ≤ γ∗ γ D ≥ →  =: f φ( bound) . (175) 1 D 1 Equation (170) follows since φ = φ− because φ is an involution. For all γ∗ I, ap- ∈ ply lemma D.25 with A := FA(γ∗),B := FB(γ∗),C := FC (γ∗), and involution φ obey-   ing φ ND (C) B φ(A) = ND (C) B φ(A) (by assumption) to conclude that \ \  \ \  1 pφ ( bound) FA(γ∗) FC (γ∗) p bound FB(γ∗) FC (γ∗) . Therefore, the limit inequality − D ≥ ≤ D ≥ eq. (172) holds. Equation (173) follows because we assumed that f1( bound) < f2( bound). Equa- tion (174) holds by reasoning similar to that given for eq. (172). D D   Therefore, f ( bound) > f ( bound) implies that f φ( bound) < f φ( bound) , and so apply 2 D 1 D 2 D 1 D lemma D.15 to conclude that f most f . 2 ≤ 1

s

sa0 s0 sa a0 a

  Figure 12: Starting from s, state s0 is a bottleneck with respect to REACH s0, a0 and REACH s0, a via the appropriate actions (definition 6.7). The red subgraph highlights the visit distributions of F and the green subgraph highlights Fsub; the black subgraph illustrates F Fsub. Since F a0 a \ a0 is similar to Fsub ( Fa, a affords “strictly more transient options.” Proposition 6.9 proves that (0 1) : ( ) ( ) and POWER ( ) POWER ( ). γ , P X-IID Fa0 , γ < P X-IID Fa, γ X-IID sa0 , γ < X-IID sa, γ ∀ ∈ D D D D

Definition D.48 (Non-dominated single-state restriction). nd(s π(s0) = a) := (s π(s0) = a) (s). F F | F | ∩ Fnd Lemma D.49 (Optimality probability and state bottlenecks). Suppose that s can reach   REACH s0, a0 REACH s0, a , but only by taking actions a0 or a at state s0. F := (s ∪ a0 Fnd | π(s0) = a0),F := (s π(s0) = a). Suppose F is similar to Fsub F via involution a F | a0 ⊆ a

36   φ which fixes all states not belonging to REACH s0, a0 REACH s0, a . Then γ [0, 1] : . ∪ ∀ ∈ P bound (Fa0 , γ) most P bound (Fa, γ) D ≤ D  If nd(s) Fa Fsub is non-empty, then for all γ (0, 1), the inequality is strict for all X-IID, andF ∩ \ . ∈ D P bound (Fa0 , γ) most P bound (Fa, γ) D 6≥ D Proof.   φ (s) F Fsub (176) Fnd \ a \    = φ (s) F Fsub (177) Fnd \ a ∪ [  = φ (s π(s0) = a00) F (178) Fnd | ∪ a0 a :a a 00∈A 006≡s0 [  = (s π(s0) = a00) (s) Fsub F (179) Fnd | ∪ Fnd ∩ ∪ a0 a00 : ∈A (a00 a) (a00 a0) 6≡s0 ∧ 6≡s0  = (s) F Fsub . (180) Fnd \ a \ Equation (178) follows by the definition of Fa and from the fact that φ (Fsub) = Fa . By assumption,   0 φ fixes all s0 REACH s0, a0 REACH s0, a . Suppose f nd(s) (Fa Fa). By the bottleneck 6∈ ∪  ∈ F \ 0 ∪ assumption, f does not visit states in REACH s0, a0 REACH s0, a . Therefore, Pφf = f, and so eq. (179) follows. ∪

Case: γ (0, 1). ∈ ( ) = ( ) ( ) P bound Fa0 , γ p bound Fa0 γ s, γ (181) D D ≥ F  most p bound Fa(γ) (s, γ) (182) ≤ D ≥ F = ( ) P bound Fa0 , γ . (183) D Equation (181) and eq. (183) follow from lemma D.34. Equation (182) follows by applying : : : : lemma D.26 with A = Fa0 (γ),B0 = Fsub(γ),B = Fa(γ),C = (s, γ) and involution φ.  F  Therefore, φ(A) = B0 since φ(F ) = Fsub and φ ND (C) B B0 = ND (C) B B0 by a0 \ \ \ \ eq. (180).

  Suppose nd(s) Fa Fsub is non-empty. 0 < nd(s) Fa Fsub = F ∩ \ F ∩ \   nd(s, γ) Fa(γ) Fsub(γ) =: ND (C) B B0 (with the first equality hold- F ∩ \ ∩ \  ing by corollary D.46), and so ND (C) B B0 is non-empty. We also have ∩ \ B := Fa(γ) (s, γ) =: C. Then reapplying lemma D.26, eq. (182) is strict for all ⊆( F ) ( ) X-IID, and P bound Fa0 , γ most P bound Fa, γ . D D 6≥ D Case: γ = 1, γ = 0. ( 1) = lim ( ) P bound Fa0 , P bound Fa0 , γ∗ (184) D γ∗ 1 D →  = lim p bound Fa0 (γ∗) (s, γ∗) (185) γ∗ 1 D ≥ F →  most lim p bound Fa(γ∗) (s, γ∗) (186) γ 1 D ≤ ∗→ ≥ F = lim ( ) P bound Fa, γ∗ (187) γ∗ 1 D → = ( 1) P bound Fa, . (188) D Equation (184) and eq. (188) hold by proposition D.33. Equation (185) and eq. (187) follow by : : : : : lemma D.34. Applying lemma D.47 with γ = 1,I = (0, 1),FA = Fa0 ,FB = Fa,FC = (s), we conclude that eq. (186) follows. F The γ = 0 case proceeds similarly to γ = 1.

37 Proposition 6.9 (Keeping options open tends to be POWER-seeking and tends to be optimal).

a  a0  Suppose that s s0 REACH s0, a and s s0 REACH s0, a0 . F := (s π(s0) = → → → → a0 F | a0),Fa := (s π(s0) = a). Suppose Fa is similar to a subset of Fa via involution φ which fixes all F |  0  states not belonging to REACH s0, a0 or REACH s0, a .   Then for all γ [0, 1], Es T (s ,a ) POWER bound (sa0 , γ) most ∈ a0 ∼ 0 0 D ≤  OWER ( ) and ( ) ( ). Esa T (s0,a) P bound sa, γ P bound Fa0 , γ most P bound Fa, γ ∼ D D ≤ D If (s) F φ(F ) is non-empty, then for all γ (0, 1), both inequalities are strict Fnd ∩ a \ a0 ∈ for all ,  OWER ( )  OWER ( ), and X-IID Esa T (s0,a0) P bound sa0 , γ most Esa T (s0,a) P bound sa, γ D 0 ∼ . D 6≥ ∼ D P bound (Fa0 , γ) most P bound (Fa, γ) D 6≥ D Proof. Optimality probability. The bottleneck assumptions imply the weaker condition that s can   reach REACH s0, a0 REACH s0, a , but only by taking actions a0 or a at state s0. Furthermore,  ∪  φ nd(s) Fa0 φ (Fa0 ) Fa. Lemma D.49’s nd(s) Fa φ(Fa0 ) non-emptiness precon- ditionF is met∩ by this⊆ result’s⊆ precondition. ThereforeF we have∩ met\ the conditions of lemma D.49, which we apply to show this result’s optimality probability claims.

POWER. For sets of visit distribution functions X ,X , define X ∗ X :=  1 2 1 × 2   x1(γx) + 1 (1 γx) x1(γx) x2(γx) x1 X1, x2 X2 , where γx is understood to − − 1 | ∈ ∈ vary on [0, 1). P Let 1reach := e , 1rest := 1 1reach (where 1 |S| sj REACH(s0,a0) REACH(s0,a) sj R ∈ b∪  π,s − b∈ is the all-ones vector), and F := f 1rest π Π . For action a , F := rest | ∈ i rest,ai  π,s b n π,s o 1 : : ai f rest π Π π(s0) s0 ai , and Fa = Esa T (s ,ai) [f ] π Π . | ∈ ≡ i i ∼ 0 | ∈  b b  φ (Fa ) = φ F ∗ F (189) 0 rest,a0 × a0 (   )   b b := Pφ frest(γx) + 1 (1 γx) frest(γx) fai (γx) frest F , fai F − − 1 | ∈ rest,a0 ∈ a0 (190)     b b = frest(γx) + 1 (1 γx) frest(γx) Pφfai (γx) frest F , fai F (191) − − 1 | ∈ rest,a0 ∈ a0 b  b  =: F ∗ φ F (192) rest,a0 × a0 F (193) ⊆ a b b = F ∗ F . (194) rest,a0 × a Equation (189) and eq. (194) follow by lemma D.44. Equation (191) follows because φ fixes the states   s REACH s0, a0 REACH s0, a . Equation (193) follows by our assumption that φ (F ) F . j 6∈ ∪ a0 ⊆ a b b  b b b  b b  Because Frest,a ∗ φ Fa Frest,a ∗ Fa , we conclude that φ Fa Fa . Since ND Fa 0 × 0 ⊆ 0 × 0 ⊆ 0 ⊆ φ F b  F b , φ ND F b  F b. a0 a0 a0 ⊆ a   E POWER bound (sa , γ) (195) s T (s ,a ) D 0 a0 ∼ 0 0 " # 1 γ∗  = E max lim − fa (γ∗) T (s0, a0) > r (196) r b γ γ 0 bound fa F ∗ γ∗ − ∼D 0 ∈ a0 → " # 1 γ∗  most max lim − fa (γ∗) T (s0, a) > r (197) E b 0 ≤ r bound fa F γ∗ γ γ − ∼D ∈ a → ∗  !  1 h i > γ∗ π0,sa = E max lim − E f (γ∗) esa r (198) r bound π Π γ∗ γ γ sa T (s ,a) − ∼D ∈ → ∗ ∼ 0

38   = E POWER bound (sa, γ) . (199) sa T (s ,a) D ∼ 0

b Equation (196) follows by the definition of F and lemma D.42. Let ∆1 := T (s0, a0), ∆2 := T (s0, a), a0   and define :=  [f π,si ] Π for = 1 2. ND is similar to F∆∗ i Esi ∆i π i , F∆∗ 1 Fsub∗ F∆∗ 2 ∼ | ∈   ⊆    b  b via involution φ, because we showed above that φ ND F ∗ = φ ND F F = ∆1 a0 ⊆ a :  OWER  F∆∗ . Apply lemma D.43 to conclude that γ [0, 1] Es T (s ,a ) P bound (sa0 , γ) most 2   ∀ ∈ a0 ∼ 0 0 D ≤ Esa T (s ,a) POWER bound (sa, γ) and so eq. (197) holds. ∼ 0 D     b b  b b  Suppose nd(s) Fa φ(Fa ) = nd(s) F ∗ F F ∗ φ F (equality F ∩ \ 0 F ∩ rest,a0 × a \ rest,a0 × a0 follows by eq. (194)) is non-empty, and that f is one of its elements. Since f (s), f is strictly ∈ Fnd optimal for some r R|S| at γ∗ (0, 1): ∈ ∈   f(γ∗)>r = frest(γ∗)>r + 1 (1 γ∗) frest(γ∗) fa(γ∗)>r (200) − − 1 > max f 0(γ∗)>r (201) f (s) f 0∈F \{ } max f 0(γ∗)>r (202) ≥ f (s π(s )=a) f 0∈F | 0 \{ } max f 0(γ∗)>r (203)   ≥ b b f 0 F ∗φ(F ) ∈ rest,a0 × a0   = max f ( ) r + 1 (1 ) f ( ) max f ( ) r (204) rest0 γ∗ > γ∗ rest0 γ∗ 1 a0 γ∗ > f F b b rest0 rest,a − − fa0 φ(F ) ∈ 0 ∈ a0   frest(γ∗)>r + 1 (1 γ∗) frest(γ∗) max fa0 (γ∗)>r. (205) 1 b ≥ − − fa0 φ(F ) ∈ a0

Since f F , eq. (200) follows by lemma D.44. Equation (201) follows by the assumed strict ∈ a optimality of f for r at γ∗. Equation (202) follows because (s π(s0) = a) (s) by definition 3.4. F | ⊆ F  b b  Equation (203) follows because F ∗ φ F (s π(s0) = a) f by lemma D.44 and rest,a0 × a0 ⊆ F | \{ }  b b  the fact that f F ∗ φ F by assumption. Equation (204) follows by the definition of 6∈ rest,a0 × a0 b ∗. Equation (205) follows because frest F by assumption. × ∈ rest,a0 If f(γ∗) = frest(γ∗), then eq. (205) shows that frest(γ∗)>r > frest(γ∗)>r, which is absurd. Therefore, b Equation (205) implies that fa(γ∗)>r > max b f 0 (γ∗)>r. This implies that ND F fa0 φ(F ) a a ∈ a0   \ φ F b  is non-empty. Since ND F b φ F b  ND F b φ ND F b  , the latter set is also a0 a \ a0 ⊆ a \ a0 non-empty.

 b  b    Therefore, when nd(s) Fa φ(Fa ) is non-empty, ND F φ ND F =: ND F ∗ F ∩ \ 0 a \ a0 ∆2 \    ND is non-empty as well. Therefore, lemma D.43 further implies that when φ F∆∗ 1   γ (0, 1), eq. (197) is strict for all X-IID, and that Es T (s ,a ) POWER bound (sa0 , γ) most ∈   D a0 ∼ 0 0 D 6≥ Esa T (s ,a) POWER bound (sa, γ) . ∼ 0 D

D.4.2 When γ 1, optimal policies tend to navigate towards “larger” sets of cycles ≈ Lemma D.50 (POWER identity when γ = 1). Let bound be any bounded reward function distribution. D " # " #

POWER bound (s, 1) = E max d>r = E max d>r . (206) D r bound d RSD(s) r bound d RSDnd(s) ∼D ∈ ∼D ∈

39 Proof. " # 1 γ π,s  POWER (s, 1) = max lim − f (γ) es > r (207) bound E π,s D r bound f (s) γ 1 γ − ∼D ∈F → " # = E max d>r (208) r bound d RSD(s) ∼D ∈ " # = E max d>r . (209) r bound d RSDnd(s) ∼D ∈ Equation (207) follows by lemma D.42. Equation (208) follows by the definition of RSD (s) (definition 6.10). Equation (209) follows because for all r R|S|, corollary D.11 shows that max d r = max d r =: max ∈d r. d RSD(s) > d ND(RSD(s)) > d RSDnd(s) > ∈ ∈ ∈  Proposition 6.12 (When γ = 1, RSDs control POWER). If RSDnd s0 is similar to a subset  of RSD (s) via involution φ, then POWER bound s0, 1 most POWER bound (s, 1). If RSDnd (s) D ≤ D  \ φ(RSDnd(s0)) is non-empty, then this inequality is strict for all X-IID, and POWER bound s0, 1 most D D 6≥ POWER bound (s, 1). D  Proof. Suppose RSDnd s0 is similar to D RSD (s) via involution φ. ⊆ " #  POWER bound s0, 1 = E max d>r (210) D r bound d RSDnd(s ) ∼D ∈ 0 " # most E max d>r (211) ≤ r bound d RSDnd(s) ∼D ∈

= POWER bound (s, 1) (212) D Equation (210) and eq. (212) follow from lemma D.50. By applying lemma D.22 with A :=  RSD s0 ,B0 := D,B := RSD (s) and g the identity function, eq. (211) follows.

Suppose RSDnd (s) D is non-empty. By the same result, eq. (211) is a strict inequality for all \  X-IID, and we conclude that POWER bound s0, 1 most POWER bound (s, 1). D D 6≥ D

s

sa0 s0 sa a0 a

Figure 13: At s0, a is more probable under optimality than a0 because it allows access to more RSDs. Let D0 contain the cycle in the left subgraph and let D contain the cycles in the right subgraph. contains only the isolated green 1-cycle. Theorem 6.13 shows that 1 = Dsub ( D P X-IID D0, ( 1), as they both only contain a 1-cycle. Because RSD ( ) (D ) con- P X-IID Dsub, nd s D Dsub tainsD the black cycles on the right, it is non-empty. Therefore, theorem ∩6.13 \implies that  1 ( 1). Furthermore, by proposition 6.12, is POWER -seeking com- P X-IID D0, < P X-IID D, a X-IID D D D pared to a0:POWER X-IID (sa , 1) < POWER X-IID (sa, 1). D 0 D

Theorem 6.13 (Blackwell optimal policies tend to end up in “larger” sets of RSDs). Let D0,D ⊆ RSD (s). Suppose that D0 is similar to a subset of D via involution φ and that the sets D0 D  ∪ and RSDnd (s) D0 D have pairwise orthogonal vector elements (i.e. pairwise disjoint vector support). \ ∪ Then  . If  is non-empty, the inequality P bound D0, 1 most P bound (D, 1) RSDnd (s) D φ(D0) D ≤ D ∩ \ is strict for all , and  . X-IID P bound D0, 1 most P bound (D, 1) D D 6≥ D

40 Proof. Let Dsub := φ(D0), where Dsub D by assumption. Let S :=  ⊆ si maxd D D d>esi > 0 . Define ∈ S | ∈ 0∪ ( φ(si) if si S φ0(si) := ∈ (213) si else.

Since φ is an involution, φ0 is also an involution. Furthermore, φ0(D0) = Dsub and φ0(Dsub) = D0 (because we assumed that both equalities hold for φ).    φ0 RSDnd (s) (D Dsub) = φ0 D0 Dsub RSDnd (s) (D0 D) (214) \ \ ∪ ∪ \ ∪ = Dsub D0 RSDnd (s) (D0 D) (215) ∪ ∪ \ ∪ = RSDnd (s) (D Dsub). (216) \ \

In eq. (215), we know that φ0(D0) = Dsub and φ0(Dsub) = D0. We just need to show that  φ0 RSDnd (s) (D0 D) = RSDnd (s) (D0 D). \ ∪ \ ∪ Suppose s S, d0 RSDnd (s) (D0 D) : d0>e > 0. By the definition of S, d D0 D : ∃ i ∈ ∈ \ ∪ si ∃ ∈ ∪ d>esi > 0. Then

X|S| d>d0 = d>(d0 e ) (217) sj j=1

d>(d0 esi ) (218) ≥   = d> (d0>esi )esi (219)

= (d0>e ) (d>e ) (220) si · si > 0. (221)

Equation (217) follows from the definitions of the dot and Hadamard products. Equation (218) follows because d and d0 have non-negative entries. Equation (221) follows because d>esi and d0>esi are both positive. But eq. (221) shows that d>d0 > 0, contradicting our assumption that d and d0 are orthogonal. n o := max d e 0 ( Therefore, such an si cannot exist, and S0 si0 d0 RSDnd(s) (D0 D) 0> si > ∈ S | ∈ \ ∪ ⊆ S\ S). By eq. (213), si0 S0 : φ0(si0 ) = si0 . Thus, φ0 RSDnd (s) (D0 D) = RSDnd (s) (D0 D), and eq. (215) follows.∀ ∈ \ ∪ \ ∪  We conclude that φ0 RSDnd (s) (D Dsub) = RSDnd (s) (D Dsub). \ \ \ \ 1 = ( ) P bound D0, p bound D0 RSD s (222) D D ≥  most p bound D RSD (s) (223) ≤ D ≥ = ( 1) P bound D, . (224) D

Equation (223) holds by applying lemma D.26 with A := D0,B0 := Dsub,B := D,C := RSD (s)  and involution φ0 which satisfies φ0 ND (C) (B B0) = ND (C) (B B0) by eq. (216). \ \ \ \  When RSDnd (s) D Dsub is non-empty, lemma D.26 also shows that eq. (223) is strict for all ∩ \ 1 ( 1) X-IID, and that P bound D0, most P bound D, . D D 6≥ D

Proposition D.51 (RSD properties). Let d RSD (s). d is element-wise non-negative and d 1 = 1. ∈ k k

Proof. d has non-negative elements because it equals the limit of limγ 1(1 γ)f(γ), whose elements are non-negative by proposition D.3 item1. → −

d = lim (1 γ)f(γ) (225) 1 γ 1 k k → − 1

41

= lim (1 γ) f(γ) (226) γ 1 1 → − = 1. (227)

Equation (225) follows because the definition of RSDs (definition 6.10) ensures that f (s) : ∃ ∈ F limγ 1(1 γ)f(γ) = d. Equation (226) follows because is a continuous function. Equa- → − k·k1 tion (227) follows because f( ) = 1 by proposition D.3 item2. γ 1 1 γ − Lemma D.52 (When reachable with probability 1, 1-cycles induce non-dominated RSDs). If e s0 ∈ RSD (s), then e RSDnd (s). s0 ∈ Proof. If d RSD (s) is distinct from e , then d = 1 and d has non-negative entries by ∈ s0 k k1 proposition D.51. Since d is distinct from es0 , then its entry for index s0 must be strictly less than 1: d>es < 1 = e>es . Therefore, es RSD (s) is strictly optimal for the reward function r := es , 0 s0 0 0 ∈ 0 and so e RSDnd (s). s0 ∈ Corollary 6.14 (Blackwell optimal policies tend not to end up in any given 1-cycle). Suppose are distinct. Then  . esx , es0 RSD (s) P bound esx , 1 most P bound RSD (s) esx , 1 ∈ D { } ≤ D \{ } If there exists a third , then  . es00 RSD (s) P bound esx , 1 most P bound RSD (s) esx , 1 ∈ D { } 6≥ D \{ } : : : Proof. Suppose esx , es0 RSD (s) are distinct. Let φ = (sx s0),D0 = esx ,D = RSD (s) ∈ : { } \ esx . φ(D0) = es0 RSD (s) esx = D since sx = s0. D0 D = RSD (s) and RSDnd (s) {RSD}(s) = trivially{ } ⊆ have pairwise\{ orthogonal} vector6 elements.∪ Then apply theorem 6.13 to\ ∅ e 1 ( ) e 1 conclude that P bound sx , most P bound RSD s sx , . D { } ≤ D \{ } Suppose there exists another es RSD (s). By lemma D.52, es RSDnd (s). Further-  00 ∈  00 ∈ more, since s00 s0, sx , es RSD (s) es es = D φ(D0). Therefore, 6∈  00 ∈ \{ x } \{ 0 } \ e RSDnd (s) D φ(D0) . Then apply the second condition of theorem 6.13 to conclude that s00 ∈ ∩ \ e 1 ( ) e 1 P bound sx , most P bound RSD s sx , . D { } 6≥ D \{ }

42