<<

The Value Function in Reinforcement Learning

Robert Dadashi 1 Adrien Ali Ta¨ıga 1 2 Nicolas Le Roux 1 Dale Schuurmans 1 3 Marc G. Bellemare 1

Abstract Line theorem. We show that policies that agree on all but We establish geometric and topological proper- one state generate a line segment within the value function ties of the space of value functions in finite state- polytope, and that this segment is monotone (all state values action Markov decision processes. Our main con- increase or decrease along it). tribution is the characterization of the nature of its Relationship between faces and semi-deterministic poli- : a general polytope (Aigner et al., 2010). To cies. We show that d-dimensional faces of this polytope are demonstrate this result, we exhibit several proper- mapped one-to-many to policies which behave deterministi- ties of the structural relationship between policies cally in at least d states. and value functions including the line theorem, which shows that the value functions of policies Sub-polytope characterization. We use this result to gen- constrained on all but one state describe a line eralize the line theorem to higher , and demon- segment. Finally, we use this novel perspective strate that varying a policy along d states generates a d- to introduce visualizations to enhance the under- dimensional sub-polytope. standing of the dynamics of reinforcement learn- Although our “line theorem” may not be completely surpris- ing algorithms. ing or novel to expert practitioners, we believe we are the first to highlight its existence. In turn, it forms the of the other two results, which require additional technical ma- 1. Introduction chinery which we develop in this paper, leaning on results The notion of value function is central to reinforcement from convex analysis and . learning (RL). It arises directly in the design of algorithms While our characterization is interesting in and of itself, it such as value iteration (Bellman, 1957), policy gradient also opens up new perspectives on the dynamics of learning (Sutton et al., 2000), policy iteration (Howard, 1960), and algorithms. We use the value polytope to visualize the ex- evolutionary strategies (e.g. Szita & Lorincz˝ , 2006), which pected behaviour and pitfalls of common algorithms: value either predict it directly or estimate it from samples, while iteration, policy iteration, policy gradient, natural policy gra- also seeking to maximize it. The value function is also a use- dient (Kakade, 2002), and finally the cross-entropy method ful tool for the analysis of approximation errors (Bertsekas (De Boer et al., 2004). & Tsitsiklis, 1996; Munos, 2003). ⇡

AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbbSbt0swm7G6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTAXXxvO+ndLa+sbmVnm7srO7t3/gHh61dJIphk2WiER1QqpRcIlNw43ATqqQxqHAdji+nfntJ1SaJ/LRTFIMYjqUPOKMGis99FLed6tezZuDrBK/IFUo0Oi7X71BwrIYpWGCat31vdQEOVWGM4HTSi/TmFI2pkPsWippjDrI56dOyZlVBiRKlC1pyFz9PZHTWOtJHNrOmJqRXvZm4n9eNzPRdZBzmWYGJVssijJBTEJmf5MBV8iMmFhCmeL2VsJGVFFmbDoVG4K//PIqaV3UfK/m319W6zdFHGU4gVM4Bx+uoA530IAmMBjCM7zCmyOcF+fd+Vi0lpxi5hj+wPn8AU+9jc0= In this paper we study the map π 7→ V π from stationary ⇡ VAAAB7nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSNcxOOsmQ2dlhZlYISz7CiwdFvPo93vwbJ8keNLGgoajqprsrUoIb6/vfXmFtfWNzq7hd2tnd2z8oHx41TZJqhg2WiES3I2pQcIkNy63AttJI40hgKxrfzvzWE2rDE/lgJwrDmA4lH3BGrZNazcesq/i0V674VX8OskqCnFQgR71X/ur2E5bGKC0T1JhO4CsbZlRbzgROS93UoKJsTIfYcVTSGE2Yzc+dkjOn9Mkg0a6kJXP190RGY2MmceQ6Y2pHZtmbif95ndQOrsOMS5ValGyxaJAKYhMy+530uUZmxcQRyjR3txI2opoy6xIquRCC5ZdXSfOiGvjV4P6yUrvJ4yjCCZzCOQRwBTW4gzo0gMEYnuEV3jzlvXjv3seiteDlM8fwB97nD3NBj6E= policies, which are typically used to describe the behaviour of RL agents, to their respective value functions. Specifi- arXiv:1901.11524v3 [cs.LG] 15 May 2019 cally, we vary π over the joint describing all policies and show that the resulting image forms a polytope, albeit one that is possibly self-intersecting and non-convex. Figure 1. Mapping between policies and value functions. We provide three results all based on the notion of “policy agreement”, whereby we study the behaviour of the map π 7→ V π as we only allow the policy to vary at a subset of 2. Preliminaries all states. We are in the reinforcement learning setting (Sutton & 1Google Brain 2Mila, Universite´ de Montreal´ 3Department of Barto, 2018). We consider a Markov decision process Computing Science, University of Alberta. Correspondence to: M := hS, A, r, P, γi with S the finite state space, A the Robert Dadashi . finite action space, r the reward function, P the transition Proceedings of the 36 th International Conference on Machine function, and γ the discount factor for which we assume Learning, Long Beach, California, PMLR 97, 2019. Copyright γ ∈ [0, 1). We denote the number of states by |S|, the 2019 by the author(s). number of actions by |A|. The Value Function Polytope in Reinforcement Learning

A stationary policy π is a mapping from states to distri- Definition 1 (). P is a convex polytope n butions over actions; we denote the space of all policies iff there are k ∈ N points x1, x2, ..., xk ∈ R such that by P(A)S . Taken with the transition function P , a policy P = Conv(x1, . . . , xk). π defines a state-to-state transition function P : Definition 2 (Convex ). P is a convex polyhe- X dron iff there are k ∈ half-spaces Hˆ , Hˆ , ..., Hˆ whose P π(s | s) = π(a | s)P (s | s, a). N 1 2 k 0 0 intersection is P , that is a ∈A k ˆ The value V π is defined as the expected cumulative reward P = ∩i=1Hk. from starting in a particular state and acting according to π: A celebrated result from convex analysis relates these two

π  X∞ i  definitions: a bounded, convex polyhedron is a convex poly- π | V (s) = EP γ r(si, ai) s0 = s . tope (Ziegler, 2012). i=0 The next two definitions generalize convex and The Bellman equation (Bellman, 1957) connects the value polyhedra to non-convex bodies. function V π at a state s with the value function at the subse- quent states when following π: Definition 3 (Polytope). A (possibly non-convex) polytope is a finite union of . π  π  V (s) = EP π r(s, a) + γV (s0) . (1) Definition 4 (Polyhedron). A (possibly non-convex) poly- hedron is a finite union of convex polyhedra. Throughout we will make use of vector notation (e.g. Put- erman, 1994). Specifically, we view (with some abuse of We will make use of another, recursive characterization notation) P π as a |S| × |S| matrix, V π as a |S|-dimensional based on the notion that the boundaries of a polytope should vector, and write rπ for the vector of expected rewards under be “flat” in a topological sense (Klee, 1959). π. In this notation, the Bellman equation for a policy π is n For an affine subspace K ⊆ R , Vx ⊂ K is a relative π π π π 1 neighbourhood of x in K if x ∈ Vx and Vx is open in V = rπ + γP V = (I − γP )− rπ. K. For P ⊂ K, the relative interior of P in K, denoted relint (P ), is then the set of points in P which have a In this work we study how the value function π changes K V relative neighbourhood in K ∩P . The notion of “open in K” as we continuously vary the policy π. As such, we will find is key here: a that lies on an of the unit convenient to also view this value function as the functional does not have a relative neighbourhood in the square, but it has a relative neighbourhood in that edge. The relative fv : P(A)S → RS boundary π π 1 ∂K P is defined as the set of points in P not in the π 7→ V = (I − γP )− rπ. relative interior of P , that is We will use the notation V π when the emphasis is on the ∂K P = P \ relintK (P ). vector itself, and fv when the emphasis is on the mapping from policies to value functions. Finally, we recall that H ⊆ K is a if H is an affine subspace of K of dim(K) − 1. Finally, we will use 4 and < for element-wise vector in- equalities, and for a function F → G and a subset Proposition 1. P is a polyhedron in an affine subspace f : n F ⊂ F write f(F ) to mean the image of f applied to F . K ⊆ R if (i) P is closed; 2.1. Polytopes in n R (ii) There are k ∈ N H1, .., Hk in K whose Central to our work will be the result that the image of union contains the boundary of P in K: k ∂K P ⊂ ∪ Hi; and the functional fv applied to the space of policies forms a i=1 polytope, possibly nonconvex and self-intersecting, with (iii) For each of these hyperplanes, P ∩ Hi is a polyhedron certain structural properties. This section lays down some in Hi. of the necessary definitions and notations. For a complete overview on the topic, we refer the reader to Grunbaum¨ et al. All proofs may be found in the appendix. (1967); Ziegler(2012); Brondsted(2012). We begin by characterizing what it means for a subset P ⊆ 3. The Space of Value Functions Rn to be a convex polytope or polyhedron. In what follows we write Conv(x1, . . . , xk) to denote the of the We now turn to the main object of our study, the space of points x1, . . . , xk. value functions V. The space of value functions is the set The Value Function Polytope in Reinforcement Learning

of all value functions that are attained by some policy. As 3.1. Basic Shape from Topology noted earlier, this corresponds to the image of P(A) under S We begin with a first result on how the functional f trans- the mapping f : v v forms the space of policies into the space of value functions n o (Figure1). Recall that V = fv(P(A)S ) = fv(π) | π ∈ P(A)S . (2) π 1 fv(π) = (I − γP )− rπ. As a warm-up, Figure2 depicts the space V corresponding Hence is infinitely differentiable everywhere on P A to four 2-state MDPs; each set is made of value functions fv ( )S (AppendixC). The following is a topological consequence corresponding to 50,000 policies sampled uniformly at ran- of this property, along with the fact that P(A) is a compact dom from P(A) . The specifics of all MDPs depicted in S S and connected set. this work can be found in AppendixA. Lemma 1. The space of value functions V is compact and connected.

AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBZBPJREBD0WvXisaD+gjWWz3bRLN5uwOxFK6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzgkQKg6777RRWVtfWN4qbpa3tnd298v5B08SpZrzBYhnrdkANl0LxBgqUvJ1oTqNA8lYwupn6rSeujYjVA44T7kd0oEQoGEUr3Tcfz3rlilt1ZyDLxMtJBXLUe+Wvbj9macQVMkmN6Xhugn5GNQom+aTUTQ1PKBvRAe9YqmjEjZ/NTp2QE6v0SRhrWwrJTP09kdHImHEU2M6I4tAselPxP6+TYnjlZ0IlKXLF5ovCVBKMyfRv0heaM5RjSyjTwt5K2JBqytCmU7IheIsvL5PmedVzq97dRaV2ncdRhCM4hlPw4BJqcAt1aACDATzDK7w50nlx3p2PeWvByWcO4Q+czx/LtI12

AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBZBPJREBD0WvXisaD+gjWWz3bRLN5uwOxFK6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzgkQKg6777RRWVtfWN4qbpa3tnd298v5B08SpZrzBYhnrdkANl0LxBgqUvJ1oTqNA8lYwupn6rSeujYjVA44T7kd0oEQoGEUr3Tcfz3rlilt1ZyDLxMtJBXLUe+Wvbj9macQVMkmN6Xhugn5GNQom+aTUTQ1PKBvRAe9YqmjEjZ/NTp2QE6v0SRhrWwrJTP09kdHImHEU2M6I4tAselPxP6+TYnjlZ0IlKXLF5ovCVBKMyfRv0heaM5RjSyjTwt5K2JBqytCmU7IheIsvL5PmedVzq97dRaV2ncdRhCM4hlPw4BJqcAt1aACDATzDK7w50nlx3p2PeWvByWcO4Q+czx/LtI12

) V ⇤ V ⇤

2

s

(

AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA==

3.2. Policy Agreement and Policy Determinism Two notions play a central role in our analysis: policy agree- ment and policy determinism. a) b)

AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBZBPJREBD0WvXisaD+gjWWz3bRLN5uwOxFK6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzgkQKg6777RRWVtfWN4qbpa3tnd298v5B08SpZrzBYhnrdkANl0LxBgqUvJ1oTqNA8lYwupn6rSeujYjVA44T7kd0oEQoGEUr3Tcfz3rlilt1ZyDLxMtJBXLUe+Wvbj9macQVMkmN6Xhugn5GNQom+aTUTQ1PKBvRAe9YqmjEjZ/NTp2QE6v0SRhrWwrJTP09kdHImHEU2M6I4tAselPxP6+TYnjlZ0IlKXLF5ovCVBKMyfRv0heaM5RjSyjTwt5K2JBqytCmU7IheIsvL5PmedVzq97dRaV2ncdRhCM4hlPw4BJqcAt1aACDATzDK7w50nlx3p2PeWvByWcO4Q+czx/LtI12 Definition 5 (Policy Agreement). Two policies π1, π2 agree V ⇤ VAAAB6nicbVBNS8NAEJ3Ur1q/qh69LBZBPJREBD0WvXisaD+gjWWz3bRLN5uwOxFK6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzgkQKg6777RRWVtfWN4qbpa3tnd298v5B08SpZrzBYhnrdkANl0LxBgqUvJ1oTqNA8lYwupn6rSeujYjVA44T7kd0oEQoGEUr3Tcfz3rlilt1ZyDLxMtJBXLUe+Wvbj9macQVMkmN6Xhugn5GNQom+aTUTQ1PKBvRAe9YqmjEjZ/NTp2QE6v0SRhrWwrJTP09kdHImHEU2M6I4tAselPxP6+TYnjlZ0IlKXLF5ovCVBKMyfRv0heaM5RjSyjTwt5K2JBqytCmU7IheIsvL5PmedVzq97dRaV2ncdRhCM4hlPw4BJqcAt1aACDATzDK7w50nlx3p2PeWvByWcO4Q+czx/LtI12 ⇤ on states s1, .., sk ∈ S if π1(· | si) = π2(· | si) for each si, i = 1, . . . , k.

π ⊆ P A For a given policy π, we denote by Ys1,...,sk ( )S the set of policies which agree with π on s1, . . . , sk; we will π also write Y s to describe the set of policies that agree with π on allS\{ states} except s. Note that policy agreement π does not imply disagreement; in particular, π ∈ YS for any c) d) V ⇡(s ) AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkoioh4LXjxWsB/QxLLZbtqlm82yuxFK6N/w4kERr/4Zb/4bt2kO2vpg4PHeDDPzQsmZNq777ZTW1jc2t8rblZ3dvf2D6uFRRyepIrRNEp6oXog15UzQtmGG055UFMchp91wcjv3u09UaZaIBzOVNIjxSLCIEWys5HceM1+yWV0PvPNBteY23BxolXgFqUGB1qD65Q8TksZUGMKx1n3PlSbIsDKMcDqr+KmmEpMJHtG+pQLHVAdZfvMMnVlliKJE2RIG5erviQzHWk/j0HbG2Iz1sjcX//P6qYlugowJmRoqyGJRlHJkEjQPAA2ZosTwqSWYKGZvRWSMFSbGxlSxIXjLL6+SzkXDcxve/WWteVXEUYYTOIU6eHANTbiDFrSBgIRneIU3J3VenHfnY9FacoqZY/gD5/MHPTORGw== 1 subset of states S ⊂ S. Definition 6 (Policy Determinism). A policy π is Figure 2. Space of value functions for various two-state MDPs.

(i) s-deterministic for s ∈ S if π(a | s) ∈ {0, 1}. While the space of policies P(A)S is easily described (it is the Cartesian product of |S| simplices), value function spaces arise as complex polytopes. Of note, they may be (ii) semi-deterministic if it is s-deterministic for at least non-convex – justifying our more intricate definition. one s ∈ S.

In passing, we remark that the polytope gives a clear illus- (iii) deterministic if it is s-deterministic for all states s ∈ S. tration of the following classic results regarding MDPs (e.g. Bertsekas & Tsitsiklis, 1996): We will denote by Ds,a the set of semi-deterministic policies that take action a when in state s. • (Dominance of V ∗) The optimal value function V ∗ is the unique dominating of V; Lemma 2. Consider two policies π1, π2 that agree on s1, . . . , sk ∈ S. Then the vector rπ1 − rπ2 has zeros in the components corresponding to s , . . . , s and the matrix • (Monotonicity) The edges of V are oriented with the 1 k P π1 − P π2 has zeros in the corresponding rows. positive orthant; This lemma highlights that when two policies agree on a • (Continuity) The space V is connected. given state they have the same immediate dynamic on this state, i.e. they get the same expected reward, and have the The next sections will formalize these and other, less- same next state transition probabilities. Lemma3 in Section understood properties of the space of value functions. 3.3 will be a direct consequence of this property. The Value Function Polytope in Reinforcement Learning

3.3. Value Functions and Policy Agreement )

2

s

(

AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkoioh4LXjxWsB/QxLLZbtqlm82yuxFK6N/w4kERr/4Zb/4bt2kO2vpg4PHeDDPzQsmZNq777ZTW1jc2t8rblZ3dvf2D6uFRRyepIrRNEp6oXog15UzQtmGG055UFMchp91wcjv3u09UaZaIBzOVNIjxSLCIEWys5HceM1+yWV0PvPNBteY23BxolXgFqUGB1qD65Q8TksZUGMKx1n3PlSbIsDKMcDqr+KmmEpMJHtG+pQLHVAdZfvMMnVlliKJE2RIG5erviQzHWk/j0HbG2Iz1sjcX//P6qYlugowJmRoqyGJRlHJkEjQPAA2ZosTwqSWYKGZvRWSMFSbGxlSxIXjLL6+SzkXDcxve/WWteVXEUYYTOIU6eHANTbiDFrSBgIRneIU3J3VenHfnY9FacoqZY/gD5/MHPTORGw== 1 dominates the other end). Furthermore, the extremes of this line segment can be taken to be s-deterministic policies. Figure 3. Illustration of Theorem1. The orange points are the value functions of mixtures of policies that agree everywhere but This is the main result of this section, which we now state one state. more formally. Theorem 1. [Line Theorem] Let s be a state and π, a pol- in the affine vector space π : π Hs1,..,sk icy. Then there are two s-deterministic policies in Y s , denoted π , π , which bracket the value of all other policiesS\{ } l u π V ∩ π π fv(Ys1,..,sk ) = Hs1,..,sk . π0 ∈ Y s : S\{ }

fv(πl) 4 fv(π0) 4 fv(πu). Put another way, Lemma3 shows that if we fix the policies Furthermore, the image of restricted to π is a line fv Y s on k states, the induced space of value function loses at least segment, and the following three sets are equivalent:S\{ } k , specifically that it lies in a |S| − k dimensional affine vector space. π  (i) fv Y s , S\{ } For k = |S| − 1, Lemma3 implies that the value functions

(ii) {fv(απl + (1 − α)πu) | α ∈ [0, 1]}, lie on a line – however, the following is necessary to expose the full structure of V within this line. (iii) {αfv(πl) + (1 − α)fv(πu) | α ∈ [0, 1]} . π Lemma 4. Consider the ensemble Y s of policies that S\{ } agree with a policy π everywhere but on s ∈ S. For π0, π1 ∈ π The second part of Theorem1 states that one can generate Y s define the function g : [0, 1] → V π S\{ } the set of value functions fv(Y s ) in two ways: either by S\{ } − drawing the line segment in value space, fv(πl) to fv(πu), g(µ) = fv(µπ1 + (1 µ)π0). or drawing the line segment in policy space, from πl to πu and then mapping to value space. Note that this result is a Then the following hold regarding g: consequence from the Sherman-Morrison formula, which has been used in reinforcement learning for efficient se- (i) g is continuously differentiable; quential matrix inverse estimation (Bradtke & Barto, 1996). (ii) (Total order) g(0) g(1) or g(0) g(1); While somewhat technical, this characterization of line seg- 4 < ment is needed to prove some of our later results. Figure (iii) If g(0) = g(1) then g(µ) = g(0), µ ∈ [0, 1]; 3 illustrates the path drawn by interpolating between two policies that agree on state s2. (iv) (Monotone interpolation) If g(0) =6 g(1) there is a ρ : [0, 1] → R such that g(µ) = ρ(µ)g(1) + (1 − Theorem1 depends on two lemmas, which we now pro- ρ(µ))g(0), and ρ is a strictly monotonic rational func- vide in turn. Consider a policy π and k states s1, . . . , sk, tion of . π π µ and write Ck+1,...,C for the columns of the matrix π 1 |S| (I − γP )− corresponding to states other than s1, . . . , sk. Define the affine vector space The result (ii) in Lemma4 was established in (Mansour & Singh, 1999) for deterministic policies. Note that in π π π π general, ρ(µ) =6 µ in the above, as the following example Hs1,...,sk = V + Span(Ck+1,...,C ). |S| demonstrates.

Example 1. Suppose S = {s1, s2}, with s2 terminal Lemma 3. Consider a policy π and k states s1, . . . , sk. with no reward associated to it, A = {a1, a2}. The Then the value functions generated by π are contained transitions and rewards are defined by | Ys1,..,sk P (s2 s1, a2) = The Value Function Polytope in Reinforcement Learning

1,P (s1, |s1, a1) = 1, r(s1, a1) = 0, r(s1, a2) = 1. De- )

2 s fine two deterministic policies such that | (

π1, π2 π1(a1 s1) = ⇡

AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA==

Remarkably, Theorem1 shows that policies agreeing on all but one state draw line segments irrespective of the size V ⇡(s ) of the action space; this may be of particular interest in AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkoioh4LXjxWsB/QxLLZbtqlm82yuxFK6N/w4kERr/4Zb/4bt2kO2vpg4PHeDDPzQsmZNq777ZTW1jc2t8rblZ3dvf2D6uFRRyepIrRNEp6oXog15UzQtmGG055UFMchp91wcjv3u09UaZaIBzOVNIjxSLCIEWys5HceM1+yWV0PvPNBteY23BxolXgFqUGB1qD65Q8TksZUGMKx1n3PlSbIsDKMcDqr+KmmEpMJHtG+pQLHVAdZfvMMnVlliKJE2RIG5erviQzHWk/j0HbG2Iz1sjcX//P6qYlugowJmRoqyGJRlHJkEjQPAA2ZosTwqSWYKGZvRWSMFSbGxlSxIXjLL6+SzkXDcxve/WWteVXEUYYTOIU6eHANTbiDFrSBgIRneIU3J3VenHfnY9FacoqZY/gD5/MHPTORGw== 1 the context of continuous action problems. Second, this Figure 5. Visual representation of Corollary1. The space of value structure is unique, in the sense that the paths traced by functions is included in the convex hull of value functions of interpolating between two arbitrary policies may be neither deterministic policies (red dots). linear, nor monotonic (Figure4 depicts two examples).

0 π π1 π πk

) that V = V , V = V , and for every i ∈ 1, . . . , k−1,

2 s

( the set

AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA==

{fv(απi + (1 − α)πi+1) | α ∈ [0, 1]} forms a line segment.

3.5. The Boundary of V We are almost ready to show that V is a polytope. To do so, V ⇡(s ) AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkoioh4LXjxWsB/QxLLZbtqlm82yuxFK6N/w4kERr/4Zb/4bt2kO2vpg4PHeDDPzQsmZNq777ZTW1jc2t8rblZ3dvf2D6uFRRyepIrRNEp6oXog15UzQtmGG055UFMchp91wcjv3u09UaZaIBzOVNIjxSLCIEWys5HceM1+yWV0PvPNBteY23BxolXgFqUGB1qD65Q8TksZUGMKx1n3PlSbIsDKMcDqr+KmmEpMJHtG+pQLHVAdZfvMMnVlliKJE2RIG5erviQzHWk/j0HbG2Iz1sjcX//P6qYlugowJmRoqyGJRlHJkEjQPAA2ZosTwqSWYKGZvRWSMFSbGxlSxIXjLL6+SzkXDcxve/WWteVXEUYYTOIU6eHANTbiDFrSBgIRneIU3J3VenHfnY9FacoqZY/gD5/MHPTORGw== 1 however, we need to show that the boundary of the space of value functions is described by semi-deterministic policies. Figure 4. Value functions of mixtures of two policies in the general case. The orange points describe the value functions of mixtures While at first glance reasonable given our earlier topologi- of two policies. cal analysis, the result is complicated by the many-to-one mapping from policies to value functions, and requires ad- ditional tooling not provided by the line theorem. Recall 3.4. Convex Consequences of Theorem1 π from Lemma3 the use of the affine vector space Hs1,..,sk Some consequences arise immediately from Theorem1. to constrain the value functions generated by fixing certain First, the result suggests a recursive application from the action probabilities. π value function V of a policy π into its deterministic con- Theorem 2. Consider the ensemble of policies Y π that stituents. s1,..,sk agree with π on states S = {s1, .., sk}. Suppose ∀s∈ / S , π Corollary 1. For any set of states s1, .., sk ∈ S and a policy ∀ ∈ A, ∈ ∩ s.t. , then a @π0 Ys1,..,sk Da,s fv(π0) = fv(π) π, V π can be expressed as a convex combination of value has a relative neighborhood in V ∩ π . fv(π) Hs1,..,sk functions of {s1, .., sk}-deterministic policies. In particular, V is included in the convex hull of the value functions of Theorem2 demonstrates by contraposition that the boundary deterministic policies. of the space of value functions is a subset of the ensemble of value functions of semi-deterministic policies. Figure6 This result indicates a relationship between the vertices of shows that the latter can be a proper subset. V and deterministic policies. Nevertheless, we observe in Corollary 3. Consider a policy π ∈ P(A) , the states Figure5 that the value functions of deterministic policies are S S = {s , .., s }, and the ensemble Y π of policies that not necessarily the vertices of V and that the vertices of V are 1 k s1,..,sk agree with π on s , .., s . Define Vy = f (Y π ), we not necessarily attained by value functions of deterministic 1 k v s1,..,sk have that the relative boundary of Vy in Hπ is included policies. s1,..,sk in the value functions spanned by policies in π that Ys1,..,sk The space of value functions is in general not convex. How- are s-deterministic for s∈ / S : ever, it does possess a weaker structural property regarding [ [ Vy ⊂ π ∩ paths between value functions which is reminiscent of policy ∂ fv(Ys1,..,sk Ds,a), iteration-type results. s/S a ∈ ∈A 0 Corollary 2. Let V π and V π be two value functions. Then where ∂ refers to ∂Hπ . s1,..,sk there exists a sequence of k ≤ |S| policies, π1, . . . , πk, such The Value Function Polytope in Reinforcement Learning

) tion algorithm (Howard, 1960) closely relates to the simplex

2 s

( algorithm (Dantzig, 1948). In fact, when the number of

AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA==

constraints are defined by {V ∈ R|S| V < T ∗V }, where 3.6. The Polytope of Value Functions T ∗ is the optimality Bellman operator. Notice that there is We are now in a position to combine the results of the a unique value function V ∈ V that is feasible, which is previous section to arrive at our main contribution: V is a exactly the optimal value function V ∗. polytope in the sense of Def.3 and Prop.1. Our result is in The dual formulation consists of maximizing the expected fact stronger: we show that any subset of policies π Ys1,..,sk return for a given initial state distribution, as a function generates a sub-polytope of V. of the discounted state action visit frequency distribution. Theorem 3. Consider a policy π ∈ P(A)S , the states Contrary to the primal form, any feasible discounted state ∈ S, and the ensemble π of policies that action visit frequency distribution maps to an actual policy s1, .., sk Ys1,..,sk agree with on . Then π is a polytope (Wang et al., 2007). π s1, .., sk fv(Ys1,..,sk ) π and in particular, V = fv(Y ) is a polytope. ∅ 5. Dynamics in the Polytope Despite the evidence gathered in the previous section in favour of the above theorem, the result is surprising given In this section we study how the behaviour of common the fundamental non-linearity of the functional fv: again, reinforcement learning algorithms is reflected in the value mixtures of policies can describe curves (Figure4), and function polytope. We consider two value-based methods, even the mapping g in Lemma4 is nonlinear in µ. value iteration and policy iteration, three variants of the That the polytope can be non-convex is obvious from the policy gradient method, and an evolutionary strategy. preceding figures. As Figure6 (right) shows, this can hap- Our experiments use the two-state, two-action MDP de- pen when value functions along two different line segments picted elsewhere in this paper (details in AppendixA). cross. At that intersection, something interesting occurs: Value-based methods are parametrized directly in terms there are two policies with the same value function but that of the value vector in R2; policy-based methods are do not agree on either state. We will illustrate the effect of parametrized using the softmax distribution, with one pa- this structure on learning dynamics in Section5. rameter per state. We initialize all methods at the same Finally, there is a natural sub-polytope structure in the space starting value functions (indicated on Figure7): near a ver- i i of value functions. If policies are free to vary only on a tex (V1 ), near a boundary (V2 ), and in the interior of the i 1 subset of states of cardinal k, then there is a polytope of polytope (V3 ). dimension k associated with the induced space of value We are chiefly interested in three aspects of the different functions. This makes sense since constraining policies on a algorithms’ learning dynamics: 1) the path taken through subset of states is equivalent to defining a new MDP, where the value polytope, 2) the speed at which they traverse the the transitions associated with the complement of this subset polytope, and 3) any accumulation points that occur along of states are not dependent on policy decisions. this path. As such, we compute model-based versions of all relevant updates; in the case of evolutionary strategies, we 4. Related Work use large population sizes (De Boer et al., 2004). 1 The link between geometry and reinforcement learning has The use of the softmax precludes initializing policy-based been so far fairly limited. However we note the former use methods exactly at boundaries. of convex polyhedra in the following: Simplex Method and Policy Iteration. The policy itera- The Value Function Polytope in Reinforcement Learning

5.1. Value Iteration The sequence of value functions visited by policy iteration (Figure8) corresponds to value functions of deterministic Value iteration (Bellman, 1957) consists of the repeated policies, which in this specific MDP corresponds to vertices application of the optimality Bellman operator T ∗ of the polytope.

Vk+1 := T ∗Vk. 5.3. Policy Gradient In all cases, V0 is initialized to the relevant starting value Policy gradient is a popular approach for directly optimiz- function. Figure7 depicts the paths in value space taken ing the value function via parametrized policies (Williams,

1992; Konda & Tsitsiklis, 2000; Sutton et al., 2000). For a

) 2 50

s policy πθ with parameters θ the policy gradient is

(

AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA==

∇θJ(θ) = Es dπ ,a π( s)∇θ log π(a | s)[r(s, a)+γ V (s0)] ∼ ∼ · | E

i V # updates AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0oMeiF48V7Ae0sWy2m3bpZhN2J0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ildusxE9P+Zb9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/9TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdK6qHpu1buvVeo3eRxFOIFTOAcPrqAOd9CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QMbkI9n 3 V i where d is the discounted stationary distribution; here we AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKUI9FLx4r2A9oY9lsN+3SzSbsToQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKnfZjJmaD2qBccavuAmSdeDmpQI7moPzVH8YsjbhCJqkxPc9N0M+oRsEkn5X6qeEJZRM64j1LFY248bPFuTNyYZUhCWNtSyFZqL8nMhoZM40C2xlRHJtVby7+5/VSDK/9TKgkRa7YclGYSoIxmf9OhkJzhnJqCWVa2FsJG1NNGdqESjYEb/XlddKuVT236t1fVRo3eRxFOINzuAQP6tCAO2hCCxhM4Ble4c1JnBfn3flYthacfOYU/sD5/AEaDI9m 2 π

V i assume a uniformly random initial distribution over the AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWDbbSbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWBym/udJ1Sax/LBTBP0IzqSPOSMGit12o8Znw28QbXm1t05yCrxClKDAs1B9as/jFkaoTRMUK17npsYP6PKcCZwVumnGhPKJnSEPUsljVD72fzcGTmzypCEsbIlDZmrvycyGmk9jQLbGVEz1steLv7n9VITXvsZl0lqULLFojAVxMQk/50MuUJmxNQSyhS3txI2pooyYxOq2BC85ZdXSfui7rl17/6y1rgp4ijDCZzCOXhwBQ24gya0gMEEnuEV3pzEeXHenY9Fa8kpZo7hD5zPHxiIj2U= 1 0 a) b) c) V ⇡(s ) states. The policy gradient update is then ( ∈ AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkoioh4LXjxWsB/QxLLZbtqlm82yuxFK6N/w4kERr/4Zb/4bt2kO2vpg4PHeDDPzQsmZNq777ZTW1jc2t8rblZ3dvf2D6uFRRyepIrRNEp6oXog15UzQtmGG055UFMchp91wcjv3u09UaZaIBzOVNIjxSLCIEWys5HceM1+yWV0PvPNBteY23BxolXgFqUGB1qD65Q8TksZUGMKx1n3PlSbIsDKMcDqr+KmmEpMJHtG+pQLHVAdZfvMMnVlliKJE2RIG5erviQzHWk/j0HbG2Iz1sjcX//P6qYlugowJmRoqyGJRlHJkEjQPAA2ZosTwqSWYKGZvRWSMFSbGxlSxIXjLL6+SzkXDcxve/WWteVXEUYYTOIU6eHANTbiDFrSBgIRneIU3J3VenHfnY9FacoqZY/gD5/MHPTORGw== 1 η [0, 1])

Figure 7. Value iteration dynamics for three initialization points. θk+1 := θk + η∇θJ(θk).

by value iteration, from the starting point to the optimal

value function. We observe that the path does not remain

)

2 s 300

within the polytope: value iteration generates a sequence of (

AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkoioh4LXjxWsB/QxLLZbtqlm82yuxFK6N/w4kERr/4Zb/4bt2kO2vpg4PHeDDPzQsmZNq777ZTW1jc2t8rblZ3dvf2D6uFRRyepIrRNEp6oXog15UzQtmGG055UFMchp91wcjv3u09UaZaIBzOVNIjxSLCIEWys5HceM1+yWV0PvPNBteY23BxolXgFqUGB1qD65Q8TksZUGMKx1n3PlSbIsDKMcDqr+KmmEpMJHtG+pQLHVAdZfvMMnVlliKJE2RIG5erviQzHWk/j0HbG2Iz1sjcX//P6qYlugowJmRoqyGJRlHJkEjQPAA2ZosTwqSWYKGZvRWSMFSbGxlSxIXjLL6+SzkXDcxve/WWteVXEUYYTOIU6eHANTbiDFrSBgIRneIU3J3VenHfnY9FacoqZY/gD5/MHPTORGw== 1 5.2. Policy Iteration Figure 9. Value functions generated by policy gradient. Policy iteration (Howard, 1960) consists of the repeated application of a policy improvement step and a policy eval- uation step until convergence to the optimal policy. The pol- Figure9 shows that the convergence rate of policy gradi- icy improvement step updates the policy by acting greedily ent strongly depends on the initial condition. In particular, according to the current value function; the value function Figure9a),b) show accumulation points along the update of the new policy is then evaluated. The algorithm is based path (not shown here, the method does eventually converge on the following update rule to V ∗). This behaviour is sensible given the dependence of ∇θJ(θ) on π(· | s), with gradients vanishing at the boundary of the polytope. πk+1 := greedy(Vk) V := evaluate(π ), k+1 k+1 5.4. Entropy Regularized Policy Gradient with V0 initialized as in value iteration. Entropy regularization adds an entropy term to the objective

) (Williams & Peng, 1991). The new policy gradient becomes

2

s

(

AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkoioh4LXjxWsB/QxLLZbtqlm82yuxFK6N/w4kERr/4Zb/4bt2kO2vpg4PHeDDPzQsmZNq777ZTW1jc2t8rblZ3dvf2D6uFRRyepIrRNEp6oXog15UzQtmGG055UFMchp91wcjv3u09UaZaIBzOVNIjxSLCIEWys5HceM1+yWV0PvPNBteY23BxolXgFqUGB1qD65Q8TksZUGMKx1n3PlSbIsDKMcDqr+KmmEpMJHtG+pQLHVAdZfvMMnVlliKJE2RIG5erviQzHWk/j0HbG2Iz1sjcX//P6qYlugowJmRoqyGJRlHJkEjQPAA2ZosTwqSWYKGZvRWSMFSbGxlSxIXjLL6+SzkXDcxve/WWteVXEUYYTOIU6eHANTbiDFrSBgIRneIU3J3VenHfnY9FacoqZY/gD5/MHPTORGw== 1 regarding policy gradient, we find that this improves the convergence rate of the optimization procedure (Figure 10). Figure 8. Policy iteration. The red arrows show the sequence of One trade-off is that the policy converges to a sub-optimal value functions (blue) generated by the algorithm. policy, which is not deterministic.

The Value Function Polytope in Reinforcement Learning

) 2

s inject additional isotropic noise at each iteration. We use

300

(

AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA==

the identity matrix of size 2, and a constant noise of 0.05I.

)

2 s

# updates

(

AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA==

a) b) c) 0 V ⇡(s ) AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkoioh4LXjxWsB/QxLLZbtqlm82yuxFK6N/w4kERr/4Zb/4bt2kO2vpg4PHeDDPzQsmZNq777ZTW1jc2t8rblZ3dvf2D6uFRRyepIrRNEp6oXog15UzQtmGG055UFMchp91wcjv3u09UaZaIBzOVNIjxSLCIEWys5HceM1+yWV0PvPNBteY23BxolXgFqUGB1qD65Q8TksZUGMKx1n3PlSbIsDKMcDqr+KmmEpMJHtG+pQLHVAdZfvMMnVlliKJE2RIG5erviQzHWk/j0HbG2Iz1sjcX//P6qYlugowJmRoqyGJRlHJkEjQPAA2ZosTwqSWYKGZvRWSMFSbGxlSxIXjLL6+SzkXDcxve/WWteVXEUYYTOIU6eHANTbiDFrSBgIRneIU3J3VenHfnY9FacoqZY/gD5/MHPTORGw== 1

Figure 10. Value functions generated by policy gradient with en- tropy, for three different initialization points. a) b) c) 50 5.5. Natural Policy Gradient Natural policy gradient (Kakade, 2002) is a second-order policy optimization method. The gradient updates condition # updates the standard policy gradient with the inverse Fisher infor-

mation matrix F (Kakade, 2002), leading to the following 0 d) e) f) V ⇡(s ) update rule: AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkoioh4LXjxWsB/QxLLZbtqlm82yuxFK6N/w4kERr/4Zb/4bt2kO2vpg4PHeDDPzQsmZNq777ZTW1jc2t8rblZ3dvf2D6uFRRyepIrRNEp6oXog15UzQtmGG055UFMchp91wcjv3u09UaZaIBzOVNIjxSLCIEWys5HceM1+yWV0PvPNBteY23BxolXgFqUGB1qD65Q8TksZUGMKx1n3PlSbIsDKMcDqr+KmmEpMJHtG+pQLHVAdZfvMMnVlliKJE2RIG5erviQzHWk/j0HbG2Iz1sjcX//P6qYlugowJmRoqyGJRlHJkEjQPAA2ZosTwqSWYKGZvRWSMFSbGxlSxIXjLL6+SzkXDcxve/WWteVXEUYYTOIU6eHANTbiDFrSBgIRneIU3J3VenHfnY9FacoqZY/gD5/MHPTORGw== 1

1 Figure 12. The cross-entropy method without noise (CEM) (a, b, θk+1 := θk + ηF − ∇θJ(θk). c); with constant noise (CEM-CN) (d, e, f). This causes the gradient steps to follow the steepest ascent direction in the underlying structure of the parameter space. As observed in the original work (Szita & Lorincz˝ , 2006),

the covariance of CEM without noise collapses (Figure )

2 12.a)b)c)), and therefore reaches convergence for a subop- s

300

(

AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA== sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA==

# updates all three initialization points.

a) b) c) 0 V ⇡(s ) AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkoioh4LXjxWsB/QxLLZbtqlm82yuxFK6N/w4kERr/4Zb/4bt2kO2vpg4PHeDDPzQsmZNq777ZTW1jc2t8rblZ3dvf2D6uFRRyepIrRNEp6oXog15UzQtmGG055UFMchp91wcjv3u09UaZaIBzOVNIjxSLCIEWys5HceM1+yWV0PvPNBteY23BxolXgFqUGB1qD65Q8TksZUGMKx1n3PlSbIsDKMcDqr+KmmEpMJHtG+pQLHVAdZfvMMnVlliKJE2RIG5erviQzHWk/j0HbG2Iz1sjcX//P6qYlugowJmRoqyGJRlHJkEjQPAA2ZosTwqSWYKGZvRWSMFSbGxlSxIXjLL6+SzkXDcxve/WWteVXEUYYTOIU6eHANTbiDFrSBgIRneIU3J3VenHfnY9FacoqZY/gD5/MHPTORGw== 1 6. Discussion and Concluding Remarks

Figure 11. Natural policy gradient. In this work, we characterized the shape of value functions and established its surprising geometric nature: a possibly In our experiment, we observe that natural policy gradient non-convex polytope. This result was based on the line theo- is less prone to accumulation than policy gradient (Fig. 11), rem which provides guarantees of monotonic improvement in part because the step-size is better conditioned. Figure b) as well as a line-like variation in the space of value functions. shows unregularized policy gradient does not, surprisingly This structural property raises the question of new learning enough, take the “shortest path” through the polytope to the algorithms based on a single state change, and what this optimal value function: instead, it moves from one vertex to might mean in the context of function approximation. the next, similar to policy iteration. We noticed the existence of self-intersecting spaces of value functions, which have a bottleneck. However, from our sim- 5.6. Cross-Entropy Method ple study of learning dynamics over a class of reinforcement learning methods, it does not seem that this bottleneck leads Gradient-free optimization methods have shown impressive to any particular learning slowdown. performance over complex control tasks (De Boer et al., 2004; Salimans et al., 2017). We present the dynamics of Some questions remain open. Although those geometric the cross-entropy method (CEM), without noise and with a concepts make sense for finite state action spaces, it is not constant noise factor (CEM-CN) (Szita & Lorincz˝ , 2006). clear how they generalize to the continuous case. There The mechanics of the algorithm is threefold: (i) sample a is a connection between representation learning and the population of size N of policy parameters from a Gaussian polytopal structure of value functions that we have started distribution of mean θ, covariance C; (ii) evaluate the re- exploring (Bellemare et al., 2019). Another exciting re- turns of the population; (iii) select top K members, and fit search direction is the relationship between the geometry of a new Gaussian onto them. In the CEM-CN variant, we value functions and function approximation. The Value Function Polytope in Reinforcement Learning

7. Acknowledgements Kakade, S. M. A natural policy gradient. In Advances in Neural Information Processing Systems, pp. 1531–1538, The authors would like to thank their colleagues at Google 2002. Brain for their help; Carles Gelada, Doina Precup, Georg Ostrovski, Marco Cuturi, Marek Petrik, Matthieu Geist, Klee, V. Some characterizations of convex polyhedra. Acta Olivier Pietquin, Pablo Samuel Castro, Remi´ Munos, Remi´ Mathematica, 102(1-2):79–107, 1959. Tachet, Saurabh Kumar, and Zafarali Ahmed for useful discussion and feedback; Jake Levinson and Mathieu Guay- Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms. Paquet for their insights on the proof of Proposition 1; Mark In Advances in Neural Information Processing Systems, Rowland for providing invaluable feedback on two earlier pp. 1008–1014, 2000. versions of this manuscript. Littman, M. L., Dean, T. L., and Kaelbling, L. P. On the complexity of solving Markov decision problems. In References Proceedings of the Eleventh conference on Uncertainty in Artificial Intelligence, pp. 394–402. Morgan Kaufmann Aigner, M., Ziegler, G. M., Hofmann, K. H., and Erdos, P. Publishers Inc., 1995. Proofs from the Book, volume 274. Springer, 2010. Mansour, Y. and Singh, S. On the complexity of policy Bellemare, M. G., Dabney, W., Dadashi, R., Taiga, A. A., iteration. In Proceedings of the Fifteenth Conference Castro, P. S., Roux, N. L., Schuurmans, D., Lattimore, on Uncertainty in Artificial Intelligence, pp. 401–408. T., and Lyle, C. A geometric perspective on optimal Morgan Kaufmann Publishers Inc., 1999. representations for reinforcement learning. arXiv preprint arXiv:1901.11530, 2019. Munos, R. Error bounds for approximate policy iteration. In Proceedings of the International Conference on Machine Bellman, R. Dynamic Programming. Dover Publications, Learning, 2003. 1957. Puterman, M. L. Markov Decision Processes: Discrete Bertsekas, D. P. Generic rank-one corrections for value iter- Stochastic Dynamic Programming. John Wiley & Sons, ation in markovian decision problems. Technical report, Inc., 1994. M.I.T., 1994. Salimans, T., Ho, J., Chen, X., Sidor, S., and Sutskever, Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Pro- I. Evolution strategies as a scalable alternative to rein- gramming. Athena Scientific, 1996. forcement learning. arXiv preprint arXiv:1703.03864, 2017. Bradtke, S. J. and Barto, A. G. Linear least- algo- rithms for temporal difference learning. Machine Learn- Sutton, R. S. and Barto, A. G. Reinforcement Learning: An ing, 22(1-3):33–57, 1996. Introduction. MIT press, 2nd edition, 2018.

Brondsted, A. An Introduction to Convex Polytopes, vol- Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, ume 90. Springer Science & Business Media, 2012. Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Dantzig, G. B. Programming in a linear structure. Washing- Information Processing Systems, pp. 1057–1063, 2000. ton, DC, 1948. Szita, I. and Lorincz,˝ A. Learning tetris using the noisy De Boer, P.-T., Kroese, D. P., Mannor, S., and Rubinstein, cross-entropy method. Neural Computation, 2006. R. Y. A tutorial on the cross-entropy method. Annals of Operations Research, 2004. Wang, T., Bowling, M., and Schuurmans, D. Dual repre- sentations for dynamic programming and reinforcement De Farias, D. P. and Van Roy, B. The linear programming learning. In Approximate Dynamic Programming and approach to approximate dynamic programming. Opera- Reinforcement Learning, 2007. ADPRL 2007. IEEE In- tions Research, 51(6):850–865, 2003. ternational Symposium on, pp. 44–51. IEEE, 2007.

Engelking, R. General Topology. Heldermann, 1989. Williams, R. J. Simple statistical gradient-following algo- rithms for connectionist reinforcement learning. Machine Grunbaum,¨ B., Klee, V., Perles, M. A., and Shephard, G. C. Learning, 8(3-4):229–256, 1992. Convex Polytopes. Springer, 1967. Williams, R. J. and Peng, J. Function optimization using Howard, R. A. Dynamic Programming and Markov Pro- connectionist reinforcement learning algorithms. Con- cesses. MIT Press, 1960. nection Science, 3(3):241–268, 1991. The Value Function Polytope in Reinforcement Learning

Ye, Y. The simplex and policy-iteration methods are strongly polynomial for the markov decision problem with a fixed discount rate. Mathematics of Operations Research, 36 (4):593–603, 2011. Ziegler, G. M. Lectures on Polytopes, volume 152. Springer Science & Business Media, 2012. The Value Function Polytope in Reinforcement Learning Appendix

A. Details of Markov Decision Processes In this section we give the specifics of the Markov Decision Processes presented in this work. We will use the following convention:

r(si, aj) =r ˆ[i × |A| + j]

P (sk|si, aj) = Pˆ[i × |A| + j][k] where P,ˆ rˆ are the vectors given below.

In Section3, Figure2: (a) |A| = 2, γ = 0.9 rˆ = [0.06, 0.38, −0.13, 0.64] Pˆ = [[0.01, 0.99], [0.92, 0.08], [0.08, 0.92], [0.70, 0.30]]

(b) |A| = 2, γ = 0.9 rˆ = [0.88, −0.02, −0.98, 0.42] Pˆ = [[0.96, 0.04], [0.19, 0.81], [0.43, 0.57], [0.72, 0.28]])

(c) |A| = 3, γ = 0.9 rˆ = [−0.93, −0.49, 0.63, 0.78, 0.14, 0.41] Pˆ = [[0.52, 0.48], [0.5, 0.5], [0.99, 0.01], [0.85, 0.15], [0.11, 0.89], [0.1, 0.9]]

(d) |A| = 2, γ = 0.9 rˆ = [−0.45, −0.1, 0.5, 0.5] Pˆ = [[0.7, 0.3], [0.99, 0.01], [0.2, 0.8], [0.99, 0.01]]

In Section3, Figure3,4,5,6: (left) |A| = 3, γ = 0.8 rˆ = [−0.1, −1., 0.1, 0.4, 1.5, 0.1] Pˆ = [[0.9, 0.1], [0.2, 0.8], [0.7, 0.3], [0.05, 0.95], [0.25, 0.75], [0.3, 0.7]]

(right) |A| = 2, γ = 0.9 rˆ = [−0.45, −0.1, 0.5, 0.5] Pˆ = [[0.7, 0.3], [0.99, 0.01], [0.2, 0.8], [0.99, 0.01]]

In Section5: |A| = 2, γ = 0.9 rˆ = [−0.45, −0.1, 0.5, 0.5] Pˆ = [[0.7, 0.3], [0.99, 0.01], [0.2, 0.8], [0.99, 0.01]] The Value Function Polytope in Reinforcement Learning B. Notation for the proofs

In the section we present the notation that we use to establish the results in the main text. The space of policies P(A)S describes a Cartesian product of simplices that we can express as a space of |S| × |A| matrices. However, we will adopt for policies, as well as the other components of M, a convenient matrix form similar to (Wang et al., 2007).

• The transition matrix P is a |S||A| × |S| matrix denoting the probability of going to state s0 when taking action a in state s .

• A policy π is represented by a block diagonal |S| × |S||A| matrix Mπ. Suppose the state s is indexed by i and the action a is indexed by j in the matrix form, then we have that Mπ(i, i × |A| + j) = π(a|s). The rest of the entries of Mπ are 0. From now on, we will confound π and Mπ to enhance readability.

• The transition matrix P π = πP induced by a policy π is a |S| × |S| matrix denoting the probability of going from state s to state s0 when following the policy π.

• The reward vector r is a |S||A| × 1 matrix denoting the expected reward when taking action a in state s. The reward vector of a policy rπ = πr is a |S| × 1 vector.

• The value function V π of a policy π is a |S| × 1 matrix.

π π 1 • We note Ci the i-th column of (I − γP )− .

π Under these notations, we can define the Bellman operator T and the optimality Bellman operator T ∗ as follows:

π π π π π T V = rπ + γP V = π(r + γP V ) π π0 π ∀s ∈ S, T ∗V (s) = max rπ0 (s) + γP V (s). π0 ( )S ∈P A

C. Supplementary Results

Lemma 5. fv is infinitely differentiable on P(A)S .

Proof. We have that:

1 fv(π) = (I − γπP )− πr 1 = adj(I − γπP )πr. det(I − γπP )

Where det is the determinant and where adj is the adjunct. ∀π ∈ P(A)S , det(I − γπP ) =6 0, therefore fv is infinitely differentiable.

Let ∈ P A , ∈ S, and ∈ π . We have Lemma 6. π ( )S s1, .., sk π0 Ys1,..,sk

π π π0 π0 Span(Ck+1, .., C ) = Span(Ck+1, .., C ). |S| |S|

0 0 Proof. As P π and P π are equal on their first k rows, we also have that (I − γP π) and (I − γP π ) are equal on their first k rows. We note these k rows L1, ..., Lk. By assumption, we have that:

π π0 ∀i ∈ {1, . . . , k}, ∀j ∈ {k + 1,..., |S|},LiCj = 0,LiCj = 0. The Value Function Polytope in Reinforcement Learning

Which we can rewrite,

π π Span(Ck+1,...,C ) ⊂ Span(L1,...,Lk)⊥ |S| π0 π0 Span(Ck+1, .., C ) ⊂ Span(L1,...,Lk)⊥ |S|

π π π0 π0 Now using, dim Span(Ck+1,...,C ) = dim Span(Ck+1, .., C ) = dim Span(L1,...,Lk)⊥ = |S| − k, we have: |S| |S| π π Span(Ck+1,...,C ) = Span(L1,...,Lk)⊥ |S| π0 π0 Span(Ck+1, .., C ) = Span(L1,...,Lk)⊥. |S|

D. Proofs Lemma 1. The space of value functions V is compact and connected.

Proof. P(A)S is connected since it is a convex space, and it is compact because it is closed and bounded in a finite dimensional real vector space. Since fv is continuous (Lemma5), we have fv(P(A)S ) = V is compact and connected.

Lemma 2. Consider two policies π1, π2 that agree on s1, . . . , sk ∈ S. Then the vector rπ1 −rπ2 has zeros in the components π1 π2 corresponding to s1, . . . , sk and the matrix P − P has zeros in the corresponding rows.

Proof. Suppose without loss of generality that {s1, .., sk} are the first k states in the matrix form notation. We have,

rπ1 = π1r

rπ2 = π2r π1 P = π1P π2 P = π2P.

Since π1(· | s) = π2(· | s) for all s ∈ {s1, .., sk}, the first k rows of π1, π2 are identical in the matrix form notation. π1 π2 Therefore, the first k elements of rπ1 and rπ2 are identical, and the first k rows of P and P are identical, hence the result. Consider a policy and states . Then the value functions generated by π are contained in the Lemma 3. π k s1, . . . , sk Ys1,..,sk affine vector space π : Hs1,..,sk π V ∩ π fv(Ys1,..,sk ) = Hs1,..,sk .

Proof. π ⊂ V ∩ π Let us first show that fv(Ys1,..,sk ) Hs1,..,sk . ∈ π Let π0 Ys1,..,sk , i.e. π0 agrees with π on s1, .., sk. Using Bellman’s equation, we have:

π0 π π0 π0 π π V − V = rπ0 − rπ + γP V − γP V π0 π π0 π π0 π = rπ0 − rπ + γ(P − P )V + γP (V − V ) π 1 π0 π π0  = (I − γP )− rπ0 − rπ + γ(P − P )V . (3)

Since the policies π0 and π agree on the states s1, . . . , sk, we have, using Lemma2:  rπ0 − rπ is zero on its first k elements 0 P π − P π is zero on its first k rows.

Hence, the right-hand side of Eq.3 is the product of a matrix with a vector whose first k elements are 0. Therefore

π0 π π π V ∈ V + Span(Ck+1, .., C ) . |S| The Value Function Polytope in Reinforcement Learning

V ∩ π ⊂ π We shall now show that Hs1,..,sk fv(Ys1,..,sk ).

0 πˆ ∈ π ∈ π π πˆ Suppose V Hs1,..,sk . We want to show that there is a policy π0 Ys1,..,sk such that V = V . We construct π0 the following way:

 π(· | s) if s ∈ {s , .., s } π = 1 k 0 πˆ(· | s) otherwise.

Therefore, using the result of the first implication of this proof:

πˆ π0 π π V − V ∈ Span(Ck+1, .., C ) by assumption |S| πˆ π0 π0 π0 V − V ∈ Span(C1 , .., Ck ) since πˆ and π0 agree on sk+1, . . . , s . |S|

∈ π However, as π, π0 Ys1,..,sk , we have using Lemma6:

π π π0 π0 Span(Ck+1, .., C ) = Span(Ck+1, .., C ). |S| |S|

0 0 0 0 0 0 πˆ − π ∈ π π ∩ π π { } πˆ π ∈ π Therefore, V V Span(C1 , .., Ck ) Span(Ck+1, .., C ) = 0 , meaning that V = V fv(Ys1,..,sk ). |S| π Lemma 4. Consider the ensemble Y s of policies that agree with a policy π everywhere but on s ∈ S. For π0, π1 ∈ π S\{ } Y s define the function g : [0, 1] → V S\{ } g(µ) = fv(µπ1 + (1 − µ)π0). Then the following hold regarding g:

(i) g is continuously differentiable;

(ii) (Total order) g(0) 4 g(1) or g(0) < g(1);

(iii) If g(0) = g(1) then g(µ) = g(0), µ ∈ [0, 1];

(iv) (Monotone interpolation) If g(0) =6 g(1) there is a ρ : [0, 1] → R such that g(µ) = ρ(µ)g(1) + (1 − ρ(µ))g(0), and ρ is a strictly monotonic rational function of µ.

Proof. (i) g is continuously differentiable as a composition of two continuously differentiable functions.

π π π π (ii) We want to show that we have either V 1 4 V 0 or V 1 < V 0 . Suppose, without loss of generality, that s is the first state in the matrix form. Using Lemma3, we have:

π0 π1 π1 V = V + αC1 , with α ∈ R.

π1 1 P i π1 As (I − γP )− = i∞=0(γπP ) , whose entries are all positive, C1 is a vector with positive entries. Therefore we have π π π π V 1 4 V 0 or V 1 < V 0 , depending on the sign of α.

(iii) We have, using Equation (3)

π0 πµ πµ 1 π0 πµ π0  V − V = (I − γP )− rπ0 − rπµ + γ(P − P )V

π0 π1 π1 1 π0 π1 π0  V − V = (I − γP )− rπ0 − rπ1 + γ(P − P )V . The Value Function Polytope in Reinforcement Learning

Now, using

rπ0 = π0r

rπµ = πµr = π0r + µ(π1 − π0)r π0 P = π0P πµ P = πµP = π0P + µ(π1 − π0)P, we have

π0 πµ πµ 1 π0 π1 π0  V − V = µ(I − γP )− rπ0 − rπ1 + γ(P − P )V

πµ 1 π1 π0 π1 = µ(I − γP )− (I − γP )(V − V ). (4)

Therefore, g(0) = g(1) ⇒ V π0 − V π1 = 0 ⇒ V π0 − V πµ = 0 ⇒ g(µ) = g(0). (iv) If g(0) = g(1), the result is true since we can take ρ = 0 using (iii). Suppose g(0) =6 g(1), let us prove the existence of ρ and that it is a rational function in µ. Reusing the Equation4, we have

π0 πµ πµ 1 π1 π0 π1 V − V = µ(I − γP )− (I − γP )(V − V )

π0 π1 π0 1 π1 π0 π1 = µ(I − γ(P + µ(P − P )))− (I − γP )(V − V )

π1 π0 π1 1 π1 π0 π1 = µ(I − γP − γ(1 − µ)(P − P )))− (I − γP )(V − V ).

π0 π1 π0 π1 t As we have that P − P is a rank one matrix (Lemma2) that we can express as P − P = uv with u, v ∈ R|S|. From the Sherman-Morrison formula:

− π1 1 π0 − π1 − π1 1 π1 π0 π1 1 π1 1 (I γP )− (P P )(I γP )− (I − γP − γ(1 − µ)(P − P )))− = (I − γP )− − γ(1 − µ) . t π1 1 1 + γ(1 − µ)v (I − γP )− u

t π1 1 Define ωπ1,π0 = v (I − γP )− u, we have

− π0 πµ π0 π1 γµ(1 µ) π1 1 π0 π1 π0 π1 V − V = µV − µV − (I − γP )− (P − P )(V − V ). 1 + ωπ1,π0 γ(1 − µ)

As in (i) we have that (P π0 − P π1 )(V π0 − V π1 ) is zeros on its last S − 1 elements using an argument similar to Lemma2, therefore

π1 1 π0 π1 π0 π1 π1 (I − γP )− (P − P )(V − V ) = βπ0,π1 C1 , with βπ0,π1 ∈ R.

Now recall from the proof of (i) that similarly, we have

π0 π1 π1 V − V = απ0,π1 C1 .

As by assumption V π0 − V π1 =6 0, we have:

π1 1 π0 π1 π0 π1 βπ0,π1 π0 π1 (I − γP )− (P − P )(V − V ) = (V − V ). απ0,π1

Finally we have, γµ(1 − µ) β V π0 − V πµ = µV π0 − µV π1 − π0,π1 (V π0 − V π1 ) 1 + ωπ1,π0 γ(1 − µ) απ0,π1  γµ(1 − µ) β  = µ − π0,π1 (V π0 − V π1 ). 1 + ωπ1,π0 γ(1 − µ) απ0,π1 The Value Function Polytope in Reinforcement Learning

Therefore, ρ is a rational function in µ, hence continuous, that we can express as:

γµ(1 − µ) β ρ(µ) = µ − π0,π1 . 1 + ωπ1,π0 γ(1 − µ) απ0,π1

Now let us prove that ρ is strictly monotonic. Suppose that ρ is not strictly monotonic. As ρ is continuous, we have that ρ is not injective. Hence, ∃µ0, µ1 ∈ [0, 1] distinct and the associated mixture of policies πµ0 , πµ1 such that

πµ πµ g(µ0) = g(µ1) ⇔ V 0 = V 1 π π π π ⇔ T µ0 V µ0 = T µ1 V µ1 π π π π ⇔ T µ0 V µ0 = T µ1 V µ0

π0 π1 πµ π0 π1 πµ ⇔ (µ0T + (1 − µ0)T )V 0 = (µ1T + (1 − µ1)T )V 0 π0 π1 πµ π0 π1 πµ ⇔ µ0(T − T )V 0 = µ1(T − T )V 0 π π π π ⇔ T 0 V µ0 = T 1 V µ0 .

Therefore we have πµ πµ πµ π0 π1 πµ π1 πµ V 0 = T 0 V 0 = (µ0T + (1 − µ0)T )V 0 = T V 0 .

Therefore π π π π π T 0 V µ0 = T 1 V µ0 = V µ0 .

However, the Bellman operator has a unique fixed point, therefore

π π π V 0 = V 1 = V µ0 , which contradicts our assumption.

π Theorem 1. [Line Theorem] Let s be a state and π, a policy. Then there are two s-deterministic policies in Y s , denoted π S\{ } πl, πu, which bracket the value of all other policies π0 ∈ Y s : S\{ }

fv(πl) 4 fv(π0) 4 fv(πu).

π Furthermore, the image of fv restricted to Y s is a line segment, and the following three sets are equivalent: S\{ }

π  (i) fv Y s , S\{ }

(ii) {fv(απl + (1 − α)πu) | α ∈ [0, 1]},

(iii) {αfv(πl) + (1 − α)fv(πu) | α ∈ [0, 1]} .

Proof. Let us start by proving the first statement of the theorem which is the existence of two s-deterministic policies πu, πl π in Y s that respectively dominates and is dominated by all other policies. S\{ }

The existence of πl and πu (without enforcing their s-determinism), whose value functions are respectively dominated or π dominate all other value functions of policies of Y s , is given by: S\{ }

π • fv(Y s ) is compact as an intersection of a compact and an affine plane (Lemma3). S\{ } • There is a total order on this compact space ((ii) in Lemma4). The Value Function Polytope in Reinforcement Learning

Suppose πl is not s-deterministic, then there is a ∈ A such that πl(a|s) = µ∗ ∈ (0, 1). Hence we can write πl as a mixture of π1, π2 defined as follows

  1 if s0 = s, a0 = a ∀s0 ∈ S, a0 ∈ A, π1(a0|s0) = 0 if s0 = s, a0 =6 a (5)  πl(a0|s0) otherwise.  0 if s0 = s, a0 = a  1 π2(a0|s0) = 1 µ∗ πl(a0|s0) if s0 = s, a0 =6 a (6) −  πl(a0|s0) otherwise.

Therefore πl = µ∗π1 + (1 − µ∗)π2. We can use (iv) in Lemma4, that gives that g : µ 7→ fv(µπ1 + (1 − µ)π2) is either strictly monotonic or constant. If g was strictly monotonic we would have a contradiction on fv(πl) being minimum. Therefore g is constant, and in particular fv(πl) = g(1) = fv(π1), with π1 s-deterministic.

Similarly we can show there is an s-deterministic policy that has the same value function as πu, hence proving the result. Now let us prove the equivalence between (i), (ii) and (iii).

π • Let π0 ∈ Y s , we have: fv(πl) 4 fv(π0) 4 fv(πu) and fv(πl), fv(π0), fv(πu) are on the same line (Lemma3). S\{ } Therefore, fv(π0) is a convex combination of fv(πl) and fv(πu) hence (i) ⊂ (iii). • By definition, (ii) ⊂ (i).

• Lemma4 gives µ 7→ fv(µπu +(1−µ)πl) = fv(πl)+ρ(µ)(fv(πu)−fv(πl)) with ρ continuous and ρ(0) = 0, ρ(1) = 1. Using the theorem of intermediary values on ρ we have that ρ takes all values between 0 and 1. Therefore (iii) ⊂ (ii)

We have (iii) ⊂ (ii) ⊂ (i) ⊂ (iii). Therefore (i) = (ii) = (iii).

π Corollary 1. For any set of states s1, .., sk ∈ S and a policy π, V can be expressed as a convex combination of value functions of {s1, .., sk}-deterministic policies. In particular, V is included in the convex hull of the value functions of deterministic policies.

Proof. We prove the result by induction on the number of states k. If k = 1, the result is true by Theorem1.

Suppose the result is true for k states. Let s1, .., sk+1 ∈ S, we have by assumption that

 π Pn πi V = i=1 αiV ∃n ∈ N, π1, .., πn{s1, .., sk}-deterministic ∈ P(A)S , α1, .., αn ∈ [0, 1], s.t. Pn i=1 αi = 1

However, using Theorem1, we have   πi,l, πi,u sk+1-deterministic ∀i ∈ [1, n], ∃πi,l, πi,u ∈ P(A)S , πi,l, πi,u agrees with πi on s1, .., sk . πi πi,l πi,u  ∃βi ∈ [0, 1],V = βiV + (1 − βi)V Therefore n π X πi,l πi,u V = αi(βiV + (1 − βi)V ), i=1 thus concluding the proof.

Proposition 1. P is a polyhedron in an affine subspace K ⊆ Rn if The Value Function Polytope in Reinforcement Learning

(i) P is closed;

(ii) There are k ∈ N hyperplanes H1, .., Hk in K whose union contains the boundary of P in K: k ∂K P ⊂ ∪i=1Hi; and

(iii) For each of these hyperplanes, P ∩ Hi is a polyhedron in Hi.

Proof. We will show the result by induction on the dimension of K. For dim(K) = 1, the proposition is true since P is a polyhedron iff its boundary is a finite number of points. Suppose the proposition is true for dim(K) = n, let us show that it is true for dim(K) = n + 1. We can verify that if P is a polyhedron, then:

• P is closed.

• There is a finite number of hyperplanes covering its boundaries (the boundaries of the half-spaces defining each convex polyhedron composing P ).

• The intersection of P with these hyperplanes still are polyhedra.

k Now let us consider the other direction of the implication, i.e. suppose that P is closed, ∂K P ⊂ ∩i=1Hi, and ∀i, P ∩ Hi is a polyhedron. We will show that we can express P as a finite union of polyhedra.

k Suppose x ∈ P and x∈ / ∪i=1Hi. By assumption, we have that x ∈ relintK (P ). We will show that x is in a intersection of closed half-spaces defined by the hyperplanes H1, .., Hk and that any other vector in this intersection is also in P (otherwise we would have a contradiction on the boundary assumption). +1 1 A hyperplane Hi defines two closed half-spaces denoted by Hi and Hi− (the signs being arbitrary). And the intersections of those half-spaces form a partition of K, therefore:

k k δ(i) ∃δ ∈ {−1, 1} , x ∈ ∩i=1Hi = Pδ.

k By assumption, x ∈ relintK (Pδ), since we assumed that x∈ / ∪i=1Hi. Now suppose ∃y ∈ relintK (Pδ) s.t. y∈ / P , we have:

∃λ ∈ [0, 1], λx + (1 − λ)y = z ∈ ∂K P.

Sk However, z ∈ relintK (Pδ) because relintK (Pδ) is convex. Therefore z∈ / i=1 Hi since z ∈ relintK (Pδ) which gives a contradiction. We thus have either relintK (Pδ) ⊂ P or relintK (Pδ) ∩ P = ∅.

Now suppose relintK (Pδ) ⊂ P and relintK (Pδ) nonempty. We have that clK (relintK (Pδ)) = Pδ (Brondsted, 2012, Theorem 3.3) and P closed, meaning that Pδ ⊂ P . Therefore, we have

k k [ j ∃δ1, .., δj ∈ {−1, 1} s.t. P = (∪i=1P ∩ Hi) (∪i=1Pδi ).

P is thus a finite union polyhedra, as {P ∩ Hi} are polyhedra by assumption, and {Pδi } are convex polyhedra by definition.

π π0 Corollary 2. Let V and V be two value functions. Then there exists a sequence of k ≤ |S| policies, π1, . . . , πk, such 0 that V π = V π1 , V π = V πk , and for every i ∈ 1, . . . , k − 1, the set

{fv(απi + (1 − α)πi+1) | α ∈ [0, 1]} forms a line segment. The Value Function Polytope in Reinforcement Learning

Proof. We can define the policies π2, .., π 1 the following way: |S|−  πi(· | sj) = π0(· | sj) if sj ∈ {s1, .., si 1} ∀i ∈ [2, |S| − 1], − πi(· | sj) = π(· | sj) if sj ∈ {si, .., s } |S|

Therefore, two consecutive policies πi, πi+1 only differ on one state. We can apply Theorem1 and thus conclude the proof. Consider the ensemble of policies π that agree with on states { }. Suppose ∀ ∈ , Theorem 2. Ys1,..,sk π S = s1, .., sk s / S ∀ ∈ A, ∈ π ∩ s.t. , then has a relative neighborhood in V ∩ π . a @π0 Ys1,..,sk Da,s fv(π0) = fv(π) fv(π) Hs1,..,sk

Proof. We will prove the result by showing that V π is in |S| − k segments that are linearly independent by applying the line theorem on a policy πˆ that has the same value function as π. We will then be able to conclude using the regularity of fv. ∈ π We can find a policy πˆ Ys1,..,sk , that has the same value function as π by applying recursively Theorem1 on the states {sk+1, .., s }, such that: |S|

∃ak+1,l, ak+1,u, .., a ,l, a ,u ∈ A, ∀i ∈ {k + 1, .., |S|}, |S| |S|

πˆ(ai,l|si) = 1 − πˆ(ai,u|si) =µ ˆi ∈ (0, 1).

Note that µˆi ∈/ {0, 1} because we assumed that no s-deterministic policy has the same value function as π. k k π We define µˆ = (ˆµk+1, ..., µˆ ) ∈ (0, 1)|S|− and the function g : (0, 1)|S|− → Hs ,..,s such that: |S| 1 k  πµ(ai,l|si) = 1 − πµ(ai,u|si) = µi if i ∈ {k + 1,..., |S|} g(µ) = fv(πµ), with πµ(· | si) =π ˆ(· | si) otherwise.

We have that:

1. g is continuously differentiable

2. g(ˆµ) = fv(ˆπ) 3. ∂g is non-zero at µˆ (Lemma4.iv) ∂µi

4. ∂g is along the i-th column of (I − γP πˆ ) 1 (Lemma3) ∂µi −

πˆ 1 Therefore, the Jacobian of g is invertible at µ since the columns of (I − γP )− are independent, therefore by the inverse theorem function, there is a neighborhood of g(µ) = fv(ˆπ) in the image space, which gives the result. Consider a policy ∈ P A , the states { }, and the ensemble π of policies that agree Corollary 3. π ( )S S = s1, .., sk Ys1,..,sk with on . Define Vy π , we have that the relative boundary of Vy in π is included in the value π s1, .., sk = fv(Ys1,..,sk ) Hs1,..,sk functions spanned by policies in π that are -deterministic for ∈ : Ys1,..,sk s s / S [ [ Vy ⊂ π ∩ ∂ fv(Ys1,..,sk Ds,a), s/S a ∈ ∈A where ∂ refers to ∂Hπ . s1,..,sk

Proof. Let V π ∈ Vy; from Theorem2, we have that

[|S| |A|[ π ∈ π ∩ ⇒ π ∈ y V / fv(Ys1,..,sk Dsi,aj ) V relint(V ) i=k+1 j=1 The Value Function Polytope in Reinforcement Learning

π where relint refers to the relative interior in Hs1,..,sk . Therefore, [|S| |A|[ Vy ⊂ π ∩ ∂ fv(Ys1,..,sk Dsi,aj ). i=k+1 j=1

Consider a policy ∈ P A , the states ∈ S, and the ensemble π of policies that agree with Theorem 3. π ( )S s1, .., sk Ys1,..,sk on . Then π is a polytope and in particular, V π is a polytope. π s1, .., sk fv(Ys1,..,sk ) = fv(Y ) ∅ Proof. We prove the result by induction on the cardinality of the number of states k. |S| π { } If k = , then Ys1,..,sk = fv(π) which is a polytope.

Suppose that the result is true for k + 1, let us show that it is still true for k. ∈ ∈ S π { } Vy π Let π Π, s1, .., sk , define the ensemble Ys1,..,sk of policies that agree with π on s1, .., sk and = fv(Ys1,..,sk ). Vy V ∩ π Using Lemma3, we have that = Hs1,..,sk .

Now, using Corollary3, we have that:

[|S| |A|[ [|S| |A|[ Vy ⊂ π ∩ Vy ∩ ∂ fv(Ys1,..,sk Dsi,aj ) = Hi,j, i=k+1 j=1 i=k+1 j=1 π π where ∂ refer to the relative boundary in Hs1,..,sk , and Hi,j is an affine hyperplane of Hs1,..,sk (Lemma3). Therefore we have that:

Vy V ∩ π 1. = Hs1,..,sk is closed since it is an intersection of two closed ensembles

y π V ⊂ S|S| S|A| 2. ∂ i=k+1 j=1 Hi,j affine hyperplanes in Hs1,..,sk Vy ∩ π ∩ 3. Hi,j = fv(Ys1,..,sk Dsi,aj ) is a polyhedron (induction assumption).

Vy verifies (i), (ii), (iii) in Proposition1, therefore it is a polyhedron. We have Vy bounded since Vy ⊂ V bounded. Therefore, Vy is a polytope.