The Value Function Polytope in Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
The Value Function Polytope in Reinforcement Learning Robert Dadashi 1 Adrien Ali Ta¨ıga 1 2 Nicolas Le Roux 1 Dale Schuurmans 1 3 Marc G. Bellemare 1 Abstract Line theorem. We show that policies that agree on all but We establish geometric and topological proper- one state generate a line segment within the value function ties of the space of value functions in finite state- polytope, and that this segment is monotone (all state values action Markov decision processes. Our main con- increase or decrease along it). tribution is the characterization of the nature of its Relationship between faces and semi-deterministic poli- shape: a general polytope (Aigner et al., 2010). To cies. We show that d-dimensional faces of this polytope are demonstrate this result, we exhibit several proper- mapped one-to-many to policies which behave deterministi- ties of the structural relationship between policies cally in at least d states. and value functions including the line theorem, which shows that the value functions of policies Sub-polytope characterization. We use this result to gen- constrained on all but one state describe a line eralize the line theorem to higher dimensions, and demon- segment. Finally, we use this novel perspective strate that varying a policy along d states generates a d- to introduce visualizations to enhance the under- dimensional sub-polytope. standing of the dynamics of reinforcement learn- Although our “line theorem” may not be completely surpris- ing algorithms. ing or novel to expert practitioners, we believe we are the first to highlight its existence. In turn, it forms the basis of the other two results, which require additional technical ma- 1. Introduction chinery which we develop in this paper, leaning on results The notion of value function is central to reinforcement from convex analysis and topology. learning (RL). It arises directly in the design of algorithms While our characterization is interesting in and of itself, it such as value iteration (Bellman, 1957), policy gradient also opens up new perspectives on the dynamics of learning (Sutton et al., 2000), policy iteration (Howard, 1960), and algorithms. We use the value polytope to visualize the ex- evolutionary strategies (e.g. Szita & Lorincz˝ , 2006), which pected behaviour and pitfalls of common algorithms: value either predict it directly or estimate it from samples, while iteration, policy iteration, policy gradient, natural policy gra- also seeking to maximize it. The value function is also a use- dient (Kakade, 2002), and finally the cross-entropy method ful tool for the analysis of approximation errors (Bertsekas (De Boer et al., 2004). & Tsitsiklis, 1996; Munos, 2003). ⇡ <latexit sha1_base64="7YlkrurOAYOv+Ch30L7WpchFVys=">AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbbSbt0swm7G6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTAXXxvO+ndLa+sbmVnm7srO7t3/gHh61dJIphk2WiER1QqpRcIlNw43ATqqQxqHAdji+nfntJ1SaJ/LRTFIMYjqUPOKMGis99FLed6tezZuDrBK/IFUo0Oi7X71BwrIYpWGCat31vdQEOVWGM4HTSi/TmFI2pkPsWippjDrI56dOyZlVBiRKlC1pyFz9PZHTWOtJHNrOmJqRXvZm4n9eNzPRdZBzmWYGJVssijJBTEJmf5MBV8iMmFhCmeL2VsJGVFFmbDoVG4K//PIqaV3UfK/m319W6zdFHGU4gVM4Bx+uoA530IAmMBjCM7zCmyOcF+fd+Vi0lpxi5hj+wPn8AU+9jc0=</latexit> In this paper we study the map π 7! V π from stationary ⇡ V<latexit sha1_base64="c8qVjM0oZqdW1JnKlbApS+DyygE=">AAAB7nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSNcxOOsmQ2dlhZlYISz7CiwdFvPo93vwbJ8keNLGgoajqprsrUoIb6/vfXmFtfWNzq7hd2tnd2z8oHx41TZJqhg2WiES3I2pQcIkNy63AttJI40hgKxrfzvzWE2rDE/lgJwrDmA4lH3BGrZNazcesq/i0V674VX8OskqCnFQgR71X/ur2E5bGKC0T1JhO4CsbZlRbzgROS93UoKJsTIfYcVTSGE2Yzc+dkjOn9Mkg0a6kJXP190RGY2MmceQ6Y2pHZtmbif95ndQOrsOMS5ValGyxaJAKYhMy+530uUZmxcQRyjR3txI2opoy6xIquRCC5ZdXSfOiGvjV4P6yUrvJ4yjCCZzCOQRwBTW4gzo0gMEYnuEV3jzlvXjv3seiteDlM8fwB97nD3NBj6E=</latexit> policies, which are typically used to describe the behaviour of RL agents, to their respective value functions. Specifi- arXiv:1901.11524v3 [cs.LG] 15 May 2019 cally, we vary π over the joint simplex describing all policies and show that the resulting image forms a polytope, albeit one that is possibly self-intersecting and non-convex. Figure 1. Mapping between policies and value functions. We provide three results all based on the notion of “policy agreement”, whereby we study the behaviour of the map π 7! V π as we only allow the policy to vary at a subset of 2. Preliminaries all states. We are in the reinforcement learning setting (Sutton & 1Google Brain 2Mila, Universite´ de Montreal´ 3Department of Barto, 2018). We consider a Markov decision process Computing Science, University of Alberta. Correspondence to: M := hS; A; r; P; γi with S the finite state space, A the Robert Dadashi <[email protected]>. finite action space, r the reward function, P the transition Proceedings of the 36 th International Conference on Machine function, and γ the discount factor for which we assume Learning, Long Beach, California, PMLR 97, 2019. Copyright γ 2 [0; 1). We denote the number of states by jSj, the 2019 by the author(s). number of actions by jAj. The Value Function Polytope in Reinforcement Learning A stationary policy π is a mapping from states to distri- Definition 1 (Convex Polytope). P is a convex polytope n butions over actions; we denote the space of all policies iff there are k 2 N points x1; x2; :::; xk 2 R such that by P(A)S . Taken with the transition function P , a policy P = Conv(x1; : : : ; xk). π defines a state-to-state transition function P : Definition 2 (Convex Polyhedron). P is a convex polyhe- X dron iff there are k 2 half-spaces H^ ; H^ ; :::; H^ whose P π(s j s) = π(a j s)P (s j s; a): N 1 2 k 0 0 intersection is P , that is a 2A k ^ The value V π is defined as the expected cumulative reward P = \i=1Hk: from starting in a particular state and acting according to π: A celebrated result from convex analysis relates these two π X1 i definitions: a bounded, convex polyhedron is a convex poly- π j V (s) = EP γ r(si; ai) s0 = s : tope (Ziegler, 2012). i=0 The next two definitions generalize convex polytopes and The Bellman equation (Bellman, 1957) connects the value polyhedra to non-convex bodies. function V π at a state s with the value function at the subse- quent states when following π: Definition 3 (Polytope). A (possibly non-convex) polytope is a finite union of convex polytopes. π π V (s) = EP π r(s; a) + γV (s0) : (1) Definition 4 (Polyhedron). A (possibly non-convex) poly- hedron is a finite union of convex polyhedra. Throughout we will make use of vector notation (e.g. Put- erman, 1994). Specifically, we view (with some abuse of We will make use of another, recursive characterization notation) P π as a jSj × jSj matrix, V π as a jSj-dimensional based on the notion that the boundaries of a polytope should vector, and write rπ for the vector of expected rewards under be “flat” in a topological sense (Klee, 1959). π. In this notation, the Bellman equation for a policy π is n For an affine subspace K ⊆ R , Vx ⊂ K is a relative π π π π 1 neighbourhood of x in K if x 2 Vx and Vx is open in V = rπ + γP V = (I − γP )− rπ: K. For P ⊂ K, the relative interior of P in K, denoted relint (P ), is then the set of points in P which have a In this work we study how the value function π changes K V relative neighbourhood in K \P . The notion of “open in K” as we continuously vary the policy π. As such, we will find is key here: a point that lies on an edge of the unit square convenient to also view this value function as the functional does not have a relative neighbourhood in the square, but it has a relative neighbourhood in that edge. The relative fv : P(A)S ! RS boundary π π 1 @K P is defined as the set of points in P not in the π 7! V = (I − γP )− rπ: relative interior of P , that is We will use the notation V π when the emphasis is on the @K P = P n relintK (P ): vector itself, and fv when the emphasis is on the mapping from policies to value functions. Finally, we recall that H ⊆ K is a hyperplane if H is an affine subspace of K of dimension dim(K) − 1. Finally, we will use 4 and < for element-wise vector in- equalities, and for a function F!G and a subset Proposition 1. P is a polyhedron in an affine subspace f : n F ⊂ F write f(F ) to mean the image of f applied to F . K ⊆ R if (i) P is closed; 2.1. Polytopes in n R (ii) There are k 2 N hyperplanes H1; ::; Hk in K whose Central to our work will be the result that the image of union contains the boundary of P in K: k @K P ⊂ [ Hi; and the functional fv applied to the space of policies forms a i=1 polytope, possibly nonconvex and self-intersecting, with (iii) For each of these hyperplanes, P \ Hi is a polyhedron certain structural properties. This section lays down some in Hi. of the necessary definitions and notations. For a complete overview on the topic, we refer the reader to Grunbaum¨ et al. All proofs may be found in the appendix. (1967); Ziegler(2012); Brondsted(2012). We begin by characterizing what it means for a subset P ⊆ 3. The Space of Value Functions Rn to be a convex polytope or polyhedron. In what follows we write Conv(x1; : : : ; xk) to denote the convex hull of the We now turn to the main object of our study, the space of points x1; : : : ; xk. value functions V. The space of value functions is the set The Value Function Polytope in Reinforcement Learning of all value functions that are attained by some policy. As 3.1. Basic Shape from Topology noted earlier, this corresponds to the image of P(A) under S We begin with a first result on how the functional f trans- the mapping f : v v forms the space of policies into the space of value functions n o (Figure1). Recall that V = fv(P(A)S ) = fv(π) j π 2 P(A)S : (2) π 1 fv(π) = (I − γP )− rπ: As a warm-up, Figure2 depicts the space V corresponding Hence is infinitely differentiable everywhere on P A to four 2-state MDPs; each set is made of value functions fv ( )S (AppendixC).