Universal Value Function Approximators

Universal Value Function Approximators Tom Schaul [email protected] Dan Horgan [email protected] Karol Gregor [email protected] David Silver [email protected] Google DeepMind, 5 New Street Square, EC4A 3TW London Abstract how to evaluate or control a specific aspect of the environ- Value functions are a core component of rein- ment (e.g. progress toward a waypoint). A collection of forcement learning systems. The main idea is general value functions provides a powerful form of knowl- to to construct a single function approximator edge representation that can be utilised in several ways. For V (s; θ) that estimates the long-term reward from example, the Horde architecture (Sutton et al., 2011) con- any state s, using parameters θ. In this paper sists of a discrete set of value functions (‘demons’), all of we introduce universal value function approx- which may be learnt simultaneously from a single stream imators (UVFAs) V (s; g; θ) that generalise not of experience, by bootstrapping off-policy from successive just over states s but also over goals g. We de- value estimates (Modayil et al., 2014). Each value function velop an efficient technique for supervised learn- may also be used to generate a policy or option, for example ing of UVFAs, by factoring observed values into by acting greedily with respect to the values, and terminat- separate embedding vectors for state and goal, ing at goal states. Such a collection of options can be used and then learning a mapping from s and g to to provide a temporally abstract action-space for learning these factored embedding vectors. We show how or planning (Sutton et al., 1999). Finally, a collection of this technique may be incorporated into a re- value functions can be used as a predictive representation inforcement learning algorithm that updates the of state, where the predicted values themselves are used as UVFA solely from observed rewards. Finally, we a feature vector (Sutton & Tanner, 2005; Schaul & Ring, demonstrate that a UVFA can successfully gener- 2013). alise to previously unseen goals. In large problems, the value function is typically repre- sented by a function approximator V (s; θ), such as a linear 1. Introduction combination of features or a neural network with param- Value functions are perhaps the most central idea in rein- eters θ. The function approximator exploits the structure forcement learning (Sutton & Barto, 1998). The main idea in the state space to efficiently learn the value of observed is to cache knowledge in a single function V (s) that repre- states and generalise to the value of similar, unseen states. sents the utility of any state s in achieving the agent’s over- However, the goal space often contains just as much struc- all goal or reward function. Storing this knowledge enables ture as the state space (Foster & Dayan, 2002). Consider the agent to immediately assess and compare the utility of for example the case where the agent’s goal is described by states and/or actions. The value function may be efficiently a single desired state: it is clear that there is just as much learned, even from partial trajectories or under off-policy similarity between the value of nearby goals as there is be- evaluation, by bootstrapping from value estimates at a later tween the value of nearby states. Our main idea is to ex- state (Precup et al., 2001). tend the idea of value function approximation to both states s and goals g, using a universal value function approxima- However, value functions may be used to represent knowl- tor (UVFA1) V (s; g; θ). A sufficiently expressive function edge beyond the agent’s overall goal. General value func- approximator can in principle identify and exploit structure tions V (s) (Sutton et al., 2011) represent the utility of any g across both s and g. By universal, we mean that the value state s in achieving a given goal g (e.g. a waypoint), repre- function can generalise to any goal g in a set of possible sented by a pseudo-reward function that takes the place of goals: for example a discrete set of goal states;G their power the real rewards in the problem. Each such value function set; a set of continuous goal regions; or a vector represen- represents a chunk of knowledge about the environment: tation of arbitrary pseudo-reward functions. Proceedings of the 32 nd International Conference on Machine This UVFA effectively represents an infinite Horde of Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- 1 right 2015 by the author(s). Pronounce ‘YOU-fah’. Universal Value Function Approximators demons that summarizes a whole class of predictions in a V(s,g) V(s,g) single object. Any system that enumerates separate value φˆ (s) ψˆ (g) V(s,g) functions and learns each individually (like the Horde) is h hampered in its scalability, as it cannot take advantage of F φ ψ h any shared structure (unless the demons share parameters). φ ψ In contrast, UVFAs can exploit two kinds of structure between goals: similarity encoded a priori in the goal rep- s g φˆ (s) ψˆ (g) resentations g, and the structure in the induced value func- s g s g tions discovered bottom-up. Also, the complexity of UVFA Figure 1. Diagram of the presented function approximation ar- learning does not depend on the number of demons but on chitectures and training setups. In blue dashed lines, we show the the inherent domain complexity. This complexity is larger learning targets for the output of each network (cloud). Left: con- than standard value function approximation, and represent- catenated architecture. Center: two-stream architecture with two ing a UVFA may require a rich function approximator such separate sub-networks φ and combined at h. Right: Decom- as a deep neural network. posed view of two-stream architecture when trained in two stages, where target embedding vectors are formed by matrix factoriza- Learning a UVFA poses special challenges. In general, the tion (right sub-diagram) and two embedding networks are trained agent will only see a small subset of possible combinations with those as multi-variate regression targets (left and center sub- of states and goals (s; g), but we would like to generalise in diagrams). several ways. Even in a supervised learning context, when the true value Vg(s) is provided, this is a challenging re- π : and each g, and under some technical reg- gression problem. We introduce a novel factorization ap- ularityS conditions, 7! A we define a general value function that proach that decomposes the regression into two stages. We represents the expected cumulative pseudo-discounted fu- view the data as a sparse table of values that contains one ture pseudo-return, i.e., row for each observed state s and one column for each ob- " 1 t # served goal g, and find a low-rank factorization of the table X Y Vg,π(s) := E Rg(st+1; at; st) γg(sk) s0 = s φ s g into state embeddings ( ) and goal embeddings ( ). We t=0 k=0 then learn non-linear mappings from states s to state embeddings φ(s), and from goals g to goal embeddings (g), where the actions are generated according to π, as well as using standard regression techniques (e.g. gradient descent an action-value function on a neural network). In our experiments, this factorized 0 0 0 Qg,π(s; a) := Es0 [Rg(s; a; s ) + γg(s ) Vg,π(s )] approach learned UVFAs an order of magnitude faster than · ∗ naive regression. Any goal admits an optimal policy πg (s) := Finally, we return to reinforcement learning, and provide arg maxa Qπ;g(s; a), and a corresponding optimal 2 ∗ ∗ value function V := Vg,π∗ . Similarly, Q := Qg,π∗ . two algorithms for learning UVFAs directly from rewards. g g g g The first algorithm maintains a finite Horde of general 3. Universal Value Function Approximators value functions Vg(s), and uses these values to seed the table and hence learn a UVFA V (s; g; θ) that generalizes to Our main idea is to represent a large set of optimal value previously unseen goals. The second algorithm bootstraps functions by a single, unified function approximator that directly from the value of the UVFA at successor states. On generalises over both states and goals. Specifically, we ∗ the Atari game of Ms Pacman, we then demonstrate that consider function approximators V (s; g; θ) Vg (s) or ∗ ≈ d UVFAs can scale to larger visual input spaces and different Q(s; a; g; θ) Qg(s; a), parameterized by θ R , that types of goals, and show they generalize across policies for approximate the≈ optimal value function both over2 a poten- obtaining possible pellets. tially large state space s , and also a potentially large goal space g . 2 S 2 G 2. Background Figure1 schematically depicts possible function approxi- Consider a Markov Decision Process defined by a set of mators: the most direct approach, F : R simply S × G 7! states s , a set of actions a , and transition prob- concatenates state and goal together as a joint input. The 2 S 0 2 A0 abilities (s; a; s ) := (st+1 = s st = s; at = a). mapping from concatenated input to regression target can For anyT goal g , weP define a pseudo-rewardj func- then be dealt with a non-linear function approximator such 0 2 G tion Rg(s; a; s ) and a pseudo-discount function γg(s). as a multi-layer perceptron (MLP). The pseudo-discount γg takes the double role of state- A two-stream architecture, on the other hand, assumes that dependent discounting, and of soft termination, in the sense the problem has a factorized structure and computes its out- that γ(s) = 0 if and only if s is a terminal state according put from two components φ : Rn and : Rn to goal g (e.g.

Load more