Deep Reinforcement Learning in Large Discrete Action Spaces
Total Page:16
File Type:pdf, Size:1020Kb
Deep Reinforcement Learning in Large Discrete Action Spaces Gabriel Dulac-Arnold*, Richard Evans*, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, Ben Coppin [email protected] Google DeepMind Abstract reinforcement learning (Sutton & Barto, 1998) problems, but current algorithms are difficult or impossible to apply. Being able to reason in an environment with a large number of discrete actions is essential to In this paper, we present a new policy architecture which bringing reinforcement learning to a larger class operates efficiently with a large number of actions. We of problems. Recommender systems, industrial achieve this by leveraging prior information about the ac- plants and language models are only some of the tions to embed them in a continuous space upon which the many real-world tasks involving large numbers actor can generalize. This embedding also allows the pol- of discrete actions for which current methods are icy’s complexity to be decoupled from the cardinality of difficult or even often impossible to apply. our action set. Our policy produces a continuous action within this space, and then uses an approximate nearest- An ability to generalize over the set of actions neighbor search to find the set of closest discrete actions as well as sub-linear complexity relative to the in logarithmic time. We can either apply the closest ac- size of the set are both necessary to handle such tion in this set directly to the environment, or fine-tune this tasks. Current approaches are not able to provide selection by selecting the highest valued action in this set both of these, which motivates the work in this relative to a cost function. This approach allows for gen- paper. Our proposed approach leverages prior eralization over the action set in logarithmic time, which is information about the actions to embed them in necessary for making both learning and acting tractable in a continuous space upon which it can general- time. ize. Additionally, approximate nearest-neighbor methods allow for logarithmic-time lookup com- We begin by describing our problem space and then detail plexity relative to the number of actions, which is our policy architecture, demonstrating how we can train it necessary for time-wise tractable training. This using policy gradient methods in an actor-critic framework. combined approach allows reinforcement learn- We demonstrate the effectiveness of our policy on various ing methods to be applied to large-scale learn- tasks with up to one million actions, but with the intent that ing problems previously intractable with current our approach could scale well beyond millions of actions. methods. We demonstrate our algorithm’s abili- ties on a series of tasks having up to one million 2. Definitions actions. We consider a Markov Decision Process (MDP) where is arXiv:1512.07679v2 [cs.AI] 4 Apr 2016 A the set of discrete actions, is the set of discrete states, : S P R is the transition probability distribution, 1. Introduction S × A × S ! R : R is the reward function, and γ [0; 1] is S × A ! 2 Advanced AI systems will likely need to reason with a large a discount factor for future rewards. Each action a 2 An number of possible actions at every step. Recommender corresponds to an n-dimensional vector, such that a R . 2 systems used in large systems such as YouTube and Ama- This vector provides information related to the action. In m zon must reason about hundreds of millions of items every the same manner, each state s is a vector s R . second, and control systems for large industrial processes 2 S 2 may have millions of possible actions that can be applied The return of an episode in the MDP is the discounted sum of rewards received by the agent during that episode: at every time step. All of these systems are fundamentally PT i−t Rt = i=t γ r(si; ai). The goal of RL is to learn a policy π : which maximizes the expected return *Equal contribution. S!A over all episodes, E[R1]. The state-action value function π Q (s; a) = E[R1 s1 = s; a1 = a; π] is the expected re- j Deep Reinforcement Learning in Large Discrete Action Spaces turn starting from a given state s and taking an action a, ACTION π following π thereafter. Q can be expressed in a recursive ARGMAX manner using the Bellman equation: RITIC X C Qπ(s; a) = r(s; a) + γ P (s0 s; a)Qπ(s0; π(s0)): j s0 In this paper, both Q and π are approximated by K-NN parametrized functions. 3. Problem Description ACTION EMBEDDING There are two primary families of policies often used in RL systems: value-based, and actor-based policies. For value-based policies, the policy’s decisions are directly PROTO ACTION conditioned on the value function. One of the more com- ACTOR mon examples is a policy that is greedy relative to the value function: πQ(s) = arg max Q(s; a): (1) a2A STATE In the common case that the value function is a parame- terized function which takes both state and action as input, Figure 1. Wolpertinger Architecture evaluations are necessary to choose an action. This j A j quickly becomes intractable, especially if the parameter- ized function is costly to evaluate, as is the case with deep 4. Proposed Approach neural networks. This approach does, however, have the desirable property of being capable of generalizing over We propose a new policy architecture which we call the Wolpertinger architecture. This architecture avoids the actions when using a smooth function approximator. If ai heavy cost of evaluating all actions while retaining general- and aj are similar, learning about ai will also inform us ization over actions. This policy builds upon the actor-critic about aj. Not only does this make learning more efficient, it also allows value-based policies to use the action features (Sutton & Barto, 1998) framework. We define both an effi- to reason about previously unseen actions. Unfortunately, cient action-generating actor, and utilize the critic to refine execution complexity grows linearly with which ren- our actor’s choices for the full policy. We use multi-layer j A j ders this approach intractable when the number of actions neural networks as function approximators for both our ac- grows significantly. tor and critic functions. We train this policy using Deep Deterministic Policy Gradient (Lillicrap et al., 2015). In a standard actor-critic approach, the policy is explicitly The Wolpertinger policy’s algorithm is described fully in defined by a parameterized actor function: πθ : . S!A Algorithm1 and illustrated in Figure1. We will detail these In practice πθ is often a classifier-like function approxima- tor, which scale linearly in relation to the number of ac- in the following sections. tions. However, actor-based architectures avoid the com- Algorithm 1 Wolpertinger Policy putational cost of evaluating a likely costly Q-function on State s previously received from environment. every action in the arg max in Equation (1). Nevertheless, a^ = fθπ (s) Receive proto-action from actor. actor-based approaches do not generalize over the action f g k = gk(a^) Retrieve k approximately closest actions. A f g space as naturally as value-based approaches, and cannot a = arg max Q (s a ) aj 2Ak Qθ ; j extend to previously unseen actions. Apply a to environment; receive r; s0. Sub-linear complexity relative to the action space and an ability to generalize over actions are both necessary to 4.1. Action Generation handle the tasks we interest ourselves with. Current ap- proaches are not able to provide both of these, which moti- Our architecture reasons over actions within a continuous n vates the approach described in this paper. space R , and then maps this output to the discrete action Deep Reinforcement Learning in Large Discrete Action Spaces set . We will first define: As we demonstrate in Section7, this second pass makes A n our algorithm significantly more robust to imperfections in fθπ : R S! the choice of action representation, and is essential in mak- fθπ (s) = a^: ing our system learn in certain domains. The size of the π generated action set, k, is task specific, and allows for an fθπ is a function parametrized by θ , mapping from the m explicit trade-off between policy quality and speed. state representation space R to the action representation n n space R . This function provides a proto-action in R for a given state, which will likely not be a valid action, i.e. it 4.3. Training with Policy Gradient is likely that a^ = . Therefore, we need to be able to map 2 A Although the architecture of our policy is not fully differ- from a^ to an element in . We can do this with: A entiable, we argue that we can nevertheless train our policy n g : R by following the policy gradient of fθπ . We will first con- !A k sider the training of a simpler policy, one defined only as gk(a^) = arg min a a^ 2: π~θ = g fθπ . In this initial case we can consider that the a2A j − j ◦ policy is fθπ and that the effects of g are a deterministic aspect of the environment. This allows us to maintain a gk is a k-nearest-neighbor mapping from a continuous standard policy gradient approach to train f π on its output space to a discrete set1. It returns the k actions in that θ A a^, effectively interpreting the effects of g as environmen- are closest to a^ by L distance.