Exploratory Gradient Boosting for Reinforcement Learning in Complex Domains

Exploratory Gradient Boosting for Reinforcement Learning in Complex Domains David Abelz Alekh Agarwaly Fernando Diazy Akshay Krishnamurthyy Robert E. Schapirey zDepartment of Computer Science, Brown University, Providence RI 02912 y Microsoft Research, New York NY 10011 Abstract High-dimensional observations and complex real- world dynamics present major challenges in reinforcement learning for both function approximation and exploration. We address both of these challenges with two complementary techniques: First, we develop a gradient-boosting style, non- parametric function approximator for learning on Q-function residuals. And second, we propose an exploration strategy inspired by the principles of state abstraction and information acquisition under uncertainty. We demonstrate the empirical ef- Figure 1: Visual Hill Climbing: The agent is rewarded for fectiveness of these techniques, first, as a prelimi- navigating to higher terrain while receiving raw visual input. nary check, on two standard tasks (Blackjack and n-Chain), and then on two much larger and more realistic tasks with high-dimensional observation In this paper, we propose two techniques for scaling re- spaces. Specifically, we introduce two benchmarks inforcement learning to such domains: First, we present a built within the game Minecraft where the observa- novel non-parametric function approximation scheme based tions are pixel arrays of the agent’s visual field. A on gradient boosting [Friedman, 2001; Mason et al., 2000], combination of our two algorithmic techniques per- a method meant for i.i.d. data, here adapted to reinforcement forms competitively on the standard reinforcement- learning. The approach seems to have several merits. Like learning tasks while consistently and substantially the deep-learning based methods [Mnih et al., 2015] which outperforming baselines on the two tasks with high- succeed by learning good function approximations, it builds dimensional observation spaces. The new func- on a powerful learning system. Unlike the deep-learning ap- tion approximator, exploration strategy, and evalua- proaches, however, gradient boosting models are amenable to tion benchmarks are each of independent interest in training and prediction on a single laptop as opposed to being the pursuit of reinforcement-learning methods that reliant on GPUs. The model is naturally trained on residu- scale to real-world domains. als, which was recently shown to be helpful even in the deep learning literature [He et al., 2015]. Furthermore, boosting 1 Introduction has a rich theoretical foundation in supervised learning, the Many real-world domains have very large state spaces and theory could plausibly be extended to reinforcement learning complex dynamics, requiring an agent to reason over ex- settings in future work. tremely high-dimensional observations. For example, this As our second contribution, we give a complementary ex- is the case for the task in Figure 1, in which an agent must ploration tactic, inspired by the principle of information ac- navigate to the highest location using only raw visual input. quisition under uncertainty (IAUU), that improves over "- Developing efficient and effective algorithms for such envi- uniform exploration by incentivizing novel action applica- ronments is critically important across a variety of domains. tions. With its extremely simple design and efficient use Even relatively straightforward tasks like the one above of data, we demonstrate how our new algorithm combining can cause existing approaches to flounder; for instance, sim- these techniques, called Generalized Exploratory Q-learning ple linear function approximation cannot scale to visual in- (GEQL), can be the backbone for an agent facing highly com- put, while nonlinear function approximation, such as deep Q- plex tasks with raw visual observations. learning [Mnih et al., 2015], tends to use relatively simple We empirically evaluate our techniques on two standard exploration strategies. RL domains (Blackjack [Sutton and Barto, 1998] and n- chain [Strens, 2000]) and on two much larger, more realis- of the predicted state to a memory bank to inform exploration tic tasks with high-dimensional observation spaces. Both of decisions. Another approach is to learn a dynamics model the latter tasks were built within the game Minecraft1 where and to then use either optimistic estimates [Xie et al., 2015] observations are pixel arrays of the agent’s visual field as in or uncertainty [Stadie et al., 2015] in the model to provide Figure 1. The Minecraft experiments are made possible by a exploration bonuses (see also [Guez et al., 2012]). Lastly, new Artificial Intelligence eXperimentation (AIX) platform, there are some exploration strategies with theoretical guar- which we describe in detail below. We find that on the stan- antees for domains with certain metric structure [Kakade et dard tasks, our technique performs competitively, while on al., 2003], but this structure must be known a priori, and it is the two large, high-dimensional Minecraft tasks, our method unclear how to construct such structure in general. consistently and quite substantially outperforms the baseline. 3 The GEQL Algorithm 2 Related Work In this section, we present our new model-free reinforcement- Because the literature on reinforcement learning is so vast, we learning algorithm, Generalized Exploratory Q-Learning focus only on the most related results, specifically, on func- (GEQL), which includes two independent but complementary tion approximation and exploration strategies. For a more components: a new function-approximation scheme based general introduction, see [Sutton and Barto, 1998]. on gradient boosting, and a new exploration tactic based on Function approximation is an important technique for scal- model compression. ing reinforcement-learning methods to complex domains. While linear function approximators are effective for many 3.1 The setting problems [Sutton, 1984], complex non-linear models for We consider the standard discounted, model-free function approximation often demonstrate stronger perfor- reinforcement-learning setting in which an agent inter- mance on many challenging domains [Anderson, 1986; acts with an environment with the goal of accumulating high Tesauro, 1994]. Unlike recent approaches based on neu- reward. At each time step t, the agent observes its state ral network architectures [Mnih et al., 2015], we adopt s 2 S, which might be represented by a high-dimensional gradient boosted regression trees [Friedman, 2001], a non- t vector, such as the raw visual input in Figure 1. The agent parametric class of regression models with competitive per- then selects an action a 2 A whose execution modifies formance on supervised learning tasks. Although similar en- t the state of the environment, typically by moving the agent. semble approaches to reinforcement learning have been ap- Finally, the agent receives some real-valued reward r . This plied in previous work [Marivate and Littman, 2013], these t process either repeats indefinitely or for a fixed number assume a fixed set of independently-trained agents rather than of actions. The agent’s goal is to maximize its long-term a boosting-style ensemble. 1 discounted reward, P γt−1r , where γ 2 (0; 1) is a Our work introduces the interleaving of boosting iterations t=1 t pre-specified discount factor. and data collection. By its iterative nature, our approxima- This process is typically assumed to define a Markov deci- tion resembles the offline, batch-style training of Fitted Q- sion process (MDP), meaning that (1) the next state reached Iteration [Ernst et al., 2005] in which a Q-learner is iteratively s is a fixed stochastic function that depends only on the fit to a fixed set of data. Our algorithm differs in that, at each t+1 previous state s and the action a that was executed; and iteration, the current Q-function approximation guides subse- t t (2) that the reward r similarly depends only on s and a . quent data collection, the results of which are used to drive the t t t For simplicity, we assume in our development that the next update of the Q-function. This adaptive data collection states are in fact fully observable. However, in many real- strategy is critical, as the exploration problem is central to istic settings, what the agent observes might not fully define reinforcement learning, and, in experiments, our interleaved the underlying state; in other words, the environment might method significantly outperforms Fitted Q-Iteration. only be a partially observable MDP. Nevertheless, in practice, Our other main algorithmic innovation is a new exploration it may often be reasonable to use observations as if they actu- strategy for reinforcement learning with function approxima- ally were unobserved states, especially when the observations tion. Our approach is similar to some work on state abstrac- are rich and informative. Alternatively for this purpose, we tion where the learning agent constructs and uses a compact could use a recent past window of observations and actions, model of the world [Dietterich, 2000; Li et al., 2006]. An or even the entire past history. important difference is that our algorithm uses the compact model for exploration only, rather than for both exploration 3.2 Boosting-based Q-function approximation and policy learning. Consequently, model compression does not compromise the expressivity of our learning algorithm, Our approach is based on Q-learning, a standard RL tech- ? which can still learn

Exploratory Gradient Boosting for Reinforcement Learning in Complex Domains

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support