Using Monte Carlo Tree Search As a Demonstrator Within Asynchronous Deep RL
Total Page:16
File Type:pdf, Size:1020Kb
Using Monte Carlo Tree Search as a Demonstrator within Asynchronous Deep RL Bilal Kartal∗, Pablo Hernandez-Leal∗ and Matthew E. Taylor fbilal.kartal,pablo.hernandez,[email protected] Borealis AI, Edmonton, Canada Abstract There are several ways to get the best of both DRL and search methods. For example, AlphaGo Zero (Silver et al. Deep reinforcement learning (DRL) has achieved great suc- 2017) and Expert Iteration (Anthony, Tian, and Barber 2017) cesses in recent years with the help of novel methods and concurrently proposed the idea of combining DRL and higher compute power. However, there are still several chal- MCTS in an imitation learning framework where both com- lenges to be addressed such as convergence to locally opti- mal policies and long training times. In this paper, firstly, ponents improve each other. These works combine search we augment Asynchronous Advantage Actor-Critic (A3C) and neural networks sequentially. First, search is used to method with a novel self-supervised auxiliary task, i.e. Ter- generate an expert move dataset which is used to train a pol- minal Prediction, measuring temporal closeness to terminal icy network (Guo et al. 2014). Second, this network is used states, namely A3C-TP. Secondly, we propose a new frame- to improve expert search quality (Anthony, Tian, and Barber work where planning algorithms such as Monte Carlo tree 2017), and this is repeated in a loop. However, expert move search or other sources of (simulated) demonstrators can be data collection by vanilla search algorithms can be slow in a integrated to asynchronous distributed DRL methods. Com- sequential framework (Guo et al. 2014). pared to vanilla A3C, our proposed methods both learn faster and converge to better policies on a two-player mini version In this paper, we show that it is also possible to blend of the Pommerman game. search with distributed DRL methods such that search and neural network components are decoupled and can be ex- ecuted simultaneously in an on-policy fashion. The plan- Introduction ner (MCTS or other methods (LaValle 2006)) can be used as a demonstrator to speed up learning for model-free RL Deep reinforcement learning (DRL) combines reinforce- methods. Distributed RL methods enable efficient explo- ment learning (Sutton and Barto 1998) with deep learn- ration and yield faster learning results and several methods ing (LeCun, Bengio, and Hinton 2015), enabling better scal- have been proposed (Mnih et al. 2016; Jaderberg et al. 2016; ability and generalization for challenging domains. DRL has Schulman et al. 2017; Espeholt et al. 2018). been one of the most active areas of research in recent years with great successes such as mastering Atari games from In this paper, we consider Asynchronous Advantage raw images (Mnih et al. 2015), AlphaGo Zero (Silver et al. Actor-Critic (A3C) (Mnih et al. 2016) as a baseline algo- 2017) for a Go playing agent skilled well beyond any human rithm. We augment A3C with auxiliary tasks, i.e., additional player (a combination of Monte Carlo tree search and DRL), tasks that the agent can learn without extra signals from and very recently, great success in DOTA 2 (OpenAI 2018b). the environment besides policy optimization, to improve the For a more general and technical overview of DRL, please performance of DRL algorithms. see the recent surveys (Arulkumaran et al. 2017; Li 2017; Hernandez-Leal, Kartal, and Taylor 2018). • We propose a novel auxiliary task, namely Terminal Pre- On the one had, one of the biggest challenges for rein- diction, based on temporal closeness to terminal states on forcement learning is the sample efficiency (Yu 2018). How- top of the A3C algorithm. Our method, named A3C-TP, ever, once a DRL agent is trained, it can be deployed to act both learns faster and helps to converge to better policies in real-time by only performing an inference through the when trained on a two-player mini version of the Pom- trained model. On the other hand, planning methods such merman game, depicted in Figure 1. as Monte Carlo tree search (MCTS) (Browne et al. 2012) do not have a training phase, but they perform simulation based • We propose a new framework based on diversifying some rollouts assuming access to a simulator to find the best ac- of the workers of A3C method with MCTS based plan- tion to take. ners (serving as demonstrators) by using the parallelized asynchronous training architecture to improve the training ∗Equal contribution To appear in AAAI-19 Workshop on Reinforcement Learning in efficiency. Games. Monte Carlo Tree Search Monte Carlo Tree Search (see Figure 2) is a best first search algorithm that gained traction after its breakthrough perfor- mance in Go (Coulom 2006). Other than for game play- ing agents, MCTS has been employed for a variety of do- mains such as robotics (Kartal et al. 2016; Zhang, Atanasov, and Daniilidis 2017) and Sokoban puzzle generation (Kartal, Sohre, and Guy 2016). A recent work (Vodopivec, Samoth- rakis, and Ster 2017) provided an excellent unification of MCTS and RL. Imitation Learning Figure 1: An example of the 8 × 8 Pommerman board, ran- Domains where rewards are delayed and sparse are difficult domly generated by the simulator. Agents’ initial positions exploration RL problems and are particularly difficult when are randomly selected among four corners at each episode. learning tabula rasa. Imitation learning can be used to train agents much faster compared to learning from scratch. Approaches such as DAGGER (Ross, Gordon, and Bag- Related Work nell 2011) or its extended version (Sun et al. 2017) formu- late imitation learning as a supervised problem where the Our work lies at the intersection of the research areas of aim is to match the performance of the demonstrator. How- deep reinforcement learning methods, imitation learning, ever, performance of agents using these methods is upper- and Monte Carlo tree search based planning. In this section, bounded by the demonstrator performance. we will mention some of the existing work in these fields. Previously, Lagoudakis et al. (2003) proposed a classification-based RL method using Monte-Carlo rollouts for each action to construct a training dataset to improve the Deep Reinforcement Learning policy iteratively. Other more recent works such as Expert Iteration (Anthony, Tian, and Barber 2017) extend imita- Reinforcement learning (RL) seeks to maximize the sum of tion learning to the RL setting where the demonstrator is discounted rewards an agent collects by interacting with an also continuously improved during training. There has been environment. RL approaches mainly fall under three cate- a growing body of work on imitation learning where hu- gories: value based methods such as Q-learning (Watkins man or simulated demonstrators’ data is used to speed up and Dayan 1992) or Deep-Q Network (Mnih et al. 2015), policy learning in RL (Hester et al. 2017; Subramanian, Is- policy based methods such as REINFORCE (Williams bell Jr, and Thomaz 2016; Cruz Jr, Du, and Taylor 2017; 1992), and a combination of value and policy based tech- Christiano et al. 2017; Nair et al. 2018). niques, i.e. actor-critic methods (Konda and Tsitsiklis 2000). Hester et al. (2017) used demonstrator data by combining Recently, there have been several distributed actor-critic the supervised learning loss with the Q-learning loss within based DRL algorithms (Mnih et al. 2016; Jaderberg et al. the DQN algorithm to pretrain and showed that their method 2016; Espeholt et al. 2018; Gruslys et al. 2018). achieves good results on Atari games by using a few min- A3C (Asynchronous Advantage Actor Critic) (Mnih et utes of game-play data. Cruz et al. (2017) employed human al. 2016) is an algorithm that employs a parallelized asyn- demonstrators to pretrain their neural network in a super- chronous training scheme (using multiple CPU cores) for vised learning fashion to improve feature learning so that efficiency. It is an on-policy RL method that does not use the RL method with the pretrained network can focus more an experience replay buffer. A3C allows multiple workers on policy learning, which resulted in reducing training times to simultaneously interact with the environment and com- for Atari games. Kim et al. (2013) proposed a learning from pute gradients locally. All the workers pass their computed demonstration method where limited demonstrator data is local gradients to a global neural network that performs used to impose constraints on the policy iteration phase and the optimization and synchronizes with the workers asyn- they theoretically prove bounds on the Bellman error. chronously. There is also the A2C (Advantage Actor-Critic) In some domains, such as robotics, the tasks can be method that combines all the gradients from all the workers too difficult or time consuming for humans to provide full to update the global neural network synchronously. demonstrations. Instead, humans can provide more sparse The UNREAL framework (Jaderberg et al. 2016) is built feedback (Loftin et al. 2014; Christiano et al. 2017) on alter- on top of A3C. In particular, UNREAL proposes unsuper- native agent trajectories that RL can use to speed up learn- vised auxiliary tasks (e.g., reward prediction) to speed up ing. Along this direction, Christiano et al. (2017) proposed a learning which require no additional feedback from the en- method that constructs a reward function based on data con- vironment. In contrast to A3C, UNREAL uses an experience taining human feedback with agent trajectories and showed replay buffer that is sampled with more priority given to pos- that a small amount of non-expert human feedback suffices itively rewarded interactions to improve the critic network. to learn complex agent behaviours. Root Root Root Root The policy and the value function are updated after every A A A A tmax actions or when a terminal state is reached.