Learning to Search with Mctsnets
Total Page:16
File Type:pdf, Size:1020Kb
Learning to Search with MCTSnets Arthur Guez * 1 Theophane´ Weber * 1 Ioannis Antonoglou 1 Karen Simonyan 1 Oriol Vinyals 1 Daan Wierstra 1 Remi´ Munos 1 David Silver 1 Abstract is measured, and the backup rule by which those evalua- tions are propagated up the search tree. Our contribution Planning problems are among the most impor- is a new search algorithm in which all of these steps can tant and well-studied problems in artificial intel- be learned automatically and efficiently. Our work fits into ligence. They are most typically solved by tree a more general trend of learning differentiable versions of search algorithms that simulate ahead into the fu- algorithms. ture, evaluate future states, and back-up those eval- uations to the root of a search tree. Among these One particularly powerful and general method for planning algorithms, Monte-Carlo tree search (MCTS) is is Monte-Carlo tree search (MCTS) (Coulom, 2006; Kocsis one of the most general, powerful and widely & Szepesvari´ , 2006), as used in the recent AlphaGo program used. A typical implementation of MCTS uses (Silver et al., 2016). A typical MCTS algorithm consists of cleverly designed rules, optimised to the partic- several phases. First, it simulates trajectories into the future, ular characteristics of the domain. These rules starting from a root state. Second, it evaluates the perfor- control where the simulation traverses, what to mance of leaf states - either using a random rollout, or using evaluate in the states that are reached, and how an evaluation function such as a ’value network’. Third, it to back-up those evaluations. In this paper we backs-up these evaluations to update internal values along instead learn where, what and how to search. Our the trajectory, for example by averaging over evaluations. architecture, which we call an MCTSnet, incorpo- We present a neural network architecture that includes the rates simulation-based search inside a neural net- same processing stages as a typical MCTS, but inside the work, by expanding, evaluating and backing-up neural network itself, as a dynamic computational graph. a vector embedding. The parameters of the net- The key idea is to represent the internal state of the search, work are trained end-to-end using gradient-based at each node, by a memory vector. The computation of optimisation. When applied to small searches the network proceeds forwards from the root state, just like in the well-known planning problem Sokoban, a simulation of MCTS, using a simulation policy based the learned search algorithm significantly outper- on the memory vector to select the trajectory to traverse. formed MCTS baselines. The leaf state is then processed by an embedding network to initialize the memory vector at the leaf. The network proceeds backwards up the trajectory, updating the memory 1. Introduction at each visited state according to a backup network that propagates from child to parent. Finally, the root memory Many success stories in artificial intelligence are based on vector is used to compute an overall prediction of value or arXiv:1802.04697v2 [cs.AI] 17 Jul 2018 the application of powerful tree search algorithms to chal- action. lenging planning problems (Samuel, 1959; Knuth & Moore, 1975;J unger¨ et al., 2009). It has been well documented that The major benefit of our planning architecture, compared planning algorithms can be highly optimised by tailoring to more traditional planning algorithms, is that it can be them to the domain (Schaeffer, 2000). For example, the exposed to gradient-based optimisation. This allows us to performance can often be dramatically improved by modify- replace every component of MCTS with a richer, learnable ing the rules that select the trajectory to traverse, the states equivalent — while maintaining the desirable structural to expand, the evaluation function by which performance properties of MCTS such as the use of a model, iterative local computations, and structured memory. We jointly * 1 Equal contribution DeepMind, London, UK. Correspondence train the parameters of the evaluation network, backup net- to: A. Guez and T. Weber <faguez, [email protected]>. work and simulation policy so as to optimise the overall Proceedings of the 35 th International Conference on Machine predictions of the MCTS network (MCTSnet). The ma- Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 jority of the network is fully differentiable, allowing for by the author(s). Learning to Search with MCTSnets efficient training by gradient descent. Still, internal action been demonstrated to succeed in any domain approaching sequences directing the control flow of the network cannot the complexity of Sokoban. be differentiated, and learning this internal policy presents a Other neural network architectures have also incorporated challenging credit assignment problem. To address this, we Monte-Carlo simulations. The I2A architecture (Weber propose a novel, generally-applicable approximate scheme et al., 2017) aggregates the results of several simulations into for credit assignment that leverages the anytime property its network computation. MCTSnets both generalise and of our computational graph, allowing us to also effectively extend some ideas behind I2A: introducing a tree structured learn this part of the search network from data. memory that stores node-specific statistics; and learning the In the Sokoban domain, a classic planning task (Botea et al., simulation and tree expansion strategy, rather than rolling 2003), we justify our network design choices and show that out each possible action from the root state with a fixed pol- our learned search algorithm is able to outperform various icy. Similar to I2A, the predictron architecture (Silver et al., model-free and model-based baselines. 2017b) also aggregates over multiple simulations; however, in that case the simulations roll out an implicit transition 2. Related Work model, rather than concrete steps from the actual environ- ment. A recent extension by Farquhar et al.(2017) performs There has been significant previous work on learning evalu- planning over a fixed-tree expansion of such implicit model. ation functions, using supervised learning or reinforcement learning, that are subsequently combined with a search al- 3. MCTSnet gorithm (Tesauro, 1994; Baxter et al., 1998; Silver et al., 2016). However, the learning process is typically decoupled The MCTSnet architecture may be understood from two dis- from the search algorithm, and has no awareness of how tinct but equivalent perspectives. First, it may be understood the search algorithm will combine those evaluations into an as a search algorithm with a control flow that closely mir- overall decision. rors the simulation-based tree traversals of MCTS. Second, it may be understood as a neural network represented by Several previous search architectures have learned to tune a dynamic computation graph that processes input states, the parameters of the evaluation function so as to achieve the performs intermediate computations on hidden states, and most effective overall search results given a specified search outputs a final decision. We present each of these perspec- algorithm. The learning-to-search framework (Chang et al., tives in turn, starting with the original, unmodified search 2015) learns an evaluation function that is effective in the algorithm. context of beam search. Samuel’s checkers player (Samuel, 1959), the TD(leaf) algorithm (Baxter et al., 1998; Schaeffer et al., 2001), and the TreeStrap algorithm apply reinforce- 3.1. MCTS Algorithm ment learning to find an evaluation function that combines The goal of planning is to find the optimal strategy that with minimax search to produce an accurate root evaluation maximises the total reward in an environment defined by a (Veness et al., 2009); while comparison training (Tesauro, deterministic transition model s0 = T (s; a), mapping each 1988) applies supervised learning to the same problem; these state and action to a successor state s0, and a reward model methods have been successful in chess, checkers and shogi. r(s; a), describing the goodness of each transition. In all cases the evaluation function is scalar valued. MCTS is a simulation-based search algorithm that converges There have been a variety of previous efforts to frame the to a solution to the planning problem. At a high level, the learning of internal search decisions as a meta-reasoning idea of MCTS is to maintain statistics at each node, such problem, one which can be optimized directly (Russell, as the visit count and mean evaluation, and to use these 1995). Kocsis et al.(2005) apply black-box optimisation to statistics to decide which branches of the tree to visit. learn the meta-parameters controlling an alpha-beta search, but do not learn fine-grained control over the search deci- MCTS proceeds by running a number of simulations. Each sions. Considering action choices at tree nodes as a bandit simulation traverses the tree, selecting the most promis- problem led to the widely used UCT variant of MCTS (Koc- ing child according to the statistics, until a leaf node is sis & Szepesvari´ , 2006). Hay & Russell(2011) also studied reached. The leaf node is then evaluated using a rollout the meta-problem in MCTS, but they only considered a or value-network (Silver et al., 2016). This value is then myopic policy without function approximation. Pascanu propagated during a back-up phase that updates statistics of et al.(2017) also investigate learning-to-plan using neural the tree along the traversed path, tracking the visit counts networks. However, their system uses an unstructured mem- N(s), N(s; a) and mean evaluation Q(s; a) following from ory which makes complex branching very unlikely. Initial each state s and action a. Search proceeds in an anytime results have been provided for toy domains but it has not yet fashion: the statistics gradually become more accurate, and simulations focus on increasingly promising regions of the Learning to Search with MCTSnets tree.