Learning to Search with MCTSnets

Arthur Guez * 1 Theophane´ Weber * 1 Ioannis Antonoglou 1 Karen Simonyan 1 Oriol Vinyals 1 Daan Wierstra 1 Remi´ Munos 1 David Silver 1

Abstract is measured, and the backup rule by which those evalua- tions are propagated up the search tree. Our contribution Planning problems are among the most impor- is a new in which all of these steps can tant and well-studied problems in artificial intel- be learned automatically and efficiently. Our work fits into ligence. They are most typically solved by tree a more general trend of learning differentiable versions of search algorithms that simulate ahead into the fu- algorithms. ture, evaluate future states, and back-up those eval- uations to the root of a search tree. Among these One particularly powerful and general method for planning algorithms, Monte-Carlo tree search (MCTS) is is Monte-Carlo tree search (MCTS) (Coulom, 2006; Kocsis one of the most general, powerful and widely & Szepesvari´ , 2006), as used in the recent AlphaGo program used. A typical implementation of MCTS uses (Silver et al., 2016). A typical MCTS algorithm consists of cleverly designed rules, optimised to the partic- several phases. First, it simulates trajectories into the future, ular characteristics of the domain. These rules starting from a root state. Second, it evaluates the perfor- control where the simulation traverses, what to mance of leaf states - either using a random rollout, or using evaluate in the states that are reached, and how an such as a ’value network’. Third, it to back-up those evaluations. In this paper we backs-up these evaluations to update internal values along instead learn where, what and how to search. Our the trajectory, for example by averaging over evaluations. architecture, which we call an MCTSnet, incorpo- We present a neural network architecture that includes the rates simulation-based search inside a neural net- same processing stages as a typical MCTS, but inside the work, by expanding, evaluating and backing-up neural network itself, as a dynamic computational graph. a vector embedding. The parameters of the net- The key idea is to represent the internal state of the search, work are trained end-to-end using gradient-based at each node, by a memory vector. The computation of optimisation. When applied to small searches the network proceeds forwards from the root state, just like in the well-known planning problem Sokoban, a simulation of MCTS, using a simulation policy based the learned search algorithm significantly outper- on the memory vector to select the trajectory to traverse. formed MCTS baselines. The leaf state is then processed by an embedding network to initialize the memory vector at the leaf. The network proceeds backwards up the trajectory, updating the memory 1. Introduction at each visited state according to a backup network that propagates from child to parent. Finally, the root memory Many success stories in artificial intelligence are based on vector is used to compute an overall prediction of value or arXiv:1802.04697v2 [cs.AI] 17 Jul 2018 the application of powerful tree search algorithms to chal- action. lenging planning problems (Samuel, 1959; Knuth & Moore, 1975;J unger¨ et al., 2009). It has been well documented that The major benefit of our planning architecture, compared planning algorithms can be highly optimised by tailoring to more traditional planning algorithms, is that it can be them to the domain (Schaeffer, 2000). For example, the exposed to gradient-based optimisation. This allows us to performance can often be dramatically improved by modify- replace every component of MCTS with a richer, learnable ing the rules that select the trajectory to traverse, the states equivalent — while maintaining the desirable structural to expand, the evaluation function by which performance properties of MCTS such as the use of a model, iterative local computations, and structured memory. We jointly * 1 Equal contribution DeepMind, London, UK. Correspondence train the parameters of the evaluation network, backup net- to: A. Guez and T. Weber <{aguez, theophane}@google.com>. work and simulation policy so as to optimise the overall Proceedings of the 35 th International Conference on Machine predictions of the MCTS network (MCTSnet). The ma- Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 jority of the network is fully differentiable, allowing for by the author(s). Learning to Search with MCTSnets efficient training by gradient descent. Still, internal action been demonstrated to succeed in any domain approaching sequences directing the control flow of the network cannot the complexity of Sokoban. be differentiated, and learning this internal policy presents a Other neural network architectures have also incorporated challenging credit assignment problem. To address this, we Monte-Carlo simulations. The I2A architecture (Weber propose a novel, generally-applicable approximate scheme et al., 2017) aggregates the results of several simulations into for credit assignment that leverages the anytime property its network computation. MCTSnets both generalise and of our computational graph, allowing us to also effectively extend some ideas behind I2A: introducing a tree structured learn this part of the search network from data. memory that stores node-specific statistics; and learning the In the Sokoban domain, a classic planning task (Botea et al., simulation and tree expansion strategy, rather than rolling 2003), we justify our network design choices and show that out each possible action from the root state with a fixed pol- our learned search algorithm is able to outperform various icy. Similar to I2A, the predictron architecture (Silver et al., model-free and model-based baselines. 2017b) also aggregates over multiple simulations; however, in that case the simulations roll out an implicit transition 2. Related Work model, rather than concrete steps from the actual environ- ment. A recent extension by Farquhar et al.(2017) performs There has been significant previous work on learning evalu- planning over a fixed-tree expansion of such implicit model. ation functions, using supervised learning or , that are subsequently combined with a search al- 3. MCTSnet gorithm (Tesauro, 1994; Baxter et al., 1998; Silver et al., 2016). However, the learning process is typically decoupled The MCTSnet architecture may be understood from two dis- from the search algorithm, and has no awareness of how tinct but equivalent perspectives. First, it may be understood the search algorithm will combine those evaluations into an as a search algorithm with a control flow that closely mir- overall decision. rors the simulation-based tree traversals of MCTS. Second, it may be understood as a neural network represented by Several previous search architectures have learned to tune a dynamic computation graph that processes input states, the parameters of the evaluation function so as to achieve the performs intermediate computations on hidden states, and most effective overall search results given a specified search outputs a final decision. We present each of these perspec- algorithm. The learning-to-search framework (Chang et al., tives in turn, starting with the original, unmodified search 2015) learns an evaluation function that is effective in the algorithm. context of . Samuel’s checkers player (Samuel, 1959), the TD(leaf) algorithm (Baxter et al., 1998; Schaeffer et al., 2001), and the TreeStrap algorithm apply reinforce- 3.1. MCTS Algorithm ment learning to find an evaluation function that combines The goal of planning is to find the optimal strategy that with search to produce an accurate root evaluation maximises the total reward in an environment defined by a (Veness et al., 2009); while comparison training (Tesauro, deterministic transition model s0 = T (s, a), mapping each 1988) applies supervised learning to the same problem; these state and action to a successor state s0, and a reward model methods have been successful in , checkers and . r(s, a), describing the goodness of each transition. In all cases the evaluation function is scalar valued. MCTS is a simulation-based search algorithm that converges There have been a variety of previous efforts to frame the to a solution to the planning problem. At a high level, the learning of internal search decisions as a meta-reasoning idea of MCTS is to maintain statistics at each node, such problem, one which can be optimized directly (Russell, as the visit count and mean evaluation, and to use these 1995). Kocsis et al.(2005) apply black-box optimisation to statistics to decide which branches of the tree to visit. learn the meta-parameters controlling an alpha-beta search, but do not learn fine-grained control over the search deci- MCTS proceeds by running a number of simulations. Each sions. Considering action choices at tree nodes as a bandit simulation traverses the tree, selecting the most promis- problem led to the widely used UCT variant of MCTS (Koc- ing child according to the statistics, until a leaf node is sis & Szepesvari´ , 2006). Hay & Russell(2011) also studied reached. The leaf node is then evaluated using a rollout the meta-problem in MCTS, but they only considered a or value-network (Silver et al., 2016). This value is then myopic policy without function approximation. Pascanu propagated during a back-up phase that updates statistics of et al.(2017) also investigate learning-to-plan using neural the tree along the traversed path, tracking the visit counts networks. However, their system uses an unstructured mem- N(s), N(s, a) and mean evaluation Q(s, a) following from ory which makes complex branching very unlikely. Initial each state s and action a. Search proceeds in an anytime results have been provided for toy domains but it has not yet fashion: the statistics gradually become more accurate, and simulations focus on increasingly promising regions of the Learning to Search with MCTSnets tree. parent node statistics are updated to new values that depend on the child values and also on their previous values. Finally, We now describe a value-network MCTS in more detail. the selected action is chosen according to the statistics at Each simulation from the root state s is composed of four A the root of the search tree. stages: 1 Different sub-networks are responsible for each of the com- Algorithm 1: Value-Network Monte-Carlo Tree Search ponents of the search described in the previous paragraph. Internally, these sub-networks manipulate the memory statis- 1. Initialize simulation time t = 0 and current node s0 = sA. tics h at each node of the tree, which now have a vector rep- resentation, h ∈ n. An embedding network h ← (s; θ ) 2. Forward simulation from root state. Do until we reach a R e s leaf node (N(st) = 0): evaluates the state and computes initial ’raw’ statistics; this can be understood as the generalisation of the value network (a) Sample action at based on simulation policy, in Algorithm 1. A simulation policy a ∼ π(·|h; θs) is used at ∼ π(a|st, {N(st),N(st, a),Q(st, a)}), to select actions during each simulation, based on statistics (b) the reward r = r(s , a ) and next state s = t t t t+1 h h ← β(h , h ; θ ) T (st, at) are computed . A backup network parent parent child b updates (c) Increment t. and propagates the statistics up the search tree. Finally, an overall decision (or evaluation) is made by a readout 3. Evaluate leaf node sL found at depth L. network a ← ρ(hsA ; θr). V (s ) (a) Obtain value estimate L , An algorithmic description of the search network follows in N(s ) = 1 (b) Set L . Algorithm 2: 3

4. Back-up phase from leaf node sL, for each t < L Algorithm 2: MCTSnet (a) Set (s, a) = (s , a ). t t For m = 1 ...M, do simulation: (b) Update Q(s, a) towards the Monte-Carlo return:

1 1. Initialize simulation time t = 0 and current node s0 = sA. Q(s, a) ← Q(s, a) + (Rt − Q(s, a)) N(s, a) + 1 2. Forward simulation from root state. Do until we reach a leaf node (N(st) = 0): PL−1 t0−t L−t where Rt = 0 γ rt0 + γ V (sL) t =t a simulation policy a ∼ (c) Update visit counts N(s) and N(s, a): (a) Sample action t based on , t π(a|hst ; θs), N(s) ← N(s) + 1,N(s, a) ← N(s, a) + 1 (b) Compute the reward rt = r(st, at) and next state st+1 = T (st, at). (c) Increment t. When the search completes, it selects the action at the root with the most visit counts. The simulation policy π is chosen 3. Evaluate leaf node sL found at depth L. to trade-off exploration and exploitation in the tree. In the (a) Initialize node statistics using the embedding network: h ← (s ; θ ) UCT variant of MCTS (Kocsis & Szepesvari´ , 2006), π is sL L e 2 inspired by the UCB bandit algorithm (Auer, 2002). 4. Back-up phase from leaf node sL, for each t < L (a) Using the backup network β, update the node statistic 3.2. MCTSnet: search algorithm as a function of its previous statistics and the statistic We now present MCTSnet as a generalisation of MCTS in of its child: which the statistics being tracked by the search algorithm, hst ← β(hst , hst+1 , rt, at; θb) the backup update, and the expansion policy, are all learned from data. After M simulations, readout network outputs a (real) action distribution from the root memory, ρ(h ; θ ). MCTSnet proceeds by executing simulations that start from sA r the root state sA. When a simulation reaches a leaf node, that node is expanded and evaluated to initialize a vector 3.3. MCTSnet: neural network architecture of statistics. The back-up phase then updates statistics in each node traversed during the simulation. Specifically, the We now present MCTSnet as a neural network architecture. The algorithm described above effectively defines a form of 1 We write states with letter indices (sA, sB ,... ) when refer- tree-structured memory: each node sk of the tree maintains ring to states in the tree across simulations, and use time indices its own corresponding statistics hk. The statistics are initial- (s0, s1,... ) for states within a simulation. 2 In this case, the simulation policy is deterministic, π(s) = 3For clarity, we omitted the update of visits N(s), only used to p arg maxa Q(s, a) + c log(N(s))/N(s, a). determine leaf nodes, which proceeds as in Alg. 1. Learning to Search with MCTSnets ized by the embedding network, but otherwise kept constant 3.4. Design choices until the next time the node is visited. They are then updated We now provide design details for each sub-network in using the backup network. For a fixed tree expansion, this MCTSnet. allows us to see MCTSnet as a deep residual and recursive network with numerous skip connections, as well as a large Backup β The backup network contains a gated residual number of inputs - all corresponding to different potential connection, allowing it to selectively ignore information future states of the environment. We describe MCTSnet originating from a node’s subtree. It updates h as follows: again following this different viewpoint in this section. s h ← β(φ; θ ) = h + g(φ; θ )f(φ; θ ), It is useful to introduce an index for the simulation count s b s b b (6) m, so that the tree memory after simulation m is the set where φ = (hs, hs0 , r, a) and where g is a learned gating m of hs for all tree nodes s. Conditioned on a tree path function with range [0, 1] and f is the learned update func- m+1 p = s0, a0, s1, a1, ··· , sL for simulation m + 1, the tion. We justify this architecture in Sec. 4.2, by comparing MCTSnet memory gets updated as follows: it to a simpler MLP which maps φ to the updated value of hs. 1. For t = L: Learned simulation policy π In its basic, unstructured hm+1 ← (s ) form, the simulation policy network is a simple MLP map- sL L (1) ping the statistics hs to the logits ψ(s, a), which define π(a|s; θs) ∝ exp(ψ(s, a)). 2. For t < L: We consider adding structure by modulating each logit with

m+1 m m+1 side-information corresponding to each action. One form h ← β(h , h , r(st, at), at; θb) (2) st st st+1 of action-specific information is obtained from the child statistic hT (s,a). Another form of information comes from a learned policy prior µ over actions, with log-probabilities 3. For all other s: ψ(s, a; θp) = log µ(s, a; θp); as in PUCT (Rosin, 2011).

m+1 m In our case, the policy prior comes from learning a small, hs ← hs (simulation m + 1 is skipped) model-free residual network on the same data. Combined, (3) we obtain the following modulated network version for the simulation policy logits: The tree path that gates this memory update is sampled as:  ψ(s, a) = w0ψp(s, a) + w1u hs, hT (s,a) , (7) m+1 m p ∼ P (s0a0 ··· sL|h ) (4) where u is a small network that combines information from L−1 the parent and child statistics. Y ∝ π(a |hm; θ )1[s = T (s , a )], t st s t+1 t t (5) t=0 Embedding  and readout network ρ The embedding network is a standard residual convolution network. The where L is a random stopping time for the tree path defined readout network, ρ, is a simple MLP that transforms a mem- m m by (N (sL−1) > 0,N (sL) = 0). An illustration of this ory vector at the root into the required output format, in this update process is provided in Fig.1. case an action distribution. See appendix for details. Note that the computation flow of the MCTS network is not only defined by the final tree, but also by the order 3.5. Training MCTSnet in which nodes are visited (the tree expansion process). The readout network of MCTSnet ultimately outputs an Furthermore, taken as a whole, MCTSnet is a stochastic overall decision or evaluation from the entire search. This feed-forward network with single input (the initial state) and final output may in principle be trained according to any single output (the action probabilities). However, thanks to loss function, such as by value-based or policy gradient the tree-structured memory, MCTSnet naturally allows for reinforcement learning. partial replanning, in a fashion similar to MCTS. Assume that from root-state sA, the MCTS network chooses action However, in order to focus on the novel aspects of our archi- 0 a, and transitions (in the real environment) to new state sA. tecture, we choose to investigate the MCTSnet architecture 0 We can initialize the MCTS network for sA as the subtree in a supervised learning setup in which labels are desired 0 ∗ rooted in sA, and initialize node statistics of the subtree to actions a in each state (the dataset is detailed in the ap- their previously computed values — the resulting network pendix). Our objective is to then train MCTSnet to predict would then be recurrent across real time-steps. the action a∗ from state s. Learning to Search with MCTSnets

sim 1 2 3 4

search trees sA sA sA sA

Ð Ð  Ð  sB sB sC sB sC  sD y O

sA / hA sA + hA sA + hA sA + hA O O O    MCTSnet sB / hB sC / hC sB - hB O  sD / hD

Figure 1. This diagram shows an execution of a search with M = 4. (Top) The evolution of the search tree rooted at s0 after each simulation, with the last simulation path highlighted in red. (Bottom) The computation graph in MCTSnet resulting from these simulations. Black arrows represent the application of the embedding network (s) to initialize h at tree node s. Red arrows represent the forward during a simulation using the simulation policy (based on last memory state) and the environment model until a leaf node is reached. Blue arrows correspond to the backup network β, which updates the memory statistics h along the traversed simulation path based on the child statistic and the last updated parent memory (in addition to transition information such as reward). The diagram makes it clear that this backup mechanism can skip over simulations where a particular node was not visited. For example, the fourth simulation updates hB based on hB from the second simulation, since sB was not visited during the third simulation. Finally, the readout network ρ, in green, outputs the action distribution based on the last root memory hA. Additional diagrams are available in the appendix.

We denote zm the set of actions sampled stochastically dur- second term corresponds to the gradient with respect to th ing the m simulation; z≤m the set of all stochastic actions the simulation distribution, and uses the REINFORCE or taken up to simulation m, and z = z≤M the set of all score-function method. In this term, the final log likelihood ∗ stochastic actions. The number of simulations M is either log pθ(a |s, z) plays the role of a ‘reward’ signal: in effect, chosen to be fixed, or taken from a stochastic distribution the quality of the search is determined by the confidence pM . in the correct label (as measured by its log-likelihood); the higher that confidence, the better the tree expansion, and After performing the desired number of simulations, the the more the stochastic actions z that induced that tree ex- network output is the action probability vector p (a|s, z). θ pansion will be reinforced. In addition, we follow the com- It is a random function of the state s due to the stochastic mon method of adding a neg-entropy regularization term on actions z. We choose to optimize p (a|s, z) so that the θ π(a|s; θ ) to the loss, to prevent premature convergence. prediction is on average correct, by minimizing the average s cross entropy between the prediction and the correct label a∗. For a pair (s, a∗), the loss is: 3.6. A credit assignment technique for anytime algorithms `(s, a∗) = [− log p (a∗|s, z)] . (8) Ez∼π(z|s) θ Although it is unbiased, the REINFORCE gradient above This can also be interpreted as a lower-bound on the has high variance; this is due to the difficulty of credit as- log-likelihood of the marginal distribution pθ(a|s) = signment in this problem: the number of decisions that con- ∗ Ez∼π(z|s) (pθ(a|z, s)). tribute to a single decision a is large (between O(M log M) and O(M 2) for M simulations), and understanding how We minimize l(s, a∗) by computing a single sample estimate each decision contributed to a low error through a better tree of its gradient (Schulman et al., 2015): expansion structure is very intricate. h ∗ ∗ In order to address this issue, we design a novel credit as- ∇θ `(s, a ) = −Ez ∇θ log pθ(a |s, z) signment technique for anytime algorithms, by casting the ∗ i + (∇θ log π(z|s; θs)) log pθ(a |s, z) . (9) loss minimization for a single example as a sequential deci- sion problem, and using reinforcement learning techniques The first term of the gradient corresponds to the differen- to come up with a family of estimators, allowing us to ma- tiable path of the network as described in section. The Learning to Search with MCTSnets

γ nipulate the bias-variance trade-off. Rm can be rewritten as the average of future baselined losses l − l , where t follows a truncated geometric Consider a general anytime algorithm which, given an ini- m+t m−1 distribution with parameter γ and maximum value M − m. tial state s, can run for an arbitrary number of internal steps Letting γ = 0 leads to a greedy behavior, where actions of M — in MCTSnet, these are simulations. For each step simulation m are only chosen as to maximize the immediate m = 1 ...M, any number of stochastic decisions (collec- improvement in loss −(` − ` ). This myopic mode is tively denoted z ) may be taken, and at the end of each m m−1 m linked to the single-step assumption proposed by Russell & step, a candidate output distribution p (a|s, z ) may be θ ≤m Wefald(1989) in an analog context. evaluated against a loss function `. The value of the loss ∆ at the end of step m is denoted `m = `(pθ(a|s, z≤m)). We assume the objective is to maximize the terminal negative 4. Experiments loss −`M . Letting `0 = 0, we rewrite the terminal loss as a telescoping sum: X −`M = −(`M − `0) = −(`m − `m−1) m=1...M X = r¯m, m=1...M where we define the (surrogate) reward r¯m as the decrease th in loss −(`m −`m−1) obtained during the m step. We then P define the return Rm = m0≥m r¯m0 as the sum of future rewards from step m; by definition we have R1 = −`M . The REINFORCE term of equation (9), ∗ ∆ −∇θ log π(z|s; θs) log pθ(a |s, z) = A, can be rewritten: X A = ∇θ log π(zm|s; θs)R1. (10) Figure 2. Evolution of success ratio in Sokoban during training m using a continuous evaluator. MCTSnet (with M = 25) against two model-free baselines. In one case (M = 2), the copy-model Since stochastic variables in zm can only affect future re- 0 0 has access to the same number of parameters and the same subnet- wards rm, m ≥ m, it follows from a classical policy gradi- works. When M = 25, the baseline also matches the amount of ent argument that (10) is, in expectation, also equal to: computation. We also provide performance of MCTS with UCT X with variable number of simulations. A = ∇θ log π(zm|s, z 25), we found the non- in Sec. 3.6 and the modulated policy architecture in Sec. residual backup network to be simply numerically unstable. 3.4. To show this, we train MCTSnets for different values This can be explained by a large number of updates being of the discount γ using the modulated policy architecture. recurrently applied, which can cause divergence when the The results in Fig. 4b demonstrate that the value of γ = 1, network has arbitrary form. Instead, the gated network can for which the network is optimizing the true loss, is not quickly reduce the influence of the update coming from the ideal choice in MCTSnet. Lower values of γ perform the subtree, and then slowly depart from the identity skip- better at different stages of training. In late training, when connection; this is the behavior we’ve observed in practice. the estimation problem is more stationary, the advantage of γ < 1 reduces but remains. The best performing MCTSnet 4 A video of MCTSnet solving Sokoban levels is available here: architecture is shown in comparison to others in Fig. 4a. We https://goo.gl/2Bu8HD. also investigated whether simply providing the policy prior Learning to Search with MCTSnets

(a) (b) Figure 4. a) Comparison of different simulation policy strategies in MCTSnet with M = 25 simulations. b) Effect of different values of γ in our approximate credit assignment scheme. term in Eq.7 (i.e., setting w1 = 0) could match these results. generally achieving better results with less training steps. With the policy prior learned with the right entropy regular- ization, this is indeed a well-performing simulation policy 5. Discussion for MCTSnet with 25 simulations, but it did not match our best performing learned policy. We refer to it as distilled We have shown that it is possible to learn successful search simulation. algorithms by framing them as a dynamic computational graph that can be optimized with gradient-based methods. 4.4. Scalability with number of simulations This may be viewed as a step towards a long-standing AI ambition: meta-reasoning about the internal processing of the agent. In particular, we proposed a neural version of the MCTS algorithm. The aim was to maintain the desir- able properties of MCTS while allowing some flexibility to improve on the choice of nodes to expand, the statis- tics to store in memory, and the way in which they are propagated, all using gradient-based learning. A more pro- nounced departure from existing search algorithms could also be considered, although this may come at the expense of a harder optimization problem. We have also assumed that the true environment is available as a simulator, but this could also be relaxed: a model of the environment could be learned separately (Weber et al., 2017), or even end-to- end (Silver et al., 2017b). One advantage of this approach is that the search algorithm could learn how to make use Figure 5. Performance of MCTSnets trained with different number of an imperfect model. Although we have focused on a of simulations M (nsims). Generally, larger searches achieve supervised learning setup, our approach could easily be better results with less training steps. extended to a reinforcement learning setup by leveraging Thanks to weight sharing, MCTSnets can in principle be run policy iteration with MCTS (Silver et al., 2017a; Anthony for an arbitrary number of simulations M ≥ 1. With larger et al., 2017). We have focused on small searches, more number of simulations, the search has progressively more similar in scale to the plans that are processed by the human opportunities to query the environment model. To check brain (Arbib, 2003), than to the massive-scale searches in whether our search network could take advantage of this high-performance games or planning applications. In fact, during training, we compare MCTSnet for different number our learned search performed better than a standard MCTS of simulations M, applied during both training and evalua- with more than an order-of-magnitude more computation, tion. In Fig.5, we find that our approach was able to query suggesting that learned approaches to search may ultimately and extract relevant information with additional simulations, replace their handcrafted counterparts. Learning to Search with MCTSnets

References Pascanu, R., Li, Y., Vinyals, O., Heess, N., Buesing, L., Racaniere,` S., Reichert, D., Weber, T., Wierstra, D., and Anthony, T., Tian, Z., and Barber, D. Thinking fast and Battaglia, P. Learning model-based planning from scratch. slow with and tree search. arXiv preprint arXiv preprint arXiv:1707.06170, 2017. arXiv:1705.08439, 2017. Rosin, C. D. Multi-armed bandits with episode context. Arbib, M. A. The handbook of brain theory and neural Annals of Mathematics and Artificial Intelligence, 61(3): networks. MIT press, 2003. 203–230, 2011.

Auer, P. Using confidence bounds for exploitation- Russell, S. Rationality and intelligence. In Proceedings exploration trade-offs. Journal of of the 14th international joint conference on Artificial Research, 3(Nov):397–422, 2002. intelligence-Volume 1, pp. 950–957. Morgan Kaufmann Publishers Inc., 1995. Baxter, J., Tridgell, A., and Weaver, L. Knightcap: A chess program that learns by combining td (λ) with game- Russell, S. and Wefald, E. On optimal game-tree search tree search. In Proceedings of the 15th International using rational meta-reasoning. In Proceedings of the 11th Conference on Machine Learning, 1998. international joint conference on Artificial intelligence- Volume 1, pp. 334–340, 1989. Botea, A., Muller,¨ M., and Schaeffer, J. Using abstrac- tion for planning in sokoban. In Computers and Games, Samuel, A. Some studies in machine learning using the volume 2883, pp. 360, 2003. game of checkers. IBM Journal of Research and Devel- opment, 3(3):210, 1959. Chang, K.-W., Krishnamurthy, A., Agarwal, A., Daume, H., and Langford, J. Learning to search better than your Schaeffer, J. The games computers (and people) play. Ad- teacher. In Proceedings of the 32nd International Confer- vances in computers, 52:189–266, 2000. ence on Machine Learning (ICML-15), pp. 2058–2066, 2015. Schaeffer, J., Hlynka, M., and Jussila, V. Temporal dif- ference learning applied to a high-performance game- Coulom, R. Efficient selectivity and backup operators in playing program. In Proceedings of the 17th international monte-carlo tree search. In International conference on joint conference on Artificial intelligence-Volume 1, pp. computers and games, pp. 72–83. Springer, 2006. 529–534. Morgan Kaufmann Publishers Inc., 2001.

Farquhar, G., Rocktaschel,¨ T., Igl, M., and Whiteson, S. Schulman, J., Heess, N., Weber, T., and Abbeel, P. Gradi- Treeqn and atreec: Differentiable tree planning for deep ent estimation using stochastic computation graphs. In reinforcement learning. In ICLR, 2017. Advances in Neural Information Processing Systems, pp. 3528–3536, 2015. Hay, N. and Russell, S. J. Metareasoning for monte carlo tree search. Technical Report UCB/EECS-2011-119, Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., EECS Department, University of California, Berkeley, Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., 2011. Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. Junger,¨ M., Liebling, T. M., Naddef, D., Nemhauser, G. L., Nature, 529(7587):484–489, 2016. Pulleyblank, W. R., Reinelt, G., Rinaldi, G., and Wolsey, L. A. 50 years of integer programming 1958-2008: From Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, the early years to the state-of-the-art. Springer, 2009. I., Huang, A., Guez, A., Hubert, T., Baker, L., et al. Mastering the game of go without human knowledge. Knuth, D. E. and Moore, R. W. An analysis of alpha-beta Nature, 550:354–359, 2017a. pruning. Artificial intelligence, 6(4):293–326, 1975. Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Kocsis, L. and Szepesvari,´ C. Bandit based monte-carlo Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz, planning. In ECML, volume 6, pp. 282–293. Springer, N., Barreto, A., et al. The predictron: End-to-end learning 2006. and planning. In ICML, 2017b.

Kocsis, L., Szepesvari,´ C., and Winands, M. H. RSPSA: Tesauro, G. Connectionist learning of expert preferences by enhanced parameter optimization in games. In Advances comparison training. In Advances in Neural Information in Computer Games, pp. 39–56. Springer, 2005. Processing, pp. 99–106, 1988. Learning to Search with MCTSnets

Tesauro, G. TD-gammon, a self-teaching backgammon pro- gram, achieves master-level play. Neural Computation, 6: 215–219, 1994. Veness, J., Silver, D., Blair, A., and Uther, W. Bootstrapping from search. In Advances in Neural Informa- tion Processing Systems, pp. 1937–1945, 2009.

Weber, T., Racaniere,` S., Reichert, D. P., Buesing, L., Guez, A., Rezende, D. J., Badia, A. P., Vinyals, O., Heess, N., Li, Y., et al. Imagination-augmented agents for deep reinforcement learning. arXiv preprint arXiv:1707.06203, 2017. Learning to Search with MCTSnets

A. Architectural choices and experimental learn the backup and embedding networks. setup We train MCTSnet in an asynchronous distributed fashion. For the Sokoban domain, we use 10 × 10 map layout with We use a batch of size 1, 32 workers and SGD for optimiza- four boxes and targets. For level generation, we gained ac- tion with a learning rate of 5e-4. cess to the level generator described by Weber et al.(2017). We directly provide a symbolic representation of the envi- ronment, coded as 10×10×4 feature map (with one feature map per type of object: wall, agent, box, target), see Fig.8 for a visual representation.

A.1. Dataset Vanilla MCTS with a deep value network (see Alg. 1) and long search times (approx. 2000 simulations per step) was employed to generate good quality trajectories, as an oracle surrogate. Specifically, we use a pre-trained value network for leaf evaluation, depth-wise transposition tables to deal with symmetries, and we reuse the relevant search subtree after each real step. Other solving methods could have been used since this is only to generate labels for supervised learning. The training dataset consists of 250000 trajectories of distinct levels; approximately 92% of those levels are solved by the agent; solved levels take on average 60 steps, while unsolved levels are interrupted after 100 steps. We also create a testing set with 2500 trajectories.

A.2. Network architecture details Our embedding network  is a convolution network with 3 residual blocks. Each residual block is composed of two 64-channel convolution layers with 3x3 kernels applied with stride 1. The residual blocks are preceded by a convolution layer with the same properties, and followed by a convolu- tion layer of 1x1 kernel to 32 channels. A linear layer maps the final convolutional activations into a 1D vector of size 128. The readout network is a simple MLP with a single hidden layer of size 128. Non-linearities between all layers are ReLus. The policy prior network has a similar architecture to the embedding network, but with 2 residual blocks and 32-channel convolutions.

A.3. Implementation and Training We implemented MCTSnet in Tensorflow, making extensive use of control flow operations to generate the necessary search logic. Tree node and simulation statistics are stored in arrays of variables and tensors that are accessed with sparse operators. The variables help determine the computational graph for a given execution (by storing the relation between nodes and temporary variables such as rewards). A single forward execution of the network generates the full search which includes multiple simulations. For a fixed expanded tree, gradients can flow through all the node statistics to Learning to Search with MCTSnets B. Additional figures

Action distribution Input state

... Simulation number embedding simulation backup readout network network network network

Figure 6. Diagram illustrating a MCTSnet search (i.e., a single forward pass of the network). The first two simulations are detailed, as well as an extract of the last simulation where the readout network is employed to output the action distribution from the root memory statistic. A detail of a longer simulation is given in Figure7.

Simulation phase Backup phase ...

Leaf state

Figure 7. This diagram represents simulation m + 1 in MCTSnet (applied to Sokoban). The leftmost part represents the simulation phase down the tree until a leaf nodeembeddingsL, using the currentsimulation state of thebackup tree memory at the end of simulation m. The right part of the diagram illustrates the embedding andnetwork backup phase, wherenetwork the memory vectorsnetwork h on the traversed tree path get updated in a bottom-up fashion. Memory vectors for nodes not visited on this tree path stay constant.

Target

Agent

Box Wall

Figure 8. The different elements composing a Sokoban frame. This is represented as four planes of 10x10 to the agent.