An Analysis of Monte Carlo Tree Search
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) An Analysis of Monte Carlo Tree Search Steven James,∗ George Konidaris,† Benjamin Rosman∗‡ ∗University of the Witwatersrand, Johannesburg, South Africa †Brown University, Providence RI 02912, USA ‡Council for Scientific and Industrial Research, Pretoria, South Africa [email protected], [email protected], [email protected] Abstract UCT calls for rollouts to be performed by randomly se- lecting actions until a terminal state is reached. The outcome Monte Carlo Tree Search (MCTS) is a family of directed of the simulation is then propagated to the root of the tree. search algorithms that has gained widespread attention in re- cent years. Despite the vast amount of research into MCTS, Averaging these results over many iterations performs re- the effect of modifications on the algorithm, as well as the markably well, despite the fact that actions during the sim- manner in which it performs in various domains, is still not ulation are executed completely at random. As the outcome yet fully known. In particular, the effect of using knowledge- of the simulations directly affects the entire algorithm, one heavy rollouts in MCTS still remains poorly understood, with might expect that the manner in which they are performed surprising results demonstrating that better-informed rollouts has a major effect on the overall strength of the algorithm. often result in worse-performing agents. We present exper- A natural assumption to make is that completely random imental evidence suggesting that, under certain smoothness simulations are not ideal, since they do not map to realistic conditions, uniformly random simulation policies preserve actions. A different approach is that of so-called heavy roll- the ordering over action preferences. This explains the suc- cess of MCTS despite its common use of these rollouts to outs, where moves are intelligently selected using domain- evaluate states. We further analyse non-uniformly random specific rules or knowledge. Counterintuitively, some results rollout policies and describe conditions under which they of- indicate that using these stronger rollouts can actually result fer improved performance. in a decrease in overall performance (Gelly and Silver 2007). We propose that the key aspect of a rollout policy is the Introduction way in which it ranks the available actions. We demonstrate that under certain smoothness conditions, a uniformly ran- Monte Carlo Tree Search (MCTS) is a general-purpose plan- dom rollout policy preserves the optimal action ranking, ning algorithm that has found great success in a number which in turn allows UCT to select the correct action at the of seemingly unrelated applications, ranging from Bayesian root of the tree. reinforcement learning (Guez, Silver, and Dayan 2013) We also investigate the effect of heavy rollouts. Given that to General-Game Playing (Finnsson and Bjornsson¨ 2008). both heavy and uniformly random policies are suboptimal, Originally developed to tackle the game of Go (Coulom we are interested in the reason these objectively stronger 2007), it is often applied to domains where it is difficult rollouts can often negatively affect the performance of UCT. to incorporate expert knowledge. MCTS combines a tree We show that heavy rollouts can indeed improve perfor- search approach with Monte Carlo simulations (also known mance, but identify low-variance policies as potentially dan- as rollouts), and uses the outcome of these simulations to gerous choices, which can lead to worse performance. evaluate states in a lookahead tree. It has also shown itself to There are a number of conflating factors that make be a flexible planner, recently combining with deep neural analysing UCT in the context of games difficult, espe- networks to achieve superhuman performance in Go (Silver cially in the multi-agent case: the strength of our opponents, et al. 2016). whether they adapt their policies in response to our own, While many variants of MCTS exist, the Upper Confi- and the requirement of rollout policies for multiple play- dence bound applied to Trees (UCT) algorithm (Kocsis and ers. Aside from Silver and Tesauro (2009) who propose the Szepesvari´ 2006) is widely used in practice, despite its short- concept of simulation balancing to learn a Go rollout pol- comings (Domshlak and Feldman 2013). A great deal of icy that is weak but “fair” to both players, there is little analysis on UCT revolves around the tree-building phase of to indicate how best to simulate our opponents. Further- the algorithm, which provides theoretical convergence guar- more, the vagaries of the domain itself can often add to the antees and upper-bounds on the regret (Coquelin and Munos complexity—Nau (1983) demonstrates how making better 2007). Less, however, is understood about the simulation decisions throughout a game does not necessarily result in phase. the expected increase in winning rate. Given all of the above, Copyright c 2017, Association for the Advancement of Artificial we simplify matters by restricting our investigation to the Intelligence (www.aaai.org). All rights reserved. single-agent case. 3576 Background and average return Xs. For a given node s, the policy then In this paper we consider only environments that can be selects child i that maximises the upper confidence bound modelled by a finite Markov Decision Process (MDP). An MDP is defined by the tuple S, A,T,R,γ over states S, 2ln(ns) Xi + Cp , actions A, transition function T : S×A×S→[0, 1], re- ni ward function R : S×A×S →R, and discount factor γ ∈ (0, 1) (Sutton and Barto 1998). where Cp is an exploration parameter. Our aim is to learn a policy π : S×A → [0, 1] Once a leaf node is reached, the expansion phase adds a that specifies the probability of executing an action in a new node to the tree. A simulation is then run from this node given state so as to maximise our expected return. Sup- according to the rollout policy, with the outcome being back- pose that after time step t we observe the sequence of propagated up through the tree, updating the nodes’ average rewards rt+1,rt+2,rt+3,... Our expected return E[Rt] is scores and visitation counts. then simply the discounted sum of rewards, where Rt = This cycle of selection, expansion, simulation and back- ∞ i i=0 γ rt+i+1. propagation is repeated until some halting criteria is met, at For a given policy π, we can calculate the value of any which point the best action is selected. There are numerous state s as the expected reward gained by following π: strategies for the final action selection, such as choosing the action that leads to the most visited state (robust child), or π E | V (s)= π[Rt st = s]. the one that leads to the highest valued state (max child). We A policy π∗ is said to be optimal if it achieves the maxi- adopt the robust child strategy throughout this paper, noting mum possible value in all states. That is, ∀s ∈S, that previous results have shown little difference between the ∗ two (Chaslot et al. 2008). V π (s)=V ∗(s)=maxV π(s). π Smoothness One final related concept is that of the Q-value function There is some evidence to suggest that the key property which assigns a value to an action in a given state under of a domain is the smoothness of its underlying value policy π: function. The phenomenon of game tree pathology (Nau π Q (s, a)=Eπ[Rt | st = s, at = a]. 1982), as well as work by Ramanujan, Sabharwal, and Sel- man (2011), advance the notion of trap states, which occur Similarly, the optimal Q-value function is given by when the value of two sibling nodes differs greatly—the lat- ∗ Qπ (s, a)=Q∗(s, a)=maxQπ(s, a). ter argue that UCT is unsuited to domains possessing many π such states. Furthermore, in the context of X -armed bandits (where X is some measurable space), UCT can be seen as Monte Carlo Tree Search a specific instance of the Hierarchical Optimistic Optimisa- MCTS iteratively builds a search tree by executing four tion algorithm, which attempts to find the global maximum phases (Figure 1). Each node in the tree represents a single of the expected payoff function using MCTS. Its selection state, while the tree’s edges correspond to actions. In the se- policy is similar to UCB1, but contains an additional term lection phase, a child-selection policy is recursively applied that depends on the smoothness of the function. For an in- until a leaf node1 is reached. finitely smooth function, this term goes to 0 and the algo- rithm becomes UCT (Bubeck et al. 2011). In defining what is meant by smoothness, one notion that can be employed is that of Lipschitz continuity, which limits the rate of change of a function. Formally, a value function V is M-Lipschitz continuous if ∀s, t ∈S, |V (s) − V (t)|≤Md(s, t), where M ≥ 0 is a constant, d(s, t)= k(s) − k(t) and k (a) Selection (b) Expansion (c) Simulation (d) Backpropagation is a mapping from state-space to some normed vector space (Pazis and Parr 2011). Figure 1: Phases of the Monte Carlo tree search algorithm. A search tree, rooted at the current state, is grown through UCT in Smooth Domains repeated application of the above four phases. We tackle the analysis of UCT by first considering only the role of the simulation policy in evaluating states (ignoring UCT uses a policy known as UCB1, a solution to the the selection and iterative tree-building phase of UCT), and multi-armed bandit problem (Auer, Cesa-Bianchi, and Fis- then discussing the full algorithm.