Dual Monte Carlo Tree Search

Dual Monte Carlo Tree Search Prashank Kadam 1 Ruiyang Xu 1 Karl Lieberherr 1 Abstract (MCTS) and a Deep Neural Network (DNN). However, the AlphaZero, using a combination of Deep Neural algorithm had some game-specific configurations, which Networks and Monte Carlo Tree Search (MCTS), did not make it a generic algorithm that could be used for has successfully trained reinforcement learning playing any game. Deepmind later came up the AlphaZero agents in a tabula-rasa way. The neural MCTS (Silver et al., 2017) which was a generic algorithm that algorithm has been successful in finding near- could be used to play any game. Since then, a combination optimal strategies for games through self-play. of MCTS and a DNN is the most widely used way to build However, the AlphaZero algorithm has a signif- game-playing programs. There is a need to balance accurate icant drawback; it takes a long time to converge state estimation by the DNN and the number of simulations and requires high computational power due to by the MCTS. Larger DNNs would be better for accurate complex neural networks for solving games like evaluations but will take a toll on the cost of computation. Chess, Go, Shogi, etc. Owing to this, it is very On the other hand, a smaller network could be used for faster difficult to pursue neural MCTS research with- evaluations and thus a larger number of MCTS simulations out cutting-edge hardware, which is a roadblock for the given amount of time. There is a need to find the op- for many aspiring neural MCTS researchers. In timum trade-off between the network’s size and the number this paper, we propose a new neural MCTS algo- of MCTS simulations. One of the recently developed meth- rithm, called Dual MCTS, which helps overcome ods for this purpose is the Multiple Policy Monte Carlo Tree these drawbacks. Dual MCTS uses two different Search (MPV-MCTS) algorithm (Lan et al., 2019) which search trees, a single deep neural network, and a combines two Policy-Value Neural Networks (PV-NNs) of new update technique for the search trees using different sizes to retain the advantages of each network. The a combination of the PUCB, a sliding-window, smaller network would perform a larger number of simula- and the -greedy algorithm. This technique is ap- tions on its tree to assign priorities to each state based on plicable to any MCTS based algorithm to reduce its evaluations. The more extensive network would then the number of updates to the tree. We show that evaluate these states starting from the highest priority to Dual MCTS performs better than one of the most achieve better accuracy. This algorithm shows a notable widely used neural MCTS algorithms, AlphaZero, improvement in the convergence over games compared to for various symmetric and asymmetric games. AlphaZero, but the two-network configuration requires high run-time memory due to which the algorithm is difficult to 1. Introduction run locally on average hardware. Another problem that all the neural MCTS algorithms face is the large number of updates required during the backup arXiv:2103.11517v1 [cs.AI] 21 Mar 2021 8True; if n = 2 phase of the tree. These updates increase exponentially as > <>False; if n < 2 _ (q = 0 ^ n > 2) the tree depth increases. The computational time takes a big MB(q; n) = 9m 2 [1:::n − 1] : hit due to these updates, and there is a need to reduce the > :> number of updates to the tree while keeping the values of MB(q − 1; m) ^ MB(q − 1; n − m) each tree node highly accurate. We propose a technique that uses a combination of a sliding window and -greedy search Deepmind’s AlphaGo (Silver et al., 2016b) was the first over a Polynomial Upper Confidence Tree (PUCT) (Rosin, algorithm to be able to beat the human Go champion. Al- 2011), for achieving this objective, thus reducing the time phaGo uses a combination of Monte Carlo Tree Search required for updates considerably. This technique can be 1Khoury College of Computer Sciences, Northeastern Univer- applied to any MCTS-based algorithm as it optimizes over sity, Boston, Massachusetts. Correspondence to: Prashank Kadam the core algorithm. <[email protected]>. In this paper, we developed a novel algorithm that helps overcome the drawbacks of AlphaZero, MPV-MCTS, and Dual Monte Carlo Tree Search MCTS and helps in accelerating the training speeds over ized policy optimization problem: various symmetric and asymmetric games. We show our ∗ T π = arg max Q (s; ·)π(s; ·) − λKL[πθ(s; ·); π(s; ·)] improvements over AlphaZero and MPV-MCTS using Elo- π rating, α-rank, and training times, which are the most widely pP 0 0 N(si; a ) used metrics for evaluating games. We train our model for λ = a jAj + P N(s; a0) symmetric and asymmetric problems with different com- a0 (3) plexities and show that our model performance would im- prove considerably compared to other neural-MCTS algo- That also means that MCTS simulation is an regular- rithms as the game’s state space increases. ized policy optimization (Grill et al., 2020), and as long as the value network is accurate, the MCTS simulation will optimize the output policy so that it maximize the action value output while minimize the change to the 2. Background policy network. 2.1. AlphaZero 2. EXPAND: Once the selected phase ends at an un- In a nutshell,AlphaZero uses a single neural network as visited state sl, the state will be fully expanded and the policy and value approximator. During each learning marked as visited. All its child nodes will be consid- iteration, it carries out multiple rounds of self-plays. Each ered as leaf nodes during next iteration of selection. self-play runs several MCTS simulations to estimate an 3. ROLL-OUT: The roll-out is carried out for every child empirical policy at each state, then sample from that policy, of the expanded leaf node sl. Starting from any child of take a move, and continue. After each round of self-play, sl, the algorithm will use the value network to estimate the game’s outcome is backed up to all states in the game the result of the game, the value is then backed up to trajectory. Those game trajectories generated during self- each node in the next phase. play are then be stored in a replay buffer, which is used to train the neural network. 4. BACKUP: This is the last phase of an iteration in which the algorithm updates the statistics for each In self-play, for a given state, the neural MCTS runs a given node in the selected states fs0; s1; :::; slg from the first number of simulations on a game tree ,rooted at that state, phase. To illustrate this process, suppose the selected to generate an empirical policy. Each simulation, guided by states and corresponding actions are the policy and value networks, passes through 4 phases: f(s0; a0); (s1; a1); :::(sl−1; al−1); (sl; )g 1. SELECT: At the beginning of each iteration, the algo- Let Vθ(si) be the estimated value for child si. We rithm selects a path from the root (current game state) want to update the Q-value so that it equals to the to a leaf (either a terminal state or an unvisited state) averaged cumulative reward over each accessing of PN(s;a) P ri according to a predictor upper confidence boundary i=1 t t the underlying state, i.e., Q(s; a) = N(s;a) . To (PUCB) algorithm (Rosin, 2011). Specifically, sup- rewrite this updating rule in an iterative form, for each pose the root is s0. The UCB determines a serial of (st; at) pair, we have: states fs0; s1; :::; slg by the following process: N(st; at) N(st; at) + 1 " pP 0 # Vθ(sr) − Q(st; at) (4) a0 N(si; a ) Q(st; at) Q(st; at) + ai = arg max Q(si; a) + cπθ(si; a) N(st; at) a N(si; a) + 1 Such a process will be carried out for all of the roll-out si+1 = move(si; ai) (1) outcomes from the last phase. Once the given number of iterations has been reached, the It has been proved in (Grill et al., 2020) that selecting algorithm returns the empirical policy π^(s) for the current simulation actions using Eq.1 is equivalent to optimize state s. After the MCTS simulation, the action is then the empirical policy sampled from the π^(s), and the game moves to the next state. In this way, for each self-play iteration, MCTS samples each 1 + N(s; a) player’s states and actions alternately until the game ends, π^(s; a) = P 0 (2) jAj + a0 N(s; a ) which generates a trajectory for the current self-play. After a given number of self-plays, all trajectories will be stored where jAj is the size of current action space, so that it into a replay buffer so that it can be used to train and update approximate to the solution of the following regular- the neural networks. Dual Monte Carlo Tree Search 2.2. Multiple Policy Value Monte Carlo Tree Search the trees instead of using two separate networks like MPV- MCTS. We also introduce a novel method to reduce the AlphaZero uses a single DNN, called a Policy-Value Neural number of value updates to the MCTS using a combination Network (PV-NN), consisting of two heads, one for the of the PUCT, a sliding window over the tree levels, and the policy and one for value approximation.

Load more