Improving Alphazero Using Monte-Carlo Graph Search
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Thirty-First International Conference on Automated Planning and Scheduling (ICAPS 2021) Improving AlphaZero Using Monte-Carlo Graph Search Johannes Czech1, Patrick Korus1, Kristian Kersting1, 2, 3 1 Department of Computer Science, TU Darmstadt, Germany 2 Centre for Cognitive Science, TU Darmstadt, Germany 3 Hessian Center for Artificial Intelligence (hessian.AI), Darmstadt, Germany [email protected], [email protected], [email protected] Abstract A 8 AlphaZero The algorithm has been successfully applied in a 7 range of discrete domains, most notably board games. It uti- lizes a neural network that learns a value and policy func- 6 tion to guide the exploration in a Monte-Carlo Tree Search. Although many search improvements such as graph search 5 have been proposed for Monte-Carlo Tree Search in the past, 4 most of them refer to an older variant of the Upper Confi- dence bounds for Trees algorithm that does not use a pol- 3 icy for planning. We improve the search algorithm for Alp- 2 haZero by generalizing the search tree to a directed acyclic graph. This enables information flow across different sub- 1 trees and greatly reduces memory consumption. Along with a b c d e f g h Monte-Carlo Graph Search, we propose a number of further extensions, such as the inclusion of -greedy exploration, a B C revised terminal solver and the integration of domain knowl- edge as constraints. In our empirical evaluations, we use the e4 Nf3 e4 Nf3 CrazyAra engine on chess and crazyhouse as examples to Nc6 Nc6 show that these changes bring significant improvements to e5 Nc6 e5 Nc6 e5 e5 AlphaZero. Nf3 Nf3 e4 e4 Nf3 e4 Nf3 e4 Introduction The planning process of most humans for discrete domains Nc6 e5 Nc6 e5 Nc6 e5 is said to resemble the AlphaZero Monte-Carlo Tree Search (MCTS) variant (Silver et al. 2017), which uses a guid- Figure 1: It is possible to obtain the King’s Knight Open- ance policy as its prior and an evaluation function to dis- ing (A) with different move sequences (B, C). Trajectories tinguish between good and bad positions (Hassabis et al. in bold are the most common move order to reach the final 2017). The human reasoning process, however, may not be position. As one can see, graphs are a much more concise as stoic and structured as a search algorithm running on a representation. computer, and humans are typically not able to process as many chess board positions in a given time. Nevertheless, humans are quite good at making valid connections between Graph Search (MCGS) (Saffidine, Cazenave, and Mehat´ different subproblems and to reuse intermediate results for 2012). The search in the DAG follows the scheme of the other subproblems. In other words, humans not only look at Upper Confidence bounds for Trees (UCT) algorithm (Auer, a problem step by step but are also able to jump between se- Cesa-Bianchi, and Fischer 2002), but employs a modified quences of moves, so called trajectories; they seem to have forward and backpropagation procedure to cope with the some kind of global memory buffer to store relevant infor- graph structure. Figure 1 illustrates, how nodes with more mation. This gives a crucial advantage over a traditional tree than one parent, so called transposition nodes, allow to share search, which does not share information between different information between different subtrees and vastly improve trajectories, although an identical state may occur. memory efficiency. Triggered by this intuition, the search tree is generalized More importantly, a graph search reduces the amount of to a Directed Acyclic Graph (DAG), yielding Monte-Carlo neural network evaluations that normally take place when Copyright © 2021, Association for the Advancement of Artificial the search reaches a leaf node. Together, MCGS can result Intelligence (www.aaai.org). All rights reserved. in a substantial performance boost, both for a fixed time con- 103 straint and for a fixed amount of evaluations. This is demon- The values for d1, d2 and d3 relate to the respective depth strated in an empirical evaluation on the example of chess and were chosen either to be 0, 1, 2 or 1. Their algorithm and its variant crazyhouse. was tested on several environments, including a toy envi- Please note, that we intentionally keep our focus here on ronment called LEFTRIGHT, on the board game HEX and the planning aspect of AlphaZero. We provide a novel plan- on a 6 × 6 GO board using 100, 1000 and 10 000 simu- ning algorithm that is based on DAGs and comes with a lations, respectively. Their results were mixed. Depending number of additional enhancements. Specifically, our con- on the game, a different hyperparameter constellation per- tributions for improving the search of AlphaZero are as fol- formed best and sometimes even the default UCT-formula, lows: corresponding to (d1 = 0; d2 = 0; d3 = 0), achieved the 1. Transforming the search tree of the AlphaZero framework highest score. into a DAG and providing a backpropagation algorithm Choe and Kim (2019) and Bauer, Patten, and Vincze which is stable both for low and high simulation counts. (2019) build upon the approach of Saffidine, Cazenave, and Mehat´ (2012). More specifically, Choe and Kim (2019) 2. Introducing a terminal solver to make optimal choices in modified the selection formula (2) while reducing the num- the tree or search graph, in situations where outcomes can ber of hyperparameters to two and applied the algorithm be computed exactly. to the card-game Heartstone. Similarly, Bauer, Patten, and 3. Combining the UCT algorithm with -greedy search Vincze (2019) employed a tuned selection policy formula which helps UCT to escape local optima. of (2) which considers the empirical variance of rewards and 4. Using Q-value information for move selection which used it for the task of object pose verification. helps to switch to the second best candidate move faster Graphs were also studied as an alternative representation if there is a gap in their Q-values. in a trial-based heuristic tree search framework by Keller 5. Adding constraints to narrow the search to events that and Helmert (2013) and further elaborated by Keller (2015). are important with respect to the domain. In the case of Moreover, Childs, Brodeur, and Kocsis (2008) explored dif- chess these are trajectories that include checks, captures ferent variants for exploiting transpositions in UCT. Finally, and threats. Pelissier,´ Nakamura, and Tabata (2019) made use of trans- positions for the task of feature selection. All other directed We proceed as follows. We start off by discussing related acyclic graph search approaches have in common that they work. Then, we give a quick recap of the PUCT algorithm were developed and evaluated in UCT-like framework which (Rosin 2011), the variant of UCT used in AlphaZero, and does not include an explicit policy distribution. explain our realisation of MCGS and each enhancement in- Also the other enhancement suggested in the resent paper, dividually. Before concluding, we touch upon our empirical namely the terminal solver, extends an already existing con- results. We explicitly exclude Reinforcement Learning (RL) cept. Chen, Chen, and Lin (2018) build up on the work by and use neural network models with pre-trained weights in- Winands, Bjornsson,¨ and Saito (2008) and presented a ter- stead. However, we give insight about the stability of the minal solver for MCTS which is able to deal with drawing search modifications by performing experiments under dif- nodes and allows to prune losing nodes. ferent simulation counts and time controls. Finally, challenged by the Montezuma’s Revenge environ- ment, which is a hard exploration problem with sparse re- Related Work wards, Ecoffet et al. (2021) described an algorithm called There exists a lot of prior work on improving UCT search, Go-Explore which remembers promising states and their re- such as the Rapid Action Value Estimation (RAVE) heuris- spective trajectories. Subsequently, they are able to both re- tic (Gelly and Silver 2011), move ordering and paralleliza- turn to these states and to robustify and optimize the cor- tion regimes. A comprehensive overview of these techniques responding trajectory. Their work only indirectly influenced is covered by Browne et al. (2012). However, these earlier this paper, but gave motivation to abandon the search tree techniques focus on a version of UCT, which only relies on a structure and to keep a memory of trajectories in our pro- value without a policy approximator. Consequently, some of posed methods. these extensions became obsolete in practice, providing an insignificant improvement or even deteriorated the perfor- The PUCT Algorithm mance. Each of the proposed enhancements also increases The essential idea behind the UCT algorithm is to iteratively complexity, and most require a full re-evaluation when they build a search tree in order to find a good choice within a are used for PUCT combined with a deep neural network. given time limit (Auer, Cesa-Bianchi, and Fischer 2002). Saffidine, Cazenave, and Mehat´ (2012) proposed to use Nodes represent states st and edges denote actions at for the UCT algorithm with a DAG and suggested an adapted each time step t. Consider e. g. chess. Here, nodes represent UCT move selection formula (1). It selects the next move at board positions and edges denote legal moves that transi- N3 with additional hyperparameters (d1; d2; d3) 2 by tion to a new board position. Now, each iteration of the tree at = argmaxa Qd1 (st; a) + Ud2;d3 (st; a) ; (1) search consists of three steps. First, selecting a leaf node, where then expanding and evaluating that leaf node and, finally, up- V (s ) N(s ) s P dating the values t and visit counts t on all nodes log b Nd2 (st; b) in the trajectory, from the selected leaf node up to the root Ud2;d3 (st; a) = cpuct · : (2) Nd3 (st; a) node.