Monte-Carlo Graph Search for Alphazero
Total Page:16
File Type:pdf, Size:1020Kb
Monte-Carlo Graph Search for AlphaZero Johannes Czech1, Patrick Korus1, Kristian Kersting1, 2, 3 1 Department of Computer Science, Technical University of Darmstadt 2 Centre for Cognitive Science, Technical University of Darmstadt 3 hessian.ai — The Hessian Center for Artificial Intelligence Darmstadt, Germany 64289, December 22, 2020 [email protected], [email protected], [email protected] Abstract A 8 The AlphaZero algorithm has been successfully applied in a 7 range of discrete domains, most notably board games. It utilizes a neural network, that learns a value and policy function to guide 6 the exploration in a Monte-Carlo Tree Search. Although many 5 search improvements have been proposed for Monte-Carlo Tree Search in the past, most of them refer to an older variant of 4 the Upper Confidence bounds for Trees algorithm that does not use a policy for planning. We introduce a new, improved search 3 algorithm for AlphaZero which generalizes the search tree to 2 a directed acyclic graph. This enables information flow across different subtrees and greatly reduces memory consumption. 1 Along with Monte-Carlo Graph Search, we propose a number of a b c d e f g h further extensions, such as the inclusion of -greedy exploration, a revised terminal solver and the integration of domain knowledge B C as constraints. In our evaluations, we use the CrazyAra engine on chess and crazyhouse as examples to show that these changes e4 Nf3 e4 Nf3 bring significant improvements to AlphaZero. Nc6 Nc6 Keywords: Monte-Carlo Tree Search, Graph Search, Directed Acyclic e5 Nc6 e5 Nc6 e5 e5 Graphs, Epsilon-Greedy Search, Chess, Crazyhouse, Upper Confidence Nf3 Nf3 e4 e4 bounds for Trees, AlphaZero Nf3 e4 Nf3 e4 Nc6 e5 Nc6 e5 Nc6 e5 1 Introduction arXiv:2012.11045v1 [cs.AI] 20 Dec 2020 Figure 1: It is possible to obtain the King’s Knight Open- The planning process of most humans for discrete domains ing (A) with different move sequences (B, C). Trajectories resembles the AlphaZero Monte-Carlo Tree Search (MCTS) in bold are the most common move order to reach the final variant (Silver et al., 2017) which uses a guidance policy as position. As one can see, graphs are a much more concise its prior and an evaluation function to distinguish between representation. good and bad positions (Hassabis et al., 2017). The human reasoning process may not be as stoic and structured as a search algorithm running on a computer and they are typi- are also able to jump between sequence of moves, so called cally not able to process as many chess board positions in trajectories; they seem to have some kind of global mem- a given time. Nevertheless, humans are quite good at mak- ory buffer to store relevant information. This gives a cru- ing valid connections between different subproblems and to cial advantage over a traditional tree search, which does not reuse intermediate results for other subproblems. In other share information between different trajectories, although words, humans not only look at a problem step by step but an identical state may occur. Triggered by this intuition, we generalize the search tree Learning (RL) and use neural network models with pre- to a Directed Acyclic Graph (DAG), yielding Monte-Carlo trained weights instead. However, we give insight about the Graph Search (MCGS). The search in our DAG follows the stability of the search modifications by performing experi- scheme of the Upper Confidence Bound for Trees (UCT) ments under different simulation counts and time controls. algorithm (Auer et al., 2002), but employs a modified for- ward and backpropagation procedure to cope with the graph structure. Figure 1 illustrates, how nodes with more than 2 Related Work one parent, so called transposition nodes, allow to share information between different subtrees and vastly improve There exists quite a lot of prior work on improving UCT memory efficiency. search including transposition usage, the Rapid Action More importantly, a graph search reduces the amount of Value Estimation (RAVE) heuristic (Gelly et al., 2011), neural network evaluations that normally take place when move ordering and different parallelization regimes. A the search reaches a leaf node. Together, MCGS can result comprehensive overview of these techniques is covered in a substantial performance boost, both for a fixed time by Browne et al. (2012). However, these earlier techniques constraint and for a fixed amount of evaluations. This is focus on a version of UCT, which only relies on a value demonstrated in an empirical evaluation on the example of without a policy approximator. Consequently, some of these chess and its variant crazyhouse. extensions became obsolete in practice, providing an in- Please note, that we intentionally keep our focus here significant improvement or even deteriorated the perfor- on the planning aspect of AlphaZero. We provide a novel mance. Each of the proposed enhancements also increases and generally applicable planning algorithm that is based complexity, and most require a full new evaluation when on DAGs and comes with a number of additional enhance- they are used for PUCT combined with a deep neural net- ments. Specifically, our contributions for improving the work. search of AlphaZero are as follows: Saffidine et al. (2012) proposed to use the UCT algo- 1. Transforming the search tree into a DAG and providing rithm with a DAG and suggested an adapted UCT move a backpropagation algorithm which is stable both for selection formula (1). It selects the next move at with addi- 3 low and high simulation counts. tional hyperparameters (d1; d2; d3) 2 N by 2. Introducing a terminal solver to make optimal choices at = argmaxa Qd1 (st; a) + Ud2;d3 (st; a) ; (1) in the tree or search graph, in situations where out- comes can be computed exactly. where 3. Combining the UCT algorithm with -greedy search s P log b Nd2 (st; b) which helps UCT to escape local optima. Ud2;d3 (st; a) = cpuct · : (2) Nd3 (st; a) 4. Using Q-value information for move selection which helps to switch to the second best candidate move The values for values d1, d2 and d3 relate to the re- faster if there is a gap in their Q-values. spective depth and were chosen either to be 0, 1, 2 or 1. Their algorithm was tested on several environments, includ- 5. Adding constraints to narrow the search to events that ing a toy environment called LEFTRIGHT, on the board are important with respect to the domain. In the case game HEX and on a 6 × 6 GO board using 100, 1000 and of chess these are trajectories that include checks, cap- 10 000 simulations respectively. Their results were mixed. tures and threats. Depending on the game, a different hyperparameter con- Note, that each enhancement, with the exception of the stellation performed best and sometimes even the default domain knowledge based constraint, can be seen as a stan- UCT-formula, corresponding to (d1 = 0; d2 = 0; d3 = 0), dalone improvement and also generalizes to other domains. achieved the highest score. We proceed as follows. We start off by discussing related Also the other enhancement suggested in this paper, work. Then, we give a quick recap of the PUCT algorithm namely the termaninal solver, extends an already exisit- (Rosin, 2011), the variant of UCT used in AlphaZero, and ing concept. Chen et al. (2018) build up on the work by explain our realisation of MCGS and each enhancement in- Winands et al. (2008) and presented a terminal solver for dividually. Before concluding, we touch upon our empir- MCTS which is able to deal with drawing nodes and allows ical results. We ran an ablation study showing how each to prune losing nodes. of them increases the performance of AlphaZero’s planning Finally, challenged by the Montezuma’s Revenge envi- individually as well as that combining all boosts the per- ronment, which is a hard exploration problem with sparse formance the most. We explicitly exclude Reinforcement rewards, Ecoffet et al. (2018) described an algorithm called 2 Go-Explore which remembers promising states and their re- spective trajectories. Subsequently, they are able to both re- V(s t )=0.3 V(s t )=0.1 turn to these states and to robustify and optimize the corre- N(s t )=57 N(s t )=13 sponding trajectory. Their work only indirectly influenced this paper, but gave motivation to abandon the search tree Active Trajectory structure and to keep a memory of trajectories in our pro- posed methods. Q(s t ,a)=0.2 Q(s t ,a)=0.4 N(s t ,a)=7 N(s t ,a)=42 3 The PUCT Algorithm V(s t )=−0.4 N(s t )=49 The essential idea behind the UCT algorithm is to iteratively build a search tree in order to find a good choice within a given time limit (Auer et al., 2002). Nodes represent states Figure 2: Scenario of accessing a child node on two possible st and edges denote actions at for each time step t. Consider trajectories. The proposed data structure stores the visits and e. g. chess. Here, nodes represent board positions and edges Q-values both on the edges and in the nodes. denote legal moves that transition to a new board position. Now, each iteration of the tree search consists of three steps. First, selecting a leaf node, then expanding and evaluat- 4 Monte-Carlo Graph Search ing that leaf node and, finally, updating the values and visit counts on all nodes in the trajectory, from the selected leaf The main motivation of having a DAG structure instead node up to the root node.