Quality-Based Rewards for Monte-Carlo Tree Search Simulations
Total Page:16
File Type:pdf, Size:1020Kb
Quality-based Rewards for Monte-Carlo Tree Search Simulations Tom Pepels and Mandy J .W. Tak and Marc Lanctot and Mark H. M. Winands1 Abstract. Monte-Carlo Tree Search is a best-first search technique In this paper, two techniques are proposed for determining the based on simulations to sample the state space of a decision-making quality of a simulation based on properties of play-outs. The first, problem. In games, positions are evaluated based on estimates ob- Relative Bonus, assesses the quality of a simulation based on its tained from rewards of numerous randomized play-outs. Generally, length. The second, Qualitative Bonus, formulates a quality assess- rewards from play-outs are discrete values representing the outcome ment of the terminal state. Adjusting results in a specific way using of the game (loss, draw, or win), e.g., r 2 {−1; 0; 1g, which are these values leads to increased performance in six distinct two-player backpropagated from expanded leaf nodes to the root node. How- games. Moreover, we determine the advantage of using the Relative ever, a play-out may provide additional information. In this paper, Bonus in the General Game Playing agent CADIAPLAYER [4], win- we introduce new measures for assessing the a posteriori quality of ner of the International GGP Competition in 2007, 2008, and 2012. a simulation. We show that altering the rewards of play-outs based The paper is structured as follows. First, the general MCTS frame- on their assessed quality improves results in six distinct two-player work is discussed in Section 2. Next, two different techniques for as- games and in the General Game Playing agent CADIAPLAYER. We sessing the quality of play-outs are detailed in Section 3. Section 4 propose two specific enhancements, the Relative Bonus and Qual- explains how rewards can be altered using the quality measures from itative Bonus. Both are used as control variates, a variance reduc- the previous section. This is followed by pseudo-code outlining the tion method for statistical simulation. Relative Bonus is based on the proposed algorithm in Section 5. Finally, the performance of the pro- number of moves made during a simulation and Qualitative Bonus re- posed enhancements is determined in Section 6, accompanied by a lies on a domain-dependent assessment of the game’s terminal state. discussion and conclusion in Section 7. We show that the proposed enhancements, both separate and com- bined, lead to significant performance increases in the domains dis- 2 MONTE-CARLO TREE SEARCH cussed. Monte-Carlo Tree Search (MCTS) is a simulation-based search method [7, 12]. MCTS grows a search tree incrementally over time, 1 INTRODUCTION by expanding a leaf node of the tree every simulation. Values of the rewards stored at nodes, when averaged over the results of numerous Monte-Carlo Tree Search (MCTS) [7, 12] is a simulation-based simulations, represent an estimate of the win probability of simula- best-first search technique for decision-making problems. Recently, tions that pass through the node. Each simulation consist of two parts, MCTS has shown to improve performance in various domains, such 1) the selection step, where moves are selected and played inside as the two-player games Go [17], Lines of Action [25], and Hex [1]. the tree according to the selection policy, and 2) the play-out step, Moreover, MCTS has seen successes in other domains such as real- where moves are played according to a simulation strategy, outside time strategy games [5], arcade games such as Ms Pac-Man [14] and the tree. At the end of each play-out a terminal state is reached and the Physical Traveling Salesman [15], but also in real-life domains the result r, usually expressed numerically in some discrete range, such as optimization, scheduling, and security [5]. e.g., r 2 {−1; 0; 1g representing a loss, draw or win, respectively, is Standard MCTS runs simulated play-outs until a terminal state is backpropagated along the tree from the expanded leaf to the root. reached. The returned reward signal represents the outcome of the In its basic form, MCTS does not require an evaluation function. game’s final position (win, draw, or loss), but any other informa- Nonetheless, in most domains it is beneficial to add some domain tion about the play-out is disregarded. Several techniques for de- knowledge to the play-out policy. MCTS can be terminated at any termining the quality of simulations have been previously proposed. time, for instance when some computational limit is reached, to se- Early cut-offs terminates a play-out and returns a heuristic value of lect a move to return. The move to make is selected by choosing ei- the state [24]. Evaluating the final score of a game, e.g., using the ther the child of the root with the highest number of visits, the highest number of points achieved, and combining it with the outcome has average reward, or a combination [6]. shown to improve results in points-based games [19]. However, for During the selection step, a policy is required to explore the tree some domains, a heuristic evaluation may not be available or too to decide on promising options. The UCT [12] is derived from the time-consuming. Additionally, many games do not have point-based UCB1 policy [2] for maximizing the rewards of a multi-armed ban- scores and only have the distinct possible outcomes, win, draw, or dit. In UCT, each node is treated as a bandit problem whose arms loss. are the moves that lead to different child nodes. UCT balances the 1 Games and AI Group, Department of Knowledge Engineering exploitation of rewarding nodes whilst allowing exploration of less (DKE), Maastricht University, Maastricht, The Netherlands, email: often visited nodes. Consider a node p with children I(p), then the ftom.pepels,mandy.tak,marc.lanctot,[email protected] policy determining which child i to select is defined as: ( r ) ∗ ln np i = argmaxi2I(p) vi + C ; (1) ni where vi is the score of the child i based on the average result of simulations that visited it, np and ni are the visit counts of the current node and its child, respectively. C is the exploration constant to tune. 3 ASSESSING SIMULATION QUALITY In this section, we discuss two quality assessments of the terminal state of a simulation. First, in Subsection 3.1 the length of a simula- tion is discussed as a measure of its quality. Second, in Subsection 3.2 a quality assessment of the terminal state of a match is consid- ered. In the next section we establish how these quantities can be used to enhance the rewards of MCTS simulations. Figure 1. A single MCTS simulation [9]. 3.1 Simulation Length states directly in order to augment the information used by MCTS. In some domains MCTS’ performance is improved by using either a The first assessment of a simulation’s quality is the length of the sim- static, or an early cut-off of the simulations and evaluating this po- ulated game played. Consider a single MCTS simulation as depicted sition. However, a heuristic evaluation may not be available or too in Figure 1, then we can define two separate distances: time-consuming. In this paper, we consider any available informa- 1. The number of moves made from the root S to the leaf N, dSN , tion that can help describe the quality of a terminal state, such as 2. The number of moves required to reach T, the simulation’s termi- how convincing of a win it was based on in-game scores. nal state, from N during play-out dNT . As before, consider a single MCTS simulation as depicted in Fig- ure 1. When a terminal state is reached, a quality assessment function The length of the simulation is defined as the sum of these distances: is called to evaluate the position with respect to the winning player. This measure q, should reflect the quality of a terminal state. For d = d + d ; (2) SN NT instance, in a game with material such as Breakthrough, Chess or i.e., the total number of moves made before reaching the terminal Checkers, an evaluation can be based on scoring the remaining mate- state of the simulation T from S. A play-out policy chooses moves rial of the winning player. For a racing game such as Chinese Check- at each state uniformly random, is rule-based, reactive, or combined ers, the inverse of the number of pieces the opponent has in his target with a source of expert or heuristic information such as an -greedy base can be considered. As such, the quality is based on the a poste- policy [20, 21]. Alternative methods have been proposed, such as riori evaluation of the terminal state. Having witnessed the states and using low-level αβ searches [24], and methods that learn a strat- actions performed from S to T, the score is based on an assessment egy online, such as the Last-Good-Reply policy [3], Move-average of T given the progression S ... N ... T (see Figure 1). Sampling Technique (MAST) [10], or N-Grams [22]. However, any Score-based bonuses have previously been shown to enhance the play-out policy balances computational effort required to select high- performance of MCTS in Go [8, 26], Mancala [16], and Blokus- quality moves, with the number of simulations performed in the al- Duo [19]. As detailed in the next section, our approach differs from lowed search time. Therefore, moves sampled from the play-out pol- previous work because the bonuses are normalized based on their icy are far from optimal.