Pipeline Pattern for Parallel MCTS

Pipeline Pattern for Parallel MCTS S. Ali Mirsoleimani1;2, Jaap van den Herik1, Aske Plaat1 and Jos Vermaseren2 1Leiden Centre of Data Science, Leiden University Niels Bohrweg 1, 2333 CA Leiden, The Netherlands 2Nikhef Theory Group, Nikhef Science Park 105, 1098 XG Amsterdam, The Netherlands Keywords: MCTS, Parallelization, Pipeline Pattern, Search Overhead Abstract: In this paper, we present a new algorithm for parallel Monte Carlo tree search (MCTS). It is based on the pipeline pattern and allows flexible management of the control flow of the operations in parallel MCTS. The pipeline pattern provides for the first structured parallel programming approach to MCTS. The Pipeline Pattern for Parallel MCTS algorithm (called 3PMCTS) scales very well to a higher number of cores when compared to the existing methods. The observed speedup is 21 on a 24-core machine. 1 Introduction parallel thread for execution on separate processors (Chaslot et al., 2008a; Schaefers and Platzner, 2014; Mirsoleimani et al., 2015a). This type of parallelism In recent years there has been much interest in the is called iteration-level parallelism (ILP). Close anal- Monte Carlo tree search (MCTS) algorithm. In 2006 ysis has learned us that each iteration in the chunk it was a new, adaptive, randomized optimization algo- can also be decomposed into separate operations for rithm (Coulom, 2006; Kocsis and Szepesvari,´ 2006). parallelization. Based on this idea, we introduce In fields as diverse as Artificial Intelligence, Opera- operation-level parallelism (OLP). The main point tions Research, and High Energy Physics, research is to assign each operation of MCTS to a separate has established that MCTS can find valuable approx- processing element for execution by separate proces- imate answers without domain-dependent heuristics sors. This leads to flexibility in managing the control (Kuipers et al., 2013). The strength of the MCTS flow of operations in the MCTS algorithm. The main algorithm is that it provides answers with a random contribution of this paper is introducing a new algo- amount of error for any fixed computational budget rithm based on the pipeline pattern for parallel MCTS (Goodfellow et al., 2016). Much effort has been (3PMCTS) and showing its benefits. put into the development of parallel algorithms for MCTS to reduce the running time. The efforts are The remainder of the paper is organized as fol- applied to a broad spectrum of parallel systems; rang- lows. In section 2 the required background informa- ing from small shared-memory multicore machines tion is briefly described. Section 3 provides necessary to large distributed-memory clusters. In the last two definitions and explanations for the design of 3PM- years, parallel MCTS played a major role in the suc- CTS. Section 4 gives the explanations for the imple- cess of AI by defeating humans in the game of Go mentation the 3PMCTS algorithm, Section 5 shows (Silver et al., 2016; Hassabis and Silver, 2017). the experimental setup, and Section 6 gives the exper- The general MCTS algorithm has four operations imental results. Finally, in Section 7 we conclude the inside its main loop (see Algorithm 1). This loop is paper. a good candidate for parallelization. Hence, a signif- icant effort has been put into the development of parallelization methods for MCTS (Chaslot et al., 2008a; Yoshizoe et al., 2011; Fern and Lewis, 2011; Schae- 2 Background fers and Platzner, 2014; Mirsoleimani et al., 2015b). To implement these methods, the computation associated with each iteration is assumed to be independent (Mirsoleimani et al., 2015a). Therefore, we can as- Below we discuss MCTS in Section 2.1, in Sec- sign a chunk of iterations as a separate task to each tion 2.2 the parallelization of MCTS is explained. 2.1 The MCTS Algorithm with the maximum number of visits). The purpose of MCTS is to approximate the The MCTS algorithm iteratively repeats four steps game-theoretic value of the actions that may be se- or operations to construct a search tree until a pre- lected from the current state by iteratively creating a defined computational budget (i.e., time or iteration partial search tree (Browne et al., 2012). How the constraint) is reached (Chaslot et al., 2008b; Coulom, search tree is built depends on how nodes in the tree 2006). Algorithm 1 shows the general MCTS algo- are selected (i.e., tree selection policy). In particular, rithm. nodes in the tree are selected according to the esti- In the beginning, the search tree has only a root mated probability that they are better than the current (v0) which represents the initial state in a domain. best action. It is essential to reduce the estimation er- Each node in the search tree resembles a state of the ror of the nodes’ values as quickly as possible. There- domain, and directed edges to child nodes represent fore, the tree selection policy in the MCTS algorithm actions leading to the succeeding states. Figure 1 il- aims at balancing exploitation (look in areas which lustrates one iteration of the MCTS algorithm on a appear to be promising) and exploration (look in ar- search tree that already has nine nodes. The non- eas that have not been well sampled yet) (Kocsis and terminal and internal nodes are represented by circles. Szepesvari,´ 2006). Squares show the terminal nodes. The Upper Confidence Bounds for Trees (UCT) algorithm addresses the exploitation-exploration 1.S ELECT: A path of nodes inside the search tree is selected from the root node until a non-terminal dilemma in the selection step of the MCTS algorithm leaf with unvisited children is reached (v ). Each using the UCB1 policy (Kocsis and Szepesvari,´ 6 j of the nodes inside the path is selected based on a 2006). A child node is selected to maximize: predefined tree selection policy (see Figure 1a). s 2ln(N(v)) 2.E XPAND: One of the children (v ) of the selected UCT( j) = X j + 2Cp (1) 9 N(v j) non-terminal leaf (v6) is generated randomly and added to the tree together with the selected path Q(v j) where X j = is an approximation of the game- (see Figure 1b). N(v j) theoretic value of node j. Q(v j) is the total reward of 3.P LAYOUT: From the given state of the newly all playouts that passed through node j, N(v j) is the added node, a sequence of randomly simulated ac- number of times node j has been visited, N(v) is the tions (i.e., RANDOMSIMULATION) is performed number of times the parent of node j has been vis- until a terminal state in the domain is reached. The ited, and Cp ≥ 0 is a constant. The left-hand term is terminal state is evaluated using a utility function for exploitation and the right-hand term is for explo- (i.e., EVALUATION) to produce a reward value D ration (Kocsis and Szepesvari,´ 2006). The decrease or (see Figure 1c). increase in the amount of exploration can be adjusted 4.B ACKUP: For each node in the selected path, the by Cp in the exploration term (see Section 6). number N(v) of times it has been visited is incre- mented by 1 and its total reward value Q(v) is up- 2.2 Parallelization of MCTS dated according to D (Browne et al., 2012). These values are required by the tree selection policy Parallelization of MCTS consists of a precise arrange- (see Figure 1d). ment of tasks and data dependencies. In section 2.2.1 As soon as the computational budget is exhausted, the we explain how to decompose MCTS into tasks. In best child of the root node is returned (e.g., the one section 2.2.2 we investigate what types of data dependencies exist among these tasks. In Section 2.2.3 the existing parallelization methods for MCTS are dis- Algorithm 1: The general MCTS algorithm. cussed. 1 Function MCTS(s0) 2.2.1 Decomposition into Tasks 2 v0 := creat root node with state s0; 3 while within search budget do 4 < vl ;sl > := SELECT(v0;s0); The first step towards parallelizing MCTS is to find 5 < vl ;sl > := EXPAND(vl ;sl ); concurrent tasks in MCTS. As stated above, there are 6 D := PLAYOUT(vl ;sl ); two levels of task decomposition in MCTS. 7 BACKUP(v ,D); l 1. Iteration-level tasks (ILT): In MCTS the compu- 8 end tation associated with each iteration is indepen- 9 return action a for the best child of v0 dent. Therefore, these are candidates to guide a v0 v0 v0 v0 v1 v2 v3 ∆ v1 v2 v3 v1 v2 v3 v1 v2 v3 v4 v5 v6 v7 ∆ v4 v5 v6 v7 v4 v5 v6 v7 v4 v5 v6 v7 v8 v9 ∆ v v v v v 8 8 9 ∆ 8 9 (a) SELECT (b) EXPAND (c) PLAYOUT (d) BACKUP Figure 1: One iteration of MCTS. task decomposition by mapping a chunk of iterations onto a task. v0 2. Operation-level tasks (OLT): The task decomposition for MCTS occurs inside each iteration. v1 v2 v3 Each of the four MCTS operations can be treated as a separate task. 2 v4 v5 v6 v7 2.2.2 Data Dependencies 1 3 v8 v9 v10 The second step is dealing adequately with the data dependency. When a search tree is shared among mul- Figure 2: Tree parallelization. The curly arrows represent tiple parallel threads, There are two levels of data de- threads. The rectangles are terminal leaf nodes. pendency. 1. Iteration-level dependencies (ILD): Strictly speaking, in MCTS, iteration j has a soft dependency to its predecessor iteration j − 1.

Load more