<<

Pattern for Parallel MCTS

S. Ali Mirsoleimani1,2, Jaap van den Herik1, Aske Plaat1 and Jos Vermaseren2 1Leiden Centre of Data Science, Leiden University Niels Bohrweg 1, 2333 CA Leiden, The Netherlands 2Nikhef Theory Group, Nikhef Science Park 105, 1098 XG Amsterdam, The Netherlands

Keywords: MCTS, Parallelization, Pipeline Pattern, Search Overhead

Abstract: In this paper, we present a new algorithm for parallel Monte Carlo tree search (MCTS). It is based on the pipeline pattern and allows flexible management of the control flow of the operations in parallel MCTS. The pipeline pattern provides for the first structured parallel programming approach to MCTS. The Pipeline Pattern for Parallel MCTS algorithm (called 3PMCTS) scales very well to a higher number of cores when compared to the existing methods. The observed speedup is 21 on a 24-core machine.

1 Introduction parallel for execution on separate processors (Chaslot et al., 2008a; Schaefers and Platzner, 2014; Mirsoleimani et al., 2015a). This type of parallelism In recent years there has been much interest in the is called iteration-level parallelism (ILP). Close anal- Monte Carlo tree search (MCTS) algorithm. In 2006 ysis has learned us that each iteration in the chunk it was a new, adaptive, randomized optimization algo- can also be decomposed into separate operations for rithm (Coulom, 2006; Kocsis and Szepesvari,´ 2006). parallelization. Based on this idea, we introduce In fields as diverse as Artificial Intelligence, Opera- operation-level parallelism (OLP). The main point tions Research, and High Energy Physics, research is to assign each operation of MCTS to a separate has established that MCTS can find valuable approx- processing element for execution by separate proces- imate answers without domain-dependent heuristics sors. This leads to flexibility in managing the control (Kuipers et al., 2013). The strength of the MCTS flow of operations in the MCTS algorithm. The main algorithm is that it provides answers with a random contribution of this paper is introducing a new algo- amount of error for any fixed computational budget rithm based on the pipeline pattern for parallel MCTS (Goodfellow et al., 2016). Much effort has been (3PMCTS) and showing its benefits. put into the development of parallel algorithms for MCTS to reduce the running time. The efforts are The remainder of the paper is organized as fol- applied to a broad spectrum of parallel systems; rang- lows. In section 2 the required background informa- ing from small shared-memory multicore machines tion is briefly described. Section 3 provides necessary to large distributed-memory clusters. In the last two definitions and explanations for the design of 3PM- years, parallel MCTS played a major role in the suc- CTS. Section 4 gives the explanations for the imple- cess of AI by defeating humans in the game of Go mentation the 3PMCTS algorithm, Section 5 shows (Silver et al., 2016; Hassabis and Silver, 2017). the experimental setup, and Section 6 gives the exper- The general MCTS algorithm has four operations imental results. Finally, in Section 7 we conclude the inside its main loop (see Algorithm 1). This loop is paper. a good candidate for parallelization. Hence, a signif- icant effort has been put into the development of par- allelization methods for MCTS (Chaslot et al., 2008a; Yoshizoe et al., 2011; Fern and Lewis, 2011; Schae- 2 Background fers and Platzner, 2014; Mirsoleimani et al., 2015b). To implement these methods, the computation associ- ated with each iteration is assumed to be independent (Mirsoleimani et al., 2015a). Therefore, we can as- Below we discuss MCTS in Section 2.1, in Sec- sign a chunk of iterations as a separate task to each tion 2.2 the parallelization of MCTS is explained. 2.1 The MCTS Algorithm with the maximum number of visits). The purpose of MCTS is to approximate the The MCTS algorithm iteratively repeats four steps game-theoretic value of the actions that may be se- or operations to construct a search tree until a pre- lected from the current state by iteratively creating a defined computational budget (i.e., time or iteration partial search tree (Browne et al., 2012). How the constraint) is reached (Chaslot et al., 2008b; Coulom, search tree is built depends on how nodes in the tree 2006). Algorithm 1 shows the general MCTS algo- are selected (i.e., tree selection policy). In particular, rithm. nodes in the tree are selected according to the esti- In the beginning, the search tree has only a root mated probability that they are better than the current (v0) which represents the initial state in a domain. best action. It is essential to reduce the estimation er- Each node in the search tree resembles a state of the ror of the nodes’ values as quickly as possible. There- domain, and directed edges to child nodes represent fore, the tree selection policy in the MCTS algorithm actions leading to the succeeding states. Figure 1 il- aims at balancing exploitation (look in areas which lustrates one iteration of the MCTS algorithm on a appear to be promising) and exploration (look in ar- search tree that already has nine nodes. The non- eas that have not been well sampled yet) (Kocsis and terminal and internal nodes are represented by circles. Szepesvari,´ 2006). Squares show the terminal nodes. The Upper Confidence Bounds for Trees (UCT) algorithm addresses the exploitation-exploration 1.S ELECT: A path of nodes inside the search tree is selected from the root node until a non-terminal dilemma in the selection step of the MCTS algorithm leaf with unvisited children is reached (v ). Each using the UCB1 policy (Kocsis and Szepesvari,´ 6 j of the nodes inside the path is selected based on a 2006). A child node is selected to maximize: predefined tree selection policy (see Figure 1a). s 2ln(N(v)) 2.E XPAND: One of the children (v ) of the selected UCT( j) = X j + 2Cp (1) 9 N(v j) non-terminal leaf (v6) is generated randomly and added to the tree together with the selected path Q(v j) where X j = is an approximation of the game- (see Figure 1b). N(v j) theoretic value of node j. Q(v j) is the total reward of 3.P LAYOUT: From the given state of the newly all playouts that passed through node j, N(v j) is the added node, a sequence of randomly simulated ac- number of times node j has been visited, N(v) is the tions (i.e., RANDOMSIMULATION) is performed number of times the parent of node j has been vis- until a terminal state in the domain is reached. The ited, and Cp ≥ 0 is a constant. The left-hand term is terminal state is evaluated using a utility function for exploitation and the right-hand term is for explo- (i.e., EVALUATION) to produce a reward value ∆ ration (Kocsis and Szepesvari,´ 2006). The decrease or (see Figure 1c). increase in the amount of exploration can be adjusted 4.B ACKUP: For each node in the selected path, the by Cp in the exploration term (see Section 6). number N(v) of times it has been visited is incre- mented by 1 and its total reward value Q(v) is up- 2.2 Parallelization of MCTS dated according to ∆ (Browne et al., 2012). These values are required by the tree selection policy Parallelization of MCTS consists of a precise arrange- (see Figure 1d). ment of tasks and data dependencies. In section 2.2.1 As soon as the computational budget is exhausted, the we explain how to decompose MCTS into tasks. In best child of the root node is returned (e.g., the one section 2.2.2 we investigate what types of data depen- dencies exist among these tasks. In Section 2.2.3 the existing parallelization methods for MCTS are dis- Algorithm 1: The general MCTS algorithm. cussed. 1 Function MCTS(s0) 2.2.1 Decomposition into Tasks 2 v0 := creat root node with state s0; 3 while within search budget do

4 < vl ,sl > := SELECT(v0,s0); The first step towards parallelizing MCTS is to find 5 < vl ,sl > := EXPAND(vl ,sl ); concurrent tasks in MCTS. As stated above, there are 6 ∆ := PLAYOUT(vl ,sl ); two levels of task decomposition in MCTS. 7 BACKUP(v ,∆); l 1. Iteration-level tasks (ILT): In MCTS the compu- 8 end tation associated with each iteration is indepen- 9 return action a for the best child of v0 dent. Therefore, these are candidates to guide a v0

v0 v0 v0 v1 v2 v3 ∆

v1 v2 v3 v1 v2 v3 v1 v2 v3 v4 v5 v6 v7 ∆ v4 v5 v6 v7 v4 v5 v6 v7 v4 v5 v6 v7 v8 v9 ∆ v v v v v 8 8 9 ∆ 8 9 (a) SELECT (b) EXPAND (c) PLAYOUT (d) BACKUP Figure 1: One iteration of MCTS.

task decomposition by mapping a chunk of itera- tions onto a task. v0 2. Operation-level tasks (OLT): The task decom- position for MCTS occurs inside each iteration. v1 v2 v3 Each of the four MCTS operations can be treated as a separate task. 2 v4 v5 v6 v7 2.2.2 Data Dependencies 1 3

v8 v9 v10 The second step is dealing adequately with the data dependency. When a search tree is shared among mul- Figure 2: Tree parallelization. The curly arrows represent tiple parallel threads, There are two levels of data de- threads. The rectangles are terminal leaf nodes. pendency. 1. Iteration-level dependencies (ILD): Strictly speaking, in MCTS, iteration j has a soft de- pendency to its predecessor iteration j − 1. Obviously, to select an optimal path, it requires each other. When the search is over, they are merged, updates on the search tree from the previous and the action of the best child of the root is selected. iteration.1 A parallel MCTS can ignore ILD and simply suffers from the search overhead.2 The leaf parallelization and tree parallelization 2. Operation-level dependencies (OLD): Each of the methods belong to category (B). In the leaf par- four operations in MCTS has a hard dependency allelization, the parallel threads perform multiple to its predecessor.3 For example, the EXPAND op- PLAYOUT operations from a non-terminal leaf node eration cannot start until the SELECT operation of the shared tree. These PLAYOUT operations are in- has been completed. dependent of each other, and therefore there is no data dependency. In tree parallelization, parallel threads 2.2.3 Parallelization Methods are potentially able to perform different MCTS op- erations on a same node of the shared tree (Chaslot There are three parallelization methods for MCTS et al., 2008a). Figure 2 shows the tree parallelization (i.e., root parallelization, leaf parallelization, and tree where three threads simultaneously perform different parallelization) that belong to two main categories: MCTS operations on the tree. This is the topic of our (A) parallelization with an ensemble of trees, and (B) research. parallelization with a single shared tree. The root parallelization method belongs to cate- The existing parallel implementations for tree par- gory (A). It creates an ensemble of search trees (i.e., allelization are based on ILP (Chaslot et al., 2008a; one for each thread). The trees are independent of Enzenberger and Muller,¨ 2010; Mirsoleimani et al.,

1 2015b; Mirsoleimani et al., 2015a). The tasks are i.e., a violation of ILD does not impact the correctness of the type of ILT, and the only dependency that ex- of the algorithm. ists among them is ILD. The potential concurrency is 2Occurs when a parallel implementation in a search al- gorithm searches more nodes of the search space than the exploited by assigning a chunk of iterations to sepa- sequential version; for example, since the information to rate parallel threads and having them work on sepa- guide the search is not yet available. rate processors. Our proposed algorithm allows im- 3i.e., a violation of OLD yields an incorrect algorithm. plementation of tree parallelization based on OLP. 3 Design of 3PMCTS Buffer Select

Expand In this section, we describe our proposed method for parallelizing MCTS. Section 3.1 describes how Buffer Select Playout Buffer Playout the pipeline pattern is applied in MCTS. Section 3.2 Expand Buffer provides the 3PMCTS algorithm.

Buffer Playout Buffer

3.1 Pipeline Pattern for MCTS Backup Backup (a) (b) Below we describe how the pipeline pattern is used Figure 3: (3a) Flowchart of a pipeline with sequential stages as a building block in the design of 3PMCTS. Figure for MCTS. (3b) Flowchart of a pipeline with parallel stages 3 shows two types of pipelines for MCTS. The inter- for MCTS. stage buffers are used to pass information between the 1: PE(S) C1(S) C2(S) C3(S) C4(S) stages. When a stage of the pipeline completes its computation; it sends a path of nodes from the search 2: PE(E) C1(E) C2(E) C3(E) C4(E) to the next buffer. The subsequent stage picks a path 3: PE(P) C1(P ) C2(P ) C3(P ) C4(P ) from the buffer and starts its computation. Here we introduce two possible types of pipelines for MCTS. 4: PE(B) C1(B) C2(B) C3(B) C4(B)

1. Pipeline with sequential stages: Figure 3a shows t0 t1 t2 t3 t4 t5 t6 t7 t8 [T] a pipeline with sequential stages for MCTS. The Figure 4: diagram of a pipeline with sequential idea is to map each MCTS operations to pipeline stages for MCTS. The computation for stages are equal. stages so that each stage of the pipeline com- putes one operation. Figure 4 illustrates how the the expected execution time for 4 paths is approx- pipeline executes the MCTS operations over time. imately 11 × T. Let Ci represent a multiple-step computation on path i. Ci( j) is the jth step of the computation in 2. Pipeline with parallel stages: Figure 3b shows a MCTS (i.e., j ∈ {S, E, P, and B}). Initially, the pipeline for MCTS with two parallel PLAYOUT first stage of the pipeline performs C1(S). After stages. Using two PLAYOUT stages in the pipeline the step has been completed, the second stage of results in an overall speed of approximately T the pipeline receives the first path and computes units of time per path as the number of paths C1(E) while the first stage computes the first step grows. Figure 6 shows that the MCTS pipeline is of the second path, C2(S). Next, the third stage perfectly balanced by using two PLAYOUT stages. computes C1(P), while the second stage computes The expected execution time for 4 paths is ap- C2(E) and the first stage C3(S). If each stage proximately 8 × T. Therefore, introducing paral- of the pipeline takes the same amount of time to lel stages improves the scalability of the MCTS do its work, say T. Figure 4 shows that the ex- pipeline. pected execution time for 4 paths in an MCTS pipeline with four stages is approximately 7 × T. 3.2 Pipeline Construction In contrast, the sequential version takes approxi- mately 16 × T because each of the 4 paths must The pseudocode of MCTS is shown in Algorithm 1. be processed one after another. The pipeline pat- Each operation in MCTS constitutes a stage of the tern works best if the operations performed by pipeline in 3PMCTS. In contrast to the existing meth- the various stages of the pipeline are all about ods, our 3PMCTS algorithm is based on OLP for equally computationally intensive. If the stages parallelizing MCTS. The pipeline pattern can satisfy in the pipeline vary in computational effort, the the operation-level dependencies (OLDs) among the slowest stage creates a bottleneck for the aggre- iteration-level tasks (OLTs). gate throughput. In other words, when there are The potential concurrency is also exploited by as- sufficient processors for each pipeline stage, the signing each stage of the pipeline to a separate pro- speed of a pipeline is approximately equal to the cessing element for execution on separate processors. speed of its slowest stage. For example, Figure If the pipeline has only sequential stages then the 5 shows the scheduling diagram that occurs when speedup is limited to the number of stages.4 However the PLAYOUT stage takes 2×T units of time while others take T units of time. Figure 5 shows that 4When the operations performed by the various stages 1: PE(S) C1(S) C2(S) C3(S) C4(S)

2: PE(E) C1(E) C2(E) C3(E) C4(E)

3: PE(P) C1(P ) C2(P ) C3(P ) C4(P )

4: PE(Idle)

5: PE(B) C1(B) C2(B) C3(B) C4(B)

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 [T] Figure 5: Scheduling diagram of a pipeline with sequential Figure 7: The 3PMCTS algorithm with a pipeline that stages for MCTS. The computation for stages are not equal. has three parallel stages (i.e., EXPAND,RANDOMSIMULA- TION, and EVALUATION).

1:PE(S) C1(S) C2(S) C3(S) C4(S)

2:PE(E) C1(E) C2(E) C3(E) C4(E) 4 Implementation

3:PE(P) C1(P ) C3(P )

4:PE(P) C2(P ) C4(P ) We have implemented the proposed 3PMCTS al-

5:PE(B) C1(B) C2(B) C3(B) C4(B) gorithm in the ParallelUCT package (Mirsoleimani et al., 2015a). The ParallelUCT package is an open t0 t1 t2 t3 t4 t5 t6 t7 t8 [T] source tool for parallelization of the UCT algorithm.5 Figure 6: Scheduling diagram of a pipeline with parallel It uses task-level parallelism to implement differ- stages for MCTS. Using parallel stages create load balanc- ent parallelization methods for MCTS. We have also ing. used an algorithm called grain-sized control paral- lel MCTS (GSCPM) to measure the performance of ILP for MCTS. The GSCPM algorithm creates tasks in MCTS, the operations are not equally computa- based on the -join pattern (McCool et al., 2012). tionally intensive, e.g., the PLAYOUT operation (ran- More details about this algorithm can be found in dom simulations plus evaluation of a terminal state) (Mirsoleimani et al., 2015a). Both 3PMCTS and could be more computationally expensive than other GSCPM are implemented by TBB parallel program- operations. Therefore, our 3PMCTS algorithm uses ming library (Reinders, 2007) and they are available a pipeline with parallel stages. Introducing parallel online as part of the ParallelUCT package. In our im- stages makes 3PMCTS more scalable. plementation for 3PMCTS, we can specify the num- Figure 7 depicts one of the possible pipeline con- ber of in-flight tokens. This is equal to the number of structions for 3PMCTS. We split the PLAYOUT op- tasks for GSCPM algorithm. eration into two stages to achieve more parallelism (See Section 2.1). The five stages run the MCTS op- erations SELECT,EXPAND,RANDOMSIMULATION, EVALUATION, and BACKUP, in that order. The 5 Experimental Setup SELECT stage and BACKUP stage are serial. The three middle stages (EXPAND,RANDOMSIMULA- The performance of 3PMCTS is measured by us- TION, and EVALUATION) are parallel and do the most ing a High Energy Physics (HEP) expression simpli- time-consuming part of the search. A serial stage does fication problem (Kuipers et al., 2013; Ruijl et al., one token at a time. A parallel stage is able 2014). Our setup follows closely (Kuipers et al., to process more than one token. Therefore, it needs 2013). We discuss the case study in 5.1, the hardware more than one in-flight token. A token represents a in 5.2, and the performance metrics in 5.3. path of nodes inside the search tree during the search. The pipeline depicted in Figure 7 is one of the 5.1 Case Study possible constructions for 3PMCTS. Each of the five stages could be either serial or parallel. Therefore, Our case study is in the field of Horner’s rule, which 3PMCTS provides a great level of flexibility. For ex- is an algorithm for polynomial computation that re- ample, a pipeline could have a serial stage for the SE- duces the number of multiplications and results in a LECT operation and a parallel stage for the BACKUP computationally efficient form. For a polynomial in operation. In our experiments we use this construc- one variable tion (see Section 6). n n−1 p(x) = anx + an−1x + ... + a0, (2) are all about equally computationally intensive. 5https://github.com/mirsoleimani/paralleluct/ the rule simply factors out powers of x. Thus, the 24 22 polynomial can be written in the form 20 18 p(x) = ((anx + an−1)x + ...)x + a0. (3) 16 14 This representation reduces the number of multipli- 12

Speedup 10 cations to n and has n additions. Therefore, the total 8 evaluation cost of the polynomial is 2n. 6 4 GSCPM Horner’s rule can be generalized for multivariate 2 3PMCTS

1 2 4 8 16 32 64 polynomials. Here, Eq. 3 applies on a polynomial 128 256 512 1024 2048 4096 for each variable, treating the other variables as con- Number of Tasks stants. The order of choosing variables may be dif- Figure 8: Playout-speedup as function of the number of ferent, each order of the variables is called a Horner tasks (tokens). Each data point is an average of 21 runs scheme. for a search budget of 8192 playouts. The constant Cp is The number of operations can be reduced even 0.5. Here a higher value is better. more by performing common subexpression elimi- nation (CSE) after transforming a polynomial with Horner’s rule. CSE creates new symbols for each Where T1 is the time of the program with one subexpression that appears twice or more and replaces worker and Tp is the time of the program with P work- them inside the polynomial. Then, the subexpression ers. In our results we report the scalability of our par- has to be computed only once. allelization as strong scalability which means that the We are using the HEP(σ) expression with 15 vari- problem size remains fixed as P varies. The problem ables to study the results of 3PMCTS. The MCTS is size is the number of playouts (i.e., the search budget) used to find an order of the variables that gives effi- and the P is the number of tasks. In the literature this cient Horner schemes (Ruijl et al., 2014). The root form of speedup is called playout-speedup (Chaslot node has n children, with n the number of variables. et al., 2008a). The children of other nodes represent the remaining The second important metric is the number of op- unchosen variables in the order. Starting at the root erations in the optimized expression. A lower value node, a path of nodes (variables) inside the search tree is desirable when higher number of tasks is used. is selected. The incomplete order is completed with the remaining variables added randomly (i.e., RAN- DOMSIMULATION). The complete order is then used 6 Experimental Results for Horners method followed by CSE to optimize the expression. The number of operations (i.e., ∆) in this optimized expression is counted (i.e., EVALUATION). In this section, the performance of 3PMCTS is measured. Table 1 shows the sequential time to ex- 5.2 Hardware ecute the specified number of playouts. Figure 8 shows the playout-speedup for both Our experiments were performed on a dual socket In- 3PMCTS and GSCPM, as a function of the number tel machine with 2 Intel Xeon E5-2596v2 CPUs run- of tasks (from 1 to 4096). The search budget for both ning at 2.4 GHz. Each CPU has 12 cores, 24 hyper- algorithms is 8192 playouts. The 3PMCTS algorithm threads, and 30 MB L3 cache. Each physical core has uses a pipeline with five stages for MCTS operations. 256KB L2 cache. The peak TurboBoost frequency is Four stages are parallel; the SELECT stage is chosen 3.2 GHz. The machine has 192GB physical memory. to be serial (see the end of Section 3.2). A playout- We compiled the code using the Intel C++ compiler speedup close to 21 on a 24-core machine is observed with a -O3 flag. for both algorithms. From our results, we may provi- sionally conclude that the 3PMCTS algorithm shows 5.3 Performance Metrics (a) a speedup less than GSCPM from 4 to 32 parallel tasks, and (b) a better speedup from 64 to 512 paral- lel tasks (see Figure 8). At the same time, 3PMCTS One important metric related to performance and par- also allows flexible control of the parallel or serial ex- allelism is speedup. Speedup compares the time for solving the identical computational problem on one worker versus that on P workers: Table 1: Sequential time in seconds when Cp = 0.5. Processor Num. Playouts Time (s) T speedup = 1 . (4) CPU 8192 215.72 ± 4.12 TP Cp =0.01 Cp =0.5 Cp =0.01 Cp =0.5 GSCPM,Cp =0.5 4450 4350 Cp =0.1 Cp =1 4350 Cp =0.1 Cp =1 3PMCTS,Cp =0.5

4400 Root Par.,Cp =0.01 4300 4300 4350

4250 4250 4300

4250 4200 4200 4200 Number of Operations of Number Number of Operations of Number Operations of Number 4150 4150 4150

4100 4100 4100 1 2 4 8 1 2 4 8 1 2 4 8 16 32 64 16 32 64 16 32 64 128 256 512 128 256 512 128 256 512 1024 2048 4096 1024 2048 4096 1024 2048 4096 Number of Tasks Number of Tasks Number of Tasks

(a) 3PMCTS (b) GSCPM (c) Root Parallelization Figure 9: Number of operations as function of the number of tasks (tokens). Each data point is an average of 21 runs for a search budget of 8192 playouts. Here a lower value is better. ecution of MCTS operations (e.g., the SELECT stage feasible choice in this domain. is sequential and the BACKUP stage is parallel in our Kuipers et al. remarked that tree parallelization case), something that GSCPM cannot provide. would give a result that is statistically a little bit in- Figure 9a and 9b show the results of the optimiza- ferior to a run with sequential MCTS with the same tion in the number of operations in the final expres- number of playouts due to the violation of iteration- sion for both algorithms. These results show consis- level dependency (ILD) that produces search over- tency with the findings in (Kuipers et al., 2013; Ruijl head (Kuipers et al., 2015). It is clear from our results et al., 2014). From our results, we may conclude two that, the effectiveness of any parallelization method observations. (1) When MCTS is sequential (i.e., the for MCTS depends heavily on the choice of three pa- C number of tasks is 1), for small values of Cp, such that rameters: (1) the p constant, (2) the number of play- MCTS behaves exploitively, the method gets trapped outs, and (3) the number of tasks. If we select these in local minima, and the number of operations is high. parameters carefully, it is possible to overcome the For larger values of Cp, such that MCTS behaves ex- search overhead to some extent. Furthermore, the ploratively, lower values for the number of operations 3PMCTS algorithm provides the flexibility of man- is found. (2) When MCTS is parallel, for small num- aging the execution (serial or parallel) of different bers of tasks (from 2 to 8), it turns out to be good MCTS operations that helps us even more to achieve to choose a high value for the constant Cp (e.g., 1) this goal. for both 3PMCTS and GSCPM. With higher numbers of tasks, a lower value for Cp in the range [0.5; 1) seems suitable for both algorithms. Figure 9c also 7 Conclusion and Future Work shows that 3PMCTS can find lower number of oper- ations for 8, 16, and 32 tasks when Cp = 0.5. When Monte Carlo Tree Search (MCTS) is a randomized both algorithms find the same number of operations, algorithm that is successful in a wide range of opti- one with higher speedup is better. The 3PMCTS algo- mization problems. The main loop in MCTS consists rithm finds the same number of operations compared of individual iterations, suggesting that the algorithm to GSCPM for 64 tasks, but it has higher speedup is well suited for parallelization. The existing par- when Cp = 0.5. Note that these values hold for this allelization methods, e.g., tree parallelization, simply particular polynomial and that different polynomials fans out the iterations over available cores. give different optimal values for Cp and number of In this paper, a new parallel algorithm based on tasks. the pipeline pattern for MCTS is proposed. The idea A comparison to root parallelization is illustrated is to break-up the iterations themselves, splitting them in Figure 9c. Both 3PMCTS and GSCPM belong to into individual operations, which are then parallelized the category of tree parallelization. For Cp = 0.01, in a pipeline. Experiments with an application from root parallelization finds a lower number of opera- High Energy Physics show that our implementation of tions for both 16 and 32 tasks compared to the two 3PMCTS scales well. Scalability is only one issue, al- other methods. However, increasing the number of though it is an important one. The second issue is the tasks causes root parallelization to provide a much flexibility of task decomposition in parallelism. These higher number of operations. From these results, we higher levels of flexibility allow fine-grained manag- may conclude that root parallelization could also be a ing of the execution of operations in MCTS which provide different pipeline constructions. We consider ECML’06 Proceedings of the 17th European the flexibility an even more important characteristic conference on Machine Learning, volume 4212 of 3PMCTS. For future work, we will study the ef- of Lecture Notes in Computer Science, pages fectiveness of 3PMCTS with regards to the different 282–293. Springer Berlin Heidelberg. pipeline constructions. Kuipers, J., Plaat, A., Vermaseren, J., and van den Herik, J. (2013). Improving Multivariate Horner Schemes with Monte Carlo Tree Search. Com- ACKNOWLEDGEMENTS puter Physics Communications, 184(11):2391– 2395. This work is supported in part by the ERC Advanced Kuipers, J., Ueda, T., and Vermaseren, J. A. M. Grant no. 320651, “HEPGAME.” (2015). Code optimization in FORM. Computer Physics Communications, 189(October):1–19. McCool, M., Reinders, J., and Robison, A. (2012). REFERENCES Structured Parallel Programming: Patterns for Efficient Computation. Elsevier. Browne, C. B., Powley, E., Whitehouse, D., Lucas, Mirsoleimani, S. A., Plaat, A., van den Herik, J., and S. M., Cowling, P. I., Rohlfshagen, P., Tavener, Vermaseren, J. (2015a). Parallel Monte Carlo S., Perez, D., Samothrakis, S., and Colton, S. Tree Search from Multi-core to Many-core Pro- (2012). A Survey of Monte Carlo Tree Search cessors. In ISPA 2015 : The 13th IEEE Inter- Methods. Computational Intelligence and AI in national Symposium on Parallel and Distributed Games, IEEE Transactions on, 4(1):1–43. Processing with Applications (ISPA), pages 77– Chaslot, G., Winands, M., and van den Herik, J. 83, Helsinki. (2008a). Parallel Monte-Carlo Tree Search. Mirsoleimani, S. A., Plaat, A., van den Herik, J., and In the 6th Internatioal Conference on Comput- Vermaseren, J. (2015b). Scaling Monte Carlo ers and Games, volume 5131, pages 60–71. Tree Search on Intel Xeon Phi. In Parallel and Springer Berlin Heidelberg. Distributed Systems (ICPADS), 2015 20th IEEE Chaslot, G. M. J. B., Winands, M. H. M., van den International Conference on, pages 666–673. Herik, J., Uiterwijk, J. W. H. M., and Bouzy, B. Reinders, J. (2007). Intel threading building blocks: (2008b). Progressive strategies for Monte-Carlo outfitting C++ for multi-core processor paral- tree search. New Mathematics and Natural Com- lelism. ” O’Reilly Media, Inc.”. putation, 4(03):343–357. Ruijl, B., Vermaseren, J., Plaat, A., and van den Coulom, R. (2006). Efficient Selectivity and Backup Herik, J. (2014). Combining Simulated Anneal- Operators in Monte-Carlo Tree Search. In Pro- ing and Monte Carlo Tree Search for Expression ceedings of the 5th International Conference on Simplification. Proceedings of ICAART Confer- Computers and Games, volume 4630 of CG’06, ence 2014, 1(1):724–731. pages 72–83. Springer-Verlag. Schaefers, L. and Platzner, M. (2014). Distributed Enzenberger, M. and Muller,¨ M. (2010). A - Monte-Carlo Tree Search : A Novel Technique free multithreaded Monte-Carlo tree search algo- and its Application to Computer Go. IEEE rithm. Advances in Computer Games, 6048:14– Transactions on Computational Intelligence and 20. AI in Games, 6(3):1–15. Fern, A. and Lewis, P. (2011). Ensemble Monte-Carlo Silver, D., Huang, A., Maddison, C. J., Guez, A., Planning: An Empirical Study. In ICAPS, pages Sifre, L., Van Den Driessche, G., Schrittwieser, 58–65. J., Antonoglou, I., Panneershelvam, V., Lanctot, Goodfellow, I., Bengio, Y., and Courville, A. (2016). M., et al. (2016). Mastering the game of go with Deep Learning. Adaptive Computation and Ma- deep neural networks and tree search. Nature, chine Learning Series. MIT Press. 529(7587):484–489. Hassabis, D. and Silver, D. (2017). Alphago’s next Yoshizoe, K., Kishimoto, A., Kaneko, T., Yoshi- move. moto, H., and Ishikawa, Y. (2011). Scalable Kocsis, L. and Szepesvari,´ C. (2006). Bandit based Distributed Monte-Carlo Tree Search. In Fourth Monte-Carlo Planning Levente. In Furnkranz,¨ Annual Symposium on Combinatorial Search, J., Scheffer, T., and Spiliopoulou, M., editors, pages 180–187.