Speculative Parallelism in Cilk++

Speculative Parallelism in Cilk++ Ruben Perez Gregory Malecha MIT Harvard University SEAS [email protected] [email protected] ABSTRACT and are based around backtracking search, often optimized Backtracking search algorithms are useful in many domains, with heuristics. Algorithms such as these can benefit tremen- from SAT solvers to artificial intelligences for playing games dously from parallel processing because separate threads can such as chess. Searching disjoint branches can, inherently be used to search independent paths in parallel. We call be done in parallel though it can considerably increase the this strategy speculative parallelism because only one solu- amount of work that the algorithm does. Such parallelism is tion matters and as soon as that is found, all other work can speculative, once a solution is found additional work is irrele- be stopped. vant, but the individual branches each have their own poten- tial to find the solution. In sequential algorithms, heuristics While simple to describe, this use of parallelism is not always are used to prune regions of the search space. In parallel easy to implement. One parallel programming platform, implementations this pruning often corresponds to abort- cilk5 [FLR98], included language support for this type of ing existing computations that can be shown to be pursuing parallelism, but it has since been removed in more recent dead-ends. While some systems provide native support for versions of the platform adapted to C++, cilk++ [Lei09]. aborting work, Intel's current parallel extensions to C++, Our goal is to regain the power of speculative execution in a implemented in the cilk++ compiler [Lei09], do not. way that is relatively natural to program with without the need to modify the runtime library. In this work, we show several methods for implementing abort as a library in the cilk++ system. We compare our Contributions implementations to each other and quantify the benefit of In this work we consider the parallel implementation of algo- abort in a real game AI. We derive a mostly mechanical rithms that rely on speculative parallelism in cilk++ [Lei09]. translation to convert programs with native abort into cilk++ We begin with a brief overview of the mechanisms for spec- using continuation passing style. ulative execution in cilk5 (Section 2.2) and then consider one prominent use of the these mechanisms for developing Keywords gam AI using α-β pruning (Section 2.2). We then consider Speculative parallelism, α-β pruning, Cilk++ our contributions: 1. INTRODUCTION • We present a technique for compositional speculative execution in cilk++ which uses polling and can be built Over the years many algorithms have been devised for solv- as a library without any special runtime support (Sec- ing difficult problems efficiently: graph algorithms such as tion 3). max-flow [EK72] and shortest path [Flo62] and numerical computations such as matrix multiplication [Str69]. Despite • We present a mostly-mechanical translation from cilk5 much success addressing many problems, fundamental limits code to use our library (Section 3.1). govern the efficiency of algorithms for a range of important problems that rely on search. In these cases, parallelism can • We demonstrate several implementation strategies for be used to mitigate some of the inefficiency of solving such the library needed for our technique (Section 3.2). problems. • We compare the performance tradeoff of several implementations of our abort library (Section 4.1). Many of these algorithms come from artificial intelligence • We determine the effect that polling granularity has on performance (Section 4.2). • We attempt to quantify the difficulty of porting an existing program that uses speculative execution to our framework (Section 4.3). We conclude by considering the difficulty of analyzing programs with speculative execution and then describe several avenues of future research. evaluated to have values of 3 and 0, while the alpha beta [3; 1] values, contained in a tuple, for the the root's third child are being determined. Having evaluated the other leaves, max we know that the the maximizing player can get at least a three, hence that game state's alpha value is three. Once we evaluate the game state's left child, with a value of 2, 3 0 [3; 2] we can say that the beta value is 2, as that now is an upper bound on the maximizing player's score at that game min state. However, now we see a contradiction, where alpha is greater than beta. Given that this state is reached, the 2 Prune maximizing player is guaranteed to get less than or equal to 2, because the minimizing player will always choose that max particular game state (or one that has a lower score). It makes no sense for the maximizing player to ever choose to Figure 1: Example of Alpha Beta Pruning go to this game state, since he has a better move available, the root's left most child with a value of 3. This we can abort exploration of the rest of root's right most subtree, 2. BACKGROUND and save ourselves some work. We begin with an overview of our target application, the parallelization of α-β pruning, before describing how specu- A simple parallelization of the algorithm would be to sim- lative execution is incorporated into the cilk5 programming ply search the children of a game state in parallel. However, language. such an implementation could execute work that a serial implementation would not, by exploring branches of the tree 2.1 Alpha-Beta Search that serial version would have pruned away. At the time we spawn, we cannot know whether this is the case, so we Alpha-Beta (α-β) search is an algorithm commonly used to speculate that such work will be useful, and search anyway. search a 2 player game tree [KM75]. A game tree is a tree If any spawned search does invoke the cutoff condition, then where nodes represent states of a game, and edges represent we must terminate all of the other spawned threads, as they the transitions from one game state to another, i.e. legal are now performing useless work by exploring game states moves in the game. Game states are ranked by a heuristic that we will never reach. Thus we see why an abort mech- static board evaluator which assigns a number that is meant anism is essential to an efficient parallel alpha beta search to encode a player's preference for being in the board con- algorithm, and speculative parallel programming in general. figuration. A naive game tree search algorithm, such as minimax, con- 2.2 Speculative Execution in Cilk5 + siders possible game states at each level of the tree. It MIT cilk5 [BJK 95, FLR98] includes an abort primitive assumes that one player, the maximizing player, wins the to support early termination of speculative work. To prop- game by getting to the game state with the highest value. erly explain how these are used, we must introduce another The other player, the minimizing player, wins by forcing the feature, inlets. An inlet is a local C function which shares maximizing player to a game state with a low value. The stack space with its parent function. It is declared inside player will always make the best moves possible; the maxi- its parent function and is local to the scope of the parent mizing player will choose game states with high values, and function. cilk5 allows an argument to a call to an inlet to the minimizing player will choose games states with low val- be a spawned computation. When the spawned computa- ues. tion returns, the inlet is called with the result. The cilk5 runtime guarantees that all inlets associated with a single The Alpha Beta algorithm keeps track of two values, alpha activation record run sequentially. This allows us to conve- and beta for each game state in the tree. The alpha value niently access local data to accumulate a result without the represents a score that the maximizing player is guaranteed need for explicit locking. Figure 2 shows an example of us- to do better than, a lower bound on his possible score. The ing inlets to implement non-deterministic choice. The code beta value represents the lowest score the minimizing player interleaves two long computations and returns the value re- can force the other player to get, an upper bound on the turned by one of them. maximizing player's score. With these values, we can tell that a game state where alpha is greater than beta would The inlet (lines 7-10) takes the return value of one of the never be reached. Such a game state would mean that one speculated computations and stores it in the parents local player is not playing the best moves possible, and allowed variable x (line 8). The inlet can then abort the remaining the maximizing player to reach a game state where he is computation because a value has been computed (line 9). guaranteed a higher score than another available move, the After the sync, we know that the inlet will have run at least move that yielded the beta value. For these game states, once, so the value in x is valid and we can return it. there is no need to further explore its subtree, as we know the move will never be reached. This elimination of work 3. COMPOSITIONAL SPECULATION is called alpha-beta pruning, as we essentially prune away For many applications, such as exhaustive search, it is ac- branches of the search tree. An example of such a pruning ceptable to have a single level abort that allows the entire is shown in figure 1 In figure 1 we see a small game tree computation to be aborted; however, the structure of α-β where a prune would occur.

Speculative Parallelism in Cilk++

Technical Report Aaron Councilman

Intel® Cilk™ Plus

Part V Some Broad Topic

An Introduction to Parallel Programming

Lecture Notes : Intel Xeon Phi Coprocessor - an Overview

The Relevance of Opencl to HPC

Parallel Programming with Matlabmpi ∗

(12) United States Patent (10) Patent No.: US 8.209,664 B2 Yu Et Al

Parallel Programming in Openmp About the Authors

Dryadlinq for Scientific Analyses

Automatic and Explicit Parallelization Approaches for Mathematical Simulation Models

Openmp a Parallel Programming Model for Shared Memory Architectures