Parallel Alpha-Beta Algorithm on the GPU
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Computing and Information Technology - CIT 19, 2011, 4, 269–274 269 doi:10.2498/cit.1002029 Parallel Alpha-Beta Algorithm on the GPU Damjan Strnad and Nikola Guid Faculty of Electrical Engineering and Computer Science, University of Maribor, Slovenia In the paper we present the parallel implementation of the laptops. They feature hardware accelerated alpha-beta algorithm running on the graphics processing scheduling of thousands of lightweight threads unit (GPU). We compare the speed of the parallel player with the standard serial one using the game of reversi running on a number of streaming multiproces- with boards of different sizes. We show that for small sors, resembling in operation the SIMD (single boards the level of available parallelism is insufficient for instruction multiple data) machines. In recent efficient GPU utilization, but for larger boards substan- years, the parallel architecture of the GPU has tial speed-ups can be achieved on the GPU. The results indicate that the GPU-based alpha-beta implementation become available for general purpose program- would be advantageous for similar games of higher com- ming not related to graphics rendering. This putational complexity (e.g. hex and go) in their standard was facilitated by the emergence of specialized form. toolkits like C for CUDA [16] and OpenCL [8] Keywords: alpha-beta, reversi, GPU, CUDA, paralleliza- which propelled the successful parallelization tion of many compute intensive tasks [15]. Our goal with the GPU-based alpha-beta imple- mentation is to compare its speed of play against 1. Introduction the basic serial version of the algorithm. The game of reversi (also known as Othello) serves as a testing ground for comparison, where dif- The alpha-beta algorithm is central to most en- ferent board sizes are used to vary the com- gines for playing the two player zero-sum per- [ ] plexity of the game. This allows us to set the fect information board games 10 . Its effi- future directions of research on more complex ciency stems from the pruning of the game tree games that exhibit higher level of exploitable that relies heavily on the serial interdependence parallelism. of tree traversal. This sequential nature of the algorithm has also proved to be the major cul- The GPU-based implementation of game tree prit for its effective parallelization, which has search has previously been reported by Blei- been the focus of many researchers in the past weiss, but he demonstrated significant speed- [3, 6, 13]. The noticeable speed-ups have been ups only for practically limited case of thou- achieved in simulation [7], but required special sands of simultaneously running games [1].An- hardware [11, 12, 21] or network configurations other report shows good acceleration results [4, 19] to be demonstrated in practice. With with the exhaustive minimax tree search, which multi-threaded solution on broadly accessible represents the foundation of the alpha-beta al- multi-core CPUs, a relatively modest accelera- gorithm [17]. In their work, Rocki and Suda in- tion has been obtained [2]. dicate some intricacies of the GPU architecture that impair the performance of parallel minimax In this paper, we present the parallel alpha-beta and are inherited by the alpha-beta. algorithm running on the graphics processing unit (GPU). Modern GPUs are massively par- The rest of the paper is organized as follows: allel processors residing on the graphics boards in Section 2 we describe the principles of gen- installed in common desktop computers and eral purpose computation on graphics proces- 270 Parallel Alpha-Beta Algorithm on the GPU sors, followed by the description of alpha-beta on-device memory comes in much larger quan- parallelization methodology and its GPU-based tities (1GB and more) and can be used for inter- implementation in Section 3. In Section 4 we block communication, but has a high latency. give the specification of testing details and in It is further divided into read-only texture and Section 5 the results are presented. In Section 6 constant memory, and read-write global mem- we conclude and present some ideas for future ory. The purpose and characteristics of each work. memory type are detailed in the documentation [9]. The usual template of operation in CUDA ker- 2. General Purpose GPU Programming nels is to copy the data from global to shared and CUDA memory, process it there, and copy the results back. All of these steps are performed in par- In our research the GPU programming toolkit allel. The kernel complexity in terms of shared C for CUDA by Nvidia has been used because memory usage and register count determines it is the most mature and well documented [18]. the number of blocks that can be simultaneously Here we give a short overview of the technology. scheduled on the multiprocessor and thereby af- fects the GPU occupancy. The expected GPU The parallel architecture of Nvidia GPUs con- occupancy can be calculated using the tools pro- sists of a set of pipelined multiprocessors. The vided with CUDA so that the execution con- parallel computation on the GPU is performed figuration and kernel compilation can be tuned as a set of concurrently executing thread blocks, for optimal utilization of resources. The higher which are organized into a 1D or 2D grid.The GPU occupancy allows the scheduler to swap blocks themselves can be 1D, 2D, or 3D with the warps waiting for global memory transfer each thread designated by a unique combination with the queued warps. of indices. The hardware schedules the execu- Another major impact on kernel performance tion of blocks on the multiprocessors in units is the way threads in a warp address the mem- of 32 threads called warps. The threads from ory locations. The desired addressing scheme the same warp run in a lockstep, with each di- uses coalesced memory reads and writes, which vergent branching in the code suspending some means that consecutive threads access sequen- of the threads and thereby reducing the rate of tial memory locations or at least a permutation parallelism. The hierarchical structure of GPU of them. Conflicting memory accesses are seri- processing is also expressed in the necessity for alized and lower the memory bandwidth. careful thread synchronization to avoid the rac- ing conditions and optimize the performance. In CUDA, the C-like code to be executed on the 3. Implementation of Parallel Alpha-beta GPU is written in the form of functions called on the GPU kernels. For the efficient implementation of kernels on the GPU, one must consider the lim- The main source of parallelism in the alpha- ited amount of available resources like on-chip beta algorithm is in the concurrent processing memory and registers, as well as restrictions on of multiple sibling nodes at any level of the game execution configuration (i.e. the dimensions of tree. If the best move is evaluated first, then the grid and blocks). The properties of the GPU rest of the moves can be refuted in parallel with are described by its compute capability,which maximal efficiency. Unfortunately, the a priori can be queried at run-time and used to adjust quality of move can only be heuristically esti- the kernel parameters. mated. The parallel processing of nodes there- The memory available to the GPU is of sev- fore usually introduces some search overhead eral types. Each thread block has at disposal which may surpass the amount of serial search a limited amount (16KB – 48KB) of fast local in case of unfavourable move order. storage called shared memory, which resides on The basis for our parallel alpha-beta implemen- the GPU and has low latency. It is used for tation is the PV-split algorithm [14],inwhich internal computation and communication be- the parallelism is employed at the nodes on the tween the threads of the block. The off-chip, principal variation (i.e. the leftmost path in the Parallel Alpha-Beta Algorithm on the GPU 271 Figure 1. GPU-based PV-split heuristically ordered game tree), also known as the PV-nodes or type 1 nodes [10]. In the GPU- based variant the leftmost child of each PV-node is searched on the CPU to establish the lower bound on the PV-node value. The rest of PV- node’s descendants are searched in parallel on the GPU using the narrower window. Figure 1 shows the principle of parallel execution in the PV-nodes. Figure 2. Two level parallelism of GPU The parallel processing of sibling nodes is real- ized using multiple thread blocks on the GPU, but this is only the high level of parallelization exerted by the GPU-based alpha-beta. On the lower level, each node (i.e. reversi board) is processed in parallel by the threads of the two- dimensional block where a single block is used per node. Figure 2 depicts the described two level parallelism which differs from the ap- proach by Rocki and Suda, who delegate the nodes to individual warps of the same block [17]. The dimension of the block may be smaller than that of the board because: • using the smaller block is desirable when it results in higher GPU occupancy, and • the number of threads in the block is limited to 1024, which requires the use of smaller blocks for boards larger than 32×32 squares. If the block and board are of the same dimen- Figure 3. The mapping of threads to consecutive board squares sion, then the thread with index pair (x, y) is mapped directly to square (x, y) of the board. If the block is smaller than the board, then the appropriate number of consecutive squares the mapping scheme shown in Figure 3 is em- which ensues in nicely coalesced memory ac- ployed.