Parallelization of Tree-Recursive Algorithms on a SIMD Machine *

From: AAAI Technical Report SS-93-04. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved. Parallelization of Tree-Recursive Algorithms on a SIMD Machine * Curt Powley, Richard E. Korf, and Chris Ferguson Computer Science Department University of California, Los Angeles Los Angeles, CA 90024 ABSTRACT The set of tree-recursive algorithms includes constraint satisfaction using backtracking, iterative-deepening The set of tree-recursive algorithms is large, search such as IDA*, depth-first branch-and-bound, including constraint satisfaction using back- two-player game minimax search, and most divide-and- tracking, iterative-deepening search such as conquer algorithms. IDA*, depth-first branch-and-bound, two- player game minimax search, and many 1.2 SIMD Machines divide-and-conquer algorithms. We describe a structured method for implementing such Two examples of single-instruction, multiple-data algorithms on SIMDmachines, and identify (SIMD) machines are the MasPar1 and the Connec- measures for determining if a tree-recursive tion Machine (CM)2. The MasPar is composed of up application is amenable or ill-suited for SIMD to 16,384 (16K) processors. The Connection Machine, parallelization. Using these ideas, we evaluate the computer used in this study, is composed of up to results from four testbeds. 65,536 (64K) one-bit processors. Every processor on a SIMDmachine must execute the same instruction at the same time, or no instruction 1 Introduction at all. This can make programming and verification of straightforward tasks easier, but can considerably 1.1 Tree-RecursiveAlgorithms complicate the programming of complex tasks, such as tree-traversal. Because of this programmingconstraint, A tree-recursive algorithm is one that traverses a tree SIMDmachines have largely been used for "data par- in executing multiple recursive calls. A simple example allel" applications. Traversing a large, irregular tree is this procedure for calculating Fibonacci (n), where that is dynamically generated does not fit this computa- is a non-negative integer. tional paradigm. On the other hand, tree traversal consists of performing the same computation of node ex- pansion and evaluation repeatedly. This suggests that fine-grain SIMDparallel computers might be appropri- Fib (n) ate for implementation of tree-recursive algorithms. if (n < 1) return (1) Someof this research was previously reported in [14], else return (Fib (n - 1) + Fib (n - [15], [16], [17], and [18]. *This work is supported in part by W. M. Keck Foun- 2 SIMD Tree Search (STS) dation grant W880615,NSF grants DIR-9024251and IRI- 9119825,NSF Biological Facilities grant BBS-8714206,the Defense Advanced Research Projects Agency under Con- Our basic SIMDTree Search algorithm, or STS, con- tract MDA903-87-C0663, Rockwell International, the Ad- sists of an initial distribution of nodesto processors, fol- vanced ComputingFacility of the Mathematics and Com- lowed by alternating phases of depth-first tree-traversal puter Science Division of ArgonneNational Laboratory, the (search) and load balancing [15]. University of Maryland Institute for AdvancedComputer Studies, the University of Minnesota/ArmyHigh Perfor- 1MasPar is a trademark of MasPar Computer manceComputing Research Center, the Massively Parallel Corporation. ComputerResearch Laboratory of Sandia National Labora- 2Connection Machine is a trademark of Thinking Ma- tories, and by Thinking MachinesCorporation. chines Corporation. 181 Initially, the tree consists of just the root node, located 3 Measures of Performance on a single active processor of the machine, and all the remaining processors are inactive. This processor ex- Our primary measures of performance are speedup and pands the root, generating its children, and assigns each efficiency. Speedup is the time that would be required child to a different processor. Each of these processors, by the most efficient serial program for the applica- in parallel, expands its node, generating more children tion, running on one processor of the SIMDmachine, that are assigned to different processors. This process divided by the time required by STS. Efficiency is sim- continues until there are no more free processors avail- ply speedup divided by the number of processors. Since able. we are interested in the factors that contribute to over- all efficiency, we decomposeefficiency into four compo- If the processor assigned to a node remains associated with that node after it is expanded, that processor will nents: raw speed ratio R, fraction of time working F, utilization U, and work ratio N. The product of these be wasted by not having any work to do during the subsequent depth-first search. To avoid this, when a four factors equals efficiency. node is expanded, its first child is assigned to the same The raw speed ratio is the ratio of the node gener- processor that held the parent, and only additional chil- ation rate of a single busy processor in the parallel dren are assigned to free processors. This guarantees STS algorithm, compared to the serial algorithm, and that once the initial distribution is completed, all pro- reflects the overhead of SIMDmachines in executing cessors will be assigned to nodes on the frontier of the conditional code. Next is the fraction of total time tree, and can participate in the depth-first search. that is devoted to working (searching), as opposed load balancing. A third factor is the processor utiliza- Oncethe initial distribution is completed, each proces- tion, which is the average fraction of processors that sor conducts a depth-first search of its assigned frontier are busy during search phases. Utilization reflects the node. Processors use a stack to represent the current extent to which processors finish their work and are idle path in the tree. Unfortunately, the trees generated by almost all tree-recursive problems of interest have irreg- while waiting to receive work in the next load balancing ular branching factors and depths, and some processors phase. The final factor is the ratio of total nodes generated by the serial algorithm to total nodes generated will finish searching their subtrees long before others. by the parallel algorithm, or the work ratio. Depending Thus, load balancing is necessary to effectively use a on the application, this mayor may not be significant. large number of processors. On a MIMDmachine, these idle processors can get work from busy processors without interrupting the other 4 Constraint Satisfaction busy processors [19]. On a SIMD machine, however, since every processor must execute the same instruction Our simplest testbed for STS is backtracking for at the same time, or no instruction at all, in order to constraint-satisfaction problems. For example, the N- share work, all search activity must be temporarily sus- Queens problem is to place N queens on an N x N pended. Thus, when the number of active processors chessboard so that no two are on the same row, col- becomes small enough to make it worthwhile, search umn, or diagonal. In such applications, a path is cut stops and load balancing begins. off when a constraint is violated, or a solution found. The N-Queens problem can be solved by placing queens Two important problems are determining when to trig- one at a time. Whena constraint is violated, the search ger load balancing, and how to distribute the work. backtracks to the last untried legal move. STS uses a dynamic trigger [14] that automatically ad- justs itself to the problem size, and to different stages Using 16K processors on a CM2,we solved a 16-queens within a single problem. It is a greedy approach that problem with a speedup of 10,669, for an efficiency of maximizes the average rate of work over a search/load- 65%. This corresponds to a raw speed ratio of .750, a balance cycle. After triggering occurs, the work is re- fraction of time workingof 0.870, a utilization of 0.998, distributed during load balancing as follows. Each busy and a work ratio of 1.0. The work ratio is one because processor scans its stack to find a node with unexplored we found all solutions, making the search tree identical children [19]. Whenthere are more active processors in the serial and parallel cases. The total work done was than idle processors, a subset of the active processors 18.02 billion node generations in 5.2 hours, for a rate must be selected to export work. Criteria for choos- of 58 million node generations per minute. In contrast, ing exporting processors include: (1) largest estimated the node generation rate on a Hewlett Packard 9000 load [14], (2) randomselection [7], (3) selection of least- model 350 workstation was 2.1 million per minute. recently used [8], and (4) location of the processor’s node in the tree so that the tree is searched in more 5 Iterative-Deepening of a left-to-right fashion [20]. The effectiveness of the work-distribution method depends on the application. Another important algorithm in this class is iterative- deepening search, such as depth-first iterative- 182 deepening (DFID) and Iterative-Deepening A* (IDA*) Iterative-deepening is also used in these algorithms, and [9]. DFIDperforms a series of depth-first searches. A in fact originated in this setting [21]. Alpha and Beta path is cut off when the depth of the path exceeds a bounds produced by searching one subtree affect the depth threshold for the iteration. Thresholds for suc- amount of work necessary to search neighboring sub- ceeding iterations are increased so that when a solution trees. These local bounds must be propagated through- is found it is guaranteed to be of lowest cost. IDA* out the tree, and a parallel algorithm will almost cer- reduces the search space by adding to the path cost a tainly do more work than a serial algorithm. Thus, heuristic estimate of the cost to reach a goal node. the work ratio becomes especially important in these applications.

Parallelization of Tree-Recursive Algorithms on a SIMD Machine *

Parallel Prefix Sum (Scan) with CUDA

Simulating Physics with Computers

Think in G Machines Corporation Connection

CSE373: Data Structures & Algorithms Lecture 26

CSE 613: Parallel Programming Lecture 2

A Review of Multicore Processors with Parallel Programming

Parallel Algorithms and Parallel Program Design

Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over Next 2 Weeks)!

A Review of Parallel Processing Approaches to Robot Kinematics and Jacobian

A Survey on Parallel Multicore Computing: Performance & Improvement

Parallelizing Multiple Flow Accumulation Algorithm Using CUDA and Openacc

Chapter 3. Parallel Algorithm Design Methodology