Internally Deterministic Parallel Algorithms Can Be Fast

Internally Deterministic Parallel Algorithms Can Be Fast Guy E. Blelloch∗ Jeremy T. Finemany Phillip B. Gibbonsz Julian Shun∗ ∗Carnegie Mellon University yGeorgetown University zIntel Labs, Pittsburgh [email protected] jfi[email protected] [email protected] [email protected] Abstract 1. Introduction The virtues of deterministic parallelism have been argued for One of the key challenges of parallel programming is dealing with decades and many forms of deterministic parallelism have been nondeterminism. For many computational problems, there is no in- described and analyzed. Here we are concerned with one of the herent nondeterminism in the problem statement, and indeed a se- strongest forms, requiring that for any input there is a unique de- rial program would be deterministic—the nondeterminism arises pendence graph representing a trace of the computation annotated solely due to the parallel program and/or due to the parallel ma- with every operation and value. This has been referred to as internal chine and its runtime environment. The challenges of nondetermin- determinism, and implies a sequential semantics—i.e., considering ism have been recognized and studied for decades [23, 24, 37, 42]. any sequential traversal of the dependence graph is sufficient for Steele’s 1990 paper, for example, seeks “to prevent the behavior analyzing the correctness of the code. In addition to returning deter- of the program from depending on any accidents of execution or- ministic results, internal determinism has many advantages includ- der that can arise from the indeterminacy” of asynchronous pro- ing ease of reasoning about the code, ease of verifying correctness, grams [42]. More recently, there has been a surge of advocacy for ease of debugging, ease of defining invariants, ease of defining and research in determinism, seeking to remove sources of non- good coverage for testing, and ease of formally, informally and ex- determinism via specially-designed hardware mechanisms [19, 20, perimentally reasoning about performance. On the other hand one 28], runtime systems and compilers [3, 5, 36, 45], operating sys- needs to consider the possible downsides of determinism, which tems [4], and programming languages/frameworks [11]. might include making algorithms (i) more complicated, unnatural While there seems to be a growing consensus that determinism or special purpose and/or (ii) slower or less scalable. is important, there is disagreement as to what degree of determin- In this paper we study the effectiveness of this strong form of ism is desired (worth paying for). Popular options include: determinism through a broad set of benchmark problems. Our main • contribution is to demonstrate that for this wide body of prob- Data-race free [2, 22], which eliminate a particularly problem- lems, there exist efficient internally deterministic algorithms, and atic type of nondeterminism: the data race. Synchronization moreover that these algorithms are natural to reason about and not constructs such as locks or atomic transactions protect ordinary complicated to code. We leverage an approach to determinism sug- accesses to shared data, but nondeterminism among such con- gested by Steele (1990), which is to use nested parallelism with structs (e.g., the order of lock acquires) can lead to considerable commutative operations. Our algorithms apply several diverse pro- nondeterminism in the execution. gramming paradigms that fit within the model including (i) a strict • Determinate (or external determinism), which requires that the functional style (no shared state among concurrent operations), (ii) program always produces the same output when run on the an approach we refer to as deterministic reservations, and (iii) the same input. Program executions for a given input may vary use of commutative, linearizable operations on data structures. We widely, as long as the program “converges” to the same output describe algorithms for the benchmark problems that use these de- each time. terministic approaches and present performance results on a 32- • Internal determinism, in which key aspects of intermediate core machine. Perhaps surprisingly, for all problems, our internally steps of the program are also deterministic, as discussed in deterministic algorithms achieve good speedup and good perfor- this paper. mance even relative to prior nondeterministic solutions. • Functional determinism, where the absence of side-effects in Categories and Subject Descriptors D.1 [Concurrent Program- purely functional languages make all components independent ming]: Parallel programming and safe to run in parallel. General Terms Algorithms, Experimentation, Performance • Synchronous parallelism, where parallelism proceeds in lock Keywords Parallel algorithms, deterministic parallelism, parallel step (e.g., SIMD-style) and each step has a deterministic out- programming, commutative operations, graph algorithms, geome- come. try algorithms, sorting, string processing There are trade-offs among these options, with stronger forms of determinism often viewed as better for reasoning and debugging but worse for performance and perhaps programmability. Making the proper choice for an application requires understanding what Permission to make digital or hard copies of all or part of this work for personal or the trade-offs are. In particular, is there a “sweet spot” for deter- classroom use is granted without fee provided that copies are not made or distributed minism, which provides a particularly useful combination of de- for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute buggability, performance, and programmability? to lists, requires prior specific permission and/or a fee. In this paper, we advocate a particular form of internal de- PPoPP ’12 February 25–29, 2012, New Orleans, Louisiana, USA. terminism as providing such a sweet spot for a class of nested- Copyright c 2012 ACM 978-1-4503-1160-1/12/02. $10.00 parallel (i.e., nested fork-join) computations in which there is no inherent nondeterminism in the problem statement. An execution straints in transactional systems [26, 30], an approach that does of a nested-parallel program defines a dependence DAG (directed not guarantee determinism. In contrast, this paper identifies use- acyclic graph) that represents every operation executed by the com- ful applications of non-trivial commutativity that can be used in the putation (the nodes) along with the control dependencies among design of internally deterministic algorithms. them (the edges). These dependencies represent ordering within se- We describe, for example, an approach we refer to as determin- quential code sequences, dependencies from a fork operation to its istic reservations for parallelizing certain greedy algorithms. In the children, and dependencies from the end of such children to the approach the user implements a loop with potential loop carried join point of the forking parent. We refer to this DAG when an- dependencies by splitting each iteration into reserve and commit notated with the operations performed at each node (including ar- phases. The loop is then processed in rounds in which each round guments and return values, if any) as the trace. Informally, a pro- takes a prefix of the unprocessed iterates applying the reserve phase gram/algorithm is internally deterministic if for any input there is in parallel and then the commit phase in parallel. Some iterates can a unique trace. This definition depends on the level of abstraction fail during the commit due to conflicts with earlier iterates and need of the operations in the trace. At the most primitive level the op- to be retried in the next round, but as long as the operations com- erations could represent individual machine instructions, but more mute within the reserve and commit phases and the prefix size is generally, and as used in this paper, it is any abstraction level at selected deterministically, the computation is internally determin- which the implementation is hidden from the programmer. We note istic (the same iterates always fail). that internal determinism does not imply a fixed schedule since any We describe algorithms for the benchmark problems using these schedule that is consistent with the DAG is valid. approaches and present performance results for our Cilk++ [31] Internal determinism has many benefits. In addition to leading implementations on a 32-core machine. Perhaps surprisingly, for to external determinism [37] it implies a sequential semantics—i.e., all problems, our internally deterministic algorithms achieve good considering any sequential traversal of the dependence DAG is suf- speedup and good performance even relative to prior nondetermin- ficient for analyzing the correctness of the code. This in turn leads istic and externally deterministic solutions, implying that the per- to many advantages including ease of reasoning about the code, formance penalty of internal determinism is quite low. We achieve ease of verifying correctness, ease of debugging, ease of defin- speedups of up to 31.6 on 32 cores with 2-way hyperthreading (for ing invariants, ease of defining good coverage for testing, and ease sorting). Almost all our speedups are above 16. Compared to what of formally, informally and experimentally reasoning about perfor- we believe are quite good sequential implementations we range mance [3–5, 11, 19, 20, 28, 36, 45]. Two primary concerns

Internally Deterministic Parallel Algorithms Can Be Fast

Deterministic Execution of Multithreaded Applications

CUDA Toolkit 4.2 CURAND Guide

Unit: 4 Processes and Threads in Distributed Systems

A Short History of Computational Complexity

Deterministic Parallel Fixpoint Computation

Byzantine Fault Tolerance for Nondeterministic Applications Bo Chen

Entropy Harvesting in GPU[1] CUDA Code for Collecting Entropy

Deterministic Atomic Buffering

Complexity Theory Lectures 1–6

Optimal Oblivious RAM*

Semantics and Concurrence Module on Semantic of Programming Languages

00 Fast K-Selection Algorithms for Graphics Processing Units