Scaling Implicit Parallelism Via Dynamic Control Replication

Scaling Implicit Parallelism via Dynamic Control Replication Michael Bauer Wonchan Lee Elliott Slaughter Zhihao Jia NVIDIA NVIDIA SLAC National Accelerator Carnegie Mellon University Laboratory Mario Di Renzo Manolis Papadakis Galen Shipman Patrick McCormick Sapienza University of NVIDIA Los Alamos National Los Alamos National Rome Laboratory Laboratory Michael Garland Alex Aiken NVIDIA Stanford University Abstract 1 Introduction We present dynamic control replication, a run-time program Implicitly parallel programming models, where apparently analysis that enables scalable execution of implicitly parallel sequential programs are automatically parallelized, are in- programs on large machines through a distributed and effi- creasingly popular as programmers endeavor to solve com- cient dynamic dependence analysis. Dynamic control repli- putationally difficult problems using massively parallel and cation distributes dependence analysis by executing multiple distributed machines. Many of these programmers have lim- copies of an implicitly parallel program while ensuring that ited experience writing explicitly parallel code and instead they still collectively behave as a single execution. By dis- choose to leverage high productivity frameworks—such as tributing and parallelizing the dependence analysis, dynamic TensorFlow [7], PyTorch [38], and Spark [53]—that hide the control replication supports efficient, on-the-fly computation details of parallelism and data distribution. of dependences for programs with arbitrary control flow at In implicitly parallel programming models, distinguished scale. We describe an asymptotically scalable algorithm for functions, often called tasks, are the units of parallelism. implementing dynamic control replication that maintains Tasks may consume data produced by other tasks preceding the sequential semantics of implicitly parallel programs. them in program order. Thus, to discover implicit parallelism, An implementation of dynamic control replication in the programming systems must perform a dependence analysis to Legion runtime delivers the same programmer productivity discover a legal partial order on task execution. In the worst as writing in other implicitly parallel programming models, case, dependence analysis has a computational complexity such as Dask or TensorFlow, while providing better perfor- that is quadratic in the number of tasks: each task must be mance (11.4X and 14.9X respectively in our experiments), and checked against all its predecessors in the program for depen- scalability to hundreds of nodes. We also show that dynamic dences. Since the number of tasks to be analyzed is generally control replication provides good absolute performance and proportional to the size of the target machine, dependence scaling for HPC applications, competitive in many cases with analysis is expensive at scale, and its apparently sequential explicitly parallel programming systems. nature can make it difficult to parallelize. Therefore, all ex- isting systems make tradeoffs limiting the scalability of their CCS Concepts • Software and its engineering Run- → implementation, the precision of their analysis, and/or the time environments; • Computing methodologies Paral- → expressivity of their programming model to avoid or reduce lel and distributed programming languages; the costs of a fully general dependence analysis. Keywords implicit parallelism, scalable dependence analy- One such class of systems performs dependence analysis sis, dynamic control replication, Legion, task-based runtime at compile-time, thereby eliminating run-time overhead. Se- quoia [20], Deterministic Parallel Java [13], and Regent [43] Publication rights licensed to ACM. ACM acknowledges that this contribu- are compiler-based systems that statically infer data depen- tion was authored or co-authored by an employee, contractor or affiliate of dences to transform a program into an explicitly parallel code the United States government. As such, the Government retains a nonex- clusive, royalty-free right to publish or reproduce this article, or to allow that runs across many distributed nodes. However, common others to do so, for Government purposes only. idioms such as data-dependent control flow, or potentially PPoPP ’21, February 27-March 3, 2021, Virtual Event, Republic of Korea aliased data structures, can defeat static analysis, resulting in © 2021 Copyright held by the owner/author(s). Publication rights licensed failed compilation or generated code incapable of scaling to to ACM. large node counts. Some systems limit the expressivity of the ACM ISBN 978-1-4503-8294-6/21/02...$15.00 https://doi.org/10.1145/3437801.3441587 programming model to ensure static analysis is successful PPoPP ’21, February 27-March 3, 2021, Virtual Event, Republic of Korea Bauer et al. while !done(): A C E We introduce dynamic control replication (DCR), a parallel Static T Analysis while !done(): B D F and distributed run-time dependence analysis for implic- Compiler Execution Nodes itly parallel programming models that can scale to large Task T { while !done(): A C E node counts and support arbitrary control flow, data usage, while !done(): Lazy T and task execution on any node. Figure 1 gives an exam- A(); B(); Evaluation while !done(): B D F C(); D(); ple program consisting of a top-level task with a loop that Control Node Worker Nodes E(); F(); launches six subtasks A-F with some dependences between } Dynamic T1 A C E B D F Control them, along with an illustration of static, lazy evaluation, A E Replication T2 B D F C and dynamic control replication approaches to the program’s Execution Nodes execution. In the static approach, to target = nodes, the tasks Figure 1. Approaches to implicit parallelism: static analysis are partitioned into = loops with explicit cross-node synchro- and lazy evaluation centralize analysis and distribute parti- nization to enforce dependences. The compiled program is tioned programs to workers for execution. DCR executes a similar to an MPI implementation [45], with each node ex- replicated program, dynamically assigns tasks to shards, and plicitly communicating and synchronizing with other nodes. performs an on-the-fly distributed dependence analysis. The lazy approach attempts to achieve a similar effect at (e.g., Sequoia). The compile-time approach works best for run-time, with all of the dependence analysis carried out by the subset of programs amenable to precise static analysis. the control node, and then just-in-time schedules of tasks To avoid the limitations of static analysis, programming sent to each worker. These schedules can be cached on the systems such as Spark and TensorFlow use lazy evaluation workers to avoid repeating the analysis if possible, but in gen- to perform dependence analysis at run-time. A single node eral when the program’s behavior changes, the control node executes the program’s control where all tasks are issued, must carry out another lazy gathering of tasks, dependence but tasks are not immediately executed. Instead the control analysis, and schedule creation for the workers. node performs a dynamic dependence analysis on the set of In DCR, illustrated at the bottom of Figure 1, the depen- launched tasks to build a representation of the program. At dence analysis is itself spread across all the nodes of the certain points, the system dynamically compiles and opti- system. The execution of task T is replicated into shards; mizes this representation and then distributes it to worker each shard Ti executes a copy of T that replicates all of the nodes for execution. The main difficulty is that the control control structures of the task T, but only performs the depen- node can limit scalability, either as a sequential bottleneck dence analysis for a subset of the subtasks launched by T. In in distributing tasks to worker nodes, or as a limitation on the figure, shard T1 handles subtasks A, C, and E for the first the size of the program representation that can be stored loop iteration, while shard T2 handles subtasks B, D and F. in memory. Both Spark and TensorFlow take steps to miti- The shards cooperate to discover and enforce any cross-shard gate these risks. To avoid having the centralized controller dependences, in this case the dependences between B and C become a sequential bottleneck, these systems either mem- and between C and F. The assignment of subtasks of T to oize repeated executions of code (Spark) or amortize the shards Ti is done by the runtime system using a sharding cost of dependence analysis by explicitly representing loops function 5 . The only requirements of 5 are that it be a func- with some (but not arbitrary) control flow (TensorFlow) [52]. tion (each subtask is assigned to one shard) and total (every Therefore, programs with large repeated loops of execution subtask is assigned to some shard). For performance, the can be compiled, optimized, and partitioned for distribution sharding function should also balance assignments across to worker nodes efficiently. Furthermore, to avoid overflow- shards. Because the sharding is computed dynamically, the ing memory, these systems limit the expressible dependence sharding function can assign different instances of a subtask patterns. Consequently, Spark and TensorFlow work well to different shards; in Figure 1 the shards of T swap roles on for programs with minimal dynamic control flow and simple the second iteration, with T1 handling subtasks B, D and E and symmetric dependence patterns. shard T2 handling

Scaling Implicit Parallelism Via Dynamic Control Replication

Quantum Dxi4500 User's Guide

Replication Using Group Communication Over a Partitioned Network

Replication Vs. Caching Strategies for Distributed Metadata Management in Cloud Storage Systems

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

Replication As a Prelude to Erasure Coding in Data-Intensive Scalable Computing

Parallel Computing a Key to Performance

Middleware-Based Database Replication: the Gaps Between Theory and Practice

Manufacturing Equipment Technologies for Hard Disk's

Control Replication: Compiling Implicit Parallelism to Efficient SPMD with Logical Regions

A Crash Course in Wide Area Data Replication

Regent: a High-Productivity Programming Language for Implicit Parallelism with Logical Regions

Replication Under Scalable Hashing: a Family of Algorithms for Scalable Decentralized Data Distribution