Scaling Implicit Parallelism via Dynamic Control Replication
Michael Bauer Wonchan Lee Elliott Slaughter Zhihao Jia NVIDIA NVIDIA SLAC National Accelerator Carnegie Mellon University Laboratory Mario Di Renzo Manolis Papadakis Galen Shipman Patrick McCormick Sapienza University of NVIDIA Los Alamos National Los Alamos National Rome Laboratory Laboratory Michael Garland Alex Aiken NVIDIA Stanford University Abstract 1 Introduction We present dynamic control replication, a run-time program Implicitly parallel programming models, where apparently analysis that enables scalable execution of implicitly parallel sequential programs are automatically parallelized, are in- programs on large machines through a distributed and effi- creasingly popular as programmers endeavor to solve com- cient dynamic dependence analysis. Dynamic control repli- putationally difficult problems using massively parallel and cation distributes dependence analysis by executing multiple distributed machines. Many of these programmers have lim- copies of an implicitly parallel program while ensuring that ited experience writing explicitly parallel code and instead they still collectively behave as a single execution. By dis- choose to leverage high productivity frameworks—such as tributing and parallelizing the dependence analysis, dynamic TensorFlow [7], PyTorch [38], and Spark [53]—that hide the control replication supports efficient, on-the-fly computation details of parallelism and data distribution. of dependences for programs with arbitrary control flow at In implicitly parallel programming models, distinguished scale. We describe an asymptotically scalable algorithm for functions, often called tasks, are the units of parallelism. implementing dynamic control replication that maintains Tasks may consume data produced by other tasks preceding the sequential semantics of implicitly parallel programs. them in program order. Thus, to discover implicit parallelism, An implementation of dynamic control replication in the programming systems must perform a dependence analysis to Legion runtime delivers the same programmer productivity discover a legal partial order on task execution. In the worst as writing in other implicitly parallel programming models, case, dependence analysis has a computational complexity such as Dask or TensorFlow, while providing better perfor- that is quadratic in the number of tasks: each task must be mance (11.4X and 14.9X respectively in our experiments), and checked against all its predecessors in the program for depen- scalability to hundreds of nodes. We also show that dynamic dences. Since the number of tasks to be analyzed is generally control replication provides good absolute performance and proportional to the size of the target machine, dependence scaling for HPC applications, competitive in many cases with analysis is expensive at scale, and its apparently sequential explicitly parallel programming systems. nature can make it difficult to parallelize. Therefore, all ex- isting systems make tradeoffs limiting the scalability of their CCS Concepts • Software and its engineering Run- → implementation, the precision of their analysis, and/or the time environments; • Computing methodologies Paral- → expressivity of their programming model to avoid or reduce lel and distributed programming languages; the costs of a fully general dependence analysis. Keywords implicit parallelism, scalable dependence analy- One such class of systems performs dependence analysis sis, dynamic control replication, Legion, task-based runtime at compile-time, thereby eliminating run-time overhead. Se- quoia [20], Deterministic Parallel Java [13], and Regent [43] Publication rights licensed to ACM. ACM acknowledges that this contribu- are compiler-based systems that statically infer data depen- tion was authored or co-authored by an employee, contractor or affiliate of dences to transform a program into an explicitly parallel code the United States government. As such, the Government retains a nonex- clusive, royalty-free right to publish or reproduce this article, or to allow that runs across many distributed nodes. However, common others to do so, for Government purposes only. idioms such as data-dependent control flow, or potentially PPoPP ’21, February 27-March 3, 2021, Virtual Event, Republic of Korea aliased data structures, can defeat static analysis, resulting in © 2021 Copyright held by the owner/author(s). Publication rights licensed failed compilation or generated code incapable of scaling to to ACM. large node counts. Some systems limit the expressivity of the ACM ISBN 978-1-4503-8294-6/21/02...$15.00 https://doi.org/10.1145/3437801.3441587 programming model to ensure static analysis is successful PPoPP ’21, February 27-March 3, 2021, Virtual Event, Republic of Korea Bauer et al.
while !done(): A C E We introduce dynamic control replication (DCR), a parallel Static T Analysis while !done(): B D F and distributed run-time dependence analysis for implic- Compiler Execution Nodes itly parallel programming models that can scale to large
Task T { while !done(): A C E node counts and support arbitrary control flow, data usage, while !done(): Lazy T and task execution on any node. Figure 1 gives an exam- A(); B(); Evaluation while !done(): B D F C(); D(); ple program consisting of a top-level task with a loop that Control Node Worker Nodes E(); F(); launches six subtasks A-F with some dependences between } Dynamic T1 A C E B D F Control them, along with an illustration of static, lazy evaluation, A E Replication T2 B D F C and dynamic control replication approaches to the program’s Execution Nodes execution. In the static approach, to target 푛 nodes, the tasks Figure 1. Approaches to implicit parallelism: static analysis are partitioned into 푛 loops with explicit cross-node synchro- and lazy evaluation centralize analysis and distribute parti- nization to enforce dependences. The compiled program is tioned programs to workers for execution. DCR executes a similar to an MPI implementation [45], with each node ex- replicated program, dynamically assigns tasks to shards, and plicitly communicating and synchronizing with other nodes. performs an on-the-fly distributed dependence analysis. The lazy approach attempts to achieve a similar effect at (e.g., Sequoia). The compile-time approach works best for run-time, with all of the dependence analysis carried out by the subset of programs amenable to precise static analysis. the control node, and then just-in-time schedules of tasks To avoid the limitations of static analysis, programming sent to each worker. These schedules can be cached on the systems such as Spark and TensorFlow use lazy evaluation workers to avoid repeating the analysis if possible, but in gen- to perform dependence analysis at run-time. A single node eral when the program’s behavior changes, the control node executes the program’s control where all tasks are issued, must carry out another lazy gathering of tasks, dependence but tasks are not immediately executed. Instead the control analysis, and schedule creation for the workers. node performs a dynamic dependence analysis on the set of In DCR, illustrated at the bottom of Figure 1, the depen- launched tasks to build a representation of the program. At dence analysis is itself spread across all the nodes of the certain points, the system dynamically compiles and opti- system. The execution of task T is replicated into shards; mizes this representation and then distributes it to worker each shard Ti executes a copy of T that replicates all of the nodes for execution. The main difficulty is that the control control structures of the task T, but only performs the depen- node can limit scalability, either as a sequential bottleneck dence analysis for a subset of the subtasks launched by T. In in distributing tasks to worker nodes, or as a limitation on the figure, shard T1 handles subtasks A, C, and E for the first the size of the program representation that can be stored loop iteration, while shard T2 handles subtasks B, D and F. in memory. Both Spark and TensorFlow take steps to miti- The shards cooperate to discover and enforce any cross-shard gate these risks. To avoid having the centralized controller dependences, in this case the dependences between B and C become a sequential bottleneck, these systems either mem- and between C and F. The assignment of subtasks of T to oize repeated executions of code (Spark) or amortize the shards Ti is done by the runtime system using a sharding cost of dependence analysis by explicitly representing loops function 푓 . The only requirements of 푓 are that it be a func- with some (but not arbitrary) control flow (TensorFlow) [52]. tion (each subtask is assigned to one shard) and total (every Therefore, programs with large repeated loops of execution subtask is assigned to some shard). For performance, the can be compiled, optimized, and partitioned for distribution sharding function should also balance assignments across to worker nodes efficiently. Furthermore, to avoid overflow- shards. Because the sharding is computed dynamically, the ing memory, these systems limit the expressible dependence sharding function can assign different instances of a subtask patterns. Consequently, Spark and TensorFlow work well to different shards; in Figure 1 the shards of T swap roles on for programs with minimal dynamic control flow and simple the second iteration, with T1 handling subtasks B, D and E and symmetric dependence patterns. shard T2 handling the rest. Crucial to the scalability of DCR, For programs with complex control flow, arbitrary data is the observation that consecutive independent tasks, such dependences, and/or the need to run arbitrary tasks on any as A and B, can be aggregated into group tasks that can be node, a different approach is needed. The dependence analy- launched and analyzed more efficiently as a single operation. sis should be distributed to ensure that no node is a bottle- We formally introduce group tasks in Section 2. neck. In particular, there is little difficulty if a task launches a In DCR the execution of a task is replicated, but collectively small, constant number of subtasks; these can be analyzed in the shards behave as a single logical task. By performing essentially constant time and with small overhead in current dependence analysis on the fly, DCR can easily react to data- tasking systems. The problem is with tasks that launch a dependent control flow. Since the dependence analysis is number of subtasks proportional to the size of the machine, distributed, each node is only responsible for its subset of where the dependence analysis quickly becomes a bottleneck the overall analysis and storing a subset of the program when scaling up to large clusters. representation, yielding an inherently scalable approach. Scaling Implicit Parallelism via Dynamic Control Replication PPoPP ’21, February 27-March 3, 2021, Virtual Event, Republic of Korea Each subsequent section presents a contribution: We for- rep malize parallel dynamic dependence analysis and prove this (푆,퐺) −→ (푆,퐺) algorithm is equivalent to sequential dependence analysis (Section 2). DCR introduces another new issue: the replicated ′ ⇒ 푆 = ⟨푠1,..., (tg; 푝푖, 푐푖, ∅), . . . , 푠푛⟩ 푑푖 = 푐푖 × tg(푖) ≠ ∅ portion of the program must behave identically in all copies ′ ′ 푆 = (푠1,..., ⟨tg; 푝푖, 푐푖,푑푖 ⟩, . . . , 푠푛) Ta and therefore cannot exhibit non-determinism; we show how rep to dynamically verify this property of control determinism (푆,퐺) −→ (푆 ′,퐺) (Section 3). We describe our implementation of DCR (Sec- 푆 = ⟨푠1,..., (tg; 푝푖, 푐푖,푑푖 ), . . . , 푠푛⟩ tion 4) and important optimizations required to achieve good 푘 푘 푑푖 ≠ ∅ ∀(푡 , 푡) ∈ 푑푖 .푡 ∈ 푐푘 performance in practice. Our evaluation demonstrates that ′ 푆 = (푠1,..., ⟨푝푖, 푐푖 ∪ tg, ∅⟩, . . . , 푠푛) ′ ′ DCR delivers the same productivity with better performance 푇 = 푇 ∪ tg(푖) 퐷 = 퐷 ∪ 푑푖 and scalability for widely-used data analysis and machine Tb rep ′ ′ ′ learning programs, and provides good absolute performance (푆, ⟨푇, 퐷⟩) −→ (푆 , ⟨푇 , 퐷 ⟩) and scalability for HPC applications (Section 5). ⇒ 푆 = ⟨푠1,..., (tg; 푝푖, 푐푖, ∅), . . . , 푠푛⟩ 푐푖 × tg(푖) = ∅ ′ ′ 푆 = (푠1,..., ⟨푝푖, 푐푖 ∪ tg, ∅⟩, . . . , 푠푛) 푇 = 푇 ∪ tg(푖) 2 Foundations of Control Replication Tc rep Dynamic control replication improves the performance of (푆, ⟨푇, 퐷⟩) −→ (푆 ′, ⟨푇 ′, 퐷⟩) run-time dependence analysis by enlisting the parallel re- Figure 2. Parallel dependence analysis DEP sources of a distributed machine. To this end, DCR partitions rep the analysis work, enabling one or more shards of a repli- is owned by shard 푘. Each shard 푘 analyzes only the subset cated task, running on multiple nodes, to handle a subset of tg(푘) ≜ {푡푘 ∈ tg} of a task group tg. the full analysis. DCR requires distinguishing where analysis Once the owner shard is determined for each task, a pro- can be done in parallel, and where coordination between the gram is analyzed by shards in parallel. The analysis state shards is required. A key observation is that programs typi- (푆,퐺) consists of a tuple 푆 = ⟨푠 , . . . , 푠 ⟩ of states (one for cally launch tasks in groups of independent tasks (i.e., tasks 1 푛 each shard), and a global task graph 퐺 shared by all shards. with no dependences between them); shards can analyze Each shard 푖’s state 푠 = (푝 , 푐 ,푑 ) consists of a program 푝 , tasks in each task group in parallel. A program is modeled as 푖 1 푖 푖 푖 a set of tasks that the shard has finished analyzing 푐 , and a a sequence of task groups: 푖 set of outstanding dependences 푑푖 , 푝 ∗ 푝 휖 푝 ∈ Program = TaskGroup = | tg; Figure 2 gives an operational semantics DEPrep for the par- tg ∈ TaskGroup = {Task} allel dependence analysis. Each inference rule corresponds 푡 ∈ Task to a single analysis step, which starts with an initial state 푆 푝, , ,..., 푝, , This definition of a program is equivalent to a top-level task = ⟨( ∅ ∅) ( ∅ ∅)⟩ where the program is replicated that launches a sequence of task groups that do not launch in the shards’ states and finishes when all shards have no further subtasks (e.g., the example in Figure 1). In our imple- remaining task groups to analyze. In the rules Ta and Tc, we ⇒ mentation, we allow any task to be replicated with DCR, and write 푇 × 푇 ′ to denote a set of dependences between tasks subtasks launched in replicated tasks may launch (optionally in 푇 and those in 푇 ′ (obtained by querying the oracle): replicated) subtasks of their own, but this simplified model is sufficient to formalize the key ideas. ⇒ 푇 × 푇 ′ {(푡, 푡 ′) | 푡 ∈ 푇 ∧ 푡 ′ ∈ 푇 ′ ∧ 푡 ⇒ 푡 ′} We assume that there is a separate dependence analysis, ≜ which we model as an oracle, that determines whether two tasks are independent (our implementation similarly reuses The analysis of a task group in each shard is done in two steps: the existing pairwise task dependence analysis of the under- First, a shard 푖 identifies dependences for the subset tg(푖) of lying runtime system). We write 푡1 ∗푡2 to indicate tasks 푡1 and the task group tg being analyzed and records them locally 푑 푡2 are independent. Tasks in a task group must be pairwise in the set 푖 of outstanding dependences (rule Ta). Second, 푡 independent: ∀푡1, 푡2 ∈ tg.푡1 = 푡2 ∨ 푡1 ∗ 푡2. We write 푡1 ⇒ 푡2 if the outstanding dependences of a task are registered to the 푡 task 푡2 depends on 푡1, i.e., 푡2 occurs after 푡1 in the program global task graph if all of ’s dependent predecessors have and ¬(푡1 ∗ 푡2). The dependence analysis’s output is a task been analyzed by their owner shards (rule Tb); to check this 푖 푐 graph 퐺 = ⟨푇, 퐷⟩, a DAG of tasks 푇 whose directed edges 퐷 condition, each shard maintains a set 푖 of tasks whose are dependences. analysis is completed. The rule Tc handles the case where The first step in DCR is to apply a sharding function, which the tasks being analyzed have no dependences, allowing determines the owner shard for each task (see Section 4). We those tasks to be registered to the task graph immediately. 푖 푖 푐 assume that a sharding function is already applied to the In rules Ta and Tc, shard compares tg( ) with 푖 and not tasks in a program to be analyzed, and we write 푡푘 if task 푡 with the tasks in the task graph, as shards can make progress PPoPP ’21, February 27-March 3, 2021, Virtual Event, Republic of Korea Bauer et al.
1 import random seq 2 ... (푃,퐺) −→ (푃,퐺) 3 if (random.random() < 0.5): ′ ′ ⇒ 4 run_algorithm0() 푇 = 푇 ∪ tg 퐷 = 퐷 ∪ 푇 × tg 5 else : seq 6 run_algorithm1() (tg; 푝, ⟨푇, 퐷⟩) −→ (푝, ⟨푇 ′, 퐷 ′⟩) Figure 4. Branching on a random number. Figure 3. Sequential dependence analysis DEPseq 1 future = launch_task0() 2 if future.is_ready(): at different rates and therefore the task graph may notyet 3 value = future.get_value() 4 run_task1_inline(value) contain all the dependent predecessors of tg(푖). 5 else : launch_task1(precondition=future) The key property of DEPrep is that it produces the same 6 task graph as a sequential analysis; Figure 3 shows a system Figure 5. Branching on a timing-dependent value. DEPseq that sequentially analyzes dependences. 1 regions = set () Theorem 1. For a program 푝, the following holds: 2 for i in range (num_regions): 3 regions.add(create_region()) seq ★ rep ★ 푝, , 휖,퐺 푆푁 , , 푆푁 ,퐺 4 for region in regions : ( ⟨∅ ∅⟩) −→ ( 푠 ) ∧ ( 휄 ⟨∅ ∅⟩) −→ ( ∅ 푟 ) 5 launch_task(region) =⇒ 퐺푠 = 퐺푟 , Figure 6. Iterating a data structure with undefined order. 푁 z }| { Figure 4 selects different algorithms to run based on the where 푆푁 = ⟨(푝, ∅, ∅),..., (푝, ∅, ∅)⟩, 휄 result of a random number generator. To ensure shards 푆푁 = ⟨(휖, 푝, ∅),..., (휖, 푝, ∅)⟩ ∅ produce the same random number sequences, we provide and 푅★ denotes the reflexive transitive closure of a relation 푅. a pseudo-random number generator backed by a parallel Theorem 1 is a direct consequence of a lemma stating counter-based generator [40]. that any consecutive transitions A and B, in two different Figure 5 branches on a value whose result depends on tim- shards, can be reordered to B followed by A if the task group ing, in this case an optimization: If the future is unresolved, analyzed in A appears later than that analyzed in B in the rather than block on the future’s value a task is launched original program order. From this lemma, we derive that the with the resolution of the future as a precondition. The fu- transitions for an analysis of a program using the system ture may resolve at different speeds in different shards, and DEPrep can be reordered so that they simulate an analysis of so some shards may launch a subtask while others do not. the same program with the system DEPseq. A detailed proof Figure 6 launches one task per set element. In Python, a of Theorem 1 is included in Appendix A. set is ordered by hash values of the items. For security, the For simplicity, we have described dependence analysis hashing is randomized, so shards will launch the same tasks as if it always compares each task with all predecessors. In but likely in different orders. Such situations are easily fixed practice, we can minimize comparisons because transitive by using a data structure with a defined order, such as a list. dependences are redundant: if the system already has depen- We enforce control determinism using a dynamic analysis. 푡 푡 푡 푡 푡 푡 dences 1 ⇒ 2 and 2 ⇒ 3 in the task graph, then 1 ⇒ 3 For each runtime API call from a shard of a replicated task does not further constrain the scheduling of tasks [11]. (and only for such calls), we compute a 128-bit hash that cap- tures the API call and all its actual arguments. An all-reduce 3 Control Determinism collective (see Section 4.2) checks that the hashes from all Our semantics (Section 2) and our implementation (Section 4) shards are identical, indicating that (assuming no hash con- rely on all shards analyzing task groups in the same order. flicts) the program is control deterministic to that point in Any deterministic program run with the same input in all its execution. The all-reduce is performed asynchronously shards will satisfy this requirement. In practice, however, to hide its latency and many applications run without per- there can be sources of “input” that differ across shards. The formance degradation even with the check enabled if there state of the memory allocator, the presence of profile-guided is unused communication bandwidth (see Section 5.5). If a garbage collectors, address space randomization, the choice check fails, the runtime system aborts with an error listing of hash functions in an interpreter, and hardware clocks for the operation that failed to be control deterministic, which timing are all examples of data sources that may be visible we have found sufficient for debugging. to the program and different in each shard. For completeness, two additional sources of input that Fortunately, dynamic control replication does not require must be handled by any system implementing dynamic con- all data sources be identical across shards. A program is trol replication are files and interactions with garbage col- control deterministic if all shards make the same the sequence lectors. We expect the support for these sources of input to of API calls into the runtime system. We illustrate three be implementation specific; we discuss how our implemen- violations of control determinism we have encountered. tation of DCR handles both cases in Section 4.3. Scaling Implicit Parallelism via Dynamic Control Replication PPoPP ’21, February 27-March 3, 2021, Virtual Event, Republic of Korea
4 Dynamic Control Replication in Legion 1 fspace Cell { 2 state : double, We have implemented dynamic control replication in the 3 flux : double, Legion runtime [11]. We work through the execution of the 4 } 5 sequential 1D stencil program in Figure 7 with DCR on a 6 task add_one(cells: region ( ispace (int1d),Cell)) small machine with two nodes, each with one CPU and two 7 where reads writes (cells.state) do 8 for i in cells do GPUs. This program is written in Regent [42], a high-level 9 cells[i].state = cells[i].state + 1 programming language that compiles down to calls to the 10 end 11 end Legion runtime system. To explain how DCR works with this 12 program, we need to introduce some Legion terminology to 13 task mul_two(cells: region ( ispace (int1d),Cell)) 14 where reads writes (cells.flux) do explain specifically how Legion expresses task groups. 15 for i in cells do Legion tasks operate on regions, which are tables created 16 cells[i].flux = cells[i].flux * 2 17 end from a set of points called an index space (which in this case 18 end is 1D) that name the table’s rows, and a set of typed fields 19 20 task stencil(owned: region ( ispace (int1d),Cell), naming the table’s columns; see lines 1-3 (which defines two 21 ghost : region ( ispace (int1d),Cell)) field names), line 33 (which creates a 1D index space), and 22 where reads writes (owned.flux), 23 reads (ghost.state) do line 35 (which defines a region). 24 for i in owned do Regions can be partitioned into arbitrary subregions [49, 25 owned[i].flux = owned[i].flux + 26 0.5*(ghost[i-1].state + ghost[i+1].state) 50]; partitions are used like arrays of subregions (lines 44, 27 end 47, and 50). Subregions can be further partitioned, and in 28 end 29 general, programs construct region trees by recursively parti- 30 __demand ( __replicable ) tioning a root region. Figure 8 depicts the region tree used 31 task main () 32 -- Parse args ncells, ntiles=4, nsteps, init by the stencil program (the partitioning code is omitted from 33 var grid = ispace (int1d, { x = ncells }) Figure 7 for space reasons but can be found online [6]). An 34 var tiles = ispace (int1d, { x = ntiles }) 35 var cells = region (grid, Cell) important property of region trees is that any region in the 36 -- Create partitions: owned, interior, ghost tree is a superset of all the regions in its subtree. Thus, any 37 -- Code omitted for space, see Figure 8 38 fill (cells.{state, flux}, init) set of regions can be over-approximated by a region 푟, or 39 for t = 0, nsteps do more generally a partition of 푟, that is their least common 40 for i in tiles do 41 add_one(owned[i]) ancestor in the tree. We will make use of this property in our 42 end analysis of the dependences between task groups.1 In this 43 for i in tiles do 44 mul_two(interior[i]) example three different partitions of the cells are created, 45 end representing the cells owned by each node, the cells in the 46 for i in tiles do 47 stencil(interior[i], ghost[i]) interior of the domain within each owned subregion, and 48 end the owned regions plus ghost cells on either side. 49 end 50 end The __demand(__replicable) annotation on line 29 indi- cates that main is a control deterministic task and therefore eligible for DCR. Most applications do not require further Figure 7. A 1-D stencil program written in Regent [42]. changes, except for minor tweaks to interact correctly with cells external side effects (see Section 4.3). Regardless of whether this program is executed with DCR or not, the Regent compiler transforms inner loops of in- dependent task launches into group task launches [42]. To enable each task in a group task launch to have different cells arguments, the task calls have the form 푡 (푝 [푓 (푖 푗 )]) where 푡 is the task, 푖1, . . . , 푖푛 is an index space, 푝 is a partition con- taining the subregions accessed in the original loop, and 푓 is a function that returns the subregion index of 푝 for the 푖 푡 2 … … … 푗 th call of . Lines 43-45, 46-48, and 49-51 are converted to owned interior ghost group task launches, which define task groups in the sense of Section 2. For example, the loop on lines 43-45 is replaced Figure 8. A region tree with three partitions to describe by the group task launch add_one(owned[id(·)] over the different subsets of cells used by different tasks in the stencil tiles index space, where id is the identity function. While example. Ranges of sub-regions are depicted visually at top.
1Note that this property of region trees is used in our implementation of 2For region trees with multiple levels of partitioning, a more general form DCR in Legion and is not a requirement of DCR in general. of this function can choose any subregion in the subtree of 푝 [11]. PPoPP ’21, February 27-March 3, 2021, Virtual Event, Republic of Korea Bauer et al.
DCR works correctly for individual tasks and operations Tasks and other operations are inserted into the coarse queue in program order. coarse queue (such as the fill to initialize data on line 40), the scalability Ideally most are group tasks or group ops. of dependence analysis hinges on the efficiency of analyzing 1 while (!coarse_queue.empty()): Coarse Stage programs consisting primarily of group launches. 2 task_or_op = coarse_queue.pop() Algorithm In our Legion implementation we do not attempt to de- 3 deps = set() 4 foreach upper bound partition P in task_or_op: cide automatically when to use DCR; instead we expose this 5 foreach dep in find_dependences(task_or_op,P): decision in the Legion mapping interface [11], an API for 6 if requires_shard_fence(dep): 7 deps.add(upgrade_to_shard_fence(dep)) application- and machine-specific policies that affect per- 8 else: formance. A client of the mapping interface is a mapper. 9 deps.add(dep) Our mapping interface extensions enable mappers to specify 10 dispatch_to_fine_queue_when_ready(task_or_op,deps) which task(s) to dynamically control replicate, the number 1 while (!fine_queue.empty()): Fine Stage Algorithm of shards, and on which processors shards should execute. 2 task_or_op = fine_queue.pop() 3 local_points = sharding_func(task_or_op,shard_id) For the example program, the mapper (not shown) requests 4 foreach point_task_or_op in local_points: one shard of main to execute on a CPU of each node. We note 5 events = set() that there is nothing the prevents the use of DCR from being 6 foreach region in point_task_or_op: 7 events.add(make_valid_region(region)) automated by heuristics in the runtime system to decide 8 launch(point_task_or_op, events) when to use it; we have simply chosen to expose it through 9 complete_stage1_dependences(task_or_op) an API so users can decide when best to deploy DCR. When a DCR task executes, Legion queries mappers to Figure 9. Pseudo-code for the dependence analysis algo- select a sharding function for each subtask launch. Sharding rithm. Each stage operates independently and continuously functions are pure functions that map individual tasks or on every shard for the lifetime of a DCR program. tasks in a group launch to shards. Each shard is responsible for the dependence analysis of the tasks assigned to it. A good that 푡2 depends on 푡1, then there will be a coarse-stage de- sharding function assigns tasks near where they will execute, pendence between groups 퐺1 and 퐺2. Importantly, these while a poor choice may require significant movement of group-level dependences are discovered without enumerat- meta-data by the runtime. In the example, the cyclic sharding ing the individual tasks within a group. Consider a task with function (ID 0) round-robins tasks across the shards. Because one argument. Conceptually, we construct a single task 푡 (푟) sharding functions are pure, we can memoize their results, that is representative of an entire task group 푡푔 by taking which reduces their dynamic invocation overhead. 푟 to be an upper bound in the region tree of all the region The implementation described here is specific to Legion, arguments of tasks in 푡푔. We then query the dependence but DCR is not. The essential requirements for any implic- oracle for any dependences between the task representative itly parallel programming system to support DCR are task 푡 and representatives of other task groups. In Figure 9, since groups, a way to indicate a task is control deterministic, and the call to a task group already provides an upper bound sharding functions. While our Legion implementation ex- partition that covers all possible region arguments in the poses the implementation of sharding functions, allowing task group, we use the partition and do not calculate another customization for additional performance, this is not essen- upper bound. For a task with multiple arguments, we repeat tial; a fixed sharding function could also be implemented this construction for each argument. All shards perform de- within the runtime system. pendence analysis for all task groups in the coarse stage, so every shard is aware of all dependences between task groups. 4.1 Dependence Analysis A coarse stage dependence from group 퐺1 to 퐺2 requires In a straightforward implementation of DCR, each shard that the fine stage analysis of all individual tasks in 퐺1 fin- would analyze all subtasks launched by the replicated task, ishes before the fine stage analysis of any tasks in 퐺2. As which would be problematic because the number of subtasks coarse stage dependences are satisfied, tasks and operations is often proportional to the number of compute nodes and proceed to the fine stage where a precise dependence analy- dependence analysis costs would scale with node count. To sis is performed for the tasks owned by the shard. improve scalability, our implementation uses a hierarchical Coarse stage dependences are trivially satisfied among analysis that achieves O(log 푁 ) overhead in the average case tasks owned by the same shard, as shards analyze their tasks (where 푁 is the number replicated shards), and appears to in program order. To handle cross-shard dependences, the execute with O(1) cost if there is sufficient task parallelism. coarse stage promotes dependences to cross-shard fences to In our approach, each shard implements the dependence enforce a partial order between the fine stage analyses on dif- analysis in Figure 9, which consists of a coarse stage and a ferent shards (lines 6-7). Cross-shard fences behave similarly fine stage. The coarse stage analysis discovers dependences to scoped memory fences on parts of the region tree, pro- at the granularity of task groups; i.e., for two task groups viding ordering of fine stage analysis across shards for tasks 퐺1 and 퐺2, if there are any tasks 푡1 ∈ 퐺1 and 푡2 ∈ 퐺2 such and operations that access the fenced regions/partitions. Scaling Implicit Parallelism via Dynamic Control Replication PPoPP ’21, February 27-March 3, 2021, Virtual Event, Republic of Korea
RW – read write Fill (0) RW – read write RO – read only RO – read only stencil (group: 0-3) RW – cells – state, flux Cross-Shard Cross-Shard RW – interior – flux Fence on Fence on Cross-Shard RO – ghost – state Cross-Shard cells.state cells.flux Fence on Sharding Function: 0 Fence on add_one (group: 0-3) mul_two (group: 0-3) cells.state cells.flux RW – owned – state RW – interior – flux Sharding Function: 0 Sharding Function: 0 add_one (group: 0-3) mul_two (group: 0-3) RW – owned – state RW – interior – flux Cross-Shard Sharding Function: 0 Sharding Function: 1 Fence on cells.state stencil (group: 0-3) RW – interior – flux Figure 11. Changed dependence analysis (in red) from an Cross-Shard RO – ghost – state alternate selection of sharding function for mul_two. Fence on Sharding Function: 0 cells.state discern the aliasing relationship between the sub-regions of these two partitions, so we must conservatively insert the add_one (group): 0-3) mul_two (group: 0-3) fence to handle any cross-shard dependences. For the sec- RW – owned – state RW – interior – flux Sharding Function: 0 Sharding Function: 0 ond dependence we know all individual tasks on each shard access the same subregions of the same partition (interior) on each shard because they have the same sharding func- Figure 10. Coarse stage dependence analysis computed by tion. Therefore all the individual task dependences will be every shard for the program in Figure 7. shard-local and we can safely elide the cross-shard fence be- Any operation can begin the fine stage once all its incom- cause the fine stage is guaranteed to analyze individual task ing dependences from the coarse stage analysis have been dependences on local shards in program order. If the mapper satisfied (line 2). The fine stage analyzes individual tasksor picked a different sharding function for mul_two, as in Fig- operations (line 4). For each individual task or operation, it ure 11, we would need a cross-shard fence because there may performs any data movement (line 7), gathers event precon- be dependences between individual tasks owned by different ditions (lines 5,7), and dispatches execution to the lowest shards. The tasks from later iterations are analyzed similarly. layer of Legion [48]. On line 3 we use our local shard ID to Finally, for Legion we implement the dependence oracle evaluate the sharding function for each task or operation to for two task calls 푡1 (푟1) and 푡2 (푟2) by checking first whether determine the set of individual tasks or operations owned by regions 푟1 and 푟2 share any index points—if they are disjoint the local shard. Note that both stages of the analysis operate then the tasks are independent. If they share points then asynchronously with respect to each other and we exploit we check whether they are accessing at least one field in pipeline parallelism to hide the latency of the on-the-fly anal- common. If the tasks are accessing the same points of at ysis; different tasks and operations can be in both stages of least one field, we lastly check to see if either task writes its the pipeline simultaneously. region argument; if at least one is writing then a dependence We illustrate how the two-stage algorithm analyzes the is required. This procedure is the standard dynamic depen- program in Figure 7. Figure 10 shows the result of the depen- dence analysis carried out by the Legion runtime [11] with dence analysis for the beginning of the program. After the no modifications required for DCR. fill is issued to both fields of the region, the next operation We make three observations. First, the coarse analysis analyzed is the add_one group task launch over the domain stage does not require enumerating individual tasks for task 0−3. This task uses the state field of the owned partition (see groups. The cost of the coarse stage is independent of the Figure 8) and therefore depends on the fill so a dependence size of the group launches, which are usually proportional is added. Additionally, we insert a cross-shard fence on the to the size of the machine. This independence from machine cells region because the fill will be performed on shard 0, size makes the coarse stage scalable. whereas the cyclic sharding function (ID 0) for add_one maps Second, in the common case of data parallel operations, we individual tasks 1 and 3, that depend on the fill, to shard can prove that all dependences are shard-local and therefore 1. The second group task launch for mul_two has a similar the cross-shard fences can be elided, which avoids unneces- dependence on the fill for the flux field and therefore inserts sary synchronization (line 6 of the coarse stage in Figure 9). a fence on the cells region for the flux field. Currently, we do this proof symbolically by analyzing the Next to be analyzed is the stencil task group, which has index space of the group task launches, the regions or parti- dependences on the state field of the previous add_one task tions that are upper bounds for task groups or operations, as well as on the flux field of the previous mul_two task. The and names of the projection and sharding functions. coarse analysis requires a cross-shard fence for the first but Third, when cross-shard fences are required, we insert not the second dependence. While the sharding functions are these on specific regions or partitions. This design is a mid- the same, the first dependence is between two different parti- dle ground between full analysis barriers that create all-to- tions of the same region: owned and ghost. We cannot easily all dependences between preceding and succeeding tasks, PPoPP ’21, February 27-March 3, 2021, Virtual Event, Republic of Korea Bauer et al. and direct shard-to-shard dependences that could be com- 5 Evaluation puted with an expensive inversion of the functions that de- We evaluate DCR against a wide variety of applications termine which subregions are used by individual tasks. Our drawn from HPC, data analytics and machine learning, writ- cross-shard fences are implemented using a collective (see ten in Legion without DCR, Regent [43], MPI [45], Tensor- 푁 Section 4.2) which has cost O(log ) in latency across the Flow [7], and Dask+NumPy [39]. We show that compared to shards. Performing the coarse-grained analysis discovers other implicit parallel models (TensorFlow, Dask+NumPy) coarse task parallelism that can be exploited during the fine- we achieve much better performance, and that for HPC appli- grained analysis stage to hide the cost of the fences. In prac- cations, our performance is often competitive with explicit tice, most applications we have observed have sufficient task parallel programming models such as MPI. Using DCR with parallelism to hide these latencies. these applications required marking the top-level task as replicable and writing sharding functions, all of which were 4.2 Collectives one or two lines of code (e.g., round-robin or tiled sharding Our DCR implementation uses a set of collective primitives of tasks). We used either one shard per node or one shard for performing cooperative work between shards: broadcast per GPU in all experiments. To underscore the portability of from one shard to all shards, reduce from all shards to one our approach, our experiments are run on a diverse collec- shard, all-gather from all shards to all other shards, and all- tion of the world’s top supercomputers including Summit reduce that produces a single value for all shards. As an (#2), Sierra (#3), Piz-Daint (#10), and Lassen (#14) [5], among example, our implementation of the cross-shard dependence others, with a mix of heterogeneous processor types. fences from Section 4.1 is performed with an all-gather collec- tive with no data payload. The collectives are implemented 5.1 Benchmarks using standard tree or butterfly communication networks We begin by benchmarking DCR against both Legion with- O( 푁 ) with log latency. out control replication, as well as the implementation of 3 4.3 Side Effects static control replication in Regent. Recall that static con- trol replication, when it applies, has no runtime overhead Control replicated programs must interact correctly with because it is a compile-time program transformation. Fur- the external world. We have focused on two aspects: file thermore, because Regent is also built on the Legion runtime, I/O and support for languages with garbage collectors. To and its static control replication approach is known to per- support file I/O, we provide implementations of Legion’s form well [43], this comparison provides a measure of the attach and detach operations [26] that execute correctly in end-to-end overheads of our dynamic control replication dynamically control replicated contexts. Attach operations implementation. Our first benchmark is an implicitly paral- allow applications to associate external memory allocations lel 2D stencil code [6], similar to Figure 7, that requires the (e.g., from an MPI program [45]) or files (e.g., HDF5 [46]) system to recognize a nearest-neighbors communication pat- with existing regions in a program; detach operations flush tern in multiple dimensions. For this benchmark, Figure 12 any changes (e.g., write back any file updates to disk) and gives both weak and strong scaling results using the GPUs remove these associations. With dynamic control replication, on Piz-Daint. Without control replication, the overhead of a we provide support for sharding attach and detach operations centralized controller node dominates once the cost of ana- just like any other kind of operation. Normal files are read lyzing and distributing all tasks eclipses the execution time and written by a single owner shard; group variants of attach of tasks assigned to a single processor. DCR weak scales and detach provide support for parallel file I/O. nearly as well as static control replication, with only a 2.5% We have also added support for using DCR with languages slowdown at 512 nodes. For strong scaling on the chosen that rely on garbage collectors (GC) such as Lua and Python. problem size, static and dynamic control replication scale to Finalizers [2] that make runtime API calls can violate con- 128 and 64 nodes, respectively, before degrading. Unsurpris- trol determinism because collections can occur at arbitrary ingly, DCR has higher runtime overheads than static control times in each shard. In practice, we have seen finalizers per- replication, but they are within a factor of two. form detach operations as well as deleting regions and fields Our next benchmark is a circuit simulation that iteratively (see Section 5.4). To handle such cases without violating updates currents on wires and voltages on nodes in a graph control determinism, we provide an option to delay detach of circuit components. The partitioning of the graph is done operations and deletions of Legion resources. The runtime dynamically, so the communication pattern must also be periodically polls the shards to see if they have all observed established at runtime. Figure 13 gives weak and strong scal- the same delayed operation. The polling is done with expo- ing results, both of which are significantly better with DCR nential back-off so that it can quickly handle cases whenGC is active, but incurs minimal overhead when GC is absent. 3There are approaches to scaling task-based programs based on using ex- Once the shards concur, the operation is inserted into the plicit communication and synchronization in a SPMD-like structure [12]. same location in the dependence analysis of each shard. Our evaluation focuses on programs written purely using implicit tasking. Scaling Implicit Parallelism via Dynamic Control Replication PPoPP ’21, February 27-March 3, 2021, Virtual Event, Republic of Korea
12 18 MPI CPU-only Legion No Control Replication ) s / 16 MPI+CUDA Legion Dynamic Control Replication s 10 l l MPI+CUDA+GPUDirect e ) c s
/ s
l 14 l 8 ( e
c e d o 2 12 ( N 10
6 t r u e p p
h 10 t g u 4 u p o r h h g 8 T u No Control Replication No Control Replication o r 2 h Static Control Replication Static Control Replication T 6 Dynamic Control Replication Dynamic Control Replication Throughput (Iterations/s) 1 0 10 1 2 4 8 1 2 4 8 4 16 32 64 128 256 512 16 32 64 128 256 512 Nodes Nodes 2
(a) Weak Scaling (b) Strong Scaling 0 1 2 4 8 16 32 (8 GPUs) (16 GPUs) (32 GPUs) (64 GPUs) (128 GPUs) (256 GPUs) DGX-1V Nodes Figure 12. Scaling of 2D Stencil Benchmark. Figure 14. Weak scaling of Pennant compared to MPI.
6 40 103
) TensorFlow s / s 5
e FlexFlow (No Control Replication) r i ) w s FlexFlow (Dynamic Control Replication)
/ s e