Control Replication: Compiling Implicit Parallelism to Efficient SPMD with Logical Regions
Total Page:16
File Type:pdf, Size:1020Kb
Control Replication: Compiling Implicit Parallelism to Efficient SPMD with Logical Regions Elliott Slaughter Wonchan Lee Sean Treichler Stanford University Stanford University Stanford University SLAC National Accelerator Laboratory [email protected] NVIDIA [email protected] [email protected] Wen Zhang Michael Bauer Galen Shipman Stanford University NVIDIA Los Alamos National Laboratory [email protected] [email protected] [email protected] Patrick McCormick Alex Aiken Los Alamos National Laboratory Stanford University [email protected] [email protected] ABSTRACT 1 for t = 0, T do We present control replication, a technique for generating high- 1 for i = 0, N do −− Parallel 2 for i = 0, N do −− Parallel performance and scalable SPMD code from implicitly parallel pro- 2 for t = 0, T do 3 B[i] = F(A[i]) grams. In contrast to traditional parallel programming models that 3 B[i] = F(A[i]) 4 end 4 −− Synchronization needed require the programmer to explicitly manage threads and the com- 5 for j = 0, N do −− Parallel 5 A[i] = G(B[h(i)]) munication and synchronization between them, implicitly parallel 6 A[j] = G(B[h(j)]) 6 end programs have sequential execution semantics and by their nature 7 end 7 end avoid the pitfalls of explicitly parallel programming. However, with- 8 end (b) Transposed program. out optimizations to distribute control overhead, scalability is often (a) Original program. poor. Performance on distributed-memory machines is especially sen- F(A[0]) . G(B[h(0)]) sitive to communication and synchronization in the program, and F(A[1]) . G(B[h(1)]) thus optimizations for these machines require an intimate un- . derstanding of a program’s memory accesses. Control replication F(A[N-1]) . G(B[h(N-1)]) achieves particularly effective and predictable results by leverag- (c) Implicitly parallel execution of original program. ing language support for first-class data partitioning in the source programming model. We evaluate an implementation of control replication for Regent and show that it achieves up to 99% parallel F(A[0]) . G(B[h(0)]) efficiency at 1024 nodes with absolute performance comparable to hand-written MPI(+X) codes. F(A[1]) . G(B[h(1)]) . CCS CONCEPTS • Software and its engineering → Parallel programming lan- F(A[N-1]) . G(B[h(N-1)]) guages; Compilers; (d) SPMD execution of transposed program. KEYWORDS Figure 1: Comparison of implicit and explicit parallelism. Control replication; Regent; Legion; regions; task-based runtimes 1 INTRODUCTION Programs with sequential semantics are easy to read, understand, Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States debug, and maintain, compared to explicitly parallel codes. In cer- government. As such, the Government retains a nonexclusive, royalty-free right to tain cases, sequential programs also lend themselves naturally to publish or reproduce this article, or to allow others to do so, for Government purposes only. parallel execution. Consider the code in Figure 1a. Assuming there SC17, November 2017, Denver, CO, USA are no loop carried dependencies, the iterations of each of the two © 2017 Copyright held by the owner/author(s). Publication rights licensed to Associa- inner loops can be executed in parallel on multiple processors in a tion for Computing Machinery. ACM ISBN 978-1-4503-5114-0/17/11...$15.00 straightforward fork-join style. As illustrated in Figure 1c, a main https://doi.org/10.1145/3126908.3126949 thread launches a number of worker threads for the first loop, each SC17, November 2017, Denver, CO, USA Elliott Slaughter et al. of which executes one (or more) loop iterations. There is a syn- pieces of the first loop are communicated to the threads that will chronization point at the end of the loop where control returns to read those values in the distributed pieces of the second loop. In the main thread; the second loop is executed similarly. Because the most SPMD models (and specifically in MPI) this data movement second loop can have a completely different data access pattern must be explicitly written and optimized by the programmer. The than the first (indicated by the arbitrary function h in B[h(j)]), com- synchronization and the data movement are by far the most difficult plex algorithms can be expressed. With considerable variation, this and time consuming parts of SPMD programs to get right, and these implicitly parallel style is the basis of many parallel programming are exactly the parts that are not required in implicitly parallel models. Classic parallelizing compilers [10, 25, 29], HPF [30, 36] programs. and the data parallel subset of Chapel [16] are canonical exam- This paper presents control replication, a technique for generating ples; other examples include MapReduce [18] and Spark [48] and high-performance and scalable SPMD code from implicitly parallel task-based models such as Sequoia [23], Legion [6], StarPU[4], and programs with sequential semantics. The goal of control replication PaRSEC [13]. is to both “have our cake and eat it”, to allow programmers to write In practice, programmers don’t write highly scalable high per- in the productive implicitly parallel style and to use a combination formance codes in the implicitly parallel style of Figure 1a. Instead, of compiler optimizations and runtime analysis to automatically they write the program in Figure 1b. Here the launching of a set of produce the scalable (but much more complex) SPMD equivalent. worker threads happens once, at program start, and the workers Control replication works by generating a set of shards, or long- run until the end of the computation. We can see in Figures 1c and running tasks, from the control flow of the original program, to 1d that conceptually the correspondence between the programs is amortize overhead and enable efficient execution on large num- simple. Where Figure 1a launches N workers in the first loop and bers of processors. Intuitively, a control thread of an implicitly then N workers in the second loop, Figure 1b launches N long-lived parallel program is replicated across the shards, with each shard threads that act as workers across iterations of the inner loops of maintaining enough state to mimic the decisions of the original the program. This single program multiple data, or SPMD, program- control thread. An important feature of control replication is that ming style is the basis of MPI [42], UPC [2] and Titanium [1], and it is a local transformation, applying to a single collection of loops. also forms a useful subset of Chapel, X10 [17] and Legion. Thus, it need not be applied only at the top level, and can in fact While Figure 1a and Figure 1b are functionally equivalent, they be applied independently to different parts of a program and at have very different scalability and programmer productivity prop- multiple different scales of nested parallelism. erties. Figure 1b is much more scalable, and not just by a con- As suggested above, the heart of the control replication trans- stant factor. To see this, consider what happens in Figure 1a as formation depends on the ability to analyze the implicitly parallel the number of workers N (the “height” of the execution graph in program with sufficient precision to generate the needed synchro- Figure 1c) increases. Under weak scaling, the time to execute each nization and data movement between shards. Similar analyses are worker task (e.g., F(A[i]) in the first loop) remains constant, but known to be very difficult in traditional programming languages. the main control thread does O(N ) work to launch N workers. (In Past approaches that have attempted optimizations with compara- some cases, a broadcast tree can be used to reduce this overhead ble goals to control replication have relied on either very sophis- to O(log N ).) Thus, for some N , the runtime overhead of launch- ticated, complex and therefore unpredictable static analysis (e.g., ing workers exceeds the individual worker’s execution time and HPF) or have relied much more heavily on dynamic analysis with the program ceases to scale. While the exact scalability in practice associated run-time overheads (e.g., inspector-executor systems always depends on how long-running the parallel worker tasks [35]). are, our experience is that many implicitly parallel programs don’t A key aspect of our work is that we leverage recent advances in scale beyond 10 to 100 nodes when task granularities are on the parallel programming model design that greatly simplify and make order of milliseconds to tens of milliseconds. In contrast, the SPMD reliable and predictable the static analysis component of control program in Figure 1b, while it still must launch N workers, does so replication. Many parallel programming models allow program- only once and in particular the launch overhead is not incurred in mers to specify a partition of the data, to name different subsets of every time step (the T loop). Programs written in SPMD style can the data on which parallel computations will be carried out. Recent scale to thousands or tens of thousands of nodes. proposals allow programmers to define and use multiple partitions On the other hand, the implicitly parallel program in Figure 1a of the same data [6, 11]. For example, returning to our abstract ex- is much easier to write and maintain than the program in Figure 1b. ample in Figure 1a, one loop may be accessing a matrix partitioned While it is not possible to give precise measurements, it is clear by columns while the other loop accesses the same matrix parti- that the difference in productivity is large: In our experience an tioned by rows.