Stateful Dataflow Multigraphs: a Data-Centric Model for Performance Portability on Heterogeneous Architectures

Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures Tal Ben-Nun, Johannes de Fine Licht, Alexandros N. Ziogas, Timo Schneider, Torsten Hoefler Department of Computer Science, ETH Zurich, Switzerland {talbn,definelicht,alziogas,timos,htor}@inf.ethz.ch ABSTRACT Section 2 Sections 3-4 Sections 5-6 Domain Scientist Performance Engineer System The ubiquity of accelerators in high-performance computing has Problem Formulation Hardware driven programming complexity beyond the skill-set of the average Information = 0 domain scientist. To maintain performance portability in the fu- 2 − Transformed Python / Dataflow SDFG Compiler ture, it is imperative to decouple architecture-specific programming DSLs numpy paradigms from the underlying scientific computations. We present Data-Centric Intermediate TensorFlow MATLAB Representation (SDFG, §3) the Stateful DataFlow multiGraph (SDFG), a data-centric intermedi- CPU Binary Performance * GPU Binary ate representation that enables separating program definition from * Results Runtime SDFG Builder API * its optimization. By combining fine-grained data dependencies with * FPGA Modules * * high-level control-flow, SDFGs are both expressive and amenable High-Level Program Graph Transformations Thin Runtime to program transformations, such as tiling and double-buffering. (API, Interactive, §4) Infrastructure These transformations are applied to the SDFG in an interactive Figure 1: Proposed Development Scheme process, using extensible pattern matching, graph rewriting, and a graphical user interface. We demonstrate SDFGs on CPUs, GPUs, utilization. Programmers would now not only worry about com- and FPGAs over various motifs — from fundamental computational munication (fortunately, the MPI specification grew by less than kernels to graph analytics. We show that SDFGs deliver competitive 4x from MPI-1.0 to 3.1) but also about the much more complex performance, allowing domain scientists to develop applications on-node heterogeneous programming. The sheer number of new naturally and port them to approach peak hardware performance approaches, such as OpenACC, OpenCL, or CUDA demonstrate the without modifying the original scientific code. difficult situation in on-node programming. This increasing complexity makes it nearly impossible for domain scientists to write CCS CONCEPTS portable and performant code today. The growing complexity in performance programming led to a • Software and its engineering → Parallel programming lan- specialization of roles into domain scientists and performance engi- guages; Data flow languages; Just-in-time compilers; • Human- neers. Performance engineers typically optimize codes by moving centered computing → Interactive systems and tools. functionality to performance libraries such as BLAS or LAPACK. If ACM Reference Format: this is insufficient, they translate the user-code to optimized ver- Tal Ben-Nun, Johannes de Fine Licht, Alexandros N. Ziogas, Timo Schneider, sions, often in different languages such as assembly code, CUDA, Torsten Hoefler. 2019. Stateful Dataflow Multigraphs: A Data-Centric Model or tuned OpenCL. Both libraries and manual tuning reduce code for Performance Portability on Heterogeneous Architectures. In The Inter- maintainability, because the optimized versions are not only hard national Conference for High Performance Computing, Networking, Storage, to understand for the original author (the domain scientist) but also and Analysis (SC ’19), November 17–22, 2019, Denver, CO, USA. ACM, New York, NY, USA, 18 pages. https://doi.org/10.1145/3295500.3356173 cannot be changed without major effort. Code annotations as used by OpenMP or OpenACC do not 1 MOTIVATION change the original code that then remains understandable to the domain programmer. However, the annotations must re-state (or HPC programmers have long sacrificed ease of programming and modify) some of the semantics of the annotated code (e.g., data place- portability for achieving better performance. This mindset was ment or reduction operators). This means that a (domain scientist) established at a time when computer nodes had a single proces- programmer who modifies the code, must modify some annota- sor/core and were programmed with C/Fortran and MPI. The last tions or she may introduce hard-to-find bugs. With heterogeneous decade, witnessing the end of Dennard scaling and Moore’s law, target devices, it now becomes common that the complexity of brought a flurry of new technologies into the compute nodes. Those annotations is higher than the code they describe [56]. Thus, scien- range from simple multi-core and manycore CPUs to heterogeneous tific programmers can barely manage the complexity of thecode GPUs and specialized FPGAs. To support those architectures, the targeted at heterogeneous devices. complexity of OpenMP’s specification grew by more than an or- The main focus of the community thus moved from scalability der of magnitude from 63 pages in OpenMP 1.0 to 666 pages in to performance portability as a major research target [69]. We call OpenMP 5.0. This one example illustrates how (performance) pro- a code-base performance-portable if the domain scientist’s view gramming complexity shifted from network scalability to node (“what is computed”) does not change while the code is optimized to SC ’19, November 17–22, 2019, Denver, CO, USA different target architectures, achieving consistently high performance. 2019. ACM ISBN 978-1-4503-6229-0/19/11...$15.00 The execution should be approximately as performant (e.g., attaining https://doi.org/10.1145/3295500.3356173 SC ’19, November 17–22, 2019, Denver, CO, USA Ben-Nun et al. similar ratio of peak performance) as the best-known implementation @dace . program def Laplace (A: dace .float64[2,N], A or theoretical best performance on the target architecture [67]. As T: dace . uint32 ): A[t%2, 0:N] [i = 1:N-1] for t in range (T): A[t%2, i-1] A[t%2, i+1] discussed before, hardly any existing programming model that A[t%2,i] for i in dace .map [1:N -1]: t < T; t++ Laplace supports portability to different accelerators satisfies this definition. A[(t+1)%2, i] = \ A[(t+1)%2, i] A[t%2, i-1:i+2] * [1,-2,1] Our Data-centric Parallel Programming (DAPP) concept ad- [i = 1:N-1] A[(t+1)%2, 1:N-1] dresses performance portability. It uses a data-centric viewpoint a = numpy.random.rand(2, 2033) Laplace(A=a, T=500) A of an application to separate the roles of domain scientist and per- t T formance programmer, as shown in Fig. 1. DAPP relies on Stateful ≥ DataFlow multiGraphs (SDFGs) to represent code semantics and (a) Python Representation (b) Resulting SDFG transformations, and supports modifying them to tune for particular Figure 2: Data-Centric Computation of a Laplace Operator target architectures. It bases on the observation that data-movement 2 DATA-CENTRIC PROGRAMMING dominates time and energy in today’s computing systems [66] and pioneers the necessary fundamental change of view in parallel Current approaches in high-performance computing optimizations programming. As such, it builds on ideas of data-centric mappers [66] revolve around improving data locality. Regardless of the un- and schedule annotations such as Legion [9] and Halide [58] and derlying architecture, the objective is to keep information as close extends them with a multi-level visualization of data movement, code as possible to the processing elements and promote memory reuse. transformation and compilation for heterogeneous targets, and strict Even a simple application, such as matrix multiplication, requires separation of concerns for programming roles. The domain program- multiple stages of transformations, including data layout modifica- mer thus works in a convenient and well-known language such as tions (packing) and register-aware caching [33]. Because optimiza- (restricted) Python or MATLAB. The compiler transforms the code tions do not modify computations and differ for each architecture, into an SDFG, on which the performance engineer solely works maintaining performance portability of scientific applications re- on, specifying transformations that match certain data-flow struc- quires separating computational semantics from data movement. tures on all levels (from registers to inter-node communication) SDFGs enable separating application development into two and modify them. Our transformation language can implement stages, as shown in Fig. 2. The problem is formulated as a high-level arbitrary changes to the SDFG and supports creating libraries of program (Fig. 2a), and is then transformed into a human-readable transformations to optimize workflows. Thus, SDFGs separate the SDFG as an Intermediate Representation (IR, Fig. 2b). The SDFG concerns of the domain scientist and the performance engineers can then be modified without changing the original code, and as through a clearly defined interface, enabling highest productivity long as the dataflow aspects do not change, the original code can of both roles. be updated while keeping SDFG transformations intact. What dif- We provide a full implementation of this concept in our Data- ferentiates the SDFG from other IRs is the ability to hierarchically Centric (DaCe) programming environment, which supports (lim- and parametrically view data movement, where scopes in the graph ited) Python, MATLAB, and TensorFlow as frontends, as well as contain overall data requirements. This enables reusing transfor- support for selected DSLs. DaCe is easily extensible to

Load more