HPVM: Heterogeneous Parallel Virtual Machine

HPVM: Heterogeneous Parallel Virtual Machine Maria Kotsifakou∗ Prakalp Srivastava∗ Matthew D. Sinclair Department of Computer Science Department of Computer Science Department of Computer Science University of Illinois at University of Illinois at University of Illinois at Urbana-Champaign Urbana-Champaign Urbana-Champaign [email protected] [email protected] [email protected] Rakesh Komuravelli Vikram Adve Sarita Adve Qualcomm Technologies Inc. Department of Computer Science Department of Computer Science [email protected]. University of Illinois at University of Illinois at com Urbana-Champaign Urbana-Champaign [email protected] [email protected] Abstract hardware, and that runtime scheduling policies can make We propose a parallel program representation for heteroge- use of both program and runtime information to exploit the neous systems, designed to enable performance portability flexible compilation capabilities. Overall, we conclude that across a wide range of popular parallel hardware, including the HPVM representation is a promising basis for achieving GPUs, vector instruction sets, multicore CPUs and poten- performance portability and for implementing parallelizing tially FPGAs. Our representation, which we call HPVM, is a compilers for heterogeneous parallel systems. hierarchical dataflow graph with shared memory and vector CCS Concepts • Computer systems organization → instructions. HPVM supports three important capabilities for Heterogeneous (hybrid) systems; programming heterogeneous systems: a compiler intermediate representation (IR), a virtual instruction set (ISA), and Keywords Virtual ISA, Compiler, Parallel IR, Heterogeneous a basis for runtime scheduling; previous systems focus on Systems, GPU, Vector SIMD only one of these capabilities. As a compiler IR, HPVM aims to enable effective code generation and optimization for het- 1 Introduction erogeneous systems. As a virtual ISA, it can be used to ship Heterogeneous parallel systems are becoming increasingly executable programs, in order to achieve both functional popular in systems ranging from portable mobile devices to portability and performance portability across such systems. high-end supercomputers to data centers. Such systems are At runtime, HPVM enables flexible scheduling policies, both attractive because they use specialized computing elements, through the graph structure and the ability to compile indi- including GPUs, vector hardware, FPGAs, and domain- vidual nodes in a program to any of the target devices on a specific accelerators, that can greatly improve energy ef- system. We have implemented a prototype HPVM system, ficiency, performance, or both, compared with traditional defining the HPVM IR as an extension of the LLVM compiler homogeneous systems. A major drawback, however, is that IR, compiler optimizations that operate directly on HPVM programming heterogeneous systems is extremely challeng- graphs, and code generators that translate the virtual ISA to ing at multiple levels: algorithm designers, application devel- NVIDIA GPUs, Intel’s AVX vector units, and to multicore opers, parallel language designers, compiler developers and X86-64 processors. Experimental results show that HPVM op- hardware engineers must all reason about performance, scal- timizations achieve significant performance improvements, ability, and portability across many different combinations HPVM translators achieve performance competitive with of diverse parallel hardware. manually developed OpenCL code for both GPUs and vector At a fundamental level, we believe these challenges arise from three root causes: (1) diverse hardware parallelism mod- ∗The first two authors contributed equally to the paper. els; (2) diverse memory architectures; and (3) diverse hard- Permission to make digital or hard copies of part or all of this work for ware instruction sets. Some widely used heterogeneous sys- personal or classroom use is granted without fee provided that copies are tems, such as GPUs, partially address these problems by not made or distributed for profit or commercial advantage and that copies defining a virtual instruction set (ISA) spanning one or more bear this notice and the full citation on the first page. Copyrights for third- families of devices, e.g., PTX for NVIDIA GPUs, HSAIL for party components of this work must be honored. For all other uses, contact GPUs from several vendors and SPIR for devices running the owner/author(s). OpenCL. Software can be shipped in virtual ISA form and PPoPP ’18, February 24–28, 2018, Vienna, Austria © 2018 Copyright held by the owner/author(s). then translated to the native ISA for execution on a supported ACM ISBN 978-1-4503-4982-6/18/02. device within the target family at install time or runtime, https://doi.org/10.1145/3178487.3178493 thus achieving portability of “virtual object code” across the PPoPP ’18, February 24–28, 2018, Vienna, Austria Kotsifakou, Srivastava, Sinclair, Komuravelli, V. Adve, and S. Adve corresponding family of devices. Except for SPIR, which is The parallel program representation we propose is a hi- essentially a lower-level representation of the OpenCL lan- erarchical dataflow graph with shared memory. The graph guage, these virtual ISAs are primarily focused on GPUs nodes can represent either coarse-grain or fine-grain compu- and do not specifically address other hardware classes, like tational tasks, although we focus on moderately coarse-grain vector hardware or FPGAs. Moreover, none of these virtual tasks (such as an entire inner-loop iteration) in this work. The ISAs aim to address the other challenges, such as algorithm dataflow graph edges capture explicit data transfers between design, language design, and compiler optimizations, across nodes, while ordinary load and store instructions express diverse heterogeneous devices. implicit communication via shared memory. The graph is We believe that these challenges can be best addressed by hierarchical because a node may contain another dataflow developing a single parallel program representation flexible graph. The leaf nodes can contain both scalar and vector enough to support at least three different purposes: (1) A com- computations. A graph node represents a static computation, piler intermediate representation, for compiler optimizations and any such node can be “instantiated” in a rectangular grid and code generation for diverse heterogeneous hardware. of dynamic node instances, representing independent parallel Such a compiler IR must be able to implement a wide range instances of the computation (in which case, the incident of different parallel languages, including general-purpose edges are instantiated as well, as described later). ones like OpenMP, CUDA and OpenCL, and domain-specific The hierarchical dataflow graphs naturally capture all the ones like Halide and TensorFlow. (2) A virtual ISA, to allow important kinds of coarse- and fine-grain data and task par- virtual object code to be shipped and then translated down allelism in heterogeneous systems. In particular, the graph to native code for different heterogeneous system configu- structure captures coarse-grain task parallelism (including rations. This requirement is essential to enable application pipelined parallelism in streaming computations); the graph teams to develop and ship application code for multiple de- hierarchy captures multiple levels and granularities of nested vices within a family. (3) A representation for runtime sched- parallelism; the node instantiation mechanism captures ei- uling, to enable flexible mapping and load-balancing policies, ther coarse- or fine-grain SPMD-style data parallelism; and in order to accommodate static variations among different explicit vector instructions within leaf nodes capture fine- compute kernels and dynamic variations due to external ef- grain vector parallelism (this can also be generated by auto- fects like energy fluctuations or job arrivals and departures. matic vectorization across independent node instances). We believe that a representation that can support all these We describe a prototype system (also called HPVM) that three capabilities could (in future) also simplify parallel algo- supports all three capabilities listed earlier. The system de- rithm development and influence parallel language design, fines a compiler IR as an extension of the LLVM IR[30] by although we do not explore those in this work. adding HPVM abstractions as a higher-level layer describing In this paper, we propose such a parallel program repre- the parallel structure of a program. sentation, Heterogeneous Parallel Virtual Machine (HPVM), As examples of the use of HPVM as a compiler IR, we and evaluate it for three classes of parallel hardware: GPUs, have implemented two illustrative compiler optimizations, SIMD vector instructions, and multicore CPUs. (In ongoing graph node fusion and tiling, both of which operate directly work, we are also targeting FPGAs using the same program on the HPVM dataflow graphs. Node fusion achieves “kernel representation.) Our evaluation shows that HPVM can serve fusion”, and the graph structure makes it explicit when it all three purposes listed above: a compiler IR, a virtual ISA, is safe to fuse two or more nodes. Similarly (and somewhat and a scheduling representation, as described below. surprisingly), we find that the graph hierarchy is alsoan No previous system we know of can achieve all three pur- effective and portable method to capture tiling of computa- poses,

HPVM: Heterogeneous Parallel Virtual Machine

A Case for High Performance Computing with Virtual Machines

Introduction to Openacc 2018 HPC Workshop: Parallel Programming

GPU Computing with Openacc Directives

Openacc Course October 2017. Lecture 1 Q&As

Exploring Massive Parallel Computation with GPU

Dcuda: Hardware Supported Overlap of Computation and Communication

On the Virtualization of CUDA Based GPU Remoting on ARM and X86 Machines in the Gvirtus Framework

Multi-Threaded GPU Accelerration of ORBIT with Minimal Code

Openacc Getting Started Guide

Productive High Performance Parallel Programming with Auto-Tuned Domain-Speciﬁc Embedded Languages

Introduction to GPU Programming with CUDA and Openacc

Parallel Processing Using PVM on a Linux Cluster