Parallel Programming for Changing Hardware

Embracing Heterogeneity — Parallel Programming for Changing Hardware Michael D. Linderman, James Balfour, Teresa H. Meng and William J. Dally Center for Integrated Systems and Computer Systems Laboratory, Stanford University {mlinderm, jbalfour}@stanford.edu Abstract Application Embed and exploit high-level parallel progamming tools Computer systems are undergoing significant change: to High-level Compilers improve performance and efficiency, architects are ex- when available posing more microarchitectural details directly to pro- CPU Accel. Integrate multiple processor- grammers. Software that exploits specialized accelera- Compiler Compilers or domain-specific compilers tors, such as GPUs, and specialized processor features, such as software-controlled memory, exposes limitations Selection Engine Automatically map application to processor-specific software in existing compiler and OS infrastructure. In this pa- CPU Accelerators per we propose a pragmatic approach, motivated by our modules experience with Merge [3], for building applications that will tolerate changing hardware. Our approach allows Figure 1: Sketch of Merge framework programmers to leverage different processor-specific or croarchitectures and system architectures have changed domain-specific toolchains to create software modules significantly. As modern architectures expose more mi- specialized for different hardware configurations, and it croarchitectural and system details to software to im- provides language mechanisms to enable the automatic prove performance and efficiency, programmers are no mapping of the application to these processor-specific longer insulated from the evolution of the underlying modules. We show this approach can be used to manage hardware. Programming models need to be inclusive of computing resources in complex heterogeneous proces- different processor architectures, and tolerant of contin- sors and to enable aggressive compiler optimizations. ual, often radical, changes in hardware. To exploit these new and different hardware resources, 1 Introduction a diverse set of vendor-specific, architecture-specific and application-specific programming models have and are Heterogeneous computer systems, which may integrate currently being developed. The rapid evolution of hard- GPUs, FPGAs and other accelerators alongside conven- ware ensures that programming models will continue tional CPUs, offer significantly better performance and to be developed at a torrid pace. Integrating different efficiency. However, they often do so by exposing to pro- toolchains, whether from different vendors or using dif- grammers architectural mechanisms, such as low-latency ferent high-level semantics, remains a challenge. How- scratchpad memories and inter-processor interconnect, ever, integrating many narrowly-focused tools is more ef- that are either hidden or unavailable in general-purpose fective than attempting to craft a single all-encompassing CPUs. The software that executes on these accelera- solution; consequently, that is the approach we take. tors often bears little resemblance to its CPU counterpart: In this paper, we present a methodology, motivated source languages and assembly differ, and often entirely by our experiences with the Merge framework [3], for different algorithms are needed to exploit the capabilities building programs that target diverse and evolving het- of the different hardware. erogeneous multicore systems. Our approach, summa- The ISAs of commodity general-purpose processors rized in Figure 1, automatically maps applications to have changed remarkably little during the past 30 years. specialized software modules, implemented with dif- Decades old software still runs correctly, and fast, on ferent processor-specific or domain-specific toolchains. modern processors. Unlike their ISAs, processor mi- Specialized domain-specific languages and accelerator- specific assembly are encapsulated in C/C++ functions kmeans(...); to provide a uniform interface and inclusive abstraction for computations of any complexity. Different implementations of a function are bundled together, creating Domain GPU a layer of indirection between the caller and the imple- Specific Code mentation that facilitates the mapping between applica- Language tion and implementation. Proxy Section 2 motivates the use of encapsulation and C Code C Code C Code bundling, summarizing and updating the techniques first described in [3]. Sections 3 and 4 present our most recent Figure 2: Encapsulation of inline accelerator-specific as- work, in which we show how encapsulation and bundling sembly or domain-specific languages can be used to effectively manage computing resources in complex heterogeneous systems and enable aggressive (DSLs) and accelerator-specific assembly into C-like compiler optimizations. functions, thereby creating a uniform interface, compati- ble with existing software infrastructure, that is indepen- dent of the actual implementation. Figure 2 shows an ex- 2 An Extensible Programming Model ample in which kmeans is implemented using combina- tions of standard C, a DSL, and GPU-specific assembly. The canonical compiler reduces a computation expressed All versions present the same interface and all appear to in some high-level language to a small, fixed set of prim- the caller to execute in the CPU memory space. The itive operations that abstract the capabilities of the tar- proxy layer (e.g., EXOCHI, CUDA) provides the data get hardware. Compilation and optimization strategies transfer and other runtime infrastructure needed to sup- are biased by the choice of primitive operations. Opti- port the interaction between the CPU and the accelerator. mizations developed for one set of primitives are often of These enhanced functions, which we term function- limited use when the primitive operations fail to abstract intrinsics, are conceptually similar to existing compiler important aspects of the target hardware or application. intrinsics, such as those used to represent SSE opera- Unfortunately, no one set of primitive operations can tions. Unlike conventional intrinsics, programmers are effectively abstract all of the unique and specialized ca- not limited to a small fixed set of operations; instead, pabilities provided by modern hardware. For instance, programmers can create intrinsics for operations of any the capabilities of scalar processors are represented well complexity, for any architecture and using any program- by three-address operations on scalar operands; the ca- ming model supported by a proxy interface. When pro- pabilities of SIMD processors, such as Cell, GPUs and grammers use a non-C language, such as GPU assem- SSE units, are better represented by short-vector oper- bly, the appropriate compiler is invoked and the resulting ations; and the capabilities of FPGAs are better repre- machine code (or an intermediate language and a just-in- sented by binary decision diagrams and data flow graphs time compiler or interpreter) is packaged into the binary. with variable-precision operands. Much as the limitations of scalar primitives motivated the adoption of short- vector primitives in compilers targeting SIMD architec- 2.2 A Concurrent Function Call ABI tures, compilers that target complex accelerators such as FPGAs will find representations based on simple scalar Using the function call interface to integrate specialized and short-vector primitives limiting and ineffective. implementations is actually common. For example, most We argue that nascent parallel programming systems systems ship with a version of the C standard library that should allow software that uses different programming is optimized for that particular platform. Often the opti- models and primitives to be integrated simply and ef- mized implementation includes machine-specific assem- ficiently. These systems require variable and inclusive bly and operating system specific system calls. We ex- primitives, primitives that can abstract computational tend this approach to more than just a few standardized features of any complexity (variable) and for any archi- libraries. We believe programmers will need to extend tecture, or using any programming model (inclusive). and specialize many different APIs to exploit different hardware efficiently. 2.1 Encapsulating Specialized Code The simple and complete definition of the C function call ABI provides a reasonable starting point, but Fortunately, programming languages already provide must be enhanced to provide guarantees needed for cor- variable and inclusive primitives: functions. Program- rect concurrent execution. Additional restrictions are re- ming systems such as EXOCHI [7] and CUDA [5] al- quired to ensure different implementations of the same low programmers to inline domain-specific languages function can be invoked interchangeably, independently proxy ments (e.g. data set size < 10000); configuration restric- GPU execution Crypto tions, which specify the necessary compute resources CPU Memory (e.g. availability of a suitable GPU); and traits, which inter-accelerator Space describe properties that are useful to users of the func- communication tion (e.g. associativity). At compile time, when function variants implementing the same computation are bundled FPGA Video together, the annotations are analyzed and translated into Coding a set of dispatch wrapper functions that implement the generic function interface and provide introspection into Figure 3: Relationship between different accelerators the variants available in the bundle. and the

Parallel Programming for Changing Hardware

Programming Language

Safe, Fast and Easy: Towards Scalable Scripting Languages

Dynamic Extension of Typed Functional Languages

The University of Chicago Reflective Techniques In

Castor: Programming with Extensible Generative Visitors

An Abstract, Reusable, and Extensible Programming Language Design Architecture⋆

201604261441 Merged2.Pdf

Genesis: an Extensible Java

Extensible Languages for Flexible and Principled Domain Abstraction

AAA One Minute Madness

Where Are We? PL Category: Concatenative Pls Introduction to Forth

Effective Extensible Programming: Unleashing Julia on Gpus