Hassan Chafi Arvind K. Sujeeth Kevin J. Brown HyoukJoong Lee Anand R. Atreya Kunle Olukotun Pervasive Parallelism Laboratory Stanford University {hchafi, asujeeth, kjbrown, hyouklee, aatreya, kunle}@stanford.edu Abstract gramming models are available, each with their own set of trade- Exploiting heterogeneous parallel hardware currently requires offs. Emerging heterogeneous systems further complicate this chal- mapping application code to multiple disparate programming mod- lenge as each accelerator vendor usually provides a distinct driver els. Unfortunately, general-purpose programming models available API and programming model to interface with the device. today can yield high performance but are too low-level to be acces- It is not realistic to expect the average programmer to deal with sible to the average programmer. We propose leveraging domain- all this complexity. Moreover, exposing the programmer directly specific languages (DSLs) to map high-level application code to to the various models supported by each compute device will ulti- heterogeneous devices. To demonstrate the potential of this ap- mately be detrimental to application portability, forward scalability proach we present OptiML, a DSL for machine learning. OptiML and maintenance. As new system configurations emerge, applica- programs are implicitly parallel and can achieve high performance tions will constantly need to be rewritten to take advantage of any on heterogeneous hardware with no modification required to the new capabilities. It is essential to develop appropriate abstractions source code. For such a DSL-based approach to be tractable at so that programmers can write high-level code and not worry about large scales, better tools are required for DSL authors to simplify low-level details that negatively impact productivity. Thus, there is language creation and parallelization. To address this concern, we a need for parallel heterogeneous programming models that target introduce Delite, a system designed specifically for DSLs that is average programmers who are not interested in becoming paral- both a framework for creating an implicitly parallel DSL as well lel/heterogeneous programming experts. This mass market parallel as a dynamic runtime providing automated targeting to heteroge- heterogeneous programming model should be driven by the follow- neous parallel hardware. We show that OptiML running on Delite ing goals: achieves single-threaded, parallel, and GPU performance superior • Productivity: the application developer can, ideally, write pro- to explicitly parallelized MATLAB code in nearly all cases. grams without having to use any explicit parallel or heteroge- Categories and Subject Descriptors D.1.3 [Programming Tech- neous constructs. niques]: Concurrent Programming – Parallel programming; D.3.4 • Performance: the application should achieve good perfor- [Programming Languages]: Processors – Code generation, Opti- mance without sacrificing productivity. The system metric mization, Run-time environments should be performance per man-hour. General Terms Languages, Performance • Portability and Forward Scalability: the application should leverage the varying amount of compute resources across dif- Keywords Parallel Programming, Domain-Specific Languages, ferent systems, both existing and emerging. The forward scala- Dynamic Optimizations bility goal manifests itself across two dimensions: the number of a particular compute resource and the diversity of compute 1. Introduction resource types. Current industry trends favor chip multiprocessors consisting of There has been a resurgence in research aimed at simplifying simpler cores[18, 29] as well as heterogeneous systems consisting parallel programming [8] and delivering on these goals. This paper of general-purpose processors, SIMD units and accelerator devices describes key elements of an ongoing effort to create a develop- such as GPUs[3, 31]. Existing applications can no longer take ad- ment environment that uses a domain-specific approach to solve vantage of the additional compute power available in these new and the issues relating to heterogeneous parallelism. The components emerging systems without a significant parallel programming ef- of this environment are shown in Figure 1. The environment con- fort. Writing parallel programs, however, is not straightforward be- sists of four main components: applications composed of multiple cause in contrast to the familiar and standard von Neumann model domain-specific languages (DSLs), DSLs embedded in the Scala for sequential programming, a variety of incompatible parallel pro- programming language [28], a Scala-based framework that simpli- fies the parallelization of DSLs and a runtime for DSL paralleliza- tion and mapping to heterogeneous architectures. A domain-specific approach to parallel programming can ad- Permission to make digital or hard copies of all or part of this work for personal or dress all of the goals of a mass market parallel heterogeneous pro- classroom use is granted without fee provided that copies are not made or distributed gramming model. A domain-specific language is a computer pro- for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute gramming language of restricted expressiveness focused on a par- to lists, requires prior specific permission and/or a fee. ticular domain[35]. DSLs are in widespread use in a variety of do- PPoPP’11, February 12–16, 2011, San Antonio, Texas, USA. mains and are becoming more popular. Examples of widely used Copyright c 2011 ACM 978-1-4503-0119-0/11/02. $10.00 DSLs are TeX and LaTeX for typesetting academic papers, SQL Applications Scientific Virtual Personal Data Engineering Worlds Robotics informatics Domain Machine Specific Rendering Physics Scripting Probabilistic Learning Languages (OptiML) Domain Embedding Language (Scala) Parallelization Framework (Delite) Static Domain Specific Opt. DSL Infrastructure Parallel Runtime (Delite) Dynamic Domain Spec. Opt. Task & Data Parallelism Locality Aware Scheduling Hardware Architecture Heterogeneous Hardware OOO Cores SIMD Cores Threaded Cores Specialized Cores Figure 1: An environment for domain-specific programming of heterogeneous parallel architectures. for database querying, Rails for web application development and Since interesting applications might leverage a variety of DSLs, VHDL for hardware design. OpenGL can also be viewed as a DSL. it is critical to not only simplify the development of DSLs by cre- By exposing an interface for specifying polygons and the rules to ating a shared infrastructure, but also to allow these DSLs to inter- shade them, OpenGL created a high-level programming model for operate. Our current approach is to embed these DSLs in a com- real-time graphics decoupled from the hardware or software used mon embedding language. Scala, our choice for the embedding to render it, allowing for aggressive performance gains as graphics language, provides features that simplify this task [9, 16]. This ap- hardware evolves. The use of DSLs can provide significant gains in proach should be applicable to any sufficiently expressive embed- the productivity and creativity of application developers, the porta- ding language. bility of applications, and application performance. We exploit this The ability to easily embed DSLs simplifies the task of a DSL trend towards DSLs and propose an approach to parallel hetero- developer. However, assistance in parallelizing and targeting het- geneous programming that hides the complexity of the underlying erogeneous resources is also needed. Delite, our framework and machine behind a collection of DSLs. A programmer using one or runtime for building and executing parallel DSLs provides facil- more of these DSLs writes her programs using domain-specific no- ities that allow DSL developers to easily parallelize their DSLs. tation and constructs. The programs appear sequential and all paral- Using Delite, a DSL developer implicitly exposes task level par- lelism and use of the heterogeneous machine resources is implicit. allelism by enabling a run-ahead model, similar to recent propos- DSLs raise the level of abstraction and can provide a sequential als [13, 19], across each invocation of the DSL’s operations. Delite model which satisfies the productivity goal. also allows the developer to express data-level parallelism available An additional benefit of using a domain-specific approach is the within DSL operations. Using such a runtime allows us to deliver ability to use domain knowledge to apply static and dynamic opti- on our portability and forward scalability goal. We provide details mizations to a program written using a DSL. Most of these domain- of the Delite framework and runtime in Section 3. Our specific con- specific optimizations would not be possible if the program was tributions are: written in a general-purpose language. General-purpose languages are limited when it comes to optimization for at least two reasons. • We present OptiML, a DSL for machine learning, which pro- First, they must produce correct code across a very wide range of vides implicitly parallel domain-specific abstractions. We show applications. This makes it difficult to apply aggressive optimiza- that such a DSL can be used to simplify programming hetero- tions. Compiler developers must err on the side of correctness. Sec- geneous parallel systems.
