Virtual Instruction Set Computing for Heterogeneous Systems ∗

Virtual Instruction Set Computing for Heterogeneous Systems ∗ Vikram Adve, Sarita Adve, Rakesh Komuravelli, Matthew D. Sinclair and Prakalp Srivastava University of Illinois at Urbana-Champaign vadve,sadve,komurav1,mdsincl2,psrivas2 @illinois.edu f g Abstract However, there are numerous challenges to getting such a system to operate effectively and efficiently. One Developing software applications for emerging and fu- major challenge is the difficulty of programming appli- ture heterogeneous systems with diverse combinations of cations to use diverse computing elements. We identify hardware is significantly harder than for homogeneous three fundamental root causes that underlie these chal- multicore systems. In this paper, we identify three root lenges: (1) diverse models of parallelism; (2) diverse causes that underlie the programming challenges: (1) di- memory architectures; and (3) diverse hardware instruc- verse parallelism models; (2) diverse memory architec- tion sets and execution semantics. We discuss these root tures; and (3) diverse hardware instruction set seman- causes and the challenges they engender in Section 2. tics. We believe that these issues must be addressed us- In this paper, we describe a broad vision and some ing a language-neutral, virtual instruction set layer that preliminary design choices for solving the programma- abstracts away most of the low-level details of hardware, bility problem by eliminating all these three root causes. an approach we call Virtual Instruction Set Computing. We believe that this can be achieved only by abstract- Most importantly, the virtual instruction set must ab- ing away the differences in heterogeneous hardware, and stract away and unify the diverse forms of parallelism presenting a more uniform hardware abstraction across and memory architectures using only one or two models devices to software. More specifically, using a low-level, of parallelism. We discuss how this approach can solve language-neutral, virtual instruction set can encapsulate the root causes of the programmability challenges, illus- all the relevant programmable hardware components on trate the design with an example, and discuss the research target systems. In this instruction set, source-level appli- challenges that arise in realizing this vision. cations are compiled, optimized, and shipped as “virtual 1 Introduction object code” and then translated down to a specific hard- The future of computing is heterogeneous. Single chips ware configuration, usually at install time, using system- currently exist with several billion transistors, and this specific compiler back ends (“translators”). We call this number will continue to increase for at least the next strategy Virtual Instruction Set Computing (or VISC, as decade [9]. However, power dissipation in these chips opposed to CISC or RISC) [2]. is an increasing problem, especially given the limited This broad strategy is not new – it has been used in power envelopes of battery-powered devices. Given this, a few commercial systems such as IBM System/38 and we have two options: turn off large portions of the chip AS/400, Transmeta processors, NVIDIA’s PTX and Mi- (the problem known as Dark Silicon [20]) or find more crosoft’s DirectCompute [16, 36, 14, 18, 32, 1] and ex- power efficient options to keep the transistors on. plored in a few research projects [19, 35, 2]. PTX and Di- Heterogeneous computing falls into the second cate- rectCompute have used this approach very successfully gory. It provides the ability to integrate a variety of pro- to abstract away families of GPUs and are strong evi- cessing elements, such as general-purpose cores, GPUs, dence that the VISC approach is commercially viable and DSPs, FPGAs, and custom or semi-custom hardware into can deliver high performance. Addressing a wider range a single system. If applications can execute code on the of heterogeneous hardware, however, requires solving all device which best suits it, then heterogeneous systems the three root causes above. PTX and DirectCompute can provide higher energy efficiency than conventional only partially address the challenges of diverse memory processors. In the best case, customized hardware accel- architectures and of diverse hardware instruction sets, fo- erators have been shown to provide 100x-1000x better cusing only on the case of GPUs. We discuss further de- power efficiency for specific computations [23, 26]. tails of these and other current efforts in Section 3. The key novelty in our work is that our instruction set ∗This work was funded in part by NSF Award Numbers 0720772 and CCF-1018796, and by the Intel-sponsored Illinois-Intel Parallelism exposes a very small number of models of parallelism Center at the University of Illinois at Urbana-Champaign. and a few memory abstractions essential for high per- 1 Figure 1: System Organization for Virtual Instruction Set Computing in a Heterogeneous System formance algorithm development and tuning. Together, In addition, applications running on multiple such we expect that these abstractions will effectively capture components may exhibit asynchronous or synchronous a wide range of heterogeneous hardware, which would parallelism relative to each other. greatly simplify major programming challenges such as (2) Diverse memory architectures: With the different algorithm design, source level portability, and perfor- parallel models come deep differences in the memory mance tuning. Our overall approach is illustrated in Fig- system. Common choices in the various components ure 1 and is discussed further in Section 4. We conclude above include cache-coherent memory hierarchies, vec- by discussing open research problems in Section 5. tor register files, private or “scratchpad” memory, stream buffers, and custom memory designs used in custom ac- 2 Programmability Challenges celerators. These differences in memory architectures Heterogeneous parallel computing systems, including strongly influence both algorithm design and application both mobile System-on-Chip (SOC) designs such as programming. Moreover, the performance tradeoffs are Qualcomm’s Snapdragon and nVidia’s Tesla, or high- becoming even more complex as new architectures pro- end supercomputers like Cray’s BlueWaters (which has vide more options, e.g., nVidia’s Fermi architecture al- many GPU coprocessors) or Convey’s FPGA-based HC- lows a 64 KB block of SRAM to be partitioned flexibly 1, raise numerous difficult programming challenges. We into part L1 cache and part private scratchpad memory. believe these challenges arise from three fundamental (3) Diverse hardware-level instruction sets and exe- root causes and we first discuss these root causes and cution semantics: Finally, the various hardware compo- then outline the challenges they engender. nents have very different instruction sets, register archi- Root Causes of Programmability Challenges: tectures, performance characteristics, and execution se- (1) Diverse models of parallelism: Different hardware mantics. These differences have an especially profound components in heterogeneous systems support different effect on object-code portability. They also have other models of parallelism. We tentatively identify five broad negative effects, described below. classes of programmable hardware that have qualita- Major Programmability Challenges: tively different models of parallelism: These fundamental forms of diversity create deep programmability challenges for heterogeneous systems. 1. General purpose cores Flexible multithreading First, it is extremely difficult to design a single algorithm 2. Vector hardware Vector parallelism for a given problem that works well across a range of 3. GPUs Restrictive data parallelism such different models of parallelism, with such different 4. FPGAs Customized dataflow memory systems, as previous work has shown [33, 4]. 5. Custom accelerators Various forms We envisage two options to address this problem: design 2 algorithms that achieve good, but not optimal, perfor- range of available hardware, both within a single system mance across the targeted range of hardware, or use mul- and also across different system configurations. Because tiple algorithms for a given problem and select among both tuning and debugging often need to go down to the them when the actual hardware configuration is known level of object code, these tools become expensive to de- (e.g., at install time, load time, or run time) [28, 4]. In velop, learn and use for each family of hardware. practice, both approaches will likely be necessary. 3 State of the Art Second, it is much more difficult to design effective Most of the existing research on programming heteroge- source-level programming languages for heterogeneous neous systems has focused on source level programming systems. A single programming language or library typ- models and languages. Existing languages like CUDA, ically supports only one or two models of parallelism. OpenCL, AMP and OpenACC are primarily focused on For example, CUDA, OpenCL and AMP naturally sup- GPU computing (The OpenMP standards committee is port fine-grain data parallelism, with a macro function developing OpenMP extensions for accelerators, which replicated across a large number of threads, but other are expected to be very similar to OpenACC [6].) In parallelism models (like more flexible dataflow) are not particular, they primarily support a single parallelism specifically addressed. Similarly, BlueSpec [30] and model: parallel execution of a kernel function replicated Lime [5],

Load more