<<

Virtual Instruction Set for Heterogeneous Systems ∗

Vikram Adve, Sarita Adve, Rakesh Komuravelli, Matthew D. Sinclair and Prakalp Srivastava University of Illinois at Urbana-Champaign vadve,sadve,komurav1,mdsincl2,psrivas2 @illinois.edu { }

Abstract However, there are numerous challenges to getting such a system to operate effectively and efficiently. One Developing applications for emerging and fu- major challenge is the difficulty of programming appli- ture heterogeneous systems with diverse combinations of cations to use diverse computing elements. We identify hardware is significantly harder than for homogeneous three fundamental root causes that underlie these chal- multicore systems. In this paper, we identify three root lenges: (1) diverse models of parallelism; (2) diverse causes that underlie the programming challenges: (1) di- memory ; and (3) diverse hardware instruc- verse parallelism models; (2) diverse memory architec- tion sets and semantics. We discuss these root tures; and (3) diverse hardware instruction set seman- causes and the challenges they engender in Section 2. tics. We believe that these issues must be addressed us- In this paper, we describe a broad vision and some ing a language-neutral, virtual instruction set layer that preliminary design choices for solving the programma- abstracts away most of the low-level details of hardware, bility problem by eliminating all these three root causes. an approach we call Virtual Instruction Set Computing. We believe that this can be achieved only by abstract- Most importantly, the virtual instruction set must ab- ing away the differences in heterogeneous hardware, and stract away and unify the diverse forms of parallelism presenting a more uniform across and memory architectures using only one or two models devices to software. More specifically, using a low-level, of parallelism. We discuss how this approach can solve language-neutral, virtual instruction set can encapsulate the root causes of the programmability challenges, illus- all the relevant programmable hardware components on trate the design with an example, and discuss the research target systems. In this instruction set, source-level appli- challenges that arise in realizing this vision. cations are compiled, optimized, and shipped as “virtual 1 Introduction object code” and then translated down to a specific hard- The future of computing is heterogeneous. Single chips ware configuration, usually at install time, using system- currently exist with several billion , and this specific back ends (“translators”). We call this number will continue to increase for at least the strategy Virtual Instruction Set Computing (or VISC, as decade [9]. However, power dissipation in these chips opposed to CISC or RISC) [2]. is an increasing problem, especially given the limited This broad strategy is not new – it has been used in power envelopes of battery-powered devices. Given this, a few commercial systems such as IBM System/38 and we have two options: turn off large portions of the chip AS/400, Transmeta processors, NVIDIA’s PTX and Mi- (the problem known as Dark [20]) or find more crosoft’s DirectCompute [16, 36, 14, 18, 32, 1] and ex- power efficient options to keep the transistors on. plored in a few research [19, 35, 2]. PTX and Di- Heterogeneous computing falls into the second cate- rectCompute have used this approach very successfully gory. It provides the ability to integrate a variety of pro- to abstract away families of GPUs and are strong evi- cessing elements, such as general-purpose cores, GPUs, dence that the VISC approach is commercially viable and DSPs, FPGAs, and custom or semi-custom hardware into can deliver high performance. Addressing a wider range a single system. If applications can execute code on the of heterogeneous hardware, however, requires solving all device which best suits it, then heterogeneous systems the three root causes above. PTX and DirectCompute can provide higher energy efficiency than conventional only partially address the challenges of diverse memory processors. In the best case, customized hardware accel- architectures and of diverse hardware instruction sets, fo- erators have been shown to provide 100x-1000x better cusing only on the case of GPUs. We discuss further de- power efficiency for specific [23, 26]. tails of these and other current efforts in Section 3. The key novelty in our is that our instruction set ∗This work was funded in part by NSF Award Numbers 0720772 and CCF-1018796, and by the -sponsored Illinois-Intel Parallelism exposes a very small number of models of parallelism Center at the University of Illinois at Urbana-Champaign. and a few memory abstractions essential for high per-

1 Figure 1: System Organization for Virtual Instruction Set Computing in a Heterogeneous System formance development and tuning. Together, In , applications running on multiple such we expect that these abstractions will effectively capture components may exhibit asynchronous or synchronous a wide range of heterogeneous hardware, which would parallelism relative to each other. greatly simplify major programming challenges such as (2) Diverse memory architectures: With the different algorithm design, source level portability, and perfor- models come deep differences in the memory mance tuning. Our overall approach is illustrated in Fig- system. Common choices in the various components ure 1 and is discussed further in Section 4. We conclude above include cache-coherent memory hierarchies, vec- by discussing open research problems in Section 5. tor register files, private or “scratchpad” memory, stream buffers, and custom memory designs used in custom ac- 2 Programmability Challenges celerators. These differences in memory architectures Heterogeneous parallel computing systems, including strongly influence both algorithm design and application both mobile System-on-Chip (SOC) designs such as programming. Moreover, the performance tradeoffs are Qualcomm’s Snapdragon and nVidia’s Tesla, or high- becoming even more complex as new architectures pro- end like ’s BlueWaters (which has vide more options, e.g., nVidia’s Fermi al- many GPU coprocessors) or Convey’s FPGA-based HC- lows a 64 KB block of SRAM to be partitioned flexibly 1, raise numerous difficult programming challenges. We into part L1 cache and part private . believe these challenges arise from three fundamental (3) Diverse hardware-level instruction sets and exe- root causes and we first discuss these root causes and cution semantics: Finally, the various hardware compo- then outline the challenges they engender. nents have very different instruction sets, register archi- Root Causes of Programmability Challenges: tectures, performance characteristics, and execution se- (1) Diverse models of parallelism: Different hardware mantics. These differences have an especially profound components in heterogeneous systems support different effect on object-code portability. They also have other models of parallelism. We tentatively identify five broad negative effects, described below. classes of programmable hardware that have qualita- Major Programmability Challenges: tively different models of parallelism: These fundamental forms of diversity create deep pro- grammability challenges for heterogeneous systems. 1. General purpose cores Flexible multithreading First, it is extremely difficult to design a single algorithm 2. Vector hardware Vector parallelism for a given problem that works well across a range of 3. GPUs Restrictive parallelism such different models of parallelism, with such different 4. FPGAs Customized dataflow memory systems, as previous work has shown [33, 4]. 5. Custom accelerators Various forms We envisage two options to address this problem: design

2 that achieve good, but not optimal, perfor- range of available hardware, both within a single system mance across the targeted range of hardware, or use mul- and also across different system configurations. Because tiple algorithms for a given problem and select among both tuning and debugging often need to go down to the them when the actual hardware configuration is known level of object code, these tools become expensive to de- (e.g., at install time, load time, or run time) [28, 4]. In velop, learn and use for each family of hardware. practice, both approaches will likely be necessary. 3 of the Art Second, it is much more difficult to design effective Most of the existing research on programming heteroge- source-level programming languages for heterogeneous neous systems has focused on source level programming systems. A single or typ- models and languages. Existing languages like CUDA, ically supports only one or two models of parallelism. OpenCL, AMP and OpenACC are primarily focused on For example, CUDA, OpenCL and AMP naturally sup- GPU computing (The OpenMP standards committee is port fine-grain data parallelism, with a developing OpenMP extensions for accelerators, which replicated across a large number of threads, but other are expected to be very similar to OpenACC [6].) In parallelism models (like more flexible dataflow) are not particular, they primarily support a single parallelism specifically addressed. Similarly, BlueSpec [30] and model: parallel execution of a kernel function replicated Lime [5], which have proved successful for FPGAs, are across a large number of cores, with explicit copying of both dataflow models but it is not clear whether these can data from to device and back. (OpenCL supports be mapped effectively to GPUs. The consequence today parallelism, but that model is a poor match to GPUs, is that, a must program each hardware com- which are the primary targets of existing OpenCL imple- ponent differently, which creates a huge barrier to entry mentations. For example, it has been shown that CUDA for widespread use of heterogeneous hardware. programs that aren’t data parallel often perform poorly A third challenge is source code portability. Hetero- on GPUs [34].) All commercial we geneous systems can provide different combinations of know of focus on GPUs and general purpose multicore hardware, both within a single manufacturer’s family CPUs and do not address other components in a hetero- of devices and across different manufacturers’ devices. geneous system. This makes source-code level portability difficult, in two The Liquid Metal [24] aims to program hy- ways. First, each component must solve the algorithm brid CPU/accelerator code, using a single object-oriented portability problem (above). Second, must programming language called Lime. Their efforts to date map source code embodying one or more algorithms for have focused on CPU/FPGA systems. The FPGA part is each component down to the various hardware compo- compiled first to a language based on a dataflow graph of nents on which those algorithms must run. This task is filters connected by input/output queues, which in turn greatly complicated by all three forms of diversity. is compiled to Verilog. The approach is very specific to Fourth, performance tuning for heterogeneous sys- FPGAs and it is not clear if it would be effective for more tems will also be significantly more complex. The dis- diverse components. parate parallelism models, memory architectures, and Delite [11] takes a promising approach. Broadly, lower-level performance details require significantly dif- Delite provides a framework to create implicitly parallel ferent performance models and tuning strategies. Be- domain-specific languages (DSLs) and a runtime to ex- cause of these disparities, the programmer training, soft- ecute these parallel DSLs on a heterogeneous platform. ware tools, and application libraries all become pro- The Delite run-time creates a dynamic execution graph hibitively expensive as the number of different hardware of method executions along with their dependencies, and components grows. this graph is then scheduled across the system. Because Finally, object-code portability across the same and DSLs can be flexible and high-level, the DSL approach different manufacturers’ devices is essential as well. An could be used to shield the programmer from most of application vendor must be able to ship a single software the differences between different heterogeneous devices. version for a broad range of devices – it is impractical to In practice, however, this puts enormous burden on the create, test, market and support different versions of an compiler and run-time system to translate the high-level application package for all the different devices running language to a variety of different hardware components a single platform, e.g., all the running An- to achieve adequate performance. For example, in Delite, droid. Today, Android solves this problem by using Java regular data-parallel kernels are automatically translated bytecode, but only for CPU application code: few appli- to the target device language (e.g., CUDA for GPU), but cation components take advantage of the on-phone GPU for irregular ones, the DSL author is required to provide a or DSPs, and those are usually libraries. More- hand-written CUDA kernel, which undercuts the promise over, the , profilers and performance tools for of high-level programming. a family of heterogeneous systems must support the full At the object code level, the NVIDIA’s PTX and

3 Microsoft’s DirectCompute are virtual instruction sets rectly by building our virtual ISA on top of the LLVM for GPU computing. Both these systems aim to sup- instruction set, which has already proven extremely ef- port a stable programming model and instruction set fective at abstracting away a wide range of sequential (for CUDA and AMP programs respectively), while also hardware instructions sets [2, 27, 29]. providing efficient code. These instruction sets expose 4.2 Abstractions for Individual Elements a data-parallel programming model and provide object Because the Virtual ISA is an abstraction of the hard- code portability, PTX across NVIDIA GPUs but Direct- ware, its design must be driven by the classes of hard- Compute across a wide range of GPUs. Their primary ware it represents. It is important that the virtual ISA not limitation is that they do not support other classes of simply be a union of 5 different kinds of abstractions for hardware. In particular, they do not aim to address the the five broad classes of hardware identified in Section 2. first root cause we have identified – expose only a restric- We expect that virtually all the parallelism exploited in tive throughput-oriented data parallelism and do not ex- individual non-CPU elements will be data parallelism in pose other models of parallelism like dataflow-style par- the broadest , including both regular and irregular allelism, or general task parallelism required across dif- data parallelism. Therefore, we focus on mapping data ferent compute elements. Also PTX and DirectCompute parallelism to individual computing elements, and leave only partially address the second root cause – abstract more general functional or task parallelism to be handled away memory architectures only within a single class of by parallel execution across elements, discussed together heterogeneous components, GPUs and thus not support- with memory abstractions, below. ing other classes of heterogeneous components. Never- In designing the virtual ISA for individual elements, theless, they do provide strong evidence that a virtual we have considered several parallelism models that have instruction set approach is commercially viable, highly been used for different hardware elements today, as well scalable, and can be used to achieve high performance. as combinations of them that can be integrated cleanly. 4 The VISC Approach In the interests of space, we omit the discussions of our 4.1 of The VISC Design Strategy analysis and describe the tentative outcome instead. All these choices will build on some variant of the sequential We propose to leverage Virtual Instruction Set Comput- LLVM instruction set. ing (VISC) to address all three root causes of the pro- Our tentative strategy is to use a combination of two grammability challenges for heterogeneous systems. Our parallelism abstractions: vector parallelism and general- system organization is shown in Figure 1. The key point ized macro dataflow graphs. A virtual vector instruction in the figure is that the only software components that can set, such as Vector LLVA [8] or Vapor SIMD [31], is an “see” the hardware details in a system are the translators obvious candidate because vectorizable code is likely to (i.e., the compiler back ends), a minimal set of low-level capture a large category of codes (though not all) that OS components sufficient to implement a full-featured run successfully on GPUs, FPGAs, and vector hardware. OS kernel [17], and potentially some device drivers. The Moreover, when applicable, vector instructions are sim- rest of the software stack, including most of an OS ker- ple, have intuitive deterministic semantics, and can be nel, higher-level language compilers and other devel- made highly portable through careful design of the vector opment tools, application-independent libraries, frame- lengths, control flow, memory alignment, and operation works and middleware, and all , live abstractions [8, 31]. above the virtual ISA. For computations not cleanly expressible as vector To our knowledge, the VISC approach has never be- code, our second choice is a general dataflow graph, fore been applied to multiple kinds of heterogeneous which may include fine-grain dataflow such as Static Sin- hardware, which presents unique and difficult chal- gle Assignment (SSA) form within procedures (which al- lenges. To solve the three root causes of programmabil- ready exists in LLVM) and “macro dataflow” representa- ity problems described earlier, the VISC approach must tions for coarse-grain data flow. This dataflow graph will have three features: (i) A very small number of funda- not be “pure dataflow” in the sense that individual nodes mental abstractions for parallel (preferably, may include side-effects, i.e,. reads and writes, to shared only one or two) in the virtual instruction set; (ii) abstrac- memory, because explicitly managing memory accesses tions for memory and that support - may be important for performance tuning and front-end rithm design, source level language compilation, and ap- compiler optimization. Although these side effects may plication performance tuning; and (iii) a uniform instruc- cause concurrency errors like data races or unintentional tion set that can be mapped down to different hardware non-determinism, it is important to note that the virtual instruction sets and execution models in different hetero- ISA is not a source-level programming language but an geneous hardware components. We discuss the first two object code language: guaranteeing correctness proper- briefly, in turn. The third feature we expect to inherit di- ties like the absence of concurrency errors is not the re-

4 FFT Node sponsibility of the virtual ISA and is instead left to source 0 0 0 k 1 [ω0 . . . ωN/2 1] [ωk 1 . . . ωN/−2 1] level languages. − − − We believe that such a (less restrictive) dataflow graph language will be able to capture all the forms of data 0 k 1 A A + ω B A A + ω − B parallelism. For example, the implicit “grid of ker- 0 0 0 k 1 B A ω B B A ω − B nel functions” used in PTX [32], Renderscript [3], We- − 0 − 0 bGL [37], and DirectCompute [1] are naturally express- Butterfly [0] Estage stage Butterfly [0] Efft stage − − ible as dataflow graphs. While dataflow graphs can also 0 k 1 A A + ω B A A + ω − B capture vectorizable code, they may be less efficient and 1 1 0 k 1 B A ω B B A ω − B less compact than vector instructions, which is why we − 1 − 1 expect to use vector instructions as well. Butterfly [1] Butterfly [1] Estage butterfly Moreover, having one monolithic dataflow graph for − k 1 0 A + ω − B an entire application would severely restrict the modu- A A + ωN/2 1B A N/2 1 − − larity and flexibility of the code. We therefore propose B 0 B k 1 A ωN/2 1B A ωN/−2 1B − − − − to make the dataflow graphs hierarchical by simply al- Butterfly [N/2 1] Butterfly [N/2 1] − − lowing the individual nodes in the overall graph to be Stage [0] Stage [log N 1] 2 − either a vector subprogram, or another dataflow graph log N stages of its own. Individual nodes in the graph would repre- 2 sent computation within each element, either as simple vector code or rich fine-grain dataflow graphs. Different Figure 2: Macro Dataflow representation of FFT EX Y denotes dataflow edges between nodes X and Y . nodes can use different choices for different kernels. To- − gether this combination would give an elegant, yet pow- erful, mechanism for expressing parallelism across an FPGA implementations. The utility of abstracting away entire heterogeneous system since a wide range of high- the low level details of data movement through dataflow level data-parallel programming models can be compiled edges is given in Section 4.3. efficiently down to either a dataflow model or a vector model, including OpenCL, Array Building Blocks [21], 4.3 Memory System Abstractions CUDA, Renderscript, OpenMP, 9x, Hierarchi- An important aspect of VISC is what abstraction of cal Tiled Arrays (HTAs) [7, 22], Concurrent Collec- the overall memory system we present to tions [10], Lime, StreamIt [38], and others. for coordinating parallelism across compute elements. In Listing 1, we give an example of the hierarchical When programming GPUs in CUDA or OpenCL, we dataflow representation of the FFT radix-2 algorithm, need to explicitly manage data movement between the which is the simplest and most commonly used form CPU and the GPU, which is an additional overhead to of the Cooley and Tukey FFT algorithm [15], in our the programmer. Additionally, newer accelerators like proposed macro dataflow language. We have studied GPUs based on NVIDIA’s Fermi architecture combine two implementations of this algorithm, an FPGA im- increasingly flexible options for caches or scratchpad plementation from the ERCBench benchmark suite [12], memory. (In an allied research project, we are exploring and a GPU from the Parboil bench- much more flexible memory system designs, combining suite [25] and tried to capture them in a common highly reconfigurable SRAM memory blocks for energy dataflow representation. efficiency and performance.) Unfortunately, these kinds Figure 2 gives the pictorial view of the dataflow graph of designs enormously complicate the task of applica- created in Listing 1. FFT is a node in the graph, which tion developers, who have to optimize their code to take in turn consists of log2 N Stage nodes, where each Stage advantage of such hardware features. Our VISC design node further consists of N/2 Butterfly nodes. In main() approach aims to enable such flexibility for hardware de- we first describe the node hierarchy in the dataflow signers, without putting an excessive burden on applica- graph. Next we use the generate edges() function to tion designers. Applications should be able to express create the dataflow edge connections among different an overall computation in a simple and abstract compu- nodes in the graph. As shown in the figure we have tational model. Such a model would convey the essential classified these edges into three types (Estage stage, coordination and data transfer requirements and leave the − Ebutterfly stage and Estage fft) based on the type of low-level mapping and execution details to the underly- nodes they− connect. Additionally,− these edges are capa- ing compiler, scheduler and hardware. ble of carrying -defined data types such as Complex. The hierarchical dataflow graphs proposed in Sec- The example shows that the dataflow graph is capable tion 4.2 provide a clean, flexible communication and co- of exposing the parallelism exploited by the GPU and ordination abstraction across computing elements. For

5 //−−−−−−−−−−Global declarations −−−−−−−−−− the producer node to the consumer node. This dataflow s t r u t Complex = type { f l o a t r e a l , f l o a t imag } ; across elements can then be efficiently mapped to either k N c on s t i n t = log2 ; uncached local memories or to cache-coherent shared Node FFT (Complex In[]) : (Complex Out[]); Node Stage (Complex In[], Complex ω []) memory. Although explicit loads and stores of the GPU : (Complex Out[]); and FPGA implementations have been abstracted away Node Butterfly(Complex A, Complex B,Complex ω ) in this example, we do expect to need explicit loads/- : (Complex O1, Complex O2); stores, which is not uncommon in macro dataflow lan- //−−−−−−−−−−−−−−−Functions −−−−−−−−−−−−−−− Complex complexAdd(Complex A, Complex B) { guages. For example, an alternative approach to express- // complexSub() , complexMul() are similar ing the FFT algorithm would be to use loads and stores Complex result; inside the Stage node for transferring data between suc- result.real = A.real + B.real; cessive stages, instead of using explicit dataflow edges as result.imag = A.imag + B.imag; return r e s u l t ; shown. (We omit that pseudocode for lack of space.) } (Complex, Complex) computeButterfly (Complex A, Complex B, Complex ω ) { 5 Open Research Problems Complex temp = complexMul(B, ω ); Complex O1 = complexAdd(A, temp); // A + Bω We must solve several key problems to bring these ideas Complex O2 = complexSub(B, temp); // A − Bω to fruition. We must design a back-end compilation pro- return (O1 , O2) ; } cess from the Virtual ISA to native code for each hard- main ( ) { ware device, which can exploit Virtual ISA information g e n e r a t e n o d e s ( ) ; to perform hardware dependent optimizations. We must g e n e r a t e d a t a f l o w e d g e s ( ) ; develop compilation techniques that translate communi- } //−−−−−−−−Generate node hierarchy −−−−−−−− cation and data movement down to hardware with a va- g e n e r a t e n o d e s ( ) { riety of memory architectures, e.g., some combination Butterfly := Node 1 of computeButterfly(); of non-cached local memories, cache-coherent shared Stage := Node N/2 of Butterfly; memory, or streaming memory. We must design an FFT := Nodek of Stage; } autotuning partitioner that decides how to partition the //−−−−−−−−Generate dataflow edges−−−−−−−− program effectively between diverse hardware elements, g e n e r a t e d a t a f l o w e d g e s ( ) { tune individual component code for specific hardware, // Estage stage : I n t e r −s t a g e edges and guide the run-time scheduler. Finally, we must de- f o r i i n −0 t o k − 2 do f o r j i n 0 t o N/2 − 1 do sign run time schedulers that take advantage of informa- expr = b j c ∗ 2i+1 + j mod 2i ; tion extracted by autotuning ahead of time and the flex- 2i S ta g e [ i ] . Out [ 2j ] => S ta g e [ i + 1 ].In[expr]; ibility offered by the Virtual ISA to optimize power and S ta g e [ i ] . Out [ 2j + 1 ] => S ta g e [ i + 1 ] . In [ expr execution time. +2i ]; // Ebutterfly stage : Edges between a stage and Another key direction of work is more effective, // its butterfly− nodes software-aware memory organization. To minimize en- f o r i i n 0 t o k − 1 do ergy usage or to maximize performance in a heteroge- f o r j i n 0 t o N/2 − 1 do with(Stage[ i ]) { neous system, depending on the computation, a given In [ j ] => Butterfly[j ].A; set of data may ideally be stored as hardware-managed In [ j + N/2 ] => Butterfly[j].B; cache lines, software-managed scratchpads with flexible ω [ j ] => Butterfly[j]. ω ; data granularities and layouts, arbitrary , vec- B u t t e r f l y [ j ] . O1 => Out [ 2j ]; B u t t e r f l y [ j ] . O2 => Out [ 2j + 1 ]; tor registers, etc. Moreover, hardware caching today is } largely software-oblivious but there is much information // Estage fft : Edges between FFT, stage nodes in software to optimize how memory is managed. In − f o r m i n 0 t o N − 1 do prior work on the DeNovo architecture, we have success- FFT . In [ m ] => S ta g e [ 0 ] . In [ m ]; S ta g e [ k − 1 ] . Out [ m ] => FFT . Out [ m ]; fully used program-level information about data } and access for efficiencies in power, performance, and complexity when managing data for homogeneous mul- Listing 1: Example Dataflow Representation for FFT. ticores with caches [13]. These ideas can be extended ‘=>’ denotes an edge; ‘:=’ denotes a node definition. for even greater benefits in heterogeneous systems. To accomplish these gains, the Virtual ISA must provide a general abstraction to exploit all such storage struc- instance, in Listing 1 we abstract away the data move- tures, the backend compiler must map this abstraction ment details in the FPGA and GPU (e.g. CUDA mem- efficiently, and the hardware must be designed to exploit cpy operations) through edges which move data from this flexibility.

6 References [13] B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Hon- armand, S. V. Adve, V. S. Adve, N. P. Carter, and C.- [1] DirectCompute Programming Guide. http:// T. Chou. DeNovo: Rethinking the Memory Hierarchy developer.download.nvidia.com/compute/ for Disciplined Parallelism. In 20th International Con- DevZone/docs//DirectCompute/doc/ ference on Parallel Architectures and Compilation Tech- DirectCompute_Programming_Guide.pdf, niques, PACT, 2011. 2010. [14] B. E. Clark and M. J. Corrigan. Application Sys- [2] V. Adve, C. Lattner, M. Brukman, A. Shukla, and tem/400 performance characteristics. IBM Systems Jour- B. Gaeke. LLVA: A Low-level Virtual Instruction Set Ar- nal, 28(3):407–423, 1989. chitecture. In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, MICRO [15] J. Cooley and J. Tukey. An algorithm for the 36, 2003. calculation of complex fourier series. Math. Comput, 19(90):297–301, 1965. [3] Android Developers. Renderscript. http: //developer.android.com/reference/ [16] I. Corporation. System/38-A high-level machine. IBM android/renderscript/package-summary. SYSTEM/38 Technical Developments, 1978. available html. through IBM branch offices. [4] J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, [17] J. Criswell, B. Monroe, and V. Adve. A Virtual In- A. Edelman, and S. Amarasinghe. PetaBricks: a language struction Set for Kernels. In and compiler for algorithmic choice. In Proceedings of WIOSCA, 2006. the 2009 ACM SIGPLAN conference on Programming [18] J. C. Dehnert, B. K. Grant, J. P. Banning, R. Johnson, language design and implementation, PLDI ’09, 2009. T. Kistler, A. Klaiber, and J. Mattson. The Transmeta [5] J. Auerbach, D. F. Bacon, P. Cheng, and R. Rabbah. Lime: Code Morphing Software: Using speculation, recovery a java-compatible and synthesizable language for hetero- and adaptive retranslation to address real-life challenges. geneous architectures. In Proceedings of the ACM inter- In CGO, 2003. national conference on Object oriented programming sys- [19] K. Ebcioglu and E. R. Altman. DAISY: Dynamic com- tems languages and applications , OOPSLA ’10, 2010. pilation for 100% architectural compatibility. In Proc. [6] J. Beyer, E. Stotzer, A. Hart, and B. D. Supinski. OpenMP Int’l Conf. on Architecture (ISCA), pages 26– for Accelerators. 2011. 37, 1997. [7] G. Bikshandi, J. Guo, D. Hoeflinger, G. Almasi, B. B. [20] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankar- Fraguela, M. J. Garzaran,´ D. Padua, and C. von Praun. alingam, and D. Burger. Dark Silicon and the End of Mul- Programming for parallelism and locality with hierarchi- ticore Scaling. In Proceedings of the 38th International cally tiled arrays. In Proceedings of the eleventh ACM Symposium on , ISCA, 2011. SIGPLAN symposium on Principles and practice of par- allel programming, PPoPP ’06, 2006. [21] A. Ghuloum, A. Sharp, N. Clemons, S. D. Toit, R. Malladi, M. Gangadhar, M. McCool, and H. Pabst. [8] R. L. Bocchino, Jr. and V. S. Adve. Vector LLVA: a vir- Array Building Blocks: A Flexible Parallel Program- tual vector instruction set for processing. In Pro- ming Model for Multicore and Many-Core Architec- ceedings of the 2nd international conference on Virtual tures. http://drdobbs.com/go-parallel/ execution environments , VEE ’06, 2006. article/showArticle.jhtml?articleID= [9] J. Brodman, B. Fraguela, M. J. Garzaran, and D. Padua. 227300084, 2010. Design Issues in Parallel Array Languages for Shared Memory. In 8th Int. Workshop on Systems, Architectures, [22] J. Guo, G. Bikshandi, B. B. Fraguela, M. J. Garzaran, and Proceedings of Modeling, and Simulation, 2008. D. Padua. Programming with tiles. In the 13th ACM SIGPLAN Symposium on Principles and [10] Z. Budimlic, M. Burke, V. Cav, K. Knobe, G. Lowney, practice of parallel programming, PPoPP ’08, 2008. R. Newton, J. Palsberg, D. Peixotto, V. Sarkar, F. Schlim- bach, and S. Tasirlar. Concurrent Collections. Scientific [23] R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solo- Programming, 18(3-4):203–217, 2010. matnikov, B. C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz. sources of inefficiency in [11] H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. general-purpose chips. In Proceedings of the 37th An- Atreya, and K. Olukotun. A domain-specific approach nual International Symposium on Computer Architecture, to heterogeneous parallelism. In Proceedings of the 16th ISCA, 2010. ACM symposium on Principles and practice of parallel programming, PPoPP ’11, 2011. [24] S. S. Huang, A. Hormati, D. F. Bacon, and R. Rabbah. [12] D. Chang, C. Jenkins, P. Garcia, S. Gilani, P. Aguil- Liquid metal: Object-oriented programming across the era, A. Nagarajan, M. Anderson, M. Kenny, S. Bauer, hardware/software boundary. In Proceedings of the 22nd M. Schulte, et al. ERCBench: An open-source bench- European conference on Object-Oriented Programming, mark suite for embedded and reconfigurable comput- ECOOP, 2008. ing. In International Conference on Field Programmable [25] IMPACT Research Group and others. Parboil benchmark Logic and Applications, FPL, 2010. suite, 2007.

7 [26] R. Iyer, S. Srinivasan, O. Tickoo, Z. Fang, R. Illikkal, S. Zhang, V. Chadha, P. M. Stillwell, and S. E. Lee. Cog- niServe: Heterogeneous Architecture for Large- Scale Recognition. IEEE Micro, 31(3):20–31, 2011. [27] C. Lattner and V. Adve. LLVM: A compilation frame- work for lifelong program analysis and transformation. In CGO, 2004. [28] M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng. Merge: a programming model for heterogeneous multi-core systems. In Proceedings of the 13th interna- tional conference on Architectural support for program- ming languages and operating systems, 2008. [29] LLVM User Community. http://llvm.org/ Users.html. [30] R. S. Nikhil. Bluespec system verilog: efficient, cor- rect rtl from high level specifications. In Proceedings of Second ACM and IEEE International Conference on For- mal Methods and Models for Co-Design, MEMOCODE, 2004. [31] D. Nuzman, S. Dyshel, E. Rohou, I. Rosen, K. Williams, D. Yuste, A. Cohen, and A. Zaks. Vapor SIMD: Auto-vectorize once, run everywhere. In 9th Annual IEEE/ACM International Symposium on Code Genera- tion and Optimization, CGO, 2011. [32] Nvidia Compute. PTX: Parallel Thread Execution ISA Version 2.3. http://developer.download. nvidia.com/compute/DevZone/docs/html/ C/doc/ptx_isa_2.3.pdf, 2011. [33] S. Ohshima, K. Kise, T. Katagiri, and T. Yuba. Paral- lel processing of in a cpu and gpu heterogeneous environment. In Proceedings of the 7th in- ternational conference on High performance computing for , VECPAR, 2007. [34] M. D. Sinclair. Enabling New Uses for GPUs. Master’s thesis, University of Wisconsin-Madison, 2011. [35] J. E. Smith, T. Heil, S. Sastry, and T. Bezenek. Achieving High Performance via Co-designed Virtual . In Proc. Int’l Workshop on Innovative Architecture, IWIA, 1999. [36] F. Soltis. Design of a small business sys- tem. IEEE Computer, 14:77–93, 1981. [37] The Khronos Group. WebGL - OpenGL ES 2.0 for the Web. http://www.khronos.org/webgl/, 2009. [38] W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A Language for Streaming Applications. In Compiler Construction, 2002.

8