Cross-Platform Heterogeneous Runtime Environment

A Dissertation Presented by

Enqiang Sun

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in

Computer Engineering

Northeastern University Boston, Massachusetts

April 2016 To my family.

i Contents

List of Figures iv

List of Tables vi

Acknowledgments vii

Abstract of the Dissertation viii

1 Introduction 1 1.1 A Brief History of Heterogeneous Computing ...... 4 1.2 Heterogeneous Computing with OpenCL ...... 5 1.3 Task-level Parallelism across Platforms with Multiple Computing Devices . . . . . 5 1.4 Scope and Contribution of This Thesis ...... 7 1.5 Organization of This Thesis ...... 8

2 Background and Related Work 9 2.1 From Serial to Parallel ...... 9 2.2 Many-Core Architecture ...... 10 2.3 Programming Paradigms for Many Core Architecture ...... 14 2.3.1 Pthreads ...... 14 2.3.2 OpenMP ...... 15 2.3.3 MPI ...... 15 2.3.4 Hadoop MapReduce ...... 15 2.4 Computing with Graphic Processing Units ...... 16 2.4.1 The Emergence of Programmable Graphics Hardware ...... 16 2.4.2 General Purpose GPUs ...... 17 2.5 OpenCL ...... 19 2.5.1 An OpenCL Platform ...... 20 2.5.2 OpenCL Execution Model ...... 20 2.5.3 OpenCL Memory Model ...... 27 2.6 Heterogeneous Computing ...... 28 2.6.1 Discussion ...... 36 2.7 SURF in OpenCL ...... 36 2.8 Monte Carlo Extreme in OpenCL ...... 37

ii 3 Cross-platform Heterogeneous Runtime Environment 39 3.1 Limitations of the OpenCL Command-Queue Approach ...... 39 3.1.1 Working with Multiple Devices ...... 39 3.2 The Task Queuing Execution System ...... 40 3.2.1 Work Units ...... 42 3.2.2 Work Pools ...... 42 3.2.3 Common Runtime Layer ...... 44 3.2.4 Resource Manager ...... 44 3.2.5 Scheduler ...... 45 3.2.6 Task-Queuing API ...... 49

4 Experimental Results 50 4.1 Experimental Environment ...... 50 4.2 Static Workload Balancing ...... 51 4.2.1 Performance Opportunities on a Single GPU Device ...... 51 4.2.2 Heterogeneous Platform with Multiple GPU Devices ...... 52 4.2.3 Heterogeneous Platform with CPU and GPU(APU) Device ...... 56 4.3 Design Space Exploration for Flexible Workload Balancing ...... 59 4.3.1 Synthetic Workload Generator ...... 59 4.3.2 Dynamic Workload Balancing ...... 59 4.3.3 Workload Balancing with Irregular Work Units ...... 63 4.4 Cross-Platform Heterogeneous Execution of clSURF and MCXCL ...... 64 4.5 Productivity ...... 67

5 Summary and Conclusions 69 5.1 Portable Execution across Platforms ...... 69 5.2 Dynamic Workload Balancing ...... 69 5.3 to Expose Both Task-level and Data-level Parallelism ...... 70 5.4 Future Research Directions ...... 70 5.4.1 Including Flexible Workload Balancing Schemes ...... 70 5.4.2 Running specific kernels on the best computing devices ...... 71 5.4.3 Prediction of data locality ...... 71

Bibliography 73

iii List of Figures

1.1 Multi-core CPU, GPU, and Heterogeneous System-on-Chip CPU and GPU. At present, designers are able to make decisions among diverse architecture choices: homogeneous multi-core with cores of various size of complexity or heterogeneous system-on-chip architectures...... 3

2.1 Processors Introduction Trends ...... 10 2.2 Multi-core Processors with Shared Memory ...... 11 2.3 Intel’s TeraFlops Architecture ...... 12 2.4 Intel’s Xeon Phi Architecture ...... 13 2.5 IBM’s ...... 14 2.6 High Level Block Diagram of a GPU ...... 18 2.7 OpenCL Architecture ...... 19 2.8 An OpenCL Platform ...... 20 2.9 The OpenCL Execution Model ...... 21 2.10 OpenCL work-items mapping to GPU devices...... 23 2.11 OpenCL work-items mapping to CPU devices...... 24 2.12 The OpenCL ...... 28 2.13 Qilin Software Architecture ...... 29 2.14 The OpenCL environment with the IBM OpenCL common runtime...... 30 2.15 Maestro’s Optimization Flow ...... 34 2.16 Symphony Overview ...... 35 2.17 The Program Flow of clSURF...... 37 2.18 Block Diagram of the Parallel Monte Carlo simulation for photon migration. . . . . 38

3.1 Distributing work units from work pools to multiple devices...... 41 3.2 CPU and GPU execution ...... 43 3.3 An example execution of vector addition on multiple devices with different process- ing capabilities...... 46

4.1 The performance of our work pool implementation on a single device – One Work Pool...... 53 4.2 The performance of our work pool implementation on a single device – Two Work Pools...... 54 4.3 Load balancing on dual devices – V9800P and HD6970...... 55

iv 4.4 Load balancing on dual devices – V9800P and GTX 285...... 56 4.5 Performance assuming different device fission configurations and load balancing schemes between CPU and Fused HD6550D GPU...... 57 4.6 The Load Balancing on dual devices – HD6550D and CPU...... 58 4.7 Performance of different workload balancing schemes on all 3 CPU and GPU devices, an A8-8350 CPU, a V7800 GPU and a HD6550D GPU, as compared to a V7800 GPU device alone...... 62 4.8 Performance of different workload balancing schemes on 1 CPU and 2 GPU devices, an NVS 5400M GPU, a Core i5-3360M CPU and a Intel HD graphics 4000 GPU, as compared to the NVS 5400M GPU device alone...... 63 4.9 Performance Comparison of clSURF implemented with various workload balancing schemes on the platform with V7800 and HD6550D GPUs...... 65 4.10 Performance Comparison of MCXCL implemented with various workload balancing schemes ...... 66 4.11 Number of lines of the source code using our runtime API versus a baseline OpenCL implementation...... 68

v List of Tables

3.1 Typical memory bandwidth between different processing units for reads...... 47 3.2 Typical memory bandwidth between different processing units for writes...... 47 3.3 The Classes and Methods ...... 47

4.1 Input sets emphasizing different phases of the SURF algorithm...... 52

vi Acknowledgments

First, I would like to thank my advisor, Prof. David Kaeli for his insightful and inspiring guidance during the course of my graduate study. I have always enjoyed talking with him on various research topics, and his computer architecture class is one of the best classes I have ever taken. The enlightening suggestions from my committee of Prof. Norman Rubin and Prof. Ningfang Mi have been a great help to this thesis. Norm was my mentor when I was doing a 6-month internship at AMD, and that’s where this thesis essentially started. I would also like to thank Dr. Xinping Zhu, who gave me valuable guidance for my early graduate study. My fellow NUCAR colleagues, Dana Schaa, Byunghyun Jang, Perhaad Mistry, etc, also helped me so much through technical discussions and feedback. If life is a train ride, I cherish every moment and every scene outside of the window we share together. My deepest appreciation goes to my family, as it is always where I can put myself together with their endless love. I would like to thank my mom and dad for their consistent support and motivation, and my brother for his advice and encouragement. And finally, but most importantly, I would like to thank my wife and mother of my two kids, Liwei, for her understanding, patience, and faith in me. I couldn’t have finished this thesis without her love.

vii Abstract of the Dissertation

Cross-Platform Heterogeneous Runtime Environment

by Enqiang Sun Doctor of Philosophy in Computer Engineering Northeastern University, April 2016 Dr. David Kaeli, Adviser

Heterogeneous platforms are becoming widely adopted thanks to the support from new programming languages and models. Among these languages/models, OpenCL is an industry standard for parallel programming on heterogeneous devices. With OpenCL, compute-intensive portions of an application can be offloaded to a variety of processing units on a system. OpenCL is one of the first standards that focuses on portability, allowing programs to be written once and run unmodified on multiple, heterogeneous devices, regardless of vendor. While OpenCL has been widely adopted, there still remains a lack of support for automatic workload balancing and data consistency when multiple devices are present in the system. To address this need, we have designed a cross-platform heterogeneous runtime environment which provides a high-level, unified, execution model that is coupled with an intelligent resource management facility. The main motivation for developing this runtime environment is to provide OpenCL programmers with a convenient programming paradigm to fully utilize all possible devices in a system and incorporate flexible workload balancing schemes without compromising the user’s ability to assign tasks according to the data affinity. Our work removes much of the cumbersome initialization of the platform, and now devices and related OpenCL objects are hidden under the hood. Equipped with this new runtime environment and associated programming interface, the programmer can focus on designing the application and worry less about customization to the target platform. Further, the programmer can now take advantage of multiple devices using a dynamic workload balancing algorithm to reap the benefits of task-level parallelism. To demonstrate the value of this cross-platform heterogeneous runtime environment, we have evaluated it running both micro benchmarks and popular OpenCL benchmark applications. With minimal overhead of managing data objects across devices, we show that we can achieve scalable performance and application speedup as we increase the number of computing devices, without any changes to the program source code.

viii ix Chapter 1

Introduction

Moore’s law describes technology advances that double transistor density on integrated circuits every 12 to 18 months [1]. However, with the size of transistors approaching the size of individual atoms, and as power density outpaces current cooling techniques, the end of Moore’s law has appeared on the horizon. This has encouraged the research community to look at new solutions in system architecture, including heterogeneous computing architectures. Since 2003, the semiconductor industry has settled on three main trends for design. The first trend is to continue improving the sequential execution speed while increasing the number of cores [2]. This kind of microprocessors are called multicore processors. An example of multicore CPU is Intel’s widely used Core 2 Duo . It has dual processor cores, each of which is an out-of-order, multiple instruction issue processor implementing the full instruction set, supporting hyperthreading with two hardware threads, designed to maximize the execution speed of sequential programs. The second trend focuses more on the execution of parallel applications with as many as possible threads. This kind of processors are called many-thread processors. Most of the current popular GPUs are many-thread architecture. For example, with full occupancy ’s GTX 970 can host 26,624 threads, executing in a large number of simple and in-order pipelines. The third trend is the improved combination of both multicore and many-thread architecture. This kind of processors are represented by most current desktop processors with integrated Graphics processing units. For example, the 6th generation Intel’s Core i7-6567U processor has dual-core CPU with integrated Iris graphics 550 GPU, which has 72 execution units [3]. AMD’s A8-3850 fusion processor has four x86-64 CPU cores integrated together with a Radeon HD6550D Radeon GPU, which has 5 SIMD engines (16-wide) and a total of 400 streaming processors [4]. With it’s long evolving history, the design philosophy of a CPU is to minimize the execution

1 CHAPTER 1. INTRODUCTION latency of a single thread. Large on-chip caches are integrated to store frequently accessed data and improve some of the long-latency memory accesses, providing short-latency access. There is also prediction logic, such as branch prediction and data prefetching designed to minimize the effective latency of operations at the cost of increased chip area and power. With all these hardware logic components, the CPU greatly reduces the execution latency of each individual thread. However, the large cache memory, low-latency arithmetic units, and sophisticated prediction logic consume chip area and power that could be otherwise used to provide more arithmetic execution units and memory access channels. This design style inside CPUs emphasizes on minimizing the latency and is latency-oriented design. The GPUs, either standalone or integrated, on the other hand, are designed as parallel, throughput-oriented computing engines. The application software is expected to be organized with much more data parallelism. The hardware takes advantage of the large number of arithmetic execution units, and pipelines the execution when some of them are waiting for long-latency memory accesses or arithmetic operations. Only limited amount of cache memories are supplied to help increase the memory bandwidth requirements of these applications and facilitate the data synchro- nization between multiple threads that access the same memory data. This design style strives to maximize the total execution throughput of a large amount of data parallelism while allowing individual threads to take a potentially much longer time to execute. GPUs have been leading the race of floating-point performance since 2013. With enough data parallelism and proper memory arrangement, the performance gap can be more than ten times. These are not necessarily the application speeds, but only the raw speed the execution resources can potentially support. For applications that have one or a few threads, CPUs can achieve much higher performance than GPUs. Therefore, the heterogeneous architectures combining CPUs and GPUs would be the natural selection for the applications, which can execute the sequential parts on the CPU and numerically intensive parallel parts on GPU. Figure 1.1 is an high-level illustration of multi-core CPU, many-thread accelerator GPU, and a heterogeneous system-on-chip architecture with CPU and GPU on the same die. High- performance computing might emphasize single-threaded latency whereas commercial transaction processing might emphasize aggregate throughput. Designers began to put both of these devices with very different characteristics together, and expected a performance gain, leveraging properly workload distribution and balancing. Graphics processing units used to be very difficult to program since programmers had to use the corresponding graphics application programming interface. OpenGL and Direct3D are the

2 CHAPTER 1. INTRODUCTION

S

Homogeneous multi-core CPU HomogeneousHomogeneous multi multi-core-core GPU CPU Heterogeneous System-on-Chip with CPU and GPU

Figure 1.1: Multi-core CPU, GPU, and Heterogeneous System-on-Chip CPU and GPU. At present, designers are able to make decisions among diverse architecture choices: homogeneous multi-core with cores of various size of complexity or heterogeneous system-on-chip architectures. most widely used graphics API specifications. More precisely, a computation must be mapped to a graphical function that programs an pixel processing engine so that they can be executed on the early GPUs. These APIs require extensive knowledge of graphics processing and also limit the kinds of applications that one can actually write for early general purpose GPU programming. To quench the increasing demands, new GPU programming paradigms became more and more popular, such as CUDA [5], OpenCL [6] OpenACC [7], and C++AMP [8]. Many runtime and execution system are also designed to help developer to manage the heterogeneous platform with multiple computing devices with dramatically different characteristics. In this thesis, we present a cross-platform heterogeneous runtime environment, providing a convenient programming interface to fully utilize all possible devices in a heterogeneous system. Our framework incorporates flexible workload balancing schemes without compromising the user’s ability to assign tasks according to the data affinity. Our framework provides significant enhancements to the state-of-the-art in OpenCL programming practice in terms of workload balancing and distribution. Furthermore, the details of programming the specific platform are hidden from the programmer, enabling the programmer to focus more on high-level design of the algorithms. In this chapter, we present the reader with an introduction to some basic concepts of heterogeneous computing. This includes a very brief history of heterogeneous computing with CPUs and GPUs, the potential benefits that heterogeneous computing provides, and the ability of our runtime framework to adapt applications to heterogeneous computing platforms. Finally, we

3 CHAPTER 1. INTRODUCTION highlight the contributions of this thesis and outline the organization of the remainder of this thesis.

1.1 A Brief History of Heterogeneous Computing

Over the last decade, developers have witnessed the field of computer architecture transi- tioning from single-core compute devices to a wide range of parallel architectures. The change in architecture has also produced new challenges with the underlying parallel programming paradigms. Existing algorithms designed to scale with single-core systems had to be redesigned to reap the performance benefits of new parallel architectures. Multi-core is the chosen path of the industry to quench the thirsty for performance, and at the same time, respecting thermal and power design limits. While multi-core processors have ushered in a new era of concurrency, there has also been work on exploiting existing parallel platforms such as GPUs. Since the early 1990’s, software architects have explored how best to run general-purpose applications on computer graphics hardware (i.e., GPUs). GPUs were originally designed to execute a set of predefined functions as a graphics rendering pipeline. Even today, GPUs are mainly designed to calculate the color of pixels on the screen to support complex graphics processing functions. GPUs provide deterministic performance when rendering frames. In the beginning of this revolution, GPU programming was done using a graphics Application Programming Interface (API) such as OpenGL [9] or DirectX [10]. This model required general purpose application developers to have intimate knowledge of graphics hardware and graphics APIs. These restrictions severely impacted the implementation of many algorithms on GPUs. General purpose GPU (GPGPU) programming was not widely accepted until new GPU architectures unified vertex and pixel processors (first available in the R600 family from AMD and the G80 family from NVIDIA). New general purpose programming languages such as CUDA [5] and Brook+ [11] were introduced in 2006. The introduction of fully programmable hardware and new programming languages lifted many of the restrictions and greatly increased the interest in using GPU for general purpose computing. Heterogeneous platforms that include GPUs as a powerful data- parallel co-processor have been adopted for many scientific and engineering environments [12][13] [14][15]. On current systems, discrete GPUs are connected to the rest of the system through a PCI express bus. All data transfer between the CPU and GPU is limited by the speed of PCI express protocol. Recently, industry leaders have recognized that scalar processing on the CPU, combined with parallel processing on the GPU, could be a power model for application throughput. More

4 CHAPTER 1. INTRODUCTION recently, the Heterogeneous System Architecture (HSA) Foundation [16] was founded in 2012 by many vendors. HSA has provided industry with standards to further support heterogeneity across systems and devices. We have also seen solutions with a CPU and a GPU on the same die, such as AMD’s APU [4] series, INTEL’s Ivybridge [17] series and ’s Snapdragon [18], has demonstrated potential power/performance savings. Current state-of-the-art supercomputers utilize a heterogeneous solution. Heterogeneous systems can be found in every domain of computing, ranging from high- performance computing servers to low-power embedded processors in mobile phones and tablets. Industry and academia are investing huge amount of effort and budget to improve every aspects of heterogeneous computing [19] [20] [21] [15] [22] [23].

1.2 Heterogeneous Computing with OpenCL

The emerging software framework for programming heterogeneous devices is the Open Computing Language(OpenCL) [6]. OpenCL is an open industry standard managed by the non- profit technology consortium . Support for OpenCL has been increasing from major companies such as Qualcomm, AMD, Intel and Imagination. The aim of OpenCL is to serve as a universal language for programming heterogeneous platforms such as GPUs, CPUs, DSPs, and FPGAs. In order to support such a wide variety of heterogeneous devices, some elements of the OpenCL API are necessarily low-level. As with the CUDA/C language [5], OpenCL does not provide support for automatic workload balancing, nor guarantee global data consistency–it is up to the programmer to explicitly define tasks and enqueue them on devices, and to move data between devices as required. Furthermore, when different implementations of OpenCL produced by different vendors are used, OpenCL objects from vendor A’s implementation may not run on vendor B’s hardware. Given these limitations, there still remain barriers to achieve straightforward heterogeneous computing.

1.3 Task-level Parallelism across Platforms with Multiple Computing Devices

Platform agnostic is a quality that is taken for granted for many existing programming languages such as C/C++, Java, etc. Programmers rely on compilers or run-time systems to automati- cally generate executables for different processing units. Until recently, there did not exist a set of

5 CHAPTER 1. INTRODUCTION

API functions that would enable the programmer to automatically exploit all computing resources when the characteristics of the underlying platform change (e.g., number of processing units and accelerators). To help illustrate some of the challenges with heterogeneous computing, we consider the OpenCL open-source implementation of OpenSURF (Open source Speeded Up Robust Feature)[24] to demonstrate a typical use of the OpenCL programming model. In OpenSURF, the degree of data-parallelism a single kernel can vary when executing on different computing devices. Execution dynamics are also dependent on the characteristics of the input images or video frames, such as the size and image complexity. Furthermore, when mapping to another platform with a different number of devices, we usually have to re-design the kernel binding and associated data transfers, without proper runtime management. Without runtime workload balancing, the additional processing units available on the targeted accelerator may remain idle unless the application is redesigned. Even with the range of parallelism present in OpenSURF, an application has no inherent ability to exploit the extra computing resources, and is not able to improve performance if we upgrade our hardware platform. In this thesis we present a cross-platform heterogeneous runtime environment that helps ameliorate many of the burdens faced when performing heterogeneous programming. New pro- gramming models such as OpenCL and CUDA provide the ability to dynamically initialize the platforms and objects, and acquire the processing capability of each device, such as the number of compute units, core frequency, etc.. The presented runtime environment augments this ability, and incorporates a central task queuing/scheduling system. This central task queuing system is based on the concepts of work pools and work units, and cooperates with workload balancing algorithms to execute applications on heterogeneous hardware platforms. Using the runtime API, programmers can easily develop and tune flexible workload balancing schemes across different platforms. In the proposed runtime environment, data-parallel kernels in an application are wrapped with metadata into work units. These work units are then enqueued into a work pool and assigned to computing devices according to a selectable workload balancing policy. A resource management system is seamlessly integrated in the central task-queuing system to provide for migration of kernels between devices and platforms. We demonstrate the utility of this class of task queuing runtime system by implementing selected benchmark applications from OpenCL benchmark suites. We also benchmark the performance trade-off by implementing real world applications such as clSURF[25], an OpenCL open-source implementation of OpenSURF (Open source Speeded Up Robust Feature) framework, and Monte Carlo Extreme in OpenCL[26], a Monte Carlo simulation for time-resolved

6 CHAPTER 1. INTRODUCTION photon transport in 3D turbid media.

1.4 Scope and Contribution of This Thesis

Unlike the ubiquity of the x86 architecture and the long life cycle of CPU designs, GPUs often have much shorter release cycles and ever-changing ISAs and hardware features. Platforms incorporating GPUs as accelerators can have very different configurations in terms of processing capabilities and number of devices. As such, the need has arisen for a programming interface and runtime execution system that allows a single program to be portable across different platforms, and can automatically use all devices supported by an associated workload balancing scheme. The key contribution of this thesis is the development of a cross-platform heterogeneous runtime environment, which enables flexible task-level workload balancing on heterogeneous plat- forms with multiple computing devices. Together with the application programming interface, this extension layer is designed in the form of a library. Different levels of this runtime environment are considered. We study the following aspects of our runtime environment:

• We enable portable execution of applications across platforms. Our runtime environment provides a unified abstraction for all processing units, including CPU, GPU and many existing OpenCL devices. With this unified abstraction, tasks are able to be distributed on all devices. An application is portable across different platforms with a variable number of processing units.

• We provide APIs to expose both task-level and data-level parallelism. The program de- signer is in the best position to identify all levels of parallelism present in his/her application. We provide API functions and a dependency description mechanism so that the programmer can expose task-level parallelism. When combined with the data-level parallelism present in OpenCL kernels, the run-time and/or the compiler can effectively adapt any type of parallel machine without the modification of the source code.

• Balance task execution dynamically based on static and run-time profiling information. The optimal static mapping of task execution on the underlying platform requires a significant amount of analysis of all of the devices on the platform, and it is impossible for programmers to perform such analysis and remap whenever new hardware is used. A dynamic workload balancing scheme makes it possible for the same source code to obtain portable performance.

7 CHAPTER 1. INTRODUCTION

• We support the management of data locality at runtime. Due to the data transfer overhead and its impact on the overall performance for OpenCL applications, data locality is an important issue for the portable execution of tasks. In our OpenCL support, data management is tightly integrated with the workload migrating decisions. The runtime layer ensures data availability and data coherency throughout the whole system.

• We simplify the initialization of platforms. Scientific programmers usually are not familiar with the best way to initialize platforms across different types of OpenCL devices. With a new API designed for our runtime environment, we shift this burden to the underlying execution system, so that the programmer can focus on the development of his/her algorithms.

1.5 Organization of This Thesis

The rest of this thesis is organized as follows. Chapter 2 provides necessary background information and related work on heterogeneous computing. It also presents a summary of related work on previously proposed runtime environments targeting heterogeneous platforms. In Chapter 3, we describe the structure and components in our cross-platform heterogeneous runtime environment and discuss how it can facilitate more effective use of the resources present on heterogeneous platforms. In Chapter 4, we explore the design space by using our heterogeneous runtime environment equipped with different scheduling schemes when running synthetic workloads. We then demonstrate the true value of our proposed runtime environment by evaluating the performance of benchmark applications run on multiple cross-vendor heterogeneous platforms. We present a detailed analysis on the performance components and demonstrate the programming efficiency. In Chapter 5, we conclude this thesis, summarizing the major contributions embodied in this work, and describe potential directions for future work.

8 Chapter 2

Background and Related Work

2.1 From Serial to Parallel

In July 2004, Intel released a statement that their 4GHz chip, originally targeted for the fourth quarter, will be delayed until the first quarter of next year. Company Spokesman Howard High said the delay will help ensure that the company can deliver high chip quantities when the product is launched. Later in mid-October, in a surprising announcement, Intel officially abandoned their entire plans to release the 4GHz version of the processor, and moved their engineers onto other projects. This marks an abrupt change of 34 years of CPU-frequency scaling, where the increase in CPU frequency grew exponentially over time. Figure 2.1 illustrates a brief history of Intel processors, plotting the number of transistors per chip and the associated clock speed [27]. As the total number of transistors continued to climb, the clock speed did not keep up. The reason behind this major change in CPU development is due to current power and cooling challenges, more specifically power density. The power density in proces- sors has already exceeded that of a hot plate. Continuing to increase the frequency would require either new cooling technologies or new materials to relax the physical limits of what a processor can withstand. has hit the power wall. Our ability to improve performance automatically by increasing frequency of the processor is gone. To further improve application throughput, major silicon vendors elected to provide multi-core designs, providing higher performance within the constraints of thermal limits and power density thresholds.

9 CHAPTER 2. BACKGROUND AND RELATED WORK

100,000,000

Quad-Core Ivy Bridge 10,000,000

Dual-Core 2 1,000,000 Intel CPU Trends (sources: Intel, Wikipedia, K. Olukotun) 100,000 Pentium 4

10,000 Pentium

1000 386 100

10

3 1 Transistors (10 ) Clock Speed (MHz) Power (W) Performance/Clock (ILP) 0

1970 1975 1980 1985 1990 1995 2000 2005 2010

Figure 2.1: Intel Processors Introduction Trends

2.2 Many-Core Architecture

In recent years, multi-core processors become the norm. Figure 2.2 shows an example of a multi-core processor. A multi-core processor has two or more processing cores on a single chip, each core with their own level-1 cache. The common global memory is shared among different processing cores, while multiple tasks are executed on the multi-core processors. Intel’s TeraFlops architecture [28] was designed to demonstrate a prototype of a many-core processor, as shown in Figure 2.3. Developed by Intel Corporation’s Tera-Scale Computing Research Program, this research processor contains 80 tiles of cores, and can yield 1.8 teraflops at 5.6GHz.

10 CHAPTER 2. BACKGROUND AND RELATED WORK

Core 0 Core 1 Core 2 Core 3

CPU CPU CPU CPU

L1 Cache L1 Cache L1 Cache L1 Cache

L2 Cache L2 Cache L2 Cache L2 Cache

Shared L3 Cache

System Memory

Figure 2.2: Multi-core Processors with Shared Memory

While data transfers can occur between any pair of cores, no cache coherency is enforced across cores, and all memory transfers are explicit. Therefore, the biggest hurdle to fully take advantage of the power of these 80 cores is parallel programming. As shown in Figure 2.3, another interesting point is that some dedicated hardware engines could be integrated with some of the cores for multimedia, networking, security, and other tasks. The Intel Xeon Phi [29] inherited many design elements from Larrabee project [30], which is another high performance co-processor based on the TeraFlops architecture. The Intel Xeon Phi coprocessor is primarily composed of processing cores, caches, PCIe client logic, and a very high bandwidth, bidirectional ring interconnect, as illustrated in Figure 2.4. Intel is using Xeon Phi as the primary element for its family of Many Integrated Core architectures. Intel revealed its second generation Many Integrated Core architecture in November 2013, with codename Knights Landing [31]. The Knights Landing contains up to 72 cores and 36 tiles manufactured in 14nm technology, with each core running 4 threads. The Knight Landing chip also has a 2MB coherent shared cache between 2 cores in a tile, which indicates the effort to make this architecture as programmable as possible. The Knights Landing is ISA compatible with the Intel Xeon processors with support for Advanced Vector Extension 512, and supports most of today’s parallel optimizations. One interesting feature of the Knights Landing is that it can either be the main

11 CHAPTER 2. BACKGROUND AND RELATED WORK

$ $ $ $ $ $ $ $ $ $ PE PE PE PE PE PE PE PE PE PE

$ $ $ $ $ $ $ $ $ $ PE PE PE PE PE PE PE PE PE PE

$ $ $ $ $ $ $ $ $ $ HD PE PE PE PE PE PE PE PE PE PE Video

$ $ $ $ $ $ $ $ $ $ PE PE PE PE PE PE PE PE PE PE Crypto

$ $ $ $ $ $ $ $ $ $ PE PE PE PE PE PE PE PE PE PE DSP

$ $ $ $ $ $ $ $ $ $ PE PE PE PE PE PE PE PE PE PE GPU

$ $ $ $ $ $ $ $ $ $ PE PE PE PE PE PE PE PE PE PE

$ $ $ $ $ $ $ $ $ $ PE PE PE PE PE PE PE PE PE PE Physics

Figure 2.3: Intel’s TeraFlops Architecture processor on a compute node, or as a coprocessor in a PCIe slot. Intel is exploring different heterogeneous computing organizations. Another example of a heterogeneous many core architecture is IBM’s Cell processor [32]. It includes a general purpose PowerPC core with 8 very simple SIMD , which are specially designed for accelerating vector or multimedia operations. An operating system runs on the main core, which is called Power Processing Unit (PPU). It is functioning as a master device controlling the 8 coprocessors, which are called Synergistic Processing Elements (SPE). Each SPE is a dual issue in-order processor composed of a Synergistic Processing Unit (SPU) and a Memory Flow Controller (MFC). The Element Interconnect Bus (EIB) is the internal communication bus

12 CHAPTER 2. BACKGROUND AND RELATED WORK

PCIe IO Core Core Core Core PCIe GDDR5 Client L2 L2 L2 L2 GDDR5 Logic GDDR5 GDDR5

GDDR MC TD TD TD TD GDDR MC

GDDR5 GDDR5 TD TD TD TD

GDDR MC GDDR MC IO GDDR

GDDR IO GDDR

GDDR5 GDDR5

L2 L2 L2

GDDR5 L2 GDDR5

Core Core Core Core

Figure 2.4: Intel’s Xeon Phi Architecture connecting various on-chip system elements. The Cell processor was used as the processor for the Sony’s PlayStation 3 game console and some high performance computing servers, such as the IBM Roadrunner supercomputer and Mercury System servers with Cell accelerator boards [33]. By November 2009, IBM discontinued the development of the Cell processor. The Cell processor benefits from very high internal memory bandwidth, but all transfers must be explicitly programmed by using low-level asynchronous DMA transfers. It requires significant expertise to write efficient code for this architecture, especially with the limited size of the local storage on each SPU (256 KB). Load balancing is another challenging issue on the Cell. The application programmer is responsible for evenly mapping the different pieces of computation on the SPUs. Besides the novelty in each hardware design of these many-core processors, industry realized that programmability can not be overlooked anymore. When the hardware design of these processors reaches an unprecedented complexity, it is impossible for software designers to manage all the processing elements manually. Suitable programming models are desperately needed to exploit the computing power on these architectures.

13 CHAPTER 2. BACKGROUND AND RELATED WORK

LS LS LS LS RAM SPU SPU SPU SPU

E I B

$ LS LS LS LS PPU SPU SPU SPU SPU

Figure 2.5: IBM’s Cell

2.3 Programming Paradigms for Many Core Architecture

Given the development of a number of many-core architectures, many parallel program- ming models have been developed to facilitate the usage of these architectures.

2.3.1 Pthreads

Pthreads or Portable Operating System Interface (POSIX) Threads is a set of C program- ming language types, functions and variables [34]. Pthreads is implemented as a header (pthread.h) and a library, which creates and manages multiple threads. When using Pthreads, the programmer has to explicitly create and destroy threads by making use of pthread API functions. The Pthreads library provides mechanisms to synchronize different threads, resolve race conditions, avoid deadlock conditions, and protect critical sections. However, the programmer has the responsibility to manage threads explicitly. Therefore, it is usually very challenging to design a scalable multithreaded application on modern many-core architectures, especially systems with hundreds of cores on a single machine.

14 CHAPTER 2. BACKGROUND AND RELATED WORK

2.3.2 OpenMP

OpenMP is an open specification for shared memory parallelism [35][36]. It comprises compiler directives, callable runtime library routines and environment variables which extend FORTRAN, C and C++ programs. OpenMP is portable across a shared memory architecture. The thread management is implicit, and the programmer has to use special directives to specify the section of code is to be run in parallel. The number of threads to be used is specified by the environment variables. OpenMP is also extended as a parallel programing model for clusters. OpenMP uses several constructs to support implicit synchronization, so that the the program is relieved from worrying about the actual synchronization mechanism. As with Pthreads, scalability is still an issue for OpenMP, as it is a thread-based mechanism. Furthermore, since OpenMP is using implicit thread management, there is no fine-grained way to do thread-to-processor mapping.

2.3.3 MPI

The Message Passing Interface (MPI) [37] provides a virtual topology, synchronization, and communication functionality between nodes in clusters. It is a natural candidate for accelerating applications in distributed systems. MPI is currently the most widely used standard for developing High Performance Computing (HPC) applications for distributed memory architectures. It provides programming interfaces for C, C++, and FORTRAN. Some of the well-known MPI implementations include OpenMPI [38], MVAPICH [39], MPICH [40], GridMPI [41], and LAM/MPI [42]. Similar to Pthreads, workload partitioning and task mapping have to be done by the programmer, but message passing is a convenient way to express date transfer between different processors. MPI barriers are used to specify that synchronization is needed. The barrier operation blocks each process from continuing its execution until all processes have entered the barrier. A typical usage of barriers is to ensure that the global data has been dispersed to the appropriate processes.

2.3.4 Hadoop MapReduce

Hadoop MapReduce is a software framework for developing parallel applications easily, and is especially well suited for processing vast amounts of data (e.g., multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes), and on commodity hardware, in a reliable, fault- tolerant manner [43]. A MapReduce job usually splits the input data-set into independent chunks

15 CHAPTER 2. BACKGROUND AND RELATED WORK which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. Typically the compute nodes and the storage nodes are the same. For example, the MapReduce framework and the Hadoop Distributed File System are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster [44].

2.4 Computing with Graphic Processing Units

Although the multicore architectures have made it possible for the applications to over- come some of the physical limits encountered with purely sequential architectures, their degree of parallelization is not comparable to the parallelism on graphic processing units (GPUs). Intrinsically, GPUs are designed for highly parallel problem. With more and more complex graphic problems, new architectures and APIs are created, and GPUs became more and more programmable. In this section, we briefly review the current state of GPU computing, and the GPU’s transition from a hardware implementation of standard graphic APIs to become a fully programmable general purpose processing unit.

2.4.1 The Emergence of Programmable Graphics Hardware

The interactive 3D graphics applications have very different characteristics as compared to general-purpose applications. Specifically, interactive 3D application requires high throughput and exhibit substantial parallelism. Since the late 1990s custom hardware has been built to take advantage the native parallelism in the application. Those early custom accelerators were designed in the form of fixed-function pipelines based on a hardware implementation of OpenGL [9] standard and Microsoft’s DirectX programming APIs. At each stage of the pipeline, a sequence of different operations were implemented in hardware units for specific tasks. Given that the GPU was originally designed to produce visual realism of rendered images, a fixed-function pipeline graphics hardware has some limitations to perform efficiently. In the mean- time, offline rendering systems such as Pixar’s Renderman [45] can be used to achieve impressive visual results by replacing the fixed-function pipeline with a more flexible programmable pipeline. In

16 CHAPTER 2. BACKGROUND AND RELATED WORK the programmable pipeline, fixed-function operations are replaced by user-provided pieces of code called shaders. Pixel shaders, vertex shaders and geometry shaders are introduced to enable flexible processing at each programmable pipeline. Initially in early shader models, vertex and pixel shares were implemented using very different instruction sets. But later in 2006, OpenGL’s Unified Shader Model and DirectX 10’s Shader Model 4.0 provided consistent instruction sets across all shader types - geometry, vertex and pixel shaders. All three types of shaders have almost the same capabilities. For example, they can perform the same set of arithmetic instructions and read from texture or data buffers. Graphics hardware designers continued to explore the best ISA for the shader models. Before the unified shader model, ATI’s Xenos graphics chip integrated in the Xbox 360 used unified shader architecture. Most shaders continued to build dedicated hardware units, even though they have a unified shader model. But eventually, all major GPU makers chose a Unified Shader Architecture, which allows a single type of processing unit to be used for all types of shaders. The Unified Shader Architecture decouples the type of shaders from the processing unit, and allows a dynamic assignment of shaders to the different processing cores. This flexibility leads to better workload balance, allowing hardware resources to be allocated dynamically for different types of shaders, based on the needs of the workload. Figure 2.6 is an illustration of high level block diagram of a modern GPU architecture.

2.4.2 General Purpose GPUs

With emergence of programmable graphics hardware, new shader languages and program- ming APIs have been created to facilitate the programming effort. Since DirectX 9, Microsoft has been using the High Level Shading Language (HLSL) [46], which supports shader construction with C-like syntax, types, expressions, statements and functions. Similarly, the OpenGL Shading Language (GLSL) [47] is the corresponding high level language targeting OpenGL shader programs. Nvidia’s Cg [48] is a collaborated effort with Microsoft. The Cg compiler outputs both DirectX and OpenGL shader programs. Although these shader languages are very popular across the graphics community, mainstream programmers feel a lack of connection between the graphics primitives in these shader languages and the constructs in general purpose programming languages. With the introduction of unified shader architectures and unified shader models, a uniform ISA makes it easier to design high-level languages for this workload. Some examples of these higher level languages include Brook [49], Scott [50], Glift [51], Nvidia’s CUDA [5] and the

17 CHAPTER 2. BACKGROUND AND RELATED WORK

Shader Shader Shader • • • Core Core Core

Interconnection Network

L2 Cache

Global memory

Figure 2.6: High Level Block Diagram of a GPU

Khronos Group’s OpenCL [6], which is an extension of Brook. These high-level languages hide the graphic primitives with programming constructs which are more familiar to general purpose programmers. The availability of CUDA and OpenCL, currently the two most popular languages, has dramatically increased the programmability of GPU hardware. As a result, GPUs have been widely adopted in many general purpose platforms for executing data-parallel, computationally-intensive workloads [52]. Many key applications possessing a high degree of data-level parallelism have been successfully accelerated using GPUs. GPUs have been included in the standard configuration for many desktop machines and servers. The availability of high-level languages has allowed industry to support both graphics and compute on the same GPU. According to the 42nd TOP500 list, GPUs are used in the No.2 and No.6 fastest supercomputers in the world [53]. Intel Xeon Phi processors are used in the No.1 and No.7 fastest supercomputers in the world. A total of fifty-three systems on the list use accelerator/co- processor technology. Thirty-eight of these systems use NVIDIA GPU chips, two use ATI Radeon,

18 CHAPTER 2. BACKGROUND AND RELATED WORK and there are now thirteen systems with Intel MIC technology (Xeon Phi).

The OpenCL Architecture

Application OpenCL Kernel

OpenCL API OpenCL C Language OpenCL Framework

OpenCL Runtime

OpenCL Driver

GPU Hardware

Figure 2.7: OpenCL Architecture

2.5 OpenCL

OpenCL (Open Computing Language) is an open standard for general purpose parallel programming on CPUs, GPUs and other processors, giving software developers portable and efficient access to the computing resource on these heterogeneous processing platforms [54]. OpenCL allows a heterogeneous platform be viewed as a single platform with multiple computing devices. It is a mature framework that includes a language definition, a set of APIs, compiler libraries, and a runtime system to support software development. Figure 2.7 shows a high-level breakdown of the OpenCL architecture.

19 CHAPTER 2. BACKGROUND AND RELATED WORK

Compute Device

•• •

•• • •• • •• • Host •• • •• • •• • •• • •• • •• • •• •

•• • Compute Unit

Processing Element

Figure 2.8: An OpenCL Platform

2.5.1 An OpenCL Platform

An OpenCL framework adopts the concept of a platform, which has a host device with multiple OpenCL devices interconnected [55]. The OpenCL devices can be a CPU, GPU or any type of processing unit which supports the OpenCL standard. An OpenCL device can be divided into one or more compute units (CUs), and a CU can be further divided into one or more processing elements (PEs). Figure 2.8 shows how the OpenCL standard hierarchically describes a heterogeneous platform with multiple OpenCL devices, multiple CUs and multiple PEs.

2.5.2 OpenCL Execution Model

The execution model of OpenCL consists of two parts: a host program running on the host device, setting up data and scheduling execution on a compute device, and kernels executed on one or more OpenCL devices [56]. Figure 2.9 shows the OpenCL execution model. An OpenCL command queue is where the host interacts with an OpenCL device by queuing computation kernels. Each command-queue is associated with a single device. There are three types of commands in a command-queue:

20 CHAPTER 2. BACKGROUND AND RELATED WORK

Program foo() bar() baz() qux() Host .... Context Command Queue

Kernel

foo(F ) bar() baz() qux ()

Device 0 Device 1 Device 2 Device 3

Figure 2.9: The OpenCL Execution Model

• Kernel-enqueue commands: Enqueues a kernel for execution on a device.

• Memory commands: Transfers data between the host and device memory, between memory objects, or maps and unmaps memory objects from the host address space.

• Synchronization commands: Explicit synchronization points that define ordering constraints between commands.

Commands communicate their status through Event objects. Successful completion is indicated by setting the event status to CL COMPLETE. Unsuccessful completion results in abnormal termination of the command which is indicated by setting the event status to a negative value. In this case, the command-queue associated with the abnormally terminated command and all other command-queues in the same context may no longer be available and their behavior is implementation defined. A command submitted to a device will not launch until prerequisites that constrain the order of commands have been resolved. These prerequisites have two sources. First, they may

21 CHAPTER 2. BACKGROUND AND RELATED WORK arise from commands submitted to a command-queue that constrain the order that commands are launched. For example, commands that follow a command queue barrier will not launch until all commands prior to the barrier are complete. The second source of prerequisites is dependencies between commands expressed through events. A command may include an optional list of events. The command will wait and not launch until all the events in the list are in the CL COMPLETE state. Using this mechanism, event objects define ordering constraints between commands and coordinate execution between the host and one or more devices [54]. In our cross-platform runtime system, we expand this mechanism to support dependencies between events across OpenCL devices from different vendors. A command may be submitted to a device, and yet there may be no visible side effects except to wait on and satisfy event dependencies. Examples include markers-, kernels executed over ranges of no work-items or copy operations of zero size. Such commands may pass directly from the ready state to the ended state. Command execution can be blocking or non-blocking. Consider a sequence of OpenCL commands. For blocking commands, the OpenCL API functions that enqueue commands do not return until the command has completed. Alternatively, OpenCL functions that enqueue non- blocking commands return immediately and require that a programmer defines dependencies between enqueued commands to ensure that enqueued commands are not launched before needed resources are available. In both cases, the actual execution of the command may occur asynchronously with execution of the host program. Multiple command-queues can be present within a single context. Multiple command- queues execute commands independently. Event objects visible to the host program can be used to define synchronization points between commands in multiple command queues. If such synchroniza- tion points are established between commands in multiple command-queues, an implementation must assure that the command-queue’s progress concurrently and correctly accounts for the dependencies established by the synchronization points. The core of the OpenCL execution model is defined by how the kernels execute. When a kernel-enqueue command submits a kernel for execution, an index space is defined. The kernel, the argument values associated with the arguments to the kernel, and the parameters that define the index space define a kernel-instance. When a kernel-instance executes on a device, the kernel function executes for each point in the defined index space. Each of these executing kernel functions is called a work-item. The work-items associated with a given kernel-instance are managed by the device in groups called work-groups. These work-groups define a coarse grained decomposition of the Index

22 CHAPTER 2. BACKGROUND AND RELATED WORK space. Work-groups are further divided into sub-groups, which provide an additional level of control over execution.

work-group Type equation here. work-item work-item

⬚ ⬚

(푤푥푆푥 + 푠푥 , 푤푦푆푦 + 푠푦 ) (푤푥푆푥 + 푠푥 , 푤푦푆푦 + 푠푦 )

⬚ ⬚

푠푥 , 푠푦 = (0, 0) 푠푥 , 푠푦 = (푆푥 − 1, 0)

size size

size size

equation here. equation

group equation here. equation

-

NDRange Type

work-item work-item ork

w Type ⬚ ⬚ (푤푥푆푥 + 푠푥 , 푤푦푆푦 + 푠푦 ) (푤푥푆푥 + 푠푥 , 푤푦푆푦 + 푠푦 )

NDRange size 퐺푥 ⬚ ⬚ 푠푥 , 푠푦 = (0, 푆푦 − 1) 푠푥 , 푠푦 = (푆푥 − 1, 푆푦 − 1) Type equation here.

work-group size 푆푥 Type equation here.

Figure 2.10: OpenCL work-items mapping to GPU devices.

2.5.2.1 Mapping OpenCL Work-items

Each work-item’s global ID is an N-dimensional tuple. The global ID components are values in the range from F, to F plus the number of elements in that dimension minus one. If a kernel is compiled as an OpenCL 2.0 kernel [20], the size of work-groups in an NDRange (the local size) need not be the same for all work-groups. In this case, any single dimension for which the global size is not divisible by the local size will be partitioned into two regions. One region will have work-groups that have the same number of work items as was specified for that dimension by the programmer (the local size). The other region will have work-groups with less than the number of work items specified by the local size parameter in that dimension (the remainder work-groups). Work-group sizes can be non-uniform in multiple dimensions, potentially producing work-groups of up to 4 different sizes in a 2D range and 8 different sizes in a 3D range. Each work-item is assigned to a work-group and is given a local ID to represent its position within the work-group. A work-item’s local ID is an N-dimensional tuple with components in the range from zero to the size of the work-group in that dimension minus one.

23 CHAPTER 2. BACKGROUND AND RELATED WORK

Execution

Workgroup 0 Workgroup n

CPU barrier Thread 0 barrier

Workgroup 1 Workgroup n+1

CPU

Thread 1 barrier barrier

• • •

Workgroup n-1 Workgroup 2n-1

CPU

Thread n-1

barrier barrier

Figure 2.11: OpenCL work-items mapping to CPU devices.

Work-groups are assigned IDs similarly. The number of work-groups in each dimension is not directly defined but is inferred from the local and global NDRanges provided when a kernel instance is enqueued. A work-group’s ID is an N-dimensional tuple with components in the range 0 to the ceiling of the global size in that dimension divided by the local size in the same dimension. As a result, the combination of a work-group ID and the local-ID within a work-group uniquely defines a work-item. Each work-item is identifiable in two ways; in terms of a global index, and in terms of a work-group index plus a local index within a work group. On a CPU device, work-items are mapped by a different mechanism. An example mapping of OpenCL execution on a CPU is shown in Figure 2.11. In this example, one worker thread is created per physical CPU core when executing a kernel. Then this worker-thread, which is usually a CPU thread, takes a work-group from the ND-range and begins to execute its associated work-items one by one in sequence. If an OpenCL barrier is reached, the work-items state is stored and the execution of the following work-item begins. When all work-items in this work group have reached the barrier, execution will go back to the first work-item which stops at the barrier. It will resume the

24 CHAPTER 2. BACKGROUND AND RELATED WORK execution until the next synchronization point. In the absence of barriers, the first work-item will run to the end of the kernel before switching to the next. In both cases, the CPU will continuously process all the work-items until the entire work-group is executed. During the whole process, idle CPU threads will look for any remaining work-groups in the ND-range and begin the process them.

2.5.2.2 Kernel Execution

A kernel object is defined to include a function within the program object and a collection of arguments connecting the kernel to a set of argument values [57]. The host program enqueues a kernel object to the command queue, along with the NDRange and the work-group decomposition. These define a kernel instance. In addition, an optional set of events may be defined when the kernel is enqueued. The events associated with a particular kernel instance are used to constrain when the kernel instance is launched with respect to other commands in the queue or with respect to commands in other queues within the same context. A kernel instance is submitted to a device. For an in-order command queue, the kernel instances appear to launch and then execute in that same order. Once these conditions are met, the kernel instance is launched and the work-groups associated with the kernel instance are placed into a pool of ready-to-execute workgroups. The device schedules work-groups from the pool for execution on the compute units of the device. The kernel-enqueue command is complete when all work-groups associated with the kernel instance end their execution, updates to global memory associated with a command are visible globally, and the device signals successful completion by setting the event associated with the kernel-enqueue command to CL COMPLETE. While a command-queue is associated with only one device, a single device may be associated with multiple command-queues. A device may also be associated with command queues associated with different contexts within the same platform. The device will pull work-groups from the pool and execute them on one or several compute units in any order; possibly interleaving execution of work-groups from multiple commands. A conforming implementation may choose to serialize the work-groups so a correct algorithm cannot assume that work-groups will execute in parallel. There is no safe and portable way to synchronize across the independent execution of work-groups since they can execute in any order. The work-items within a single sub-group execute concurrently, but not necessarily in parallel (i.e., they are not guaranteed to make independent forward progress). Therefore, only

25 CHAPTER 2. BACKGROUND AND RELATED WORK high-level synchronization constructs (e.g. sub-group functions such as barriers) that apply to all the work-items in a sub-group are well defined and included in OpenCL. Sub-groups execute concurrently within a given work-group and with appropriate device support may make independent forward progress with respect to each other, with respect to host threads and with respect to any entities external to the OpenCL system but running on an OpenCL device, even in the absence of work-group barrier operations. In this situation, sub-groups are able to internally synchronize using barrier operations without synchronizing with each other and may perform operations that rely on runtime dependencies on operations other sub-groups perform. The work-items within a single work-group execute concurrently, but are only guaranteed to make independent progress in the presence of sub-groups and device support. In the absence of this capability, only high-level synchronization constructs (e.g., work-group functions such as barriers), that apply to all the work-items in a work-group, are well defined and included in OpenCL for synchronization within a work-group.

2.5.2.3 Synchronization

Synchronization across all work-items within a single work-group is carried out using a work-group function [58]. These functions carry out collective operations across all the work-items in a work-group. Available collective operations are: barrier, reduction, broadcast, prefix sum, and evaluation of a predicate. A work-group function must occur within a converged control flow; i.e., all work-items in the work-group must encounter precisely the same work-group function. For example, if a work-group function occurs within a loop, the work-items must encounter the same work-group function in the same loop iterations. All the work-items of a work-group must execute the work-group function and complete reads and writes to memory before any are allowed to continue execution beyond the work-group function. Work-group functions that apply between work-groups are not provided in OpenCL since OpenCL does not define forward progress or ordering relations between work-groups, hence collective synchronization operations are not well defined. Synchronization across all work-items within a single sub-group is carried out using a sub-group function. These functions carry out collective operations across all the work-items in a sub-group. Available collective operations are: barrier, reduction, broadcast, prefix sum, and evaluation of a predicate. A sub-group function must occur within a converged control flow; i.e., all work-items in the sub-group must encounter precisely the same sub-group function. For example, if a work-group function occurs within a loop, the work-items must encounter the same sub-group

26 CHAPTER 2. BACKGROUND AND RELATED WORK function in the same loop iterations. All the work-items of a sub-group must execute the sub-group function and complete reads and writes to memory before any are allowed to continue execution beyond the sub-group function. Synchronization between sub-groups must either be performed using work-group functions, or through memory operations. Using memory operations for sub-group synchronization should be used carefully as forward progress of sub-groups relative to each other is only supported optionally by OpenCL implementations. A synchronization point between a pair of commands (A and B) assures that results of command A happens-before command B is launched. This requires that any updates to memory from command A complete and are made available to other commands before the synchronization point completes. Likewise, this requires that command B waits until after the synchronization point before loading values from global memory. The concept of a synchronization point works in a similar fashion for commands such as a barrier that apply to two sets of commands. All the commands prior to the barrier must complete and make their results available to following commands. Furthermore, any commands following the barrier must wait for the commands prior to the barrier before loading values and continuing their execution.

2.5.3 OpenCL Memory Model

The OpenCL memory model describes the structure, contents, and behavior of the memory exposed by an OpenCL platform as an OpenCL program runs [59]. The model allows a programmer to reason about values in memory as the host program and multiple kernel-instances execute. An OpenCL program defines a context that includes a host, one or more devices, command- queues, and memory exposed within the context. Consider the units of execution involved with such a program. The host program runs as one or more host threads managed by the operating system running on the host (the details of which are defined outside of OpenCL). There may be multiple devices in a single context which all have access to memory objects defined by OpenCL. On a single device, multiple work-groups may execute in parallel with potentially overlapping updates to memory. Finally, within a single work-group, multiple work-items concurrently execute, once again with potentially overlapping updates to memory. The memory regions, and their relationship to the OpenCL Platform model, are summarized in Figure 2.12. Local and private memories are always associated with a particular device. The global and constant memories, however, are shared between all devices within a given context. An OpenCL device may include a cache to support efficient access to these shared memories.

27 CHAPTER 2. BACKGROUND AND RELATED WORK

Host Memory Global Memory Constant Memory

Kernel A Kernel B

PCIE Compute Unit 0 Compute Unit 1 Compute Unit 0 Compute Unit 1

Host Local Memory Local Memory Local Memory Local Memory Private Private Private Private Private Private Private Private Memory Memory Memory Memory Memory Memory Memory Memory PE PE PE PE

Figure 2.12: The OpenCL Memory Hierarchy.

To understand memory in OpenCL, it is important to appreciate the relationship between these named address spaces. The four named address spaces available to a device are disjoint, which means that they do not overlap. This is their logical relationship, however, and an implementation may choose to let these disjoint named address spaces share physical memory. Programmers often need functions callable from kernels, where the pointers manipulated by those functions can point to multiple named address spaces. This saves a programmer from the error-prone and wasteful practice of creating multiple copies of functions, one for each named address space. Therefore, the global, local and private address spaces belong to a single generic address space.

2.6 Heterogeneous Computing

To take full advantage of the resources on a heterogeneous platform, the programmer has to manage these the allocation of these resources. In this section, we introduce several projects which were designed or extended to support heterogeneous computing platforms. All of these runtimes or libraries provide higher-level software layers with convenient abstractions, which alleviates the programmer from the burden of managing resources on the targeted heterogeneous platform.

28 CHAPTER 2. BACKGROUND AND RELATED WORK

C++ source Application Qilin API

Compiler Qilin Dev. Libraries tools System Code Cache

Scheduler

Hardware CPU GPU

Figure 2.13: Qilin Software Architecture

2.6.0.1 Qilin

Qilin [60] is a programming system recently developed for heterogeneous multiprocessors. Figure 2.13 shows the software architecture of Qilin. At the application level, Qilin provides an API to programmers for describe parallelizable operations. By explicitly expressing these computations through the API, the compiler does not have to extract any implicit parallelism from the serial code, and instead can focus on performance tuning. Similar to OpenMP, the Qilin API is built on top of C/C++ so that it can be easily adopted. But unlike standard OpenMP, where parallelization only happens on the CPU, Qilin can exploit the hardware parallelism available on both the CPU and the GPU. Beneath the API layer is the Qilin system layer, which consists of a dynamic compiler and its code cache, a number of libraries, a set of development tools, and a scheduler. The compiler dynamically translates the API calls into native machine codes. It also produces a near-optimal map-

29 CHAPTER 2. BACKGROUND AND RELATED WORK ping from computations to processing elements using an adaptive algorithm. To reduce compilation overhead, translated code is stored in the code cache so that it can be reused without recompilation, whenever possible. Once native machine code is available, it can be scheduled to run on the CPU and/or the GPU by the scheduler. Libraries include commonly used functions such as BLAS and FFT. Finally, debugging, visualization, and profiling tools can be provided to facilitate the development of Qilin programs. Qilin uses off-line profiling to obtain information about each task on each computing device. This information is then used to partition tasks and create an appropriate performance model for the targeted heterogeneous platform. However, the overhead to carry out the initial profiling phase can be prohibitively high and results may be inaccurate if computation behavior is heavily input dependent.

OpenCL Application

OpenCL Common Runtime for Linux on x86

Platform

Device 0 Context Device 1

Queue 0 Program MemObj Queue 1

Figure 2.14: The OpenCL environment with the IBM OpenCL common runtime.

30 CHAPTER 2. BACKGROUND AND RELATED WORK

2.6.0.2 IBM OpenCL common runtime

IBM’s OpenCL common runtime [61] improves the OpenCL programming experience by removing the burden from the programmer of managing multiple OpenCL platforms and dupli- cated resources, such as contexts and memory objects. In the conventional OpenCL programming environment, programmers are responsible for managing the movement of memory between two or more contexts, when there are multiple OpenCL devices are present on the platform. In this case, the application is forced to have host side synchronization in order to move their memory objects between coordinating contexts. Equipped with the common runtime, this movement and synchronization is done automatically. In addition, the common runtime also improves the OpenCL programming experience by alleviating the programmer from managing cross-queue scheduling and event dependencies. By convention, OpenCL requires that command queue event dependencies must originate from the same context as that of the command queue. In a multiple context environment, this restriction forces programmers to manage their own cross-queue scheduling and dependencies. Again, this requires additional host-side synchronization in the application. With the common runtime, the handling of cross-queue event dependencies and scheduling are handled for the programmer. Finally, The common runtime improves application portability and resource usage, which reduces application complexity. In the conventional OpenCL environment, coordination of OpenCL resources is more than just an inconvenience. Managing resources comes with challenges of application portability, which becomes an issue when code is tuned for a particular underlying platform. Applications are forced to choose whether to support only one platform, potentially leaving compute resources unused, or adding complexity to manage resources across a range of platforms. Using the unifying platform provided by the IBM OpenCL common runtime, applications are more portable and resources can be more easily exploited. IBM’s OpenCL common runtime is designed to improve the OpenCL programming expe- rience by managing multiple OpenCL platforms and duplicated resources. It minimizes application complexity by presenting the programming environment as a single OpenCL platform. Shared OpenCL resources, such as data buffers, events, and kernel programs are transparently managed across the installed vendor implementations. The result is simpler programming in heterogeneous environments. However, even equipped with this commercially-developed common runtime, many of the multiple contexts features, such as scheduling decisions and data synchronization, must still be manually performed by the programmer.

31 CHAPTER 2. BACKGROUND AND RELATED WORK

2.6.0.3 StarPU

StarPU [62] automatically schedules tasks across the different processing units of an accelerator-based machine. Applications using StarPU do not have to deal with low-level concerns such as data transfers or an efficient load balancing that are target system dependent. StarPU is a C library that provides an API to describe application data, and can asynchronously submit tasks that are dispatched and executed transparently over the entire machine in an efficient way. Providing a separation of concerns between writing efficient algorithms and mapping them on complex accelerator-based machines therefore makes it possible to achieve portable performance, tapping into the potential of both accelerators and multi-core architectures. An application first has to register data with StarPU. Once a piece of data has been registered, its state is fully described using an opaque data structure, called a handle. Programmers must then divide their applications into sets of possibly inter-dependent tasks. In order to obtain portable performance, programmers do not explicitly choose which processing units will process the different tasks. Each task is described by a structure that contains the list of handles of the data that the task will manipulate, the corresponding access modes (i.e. read, write, etc.), and a multi-versioned kernel called a codelet, which gathers the various kernel implementations available on the different types of processing units. The different tasks are submitted asynchronously to StarPU, which automatically decides where to execute them. Thanks to the data description stored in the handle data structure, StarPU also ensures that a coherent replicate of the different pieces of data accessed by a task are automatically transferred to the appropriate processing unit. If StarPU selects a CUDA device to execute a task, the CUDA implementation of the corresponding codelet will be provided with pointers to locally replicated data allocated in the memory on the GPU. Programmers need not worry about where the tasks are executed, nor how data replicates are managed for these tasks. They simply need to register data, submit tasks with their implemen- tations for the various processing units, and just wait for their termination, or simply rely on task dependencies. StarPU is a simple tasking API that provides numerical kernel designers with a convenient way to execute parallel tasks on heterogeneous platforms, and incorporates a number of different scheduling policies. StarPU is based on the integration of a resource management facility with a task execution engine. Several scientific kernels [63][64][65][66] have been deployed on StarPU to utilize the computing power of heterogeneous platforms. However, StarPU is implemented in C and

32 CHAPTER 2. BACKGROUND AND RELATED WORK the basic schedulable units (codelets) have to be implemented multiple times if they are targeting multiple devices. This limits the migration of the codelets across platforms, and increases the programmer’s burden. To overcome this limitation, StarPU has initiated a recent effort to incorporate OpenCL [67] as the front-end.

2.6.0.4 Maestro

The Maestro model [68] unifies the disparate, device-specific, queues into a single, high- level, task queue. At runtime, Maestro queries OpenCL to obtain information about the available GPUs or other accelerators in a given system. Based on this information, Maestro can transfer data and divide work among the available devices automatically. This frees the programmer from having to synchronize multiple devices and keep track of device-specific information. Since OpenCL can execute on devices which differ radically in architecture and compu- tational capabilities, it is difficult to develop simple heuristics with strong performance guarantees. Hence, Maestro’s optimizations rely solely on empirical data, instead of any performance model or apriori knowledge. Maestro’s general strategy for all optimizations can be summarized by the following steps as show in Figure 2.15. This strategy is used to optimize a variety of parameters, including local work group size, data transfer size, and the division of work across multiple devices. However, these dynamic execution parameters are only one of the obstacles to true portability. Another obstacle is the choice of hardware-specific kernel optimizations. For instance, some kernel optimizations may result in excellent performance on a GPU, but reduce performance on a CPU. This remains an open problem. Since the solution will no doubt involve editing kernel source code, it is beyond the scope of Maestro. Maestro is an open source library for data orchestration on OpenCL devices. It provides automatic data transfer, task decomposition across multiple devices, and auto-tuning of dynamic execution parameters for selected problems. However, Maestro relies heavily on empirical data and benchmark profiling beforehand. This limits its ability to run on applications with data-dependent program flow and/or data dependencies.

2.6.0.5 Symphony

Symphony [69], previously known as MARE (Multicore Asynchronous Runtime Environ- ment) [70], seamlessly integrates heterogeneous execution into a concurrent task graph and removes the burden from the programmer of managing data transfers and explicit data copies between kernels

33 CHAPTER 2. BACKGROUND AND RELATED WORK

Estimate based on benchmarks

Collect empirical data from execution

Optimize based on results

Yes Performance continues improving?

No

Final performance stratety

Figure 2.15: Maestro’s Optimization Flow executing on different devices. At a low level, Symphony provides state-of-the-art algorithms for work stealing and power optimizations that can hide hardware idiosyncrasies, allowing for portable application development. In addition, Symphony is designed to support dynamic mapping of kernels to heterogeneous execution units. Moreover, expert programmers can take charge of the execution through a carefully designed system of attributes and directives that provide the runtime system with additional semantic information about the patterns, tasks, and buffers that Symphony uses as building blocks. Symphony runs on top of a runtime system that will execute the concurrent applications on all the available computational resources on the SoC. The Symphony runtime system is essentially a resource manager for threads, address spaces, and devices. It builds on a set of state-of-the-art algorithms to free programmers from the need to manage these resources explicitly and provides the best performance for the Symphony execution model. However, similar to StarPU, the kernels running on different devices in Symphony are in device-specific formats, and OpenCL is functioning just as a language to offload computing kernels to GPU device as an accelerator on a Qualcomm

34 CHAPTER 2. BACKGROUND AND RELATED WORK

Figure 2.16: Symphony Overview

Snapdragon platform.

2.6.0.6 Workload mapping using machine learning

Grewe et al. [71] propose a static partitioning model for OpenCL programs targeting het- erogeneous CPU-GPU systems. The model focuses on how to predict and partition the different tasks according to their computational characteristics, and does not abstract to any common programming interface that would enable more rapid adoption. The approach to partitioning data-parallel OpenCL tasks described in their work is a purely static approach. There is no profiling of the target program and the run-time overhead of dynamic schemes is avoided. In their approach, static analysis is used to extract code features from OpenCL programs. Given this information, the system determines the best partitioning across the processors and divides tasks into as many chunks as there are processors, each processor receiving the appropriate chunk. Deriving the optimal partitioning from a program’s features is a difficult problem and depends heavily on the characteristics of the system. They rely on machine-learning techniques to automatically build a model that maps code features to partitions. Because the process

35 CHAPTER 2. BACKGROUND AND RELATED WORK is entirely automatic, it is easily portable across different systems and implementations. They focus on CPU-GPU systems, as this is arguably the most common form of heterogeneous system on the market today. However, their static approach is inherently not able to adjust to changing input data-sets.

2.6.1 Discussion

The concept of developing a cross-platform parallel scheme was first introduced by Bernard et al. [72]. Together with work stealing [73], this prior work focuses on the load balance problem for processors with asymmetric processing capabilities. Apart from StarPU, none of the above approaches focus on task partitioning and do not discuss how to exploit task-level parallelism that is commonly present in large and diverse applications. We address this issue by enhancing OpenCL programming with a resource management facility. None of above work attempt to propose a programming mechanism for the portability across heterogeneous platforms.

2.7 SURF in OpenCL

In this thesis, we use real user applications as a demonstration of our cross-platform runtime environment. The first application is SURF (Speeded Up Robust Feature). The SURF application was first presented by Bay et al. [74]. The basic idea of SURF is to summarize images by using only a relatively small number of interesting points. The algorithm analyzes an image and produces feature vectors for every interesting point. SURF features have been widely used in real life applications such as object recognition [75], feature comparison and face recognition [76]. Numerous projects have implemented elements of SURF in parallel using OpenMP [77], CUDA [78, 79] and OpenCL [80]. We reference the state-of-art OpenCL implementation, clSURF, and use it as the baseline for performance comparison in this thesis. Figure 2.17 shows the program flow of clSURF for processing an image or one frame of a video stream. The whole program flow of clSURF is implemented as several stages, and these stages can be further separated into two phases. In the first phase, the amount of computation is mainly influenced by the size of the image. In the second phase, the amount of the computation depends more on the number of the interesting points, which is a reflection of the complexity of the image.

36 CHAPTER 2. BACKGROUND AND RELATED WORK

I Build Integral Image

Calculate Hessian Determinant

Non-max Suppression

II Calculate Orientation

Calculate and Normalize Descriptors

Figure 2.17: The Program Flow of clSURF.

Previous work has also evaluated SURF on hybrid and heterogeneous platforms[81]. However, they concentrate on speeding up the algorithm and did not explore distributing workload on multiple devices of a platform.

2.8 Monte Carlo Extreme in OpenCL

Monte Carlo methods are a set of statistical computational algorithms particularly suitable for estimating the distribution of an unknown probabilistic entity. These methods usually apply to problems where it is difficult or impossible to obtain the analytical model or apply a deterministic algorithm. Instead, Monte Carlo methods generate a large number of independent random trials and collect probability distribution.

37 CHAPTER 2. BACKGROUND AND RELATED WORK

Modeling photon migration in turbid media by Monte Carlo methods has been demonstrated to be a valuable tool [82] in bio-optical imaging applications such as brain functional imaging and small-animal imaging [83]. Monte Carlo simulation of photon migration allows the reconstruction of the optical properties deep into the tissue, while traditional microscopy techniques limit the depth penetration to a few millimeters. Monte Carlo Extreme in OpenCL(MCXCL) is the OpenCL implementation of Monte Carlo simulation for 3D complex media with efficient parallel random-number generators and boundary reflection schemes. The OpenCL implementation enables the simulation to utilize massively parallel computing resources available on GPUs, and is extremely fast compared to single-threaded CPU- based simulations.

Figure 2.18: Block Diagram of the Parallel Monte Carlo simulation for photon migration.

Figure 2.18 is a block diagram of this parallel Monte Carlo simulation processing pipeline [26]. We reference this OpenCL implementation and use it as the baseline for the performance comparison in this thesis.

38 Chapter 3

Cross-platform Heterogeneous Runtime Environment

In this chapter, we describe our cross-platform heterogeneous runtime environment in detail, and consider some of the design decisions made before arriving at task-level workload balancing on multiple devices.

3.1 Limitations of the OpenCL Command-Queue Approach

In OpenCL, we use command queues as the mechanism to support the host interaction with devices. Equipped with a command queue, the host submits commands to be executed by a device. These commands include the execution of programs (kernels), as well as data transfers. The OpenCL standard specifies that each command queue is only associated with a single device; therefore if N devices are used, then N command queues are required. There are a number of factors that can limit performance when kernel execution runs across multiple devices.

3.1.1 Working with Multiple Devices

When a CPU and GPU are present in a system, the CPU can potentially help with the processing of some workloads that would normally be offloaded entirely to the GPU. Using the CPU as a compute device requires creating a separate command queue, and specifying which commands should be sent to that queue. To allow for this, we need to decide which workloads will target the CPU at compile time. At runtime, once a kernel is enqueued on a command queue, there is no

39 CHAPTER 3. CROSS-PLATFORM HETEROGENEOUS RUNTIME ENVIRONMENT mechanism for a command to be removed from a queue or assigned to another queue. Effective workload balancing thus requires that we profile the kernel on both platforms, compute the relative performance, and divide the work accordingly. The disadvantages of this approach are: 1) the CPU may have some unknown amount of work to do between calls, 2) the performance of one or both devices may vary based on the specific input data used, 3) the host CPU may be executing unrelated tasks which may be difficult to identify and introduce noise into the profile. Working with multiple GPU devices presents a similar problem. If multiple devices are used to accelerate the execution of a single kernel, we would still need to statically pre-partition the data. This is especially tricky with heterogeneous GPUs, as it does not allow for the consideration of runtime factors such as delays from bus contention, computation time changes based on input data sets used, relative computational power, etc.. If multiple devices are used for separate kernels, we would have to add code to either split the kernels between devices, or change the targeted command queue based on some other factors (e.g., number of loop iterations). Creating an algorithm that divides multiple tasks between a variable number of devices is not a trivial undertaking. Another persuasive argument for changing the current queuing model is the fact that it limits effective use of Fusion-like devices. If multiple tasks are competing to run concurrently on a Fusion processor, one of the tasks may elect to run on the GPU, but if the device is already busy, it may be acceptable to run the specific tasks or kernel on the CPU. Unless we introduce new functionality that allows swapping contexts on the GPU, we are limited to using the current model, which restricts programs that target the GPU to wait until all previous kernels have completed execution. Even if swapping is implemented, there may be just too many tasks attempting to share the GPU, and execution may be preferable to waiting a long time.

3.2 The Task Queuing Execution System

The basic idea of the cross-platform execution is that programmers are not targeting a specific platform when they design their applications. Instead, all logical parallelism is exposed by the proper programming paradigm and the execution units are sent to a central task queue. Based on this task-level and data-level parallelism information, work units are eventually distributed to the appropriate computing devices according to the workload balancing algorithms. Instead of using command queues to communicate with devices (one command queue per device), we introduce the concept of a work pools as an improved task queuing extension to OpenCL.

40 CHAPTER 3. CROSS-PLATFORM HETEROGENEOUS RUNTIME ENVIRONMENT

Task-Queueing Extension Application Program Interface

Work Unit Queue Dequeue EnqueueEnqueue Dequeue Engine EngineEngine Engine Resource Management Unit

OpenCL Interface and Device Driver

Figure 3.1: Distributing work units from work pools to multiple devices.

Work pools function similarly to command queues, except that a work pool can automatically target any device in the system. In order to manage execution between multiple devices, OpenCL kernels need to be combined with additional information and wrapped into structures called work units. Figure 3.1 shows the components in the work pool-based task queuing execution system that runs on top of OpenCL. Once kernels are wrapped into work units, they can be enqueued into a work pool by the enqueue engine. On the back-end, the scheduler can dequeue work units which are ready to execute (i.e., have all dependencies resolved) on one of the devices in the system. Each work pool has access to all of the devices on the platform. It is possible to define multiple work pools and enqueue/dequeue engines. If there is task-level parallelism in the host code, the runtime environment allows the programmer to create multiple work pools that correspond to different workload balancing algorithms. From a practical standpoint, creating multiple work pools to assign kernels to the same device can increase device utilization. The following subsections more clearly define the concepts used in our task-queueing extension for OpenCL. The API functions that implement these concepts are described in detail in Section 3.2.6.

41 CHAPTER 3. CROSS-PLATFORM HETEROGENEOUS RUNTIME ENVIRONMENT

3.2.1 Work Units

In the proposed runtime environment, work units are the basic schedulable units of exe- cution. A work unit consists of the actual OpenCL kernel (cl kernel) and its dependencies that need to be resolved prior to its execution. When a work unit is created, the programmer supplies a list of work units that must complete execution before the current work unit can be distributed according to the implemented algorithm. This functionality follows the current OpenCL standard, where each clEnqueue function takes a list of cl events as an argument. This list of cl events must be completed before the current work unit is distributed and executed. We improve the dependency mechanism in OpenCL by creating a new event data structure, which can carry the dependency information across OpenCL contexts and platforms. To enable a work unit to execute on any device, the OpenCL kernel is pre-compiled for all devices in the system. When the work unit is distributed for execution, the kernel corresponding to the chosen device is selected.

3.2.2 Work Pools

A work pool is a central container where the collection of work units to be distributed, with help from the resource management unit (Section 3.2.4). A scheduler (Section 3.2.5) interacts with this central execution system, dequeuing and executing work units according to their accompanying dependency information. Equipped with dependency information for each work unit, the work pool has system-wide runtime information available so it is possible to make informed scheduling decisions. The work pool also has detailed knowledge of the system (such as the number and types of devices), and has the ability to work with multiple devices from different vendors. Together with the resource management unit, the runtime system tracks the status of all memory objects and events for each computing device in the system.

3.2.2.1 Enqueue and Dequeue Engine

The enqueue and dequeue operations on the work pool are two independent CPU threads, and a mutex is used to protect the work pool which is a collection of the work units. Before a work unit is enqueued into the work pool, the runtime system assumes that all the OpenCL kernel information, such as kernel arguments, work group size, and work group dimensions, etc., is

42 CHAPTER 3. CROSS-PLATFORM HETEROGENEOUS RUNTIME ENVIRONMENT

CPU Execution GPU Execution CPU Enqueue CPU Idle GPU Idle CPU Dequeue

(a)Baseline Implementation (b)Work Pool Implementation

Figure 3.2: CPU and GPU execution already packaged in the work unit during initialization. However, it is possible that some of these configurations are dynamically generated. Under this circumstance, we rely on a callback mechanism that allows the user to inject initialization and finalization functions which are executed before and after the execution of the work unit, respectively. With these callback functions, we can change the work unit configuration based on the runtime information. In the baseline implementation of the clSURF application, the kernel execution on the GPU and the host program execution on the CPU (e.g., program flow control, data structure initialization and image display, etc.), are synchronized with each other, as illustrated in Figure 3.2(a). By using separate enqueue and dequeue threads and the callback mechanism, we can execute some of the CPU

43 CHAPTER 3. CROSS-PLATFORM HETEROGENEOUS RUNTIME ENVIRONMENT host programs asynchronously with the GPU kernel execution. This usually results in achieving higher utilization for the GPU command queue, and an overall performance gain for the application (Section 4.2.2). When utilizing multiple dequeue threads for each available device, multiple kernels can be further overlapped if there are no dependencies between them, and the utilization of the overall platform is further improved.

3.2.3 Common Runtime Layer

In OpenCL, the scope of objects (such as kernels, buffers, events, etc) are limited to a context. When an object is created, a context must always be specified, and objects that are created in one context are not visible by another context. This becomes a problem when devices from different vendors are used. If a programmer wanted to use an AMD GPU and an NVIDIA GPU simultaneously, they would need to install both vendors’ OpenCL implementations–AMD’s implementation can interact with AMD GPUs and any x86 CPU; NVIDIA’s implementation can only interact with NVIDIA GPUs. With the current OpenCL specification, contexts cannot span different vendor implementations. This means that the programmer would have to create two contexts (one per implementation), and initialize both of them. The programmer would also have to explicitly manage synchronization between contexts, including transferring data and ensuring dependencies are satisfied. Using our work pool-based runtime environment, we remove the handicap of restricting object scope by device type and implement a common runtime layer in the work pool back-end. Each individual work pool is able to directly manage the task of object communication and synchronization across multiple contexts. This new level of flexibility enables the programmer to transparently use all possible devices and take full advantage of the heterogeneous platform.

3.2.4 Resource Manager

In the current OpenCL environment, when many devices are used with OpenCL, memory objects must be managed explicitly by the programmer. If kernels that use the same data run on different devices, the programmer must always be aware of which device updated the buffer last, and transfer data accordingly. Using the new API in our runtime environment, the programmer cannot predict apriori which device each kernel will execute on, so we have designed it to act as an automated resource manager that manages memory objects between devices.

44 CHAPTER 3. CROSS-PLATFORM HETEROGENEOUS RUNTIME ENVIRONMENT

In our runtime environment, prior to execution, the programmer must explicitly inform the API that it wishes to use a block of memory as the input to a kernel by passing the associated data pointer. The resource manager then determines whether the data buffer is already present on that device by looking into a lookup table with existing data buffers. If this is the first time the data is used, or if the valid version of the data is on another device, a data transfer will be initiated, and the new valid data buffer will be moved to the target device. If the data is already present on the correct device, no action is required. Since the data transfer time overhead is not negligible, data is only transferred as necessary prior to kernel dispatch. This data management scheme can be easily extended to avoid data transfers altogether if all devices use a unified memory. In the current implementation, the resource manager assumes that data sizes are smaller than the capacity of any one single device.

3.2.5 Scheduler

At the back-end of the central task queuing execution system, one or multiple independent dequeue threads are continuously taking ready work units out of work pool according to the defined workload balancing policy. The scheduler evaluates the dependency information associated with the work units enqueued in the work pool and uses the specified policy to determine which work unit to execute next and on which device. OpenCL represents dependencies by generating an event for one API call and passing it to a successive call as an argument. An enqueued kernel will wait for all events in its wait-list to be executed before it’s own execution. However, OpenCL events are restricted to the context on which they are created. The status of an event created on a certain context is unknown to another context. In our heterogeneous runtime system, we remove this constraint by sharing the valid status of events across the entire platform across multiple OpenCL contexts. Therefore, programmers can still use events to describe a dependency, even though the work units can be distributed to devices in different contexts. When work units are created, the values of the kernel arguments and work-item configura- tion information may or may not be available. If the information is not available, work units can still be enqueued into the work pool, but cannot be executed until the information is provided. To update a work unit with this information, a callback mechanism is provided for the programmer. For example, in the clSURF application, the work-item configuration information for the work units in the second phase is determined by the number of interest points, which is the output

45 CHAPTER 3. CROSS-PLATFORM HETEROGENEOUS RUNTIME ENVIRONMENT data that was computed during the first phase of the application. We program the initialization and finalization in the callback functions of related work units so that the number of interesting points is updated after the work units are already enqueued.

3.2.5.1 Design Cross-platform Workload Balancing Algorithm

“Prediction is very difficult, especially about the future.” This principle obviously applies to computer hardware evolution also. Many HPC applications were designed to exploit single core performance, so programmers have to re-design many of them using multi-threaded schemes when targeting multi-core systems. When applications assume they will only ever run on a single machine, and when the underlying hardware changes, this can significantly impact application performance. Programmers have less and less control and certainty about the target platform when designing applications. When a new class of processor or accelerator is available, it is crucial that programmers do not have to redesign their entire application. A statically-mapped application may be able to exploit the resources on a specific machine very well. However, the code must undergo a major redesign to support new architecture features, such as new features provided for on accelerators. Therefore, to ensure that applications will be ready to run on tomorrow’s architectures, it is important to be able to dispatch different computations dynamically to different processing units. In this thesis, we show that it is possible to utilize adaptive workload balancing schemes across different platforms with flexible libraries, compilation environments and run-time systems. We implement several static and dynamic workload balancing schemes to map the kernel execution to multiple devices. By using a dynamic workload balancing algorithm, computing tasks are automatically assigned to multiple devices according to their processing capabilities, run-time availability, and predefined preferences. A profiling pass operating at a work-unit granularity is also incorporated to provide run-time feedback to the scheduler to guide dynamic decisions.

HD6970 HD6550D A8-3850 CPU

Figure 3.3: An example execution of vector addition on multiple devices with different processing capabilities.

46 CHAPTER 3. CROSS-PLATFORM HETEROGENEOUS RUNTIME ENVIRONMENT

3.2.5.2 Heterogeneous devices with different processing capabilities

When running the same workload on different types of processing units, the execution time can vary greatly depending on the processing capabilities of the device.

Table 3.1: Typical memory bandwidth between different processing units for reads. Memory Read Operations CPU Discrete GPU Fused GPU System Memory 20GB/s 6GB/s 1GB/s Global Memory of Discrete GPU 6GB/s 150GB/s 5GB/s Global Memory of Fused GPU 1GB/s 5GB/s 30GB/s

Table 3.2: Typical memory bandwidth between different processing units for writes. Memory Write Operations CPU Discrete GPU Fused GPU System Memory 20GB/s 6GB/s 1GB/s Global Memory of Discrete GPU 6GB/s 150GB/s 5GB/s Global Memory of Fused GPU 1GB/s 5GB/s 30GB/s

Table 3.3: The Classes and Methods class work pool Description init Initialize a work pool, define its capacity and initialize a buffer table. get context Provide information for all OpenCL devices in the system. enqueue Enqueue a new work unit into the work pool. dequeue(optional) Extract a ready work unit and distribute to device request buffer Request a new or existing buffer for the data referenced by a pointer. query Query the information about the next work unit. finish Indicates the end of application program flow. class work unit Description init Initialize a work unit. compile program Compile an OpenCL kernel file for all possible devices. create kernel Create an OpenCL kernel for all possible devices. set argument Register these arguments in the metadata of the work unit.

Figure 3.3 shows the results of profiling the execution of vector addition kernels on a heterogeneous platform with one CPU and two GPUs, using a greedy workload balancing scheme. The vector addition kernels are continuously enqueued into the workpool, and on the back-end, all processing units are greedily grabbing and executing work units from the work pool. Even though we are executing the workload on all available devices, the resulting performance is still well below the optimal performance. The tasks assigned to the CPU run significantly slower than when run on a

47 CHAPTER 3. CROSS-PLATFORM HETEROGENEOUS RUNTIME ENVIRONMENT

GPU, and when all the work units are processed on the GPUs, the last work unit on the CPU still takes a considerable amount of time to finish. A better workload balancing scheme would be to distribute the work units based on the processing capabilities of each device, so that all the devices are busy, and so we do not wait for other devices to complete.

3.2.5.3 Overhead of Data Transfers work_pool work_pool_vec; work_pool_vec.init(WORKPOOL_CAPACITY, init_num_work_units, &status); work_unit work_unit_vec; work_unit_vec.init(&work_pool_vec, NULL, //preferred context "vectoradd.cl", //kernel file "vecAdd", //kernel name NULL, //dependency description 1, //work dimension NULL, //global work offset &globalSize, //global work size &localSize, //local work size &status); work_unit_vec.set_argument(index, ARG_TYPE_CL_MEM, int_value, //if integer float_value, //if float (void *)&data, //data pointer data_size, //data size READ_WRITE_FLAG, CL_TRUE ); work_pool_vec.enqueue(&work_unit_vec, PRIORITY_LEVEL, &status); work_pool_vec.finish();

Listing 3.1: Example code of enqueuing work unit to the work pool.

In our runtime system, while the resource manager automatically handles data transfers between devices, it cannot eliminate the overhead of those transfers. The scheduler has to take into

48 CHAPTER 3. CROSS-PLATFORM HETEROGENEOUS RUNTIME ENVIRONMENT consideration data transfer overhead when making workload balancing decisions, especially when the amount of data is significant as compared with the kernel execution time.

3.2.6 Task-Queuing API

Tables 3.1 and 3.2 provide memory bandwidth estimation for our experimental setup (for the second platform described in Chapter 4.1), for read and write performance, respectively. The performance may vary based on systems and drivers, but we can clearly see that as we increase the number of processing units, data transfer becomes the main bottleneck, so this overhead must be taken into consideration when making workload balancing decisions. Our runtime environment is designed to facilitate the execution of OpenCL applications on multiple devices in a convenient and efficient way. On top of the OpenCL programming API, we provide a new set of APIs based on two basic classes: 1) work pools, and 2) work units. Table 3.3 provides a brief description of these two classes and their major methods. The work pool class initializes the a task queue execution system, and provides knowledge about all OpenCL devices present in the system. It also allocates and coordinates resources across devices. The work unit class concentrates more on the kernel itself. Besides packaging the compiled kernels for all possible devices, it also incorporates dependency information. Code listings 3.1 shows an example of how to declare a vector addition work unit and provides commands to enqueue this work unit to the work pool.

49 Chapter 4

Experimental Results

4.1 Experimental Environment

To evaluate our task-queueing extension framework for OpenCL, we use three different heterogeneous platforms that include both CPU and GPU devices from AMD, Intel and NVIDIA. The first platform has one AMD Radeon HD6970 GPU [84] and one NVIDIA GTX 285 GPU [85]. The AMD Radeon HD6970 GPU has 24 SIMD engines (16-wide), which are 4-way VLIW, for a total of 1536 streaming processors running at 880MHz. The NVIDIA GTX 285 GPU has 15 SIMD engines (16-wide), and includes scalar cores, for a total of 240 streaming cores running at 1.4GMHz. The second platform has one AMD Fusion A8-3850 APU (fuses a CPU and GPU on the same chip) [4]and one discrete AMD Firepro V7800 GPU [86]. The AMD A8-3850 Fusion APU is used as the host device. On the single chip of the A8-3850, we have four x86-64 CPU cores running at 2.9GHz integrated together with a Radeon HD6550D Radeon GPU, which has 5 SIMD engines (16-wide), with 5-way VLIW cores, for a total of 400 streaming processors. The AMD FirePro V7800 GPU has 18 SIMD engines (16-wide), where each core is a 5-way VLIW, for a total of 1440 streaming processors running at 700MHz. The third platform has one INTEL Core i5-3360M CPU [87], one INTEL HD graphics 4000 mobile GPU [88], and one NVIDIA NVS 5400M GPU [89]. The Core i5-3360 has two processing cores with hyper-threading enabled, and runs at 2.8GHz. The INTEL graphics 4000 mobile GPU has 16 compute units running at 350MHz. The NVIDIA NVS 5400M GPU has 96 streaming cores running at 1320MHz. To explore the benefits of heterogeneity, the host CPU is also used as a computing device

50 CHAPTER 4. EXPERIMENTAL RESULTS in our experiments. However, to ensure the CPU’s quality of service as a host scheduling device can be maintained, we utilize device fission extensions [90] supported by OpenCL to reserve part of the CPU for scheduling. By using device fission, we can divide an OpenCL device into sub-devices and use only part of it for compute, and at the same time, maintain its ability to manage the host program. On the first two platforms, experiments were performed using the AMD Accelerated Parallel Processing (APP) SDK 2.5 on top of the vendor specific drivers (Catalyst 11.12 for AMD’s GPU and CUDA 4.0.1 for NVIDIA’s GPU). On the third platform, the experiments were performed using Intel’s SDK for OpenCL applications 2013. The NVIDIA NVS 5400M driver has support for CUDA 5.5. The Open Source Computer Vision (OpenCV) library v2.2[91] is used by clSURF to handle the extraction of frames from video files and to display the processed video frames.

4.2 Static Workload Balancing

4.2.1 Performance Opportunities on a Single GPU Device

We first investigate the performance opportunities solely afforded by our extension layer on a single device. In this experiment, our baseline is the OpenCL implementation of clSURF. The clSURF application uses OpenCV to display the processed images on the screen. This function is always running on the host machine and frames must be processed serially, to later be displayed. When we have multiple devices processing independent frames, displaying resulting frames often becomes a major bottleneck. To provide a more accurate view of the performance capabilities of the task queuing implementation on multiple devices, we will present results with and without the display functionality enabled for each evaluation. When processing a video in clSURF, the characteristics of the video frames can dramatically impact execution on specific kernels in the application. For each frame, the performance of the kernels in the first stage is a function of the size of the image, and the performance of the kernels in the second stage is dependent on the number of interest points found in the image. To provide coverage of the entire performance space, we selected four different videos that provide a range of these different attributes (Table 4.1). Next, we use our new OpenCL extension layer to implement the clSURF framework, a real world application that presents us with more complicated program flow. When processing a video in clSURF, the characteristics of the video frames can dramatically impact the execution on specific kernels in the application. For each frame, the performance of the kernels in the first stage is a

51 CHAPTER 4. EXPERIMENTAL RESULTS

Table 4.1: Input sets emphasizing different phases of the SURF algorithm. Input Size Number of Description Interesting Points Video1 240 x 320 312 Small video size & small number of interesting points Video2 704 x 790 3178 Small video size & large number of interesting points Video3 720 x 1280 569 Large video size & small number of interesting points Video4 720 x 1280 4123 Large video size & large number of interesting points function of the size of the image, and the performance of the kernels in the second stage is dependent on the number of interest points found in the image. To provide coverage of the entire performance space, we selected four different videos that provide a range of these different attributes (Table 4.1). Our baseline is the reference implementation of clSURF. The clSURF application uses OpenCV to display the processed images on the screen. This function is always running on the host machine and frames must be serialized to be displayed. When we have multiple devices processing independent frames, this function often becomes a major bottleneck. Figure 4.2 shows the speedup when we create two work pools on a single GPU device. The two work pools are processing independent different frames, and there is no dependency between frames. When we decompose the application across frames, we obtain an additional average speedup of 15%. Using two work pools produces better utilization of the GPU command queue.

4.2.2 Heterogeneous Platform with Multiple GPU Devices

To demonstrate the work pool implementation of the clSURF application as run on multiple GPU devices, we choose two combinations of GPU devices. The first platform has an AMD FirePro V9800P and an AMD Radeon HD6970; the second platform has an AMD FirePro V9800P and a NVIDIA GTX 285. We manually load-balance the workload on different GPU devices, and compare the performance against the a single work pool implementation by measuring the average execution time per video frame. Figure 4.3 shows the performance of our two work pool implementation on the V9800P/HD6970 and compares this against a single work pool implementation on a V9800P. Since the V9800P and the HD6970 have comparable computing power, we achieve the best performance when the workload on

52 CHAPTER 4. EXPERIMENTAL RESULTS

GTX 285 V9800P HD 6970 HD 6550D 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Video1 Video2 Video3 Video4

(a) With Display

GTX 285 V9800P HD 6970 HD 6550D 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Video1 Video2 Video3 Video4

(b) Without Display

Figure 4.1: The performance of our work pool implementation on a single device – One Work Pool. both devices is allocated evenly. The speedup achieved is up to 55.4%. We also include the baseline implementation for reference. The AMD FirePro V9800P and the NVIDIA GTX 285 provide significant differences in computational power. In Figure 4.4, we compare the performance of our two work pool implementa- tion on the V9800P/GTX 285 combination and compare it against a single work pool implementation

53 CHAPTER 4. EXPERIMENTAL RESULTS

GTX 285 V9800P HD 6970 HD 6550D 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Video1 Video2 Video3 Video4

(a) With Display

GTX 285 V9800P HD 6970 HD 6550D 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Video1 Video2 Video3 Video4

(b) Without Display

Figure 4.2: The performance of our work pool implementation on a single device – Two Work Pools. on a GTX 285. When we schedule twice the number of frames on the V9800P as on the GTX 285, and a speedup of to 2.8x speedup given the more powerful processing unit on the AMD device.

54 CHAPTER 4. EXPERIMENTAL RESULTS

Baseline on V9800P V9800P Work Pool HD 6970: V9800P = 1:1 HD 6970: V9800P = 10:9 HD 6970: V9800P = 2:1 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Video1 Video2 Video3 Video4

(a) With Display

Baseline on V9800P V9800P Work Pool HD 6970: V9800P = 1:1 HD 6970: V9800P = 10:9 HD 6970: V9800P = 2:1 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Video1 Video2 Video3 Video4

(b) Without Display

Figure 4.3: Load balancing on dual devices – V9800P and HD6970.

55 CHAPTER 4. EXPERIMENTAL RESULTS

Baseline on GTX 285 GTX 285 Work Pool V9800P: GTX 285 = 1:1 V9800P: GTX 285 = 10:9 V9800P: GTX 285 = 2:1 3 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Video1 Video2 Video3 Video4

(a) With Display

Baseline on GTX 285 GTX 285 Work Pool V9800P: GTX 285 = 1:1 V9800P: GTX 285 = 10:9 V9800P: GTX 285 = 2:1 3 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Video1 Video2 Video3 Video4

(b) Without Display

Figure 4.4: Load balancing on dual devices – V9800P and GTX 285.

4.2.3 Heterogeneous Platform with CPU and GPU(APU) Device

In OpenCL programming, the CPU can also be used as a compute device for the kernel. With APUs such as AMD’s Fusion, the communication overhead between GPU and CPU decreases, 56 CHAPTER 4. EXPERIMENTAL RESULTS and makes heterogeneous programming feasible for a wider variety of applications. In this section, we demonstrate how our extension facilitates the use of the CPU to participate in kernel execution with the GPU accelerator. When using the GPU and CPU as co-accelerators, characteristics of each processing unit must be considered. For example, GPUs are much better suited for large, data-parallel algorithms than CPUs. However, as more processing cores are appearing inside a single die, the CPU has the potential to supply valuable computing power instead of being used only as a scheduler and coordinator. Furthermore, we always have to be careful when assigning computing tasks to CPU devices, as the performance of the whole system could be seriously affected if too many resources are used by the kernel computation and CPU threads are blocked from execution.

46

44 42 40 38

36 4 34 3 32

2 Time per Frameper (ms) Time 30 1

28 usedComputing Coresfor 26

Execution 24 22 20 Numberof 0.04 0.03 0.02 0.01

Processed Frame Ratio between CPU and Fused GPU

Figure 4.5: Performance assuming different device fission configurations and load balancing schemes between CPU and Fused HD6550D GPU.

One solution for this problem is to utilize the device fission extensions [90] supported by OpenCL. By using device fission, the programmer can divide an OpenCL device into sub-devices and use only part of it for execution. Currently, this device extension is only supported for multi-core CPU devices, and can be used to ensure that the CPU has enough resources to perform additional tasks, specifically task scheduling, while executing an OpenCL program.

57 CHAPTER 4. EXPERIMENTAL RESULTS

Baseline on HD6550D HD6550D with the work pool and last work unit on CPU HD6550D: CPU = 100:2 HD6550D:CPU = 100:4 1.2

1

0.8

0.6

0.4

0.2

0 Video1 Video2 Video3 Video4

Figure 4.6: The Load Balancing on dual devices – HD6550D and CPU.

Figure 4.5 shows the impact of the number of CPU cores used for OpenCL processing and its impact on the execution time per video frame for Video2. In this experiment, the GPU and CPU devices are working together to process the video file. The x-axis represents the ratio between number of frames distributed to the fused AMD APU A8-3850. From this Figure we can see that aggressively assigning computation to the CPU is detrimental to performance, as kernel execution impacts the CPU’s ability to function as an efficient host device. Figure 4.6 shows the performance of our work pool implementation while using both

58 CHAPTER 4. EXPERIMENTAL RESULTS a CPU and a GPU as compute devices, and is compared with the baseline execution on a fused HD6550D GPU. We configure the CPU using the device fission extension and only use part of the CPU as the computing device. From this Figure we can see that our extension enables programmers to utilize all possible computing resources on the GPU and the CPU. However, we only achieve speedup in one scenario, and in most of the cases, kernel execution on the CPU slows down the whole application. Since the CPU in the AMD APU A8-3850 is a quad-core CPU, using it as a computing device severely impacts its ability as a scheduling unit, even though we use only a subset of the cores by configuring it using the fission extension. The enqueue operation on the GPU work pool is frequently blocked when part of the CPU is reserved for the computing task. This results in a slow down of the overall execution.

4.3 Design Space Exploration for Flexible Workload Balancing

In our second experiment we use a synthetic workload generator to explore the design space of incorporating flexible workload balancing schemes with our task-queueing OpenCL extension.

4.3.1 Synthetic Workload Generator

To better understand the potential benefits of a centralized task queuing system that supports our extension layer, we designed a synthetic workload generator. This workload generator will continuously enqueue Vector Addition work units to the central work pool. The Vector Addition kernel performs single precision vector additions, and is easy parallelized on any data-parallel platform. On the back end, multiple dequeue threads consume the work units according to the selected workload balancing scheme. To accurately present the load balancing, all the work units have the same configuration. Configuration parameters include global/local work size, dimensions, etc.. Moreover, there are no dependencies between these work units and they can be executed by any existing device on the platform. Code listing 4.1 shows the source code of the kernel.

4.3.2 Dynamic Workload Balancing

To demonstrate the ability of our task-queueing extension to employ a range of workload balancing schemes, we have implemented a number of static and dynamic workload balancing algorithms, and compare their performance using our synthetic workload. We present results for three basic workload balancing algorithms: 1) Round Robin (RR) - the scheduler will distribute the work

59 CHAPTER 4. EXPERIMENTAL RESULTS units to all the devices in a round-robin fashion, 2) Greedy algorithm (GREEDY) - Every dequeued thread continuously grabs work units whenever the thread finishes a work unit, 3) Partitioned (PARTITION) - based on each device’s inherent processing capabilities, we predefine a partition, allocating a set number of work units for each device.

__kernel void vecAdd(__global float *a, __global float *b, __global float *c, const unsigned int n) { //Get our global thread ID int id = get_global_id(0); int i; float r=1.0;

if (id < n) { for(i=0;i

Listing 4.1: A synthetic vector addition like kernel

When designing our adaptive dynamic workload balancing algorithm, we detect the utilization of each device by monitoring utilization starting from the first enqueue of a work unit, and continuing until the end of its execution. The basic idea is to detect if a device is overloaded by monitoring its command queue. When a device is overloaded, future work units will be dynamically migrated to other devices that have excess processing capacity. Algorithm 1 show the details of our dynamic workload balancing algorithm. Figure 4.7 presents the performance speedup of different workload balancing schemes using our new task-queueing extension on all 3 possible devices, which include the A8-8350 CPU, a Firepro v7800 GPU and a fused HD6550D GPU on the second testing platform. The baseline is the execution time of all work units executed on the Firepro V7800 GPU, which is the most powerful device on the platform. From this Figure, we can see that the performance of the Round Robin (RR)

60 CHAPTER 4. EXPERIMENTAL RESULTS

Algorithm 1 Dynamic workload balancing algorithm for multi-device heterogeneous systems. Wi = number of work units assigned to current device i

Ui = The utilization indicator, 1 = busy, 0 = not busy T[0..2]= the queue time of the most recent 3 work units N= total number of devices on the platform

if T1 < T2 < T3 then

Ui = 1 . //Device is getting busy else

Ui = 0 end if

if Ui == 1 then for j = 0 to N − 1 do if j! = i then

if Uj! = 1 then . //Find a non-busy device

Wi = Wi − 1

Wj = Wj + 1 . //move over 1 work unit break end if end if end for end if

61 CHAPTER 4. EXPERIMENTAL RESULTS

1.6

1.4

1.2

1 RR 0.8 GREEDY

Speed up Speed PARTITION 0.6 DYNAMIC

0.4

0.2

0 32 64 96 128 Total number of work units

Figure 4.7: Performance of different workload balancing schemes on all 3 CPU and GPU devices, an A8-8350 CPU, a V7800 GPU and a HD6550D GPU, as compared to a V7800 GPU device alone. scheme performs the worst, since it treats both CPU and GPU devices equally. Our static partitioning scheme considers the parameters of the device, including the number of SIMD engines, the number of processing cores and the operating frequencies. This provides us with a rough estimate of a device’s processing capability. Using this scheme we obtain an average speedup of 11% as compared with the baseline, while dynamic scheme gives us more speedup (19% on average). In the Figure, GREEDY represents a greedy scheduling scheme. As we increase the number of work units, GREEDY quickly catches up to our dynamic scheme. However, it produces very different performance results as we vary the number of work units. Depending on the targeted device, the granularity of the schedulable units here can be dramatically different. For example, the execution time on a CPU device can be 10x longer than on a GPU device. GREEDY workload balancing makes decisions solely based on the device’s availability, and therefore is not able to

62 CHAPTER 4. EXPERIMENTAL RESULTS

RR GREEDY PARTITION DYNAMIC 1.6 1.4

1.2 1

Speed up Speed 0.8 0.6 0.4 0.2 0 36 48 60 72 Total number of work units

Figure 4.8: Performance of different workload balancing schemes on 1 CPU and 2 GPU devices, an NVS 5400M GPU, a Core i5-3360M CPU and a Intel HD graphics 4000 GPU, as compared to the NVS 5400M GPU device alone. achieve the best performance when the number of work units is smaller.

4.3.3 Workload Balancing with Irregular Work Units

In this experiment, we configure the workload generator to generate irregular work units by changing the configuration of each work unit. We have three types of work units, each possessing a different number of work items. This results a range of different execution times for the work units. All the work units are enqueued into the work pool in a random sequence. Again, we apply the three different workload balancing schemes mentioned in previous section, and compare the performance by measuring the execution time. Figure 4.8 shows the performance speedup of different workload balancing schemes using our task queueing system on a Core i5-3360M CPU, an Intel HD graphics 4000 GPU and a NVS 5400M GPU. The baseline is the execution time of all work units executed on the NVS 5400M GPU, which is the most powerful compute device on the platform. From this Figure, we can see that again

63 CHAPTER 4. EXPERIMENTAL RESULTS the Round Robin scheme slows down the whole execution by treating the slower device the same as the fast device. If we consider the different processing capabilities when using the PARTITION scheme, we can better utilize the computing power from multiple devices, and improve performance by 19.2% on average. When we adopt our DYNAMIC workload balancing scheme, we further improve the performance and the speedup is 25.2% compared with the baseline execution on single NVS 5400M GPU device. In addition to our investigation on how our runtime environment facilitates workload balancing at the granularity of work units, we also investigate the automatic partitioning of the workload on a single work unit. We create a special API function, and use a flag to indicate if the input and output data of a work unit can be partitioned and distributed to utilize different compute devices. For partitioning to make sense, the input data should be large enough to amortize the overhead of partitioning and distributing data to different compute devices. However, in this implementation, we assume there is no data dependency between work items, this applies a limitation on the scope of the problems on which we can use this API function.

4.4 Cross-Platform Heterogeneous Execution of clSURF and MCXCL

Next, we apply our new task-queueing extension to real world applications that present us with more complicated program flow. We collect experimental results on the second testing platform with an AMD V7800 GPU and a HD6550D GPU, and also on the third testing platform with an Intel Core i5-3360M CPU and a NVIDIA NVS 5400M GPU. The baseline of this experiment is the original state-of-the-art OpenCL implementation on a single GPU device, which represents the most powerful acceleration in either configuration. We also include results when using static partitioning, which divides the workload based on the processing capabilities of each device. During the processing of each frame in clSURF, each kernel execution is directly dependent on the output of previous kernel. The migration of a kernel to a different device incurs overhead associated with migration of the kernel’s data objects. After applying migration based on profile information, further workload balancing is performed at the granularity of video frames. Figure 4.9 shows the performance of clSURF when using our new task-queueing extension. Using our method, users can easily adopt different workload balancing schemes. In this figure, we compare the performance of a workpool implementation on a V7800, as well as static partitioning and dynamic workload balancing on both GPU devices, compared against a baseline implementation

64 CHAPTER 4. EXPERIMENTAL RESULTS

Baseline on V7800 V7800 Work Pool Dynamic scheduling Static Scheduling 1.6

1.4

1.2

1 Speed up Speed 0.8 0.6 0.4 0.2 0 Video1 Video2 Video3 Video4

(a) with display

Baseline on V7800 V7800 Work Pool Dynamic scheduling Static Scheduling 2 1.8

1.6

1.4 1.2

Speed up Speed 1 0.8 0.6 0.4 0.2 0 Video1 Video2 Video3 Video4

(b) without display

Figure 4.9: Performance Comparison of clSURF implemented with various workload balancing schemes on the platform with V7800 and HD6550D GPUs. run on a V7800. The y-axis is the speedup and the x-axis plots the different video inputs. If we ignore the overhead of displaying the image, the speedup is up to 30.4%. When we utilize both

65 CHAPTER 4. EXPERIMENTAL RESULTS

Baseline on V7800 V7800 Work Pool Dynamic scheduling Static Scheduling 1.4

1.2 1

Speed up Speed 0.8 0.6 0.4 0.2 0 16 24 32 40 Total number of work units

(a) Platform with V7800 and HD6550D GPUs

Baseline on NVS 5400M NVS 5400M Work Pool Dynamic scheduling Static Scheduling 4

3.5

3

2.5 Speed up Speed 2 1.5 1 0.5 0 16 24 32 40 Total number of work units

(b) Platform with 5400M GPU and i5-3360M CPU

Figure 4.10: Performance Comparison of MCXCL implemented with various workload balancing schemes

GPU devices in the system, we can see that dynamic workload balancing scheme further improves performance compared to the static workload balancing scheme.

66 CHAPTER 4. EXPERIMENTAL RESULTS

Figure 4.10 illustrates the performance of MCXCL implemented using our new task- queueing extension on two different heterogeneous platforms. Without changing the program source code, the programmer can easily adopt different workload balancing schemes. MCXCL only has one large kernel, and from Figure 4.10(a), we can see that it is not always beneficial to divide the workload into work units if we only use one device. However, we can easily use all the devices on the platform and obtain up to a 22.7% performance gain, taking advantage the CPU and GPU working together. Figure 4.10(b) shows the potential for dividing the workload into small chunks. When using a device with a very limited number of processing cores, it is beneficial to separate the workload into smaller chunks so that potential contention or register spills can be avoided. We can obtain further performance gains by applying an adaptive dynamic workload balancing scheme.

4.5 Productivity

To demonstrate how our cross-platform runtime environment improves the programmability of the heterogeneous platforms, we implement several OpenCL benchmark applications on top of our runtime API, and compare the number of lines of the source code with their baseline OpenCL implementation. Figure 4.11 shows the comparison of the number of lines for selected benchmark appli- cations, including stencil, sgemm, bfs, and histogram, selected from the Parboil suite [92]. From these results we can clearly see that our runtime environment is able to hide most of the cumbersome device and platform initialization. Using only 46.5% (on average) of the original lines of code, we can implement the same OpenCL functionality. The savings are based on the number of different devices on the platform, and the degree of heterogeneity of the devices.

67 CHAPTER 4. EXPERIMENTAL RESULTS

baseline CL cross-platform runtime baseline CL & header files cross-platform runtime & header files

800 700

Code 600

Source 500 400 300 200 100

Numberof Lines of 0 stencil sgemm bfs histo Productivity Using Cross-Platform Runtime

Figure 4.11: Number of lines of the source code using our runtime API versus a baseline OpenCL implementation.

68 Chapter 5

Summary and Conclusions

This thesis presents a novel design, implementation and optimization of a cross-platform heterogeneous runtime system. Our runtime system enables flexible task-level workload balancing on a heterogeneous platform with multiple computing devices. We conclude this thesis by evaluating the major contributions of this thesis.

5.1 Portable Execution across Platforms

Without our cross-platform runtime system, each applications would have to be re- implemented for each new platform, and the programmer would be required to have a good knowledge about details of the platform, such as how many computing devices, types of the devices, etc.. Our runtime environment provides a unified abstraction for all processing units, including many existing OpenCL devices. With this unified abstraction, tasks are able to be distributed to all devices, and an application is portable across platforms with different numbers of processing units. We demonstrate the portable execution of two real world applications on top of our runtime environment (Section 4.4). With the utilization of more computing devices on the platform, we achieved more than 20% speedup.

5.2 Dynamic Workload Balancing

The optimal static mapping of the task execution on the underlying platform requires a significant amount of analysis of all the devices on the platform, and it is difficult for the programmer to perform remapping whenever a new hardware platform. A dynamic workload balancing scheme makes it possible for the same source code to obtain portable performance.

69 CHAPTER 5. SUMMARY AND CONCLUSIONS

This work explores a concise adaptive workload balancing scheme effective on different number of computing devices with a range of characteristics. Our work finds that a proper workload balancing scheme is an important factor for portable performance (Section 4.3.3). We demonstrate an average of 22.1% speedup when we running the application on platforms with multiple GPUs and CPUs.

5.3 APIs to Expose Both Task-level and Data-level Parallelism

The program designer is in the best person to identify all levels of parallelism in their appli- cation. We provide API functions and a dependency description mechanism so that the programmer can expose task-level parallelism (Section 3.2.6). Together with the data parallelism expressed in OpenCL kernels, the run-time and/or the compiler can be adapted for any type of parallel machine without making modifications to the source code.

5.4 Future Research Directions

The following sections describe potential directions for future work, based on the contribu- tions of this thesis.

5.4.1 Including Flexible Workload Balancing Schemes

In our experiments, we evaluated several workload balancing schemes, including static scheduling which only considers the processing capabilities of the processing units. We also consider dynamic scheduling schemes, such as Round Robin, Greedy, and a profile-based feedback algorithm. We have designed a flexible API as an interface to design new portable workload balancing strategies. Applications can query or profile information for the current execution, collecting status of the processing units at runtime. A scheduler can later use the information to make further workload balancing decisions. Different workload balancing algorithms can be used accompanied by our API. Given the large amount of literature [93] focused on task scheduling and workload balanc- ing in parallel and distributed systems, it clearly indicates that this problem remains an open area ripe for future research. We have focused our work on only a select class workload balancing schemes, and for a select class of heterogeneous platforms. Instead of looking for a single best workload balancing algorithm for all platforms, it seems a lot more reasonable to provide programmers with multiple options which they can choose from. This will give the programmer the opportunity to make

70 CHAPTER 5. SUMMARY AND CONCLUSIONS their own decisions based on the characteristics of the application and the underlying platform. Vu and Derbel [94] have started to explore workload balancing in a large scale heterogeneous compute environment. Another interesting aspect of workload balancing is the correlation between scheduling overhead and granularity of the workload. Depending on the complexity of the workload balancing scheme, the overhead of making the workload balancing decisions can be varied. It would be interesting to investigate what is the correlation between workload balancing overhead and the size of the work units. For example, when the work units are very small, the scheduling overhead may become too much to amortize even we can take advantage of multiple devices in the system.

5.4.2 Running specific kernels on the best computing devices

In our current runtime system, a single version of a kernel is run on a different device, guided by our workload distribution schemes. However, sometimes the kernel execution suffers from serious performance degradation. Usually the reason behind this degradation is due to the fact that the kernel was originally designed for a specific type of accelerator, or may be heavily optimized for a specific memory hierarchy. Therefore, one future research direction is for our runtime system to support running the same individual kernels and have portable performance for different classes of systems. Based on the dynamic decisions, different optimizations could be applied and that could significantly benefit performance. Vinas et al. [95] have started to explore performance portability for heterogeneous platform on top of OpenCL. They proposed a suite of micro-benchmarks, and illustrate how this suite of micro-benchmarks reflects the characteristics of a certain processing unit. They also provide guidance on future application design based on the results from running these micro-benchmarks. Jaaskelainen et al. [96] have tried to use the OpenCL compiler for the portable performance. They implemented their own OpenCL kernel compiler based on the open source LLVM compiler [97].

5.4.3 Prediction of data locality

Data locality is a very important factor when considering performance. One of the major sources of overhead when distributing multiple kernels across different devices is the movement of data, since the cost of data transfers can be the dominant factor in terms of performance. Anytime we make a decision to distribute kernel execution on different devices, we have to take into consideration of the data transfer overhead, together with the potential performance gain.

71 CHAPTER 5. SUMMARY AND CONCLUSIONS

A promising research direction is to consider this tradeoff analytically, considering a workload balancing algorithm that can predict the data locality based on feedback from runtime profiling information. Ideally, we could apply a machine learning algorithm, but offline profiling can also be useful to establish data locality. Vinas et al. [98] proposed an analytical performance model that includes PCIe transfers and overlapping computation and data communication. For heterogeneous platform with multiple CPU and GPU devices, the ultimate solution will be an accurate prediction of the data locality.

72 Bibliography

[1] G. E. Moore, “Cramming more components onto integrated circuits,” Electronics, vol. 38, no. 8, April 1965.

[2] W.-m. Hwu, K. Keutzer, and T. G. Mattson, “The concurrency challenge,” IEEE Des. Test, vol. 25, no. 4, pp. 312–320, Jul. 2008.

[3] “The 6th Generation Core i7 Processors.” [Online]. Available: http://www.intel.com/content/ www/us/en/processors/core/core-i7-processor.html

[4] “AMD Accelerated Processors for Desktop PCs.” [Online]. Available: http://www.amd.com/us/ products/desktop/apu/mainstream/pages/\mainstream.aspx

[5] “NVIDIA’s parallel computing architecture,” http://www.nvidia.com/object/cuda home new.html.

[6] “OpenCL - The open standard for parallel programming of heterogeneous systems,” http://www.khronos.org/opencl/.

[7] “OpenACC Directives for Accelerators,” http://www.openacc.org/.

[8] A. Miller and K. Gregory, C++ AMP, ser. Developer Reference. Pearson Education, 2012.

[9] “The Industry Standard for High Performance Graphics,” http://www.opengl.org/.

[10] P. Taylor, “Programmable Shaders for DirectX 8.0,” 1989.

[11] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, “Brook for gpus: Stream computing on graphics hardware,” in ACM SIGGRAPH 2004 Papers, ser. SIGGRAPH ’04, 2004, pp. 777–786.

73 BIBLIOGRAPHY

[12] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu, “Optimization principles and application performance evaluation of a multithreaded gpu using ,” in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’08, 2008, pp. 73–82.

[13] P. Mistry, D. Schaa, B. Jang, D. R. Kaeli, A. Dvornik, and D. Meglan, “Data structures and transformations for physically based simulation on a gpu.” in VECPAR, ser. Lecture Notes in Computer Science, J. M. L. M. Palma, M. J. Dayd, O. Marques, and J. C. Lopes, Eds., vol. 6449. Springer, 2010, pp. 162–171.

[14] B. Jang, P. Mistry, D. Schaa, R. Dominguez, and D. Kaeli, “Data transformations enabling loop vectorization on multithreaded data parallel architectures,” in Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’10, 2010, pp. 353–354.

[15] E. Sun and D. R. Kaeli, “Aggressive value prediction on a GPU,” International Journal of Parallel Programming, vol. 42, no. 1, pp. 30–48, 2014.

[16] “HSA Foundation,” http://hsafoundation.com/.

[17] “Intel Ivybridge,” http://ark.intel.com/products/codename/29902/Ivy-Bridge.

[18] “Snapdragon Mobile Processors and Chipsets, Qualcomm.” [Online]. Available: https: //www.qualcomm.com/products/snapdragon

[19] Y. Ukidave, F. N. Paravecino, L. Yu, C. Kalra, A. Momeni, Z. Chen, N. Materise, B. Daley, P. Mistry, and D. Kaeli, “Nupar: A benchmark suite for modern gpu architectures,” in Proceed- ings of the 6th ACM/SPEC International Conference on Performance Engineering, ser. ICPE ’15. New York, NY, USA: ACM, 2015, pp. 253–264.

[20] D. R. Kaeli, P. Mistry, D. Schaa, and D. P. Zhang, Heterogeneous Computing with OpenCL 2.0. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2015.

[21] F. Azmandian, A. Yilmazer, J. G. Dy, J. A. Aslam, and D. R. Kaeli, “Harnessing the power of gpus to speed up feature selection for outlier detection,” J. Comput. Sci. Technol., vol. 29, no. 3, pp. 408–422, 2014.

74 BIBLIOGRAPHY

[22] E. Sun, D. Schaa, R. Bagley, N. Rubin, and D. Kaeli, “Enabling task-level scheduling on heterogeneous platforms,” in Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, ser. GPGPU-5. New York, NY, USA: ACM, 2012, pp. 84–93. [Online]. Available: http://doi.acm.org/10.1145/2159430.2159440

[23] A. Yilmazer, Z. Chen, and D. R. Kaeli, “Scalar waving: Improving the efficiency of SIMD execution on gpus,” in 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, USA, May 19-23, 2014, 2014, pp. 103–112.

[24] C. Evans, “Notes on the opensurf library,” University of Bristol, Tech. Rep. CSTR-09-001, January 2009.

[25] “OpenCL implementation of the Speeded Up Robust Features (SURF) algorithm,” http://code.google.com/p/clsurf/.

[26] Q. Fang and D. A. Boas, “Monte carlo simulation of photon migration in 3d turbid media accelerated by graphics processing units,” Opt. Express, vol. 17, no. 22, pp. 20 178–20 190, Oct 2009. [Online]. Available: http://www.opticsexpress.org/abstract.cfm?URI=oe-17-22-20178

[27] H. Sutter, “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software,” http://www.gotw.ca/publications/concurrency-ddj.htm.

[28] S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, N. Borkar, and S. Borkar, “An 80-tile sub-100-w teraflops processor in 65-nm cmos.”

[29] “Intel Xeon Phi Core Architecture,” https://software.intel.com/en-us/articles/intel-xeon-phi- core-micro-architecture.

[30] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan, “Larrabee: A many-core x86 architecture for visual computing,” in ACM SIGGRAPH 2008 Papers, ser. SIGGRAPH ’08. New York, NY, USA: ACM, 2008, pp. 18:1–18:15.

[31] “What public disclosures has Intel made about Knights Landing?” https://software.intel.com/en- us/articles/what-disclosures-has-intel-made-about-knights-landing/.

[32] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy, “Introduction to the cell multiprocessor,” IBM J. Res. Dev., vol. 49, no. 4/5, pp. 589–604, Jul. 2005. [Online]. Available: http://dl.acm.org/citation.cfm?id=1148882.1148891

75 BIBLIOGRAPHY

[33] B. Bouzas, R. Cooper, J. Greene, M. Pepe, and M. J. Prelle, “Prelle. multicore framework: An for programming heterogeneous multicore processors,” Tech. Rep., 2006.

[34] D. R. Butenhof, Programming with POSIX Threads. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1997.

[35] “OpenMP,” http://www.openmp.org.

[36] B. Chapman, G. Jost, and R. v. d. Pas, Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). The MIT Press, 2007.

[37] P. S. Pacheco, Parallel Programming with MPI. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1996.

[38] “OpenMPI,” http://www.open-mpi.org.

[39] “MVAPICH,” http://mvapich.cse.ohio-state.edu.

[40] “MPICH,” http://www.mcs.anl.gov/research/projects/mpich2.

[41] “GRIDMPI,” http://www.gridmpi.org.

[42] “LAM/DMPI,” http://www.lam-mpi.org.

[43] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008.

[44] “MapReduce Tutorial,” https://hadoop.apache.org/docs/r1.0.4/mapred tutorial.pdf.

[45] S. Upstill, RenderMan Companion: A Programmer’s Guide to Realistic Computer Graphics. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1989.

[46] “Programming Guide for HLSL,” http://msdn.microsoft.com/en-us/library/bb509635(v=VS.85).aspx.

[47] R. J. Rost, OpenGL(R) Shading Language (2Nd Edition). Addison-Wesley Professional, 2005.

[48] W. R. Mark, R. Steven, G. Kurt, A. Mark, and J. Kilgard, “Cg: A system for programming graphics hardware in a c-like language,” ACM Transactions on Graphics, vol. 22, pp. 896–907, 2003.

76 BIBLIOGRAPHY

[49] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, “Brook for gpus: Stream computing on graphics hardware,” ACM Trans. Graph., vol. 23, no. 3, pp. 777–786, Aug. 2004. [Online]. Available: http://doi.acm.org/10.1145/1015706.1015800

[50] P. McCormick, J. Inman, J. Ahrens, J. Mohd-Yusof, G. Roth, and S. Cummins, “Scout: A data- parallel programming language for graphics processors,” Parallel Comput., vol. 33, no. 10-11, pp. 648–662, Nov. 2007. [Online]. Available: http://dx.doi.org/10.1016/j.parco.2007.09.001

[51] A. E. Lefohn, S. Sengupta, J. Kniss, R. Strzodka, and J. D. Owens, “Glift: Generic, efficient, random-access gpu data structures,” ACM Trans. Graph., vol. 25, no. 1, pp. 60–99, Jan. 2006. [Online]. Available: http://doi.acm.org/10.1145/1122501.1122505

[52] J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, “GPU computing,” Proceedings of the IEEE, vol. 96, no. 5, pp. 879 –899, may 2008.

[53] “Top500 Supercomputers sites,” http://www.top500.org/.

[54] “The OpenCL Specification, version 2.1.” [Online]. Available: https://www.khronos.org/ registry/cl/specs/opencl-2.1.pdf

[55] A. Munshi, B. Gaster, T. G. Mattson, J. Fung, and D. Ginsburg, OpenCL Programming Guide, 1st ed. Addison-Wesley Professional, 2011.

[56] B. Gaster, L. Howes, D. R. Kaeli, P. Mistry, and D. Schaa, Heterogeneous Computing with OpenCL, 1st ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011.

[57] R. Tay, OpenCL Parallel Programming Development Cookbook. Packt Publishing, 2013.

[58] B. Gaster, L. Howes, D. Kaeli, P. Mistry, and D. Schaa, Heterogeneous Computing with OpenCL: Revised OpenCL 1.2 Edition. Elsevier Science, 2012. [Online]. Available: https://books.google.com/books?id=yyI8jfvi9-8C

[59] J. Kowalik and T. Puzniakowski,´ Using OpenCL: Programming Massively Parallel Computers, ser. Advances in parallel computing. IOS Press, 2012. [Online]. Available: https://books.google.com/books?id=T0sKa4T-sN0C

[60] C.-K. Luk, S. Hong, and H. Kim, “Qilin: Exploiting parallelism on heterogeneous multiproces- sors with adaptive mapping,” in , 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, dec. 2009, pp. 45 –55.

77 BIBLIOGRAPHY

[61] “IBM OpenCL Common Runtime for Linux on x86 Architecture,” http://www.alphaworks.ibm.com/tech/ocr.

[62] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, “Starpu: A unified platform for task scheduling on heterogeneous multicore architectures,” in Proceedings of the 15th International Euro-Par Conference on Parallel Processing, ser. Euro-Par ’09. Berlin, Heidelberg: Springer- Verlag, 2009, pp. 863–874.

[63] E. Hermann, B. Raffin, F. Faure, T. Gautier, and J. Allard, “Multi-gpu and multi-cpu paralleliza- tion for interactive physics simulations,” in Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II, ser. Euro-Par’10. Berlin, Heidelberg: Springer- Verlag, 2010, pp. 235–246.

[64] E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst, S. Thibault, and S. Tomov, “Faster, Cheaper, Better – a Hybridization Methodology to Develop Linear Algebra Software for GPUs,” in GPU Computing Gems. Morgan Kaufmann, Sep. 2010, vol. 2.

[65] E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, and S. Tomov, “QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators,” in 25th IEEE International Parallel & Distributed Processing Symposium, Anchorage, Etats-Unis,´ May 2011.

[66] E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief, and S. Tomov, “LU factorization for accelerator-based systems,” in 9th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 11), Sharm El-Sheikh, Egypt, 2011.

[67] S. Henry, “Opencl as Starpu frontend,” National Institute for Research in Computer Science and Control, INRIA, Tech. Rep., March 2010.

[68] K. Spafford, J. Meredith, and J. Vetter, “Maestro: data orchestration and tuning for opencl devices,” in Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II, ser. Euro-Par’10. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 275–286.

[69] “Symphony System Manager SDK.” [Online]. Available: https://developer.qualcomm.com/ software/symphony-system-manager-sdk

[70] “Multicore Asynchronous Runtime Environment.” [Online]. Available: https://developer. qualcomm.com/software/mare-sdk

78 BIBLIOGRAPHY

[71] D. Grewe and M. F. P. O’Boyle, “A static task partitioning approach for heterogeneous systems using opencl,” in Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software, ser. CC’11/ETAPS’11. Berlin, Heidelberg: Springer-Verlag, 2011, pp. 286–305.

[72] J. Bernard, J.-L. Roch, and D. Traore, “Processor-oblivious parallel stream computations,” 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008), pp. 72–76, 2008.

[73] M. Frigo, C. E. Leiserson, and K. H. Randall, “The implementation of the Cilk-5 multithreaded language,” in Proceedings of the ACM SIGPLAN ’98 Conference on Programming Language Design and Implementation (PLDI), Montreal, Quebec, Canada, Jun. 1998, pp. 212–223.

[74] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” Comput. Vis. Image Underst., vol. 110, pp. 346–359, June 2008.

[75] J. Luo, Y. Ma, E. Takikawa, S. Lao, M. Kawade, and B.-L. Lu, “Person-specific sift features for face recognition,” in Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, vol. 2, april 2007, pp. II–593 –II–596.

[76] D. Kim and R. Dahyot, “Face components detection using surf descriptors and svms,” in Machine Vision and Image Processing Conference, 2008. IMVIP ’08. International, sept. 2008, pp. 51 –56.

[77] S. Srinivasan, Z. Fang, R. Iyer, S. Zhang, M. Espig, D. Newell, D. Cermak, Y. Wu, I. Kozintsev, and H. Haussecker, “Performance characterization and optimization of mobile augmented reality on handheld platforms,” IEEE Workload Characterization Symposium, vol. 0, pp. 128–137, 2009.

[78] N. Zhang, “Computing parallel speeded-up robust features (p-surf) via posix threads,” in Proceedings of the 5th international conference on Emerging intelligent computing technology and applications, ser. ICIC’09. Berlin, Heidelberg: Springer-Verlag, 2009, pp. 287–296.

[79] P. Furgale, C. H. Tong, and G. Kenway, “Ece 1724 project report: Speed-up speed-up robust features,” 2009.

[80] P. Mistry, C. Gregg, N. Rubin, D. Kaeli, and K. Hazelwood, “Analyzing program flow within a many-kernel opencl application,” in Proceedings of the Fourth Workshop on General Purpose

79 BIBLIOGRAPHY

Processing on Graphics Processing Units, ser. GPGPU-4. New York, NY, USA: ACM, 2011, pp. 10:1–10:8.

[81] Z. Fang, D. Yang, WeihuaZhang, H. Chen, and B. Zang, “A comprehensive analysis and parallelization of an image retrieval algorithm,” in Performance Analysis of Systems and Software (ISPASS), 2011 IEEE International Symposium on, april 2011, pp. 154 –164.

[82] V. Tuchin, Handbook of Optical Biomedical Diagnostics, ser. Press Monographs. Society of Photo Optical, 2002. [Online]. Available: http://books.google.com/books?id=U7cNCRwZqtcC

[83] A. T. N. Kumar, S. B. Raymond, A. K. Dunn, B. J. Bacskai, and D. A. Boas, “A time domain fluorescence tomography system for small animal imaging.” IEEE Trans. Med. Imaging, vol. 27, no. 8, pp. 1152–1163, 2008. [Online]. Available: http://dblp.uni-trier.de/db/journals/tmi/tmi27.html#KumarRDBB08

[84] “AMD Radeon HD 6970 Graphics.” [Online]. Available: http: //www.amd.com/us/products/desktop/graphics-/amd-radeon-hd-6000/hd-6970-/Pages/ amd-radeon-hd-6970-overview.aspx

[85] “NVIDIA GeForce GTX 285 Graphics,” http://www.nvidia.com/object/product geforce gtx 285 us.html.

[86] “AMD FirePro V7800 Professional Graphics,” http://www.amd.com/us/products/workstation/grap-hics/ati-firepro-3d/ v7800/Pages/v7800.aspx.

[87] “INTEL Core i5-3360M Procesor,” http://ark.intel.com/products/64895.

[88] “Intel HD Graphics 4000 for 3rd Generation Intel Core Processors,” http://www.intel.com/content/www/us/en/support/graphics-drivers/intel-hd-graphics-4000- for-3rd-generation-intel-core-processors.html.

[89] “NVIDIA NVS notebook solutions,” http://www.nvidia.com/object/notebook-nvs.html.

[90] “OpenCL device extension: Device fission,” http://www.khronos.org/registry/cl/extensions/ext/ cl ext device fission.txt.

[91] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools, 2000.

80 BIBLIOGRAPHY

[92] J. A. Stratton, C. Rodrigrues, I.-J. Sung, N. Obeid, L. Chang, G. Liu, and W.-M. W. Hwu, “Parboil: A revised benchmark suite for scientific and commercial throughput computing,” University of Illinois at Urbana-Champaign, Urbana, Tech. Rep. IMPACT-12-01, Mar. 2012.

[93] H. H. A. Hesham El-Rewini, Theodore Gyle Lewis, Task scheduling in parallel and distributed systems. Prentice-Hall, Inc., 1994.

[94] T.-T. Vu and B. Derbel, “Parallel branch-and-bound in multi-core multi-cpu multi-gpu het- erogeneous environments,” Future Gener. Comput. Syst., vol. 56, no. C, pp. 95–109, Mar. 2016.

[95] M. Vinas,˜ Z. Bozkus, B. B. Fraguela, D. Andrade, and R. Doallo, “Developing adaptive multi-device applications with the heterogeneous programming library,” J. Supercomput., vol. 71, no. 6, pp. 2204–2220, Jun. 2015. [Online]. Available: http://dx.doi.org/10.1007/s11227-014-1352-1

[96] P. Ja¨askel¨ ainen,¨ C. S. Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg, “Pocl: A performance-portable opencl implementation,” Int. J. Parallel Program., vol. 43, no. 5, pp. 752–785, Oct. 2015. [Online]. Available: http://dx.doi.org/10.1007/s10766-014-0320-y

[97] C. Lattner and V. Adve, “LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation,” in Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04), Palo Alto, California, Mar 2004.

[98] B. van Werkhoven, J. Maassen, F. Seinstra, and H. Bal, “Performance models for cpu-gpu data transfers,” 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and , vol. 0, pp. 11–20, 2014.

81