Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping

Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung Luk Sunpyo Hong Hyesoon Kim Software Pathfinding and Electrical and Computer College of Computing Innovations Engineering School of Computer Science Software and Services Group Georgia Institute of Georgia Institute of Intel Corporation Technology Technology Hudson, MA 01749 Atlanta, GA 30332 Atlanta, GA 30332 [email protected] [email protected] [email protected] ABSTRACT 1. INTRODUCTION Heterogeneous multiprocessors are increasingly important in the Multiprocessors have emerged as mainstream computing plat- multi-core era due to their potential for high performance and en- forms nowadays. Among them, an increasingly popular class are ergy efficiency. In order for software to fully realize this potential, those with heterogeneous architectures. By providing processing the step that maps computations to processing elements must be as elements (PEs)1 of different performance/energy characteristics on automated as possible. However, the state-of-the-art approach is the same machine, these architectures could deliver high perfor- to rely on the programmer to specify this mapping manually and mance and energy efficiency [14]. The most well-known hetero- statically. This approach is not only labor intensive but also not geneous architecture today is probably the IBM/Sony Cell archi- adaptable to changes in runtime environments like problem sizes tecture, which consists of a Power processor and eight synergistic and hardware/software configurations. In this study, we propose processors [26]. In the personal computer (PC) world, a desktop adaptive mapping, a fully automatic technique to map computa- now has a multicore CPU and a GPU, exposing multiple levels of tions to processing elements on a CPU+GPU machine. We have hardware parallelism to software, as illustrated in Figure 1. implemented it in our experimental heterogeneous programming In order for mainstream programmers to fully tap into the poten- system called Qilin. Our results show that, by judiciously distribut- tial of heterogeneous multiprocessors, the step that maps compu- ing works over the CPU and GPU, automatic adaptive mapping tations to processing elements must be as automated as possible. achieves a 25% reduction in execution time and a 20% reduction Unfortunately, the state-of-the-art approach [16, 22, 36] is to rely in energy consumption than static mappings on average for a set of on the programmer to perform this mapping manually: for the Cell, important computation benchmarks. We also demonstrate that our O’Brien et al. extend the IBM XL compiler to support OpenMP on technique is able to adapt to changes in the input problem size and this architecture [22]; for commodity PCs, Linderman et al. pro- system configuration. pose the Merge framework [16], a library-based system to program CPU and GPU together using the map-reduce paradigm. In both Categories and Subject Descriptors cases, the computation-to-processor mappings are statically deter- mined by the programmer. This manual approach not only puts bur- C.1.2 [Processor Architectures]: Multiple Data Stream Architec- dens on the programmer but also is not adaptable, since the optimal tures—Parallel Processors; D.1.3 [Software]: Programming Tech- mapping is likely to change with different applications, different in- niques—Parallel Programming put problem sizes, and different hardware/software configurations. In this paper, we address this problem by introducing a fully General Terms automatic approach that decides the mapping from computations to processing elements using run-time adaptation. We have im- Performance plemented our approach into Qilin2, our experimental system for programming heterogeneous multiprocessors. Experimental results Keywords demonstrate that our automated adaptive mapping performs nearly Multicore, heterogeneous, GPU, adaptive, mapping, dynamic com- as well as manual mapping in terms of both execution time and en- pilation. ergy consumption, while at the same time can tolerate changes in input problem sizes and hardware/software configurations. The rest of this paper is organized as follows. In Section 2, we use a case study to further motivate the need of adaptive mapping. Section 3 then describes the Qilin system in details. We will focus on our runtime adaptation techniques in Section 4. Experimental evidences will be given in Section 5. Finally, we relate our works Permission to make digital or hard copies of all or part of this work for to others in Section 6 and conclude in Section 7. personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies 1 bear this notice and the full citation on the first page. To copy otherwise, to The term processing element (PE) is a generic term for a hardware element that executes a stream of instructions. republish, to post on servers or to redistribute to lists, requires prior specific 2 permission and/or a fee. “Qilin” is a mythical chimerical creature in the Chinese culture; MICRO’09, December 12–16, 2009, New York, NY, USA. we picked it as the project name for the heterogeneity of this crea- Copyright 2009 ACM 978-1-60558-798-1/09/12 ...$10.00. ture. N = 1000 CPU GPU (a) Experiment 1 ( , #CPU Cores = 8) 9x Core-0 Core-1 8x 7.7 7 7x 6.1 6x 5.7 5.7 Core-2 Core-3 5.1 5.2 5x 4.7 4.1 3.6 4x 3.2 Core-3 3x SIMD 2x Speedup over Serial 1x 0x ly n Figure 1: Multiple levels of hardware parallelism exposed to 90/10 80/20 70/30 60/40 50/50 40/60 30/70 20/80 10/90 GPU-o CPU-only software on current CPU+GPU systems (The GPU has tens to hundreds of special-purpose cores, while the CPU has a few (b) Experiment 2 (N = 6000, #CPU Cores = 8) general-purpose cores. Within each CPU core, there is short- 12x 10.3 vector parallelism provided by the SIMD extension of the ISA.) 9.7 10x 8.7 9 8 7.7 8x 7.6 7.4 6.7 5.9 2. A CASE STUDY: WHY DO WE NEED ADAP- 6x 5.3 TIVE MAPPING? 4x We now motivate the need of adaptive mapping with a case study Speedup over Serial on matrix multiplication, a very commonly used computation ker- 2x nel in scientific computing. We measured the parallelization speedups 0x on matrix multiplication with a heterogeneous machine consisting only of an Intel multicore CPU and a Nvidia multicore GPU (details of 90/10 80/20 70/30 60/40 50/50 40/60 30/70 20/80 10/90 - the machine configuration are given in Section 5.1). GPU-only CPU We did three experiments with different input matrix sizes and (c) Experiment 3 (N = 6000, #CPU Cores = 2) the number of CPU cores used. In each experiment, we varied the 10x 9.3 distribution of work between the CPU and GPU. For matrix multi- 9x 8.4 plication C = A ∗ B, we first divide A into two smaller matrices 8x 7.6 A1 and A2 by rows. We then compute C1 = A1 ∗ B on the CPU 7x 6.5 C = A ∗ B 6x and 2 2 on the GPU in parallel. Finally, we obtain 5 5x C by combining C1 and C2. We use the best matrix multiplica- 4 4x 3.3 tion libraries available: for the CPU, we use the Intel Math Kernel 2.8 3x 2.5 Library (MKL) [11]; for the GPU, we use the CUDA CUBLAS 2.2 2 Speedup over Serial 2x library [20]. 1x Figure 2 shows the results of these three experiments. All input 0x matrices are N ∗ N square matrices. The y-axis is the speedup only over the serial case. The x-axis is the distribution of work across 90/10 80/20 70/30 60/40 50/50 40/60 30/70 20/80 10/90 - the CPU and GPU, where the notation “X/Y” means X% of work GPU-only CPU mapped to the GPU and Y% of work mapped to the CPU. At the two extremes are the cases where we schedule all the work on either Figure 2: Matrix multiplication experiments with different in- the GPU or CPU. put matrix sizes and number of CPU cores used. The input In Experiment 1, we use a relatively small problem size (N = matrices are N by N. The notation “X/Y” on the x-axis means 1000) with eight CPU cores. The low computation-to-communication X% of work mapped to the GPU and Y% of work mapped to ratio with this problem size renders the GPU less effective. As a the CPU. result, the optimal mapping is to schedule all work on the CPU. In Experiment 2, we increase the problem size to N = 6000 and keep different optimal mappings. Therefore, we believe that any static the same number of CPU cores. Now, with a higher computation- mapping techniques would not be satisfactory. What we want is to-communication ratio, the GPU becomes more effective—both a dynamic mapping technique that can automatically adapt to the the GPU-alone and the CPU-alone speedups are over 7x. And the runtime environment, as we are going to propose next. optimal mapping is to schedule 60% of work on the GPU and 40% on the CPU, resulting into a 10.3x speedup. In Experiment 3, we keep the problem size to N = 6000 but reduce the number of CPU 3. THE QILIN PROGRAMMING SYSTEM cores to only two. With much less CPU horsepower, the CPU-only Qilin is a programming system we have recently developed for speedup is limited to 2x and the optimal mapping now is to sched- heterogeneous multiprocessors.

Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping

Optimizing Applications for Multicore by Intel Software Engineer Levent Akyil Welcome to the Parallel Universe

Andrzej Nowak - Bio

Efficiency, Energy Efficiency and Programming of Accelerated HPC Servers: Highlights of PRACE Studies

Design of a Parallel Multi-Threaded Programming Model for Multi-Core Processors

Download (633Kb)

General Purpose Programming on Modern Graphics Hardware

Next Generation Developer Tools from Intel

Gossamer: a Lightweight Approach to Using Multicore Machines

Software Plattform Embedded Systems 2020

Cross-Platform Software Optimization with Intel's Ct Technology

Intel® Software Development Tools Intel® Parallel Studio Seminar