Programmability, Portability and Performance in Heterogeneous Accelerator-Based Systems

R. Govindarajan High Performance Computing Lab. SERC & CSA, IISc [email protected]

Dec. 2015 © RG@SERC,IISc Overview

• Introduction • Challenges in Accelerator-Based Systems • Addressing Programming Challenges – Managing parallelism across Accelerator cores – Managing data transfer between CPU and GPU – Managing Synergistic Execution across CPU and GPU cores • Implementation and Evaluation • Conclusions

© RG@SERC,IISc 2 Moore’s Law

No. of Transistors doubles every 18 months

Doubling of Performance every 18 months

Source: Univ. of Wisconsin 3 Moore’s Law: Processor Architecture Roadmap

ILP ProcessorsEPIC

Super- scalar

© RG@SERC,IISc 4

Multicores : The “Right Turn”

1 Core 1 6 GHz GHz 6 3 GHz 3 GHz 3 GHz

2 Core 4 Core 16 Core

1 Core 1

3 GHz GHz 3

Performance

1 Core 1 1 GHz GHz 1

Source: LinusTechTips 5 Moore’s Law: Processor Architecture Roadmap

Accele- rators ILP ProcessorsEPIC

Super- scalar

© RG@SERC,IISc 6 Accelerators

© RG@SERC,IISc 7 Accelerator Architecture: Fermi S2050

Source: icrontic.com

© RG@SERC,IISc 8 Comparing CPU and GPU

• 8 CPU cores @ 3 GHz • 2880 CUDA cores @ 1.67 GHz • 380 GFLOPS • 1500 GFLOPS

C0 C1 C7

SLC

IMC QPI PCI-e

Memory Memory CPU © RG@SERC,IISc GPU 9 Accelerators: Hype or Reality?

Rmax Power Rank Site Manufacturer Computer Country Cores [Pflops] [MW] National Tianhe-2, 1 Center NUDT China 3,120,000 33.86 17.80 Xeon E5 2691 and Xeon Phi 31S1 in Tianjin XK7, Oak Ridge National 2 Cray Opteron 6274 (2.2GHz) + USA 560,640 17.59 8.20 Labs NVIDIA Kepler K-20 Lawrence Livermore 3 IBM Sequoia – BlueGene/Q USA 1,572,864 17.17 7.89 Labs RIKEN Advanced K Computer 4 Institute for Fujitsu SPARC64 VIIIfx 2.0GHz, Japan 705,024 10.51 12.66 Computational Science Tofu Interconnect BlueGene/Q 5 DOE/SC/ANL IBM USA 786,432 8.58 3.94 Power BQC 16C/1.6 GHz Xeon E5-2698 v3 (2.3GHz) + 6 DOE/LANL/SNL Cray USA 301,056 8.10 Aries Interconnect

Swiss National Xeon E5-2670 (2.6GHz) + 7 Cray Swiss 115,984 6.27 2.35 Computing Centre Nvidia Kepler K20x

Xeon E5-2680v3 (2.5GHz) + 8 HLRS, Stuttgart Cray Germany 196,608 5.64 Aries Interconnect Top 500 Nov 2015 List : www..org 10 Emergence of Accelerators

104 Systems in Top500 in Nov 2015

Source: Nicole Hemsoth, HPC Wire, June 2014 11 Overview

• Introduction • Challenges in Accelerator-Based Systems • Addressing Programming Challenges – Managing parallelism across Accelerator cores – Managing data transfer between CPU and GPU – Managing Synergistic Execution across CPU and GPU cores • Implementation and Evaluation • Conclusions

© RG@SERC,IISc 12 Accelerator Programming: Good News • Emergence of Programming Languages for GPUs/Accelerators  CUDA, OpenCL, OpenACC – Efficient, but low programmability • Growing collection of code base – CUDAzone – Packages supporting GPUs by Software Vendors • Impressive performance • What about Programmability and Portability while retaining Performance?

© RG@SERC,IISc 13 Handling the Multicore Challenge

• Shared and Distributed Memory Programming Languages  OpenMP, MPI – Scaling w.r.t. no. of cores • New Parallel Languages (partitioned global address space languages)  X10, UPC, Chapel, … – Higher programmability - but, perf. and applicability to accelerators are issues Source: Top500.org 14 Accelerator Programming: Boon or Bane • Challenges in programming Accelerators – Managing parallelism across various cores • Task, data, and thread-level parallelism – Efficient partitioning of work across different devices – Transfer of data between CPU and Accelerator – Managing CPU-Accelerator memory bandwidth efficiently – Efficient use of different types of memory (Device memory, Shared Memory, Constant and Texture Memory, … ) – Synchronization across multiple cores/devices

© RG@SERC,IISc 15 What Parallelism(s) to Exploit?

© RG@SERC,IISc 16 Our Approach

Array Lang. OpenMP MPI Streaming (Matlab) Lang. CUDA/ Parallel OpenCL Lang.

Compiler/ Runtime System

Synergistic Execution on Multiple Hetergeneous Cores Xeon Multicores Phi SSE GPUs CellBE

© RG@SERC,IISc 17 Overview

• Introduction • Challenges in Accelerator-Based Systems • Addressing Programming Challenges – Managing parallelism across Accelerator cores – Managing data transfer between CPU and GPU – Managing Synergistic Execution across CPU and GPU cores • Implementation and Evaluation • Conclusions

© RG@SERC,IISc 18 Programming Heterogeneous Systems: Our Contributions 1. Managing parallelism across Accelerator cores  Identifying and exposing implicit parallelism (MATLAB, StreamIT)  Software Pipelining in StreamIT 2. Managing data transfer between CPU and GPU  Compiler Scheme (MATLAB)  Hybrid Scheme (X10-CUDA) 3. Managing Synergistic Execution across CPU and GPU cores  Profile-based partitioning (StreamIT, MATLAB)  Runtime mechanism for OpenCL (FluidiCL)

© RG@SERC,IISc 19 1. Managing Parallelism across Cores

StreamIT Language • Program is a hierarchical composition of three A basic constructs – Pipeline B – SplitJoin C – Feedback Loop • Program reprsents infinite stream of data and D computation on them

© RG@SERC,IISc 20 1. Managing Parallelism across Cores

• Multithreading – Stream of data as multiple instances of thread – data parallelism – Software pipelining of filters or tasks – pipelined parallelism • Task partition between GPU and CPU cores, work scheduling and processor assignment – Profile-based approach for task mapping – Takes communication bandwidth restrictions into account – Task parallelism

© RG@SERC,IISc 21 Mapping StreamIt on GPUs

Resource constraints and Dependency Stream Graph constraints to be satisfied Task A T A1 A2 i Parallelism m B1 B2 e C1 C2 A3 A4 B C D1 D2 B3 B4 C3 C4 Pipeline D Data ParallelismD3 D4 Parallelism Optimal Schedule using SoftwareInteger Pipelined Linear Program Executionor near -optimal Heuristic Schedule for efficiency

© RG@SERC,IISc [CGO 2009] 22 Mapping StreamIt on GPUs

Schedule accounts for Stream Graph Data Transfer

A GPU DMA Channel CPU Task T A1 A2 i Parallelism B1 B2 m An Dn-1  Bn-1 Cn-1 B C e C1 C2 A3 A4 D1 D2 B3 B4 Cn An C3 C4 Bn Dn-1 D Data Pipeline Parallelism ParallelismD3 D4 Software Pipelined Execution

© RG@SERC,IISc [LCTES 2009] 23 MEGHA: MATLAB Execution on GPUs Frontend MATLAB Program Code Simplification

SSA Construction array A[N][N]; array B[N][N]; Type Inference array C[N][N]

1: tempVar0 = (B + C); Kernel Identification 2: tempVar1 = (A + C); 3: A_1 = tempVar0 + tempVar1; Mapping and Scheduling 4: C_1 = A_1 * C;

Data Transfer Insertion

Code Generation Backend © RG@SERC,IISc 24 Task Partitioning and Mapping in MATLAB Execution 1: tempVar0 = (B + C); Kernel Composition : combining IR 2: tempVar1 = (A + C); statements into “common loops” 3: A_1 = tempVar0 + tempVar1; under memory and register constraints 4: C_1 = A_1 * C; using Clustering Methods Kernel 1

1: tempVar0 2: tempVar1 = B+C = A+C • Type and shape inferencing • ScalarKernel Reduction 1 to reduce 1: tempVar0 = (B + C); storage2: andtempVar1 Kernel = Call(A + C); overheads 3: 3: A_1 = tempVar0 + tempVar1; A_1=tempVar0+tem pVar1 Data Flow Graph

Construction // for GPUCPU Execution -– coalescingcache locality for ji = 1:N Kernel 2 for ij = 1:N tempVar0_s = B(i, j) + C(i, j); tempVar1_s = A(i, j) + C(i, j); 4: C_1=A_1*C A_1(i, j) = tempVar0_s+tempVar1_s; endfor endfor © RG@SERC,IISc [PLDI 2011] 25 Overview

• Introduction • Challenges in Accelerator-Based Systems • Addressing Programming Challenges – Managing parallelism across Accelerator cores – Managing data transfer between CPU and GPU – Managing Synergistic Execution across CPU and GPU cores • Implementation and Evaluation • Conclusions

© RG@SERC,IISc 26 2. Data Transfer Insertion

1: n = 100; b1 1: n = 100; (CPU) 2: A = rand(n, n); 2: A = rand(n, n); (CPU) CPU 3: for i=1:k Data transfer TransferToGPU(A) required 4: A = A * A; (GPU) GPU b2 5: end 4: A = A * A; 6: print(A); Data transfer (CPU) GPU required TransferToCPU(A) CPU 6: print(A); b3 CPU • Data flow analysis to determine the locations of variables at the start of each basicNo data block transfer required • Edge splitting to insert necessary transfers

© RG@SERC,IISc 27 Hybrid Compiler-Runtime Approach for Memory Consistency

Write A ; Read B Object-Level EfficientTrue hybrid compilerFalse Memory- runtime consistencyGPUTrfr. checkAscheme to GPUConsistency A to at avoid redundant data transferGPU Page-Level K(A) MemoryRead A // Writes A Consistency at CPU

CTrfrPU. checkA to CPU A Trfr. B to CPU Read A;

Read B; [PACT 2012] 28 Overview

• Introduction • Challenges in Accelerator-Based Systems • Addressing Programming Challenges – Managing parallelism across Accelerator cores – Managing data transfer between CPU and GPU – Managing Synergistic Execution across CPU and GPU cores • Implementation and Evaluation • Conclusions

© RG@SERC,IISc 29 3. Synergistic Execution on Heterogeneous Devices Compile-time Approaches • Profile based scheduling T A1 A2 of tasks on CPU and GPU i cores m B1 B2 e C1 C2 • Compiler analysis of CPU- A3 A4 D1 D2 and GPU-friendly codes B3 B4 C3 C4 • Compiler analysis for data D3 D4

partition and transfer Software Pipelined Execution

© RG@SERC,IISc 30 3. Synergistic Execution on using Runtime Methods • Different kernels execute better (faster) on different devices (CPU, GPU, Xeon Phi, …) • OpenCL allows different kernels to be executed on different devices • But require programmers to specify the device for kernel execution • Can a single OpenCL kernel transparently utilize all devices? FluidiCL: A Runtime System for Cooperative and Transparent Execution of OpenCL Programs on Heterogeneous Devices © RG@SERC,IISc 31 FluidiCL Runtime

© RG@SERC,IISc 32 Kernel Execution

• Kernel launch is modified with a wrapper call • Data transfers to both devices CPU • Two additional buffers on the GPU: one for data coming from the CPU, one to hold an original copy Host • Entire grid launched on GPU GPUData Twin CPUData

• In addition, a few work- GPU groups, starting from end, are launched on CPU

© RG@SERC,IISc [CGO 2014] 33 FluidiCL: CPU+GPU+Xeon Phi • Consider CPU + Xeon Phi as a single device and GPU as a device • Use FluidiCL's two-device solution between the merged device and the GPU • Use FluidiCL's two-device solution for work distribution betwn. CPU and Xeon Phi.

CPU Xeon Phi GPU

© RG@SERC,IISc 34 Overview

• Introduction • Challenges in Accelerator-Based Systems • Addressing Programming Challenges – Managing parallelism across Accelerator cores – Managing data transfer between CPU and GPU – Managing Synergistic Execution across CPU and GPU cores • Implementation and Evaluation • Conclusions

© RG@SERC,IISc 35 Implementation & Evaluation

• Prototype Compiler Implementation: Source- to-Source translation – Insertion of wrapper functions, data transfers, … • Use of Nvidia CUDA compiler and OpenCL Compiler • Use of CUDA and OpenCL runtimes • Rodinia, CUDA SDK benchmarks along with benchmarks from StreamIT and MATLAB

© RG@SERC,IISc 36 Compiler Framework for StreamIT

StreamIt Program

Generate Code Execute Profile Configuration for Profiling Runs Selection

ILP Partitioner Instance Task Partitioning Partitioning Code Modulo Generation Scheduling CUDA Instance Task Code Partitioning Partitioning + Heuristic Partitioner C Code

© RG@SERC,IISc© 37 StreamIT: Experimental Results

Higher is Better

> 32x > 65x > 10 52x > 9 CPU Only 8 Mostly GPU 7 Syn-Heuristic 6 5 4 Speedup 3 2 1 0

© RG@SERC,IISc 38 MATLAB Execution : Experimental Results

113 64 Higher is Better 20

18

16

14

12

10 CPU+ GPU

8 CPU only Jacket 6 GPUmat

4 Speedup Over SpeedupOver MATLAB

2

0 bscholes clos fdtd nb1d nb3d Benchmark

© RG@SERC,IISc 39 Compiler Framework for X10-CUDA

X10 Code without explicit memory management

CPU Coherence Remove Redundant Insert checks at Coherence Checks every read/write Identify variables that require to be transferred GPU Coherence Insert checks before and after kernel invocation COMPILER

RUNTIME Initiate data transfers between CPU and GPU Track Coherence only when access to Status stale data is detected Automatic Memory Management

Lower is Better

• 3.33X faster than naïve • Comparable to hand-tuned (explicit) or better • 1.29X faster than CGCM Transparent Kernel Execution

Lower is Better

• FluidiCL execution times comparable to the best of the devices • Better than best-device performance in two applns. Summary

• Another exciting decade is awaiting for Accelerators • Need innovations in – Compilers – Programming Languages & Models – Runtime Systems • Key Challenges are: Programmability, Portability and Performance

© RG@SERC,IISc 43 Acknowledgements

• My Students – Abhishek Udupa, Ashwin Prasad, Jayvant Anantpur, Prasanna Pandit, Radhika Sarma, Raghu Prabhakar, Sreepathi Pai – Kaushik Rajan, R. Manikantan, Nagendra Gulur • Funding Agencies – AMD, IBM, Intel, Microsoft, Nvidia

© RG@SERC,IISc 44 Computer System Research @HPC Lab

IBM-JNC Workshop Oct. 2011 © RG@SERC,IISc 45 Dec. 2015 Thank You !!

© RG@SERC,IISc