Improving Programmability, Portability and Performance In
Total Page:16
File Type:pdf, Size:1020Kb
Programmability, Portability and Performance in Heterogeneous Accelerator-Based Systems R. Govindarajan High Performance Computing Lab. SERC & CSA, IISc [email protected] Dec. 2015 © RG@SERC,IISc Overview • Introduction • Challenges in Accelerator-Based Systems • Addressing Programming Challenges – Managing parallelism across Accelerator cores – Managing data transfer between CPU and GPU – Managing Synergistic Execution across CPU and GPU cores • Implementation and Evaluation • Conclusions © RG@SERC,IISc 2 Moore’s Law No. of Transistors doubles every 18 months Doubling of Performance every 18 months Source: Univ. of Wisconsin 3 Moore’s Law: Processor Architecture Roadmap ILP ProcessorsEPIC Super- scalar © RG@SERC,IISc 4 Multicores : The “Right Turn” 1 Core 1 6 GHz GHz 6 3 GHz 3 GHz 3 GHz 2 Core 4 Core 16 Core 1 Core 1 3 GHz 3 GHz Performance 1 Core 1 1 GHz GHz 1 Source: LinusTechTips 5 Moore’s Law: Processor Architecture Roadmap Accele- rators ILP ProcessorsEPIC Super- scalar © RG@SERC,IISc 6 Accelerators © RG@SERC,IISc 7 Accelerator Architecture: Fermi S2050 Source: icrontic.com © RG@SERC,IISc 8 Comparing CPU and GPU • 8 CPU cores @ 3 GHz • 2880 CUDA cores @ 1.67 GHz • 380 GFLOPS • 1500 GFLOPS C0 C1 C7 SLC IMC QPI PCI-e Memory Memory CPU © RG@SERC,IISc GPU 9 Accelerators: Hype or Reality? Rmax Power Rank Site Manufacturer Computer Country Cores [Pflops] [MW] National Tianhe-2, 1 SuperComputer Center NUDT China 3,120,000 33.86 17.80 Xeon E5 2691 and Xeon Phi 31S1 in Tianjin Titan Cray XK7, Oak Ridge National 2 Cray Opteron 6274 (2.2GHz) + USA 560,640 17.59 8.20 Labs NVIDIA Kepler K-20 Lawrence Livermore 3 IBM Sequoia – BlueGene/Q USA 1,572,864 17.17 7.89 Labs RIKEN Advanced K Computer 4 Institute for Fujitsu SPARC64 VIIIfx 2.0GHz, Japan 705,024 10.51 12.66 Computational Science Tofu Interconnect BlueGene/Q 5 DOE/SC/ANL IBM USA 786,432 8.58 3.94 Power BQC 16C/1.6 GHz Xeon E5-2698 v3 (2.3GHz) + 6 DOE/LANL/SNL Cray USA 301,056 8.10 Aries Interconnect Swiss National Xeon E5-2670 (2.6GHz) + 7 Cray Swiss 115,984 6.27 2.35 Computing Centre Nvidia Kepler K20x Xeon E5-2680v3 (2.5GHz) + 8 HLRS, Stuttgart Cray Germany 196,608 5.64 Aries Interconnect Top 500 Nov 2015 List : www.top500.org 10 Emergence of Accelerators 104 Systems in Top500 in Nov 2015 Source: Nicole Hemsoth, HPC Wire, June 2014 11 Overview • Introduction • Challenges in Accelerator-Based Systems • Addressing Programming Challenges – Managing parallelism across Accelerator cores – Managing data transfer between CPU and GPU – Managing Synergistic Execution across CPU and GPU cores • Implementation and Evaluation • Conclusions © RG@SERC,IISc 12 Accelerator Programming: Good News • Emergence of Programming Languages for GPUs/Accelerators CUDA, OpenCL, OpenACC – Efficient, but low programmability • Growing collection of code base – CUDAzone – Packages supporting GPUs by Software Vendors • Impressive performance • What about Programmability and Portability while retaining Performance? © RG@SERC,IISc 13 Handling the Multicore Challenge • Shared and Distributed Memory Programming Languages OpenMP, MPI – Scaling w.r.t. no. of cores • New Parallel Languages (partitioned global address space languages) X10, UPC, Chapel, … – Higher programmability - but, perf. and applicability to accelerators are issues Source: Top500.org 14 Accelerator Programming: Boon or Bane • Challenges in programming Accelerators – Managing parallelism across various cores • Task, data, and thread-level parallelism – Efficient partitioning of work across different devices – Transfer of data between CPU and Accelerator – Managing CPU-Accelerator memory bandwidth efficiently – Efficient use of different types of memory (Device memory, Shared Memory, Constant and Texture Memory, … ) – Synchronization across multiple cores/devices © RG@SERC,IISc 15 What Parallelism(s) to Exploit? © RG@SERC,IISc 16 Our Approach Array Lang. OpenMP MPI Streaming (Matlab) Lang. CUDA/ Parallel OpenCL Lang. Compiler/ Runtime System Synergistic Execution on Multiple Hetergeneous Cores Xeon Multicores Phi SSE GPUs CellBE © RG@SERC,IISc 17 Overview • Introduction • Challenges in Accelerator-Based Systems • Addressing Programming Challenges – Managing parallelism across Accelerator cores – Managing data transfer between CPU and GPU – Managing Synergistic Execution across CPU and GPU cores • Implementation and Evaluation • Conclusions © RG@SERC,IISc 18 Programming Heterogeneous Systems: Our Contributions 1. Managing parallelism across Accelerator cores Identifying and exposing implicit parallelism (MATLAB, StreamIT) Software Pipelining in StreamIT 2. Managing data transfer between CPU and GPU Compiler Scheme (MATLAB) Hybrid Scheme (X10-CUDA) 3. Managing Synergistic Execution across CPU and GPU cores Profile-based partitioning (StreamIT, MATLAB) Runtime mechanism for OpenCL (FluidiCL) © RG@SERC,IISc 19 1. Managing Parallelism across Cores StreamIT Language • Program is a hierarchical composition of three A basic constructs – Pipeline B – SplitJoin C – Feedback Loop • Program reprsents infinite stream of data and D computation on them © RG@SERC,IISc 20 1. Managing Parallelism across Cores • Multithreading – Stream of data as multiple instances of thread – data parallelism – Software pipelining of filters or tasks – pipelined parallelism • Task partition between GPU and CPU cores, work scheduling and processor assignment – Profile-based approach for task mapping – Takes communication bandwidth restrictions into account – Task parallelism © RG@SERC,IISc 21 Mapping StreamIt on GPUs Resource constraints and Dependency Stream Graph constraints to be satisfied Task A T A1 A2 i Parallelism m B1 B2 e C1 C2 A3 A4 B C D1 D2 B3 B4 C3 C4 Pipeline D Data ParallelismD3 D4 Parallelism Optimal Schedule using SoftwareInteger Pipelined Linear Program Executionor near -optimal Heuristic Schedule for efficiency © RG@SERC,IISc [CGO 2009] 22 Mapping StreamIt on GPUs Schedule accounts for Stream Graph Data Transfer A GPU DMA Channel CPU Task T A1 A2 i Parallelism B1 B2 m An Dn-1 Bn-1 Cn-1 B C e C1 C2 A3 A4 D1 D2 B3 B4 Cn An C3 C4 Bn Dn-1 D Data Pipeline Parallelism ParallelismD3 D4 Software Pipelined Execution © RG@SERC,IISc [LCTES 2009] 23 MEGHA: MATLAB Execution on GPUs Frontend MATLAB Program Code Simplification SSA Construction array A[N][N]; array B[N][N]; Type Inference array C[N][N] 1: tempVar0 = (B + C); Kernel Identification 2: tempVar1 = (A + C); 3: A_1 = tempVar0 + tempVar1; Mapping and Scheduling 4: C_1 = A_1 * C; Data Transfer Insertion Code Generation Backend © RG@SERC,IISc 24 Task Partitioning and Mapping in MATLAB Execution 1: tempVar0 = (B + C); Kernel Composition : combining IR 2: tempVar1 = (A + C); statements into “common loops” 3: A_1 = tempVar0 + tempVar1; under memory and register constraints 4: C_1 = A_1 * C; using Clustering Methods Kernel 1 1: tempVar0 2: tempVar1 = B+C = A+C • Type and shape inferencing • ScalarKernel Reduction 1 to reduce 1: tempVar0 = (B + C); storage2: andtempVar1 Kernel = Call(A + C); overheads 3: 3: A_1 = tempVar0 + tempVar1; A_1=tempVar0+tem pVar1 Data Flow Graph Construction // for GPUCPU Execution -– coalescingcache locality for ji = 1:N Kernel 2 for ij = 1:N tempVar0_s = B(i, j) + C(i, j); tempVar1_s = A(i, j) + C(i, j); 4: C_1=A_1*C A_1(i, j) = tempVar0_s+tempVar1_s; endfor endfor © RG@SERC,IISc [PLDI 2011] 25 Overview • Introduction • Challenges in Accelerator-Based Systems • Addressing Programming Challenges – Managing parallelism across Accelerator cores – Managing data transfer between CPU and GPU – Managing Synergistic Execution across CPU and GPU cores • Implementation and Evaluation • Conclusions © RG@SERC,IISc 26 2. Data Transfer Insertion 1: n = 100; b1 1: n = 100; (CPU) 2: A = rand(n, n); 2: A = rand(n, n); (CPU) CPU 3: for i=1:k Data transfer TransferToGPU(A) required 4: A = A * A; (GPU) GPU b2 5: end 4: A = A * A; 6: print(A); Data transfer (CPU) GPU required TransferToCPU(A) CPU 6: print(A); b3 CPU • Data flow analysis to determine the locations of variables at the start of each basicNo data block transfer required • Edge splitting to insert necessary transfers © RG@SERC,IISc 27 Hybrid Compiler-Runtime Approach for Memory Consistency Write A ; Read B Object-Level EfficientTrue hybrid compilerFalse Memory- runtime consistencyGPUTrfr. checkAscheme to GPUConsistency A to at avoid redundant data transferGPU Page-Level K(A) MemoryRead A // Writes A Consistency at CPU CTrfrPU. checkA to CPU A Trfr. B to CPU Read A; Read B; [PACT 2012] 28 Overview • Introduction • Challenges in Accelerator-Based Systems • Addressing Programming Challenges – Managing parallelism across Accelerator cores – Managing data transfer between CPU and GPU – Managing Synergistic Execution across CPU and GPU cores • Implementation and Evaluation • Conclusions © RG@SERC,IISc 29 3. Synergistic Execution on Heterogeneous Devices Compile-time Approaches • Profile based scheduling T A1 A2 of tasks on CPU and GPU i cores m B1 B2 e C1 C2 • Compiler analysis of CPU- A3 A4 D1 D2 and GPU-friendly codes B3 B4 C3 C4 • Compiler analysis for data D3 D4 partition and transfer Software Pipelined Execution © RG@SERC,IISc 30 3. Synergistic Execution on using Runtime Methods • Different kernels execute better (faster) on different devices (CPU, GPU, Xeon Phi, …) • OpenCL allows different kernels to be executed on different devices • But require programmers to specify the device for kernel execution • Can a single OpenCL kernel transparently utilize all devices? FluidiCL: A Runtime System for Cooperative and Transparent Execution of OpenCL Programs on Heterogeneous Devices © RG@SERC,IISc 31 FluidiCL Runtime © RG@SERC,IISc 32 Kernel Execution • Kernel launch is modified with a wrapper call • Data transfers to both devices CPU • Two additional buffers on the GPU: one for data coming from the CPU, one to hold an original copy Host • Entire grid launched on GPU GPUData Twin CPUData • In addition, a few work- GPU groups, starting from end, are launched on CPU © RG@SERC,IISc [CGO 2014] 33 FluidiCL: CPU+GPU+Xeon Phi • Consider CPU + Xeon Phi as a single device and GPU as a device • Use FluidiCL's two-device solution between the merged device and the GPU • Use FluidiCL's two-device solution for work distribution betwn.