GPU Acceleration at Scale with Openpower Platforms In

GPU Acceleration at Scale with OpenPower platforms in Code_Saturne Samuel Antao1,CharlesMoulinec2,YvanFournier3,RobertSawko1, Malgorzata Zimon1,ChristopherThompson1,AlexSkillen2,JuanUribe4, David R. Emerson2 1IBMIBM Research, Confidential2SCD, STFC Daresbury Laboratory UK, 3EDF R&D FR, 4EDF R&D CentreGPU UK acceleration in Code Saturne Main accomplishments Task-based GPUIBM acceleration Confidential in CFD with OpenMP 4.5 and CUDA Whatin OpenPOWER isplatforms. the challenges? GPU acceleration in CodeResults Saturne • Library to provide GPU acceleration at CPU+GPU speedup over CPU-only and Sparse-matrix solvers which cannot Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. Domain boundaries: intra- and inter- Scattered set of compute scale for Code_Saturne running on top of take full advantage of GPUs resources efficiency (strong scale) node communication overheads. kernels/bottlenecks. OpenPower machines using CUDA. tailored for dense computation. 3.0 3.5 120% 111.4% 120% • Manage data environments:controlscope Compute Data 2.8 100.0% 3.3 98.1% 100.0% 92.9% 100% 3.1 100% 2.6 2.90 83.1% 84.8% of data in the device. 2.39 2.42 2.9 only (1x) 80% only (1x) - 71.2% 80% 2.4 2.32 - How do we tackle them? 2.22 65.0% 2.7 • Single templated device entry function 2.2 2.08 60% 2.5 2.34 60% 2.00 2.31 Efficiency 2.0 2.3 Efficiency to run application kernels back to back and 40% 2.1 40% • Mitigate latencies for launching kernels 1.8 Speedup over CPU 1.9 mitigate kernel launching latencies. 1.6 20% Speedup over CPU 20% Task-based GPU accelerationback-to-back in CFD with: OpenMP 4.5 and CUDA in OpenPOWER platforms. 1.7 TM 1.4 • Leverage NVLINK fast interconnect 0% 1.5 0% • Packing arguments of various kernels (SpMV, dot products, etc) 1 2 4 8 16 32 /Users/sfantao/Downloads/a.cpp Page 1/1 TM 64 256 512 NVLINK NVIDIA Saved: 04/10/2018, 10:09:53 Printed for: Samuel Antao 1.0 to shift bottle necks from data-movement into a single data structure. Nodes P100 Nodes Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. Speedup Efficiency Speedup Efficiency towards compute. 1 •templateLeverage<KernelKinds template Kind> pack expansion to wrap multiple kernels in a 2 __device__ int any_kernel(KernelArgsBase &Arg, unsigned n_rows_per_block) { Kernel Kernel /Users/sfantao/Downloads/a.cpp3 switch(Kind) { Page 1/1 NVLINKTM NVIDIA single CUDA kernel. (a) Gauss-Seidel solver. 1.0 P100 • 2.9x speedup over CPU-only Saved:4 04/10/2018,// Matrix-vector 10:09:53 multiplication (MSR format): Printed for: Samuel Antao 5 case MV_MSR: Kind A Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. TM 6 matrix_vector_multiplication<Kind>( Kind D 1 template<KernelKinds Kind> POWER8 + P100 GPUs (111M-cells). simulation on 256-node POWER9 7 /* dev_row_index */ Arg.getArg<cs_lnum_t *>(0), Kernel (a) 8 2 __device__ /* dev_col_id int any_kernel(KernelArgsBase */ Arg.getArg<cs_lnum_t &Arg, unsigned*>(1), n_rows_per_block) { 3 switch(Kind) { Kernel 9 Kernel /* dev_val */ Arg.getArg<cs_real_t *>(2), machine - ORNL Summit 10 4 // Matrix-vector/* dev_d_val */ multiplication Arg.getArg<cs_real_t (MSR format): *>(3), Kernel Kernel 3.5 11IBM5 case Confidential /* MV_MSR: dev_x */ Arg.getArg<cs_real_t *>(4), KindGPU B acceleration in Code Saturne 111.4% 120% 12 6 Kind A matrix_vector_multiplication<Kind>( /* dev_y */ Arg.getArg<cs_real_t *>(5), Kind C 3.3 Supercomputer. 35 100.0% 13 7 /* /*n_rows dev_row_index */ */ Arg.getArg<cs_lnum_t Arg.getArg<cs_lnum_t >(6), *>(0), Code 3.1 100% © 2018 IBM Corporation 14 8 /* /*n_cols dev_col_id */ */ Arg.getArg<cs_lnum_t Arg.getArg<cs_lnum_t >(7), *>(1), Kind C Kind D 2.90 Kernel 2.9 84.8% • Enable multi-billion-cell unstructured 15 9 /* n_rows_per_block */ n_rows_per_block); /* dev_val */ Arg.getArg<cs_real_t *>(2), Kernel only (1x) 80% Kernel - 16 10 break ; /* dev_d_val */ Arg.getArg<cs_real_t *>(3), 2.7 17 // ... mesh simulations. 11 /* dev_x */ Arg.getArg<cs_real_t *>(4), 2.5 60% 18 Kind case A DP_xx: Kind B 2.31 2.34 12 /* dev_y */ Arg.getArg<cs_real_t *>(5), Kind A 2.3 Efficiency 19 13 dot_product<Kind>( /* n_rows */ Arg.getArg<cs_lnum_t >(6), Code 40% 20 14 /* /*version n_cols */ */ Arg.getArg<cs_lnum_t Arg.getArg<cs_lnum_t >(0 ), >(7), 2.1 21 /* n_rows */ Arg.getArg<cs_lnum_t >(1), 15 /* n_rows_per_block */ n_rows_per_block); Unpack Arguments 1.9 22 /* x */ Arg.getArg<cs_real_t *>(2), Speedup over CPU What is Code_Saturne? 32 16 Kernel break; 20% 23 /* y */ nullptr, © 2018 IBM Corporation 1.7 17 // ... NVIDIA 24 /* z */ nullptr, 1.5 NVLINKTM 18 Kind C case DP_xx: Kernel Kernel Kernel 0% V100 25 /* res */ Arg.getArg<cs_real_t *>(3), 2.0 19 dot_product<Kind>( 64 256 512 26 /* n_rows_per_block */ n_rows_per_block); 20 IBM /* version Confidential */ Arg.getArg<cs_lnum_t >(0), GPU acceleration inNodes Code Saturne • www.code-saturne.org 27 break; Speedup Efficiency 28 21 // ... /* n_rows */ Arg.getArg<cs_lnum_t >(1), Kind C KindUnpack B ArgumentsKind D 22 /* x */ Arg.getArg<cs_real_t *>(2), 29 (a) Gauss-Seidel solver. TM NVIDIA Kernel} NVLINK V100 • Open-source 30 23 __syncthreads(); /* y */ nullptr, 2.0 24 /* z */ nullptr, 31 Kind return C 0; Kernel Kernel Kernel 32 25 } /* res */ Arg.getArg<cs_real_t *>(3), TM • Computational fluid 33 26 /* n_rows_per_block */ n_rows_per_block); NVLINK Unpack Arguments 2.0 NVIDIA 34 27 break; V100 dynamics (CFD) software 35 28 template // ...<KernelKinds... Kinds> Kind C Kind B Kind D 36 29 __global__ } void any_kernels(void) { Kernel Kernel Kernel Kernel 1 package 37 30 __syncthreads(); (b) POWER9 + V100 GPUs (889M-cells) . 38 31 auto return *KA =0 ;reinterpret_cast<KernelArgsSeries *>(&KernelArgsSeriesGPU[0]); 36 39 32 const} unsigned n_rows_per_block = KA->RowsPerBlock; © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in40 OpenPOWER unsigned idx = 0; platforms. Kind B Kind D Kind A Kind A • Developed by EDF-R&D 33 Unpack Arguments • Compute and data movements overlap 41 34 42 int dummy [] = { any_kernel<Kinds>(KA->Args[idx++], n_rows_per_block)... }; 35 template<KernelKinds... Kinds> • Demonstrated good scaling characteristics: 43 (void) dummy; 36 __global__ void any_kernels(void) { almost perfectly for the GS algorithm. 44 } Kernel Kernel Kernel Kernel tailored to tackle extra large problems, e.g. 37 (a) AMG cycle overview. Specialised implementations 38 auto *KA = reinterpret_cast<KernelArgsSeries *>(&KernelArgsSeriesGPU[0]); • Consistent speedup over 2x with only 39 const unsigned n_rows_per_block = KA->RowsPerBlock; whole thermo-nuclear reactors. 40 unsigned idx = 0; Kind B Kind D Kind A Kind A 41 (b) AMG solver finer mesh. ˜100K-cells per rank. 42 int dummy [] = { any_kernel<Kinds>(KA->Args[idx++], n_rows_per_block)... }; • OpenMP + MPI - C/C++, Fortran and Python 43 (void) dummy; • NVLINK 2.0 available in POWER9 • Typical 3D 44 } Specialised implementationsmachines enables reaching better strong scaling • Combine with NVIDIA compute time AMG than its previous generation: Multiple-process service • Better efficiency when using 16x more resources for a distribution (MPS) with dual buffering 33 problem that is only 8x larger. Gauss-Seidel (GS) for Wrap kernels into a single one © 2018 IBM Corporation techniques: AMG Task-based GPU acceleration in CFD with OpenMPvelocity4.5 and CUDA in OpenPOWER and scalarsplatforms. • Improve overlap of data movements Acknowledgements CFD and Algebraic-Multigridand – AlgebraicExecution time distribution and compute in the same and across • Many components (kernels) different ranks. Multigrid (AMG) for 33 Multiple ranks with MPS POWER9 scaling experiments were only possi- contribute to total execution Single thread profiling - Code Saturne 5.0+ time © 2018 IBM Corporation pressure: AMG • Reduce device memory ble with Summit Early Access programme spon- Other • There are data dependencies allocation/movement sored by Oak Ridge Leadership Computing Facility. between consecutive kernels (b) Coarse mesh iterationoverheads detail. : Work partially funded by UKTC consortium grants • There are opportunities to Matrix-vector mult. • Create data environments without major keep data in the device MSR EP/L000261/1 and EP/R029326/1. between kernels code disruption. Pressure Figure 1: OpenMP 4.5 profiling overview of the AMG solver. Matrix-vector mult. GS (AMG) • Use own device memory pool and record • Some kernels may have CSR Dot products (b) AMG solver finer mesh. lower compute intensity, it information about data environments. More information: could still be worthwhile Gauss-Seidel solver Multigrid setup • computing them in the GPU if (Velocity) Within an environment, data stays in the Compute coarse cells [email protected] the data is already there 3.4 CUDA porting device. from fine cells (a) Gauss-Seidel solver. Other AMG-related We implemented a CUDA porting providing the same functionality as the OpenMP 4.5 porting, while (c) AMG solver coarser mesh. 10 addressing the limitations listed in Section 3.3. We implemented the port in a separate folder using C++ © 2018 IBM Corporation and exposed a C interface so that it could be invoked from within Code Saturne existing code. Figure 3: Single-rank timeline details for the CUDA port using a 1.5M-cell mesh. 3.4.1 Kernels’ implementation We implemented four templated kernels using the CUDA language: Gauss-Seidel, SpMV, dot products 8.3% - Conjugate gradient - includes MSR and CST SpMVs. and stream operations.∗ The different variations in SpMV (MSR, CSR, with/without diagonal) are con- trolled with template7.0% arguments. - Compute Similarly, coarse we cells use from template fine arguments cells to reuse code for the different flavours of dot∗ product and other stream operations. For the Gauss-Seidel kernel, we use a template argumentWe observe to control that whether the number the kernel of is combined iterations with for the the computation Gauss-Seidel of a dot-product. solver increases Using from tem- 64 to 76 for the plated functions improves the code scalability without increasing the complexity of control flows during GPUthe execution. accelerated Listings version. 6 and 7 This present solver the implementation relies on a data of the race SpMV during kernel the and execution stream operations.

Load more