GPU Acceleration at Scale with Openpower Platforms In

GPU Acceleration at Scale with OpenPower platforms in Code_Saturne Samuel Antao1,CharlesMoulinec2,YvanFournier3,RobertSawko1, Malgorzata Zimon1,ChristopherThompson1,AlexSkillen2,JuanUribe4, David R. Emerson2 1IBMIBM Research, Confidential2SCD, STFC Daresbury Laboratory UK, 3EDF R&D FR, 4EDF R&D CentreGPU UK acceleration in Code Saturne Main accomplishments Task-based GPUIBM acceleration Confidential in CFD with OpenMP 4.5 and CUDA Whatin OpenPOWER isplatforms. the challenges? GPU acceleration in CodeResults Saturne • Library to provide GPU acceleration at CPU+GPU speedup over CPU-only and Sparse-matrix solvers which cannot Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. Domain boundaries: intra- and inter- Scattered set of compute scale for Code_Saturne running on top of take full advantage of GPUs resources efficiency (strong scale) node communication overheads. kernels/bottlenecks. OpenPower machines using CUDA. tailored for dense computation. 3.0 3.5 120% 111.4% 120% • Manage data environments:controlscope Compute Data 2.8 100.0% 3.3 98.1% 100.0% 92.9% 100% 3.1 100% 2.6 2.90 83.1% 84.8% of data in the device. 2.39 2.42 2.9 only (1x) 80% only (1x) - 71.2% 80% 2.4 2.32 - How do we tackle them? 2.22 65.0% 2.7 • Single templated device entry function 2.2 2.08 60% 2.5 2.34 60% 2.00 2.31 Efficiency 2.0 2.3 Efficiency to run application kernels back to back and 40% 2.1 40% • Mitigate latencies for launching kernels 1.8 Speedup over CPU 1.9 mitigate kernel launching latencies. 1.6 20% Speedup over CPU 20% Task-based GPU accelerationback-to-back in CFD with: OpenMP 4.5 and CUDA in OpenPOWER platforms. 1.7 TM 1.4 • Leverage NVLINK fast interconnect 0% 1.5 0% • Packing arguments of various kernels (SpMV, dot products, etc) 1 2 4 8 16 32 /Users/sfantao/Downloads/a.cpp Page 1/1 TM 64 256 512 NVLINK NVIDIA Saved: 04/10/2018, 10:09:53 Printed for: Samuel Antao 1.0 to shift bottle necks from data-movement into a single data structure. Nodes P100 Nodes Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. Speedup Efficiency Speedup Efficiency towards compute. 1 •templateLeverage<KernelKinds template Kind> pack expansion to wrap multiple kernels in a 2 __device__ int any_kernel(KernelArgsBase &Arg, unsigned n_rows_per_block) { Kernel Kernel /Users/sfantao/Downloads/a.cpp3 switch(Kind) { Page 1/1 NVLINKTM NVIDIA single CUDA kernel. (a) Gauss-Seidel solver. 1.0 P100 • 2.9x speedup over CPU-only Saved:4 04/10/2018,// Matrix-vector 10:09:53 multiplication (MSR format): Printed for: Samuel Antao 5 case MV_MSR: Kind A Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. TM 6 matrix_vector_multiplication<Kind>( Kind D 1 template<KernelKinds Kind> POWER8 + P100 GPUs (111M-cells). simulation on 256-node POWER9 7 /* dev_row_index */ Arg.getArg<cs_lnum_t *>(0), Kernel (a) 8 2 __device__ /* dev_col_id int any_kernel(KernelArgsBase */ Arg.getArg<cs_lnum_t &Arg, unsigned*>(1), n_rows_per_block) { 3 switch(Kind) { Kernel 9 Kernel /* dev_val */ Arg.getArg<cs_real_t *>(2), machine - ORNL Summit 10 4 // Matrix-vector/* dev_d_val */ multiplication Arg.getArg<cs_real_t (MSR format): *>(3), Kernel Kernel 3.5 11IBM5 case Confidential /* MV_MSR: dev_x */ Arg.getArg<cs_real_t *>(4), KindGPU B acceleration in Code Saturne 111.4% 120% 12 6 Kind A matrix_vector_multiplication<Kind>( /* dev_y */ Arg.getArg<cs_real_t *>(5), Kind C 3.3 Supercomputer. 35 100.0% 13 7 /* /*n_rows dev_row_index */ */ Arg.getArg<cs_lnum_t Arg.getArg<cs_lnum_t >(6), *>(0), Code 3.1 100% © 2018 IBM Corporation 14 8 /* /*n_cols dev_col_id */ */ Arg.getArg<cs_lnum_t Arg.getArg<cs_lnum_t >(7), *>(1), Kind C Kind D 2.90 Kernel 2.9 84.8% • Enable multi-billion-cell unstructured 15 9 /* n_rows_per_block */ n_rows_per_block); /* dev_val */ Arg.getArg<cs_real_t *>(2), Kernel only (1x) 80% Kernel - 16 10 break ; /* dev_d_val */ Arg.getArg<cs_real_t *>(3), 2.7 17 // ... mesh simulations. 11 /* dev_x */ Arg.getArg<cs_real_t *>(4), 2.5 60% 18 Kind case A DP_xx: Kind B 2.31 2.34 12 /* dev_y */ Arg.getArg<cs_real_t *>(5), Kind A 2.3 Efficiency 19 13 dot_product<Kind>( /* n_rows */ Arg.getArg<cs_lnum_t >(6), Code 40% 20 14 /* /*version n_cols */ */ Arg.getArg<cs_lnum_t Arg.getArg<cs_lnum_t >(0 ), >(7), 2.1 21 /* n_rows */ Arg.getArg<cs_lnum_t >(1), 15 /* n_rows_per_block */ n_rows_per_block); Unpack Arguments 1.9 22 /* x */ Arg.getArg<cs_real_t *>(2), Speedup over CPU What is Code_Saturne? 32 16 Kernel break; 20% 23 /* y */ nullptr, © 2018 IBM Corporation 1.7 17 // ... NVIDIA 24 /* z */ nullptr, 1.5 NVLINKTM 18 Kind C case DP_xx: Kernel Kernel Kernel 0% V100 25 /* res */ Arg.getArg<cs_real_t *>(3), 2.0 19 dot_product<Kind>( 64 256 512 26 /* n_rows_per_block */ n_rows_per_block); 20 IBM /* version Confidential */ Arg.getArg<cs_lnum_t >(0), GPU acceleration inNodes Code Saturne • www.code-saturne.org 27 break; Speedup Efficiency 28 21 // ... /* n_rows */ Arg.getArg<cs_lnum_t >(1), Kind C KindUnpack B ArgumentsKind D 22 /* x */ Arg.getArg<cs_real_t *>(2), 29 (a) Gauss-Seidel solver. TM NVIDIA Kernel} NVLINK V100 • Open-source 30 23 __syncthreads(); /* y */ nullptr, 2.0 24 /* z */ nullptr, 31 Kind return C 0; Kernel Kernel Kernel 32 25 } /* res */ Arg.getArg<cs_real_t *>(3), TM • Computational fluid 33 26 /* n_rows_per_block */ n_rows_per_block); NVLINK Unpack Arguments 2.0 NVIDIA 34 27 break; V100 dynamics (CFD) software 35 28 template // ...<KernelKinds... Kinds> Kind C Kind B Kind D 36 29 __global__ } void any_kernels(void) { Kernel Kernel Kernel Kernel 1 package 37 30 __syncthreads(); (b) POWER9 + V100 GPUs (889M-cells) . 38 31 auto return *KA =0 ;reinterpret_cast<KernelArgsSeries *>(&KernelArgsSeriesGPU[0]); 36 39 32 const} unsigned n_rows_per_block = KA->RowsPerBlock; © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in40 OpenPOWER unsigned idx = 0; platforms. Kind B Kind D Kind A Kind A • Developed by EDF-R&D 33 Unpack Arguments • Compute and data movements overlap 41 34 42 int dummy [] = { any_kernel<Kinds>(KA->Args[idx++], n_rows_per_block)... }; 35 template<KernelKinds... Kinds> • Demonstrated good scaling characteristics: 43 (void) dummy; 36 __global__ void any_kernels(void) { almost perfectly for the GS algorithm. 44 } Kernel Kernel Kernel Kernel tailored to tackle extra large problems, e.g. 37 (a) AMG cycle overview. Specialised implementations 38 auto *KA = reinterpret_cast<KernelArgsSeries *>(&KernelArgsSeriesGPU[0]); • Consistent speedup over 2x with only 39 const unsigned n_rows_per_block = KA->RowsPerBlock; whole thermo-nuclear reactors. 40 unsigned idx = 0; Kind B Kind D Kind A Kind A 41 (b) AMG solver finer mesh. ˜100K-cells per rank. 42 int dummy [] = { any_kernel<Kinds>(KA->Args[idx++], n_rows_per_block)... }; • OpenMP + MPI - C/C++, Fortran and Python 43 (void) dummy; • NVLINK 2.0 available in POWER9 • Typical 3D 44 } Specialised implementationsmachines enables reaching better strong scaling • Combine with NVIDIA compute time AMG than its previous generation: Multiple-process service • Better efficiency when using 16x more resources for a distribution (MPS) with dual buffering 33 problem that is only 8x larger. Gauss-Seidel (GS) for Wrap kernels into a single one © 2018 IBM Corporation techniques: AMG Task-based GPU acceleration in CFD with OpenMPvelocity4.5 and CUDA in OpenPOWER and scalarsplatforms. • Improve overlap of data movements Acknowledgements CFD and Algebraic-Multigridand – AlgebraicExecution time distribution and compute in the same and across • Many components (kernels) different ranks. Multigrid (AMG) for 33 Multiple ranks with MPS POWER9 scaling experiments were only possi- contribute to total execution Single thread profiling - Code Saturne 5.0+ time © 2018 IBM Corporation pressure: AMG • Reduce device memory ble with Summit Early Access programme spon- Other • There are data dependencies allocation/movement sored by Oak Ridge Leadership Computing Facility. between consecutive kernels (b) Coarse mesh iterationoverheads detail. : Work partially funded by UKTC consortium grants • There are opportunities to Matrix-vector mult. • Create data environments without major keep data in the device MSR EP/L000261/1 and EP/R029326/1. between kernels code disruption. Pressure Figure 1: OpenMP 4.5 profiling overview of the AMG solver. Matrix-vector mult. GS (AMG) • Use own device memory pool and record • Some kernels may have CSR Dot products (b) AMG solver finer mesh. lower compute intensity, it information about data environments. More information: could still be worthwhile Gauss-Seidel solver Multigrid setup • computing them in the GPU if (Velocity) Within an environment, data stays in the Compute coarse cells [email protected] the data is already there 3.4 CUDA porting device. from fine cells (a) Gauss-Seidel solver. Other AMG-related We implemented a CUDA porting providing the same functionality as the OpenMP 4.5 porting, while (c) AMG solver coarser mesh. 10 addressing the limitations listed in Section 3.3. We implemented the port in a separate folder using C++ © 2018 IBM Corporation and exposed a C interface so that it could be invoked from within Code Saturne existing code. Figure 3: Single-rank timeline details for the CUDA port using a 1.5M-cell mesh. 3.4.1 Kernels’ implementation We implemented four templated kernels using the CUDA language: Gauss-Seidel, SpMV, dot products 8.3% - Conjugate gradient - includes MSR and CST SpMVs. and stream operations.∗ The different variations in SpMV (MSR, CSR, with/without diagonal) are con- trolled with template7.0% arguments. - Compute Similarly, coarse we cells use from template fine arguments cells to reuse code for the different flavours of dot∗ product and other stream operations. For the Gauss-Seidel kernel, we use a template argumentWe observe to control that whether the number the kernel of is combined iterations with for the the computation Gauss-Seidel of a dot-product. solver increases Using from tem- 64 to 76 for the plated functions improves the code scalability without increasing the complexity of control flows during GPUthe execution. accelerated Listings version. 6 and 7 This present solver the implementation relies on a data of the race SpMV during kernel the and execution stream operations.

GPU Acceleration at Scale with Openpower Platforms In

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support