GPU Acceleration at Scale with OpenPower platforms in Code_Saturne Samuel Antao1,CharlesMoulinec2,YvanFournier3,RobertSawko1, Malgorzata Zimon1,ChristopherThompson1,AlexSkillen2,JuanUribe4, David R. Emerson2 1IBMIBM Research, Confidential2SCD, STFC Daresbury Laboratory UK, 3EDF R&D FR, 4EDF R&D CentreGPU UK acceleration in Code Saturne

Main accomplishments Task-based GPUIBM acceleration Confidential in CFD with OpenMP 4.5 and CUDA Whatin OpenPOWER isplatforms. the challenges? GPU acceleration in CodeResults Saturne

• Library to provide GPU acceleration at CPU+GPU speedup over CPU-only and Sparse-matrix solvers which cannot Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. Domain boundaries: intra- and inter- Scattered set of compute scale for Code_Saturne running on top of take full advantage of GPUs resources eciency (strong scale) node communication overheads. kernels/bottlenecks. OpenPower machines using CUDA. tailored for dense computation. 3.0 3.5 120% 111.4% 120% • Manage data environments:controlscope Compute Data 2.8 100.0% 3.3 98.1% 100.0% 92.9% 100% 3.1 100% 2.6 2.90 83.1% 84.8% of data in the device. 2.39 2.42 2.9 only (1x)

80% only (1x) - 71.2% 80% 2.4 2.32 - How do we tackle them? 2.22 65.0% 2.7 • Single templated device entry function 2.2 2.08 60% 2.5 2.34 60% 2.00 2.31 Efficiency 2.0 2.3 Efficiency to run application kernels back to back and 40% 2.1 40% • Mitigate latencies for launching kernels 1.8

Speedup over CPU 1.9 mitigate kernel launching latencies. 1.6 20% Speedup over CPU 20% Task-based GPU accelerationback-to-back in CFD with: OpenMP 4.5 and CUDA in OpenPOWER platforms. 1.7 TM 1.4 • Leverage NVLINK fast interconnect 0% 1.5 0% • Packing arguments of various kernels (SpMV, dot products, etc) 1 2 4 8 16 32 /Users/sfantao/Downloads/a.cpp Page 1/1 TM 64 256 512 NVLINK NVIDIA Saved: 04/10/2018, 10:09:53 Printed for: Samuel Antao 1.0 to shift bottle necks from data-movement into a single data structure. Nodes P100 Nodes Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. Speedup Efficiency Speedup Efficiency towards compute. 1 •templateLeverage pack expansion to wrap multiple kernels in a 2 __device__ int any_kernel(KernelArgsBase &Arg, unsigned n_rows_per_block) { Kernel Kernel /Users/sfantao/Downloads/a.cpp3 switch(Kind) { Page 1/1 NVLINKTM NVIDIA single CUDA kernel. (a) Gauss-Seidel solver. 1.0 P100 • 2.9x speedup over CPU-only Saved:4 04/10/2018,// Matrix-vector 10:09:53 multiplication (MSR format): Printed for: Samuel Antao 5 case MV_MSR: Kind A Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in OpenPOWER platforms. TM 6 matrix_vector_multiplication( Kind D 1 template POWER8 + P100 GPUs (111M-cells). simulation on 256-node POWER9 7 /* dev_row_index */ Arg.getArg(0), Kernel (a) 8 2 __device__ /* dev_col_id int any_kernel(KernelArgsBase */ Arg.getArg(1), n_rows_per_block) { 3 switch(Kind) { Kernel 9 Kernel /* dev_val */ Arg.getArg(2), machine - ORNL Summit 10 4 // Matrix-vector/* dev_d_val */ multiplication Arg.getArg(3), Kernel Kernel 3.5 11IBM5 case Confidential /* MV_MSR: dev_x */ Arg.getArg(4), KindGPU B acceleration in Code Saturne 111.4% 120% 12 6 Kind A matrix_vector_multiplication( /* dev_y */ Arg.getArg(5), Kind 3.3 Supercomputer. 35 100.0% 13 7 /* /*n_rows dev_row_index */ */ Arg.getArg(6), *>(0), Code 3.1 100% © 2018 IBM Corporation 14 8 /* /*n_cols dev_col_id */ */ Arg.getArg(7), *>(1), Kind C Kind D 2.90 Kernel 2.9 84.8% • Enable multi-billion-cell unstructured 15 9 /* n_rows_per_block */ n_rows_per_block); /* dev_val */ Arg.getArg(2), Kernel only (1x) 80% Kernel - 16 10 break ; /* dev_d_val */ Arg.getArg(3), 2.7 17 11 // ... /* dev_x */ Arg.getArg(4), mesh simulations. 2.5 2.34 60% 18 12 Kind case A DP_xx: /* dev_y */ Arg.getArg(5), Kind B 2.31

Kind A 2.3 Efficiency 19 13 dot_product( /* n_rows */ Arg.getArg(6), Code 40% 20 14 /* /*version n_cols */ */ Arg.getArg(0 ), >(7), 2.1 21 /* n_rows */ Arg.getArg(1), 15 /* n_rows_per_block */ n_rows_per_block); Unpack Arguments 1.9 22 /* x */ Arg.getArg(2), Speedup over CPU What is Code_Saturne? 32 16 Kernel break; 20% 23 /* y */ nullptr, © 2018 IBM Corporation 1.7 17 // ... NVIDIA 24 /* z */ nullptr, 1.5 NVLINKTM 18 Kind C case DP_xx: Kernel Kernel Kernel 0% V100 25 /* res */ Arg.getArg(3), 2.0 19 dot_product( 64 256 512 26 /* n_rows_per_block */ n_rows_per_block); 20 IBM /* version Confidential */ Arg.getArg(0), GPU acceleration inNodes Code Saturne • www.code-saturne.org 27 break; Speedup Efficiency 28 21 // ... /* n_rows */ Arg.getArg(1), Kind C KindUnpack B ArgumentsKind D 22 /* x */ Arg.getArg(2), 29 (a) Gauss-Seidel solver. TM NVIDIA Kernel} NVLINK V100 • Open-source 30 23 __syncthreads(); /* y */ nullptr, 2.0 24 /* z */ nullptr, 31 Kind return C 0; Kernel Kernel Kernel 32 25 } /* res */ Arg.getArg(3), TM • Computational fluid 33 26 /* n_rows_per_block */ n_rows_per_block); NVLINK Unpack Arguments 2.0 NVIDIA 34 27 break; V100 dynamics (CFD) software 35 28 template // ... Kind C Kind B Kind D 36 29 __global__ } void any_kernels(void) { Kernel Kernel Kernel Kernel 1 37 30 __syncthreads(); (b) POWER9 + V100 GPUs (889M-cells) . package 38 31 auto return *KA =0 ;reinterpret_cast(&KernelArgsSeriesGPU[0]); 36 39 32 const} unsigned n_rows_per_block = KA->RowsPerBlock; © 2018 IBM Corporation Task-based GPU acceleration in CFD with OpenMP 4.5 and CUDA in40 OpenPOWER unsigned idx = 0; platforms. Kind B Kind D Kind A Kind A • Developed by EDF-R&D 33 Unpack Arguments • Compute and data movements overlap 41 34 42 int dummy [] = { any_kernel(KA->Args[idx++], n_rows_per_block)... }; 35 template • Demonstrated good scaling characteristics: 43 (void) dummy; 36 __global__ void any_kernels(void) { almost perfectly for the GS algorithm. 44 } Kernel Kernel Kernel Kernel tailored to tackle extra large problems, e.g. 37 (a) AMG cycle overview. Specialised implementations 38 auto *KA = reinterpret_cast(&KernelArgsSeriesGPU[0]); • Consistent speedup over 2x with only 39 const unsigned n_rows_per_block = KA->RowsPerBlock; whole thermo-nuclear reactors. 40 unsigned idx = 0; Kind B Kind D Kind A Kind A 41 (b) AMG solver finer mesh. ˜100K-cells per rank. 42 int dummy [] = { any_kernel(KA->Args[idx++], n_rows_per_block)... }; • OpenMP + MPI - C/C++, and Python 43 (void) dummy; • NVLINK 2.0 available in POWER9 • Typical 3D 44 } Specialised implementationsmachines enables reaching better strong scaling • Combine with NVIDIA compute time AMG than its previous generation: Multiple-process service • Better eciency when using 16x more resources for a distribution (MPS) with dual buering 33 problem that is only 8x larger. Gauss-Seidel (GS) for Wrap kernels into a single one © 2018 IBM Corporation techniques: AMG Task-based GPU acceleration in CFD with OpenMPvelocity4.5 and CUDA in OpenPOWER and scalarsplatforms. • Improve overlap of data movements Acknowledgements CFD and Algebraic-Multigridand – AlgebraicExecution time distribution and compute in the same and across • Many components (kernels) dierent ranks. Multigrid (AMG) for 33 Multiple ranks with MPS POWER9 scaling experiments were only possi- contribute to total execution Single thread profiling - Code Saturne 5.0+ time © 2018 IBM Corporation pressure: AMG • Reduce device memory ble with Summit Early Access programme spon- Other • There are data dependencies allocation/movement sored by Oak Ridge Leadership Computing Facility. between consecutive kernels (b) Coarse mesh iterationoverheads detail. : Work partially funded by UKTC consortium grants • There are opportunities to Matrix-vector mult. • Create data environments without major keep data in the device MSR EP/L000261/1 and EP/R029326/1. between kernels code disruption. Pressure Figure 1: OpenMP 4.5 profiling overview of the AMG solver. Matrix-vector mult. GS (AMG) • Use own device memory pool and record • Some kernels may have CSR Dot products (b) AMG solver finer mesh. lower compute intensity, it information about data environments. More information: could still be worthwhile Gauss-Seidel solver Multigrid setup • computing them in the GPU if (Velocity) Within an environment, data stays in the Compute coarse cells [email protected] the data is already there 3.4 CUDA porting device. from fine cells (a) Gauss-Seidel solver. Other AMG-related We implemented a CUDA porting providing the same functionality as the OpenMP 4.5 porting, while (c) AMG solver coarser mesh. 10 addressing the limitations listed in Section 3.3. We implemented the port in a separate folder using C++ © 2018 IBM Corporation and exposed a C interface so that it could be invoked from within Code Saturne existing code. Figure 3: Single-rank timeline details for the CUDA port using a 1.5M-cell mesh. 3.4.1 Kernels’ implementation We implemented four templated kernels using the CUDA language: Gauss-Seidel, SpMV, dot products 8.3% - Conjugate gradient - includes MSR and CST SpMVs. and stream operations. The dierent variations in SpMV (MSR, CSR, with/without diagonal) are con- trolled with template7.0% arguments. - Compute Similarly, coarse we cells use from template fine arguments cells to reuse code for the dierent flavours of dot product and other stream operations. For the Gauss-Seidel kernel, we use a template argumentWe observe to control that whether the number the kernel of is combined iterations with for the the computation Gauss-Seidel of a dot-product. solver increases Using from tem- 64 to 76 for the plated functions improves the code scalability without increasing the complexity of control flows during GPUthe execution. accelerated Listings version. 6 and 7 This present solver the implementation relies on a data of the race SpMV during kernel the and execution stream operations. of the solver loop in the senseThe that implementation the input vector of SpMV is (Gauss-Seidel also the output kernels vector. are very Therefore, similar) is is similar possible to the that OpenMP the output im- of an iteration isplementation used as input except in for some the other reductions, iteration, where depending we can benefit on how from the the GPU dierent double-precision threads are shu scheduled.e This data sharinginstructions between which deliver iterations better contributes performance than to the communicating solver to converge through the with scratchpad less iterations. memory. For The the GPU version ofimplementation the code, the of the number dot products of active also threads benefit from in the the execution shue operation of the to loop perform is two reductions or more across orders of magnitude the warp. (c) AMG solver coarser mesh. higher than the active threads in the CPU version (only up to 8). This causes the propagation of data between3.4.2 Memory iterationsFigure management less 6: likely Details to for happen, the GPU and(b) execution therefore AMG solver the of 5 finer solver ranks mesh. will bound take to longer a single to GPU. converge as it becomes moreWe implemented similar to a complete a Jacobi infrastructure solver rather to manage than a the Gauss-Seidel GPU storage and solver. the mapping to host variables. DuringFigure the program 3 shows initialisation the detail phase, of the the GPU implementation execution detects timeline the for number the Gauss-Seidel of ranks using a solver, given and the coarser 34 andGPU finer and chunks levels all of the the available AMG solver. GPU memory We observe by rank. that Then, the each Gauss-Seidel rank manages and its the own AMG chunk, finer mesh solvers makewithout a any better© 2018 expensive IBM use Corporation of memory the GPU. allocation/deallocation, The AMG finer and mesh making solver sure that still all accounts variables a are significant 256-byte amount of idle GPUaligned time. (GPU A cache methodology block-size). to When reduce a given the amount variable19 needs of idle to time be allocated is discussed in GPU in memory, section the 4. implementation detects the minimum slot that fits the requested size in the already allocated memory. Then, the map between the host address and the device address is kept in a list so that it can be retrieved later. As the number of variables mapped to the GPU at a given point is small, managing the list of

8 15

(c) AMG solver coarser mesh.

Figure 6: Details for the GPU execution of 5 ranks bound to a single GPU.

19 1

APPENDIX A ./configure \ --disable-shared \ ARTIFACT DESCRIPTION APPENDIX:GPUACCELERATION --enable-static \ AT SCALE WITH OPENPOWER PLATFORMS IN Code Saturne --enable-openmp \ --enable-cuda-offload \ A. Abstract --enable-long-gnum \ This description contains the information on how the results --host=ppc64le \ --build=ppc64le \ presented in SC18 poster “GPU Acceleration at Scale with --without-modules \ OpenPower platforms in Code Saturne” were obtained and --disable-gui \ how they can be reproduced. --without-libxml2 \ --without-hdf5 \ We present details on the dependences, Code Saturne build- --without--kernel \ ing process, case configuration, and job submission. --without-salome-gui \ --prefix= \ CC=mpicc CFLAGS="-g -O3" \ B. Description CXX=mpic++ CXXFLAGS="-g -O3" \ FC=mpifort FCFLAGS="-g -O3" && make install 1) Check-list (artifact meta information): Listing 1. Command to configure and build Code Saturne with CUDA offload Velocity-pressure coupling with Gauss-Seidel (ve- • Algorithm: support. locity) and Algebraic Multigrid with Conjugate gradient at each level (pressure). C/C++, Fortran and Python source. For GPU runs, • Program: CUDA source code is used. Solver implemented in C99 (host) – CUDA 9.1 and CUDA (GPUs). POWER9 and NVIDIA GPU V100: gcc 6.4.0 and CUDA 9.1 (POWER8) or CUDA • • Compilation: 9.2 (POWER9). – GNU compilers 6.4.0 Native PPC64LE ELF64 binary. For GPU runs, CUDA IBM spectrumMPI 10.1 • Binary: – source code is used and NVIDIA GPU binary is embedded in – jsrun 1.1.0 the host binary. – CUDA 9.2 111M-cell mesh and 889M-cell meshes. A 13M-cell • Data set: mesh was generated using icem-cfd, and mesh multiplication 5) Datasets: The input data set consists mainly of the mesh was applied to get these large meshes. of a cubic cavity that uses 111M or 889M tetrahedral cells IBM Spectrum LSF (POWER8) and • Run-time environment: for the POWER8 and POWER9 runs, respectively. The mesh IBM CSM (POWER9) was obtained using icem-cfd for a 13M cell mesh and mesh POWER8 with P100 NVIDIA GPUs and POWER9 • Hardware: with V100 NVIDIA GPUs. multiplication was then applied to get these large meshes. OpenMPI mpirun launching (POWER8) and jsrun • Execution: launching (on top of IBM Spectrum MPI) (POWER9). C. Installation Simulation Wall time of the 3rd time step. • Output: Build binary that implements selected As part of the GPU port work, we also updated the build • Experiment workflow: testcase and execute it with MPI launching wrapper. system, so building with and without GPU offload support is Binary yes, source no (arrangements are • Publicly available?: straightforward - see Listing 1. To disable GPU offload and being made to contribute it to Code Saturne main repository). get a CPU-only implementation, one has to remove the line 2) How software can be obtained (if available): --enable-cuda-offload. Code Saturne is an open-source software. The CPU-only code can be obtained from github.com/code-saturne/code saturne. D. Experiment workflow We used a snapshot from September 2017 - e5a72a6fa. The The first step of the experiment is to generate the case. GPU port was rebased on the same snapshot and its code is In our setup, we provide a template case for the users - not upstreamed (and therefore not publicly available) as of yet. see Section A-E to see how to customise it. The template Nevertheless, the binary of the GPU-port is available to users contains a SRC folder with several files as in Listing 2. The of the POWER8 Paragon system at Hartree Centre, STFC, file mesh_input is the input mesh and the remaining files UK. It is made available as a beta LMOD module that can be specify customisations to the tool. These files are compiled conveniently used by any users with access to the system. together with Code Saturne to change its default behaviour. 3) Hardware dependencies: The current GPU-port imple- Running code_saturne compile in that folder compiles mentation requires a POWER8 or POWER9 machine with these files along with the Code Saturne implementation li- NVIDIA GPUs with CUDA compute capability 3.5 or higher. braries and creates the binary cs_solver. This binary is the The vanilla implementation is known to work Code Saturne executable that performs the simulation. on other hardware including IBM BlueGene Q machines. Listings 3 and 4 show the content of submission scripts 4) Software dependencies: Code Saturne is known to be of the GPU runs for POWER8 and POWER9, respectively. successfully built on different dependencies and different In both cases, the job is submitted by piping it into bsub versions of these dependencies [1]. For the results presented command. In these listings there are two wrapper scripts: in the poster these are the software dependences: cs_mps, and cs_ompi. The former starts 4 MPS servers per POWER8 and NVIDIA GPU P100: • node (one for each GPU) and set the GPU visibility so that a – GNU compilers 6.4.0 set of 5 consecutive ranks use the same MPS server instance. – OpenMPI 3.0.0 The latter just sets OpenMPI compatible environment from the 2 cs_user_boundary_conditions.f90 #!/bin/bash cs_user_parameters.c # cs_user_parameters.f90 #BSUB -P cs_user_performance_tuning.c #BSUB -J cs_user_postprocess.c #BSUB -W 00:30 # wall-clock time mesh_input #BSUB -q batch # queue #BSUB -eo errors.log # error file name Listing 2. Contents of Code Saturne case source folder. #BSUB -oo output.log # output file name #BSUB -nnodes 512 # number of nodes #BSUB -alloc_flags "gpumps smt4" # MPS and SMT=4 #!/bin/bash #------# #BSUB -J ulimit -s 10240 #BSUB -W 01:30 # wall-clock time export OMP_NUM_THREADS=4 #BSUB -q paragon # queue #BSUB -eo errors.log # error file name jsrun --rs_per_host 6 --gpu_per_rs 1 \ #BSUB -oo output.log # output file name --tasks_per_rs 3 --cpu_per_rs 6 --nrs 3072 \ #BSUB -x # exclusive mode -D CUDA_VISIBLE_DEVICES --smpiargs="-gpu" \ #BSUB -n 512 # number of tasks in job -d plane:3 -b packed:2 ./cs_ompi ./cs_solver #BSUB -R "span[ptile=16]" # ranks per node Listing 4. Submission script for POWER9 system running IBM CSM - 512 #BSUB -gpu "num=4:mode=shared" # activate the 4 GPUs node run (18 ranks per node) - 6 GPUs per node. #------ulimit -s 10240 export OMP_NUM_THREADS=8 export OMPI_COMM_WORLD_RANK=$JSM_NAMESPACE_RANK export OMPI_COMM_WORLD_SIZE=$JSM_NAMESPACE_SIZE mpirun --map-by socket -bind-to core \ export OMPI_COMM_WORLD_LOCAL_RANK=\ ./cs_mps ./cs_solver $JSM_NAMESPACE_LOCAL_RANK export OMPI_COMM_WORLD_LOCAL_SIZE=\ Listing 3. Submission script for POWER8 system running IBM Spectrum $JSM_NAMESPACE_LOCAL_SIZE LSF - 32 node run (16 ranks per node) - 4 GPUs per node. $@

Listing 5. Contents of the cs_ompi wrapper existing CSM environment prior to invoking the cs_solver- see Listing 5 - CSM handles the MPS server itself. For POWER9 we use 21 ranks per node, which is half the number of cores in the node. So, we have a pair of cores for E. Evaluation and expected result each rank. As L3-cache is shared by each pair of cores, this Once the simulation completes, the time-step execution Wall resulted in better performance. For GPU runs we use only 18 time is extracted from the generated timer_stats.csv ranks per node, so as it can be divided by 6, the number of file. We do not use the first and last time-step times because GPUs in the system. Each 3 ranks share the same GPU. that also includes initialisation and clean up actions that are not relevant for the performance assessment. REFERENCES For the GPU port, results are verified by comparing the [1] Code Saturne project website, https://www.code-saturne.org. content of the generated listing file with the CPU-only run. The different quantities (e.g. velocity, pressure) and number of interactions must be similar - they will not be exactly the same as the rounding error as well as the loop scheduling by threads is different.

F. Experiment customisation Customisation happens at two levels, application and exe- cution environment. 1) Application Customisation: Application customisation is performed by editing the files in Listing 2. We used the exact same files for both CPU and GPU runs using default values. 2) Execution Environment Customisation: Customisation here has to do with the number and distribution of ranks and threading in each rank to get better performance. The application is OpenMP enabled, so we use all the threads in a core for all configurations, i.e., 8 threads for POWER8 and 4 threads for POWER9. For POWER8, we use 1 rank per core, which means we use 20 ranks per node. For GPU runs that stays the same, and each group of 5 ranks use one GPU.