SIMD Abstraction and Interfacing with CUDA

GeantV Geometry: SIMD abstraction and interfacing with CUDA Johannes de Fine Licht (johannes.defi[email protected]) Sandro Wenzel ([email protected]) ! Thanks to Michal Husejko and Romain Wartel at TechLab for providing hardware. PH - SFT Johannes de Fine Licht (johannes.defi[email protected]) Abstracted algorithms Write common code for multiple types of SIMD “backends” (Vc/Cilk/CUDA). Eases code maintenance; requires only one kernel to be written for a given functionality Must retain performance of a low level implementation PH - SFT "2 Johannes de Fine Licht (johannes.defi[email protected]) Illustrating scalar/SIMD abstraction and kernels Single particle Vector External interface interface CUDA kernels C-like abstract kernels PH - SFT "3 Johannes de Fine Licht (johannes.defi[email protected]) Illustrating scalar/SIMD abstraction and kernels Single particle Vector External interface interface CUDA kernels Scalar Looper Cilk Plus Looper Vc Looper Thread access C-like abstract kernels PH - SFT "3 Johannes de Fine Licht (johannes.defi[email protected]) Illustrating scalar/SIMD abstraction and kernels Single particle Vector External interface interface CUDA kernels Scalar Looper Cilk Plus Looper Vc Looper Thread access Backend instantiation Scalar Backend Vc Backend Cilk Plus Backend CUDA Backend C-like abstract kernels PH - SFT "3 Johannes de Fine Licht (johannes.defi[email protected]) Illustrating scalar/SIMD abstraction and kernels Single particle Vector External interface interface CUDA kernels Scalar Looper Cilk Plus Looper Vc Looper Thread access Backend instantiation Scalar Backend Vc Backend Cilk Plus Backend CUDA Backend Kernel instantiation C-like abstract C-like specialized kernels kernels PH - SFT "3 Johannes de Fine Licht (johannes.defi[email protected]) Generic kernels on abstract types Generic programming; write algorithms in a generic way on abstract types ! Implementation depends on the backend ! Backend determines the types on which the high level operations are performed PH - SFT "4 Johannes de Fine Licht (johannes.defi[email protected]) Generic kernels on abstract types Generic programming; write algorithms in a generic way on abstract types ! Implementation depends on the backend ! Backend determines the types on which the high level operations are performed PH - SFT "4 Johannes de Fine Licht (johannes.defi[email protected]) Implementations wrapped in types Operations are abstracted Backends implement all operations Vc already supports arithmetics CUDA uses primitive types Wrapper structs are implemented if needed PH - SFT "5 Johannes de Fine Licht (johannes.defi[email protected]) Interfacing with CUDA Involving the GPU should be easy ! Principle of common code ! Issue: GPU code needs GNU/Clang: nvcc: nvcc, but we want a C++11 CUDA Vc vs. different compiler for CPU Cilk PH - SFT "6 Johannes de Fine Licht (johannes.defi[email protected]) Dual namespaces Utilize distinct namespaces but same code for CPU and GPU ! Classes live in one or both environments ! Provide abstraction to interface between environments namespace vecgeom Host memory ! namespace vecgeom_cuda vecgeom::CudaManager GPU memory namespace vecgeom ! vecgeom::UnplacedBox CUDA vecgeom_cuda::UnplacedBox vecgeom::LogicalVolume kernel vecgeom_cuda::LogicalVolume … … PH - SFT "7 Johannes de Fine Licht (johannes.defi[email protected]) Dual namespaces Utilize distinct namespaces but same code for CPU and GPU ! Classes live in one or both environments ! Provide abstraction to interface between environments namespace vecgeom Host memory Allocation ! namespace vecgeom_cuda vecgeom::CudaManager GPU memory namespace vecgeom ! vecgeom::UnplacedBox CUDA vecgeom_cuda::UnplacedBox vecgeom::LogicalVolume kernel vecgeom_cuda::LogicalVolume … … PH - SFT "7 Johannes de Fine Licht (johannes.defi[email protected]) Dual namespaces Utilize distinct namespaces but same code for CPU and GPU ! Classes live in one or both environments ! Provide abstraction to interface between environments namespace vecgeom Host memory Allocation ! namespace vecgeom_cuda Synchronize content vecgeom::CudaManager GPU memory namespace vecgeom ! vecgeom::UnplacedBox Launch kernel CUDA Instantiate vecgeom_cuda::UnplacedBox vecgeom::LogicalVolume kernel vecgeom_cuda::LogicalVolume … … PH - SFT "7 Johannes de Fine Licht (johannes.defi[email protected]) Compilation scheme Source files Source files .cpp .cu + Optional modules + CUDA interface (ROOT, benchmarking…) .cu Interface Compile for C++11 Compile with NVCC with vector backend header with CUDA backend namespace vecgeom namespace vecgeom_cuda .o Linking .o Executable PH - SFT "8 Johannes de Fine Licht (johannes.defi[email protected]) Status Common code for box methods runs for scalar, Vc, Cilk and CUDA backends Particle location in box geometry ! 4.5 GPU 4.0 GPU with overhead Implemented GPU memory CPU synchronization 3.5 3.0 ! 2.5 2.0 Dual namespaces allows dispatching Speedup work to CPU or GPU 1.5 1.0 ! 0.5 Location can run in either environment 0.0 Number of particles located ! Preliminary benchmarks run in hybrid CPU/GPU environment for four-level box geometry. Results are in agreement Location of particles is a tree algorithm, so speedups ! are only seen at very high particle multiplicity. PH - SFT "9 Johannes de Fine Licht (johannes.defi[email protected]) Appendix: CPU/GPU same interface Main executable compiled with C++ compiler Same code used CUDA kernel compiled with NVCC PH - SFT "10 Johannes de Fine Licht (johannes.defi[email protected]) Appendix: Common code PH - SFT "11 Johannes de Fine Licht (johannes.defi[email protected]) Appendix: Template hierarchy PH - SFT "12 Johannes de Fine Licht (johannes.defi[email protected]) Vector interface External CUDA kernels C-like abstract PH - SFT kernels "13 Johannes de Fine Licht (johannes.defi[email protected]) Vector interface Vc Scalar Backend Backend Cilk Plus Backend CUDA Backend External CUDA kernels C-like abstract PH - SFT kernels "13.

Load more