GeantV Geometry: SIMD abstraction and interfacing with CUDA
Johannes de Fine Licht (johannes.defi[email protected]) Sandro Wenzel ([email protected]) ! Thanks to Michal Husejko and Romain Wartel at TechLab for providing hardware.
PH - SFT Johannes de Fine Licht (johannes.defi[email protected]) Abstracted algorithms
Write common code for multiple types of SIMD “backends” (Vc/Cilk/CUDA).
Eases code maintenance; requires only one kernel to be written for a given functionality
Must retain performance of a low level implementation
PH - SFT 2 Johannes de Fine Licht (johannes.defi[email protected]) Illustrating scalar/SIMD abstraction and kernels Single particle Vector External interface interface CUDA kernels
C-like abstract kernels PH - SFT 3 Johannes de Fine Licht (johannes.defi[email protected]) Illustrating scalar/SIMD abstraction and kernels Single particle Vector External interface interface CUDA kernels
Scalar Looper Cilk Plus Looper
Vc Looper Thread access
C-like abstract kernels PH - SFT 3 Johannes de Fine Licht (johannes.defi[email protected]) Illustrating scalar/SIMD abstraction and kernels Single particle Vector External interface interface CUDA kernels
Scalar Looper Cilk Plus Looper
Vc Looper Thread access
Backend instantiation
Scalar Backend Vc Backend Cilk Plus Backend CUDA Backend
C-like abstract kernels PH - SFT 3 Johannes de Fine Licht (johannes.defi[email protected]) Illustrating scalar/SIMD abstraction and kernels Single particle Vector External interface interface CUDA kernels
Scalar Looper Cilk Plus Looper
Vc Looper Thread access
Backend instantiation
Scalar Backend Vc Backend Cilk Plus Backend CUDA Backend
Kernel instantiation
C-like abstract C-like specialized kernels kernels PH - SFT 3 Johannes de Fine Licht (johannes.defi[email protected]) Generic kernels on abstract types
Generic programming; write algorithms in a generic way on abstract types ! Implementation depends on the backend ! Backend determines the types on which the high level operations are performed
PH - SFT 4 Johannes de Fine Licht (johannes.defi[email protected]) Generic kernels on abstract types
Generic programming; write algorithms in a generic way on abstract types ! Implementation depends on the backend ! Backend determines the types on which the high level operations are performed
PH - SFT 4 Johannes de Fine Licht (johannes.defi[email protected]) Implementations wrapped in types
Operations are abstracted
Backends implement all operations
Vc already supports arithmetics
CUDA uses primitive types
Wrapper structs are implemented if needed
PH - SFT 5 Johannes de Fine Licht (johannes.defi[email protected]) Interfacing with CUDA
Involving the GPU should be easy ! Principle of common code ! Issue: GPU code needs GNU/Clang: nvcc: nvcc, but we want a C++11 CUDA Vc vs. different compiler for CPU Cilk
PH - SFT 6 Johannes de Fine Licht (johannes.defi[email protected]) Dual namespaces
Utilize distinct namespaces but same code for CPU and GPU ! Classes live in one or both environments ! Provide abstraction to interface between environments
namespace vecgeom Host memory ! namespace vecgeom_cuda vecgeom::CudaManager GPU memory namespace vecgeom ! vecgeom::UnplacedBox CUDA vecgeom_cuda::UnplacedBox vecgeom::LogicalVolume kernel vecgeom_cuda::LogicalVolume … …
PH - SFT 7 Johannes de Fine Licht (johannes.defi[email protected]) Dual namespaces
Utilize distinct namespaces but same code for CPU and GPU ! Classes live in one or both environments ! Provide abstraction to interface between environments
namespace vecgeom Host memory Allocation ! namespace vecgeom_cuda vecgeom::CudaManager GPU memory namespace vecgeom ! vecgeom::UnplacedBox CUDA vecgeom_cuda::UnplacedBox vecgeom::LogicalVolume kernel vecgeom_cuda::LogicalVolume … …
PH - SFT 7 Johannes de Fine Licht (johannes.defi[email protected]) Dual namespaces
Utilize distinct namespaces but same code for CPU and GPU ! Classes live in one or both environments ! Provide abstraction to interface between environments
namespace vecgeom Host memory Allocation ! namespace vecgeom_cuda Synchronize content vecgeom::CudaManager GPU memory namespace vecgeom ! vecgeom::UnplacedBox Launch kernel CUDA Instantiate vecgeom_cuda::UnplacedBox vecgeom::LogicalVolume kernel vecgeom_cuda::LogicalVolume … …
PH - SFT 7 Johannes de Fine Licht (johannes.defi[email protected]) Compilation scheme
Source files Source files .cpp .cu + Optional modules + CUDA interface (ROOT, benchmarking…) .cu Interface Compile for C++11 Compile with NVCC with vector backend header with CUDA backend
namespace vecgeom namespace vecgeom_cuda .o Linking .o
Executable PH - SFT 8 Johannes de Fine Licht (johannes.defi[email protected]) Status
Common code for box methods runs for scalar, Vc, Cilk and CUDA backends
Particle location in box geometry ! 4.5 GPU 4.0 GPU with overhead Implemented GPU memory CPU synchronization 3.5 3.0
! 2.5
2.0 Dual namespaces allows dispatching Speedup work to CPU or GPU 1.5 1.0
! 0.5 Location can run in either environment 0.0 Number of particles located ! Preliminary benchmarks run in hybrid CPU/GPU environment for four-level box geometry. Results are in agreement Location of particles is a tree algorithm, so speedups ! are only seen at very high particle multiplicity.
PH - SFT 9 Johannes de Fine Licht (johannes.defi[email protected]) Appendix: CPU/GPU same interface
Main executable compiled with C++ compiler Same code used
CUDA kernel compiled with NVCC
PH - SFT 10 Johannes de Fine Licht (johannes.defi[email protected]) Appendix: Common code
PH - SFT 11 Johannes de Fine Licht (johannes.defi[email protected]) Appendix: Template hierarchy
PH - SFT 12 Johannes de Fine Licht (johannes.defi[email protected]) Vector interface
External CUDA kernels
C-like abstract PH - SFT kernels 13 Johannes de Fine Licht (johannes.defi[email protected]) Vector interface
Vc Scalar Backend Backend Cilk Plus Backend
CUDA Backend External CUDA kernels
C-like abstract PH - SFT kernels 13