GeantV Geometry: SIMD abstraction and interfacing with CUDA

Johannes de Fine Licht (johannes.defi[email protected]) Sandro Wenzel ([email protected]) ! Thanks to Michal Husejko and Romain Wartel at TechLab for providing hardware.

PH - SFT Johannes de Fine Licht (johannes.defi[email protected]) Abstracted algorithms

Write common code for multiple types of SIMD “backends” (Vc//CUDA).

Eases code maintenance; requires only one kernel to be written for a given functionality

Must retain performance of a low level implementation

PH - SFT 2 Johannes de Fine Licht (johannes.defi[email protected]) Illustrating scalar/SIMD abstraction and kernels Single particle Vector External interface interface CUDA kernels

C-like abstract kernels PH - SFT 3 Johannes de Fine Licht (johannes.defi[email protected]) Illustrating scalar/SIMD abstraction and kernels Single particle Vector External interface interface CUDA kernels

Scalar Looper Cilk Plus Looper

Vc Looper access

C-like abstract kernels PH - SFT 3 Johannes de Fine Licht (johannes.defi[email protected]) Illustrating scalar/SIMD abstraction and kernels Single particle Vector External interface interface CUDA kernels

Scalar Looper Cilk Plus Looper

Vc Looper Thread access

Backend instantiation

Scalar Backend Vc Backend Cilk Plus Backend CUDA Backend

C-like abstract kernels PH - SFT 3 Johannes de Fine Licht (johannes.defi[email protected]) Illustrating scalar/SIMD abstraction and kernels Single particle Vector External interface interface CUDA kernels

Scalar Looper Cilk Plus Looper

Vc Looper Thread access

Backend instantiation

Scalar Backend Vc Backend Cilk Plus Backend CUDA Backend

Kernel instantiation

C-like abstract C-like specialized kernels kernels PH - SFT 3 Johannes de Fine Licht (johannes.defi[email protected]) Generic kernels on abstract types

Generic programming; write algorithms in a generic way on abstract types ! Implementation depends on the backend ! Backend determines the types on which the high level operations are performed

PH - SFT 4 Johannes de Fine Licht (johannes.defi[email protected]) Generic kernels on abstract types

Generic programming; write algorithms in a generic way on abstract types ! Implementation depends on the backend ! Backend determines the types on which the high level operations are performed

PH - SFT 4 Johannes de Fine Licht (johannes.defi[email protected]) Implementations wrapped in types

Operations are abstracted

Backends implement all operations

Vc already supports arithmetics

CUDA uses primitive types

Wrapper structs are implemented if needed

PH - SFT 5 Johannes de Fine Licht (johannes.defi[email protected]) Interfacing with CUDA

Involving the GPU should be easy ! Principle of common code ! Issue: GPU code needs GNU/: nvcc: nvcc, but we want a C++11 CUDA Vc vs. different compiler for CPU Cilk

PH - SFT 6 Johannes de Fine Licht (johannes.defi[email protected]) Dual namespaces

Utilize distinct namespaces but same code for CPU and GPU ! Classes live in one or both environments ! Provide abstraction to interface between environments

namespace vecgeom Host memory ! namespace vecgeom_cuda vecgeom::CudaManager GPU memory namespace vecgeom ! vecgeom::UnplacedBox CUDA vecgeom_cuda::UnplacedBox vecgeom::LogicalVolume kernel vecgeom_cuda::LogicalVolume … …

PH - SFT 7 Johannes de Fine Licht (johannes.defi[email protected]) Dual namespaces

Utilize distinct namespaces but same code for CPU and GPU ! Classes live in one or both environments ! Provide abstraction to interface between environments

namespace vecgeom Host memory Allocation ! namespace vecgeom_cuda vecgeom::CudaManager GPU memory namespace vecgeom ! vecgeom::UnplacedBox CUDA vecgeom_cuda::UnplacedBox vecgeom::LogicalVolume kernel vecgeom_cuda::LogicalVolume … …

PH - SFT 7 Johannes de Fine Licht (johannes.defi[email protected]) Dual namespaces

Utilize distinct namespaces but same code for CPU and GPU ! Classes live in one or both environments ! Provide abstraction to interface between environments

namespace vecgeom Host memory Allocation ! namespace vecgeom_cuda Synchronize content vecgeom::CudaManager GPU memory namespace vecgeom ! vecgeom::UnplacedBox Launch kernel CUDA Instantiate vecgeom_cuda::UnplacedBox vecgeom::LogicalVolume kernel vecgeom_cuda::LogicalVolume … …

PH - SFT 7 Johannes de Fine Licht (johannes.defi[email protected]) Compilation scheme

Source files Source files .cpp .cu + Optional modules + CUDA interface (ROOT, benchmarking…) .cu Interface Compile for C++11 Compile with NVCC with vector backend header with CUDA backend

namespace vecgeom namespace vecgeom_cuda .o Linking .o

Executable PH - SFT 8 Johannes de Fine Licht (johannes.defi[email protected]) Status

Common code for box methods runs for scalar, Vc, Cilk and CUDA backends

Particle location in box geometry ! 4.5 GPU 4.0 GPU with overhead Implemented GPU memory CPU synchronization 3.5 3.0

! 2.5

2.0 Dual namespaces allows dispatching work to CPU or GPU 1.5 1.0

! 0.5 Location can run in either environment 0.0 Number of particles located ! Preliminary benchmarks run in hybrid CPU/GPU environment for four-level box geometry. Results are in agreement Location of particles is a tree algorithm, so speedups ! are only seen at very high particle multiplicity.

PH - SFT 9 Johannes de Fine Licht (johannes.defi[email protected]) Appendix: CPU/GPU same interface

Main executable compiled with C++ compiler Same code used

CUDA kernel compiled with NVCC

PH - SFT 10 Johannes de Fine Licht (johannes.defi[email protected]) Appendix: Common code

PH - SFT 11 Johannes de Fine Licht (johannes.defi[email protected]) Appendix: Template hierarchy

PH - SFT 12 Johannes de Fine Licht (johannes.defi[email protected]) Vector interface

External CUDA kernels

C-like abstract PH - SFT kernels 13 Johannes de Fine Licht (johannes.defi[email protected]) Vector interface

Vc Scalar Backend Backend Cilk Plus Backend

CUDA Backend External CUDA kernels

C-like abstract PH - SFT kernels 13