SYCL Based Seed Finding in Acts Attila Krasznahorkay with material from Angéla Czirkós Seed Finding

● Is one of the early steps of charged particle (track) reconstruction ○ It takes 3D measurement points from the detector (space points) and forms triplets out of them, which are compatible with a charged particle that came from the “interaction region” of the detector ● In the end it is a big combinatorial problem ○ We categorise all space points into 3 categories (“bottom”, “middle” and “top”) based on their 3D positions, and select which triplets fulfil all our requirements ● Even with GPUs we need to be smart about how we do this however ○ In “high activity events” we see >100k space points in our detector. Blindly trying all triplet combinations from these would take very long to evaluate. ○ So we split our detector into regions in which we form seeds separately, and we use “clever” algorithms for forming the triplets, not just blind loops over all combinations

2 Seed Finding

● You can find a better description about the whole thing on: https://acts.readthedocs.io/en/latest/c ore/seeding.html

3 Seed Finding Using SYCL

● As part of her Summer Studentship with CERN’s OpenLab, Angéla implemented code that performs the seed finding using SYCL ○ https://github.com/acts-project/acts/tree/master/Plugins/Sycl ● She gave a presentation about her work on this a little while ago, in: https://indico.cern.ch/event/948829/

4 SYCL Seed Finding Performance

● Angela posted some very nice updates to her code just today, with some updated performance figures (https://github.com/acts-project/acts/pull/466)

5 Our SYCL Experience

● Instead of talking in detail about that code, I would rather just talk about our experience with writing it… ● We used Intel’s oneAPI compiler for the code development ○ Both their oneAPI binary distribution for using their own hardware, and our own builds from the https://github.com/intel/llvm repository to make use of our GPU ○ Since the NVidia GPU support in https://github.com/intel/llvm comes from , we did not try to use Codeplay’s own compiler on this code yet ■ Basically, we just didn’t find the time to do that yet… ● In the following I’ll just highlight a couple things that I think could be worth discussing

6 Memory Management

● cl::sycl::buffer objects are not really appropriate for some of the things that we do in this code ○ The algorithm needs to use relatively large chunks of memory on just the GPU, without having to return all that information to the host ○ With CUDA we do this using basic cudaMalloc(...) / cudaMemcpy(...) calls. Intel’s implementation provides USM to do something similar. The SYCL standard itself seems to provide “sub-buffers” for copying just parts of buffers back and forth between the GPU and the host, but there’s no support for allocating only device memory using cl::sycl::buffer . (And sub-buffers don’t really work in Intel’s compiler at the moment…)

7 Memory Management

● In the current SYCL implementation the sub-optimal memory management seems to be the major difference in performance wrt. the CUDA implementation.

8 Virtual Functions, Function Pointers

● The CPU implementation of the seed finding allows clients to implement their own filtering of seed candidates by asking them to implement “filter functions” themselves. (By creating a class that implements a pure virtual interface.) ● With CUDA it is possible, even if really not conveniently, to allow users to implement their filtering in their own CUDA functions, and pass that to Acts’s code. So that the Acts seed finding kernels would use the user’s functions internally. ● With SYCL there is absolutely no way for doing this at the moment. ☹ To be able to receive functions “from the outside” that should be used in the accelerated code. ○ Of course if we expose all the SYCL code through templates, it can be done. But Acts is meant to hide how the seed finding is done exactly. The user’s code should not have to be designed around the fact that it’s a CPU or GPU implementation that’s running.

9 Documentation

● At the moment the only source of documentation about the SYCL standard is basically: https://www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf ○ But we ended up reading through the compiler’s headers very often to figure out how something was actually working ● Improvements with documentation overall would definitely be very welcome…

10 Summary

● The SYCL language standard has a number of things going for it ○ The ability to fall back to running the same code on the host if no GPU is available, sounds very good. Although whether we can create algorithms that would be equally performant on a CPU and a GPU with the same code design, may just be a dream. ○ Being able to mix CPU / GPU code in the same files/functions fairly seamlessly results in much less code bloat than what we see with CUDA ● However it will really only be used seriously if it can deliver very similar performance to what you can get with CUDA ○ The portability, while very nice in theory, will be unlikely to allow us to avoid writing separate CPU / GPU implementations of our performance critical algorithms ● Do not take this as negative feedback though! We are definitely very interested in ATLAS in SYCL. 

11 http://home.cern

12