SYCL Building Blocks for C++ Libraries Maria Rovatsou, SYCL Specification Editor Principal Software Engineer, Codeplay Software

SYCL Building Blocks for C++ libraries Maria Rovatsou, SYCL specification editor Principal Software Engineer, Codeplay Software © Copyright Khronos Group 2014 SYCL Building Blocks for C++ libraries Overview SYCL standard What is the SYCL toolkit for? C++ libraries using SYCL: Under construction SYCL Building blocks asynchronous execution on multiple SYCL devices command groups for embedding computational kernels in a library buffers and accessors for data storage and access vector classes shared between host and device pipes hierarchical command group execution printing on device using cl::sycl::stream © Copyright Khronos Group 2014 SYCL standard © Copyright Khronos Group 2014 Khronos royalty-free open standard that works with OpenCLTM SYCLTM for OpenCLTM 1.2 is a published specification that can be found at: https://www.khronos.org/registry/sycl/specs/sycl-1.2.pdf SYCLTM for OpenCLTM 2.2 is a published provisional specification that the group would like feedback on: https://www.khronos.org/registry/sycl/specs/sycl-2.2.pdf © Copyright Khronos Group 2014 What is the SYCL toolkit for? © Copyright Khronos Group 2014 What is the SYCL toolkit for? Performance Heterogeneous Programmability Power efficiency system Longevity of software written for Best h/w architecture integrating host multi-core CPU custom architectures per application with GPUs and Use existing C++ frameworks on Harness processing power in accelerators new targets, e.g. from HPC to order to create much more mobile and embedded and vice powerful applications versa © Copyright Khronos Group 2014 C++ libraries using SYCL building blocks: Under construction © Copyright Khronos Group 2014 C++ libraries using SYCL building blocks: VisionCpp 1 auto q = get_queue<gpu_selector>(); 2 cv::VideoCapture cap(0); 3 cap >> frame; 4 std::shared_ptr<unsigned char> ptr; 5 auto a = Node<sBGR, rows,cols,Image>(frame.data); 6 auto b = Node<sBGR2sRGB>(a); 7 auto c = Node<sRGB2lRGB>(b); 8 auto d = Node<lRGB2lHSV>(c); VisionCpp is a templated 9 auto e = Node<DegradeH>(d); 10 auto f = Node<lHSV2lRGB>(e); library with a domain 11 auto g = Node<lRGB2sRGB>(f); 12 auto h = Node<sRGB2sBGR>(g); specific C++ dialect focused 13 for (;;) { 14 auto fused_kernel = execute<Fuse>(h, q); on providing a framework 15 ptr = fused_kernel.getData(); 16 cv::imshow("Colour Degragation", ooo); for efficient Machine Vision 17 if (cv::waitKey(30) >= 0) 18 break; algorithms 19 cap >> frame; // get a new frame from camera 20 } © Copyright Khronos Group 2014 C++ libraries using SYCL building blocks: Eigen Tensor Library 1 // use gpu 2 cl::sycl::gpu_selector s; 3 cl::sycl::queue q(s); 4 Eigen::SyclDevice sycl_device(q); 5 // create a tensor range Eigen is a C++ 6 Eigen::array<int, 3> tensorRange = {{100, 100, 100}}; 7 // create tensors templated library 8 Eigen::Tensor<float, 3> in1(tensorRange); for linear 9 Eigen::Tensor<float, 3> in2(tensorRange); 10 Eigen::Tensor<float, 3> in3(tensorRange); algebra. 11 Eigen::Tensor<float, 3> out(tensorRange); 12 // create random data for each tensor The Tensor 13 in1 = in1.random(); 14 in2 = in2.random(); module SYCL 15 in3 = in3.random(); support is in 16 // create a map veiw for each tensor 17 Eigen::TensorMap<Eigen::Tensor<float, 3> > gpu_in1(in1.data(), tensorRange); progress. 18 Eigen::TensorMap<Eigen::Tensor<float, 3> > gpu_in2(in2.data(), tensorRange); 19 Eigen::TensorMap<Eigen::Tensor<float, 3> > gpu_in3(in3.data(), tensorRange); 20 Eigen::TensorMap<Eigen::Tensor<float, 3> > gpu_out(out.data(), tensorRange); 21 // assign operator invokes executor 22 gpu_out.device(sycl_device) = gpu_in1 * gpu_in2 + gpu_in3; © Copyright Khronos Group 2014 C++ libraries using SYCL building blocks: Parallel STL Parallel STL is the next version of the C++ standard library which is starting to add support for parallel algorithms. 1 vector <int > data = { 8, 9, 1, 4 }; 2 std :: sort ( sycl_policy , v. begin () , v. end ()); 3 if ( is_sorted ( data )) { 4 cout << " Data is sorted ! " << endl ; 5 } © Copyright Khronos Group 2014 SYCL building blocks © Copyright Khronos Group 2014 SYCL Building Blocks: asynchronous execution on multiple SYCL devices buffers device_selector accessors command groups for atomically parallel_for functions for encapsulating kernels over an scheduling computational kernels index space of work items on queues asynchronous error handling single-source offline multiple host/device compiler solutions event mechanism C++ lambdas or functors compiled for SYCL asynchronous queue devices. © Copyright Khronos Group 2014 SYCL Building Blocks: command groups for embedding computational kernels in a library The data movement is command group lambda or functor managed by buffers and accessors. 1 auto myCommandGroup = ([&](cl::sycl::execution_handler& cgh) { 2 auto acc = buffer.get_access<cl::sycl::access::mode::read_write>(cgh); 3 cgh.single_task<class lambda_or_functor_kernel>( 4 cl::sycl::nd_range<2>(range<2>(16, 16), range<2>(4, 4)), 5 [=](cl::sycl::nd_item<1> idx) { 6 group my_group = idx.get_group(); 7 int x = a_function(); 8 int sum = my_group.reduce<work_group_op::add>(x); 9 //.. 10 }); 11 }); On submit the kernel and its access requirements are 1 queue myQueue; scheduled for execution on an asynchronous queue. 2 auto myEvent = myQueue.submit(myCommandGroup); © Copyright Khronos Group 2014 SYCL Building Blocks: buffers and accessors for data storage and access Data is managed 1 buffer<pos_t, 1> positionsBuf(positions.data(), across host and 2 range<1>(number_objects)); device using the 3 buffer<vel_t, 1> velocitiesBuf(velocities.data(), 4 range<1>(number_objects)); buffer class. The buffer on destruction waits for all usage of its data by the SYCL The buffer class takes runtime to be completed. ownership of the data. © Copyright Khronos Group 2014 SYCL Building Blocks: accessors for data storage and access Accessor class The SYCL runtime is effectively constructing an asynchronous runtime gives access to a graph that schedules all the command groups and their memory buffer in a transfers. command group 1 auto cg = [&](handler &ch) { 2 auto posAcc = positionsBuf.get_access<access::mode::read>(ch); 3 auto velAcc = velocitiesBuf.get_access<access::mode::read_write>(ch); 4 auto newPosAcc = newposBuf.get_access<access::mode::read_write>(ch); 5 ch.parallel_for<class nbody_kernel>( Host accessors 6 nd_range<1>(range<1>(number_objects), range<1>(256)), force 7 ([=](nd_item<1> item_) { 8 unsigned int i = item_.get_global_linear_id(); synchronisation of 9 // Computation and update of velocities[ ... ] all copies of the 10 // Compute the new positions 11 newPosAcc[i][0] = posAcc[i][0] + dtCube * velAcc[i][0] + 0.5f a[0]; buffer on devices 12 newPosAcc[i][1] = posAcc[i][1] + dtCube * velAcc[i][1] + 0.5f * a[1]; with the host. 13 newPosAcc[i][2] = posAcc[i][2] + dtCube * velAcc[i][2] + 0.5f * a[2]; 14 })); 15 }; © Copyright Khronos Group 2014 SYCL Building Blocks: accessors for data storage and access 1 cl::sycl::svm_allocator<int, cl::sycl::svm_fine_grain> fineGrainAllocator( svm_allocator 2 fineGrainedContext); 3 /* Use the instance of the svm_allocator with any custom container, for provides allocation 4 * example, with std::shared_ptr. capabilities for 5 */ SYCL 2.2 devices 6 std::shared_ptr<int> svmFineBuffer = 7 std::allocate_shared<int>(fineGrainAllocator, numElements); with fine grained 8 /* Initialize the pointer on the host */ SVM capabilities. 9 *svmFineBuffer = 66; 1 q.submit([&]( 2 cl::sycl::execution_handler<svm_fine_grain< 3 cl::sycl::svm_sharing::buffer, cl::sycl::svm_atomics::none>>& cgh) { Registered usage 4 auto rawSvmFinePointer = svmFineBuffer.get(); of the SVM 5 cgh.register_access<cl::sycl::access::mode::read_write>(rawSvmFinePointer); 6 cgh.single_task<class svm_fine_grained_kernel>( allocated 7 range<1>(1), [=](id<1> index) { rawSvmFinePointer[index]++; }); pointer. 8 }); 1 int result = *svmFineBuffer; Direct Usage of SVM allocated fine grained buffer pointer on the host. © Copyright Khronos Group 2014 SYCL Building Blocks: SIMD vector classes available on host and device 1 for (int i = 0; i < 64; i++) { The vec<class T, int S> class 2 dataA[i] = float4(2.0f, 1.0f, 1.0f, static_cast<float>(i)); can be used on host and 3 dataB[i] = float3(0.0f, 0.0f, 0.0f); device. 4 } 1 myQueue.submit([&](handler& cgh) { 2 auto ptrA = bufA.get_access<access::mode::read_write>(cgh); 3 auto ptrB = bufB.get_access<access::mode::read_write>(cgh); Swizzles are 4 supported as 5 cgh.parallel_for<class vector_example>(range<3>(4, 4, 4), [=](item<3> item) { 6 float4 in = ptrA[item.get_linear_id()]; functions, as 7 float w = in.w(); well as all the 8 float3 swizzle = in.xyz(); 9 float3 trans = swizzle * w; common 10 ptrB[item.get_linear_id()] = trans; operators for 11 }); 12 }); vectors. © Copyright Khronos Group 2014 SYCL Building Blocks: pipes 1 cl::sycl::static_pipe<float, 4> p; 2 3 // Launch the producer to stream A to the pipe 4 q.submit([&](cl::sycl::execution_handle &cgh) { 5 // Get write access to the pipe 6 auto kp = p.get_access<cl::sycl::access::write>(cgh); Pipes are available in SYCL 7 // Get read access to the data 8 auto ka = ba.get_access<cl::sycl::access::read>(cgh); 2.2 for OpenCL 2.2 devices. 9 10 cgh.single_task<class producer_using_pipes>([=] { 11 for (int i = 0; i != N; i++) Pipes are a memory object 12 // Try to write to the pipe up to success 13 while (!(kp.write(ka[i]))); with a sequential ordering of 14 }); data that cannot be accessed 15 }); from the host. The static_pipe is a pipe with constexpr capacity that is defined only for one target device. This is

SYCL Building Blocks for C++ Libraries Maria Rovatsou, SYCL Specification Editor Principal Software Engineer, Codeplay Software

The Importance of Data

Introduction to GPU Computing

Intel® Oneapi Programming Guide

Opencl SYCL 2.2 Specification

Of the Threading Building Blocks Flow Graph API, a C++ API for Expressing Dependency, Streaming and Data Flow Applications

SYCL – Introduction and Hands-On

Under the Hood of SYCL – an Initial Performance Analysis with an Unstructured-Mesh CFD Application

Resyclator: Transforming CUDA C++ Source Code Into SYCL

Trisycl Open Source C++17 & Openmp-Based Opencl SYCL Prototype

SYCL for CUDA

Investigation of the Opencl SYCL Programming Model

Comparing SYCL with HPX, Kokkos, Raja and C++ Executors the Future of ISO C++ Heterogeneous Computing