SYCL Building Blocks for C++ libraries Maria Rovatsou, SYCL specification editor Principal Software Engineer, Software

© Copyright 2014 SYCL Building Blocks for C++ libraries

Overview SYCL standard What is the SYCL toolkit for? C++ libraries using SYCL: Under construction SYCL Building blocks asynchronous execution on multiple SYCL devices command groups for embedding computational kernels in a library buffers and accessors for data storage and access vector classes shared between host and device pipes hierarchical command group execution printing on device using cl::sycl::stream

© Copyright Khronos Group 2014 SYCL standard

© Copyright Khronos Group 2014 Khronos royalty-free open standard that works with OpenCLTM

SYCLTM for OpenCLTM 1.2 is a published specification that can be found at: https://www.khronos.org/registry/sycl/specs/sycl-1.2.pdf SYCLTM for OpenCLTM 2.2 is a published provisional specification that the group would like feedback on: https://www.khronos.org/registry/sycl/specs/sycl-2.2.pdf

© Copyright Khronos Group 2014 What is the SYCL toolkit for?

© Copyright Khronos Group 2014 What is the SYCL toolkit for?

Performance Programmability Heterogeneous Power efficiency system Longevity of software written for integrating host custom architectures Best h/w architecture multi-core CPU per application with GPUs and Use existing C++ frameworks on accelerators new targets, e.g. from HPC to Harness processing power in mobile and embedded and vice order to create much more versa powerful applications

© Copyright Khronos Group 2014 C++ libraries using SYCL building blocks: Under construction

© Copyright Khronos Group 2014 C++ libraries using SYCL building blocks: VisionCpp

1 auto q = get_queue(); 2 cv::VideoCapture cap(0); 3 cap >> frame; 4 std::shared_ptr ptr; 5 auto a = Node(frame.data); 6 auto b = Node(a); 7 auto c = Node(b); 8 auto d = Node(c); VisionCpp is a templated 9 auto e = Node(d); 10 auto f = Node(e); library with a domain 11 auto g = Node(f); 12 auto h = Node(g); specific C++ dialect focused 13 for (;;) { 14 auto fused_kernel = execute(h, q); on providing a framework 15 ptr = fused_kernel.getData(); 16 cv::imshow("Colour Degragation", ooo); for efficient Machine Vision 17 if (cv::waitKey(30) >= 0) 18 break; algorithms 19 cap >> frame; // get a new frame from camera 20 }

© Copyright Khronos Group 2014 C++ libraries using SYCL building blocks: Eigen Tensor Library

1 // use gpu 2 cl::sycl::gpu_selector s; 3 cl::sycl::queue q(s); 4 Eigen::SyclDevice sycl_device(q); 5 // create a tensor range Eigen is a C++ 6 Eigen::array tensorRange = {{100, 100, 100}}; 7 // create tensors templated library 8 Eigen::Tensor in1(tensorRange); for linear 9 Eigen::Tensor in2(tensorRange); 10 Eigen::Tensor in3(tensorRange); algebra. 11 Eigen::Tensor out(tensorRange); 12 // create random data for each tensor The Tensor 13 in1 = in1.random(); 14 in2 = in2.random(); module SYCL 15 in3 = in3.random(); support is in 16 // create a map veiw for each tensor 17 Eigen::TensorMap > gpu_in1(in1.data(), tensorRange); progress. 18 Eigen::TensorMap > gpu_in2(in2.data(), tensorRange); 19 Eigen::TensorMap > gpu_in3(in3.data(), tensorRange); 20 Eigen::TensorMap > gpu_out(out.data(), tensorRange); 21 // assign operator invokes executor 22 gpu_out.device(sycl_device) = gpu_in1 * gpu_in2 + gpu_in3;

© Copyright Khronos Group 2014 C++ libraries using SYCL building blocks: Parallel STL

Parallel STL is the next version of the C++ standard library which is starting to add support for parallel algorithms.

1 vector data = { 8, 9, 1, 4 }; 2 std :: sort ( sycl_policy , v. begin () , v. end ()); 3 if ( is_sorted ( data )) { 4 cout << " Data is sorted ! " << endl ; 5 }

© Copyright Khronos Group 2014 SYCL building blocks

© Copyright Khronos Group 2014 SYCL Building Blocks: asynchronous execution on multiple SYCL devices

buffers device_selector accessors

command groups for atomically parallel_for functions for encapsulating kernels over an scheduling computational kernels index space of work items on queues

asynchronous error handling single-source offline multiple host/device compiler solutions event mechanism

C++ lambdas or functors compiled for SYCL asynchronous queue devices.

© Copyright Khronos Group 2014 SYCL Building Blocks: command groups for embedding computational kernels in a library

The data movement is command group lambda or functor managed by buffers and accessors.

1 auto myCommandGroup = ([&](cl::sycl::execution_handler& cgh) { 2 auto acc = buffer.get_access(cgh); 3 cgh.single_task( 4 cl::sycl::nd_range<2>(range<2>(16, 16), range<2>(4, 4)), 5 [=](cl::sycl::nd_item<1> idx) { 6 group my_group = idx.get_group(); 7 int x = a_function(); 8 int sum = my_group.reduce(x); 9 //.. 10 }); 11 });

On submit the kernel and its access requirements are 1 queue myQueue; scheduled for execution on an asynchronous queue. 2 auto myEvent = myQueue.submit(myCommandGroup);

© Copyright Khronos Group 2014 SYCL Building Blocks: buffers and accessors for data storage and access

Data is managed 1 buffer positionsBuf(positions.data(), across host and 2 range<1>(number_objects)); device using the 3 buffer velocitiesBuf(velocities.data(), 4 range<1>(number_objects)); buffer class.

The buffer on destruction waits for all usage of its data by the SYCL The buffer class takes runtime to be completed. ownership of the data.

© Copyright Khronos Group 2014 SYCL Building Blocks: accessors for data storage and access

Accessor class The SYCL runtime is effectively constructing an asynchronous runtime gives access to a graph that schedules all the command groups and their memory buffer in a transfers. command group

1 auto cg = [&](handler &ch) { 2 auto posAcc = positionsBuf.get_access(ch); 3 auto velAcc = velocitiesBuf.get_access(ch); 4 auto newPosAcc = newposBuf.get_access(ch); 5 ch.parallel_for( Host accessors 6 nd_range<1>(range<1>(number_objects), range<1>(256)), force 7 ([=](nd_item<1> item_) { 8 unsigned int i = item_.get_global_linear_id(); synchronisation of 9 // Computation and update of velocities[ ... ] all copies of the 10 // Compute the new positions 11 newPosAcc[i][0] = posAcc[i][0] + dtCube * velAcc[i][0] + 0.5f a[0]; buffer on devices 12 newPosAcc[i][1] = posAcc[i][1] + dtCube * velAcc[i][1] + 0.5f * a[1]; with the host. 13 newPosAcc[i][2] = posAcc[i][2] + dtCube * velAcc[i][2] + 0.5f * a[2]; 14 })); 15 };

© Copyright Khronos Group 2014 SYCL Building Blocks: accessors for data storage and access

1 cl::sycl::svm_allocator fineGrainAllocator( svm_allocator 2 fineGrainedContext); 3 /* Use the instance of the svm_allocator with any custom container, for provides allocation 4 * example, with std::shared_ptr. capabilities for 5 */ SYCL 2.2 devices 6 std::shared_ptr svmFineBuffer = 7 std::allocate_shared(fineGrainAllocator, numElements); with fine grained 8 /* Initialize the pointer on the host */ SVM capabilities. 9 *svmFineBuffer = 66;

1 q.submit([&]( 2 cl::sycl::execution_handler>& cgh) { Registered usage 4 auto rawSvmFinePointer = svmFineBuffer.get(); of the SVM 5 cgh.register_access(rawSvmFinePointer); 6 cgh.single_task( allocated 7 range<1>(1), [=](id<1> index) { rawSvmFinePointer[index]++; }); pointer. 8 });

1 int result = *svmFineBuffer; Direct Usage of SVM allocated fine grained buffer pointer on the host.

© Copyright Khronos Group 2014 SYCL Building Blocks: SIMD vector classes available on host and device

1 for (int i = 0; i < 64; i++) { The vec class 2 dataA[i] = float4(2.0f, 1.0f, 1.0f, static_cast(i)); can be used on host and 3 dataB[i] = float3(0.0f, 0.0f, 0.0f); device. 4 }

1 myQueue.submit([&](handler& cgh) { 2 auto ptrA = bufA.get_access(cgh); 3 auto ptrB = bufB.get_access(cgh); Swizzles are 4 supported as 5 cgh.parallel_for(range<3>(4, 4, 4), [=](item<3> item) { 6 float4 in = ptrA[item.get_linear_id()]; functions, as 7 float w = in.w(); well as all the 8 float3 swizzle = in.xyz(); 9 float3 trans = swizzle * w; common 10 ptrB[item.get_linear_id()] = trans; operators for 11 }); 12 }); vectors.

© Copyright Khronos Group 2014 SYCL Building Blocks: pipes

1 cl::sycl::static_pipe p; 2 3 // Launch the producer to stream A to the pipe 4 q.submit([&](cl::sycl::execution_handle &cgh) { 5 // Get write access to the pipe 6 auto kp = p.get_access(cgh); Pipes are available in SYCL 7 // Get read access to the data 8 auto ka = ba.get_access(cgh); 2.2 for OpenCL 2.2 devices. 9 10 cgh.single_task([=] { 11 for (int i = 0; i != N; i++) Pipes are a memory object 12 // Try to write to the pipe up to success 13 while (!(kp.write(ka[i]))); with a sequential ordering of 14 }); data that cannot be accessed 15 }); from the host.

The static_pipe is a pipe with constexpr capacity that is defined only for one target device. This is the type of pipe which can be useful for compile-time graphs and pipelines.

© Copyright Khronos Group 2014 SYCL Building Blocks: hierarchical command group kernels

parallel_for_work_group() : in this scope the code is executing once per work group and the memory is allocated in the local address space. In a parallel_for_work_item 1 auto command_group = [&](handler & cgh) { scope the code is 2 // Issue 8 work-groups of 8 work-items each 3 cgh.parallel_for_work_group( executed once per work- 4 range<3>(2, 2, 2), range<3>(2, 2, 2), [=](group<3> myGroup) { item and the memory is 5 int myLocal; allocated in private 6 private_memory myPrivate(myGroup); 7 parallel_for_work_item(myGroup, [=](item<3> myItem) { memory. 8 //[work-item code] 9 myPrivate(myItem) = 0; 10 }); 11 parallel_for_work_item(myGroup, [=](item<3> myItem) { 12 //[work-item code] 13 output[myGroup.get_local_range() * myGroup.get() + myItem] = In SYCL 2.2, there is a 14 myPrivate(myItem); parallel_for_sub_group 15 }); 16 //[workgroup code] which executes once 17 }); per vector of work 18 }); items.

© Copyright Khronos Group 2014 SYCL Building Blocks: basic stream class

The stream object is provided to give basic support for printing from the device for debugging or notification reasons.

1 myQueue.submit([&](handler& cgh) { 2 /* We create a stream object to print from the device. */ 3 stream os(1024, 80, cgh); 4 cgh.single_task([=]() { 5 /* We use the stream operator on the stream object we created above to 6 * print to stdout from the device. */ 7 os << "hello World!" << endl; 8 }); 9 });

© Copyright Khronos Group 2014 SYCL Building Blocks for C++ libraries Questions?

SYCL Building blocks asynchronous execution on multiple SYCL devices command groups for embedding computational kernels in a library buffers and accessors for data storage and access vector classes shared between host and device pipes hierarchical command group execution printing on device using cl::sycl::stream

© Copyright Khronos Group 2014 • SYCL spec and forums: www.khronos.org/opencl/sycl

• New website coming! sycl.tech

© Copyright Khronos Group 2016