Modern C++, heterogeneous computing & OpenCL SYCL
Ronan Keryell
Khronos OpenCL SYCL committee
05/12/2015 — IWOCL 2015 SYCL Tutorial •C++14 I Outline
1 C++14
2 C++ dialects for OpenCL (and heterogeneous computing)
3 OpenCL SYCL 1.2 C++... putting everything altogether
4 OpenCL SYCL 2.1...
5 Conclusion
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 2 / 43 •C++14 I C++14
• 2 Open Source compilers available before ratification (GCC & Clang/LLVM) • Confirm new momentum & pace: 1 major (C++11) and 1 minor (C++14) version on a 6-year cycle • Next big version expected in 2017 (C++1z) I Already being implemented! • Monolithic committee replaced, by many smaller parallel task forces I Parallelism TS (Technical Specification) with Parallel STL I Concurrency TS (threads, mutex...) I Array TS (multidimensional arrays à la Fortran) I Transactional Memory TS... Race to parallelism! Definitely matters for HPC and heterogeneous computing! C++ is a complete new language • Forget about C++98, C++03... • Send your proposals and get involved in C++ committee (pushing heterogeneous computing)!
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 3 / 43 •C++14 I Modern C++ & HPC (I)
• Huge library improvements I
• Easy functional programming style with lambda (anonymous) functions std::transform(std::begin(v), std::end(v), [] (intv) { return2 ∗v ; } ) ;
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 4 / 43 •C++14 I Modern C++ & HPC (II)
• Lot of meta-programming improvements to make meta-programming easy easier: variadic templates, type traits
I Same in C++14 but compiled+ static compile-time type-checking: auto add= [] (auto x, autoy) { returnx+y; }; std::cout << add(2, 3) << std::endl;//5 std::cout << add("2"s,"3"s) << std::endl;// 23 ((( Without using templated code! template(
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 5 / 43 •C++14 I Modern C++ & HPC (III)
• R-value references & std :: move semantics I matrix_A = matrix_B + matrix_C Avoid copying (TB, PB, EB... ) when assigning or function return • Avoid raw pointers, malloc()/free()/ /delete[]: use references and smart pointers instead // Allocatea double with new() and wrap it ina smart pointer auto gen() { return std ::make_shared
• Lot of other amazing stuff... • Allow both low-level & high-level programming... Useful for heterogeneous computing
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 6 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Outline
1 C++14
2 C++ dialects for OpenCL (and heterogeneous computing)
3 OpenCL SYCL 1.2 C++... putting everything altogether
4 OpenCL SYCL 2.1...
5 Conclusion
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 7 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I OpenCL 2.1 C++ kernel language (I)
• Announced at GDC, March 2015 • Move from C99-based kernel language to C++14-based // Template classes to express OpenCL address spaces local_array
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 8 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I OpenCL 2.1 C++ kernel language (II)
• Kernel side enqueue I Replace OpenCL 2 infamous Apple GCD block syntax by C++11 lambda kernel void main_kernel(intN, int ∗ array ) { // Only work−item0 will launcha new kernel i f (get_global_id(0) == 0) // Wait for the end of this work−group before starting the new kernel get_default_queue(). enqueue_kernel (CLK_ENQUEUE_FLAGS_WAIT_WORK_GROUP, ndrange { N }, [ = ] kernel{ array[get_global_id(0)] = 7; }); }
• C++14 memory model and atomic operations • Newer SPIR-V binary IR format
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 9 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I OpenCL 2.1 C++ kernel language (III)
• Amazing progress but no single source solution à la CUDA yet I Still need to play with OpenCL host API to deal with buffers, etc.
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 10 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Bolt C++ (I)
• Parallel STL + map-reduce https://github.com/HSA-Libraries/Bolt • Developed by AMD on top of OpenCL, C++AMP or TBB #include
• Simple!
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 11 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Bolt C++ (II)
• But... I No direct interoperability with OpenCL world I No specific compiler required with OpenCL some special syntax to define operation on device OpenCL kernel source strings for complex operations; with macros BOLT_FUNCTOR(), BOLT_CREATE_TYPENAME(), BOLT_CREATE_CLCODE()... Work better with AMD Static C++ Kernel Language Extension (now in OpenCL 2.1) & best with C++AMP (but no OpenCL interoperability...)
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 12 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Boost.Compute (I)
• Boost library accepted in 2015 https://github.com/boostorg/compute • Provide 2 levels of abstraction I High-level parallel STL I Low-level C++ wrapping of OpenCL concepts
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 13 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Boost.Compute (II)
// Geta default command queue on the default accelerator auto queue = boost::compute::system:: default_queue (); // Allocatea vector ina buffer on the device boost ::compute:: vector
// Create an equivalent OpenCL kernel BOOST_COMPUTE_FUNCTION(float , add_four, (float x), { return x+4; });
boost ::compute:: transform(device_vector.begin() , device_vector.end() , device_vector.begin() , add_four , queue);
boost::compute:: sort(device_vector.begin() , device_vector.end() , queue); // Lambda expression equivalent boost ::compute:: transform(device_vector.begin() , device_vector.end() , device_vector.begin() , boost ::compute::lambda::_1 ∗ 3 − 4 , queue ) ;
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 14 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Boost.Compute (III)
• Elegant implicit C++ conversions between OpenCL and Boost.Compute types for finer control and optimizations auto command_queue = boost ::compute::system:: default_queue (); auto context = command_queue.get_context (); auto program = boost ::compute::program:: create_with_source_file(kernel_file_name , context ) ; program. build (); boost ::compute:: kernel im2col_kernel { program,"im2col"};
boost::compute:: buffer im_buffer { context , image_size∗ s i z e o f(float), CL_MEM_READ_ONLY } ; command_queue. enqueue_write_buffer(im_buffer , 0/ ∗ O f f s e t ∗ /, im_data.size() ∗ s i z e o f(decltype(im_data ):: value_type), im_data.data ());
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 15 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Boost.Compute (IV)
im2col_kernel.set_args(im_buffer , height , width, ksize_h , ksize_w, pad_h, pad_w, stride_h , stride_w , height_col , width_col , data_col ) ;
command_queue. enqueue_nd_range_kernel (kernel, boost::compute::extents<1> { 0 }/ ∗ g l o b a l work offset ∗ /, boost::compute::extents<1> { workitems }/ ∗ g l o b a l work−item ∗ /, boost::compute:: extents<1> { workgroup_size };/ ∗ Work group size ∗ /);
• Provide program caching • Direct OpenCL interoperability for extreme performance • No specific compiler required some special syntax to define operation on device • Probably the right tool to use to; translate CUDA & Thrust to OpenCL world
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 16 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I VexCL (I)
• Parallel STL similar to Boost.Compute + mathematical libraries https://github.com/ddemidov/vexcl I Random generators (Random123) I FFT I Tensor operations I Sparse matrix-vector products I Stencil convolutions I ... • OpenCL (CL.hpp or Boost.Compute) & CUDA back-end • Allow device vectors & operations to span different accelerators from different vendors in a same context vex::Context ctx { vex:: Filter ::Type { CL_DEVICE_TYPE_GPU } && vex:: Filter ::DoublePrecision }; vex:: vector
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 17 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I VexCL (II)
I Allow easy interoperability with back-end // Get the cl_buffer storingA on the device2 auto clBuffer = A(2);
• Use heroic meta-programming to generate kernels without using specific compiler with deep embedded DSL I Use symbolic types (prototypal arguments) to extract function structure // Set recorder for expression sequence std :: ostringstream body; vex:: generator :: set_recorder(body); vex:: symbolic
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 18 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I VexCL (III)
// Now use the kernel foobar (A ) ;
I VexCL is probably the most advanced tool to generate OpenCL without requiring a specific compiler... • Interoperable with OpenCL, Boost.Compute for extreme performance & ViennaCL • Kernel caching to avoid useless compiling • Probably the right tool to use to translate CUDA & Thrust to OpenCL world
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 19 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I ViennaCL (I)
https://github.com/viennacl/viennacl-dev • OpenCL/CUDA/OpenMP back-end • Similar to VexCL for sharing context between various platforms • Linear algebra (dense & sparse) • Iterative solvers • FFT • OpenCL kernel generator from high-level expressions • Some interoperability with Matlab
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 20 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I C++AMP
// Use iota algorithm inC++AMP parallel_for_each(e, #include
void iota_n(size_t n, int dst[]) { • Developed by Microsoft, AMD & // Select the first true accelerator found as the default one f o r(auto const & acc : accelerator::get_all()) MultiCoreWare i f (!acc.get_is_emulated()) { accelerator :: set_default(acc.get_device_path ()); • Single source: easy to write kernels break; } • Require specific compiler // Define the iteration space extent<1> e(n); • Not pure C++ (restrict, tile_static) // Createa buffer from the given array memory array_view
Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 21 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I OpenMP 4
#include
enum { NWITEMS = 512 } ; r e t u r n 0; i n t array [NWITEMS ] ; }
void iota_n(size_t n, int dst[n]) { • Old HPC standard from the 90’s #pragma omp target map(from: dst[0:n −1]) #pragma omp parallel for • Use #pragma to express parallelism f o r(int i =0; i Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 22 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Other (non-)OpenCL C++ framework (I) • ArrayFire, Aura, CLOGS, hemi, HPL, Kokkos, MTL4, SkelCL, SkePU, EasyCL... • nVidia CUDA 7 now C++11-based I Single source simpler for the programmer I nVidia Thrust ≈ parallel STL+map-reduce on top of CUDA, OpenMP or TBB https://github.com/thrust/thrust; Not very clean because device pointers returned by cudaMalloc() do not have a special type use some ugly casts • OpenACC;≈ OpenMP 4 restricted to accelerators + LDS finer control Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 23 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Missing link... • No tool providing I OpenCL interoperability I Modern C++ environment I Single source for programming productivity Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 24 / 43 •OpenCL SYCL 1.2 I Outline 1 C++14 2 C++ dialects for OpenCL (and heterogeneous computing) 3 OpenCL SYCL 1.2 C++... putting everything altogether 4 OpenCL SYCL 2.1... 5 Conclusion Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 25 / 43 •OpenCL SYCL 1.2 I Puns and pronunciation explained OpenCL SYCL OpenCL SPIR sickle [ "si-k@l ] spear [ "spir ] Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 26 / 43 •OpenCL SYCL 1.2 I OpenCL SYCL goals • Ease of use I Single source programming model Take advantage of CUDA & C++AMP simplicity and power Compiled for host and device(s) • Easy development/debugging on host: host fall-back target • Programming interface based on abstraction of OpenCL components (data management, error handling...) • Most modern C++ features available for OpenCL I Enabling the creation of higher level programming models I C++ templated libraries based on OpenCL I Exceptions for error handling • Portability across platforms and compilers • Providing the full OpenCL feature set and seamless integration with existing OpenCL code • Task graph programming model with interface à la TBB/Cilk (C++17) • High performance http://www.khronos.org/opencl/sycl Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 27 / 43 •OpenCL SYCL 1.2 I Complete example of matrix addition in OpenCL SYCL #include Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 28 / 43 •OpenCL SYCL 1.2 I Asynchronous task graph model • Theoretical graph of an application described implicitly with kernel tasks using buffers through accessors cl::sycl::accessor Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 29 / 43 •OpenCL SYCL 1.2 I Task graph programming — the code #include const int size = 10; i n t data[size]; const int gsize = 2; Very close to OpenMP 4 style! b u f f e r Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 31 / 43 •OpenCL SYCL 1.2 I C++11 allocators •∃ C++11 allocators to control the way objects are allocated in memory I For example to allocate some vectors on some storage I Concept of scoped_allocator to control storage of nested data structures I Example: vector of strings, with vector data and string data allocated in different memory areas (speed, power consumption, caching, read-only...) • SYCL reuses allocator to specify how buffer and image are allocated on the host side Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 32 / 43 •OpenCL SYCL 1.2 IC++... putting everything altogether Outline 1 C++14 2 C++ dialects for OpenCL (and heterogeneous computing) 3 OpenCL SYCL 1.2 C++... putting everything altogether 4 OpenCL SYCL 2.1... 5 Conclusion Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 33 / 43 •OpenCL SYCL 1.2 IC++... putting everything altogether Exascale-ready • Use your own C++ compiler I Only kernel outlining needs SYCL compiler • SYCL with C++ can address most of the hierarchy levels I MPI I OpenMP I C++-based PGAS (Partitioned Global Address Space) DSeL (Domain-Specific embedded Language, such as Coarray C++...) I Remote accelerators in clusters I Use SYCL buffer allocator for RDMA Out-of-core, mapping to a file PiM (Processor in Memory) ... Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 34 / 43 •OpenCL SYCL 1.2 IC++... putting everything altogether Debugging • Difficult to debug code or detect precondition violation on GPU and at large... • Rely on C++ to help debugging I Overload some operations and functions to verify preconditions I Hide tracing/verification code in constructors/destructors I Can use pure-C++ host implementation for bug-tracking with favorite debugger Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 35 / 43 •OpenCL SYCL 1.2 IC++... putting everything altogether Poor-man SVM with C++11 + SYCL • For complex data structures I Objects need to be in buffers to be shipped between CPU and devices I Do not want marshaling/unmarshaling objects... I Use C++11 allocator to allocate some objects in 1 SYCL buffer Useful to send efficiently data through MPI and RDMA too! I But since no SVM, not same address on CPU and GPU side... How to deal with pointers? Override all pointer accessed/ (for example use std::pointer_trait) to do address translation on kernel side Cost: 1 addition, per *p • When no or inefficient SVM... I Also useful optimization when need to work on a copy only on the GPU Only allocation on GPU side Spare some TLB trashing on the CPU Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 36 / 43 •OpenCL SYCL 1.2 IC++... putting everything altogether ¿¿¿Fortran??? • Fortran 2003 introduces C-interoperability that can be used for C++ interoperability... SYCL • C++ boost::multi_array & others provides à la Fortran arrays I Allows triplet notation I Can be used from inside SYCL to deal with Fortran-like arrays • Perhaps the right time to switch your application to modern C++? , Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 37 / 43 •OpenCL SYCL 1.2 IC++... putting everything altogether Using SYCL-like models in other areas • SYCL ≡ generic heterogeneous computing model beyond OpenCL I queue expresses where computations happen I parallel_for launches computations I accessor defines the way we access data I buffer for storing data I allocator for defining how data are allocated/backed • Example for HSA: almost direct mapping à la OpenCL • Example in PiM world I Use queue to run on some PiM chips I Use allocator to distribute data structures or to allocate buffer in special memory (memory page, chip...) I Use accessor to use alternative data access (split address from computation, streaming only, PGAS...) I Use pointer_trait to use specific way to interact with memory I ... Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 38 / 43 •OpenCL SYCL 2.1... I Outline 1 C++14 2 C++ dialects for OpenCL (and heterogeneous computing) 3 OpenCL SYCL 1.2 C++... putting everything altogether 4 OpenCL SYCL 2.1... 5 Conclusion Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 39 / 43 •OpenCL SYCL 2.1... I SYCL 2.1 is coming! • Skip directly to OpenCL 2.1 and C++14 • Kernel side enqueue • Shared memory between host and accelerator • Parallel STL C++17 • Array TS Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 40 / 43 •OpenCL SYCL 2.1... I SYCL and fine-grain system shared memory (OpenCL 2) #include Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 41 / 43 •Conclusion I Outline 1 C++14 2 C++ dialects for OpenCL (and heterogeneous computing) 3 OpenCL SYCL 1.2 C++... putting everything altogether 4 OpenCL SYCL 2.1... 5 Conclusion Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 42 / 43 •Conclusion I Conclusion •∃ Many C++ frameworks to leverage OpenCL I None of them provides seamless single source Require some kind of macros & weird syntax I But they should be preferred to plain OpenCL C for productivity • SYCL provides seamless single source with OpenCL interoperability I Can be used to improve other higher-level frameworks • SYCL ≡ pure C++ integration with other C/C++ HPC frameworks: OpenCL, OpenMP, libraries (MPI, numerical), C++ DSeL (PGAS...)... • SYCL also interesting; as co-design tool for architectural & programming model research (PiM, Near-Memory Computing, various computing models...) • Modern C++ is not just C program in .cpp file Invest in learning modern C++ , ; Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 43 / 43 •Table of content I Complete example of matrix addition in OpenCL SYCL 28 1 C++14 Asynchronous task graph model 29 Outline 2 Task graph programming — the code 30 C++14 3 From work-groups & work-items to hierarchical parallelism 31 Modern C++ & HPC 4 C++11 allocators 32 C++... putting everything altogether 2 C++ dialects for OpenCL (and heterogeneous computing) Outline 33 Outline 7 Exascale-ready 34 OpenCL 2.1 C++ kernel language 8 Debugging 35 Bolt C++ 11 Poor-man SVM with C++11 + SYCL 36 Boost.Compute 13 ¿¿¿Fortran??? 37 VexCL 17 Using SYCL-like models in other areas 38 ViennaCL 20 C++AMP 21 4 OpenCL SYCL 2.1... OpenMP 4 22 Outline 39 Other (non-)OpenCL C++ framework 23 SYCL 2.1 is coming! 40 Missing link... 24 SYCL and fine-grain system shared memory (OpenCL 2) 41 3 OpenCL SYCL 1.2 5 Conclusion Outline 25 Outline 42 Puns and pronunciation explained 26 Conclusion 43 OpenCL SYCL goals 27 You are here ! 44 Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 43 / 43