Modern C++, Heterogeneous Computing & Opencl SYCL

Modern C++, heterogeneous computing & OpenCL SYCL Ronan Keryell Khronos OpenCL SYCL committee 05/12/2015 — IWOCL 2015 SYCL Tutorial •C++14 I Outline 1 C++14 2 C++ dialects for OpenCL (and heterogeneous computing) 3 OpenCL SYCL 1.2 C++... putting everything altogether 4 OpenCL SYCL 2.1... 5 Conclusion Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 2 / 43 •C++14 I C++14 • 2 Open Source compilers available before ratification (GCC & Clang/LLVM) • Confirm new momentum & pace: 1 major (C++11) and 1 minor (C++14) version on a 6-year cycle • Next big version expected in 2017 (C++1z) I Already being implemented! • Monolithic committee replaced, by many smaller parallel task forces I Parallelism TS (Technical Specification) with Parallel STL I Concurrency TS (threads, mutex...) I Array TS (multidimensional arrays à la Fortran) I Transactional Memory TS... Race to parallelism! Definitely matters for HPC and heterogeneous computing! C++ is a complete new language • Forget about C++98, C++03... • Send your proposals and get involved in C++ committee (pushing heterogeneous computing)! Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 3 / 43 •C++14 I Modern C++ & HPC (I) • Huge library improvements I <thread> library and multithread memory model <atomic> HPC I Hash-map I Algorithms ; I Random numbers I ... • Uniform initialization and range-based for loop std ::vector<int> my_vector { 1, 2, 3, 4, 5 }; f o r(int &e : my_vector) e += 1; • Easy functional programming style with lambda (anonymous) functions std::transform(std::begin(v), std::end(v), [] (intv) { return2 ∗v ; } ) ; Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 4 / 43 •C++14 I Modern C++ & HPC (II) • Lot of meta-programming improvements to make meta-programming easy easier: variadic templates, type traits <type_traits>... • Make simple things simpler to be able to write generic numerical libraries, etc. • Automatic type inference for terse programming I Python 3.x (interpreted): def add(x, y): r e t u r nx+y p r i n t(add(2, 3))#5 p r i n t (add("2","3"))# 23 I Same in C++14 but compiled+ static compile-time type-checking: auto add= [] (auto x, autoy) { returnx+y; }; std::cout << add(2, 3) << std::endl;//5 std::cout << add("2"s,"3"s) << std::endl;// 23 ((( Without using templated code! template( <typename(( > (((( , Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 5 / 43 •C++14 I Modern C++ & HPC (III) • R-value references & std :: move semantics I matrix_A = matrix_B + matrix_C Avoid copying (TB, PB, EB... ) when assigning or function return • Avoid raw pointers, malloc()/free()/ /delete[]: use references and smart pointers instead // Allocatea double with new() and wrap it ina smart pointer auto gen() { return std ::make_shared<double> { 3.14 }; } [...] { auto p = gen(), q = p; ∗q = 2.718; // Out of scope, no longer use of the memory: deallocation happens here } • Lot of other amazing stuff... • Allow both low-level & high-level programming... Useful for heterogeneous computing Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 6 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Outline 1 C++14 2 C++ dialects for OpenCL (and heterogeneous computing) 3 OpenCL SYCL 1.2 C++... putting everything altogether 4 OpenCL SYCL 2.1... 5 Conclusion Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 7 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I OpenCL 2.1 C++ kernel language (I) • Announced at GDC, March 2015 • Move from C99-based kernel language to C++14-based // Template classes to express OpenCL address spaces local_array<int, N> array; l o c a l<float> v; constant_ptr<double> p; // UseC++11 generalized attributes, to ignore vector dependencies [[safelen(8), ivdep]] f o r(int i =0; i <N; i++) // Can infer that offset >=8 array[i+offset] = array[i] + 42; Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 8 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I OpenCL 2.1 C++ kernel language (II) • Kernel side enqueue I Replace OpenCL 2 infamous Apple GCD block syntax by C++11 lambda kernel void main_kernel(intN, int ∗ array ) { // Only work−item0 will launcha new kernel i f (get_global_id(0) == 0) // Wait for the end of this work−group before starting the new kernel get_default_queue(). enqueue_kernel (CLK_ENQUEUE_FLAGS_WAIT_WORK_GROUP, ndrange { N }, [ = ] kernel{ array[get_global_id(0)] = 7; }); } • C++14 memory model and atomic operations • Newer SPIR-V binary IR format Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 9 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I OpenCL 2.1 C++ kernel language (III) • Amazing progress but no single source solution à la CUDA yet I Still need to play with OpenCL host API to deal with buffers, etc. Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 10 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Bolt C++ (I) • Parallel STL + map-reduce https://github.com/HSA-Libraries/Bolt • Developed by AMD on top of OpenCL, C++AMP or TBB #include <bolt/cl/sort.h> #include <vector> #include <algorithm> i n t main() { // generate random data(on host) std ::vector<int> a(8192); std::generate(a.begin(), a.end(), rand); // sort, run on best device in the platform bolt::cl::sort(a.begin(), a.end()); r e t u r n 0; } • Simple! Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 11 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Bolt C++ (II) • But... I No direct interoperability with OpenCL world I No specific compiler required with OpenCL some special syntax to define operation on device OpenCL kernel source strings for complex operations; with macros BOLT_FUNCTOR(), BOLT_CREATE_TYPENAME(), BOLT_CREATE_CLCODE()... Work better with AMD Static C++ Kernel Language Extension (now in OpenCL 2.1) & best with C++AMP (but no OpenCL interoperability...) Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 12 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Boost.Compute (I) • Boost library accepted in 2015 https://github.com/boostorg/compute • Provide 2 levels of abstraction I High-level parallel STL I Low-level C++ wrapping of OpenCL concepts Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 13 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Boost.Compute (II) // Geta default command queue on the default accelerator auto queue = boost::compute::system:: default_queue (); // Allocatea vector ina buffer on the device boost ::compute:: vector<float> device_vector { N, queue.get_context() }; boost::compute:: iota(device_vector.begin() , device_vector.end() , 0); // Create an equivalent OpenCL kernel BOOST_COMPUTE_FUNCTION(float , add_four, (float x), { return x+4; }); boost ::compute:: transform(device_vector.begin() , device_vector.end() , device_vector.begin() , add_four , queue); boost::compute:: sort(device_vector.begin() , device_vector.end() , queue); // Lambda expression equivalent boost ::compute:: transform(device_vector.begin() , device_vector.end() , device_vector.begin() , boost ::compute::lambda::_1 ∗ 3 − 4 , queue ) ; Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 14 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Boost.Compute (III) • Elegant implicit C++ conversions between OpenCL and Boost.Compute types for finer control and optimizations auto command_queue = boost ::compute::system:: default_queue (); auto context = command_queue.get_context (); auto program = boost ::compute::program:: create_with_source_file(kernel_file_name , context ) ; program. build (); boost ::compute:: kernel im2col_kernel { program,"im2col"}; boost::compute:: buffer im_buffer { context , image_size∗ s i z e o f(float), CL_MEM_READ_ONLY } ; command_queue. enqueue_write_buffer(im_buffer , 0/ ∗ O f f s e t ∗ /, im_data.size() ∗ s i z e o f(decltype(im_data ):: value_type), im_data.data ()); Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 15 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Boost.Compute (IV) im2col_kernel.set_args(im_buffer , height , width, ksize_h , ksize_w, pad_h, pad_w, stride_h , stride_w , height_col , width_col , data_col ) ; command_queue. enqueue_nd_range_kernel (kernel, boost::compute::extents<1> { 0 }/ ∗ g l o b a l work offset ∗ /, boost::compute::extents<1> { workitems }/ ∗ g l o b a l work−item ∗ /, boost::compute:: extents<1> { workgroup_size };/ ∗ Work group size ∗ /); • Provide program caching • Direct OpenCL interoperability for extreme performance • No specific compiler required some special syntax to define operation on device • Probably the right tool to use to; translate CUDA & Thrust to OpenCL world Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 16 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I VexCL (I) • Parallel STL similar to Boost.Compute + mathematical libraries https://github.com/ddemidov/vexcl I Random generators (Random123) I FFT I Tensor operations I Sparse matrix-vector products I Stencil convolutions I ... • OpenCL (CL.hpp or Boost.Compute) & CUDA back-end • Allow device vectors & operations to span different accelerators from different vendors in a same context vex::Context ctx { vex:: Filter ::Type { CL_DEVICE_TYPE_GPU } && vex:: Filter ::DoublePrecision }; vex:: vector<double>A { ctx, N }, B { ctx, N }, C{ ctx, N }; A = 2 ∗ B − s i n (C ) ; Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 17 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I VexCL (II) I Allow easy interoperability with back-end //

Load more