Modern ++, & OpenCL SYCL

Ronan Keryell

Khronos OpenCL SYCL committee

05/12/2015 — IWOCL 2015 SYCL Tutorial •C++14 I Outline

1 C++14

2 C++ dialects for OpenCL (and heterogeneous computing)

3 OpenCL SYCL 1.2 C++... putting everything altogether

4 OpenCL SYCL 2.1...

5 Conclusion

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 2 / 43 •C++14 I C++14

• 2 Open Source compilers available before ratification (GCC & /LLVM) • Confirm new momentum & pace: 1 major (C++11) and 1 minor (C++14) version on a 6-year cycle • Next big version expected in 2017 (C++1z) I Already being implemented! • Monolithic committee replaced, by many smaller parallel task forces I Parallelism TS (Technical Specification) with Parallel STL I Concurrency TS (threads, mutex...) I Array TS (multidimensional arrays à la ) I Transactional Memory TS... Race to parallelism! Definitely matters for HPC and heterogeneous computing! C++ is a complete new language • Forget about C++98, C++03... • Send your proposals and get involved in C++ committee (pushing heterogeneous computing)!

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 3 / 43 •C++14 I Modern C++ & HPC (I)

• Huge improvements I library and multithread memory model HPC I Hash-map I Algorithms ; I Random numbers I ... • Uniform initialization and range-based for loop std ::vector my_vector { 1, 2, 3, 4, 5 }; f o r(int &e : my_vector) e += 1;

• Easy functional programming style with lambda (anonymous) functions std::transform(std::begin(v), std::end(v), [] (intv) { return2 ∗v ; } ) ;

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 4 / 43 •C++14 I Modern C++ & HPC (II)

• Lot of meta-programming improvements to make meta-programming easy easier: variadic templates, type traits ... • Make simple things simpler to be able to write generic numerical libraries, etc. • Automatic type inference for terse programming I Python 3.x (interpreted): def add(x, y): r e t u r nx+y p r i n t(add(2, 3))#5 p r i n t (add("2","3"))# 23

I Same in C++14 but compiled+ static compile-time type-checking: auto add= [] (auto x, autoy) { returnx+y; }; std::cout << add(2, 3) << std::endl;//5 std::cout << add("2"s,"3"s) << std::endl;// 23 ((( Without using templated code! ( (((( ,

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 5 / 43 •C++14 I Modern C++ & HPC (III)

• R-value references & std :: move semantics I matrix_A = matrix_B + matrix_C  Avoid copying (TB, PB, EB... ) when assigning or function return • Avoid raw pointers, malloc()/free()/ /delete[]: use references and smart pointers instead // Allocatea double with new() and wrap it ina smart pointer auto gen() { return std ::make_shared { 3.14 }; } [...] { auto p = gen(), q = p; ∗q = 2.718; // Out of scope, no longer use of the memory: deallocation happens here }

• Lot of other amazing stuff... • Allow both low-level & high-level programming... Useful for heterogeneous computing

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 6 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Outline

1 C++14

2 C++ dialects for OpenCL (and heterogeneous computing)

3 OpenCL SYCL 1.2 C++... putting everything altogether

4 OpenCL SYCL 2.1...

5 Conclusion

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 7 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I OpenCL 2.1 C++ kernel language (I)

• Announced at GDC, March 2015 • Move from C99-based kernel language to C++14-based // Template classes to express OpenCL address spaces local_array array; l o c a l v; constant_ptr p; // UseC++11 generalized attributes, to ignore vector dependencies [[safelen(8), ivdep]] f o r(int i =0; i =8 array[i+offset] = array[i] + 42;

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 8 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I OpenCL 2.1 C++ kernel language (II)

• Kernel side enqueue I Replace OpenCL 2 infamous Apple GCD block syntax by C++11 lambda kernel void main_kernel(intN, int ∗ array ) { // Only work−item0 will launcha new kernel i f (get_global_id(0) == 0) // Wait for the end of this work−group before starting the new kernel get_default_queue(). enqueue_kernel (CLK_ENQUEUE_FLAGS_WAIT_WORK_GROUP, ndrange { N }, [ = ] kernel{ array[get_global_id(0)] = 7; }); }

• C++14 memory model and atomic operations • Newer SPIR-V binary IR format

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 9 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I OpenCL 2.1 C++ kernel language (III)

• Amazing progress but no single source solution à la CUDA yet I Still need to play with OpenCL host API to deal with buffers, etc.

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 10 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Bolt C++ (I)

• Parallel STL + map-reduce https://github.com/HSA-Libraries/Bolt • Developed by AMD on top of OpenCL, C++AMP or TBB #include #include #include i n t main() { // generate random data(on host) std ::vector a(8192); std::generate(a.begin(), a.end(), rand); // sort, run on best device in the platform bolt::cl::sort(a.begin(), a.end()); r e t u r n 0; }

• Simple!

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 11 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Bolt C++ (II)

• But... I No direct interoperability with OpenCL world I No specific compiler required with OpenCL some special syntax to define operation on device  OpenCL kernel source strings for complex operations; with macros BOLT_FUNCTOR(), BOLT_CREATE_TYPENAME(), BOLT_CREATE_CLCODE()...  Work better with AMD Static C++ Kernel Language Extension (now in OpenCL 2.1) & best with C++AMP (but no OpenCL interoperability...)

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 12 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I .Compute (I)

• Boost library accepted in 2015 https://github.com/boostorg/compute • Provide 2 levels of abstraction I High-level parallel STL I Low-level C++ wrapping of OpenCL concepts

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 13 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Boost.Compute (II)

// Geta default command queue on the default accelerator auto queue = boost::compute::system:: default_queue (); // Allocatea vector ina buffer on the device boost ::compute:: vector device_vector { N, queue.get_context() }; boost::compute:: iota(device_vector.begin() , device_vector.end() , 0);

// Create an equivalent OpenCL kernel BOOST_COMPUTE_FUNCTION(float , add_four, (float x), { return x+4; });

boost ::compute:: transform(device_vector.begin() , device_vector.end() , device_vector.begin() , add_four , queue);

boost::compute:: sort(device_vector.begin() , device_vector.end() , queue); // Lambda expression equivalent boost ::compute:: transform(device_vector.begin() , device_vector.end() , device_vector.begin() , boost ::compute::lambda::_1 ∗ 3 − 4 , queue ) ;

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 14 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Boost.Compute (III)

• Elegant implicit C++ conversions between OpenCL and Boost.Compute types for finer control and optimizations auto command_queue = boost ::compute::system:: default_queue (); auto context = command_queue.get_context (); auto program = boost ::compute::program:: create_with_source_file(kernel_file_name , context ) ; program. build (); boost ::compute:: kernel im2col_kernel { program,"im2col"};

boost::compute:: buffer im_buffer { context , image_size∗ s i z e o f(float), CL_MEM_READ_ONLY } ; command_queue. enqueue_write_buffer(im_buffer , 0/ ∗ O f f s e t ∗ /, im_data.size() ∗ s i z e o f(decltype(im_data ):: value_type), im_data.data ());

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 15 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Boost.Compute (IV)

im2col_kernel.set_args(im_buffer , height , width, ksize_h , ksize_w, pad_h, pad_w, stride_h , stride_w , height_col , width_col , data_col ) ;

command_queue. enqueue_nd_range_kernel (kernel, boost::compute::extents<1> { 0 }/ ∗ g l o b a l work offset ∗ /, boost::compute::extents<1> { workitems }/ ∗ g l o b a l work−item ∗ /, boost::compute:: extents<1> { workgroup_size };/ ∗ Work group size ∗ /);

• Provide program caching • Direct OpenCL interoperability for extreme performance • No specific compiler required some special syntax to define operation on device • Probably the right tool to use to; translate CUDA & Thrust to OpenCL world

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 16 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I VexCL (I)

• Parallel STL similar to Boost.Compute + mathematical libraries https://github.com/ddemidov/vexcl I Random generators (Random123) I FFT I Tensor operations I Sparse matrix-vector products I Stencil convolutions I ... • OpenCL (CL.hpp or Boost.Compute) & CUDA back-end • Allow device vectors & operations to span different accelerators from different vendors in a same context vex::Context ctx { vex:: Filter ::Type { CL_DEVICE_TYPE_GPU } && vex:: Filter ::DoublePrecision }; vex:: vectorA { ctx, N }, B { ctx, N }, C{ ctx, N }; A = 2 ∗ B − s i n (C ) ;

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 17 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I VexCL (II)

I Allow easy interoperability with back-end // Get the cl_buffer storingA on the device2 auto clBuffer = A(2);

• Use heroic meta-programming to generate kernels without using specific compiler with deep embedded DSL I Use symbolic types (prototypal arguments) to extract function structure // Set recorder for expression sequence std :: ostringstream body; vex:: generator :: set_recorder(body); vex:: symbolic sym_x { vex::symbolic::VectorParameter }; sym_x = sin(sym_x) + 3; sym_x = cos(2∗sym_x ) + 5; // Build kernel from the recorded sequence auto foobar = vex::generator::build_kernel(ctx ,"foobar", body.str (), sym_x);

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 18 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I VexCL (III)

// Now use the kernel foobar (A ) ;

I VexCL is probably the most advanced tool to generate OpenCL without requiring a specific compiler... • Interoperable with OpenCL, Boost.Compute for extreme performance & ViennaCL • Kernel caching to avoid useless compiling • Probably the right tool to use to translate CUDA & Thrust to OpenCL world

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 19 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I ViennaCL (I)

https://github.com/viennacl/viennacl-dev • OpenCL/CUDA/OpenMP back-end • Similar to VexCL for sharing context between various platforms • Linear algebra (dense & sparse) • Iterative solvers • FFT • OpenCL kernel generator from high-level expressions • Some interoperability with Matlab

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 20 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I C++AMP

// Use iota algorithm inC++AMP parallel_for_each(e, #include // Define the kernel to execute #include [=] (Concurrency::index<1> i) restrict(amp) { a [ i ] = i [ 0 ] ; enum { NWITEMS = 512 } ; }); // In the destruction of array_view"a" happening here, i n t data [NWITEMS ] ; // the data are copied back before iota_n() returns } // To avoid writing Concurrency:: everywhere using namespace Concurrency;

void iota_n(size_t n, int dst[]) { • Developed by Microsoft, AMD & // Select the first true accelerator found as the default one f o r(auto const & acc : accelerator::get_all()) MultiCoreWare i f (!acc.get_is_emulated()) { accelerator :: set_default(acc.get_device_path ()); • Single source: easy to write kernels break; } • Require specific compiler // Define the iteration space extent<1> e(n); • Not pure C++ (restrict, tile_static) // Createa buffer from the given array memory array_view a(e, dst); // Is therea better way to express write −only data? • No OpenCL interoperability a.discard_data (); // Executea kernel in parallel • Difficult to optimize the data transfers

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 21 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I OpenMP 4

#include p r i n t f ("% %d\n", i, array[i]);

enum { NWITEMS = 512 } ; r e t u r n 0; i n t array [NWITEMS ] ; }

void iota_n(size_t n, int dst[n]) { • Old HPC standard from the 90’s #pragma omp target map(from: dst[0:n −1]) #pragma omp parallel for • Use #pragma to express parallelism f o r(int i =0; i

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 22 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Other (non-)OpenCL C++ framework (I)

• ArrayFire, Aura, CLOGS, hemi, HPL, Kokkos, MTL4, SkelCL, SkePU, EasyCL... • CUDA 7 now C++11-based I Single source simpler for the programmer I nVidia Thrust ≈ parallel STL+map-reduce on top of CUDA, OpenMP or TBB https://github.com/thrust/thrust;  Not very clean because device pointers returned by cudaMalloc() do not have a special type use some ugly casts • OpenACC;≈ OpenMP 4 restricted to accelerators + LDS finer control

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 23 / 43 •C++ dialects for OpenCL (and heterogeneous computing) I Missing link...

• No tool providing I OpenCL interoperability I Modern C++ environment I Single source for programming productivity

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 24 / 43 •OpenCL SYCL 1.2 I Outline

1 C++14

2 C++ dialects for OpenCL (and heterogeneous computing)

3 OpenCL SYCL 1.2 C++... putting everything altogether

4 OpenCL SYCL 2.1...

5 Conclusion

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 25 / 43 •OpenCL SYCL 1.2 I Puns and pronunciation explained

OpenCL SYCL OpenCL SPIR

sickle [ "si-k@l ] spear [ "spir ]

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 26 / 43 •OpenCL SYCL 1.2 I OpenCL SYCL goals

• Ease of use I Single source programming model  Take advantage of CUDA & C++AMP simplicity and power  Compiled for host and device(s) • Easy development/debugging on host: host fall-back target • Programming interface based on abstraction of OpenCL components (data management, error handling...) • Most modern C++ features available for OpenCL I Enabling the creation of higher level programming models I C++ templated libraries based on OpenCL I Exceptions for error handling • Portability across platforms and compilers • Providing the full OpenCL feature set and seamless integration with existing OpenCL code • Task graph programming model with interface à la TBB/ (C++17) • High performance http://www.khronos.org/opencl/sycl

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 27 / 43 •OpenCL SYCL 1.2 I Complete example of matrix addition in OpenCL SYCL

#include b u f f e r B { b, range<2> { N, M } }; #include b u f f e r C { c, range<2> { N, M } }; // Enqueue some computation kernel task using namespace cl::sycl; myQueue.submit([&](handler& cgh) { // Define the data used/produced constexpr size_tN= 2; auto ka = A.get_access(cgh); constexpr size_tM= 3; auto kb = B.get_access(cgh); using Matrix = float[N][M]; auto kc = C.get_access(cgh); // Create& call OpenCL kernel named"mat_add" i n t main() { cgh.parallel_for (range<2> { N, M }, Matrix a= { { 1, 2,3 }, { 4, 5,6 } }; [=](id<2> i) { kc[i] = ka[i] + kb[i]; } Matrix b= { { 2, 3,4 }, { 5, 6,7 } }; ); });// End of our commands for this queue Matrix c ; }// End scope, so wait for the queue to complete. // Copy back the buffer data withRAII behaviour. {// Createa queue to work on r e t u r n 0; queue myQueue ; } // Wrap some buffers around our data b u f f e r A { a, range<2> { N, M } };

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 28 / 43 •OpenCL SYCL 1.2 I Asynchronous task graph model

• Theoretical graph of an application described implicitly with kernel tasks using buffers through accessors cl::sycl::accessor cl::sycl::accessor init_a cl::sycl::buffer a cl::sycl::accessor matrix_add cl::sycl::buffer c Display init_b cl::sycl::buffer b cl::sycl::accessor cl::sycl::accessor cl::sycl::accessor • Possible schedule by SYCL runtime: init_b init_a matrix_add Display Automatic overlap of kernels & communications I Even better when looping around in an application ; I Assume it will be translated into pure OpenCL event graph I Runtime uses as many threads & OpenCL queues as necessary (AMD synchronous queues, AMD compute rings, AMD DMA rings...)

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 29 / 43 •OpenCL SYCL 1.2 I Task graph programming — the code #include [ = ] (auto index) { #include B[index] = index[0]∗2014 + index[1]∗42; using namespace cl::sycl; }); // Size of the matrices }); const size_t N = 2000; // Launch an asynchronous kernel to compute matrix additionc=a+b const size_t M = 3000; myQueue . submit ( [ & ] (auto &cgh) { i n t main() { // In the kernela andb are read, butc is written {// By sticking all the SYCL work ina{} block, we ensure auto A = a.get_access(cgh); // all SYCL tasks must complete before exiting the block auto B = b.get_access(cgh); auto C = c.get_access(cgh); // Createa queue to work on // From these accessors, the SYCL runtime will ensure that when queue myQueue ; // this kernel is run, the kernels computinga andb completed // Create some2D buffers of float for our matrices b u f f e r a({ N, M }); // Enqueuea parallel kernel onaN ∗M2D iteration space b u f f e r b({ N, M }); cgh.parallel_for ({ N, M }, b u f f e r c({ N, M }); [ = ] (auto index) { // Launcha first asynchronous kernel to initializea C[index] = A[index] + B[index]; myQueue . submit ( [ & ] (auto &cgh) { }); // The kernel writea, so geta write accessor on it }); auto A = a.get_access(cgh); /∗ Request an access to readc from the host −side. The SYCL runtime ensures thatc is ready when the accessor is returned ∗/ // Enqueue parallel kernel onaN ∗M2D iteration space auto C = c.get_access(); cgh.parallel_for ({ N, M }, std::cout << std::endl <<"Result:" << std::endl; [ = ] (auto index) { f o r(size_t i =0; i (cgh); } /∗ From the access pattern above, the SYCL runtime detect }/ ∗ End scope of myQueue, this wait for any remaining operations on the t h i s command_group is independant from the first one queue to complete ∗/ and can be scheduled independently ∗/ std::cout <<"Good computation!" << std::endl; r e t u r n 0; // Enqueuea parallel kernel onaN ∗M2D iteration space } cgh.parallel_for ({ N, M }, Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 30 / 43 •OpenCL SYCL 1.2 I From work-groups & work-items to hierarchical parallelism

const int size = 10; i n t data[size]; const int gsize = 2; Very close to OpenMP 4 style! b u f f e r my_buffer { data, size }; • Easy to understand the concept, of my_queue. submit ([&](auto &cgh) { work-groups auto in = my_buffer.get_access(cgh); auto out = my_buffer.get_access(cgh); • Easy to write work-group only code // Iterate on the work−group cgh. parallel_for_workgroup({ size, • Replace code + barriers with gsize } , several parallel_for_workitem() [=](group<> grp) { // Code executed only once per work−group I Performance-portable between std::cerr <<"Gid=" << grp[0] << std::endl; CPU and GPU // Iterate on the work−items ofa work −group cgh.parallel_for_workitem(grp, [=](item<1> tile) { I No need to think about barriers std::cerr <<"id =" << tile.get_local()[0] (automatically deduced) <<" " << tile.get_global()[0] Easier to compose components & << std::endl; I out[tile] = in[tile] ∗ 2; algorithms }); I Ready for future GPU with non // Can have other cgh. parallel_for_workitem() here... }); uniform work-group size });

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 31 / 43 •OpenCL SYCL 1.2 I C++11 allocators

•∃ C++11 allocators to control the way objects are allocated in memory I For example to allocate some vectors on some storage I Concept of scoped_allocator to control storage of nested data structures I Example: vector of strings, with vector data and string data allocated in different memory areas (speed, power consumption, caching, read-only...) • SYCL reuses to specify how buffer and image are allocated on the host side

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 32 / 43 •OpenCL SYCL 1.2 IC++... putting everything altogether Outline

1 C++14

2 C++ dialects for OpenCL (and heterogeneous computing)

3 OpenCL SYCL 1.2 C++... putting everything altogether

4 OpenCL SYCL 2.1...

5 Conclusion

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 33 / 43 •OpenCL SYCL 1.2 IC++... putting everything altogether Exascale-ready

• Use your own C++ compiler I Only kernel outlining needs SYCL compiler • SYCL with C++ can address most of the hierarchy levels I MPI I OpenMP I C++-based PGAS (Partitioned Global Address Space) DSeL (Domain-Specific embedded Language, such as Coarray C++...) I Remote accelerators in clusters I Use SYCL buffer allocator for  RDMA  Out-of-core, mapping to a file  PiM (Processor in Memory)  ...

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 34 / 43 •OpenCL SYCL 1.2 IC++... putting everything altogether Debugging

• Difficult to debug code or detect precondition violation on GPU and at large... • Rely on C++ to help debugging I Overload some operations and functions to verify preconditions I Hide tracing/verification code in constructors/destructors I Can use pure-C++ host implementation for bug-tracking with favorite debugger

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 35 / 43 •OpenCL SYCL 1.2 IC++... putting everything altogether Poor-man SVM with C++11 + SYCL

• For complex data structures I Objects need to be in buffers to be shipped between CPU and devices I Do not want marshaling/unmarshaling objects... I Use C++11 allocator to allocate some objects in 1 SYCL buffer  Useful to send efficiently data through MPI and RDMA too! I But since no SVM, not same address on CPU and GPU side...  How to deal with pointers?  Override all pointer accessed/ (for example use std::pointer_trait) to do address translation on kernel side Cost: 1 addition, per *p • When no or inefficient SVM... I Also useful optimization when need to work on a copy only on the GPU  Only allocation on GPU side  Spare some TLB trashing on the CPU

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 36 / 43 •OpenCL SYCL 1.2 IC++... putting everything altogether ¿¿¿Fortran???

• Fortran 2003 introduces C-interoperability that can be used for C++ interoperability... SYCL • C++ boost::multi_array & others provides à la Fortran arrays I Allows triplet notation I Can be used from inside SYCL to deal with Fortran-like arrays • Perhaps the right time to switch your application to modern C++? ,

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 37 / 43 •OpenCL SYCL 1.2 IC++... putting everything altogether Using SYCL-like models in other areas

• SYCL ≡ generic heterogeneous computing model beyond OpenCL I queue expresses where computations happen I parallel_for launches computations I accessor defines the way we access data I buffer for storing data I allocator for defining how data are allocated/backed • Example for HSA: almost direct mapping à la OpenCL • Example in PiM world I Use queue to run on some PiM chips I Use allocator to distribute data structures or to allocate buffer in special memory (memory page, chip...) I Use accessor to use alternative data access (split address from computation, streaming only, PGAS...) I Use pointer_trait to use specific way to interact with memory I ...

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 38 / 43 •OpenCL SYCL 2.1... I Outline

1 C++14

2 C++ dialects for OpenCL (and heterogeneous computing)

3 OpenCL SYCL 1.2 C++... putting everything altogether

4 OpenCL SYCL 2.1...

5 Conclusion

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 39 / 43 •OpenCL SYCL 2.1... I SYCL 2.1 is coming!

• Skip directly to OpenCL 2.1 and C++14 • Kernel side enqueue • between host and accelerator • Parallel STL C++17 • Array TS

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 40 / 43 •OpenCL SYCL 2.1... I SYCL and fine-grain system shared memory (OpenCL 2)

#include #include #include using namespace cl::sycl; i n t main() { std::vector a { 1, 2, 3 }; std::vector b { 5, 6, 8 }; std::vector c(a.size()); // Enqueuea parallel kernel parallel_for(a.size(), [&] (int index) { c[index] = a[index] + b[index]; }); // Since there is no queue or no accessor, we assume parallel_for are blocking kernels std::cout << std::endl <<"Result:" << std::endl; f o r(autoe:c) std::cout << e <<" "; std::cout << std::endl; r e t u r n 0; } • Very close to OpenMP simplicity • Can still use of buffers & accessors for compatibility & finer control (task graph, optimizations...) I SYCL can remove the copy when possible

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 41 / 43 •Conclusion I Outline

1 C++14

2 C++ dialects for OpenCL (and heterogeneous computing)

3 OpenCL SYCL 1.2 C++... putting everything altogether

4 OpenCL SYCL 2.1...

5 Conclusion

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 42 / 43 •Conclusion I Conclusion

•∃ Many C++ frameworks to leverage OpenCL I None of them provides seamless single source  Require some kind of macros & weird syntax I But they should be preferred to plain OpenCL C for productivity • SYCL provides seamless single source with OpenCL interoperability I Can be used to improve other higher-level frameworks • SYCL ≡ pure C++ integration with other C/C++ HPC frameworks: OpenCL, OpenMP, libraries (MPI, numerical), C++ DSeL (PGAS...)... • SYCL also interesting; as co-design tool for architectural & programming model research (PiM, Near-Memory Computing, various computing models...) • Modern C++ is not just C program in .cpp file Invest in learning modern C++ , ;

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 43 / 43 •Table of content I

Complete example of matrix addition in OpenCL SYCL 28 1 C++14 Asynchronous task graph model 29 Outline 2 Task graph programming — the code 30 C++14 3 From work-groups & work-items to hierarchical parallelism 31 Modern C++ & HPC 4 C++11 allocators 32 C++... putting everything altogether 2 C++ dialects for OpenCL (and heterogeneous computing) Outline 33 Outline 7 Exascale-ready 34 OpenCL 2.1 C++ kernel language 8 Debugging 35 Bolt C++ 11 Poor-man SVM with C++11 + SYCL 36 Boost.Compute 13 ¿¿¿Fortran??? 37 VexCL 17 Using SYCL-like models in other areas 38 ViennaCL 20 C++AMP 21 4 OpenCL SYCL 2.1... OpenMP 4 22 Outline 39 Other (non-)OpenCL C++ framework 23 SYCL 2.1 is coming! 40 Missing link... 24 SYCL and fine-grain system shared memory (OpenCL 2) 41

3 OpenCL SYCL 1.2 5 Conclusion Outline 25 Outline 42 Puns and pronunciation explained 26 Conclusion 43 OpenCL SYCL goals 27 You are here ! 44

Modern C++, heterogeneous computing & OpenCL SYCL IWOCL 2015 43 / 43