Khronos Group SYCL Standard Trisycl Open Source Implementation Ronan Keryell Xilinx Research Labs SC 2016 • I C++

Khronos Group SYCL standard triSYCL Open Source Implementation Ronan Keryell Xilinx Research Labs SC 2016 • I C++ • Direct mapping to hardware • Zero-overhead abstraction Khronos Group SYCL standard 2 / 27 • I Even better with modern C++ (C++14, C++17) • Huge library improvements • Simpler syntax std ::vector<int> my_vector { 1, 2, 3, 4, 5 }; f o r(auto &e : my_vector) e += 1; • Automatic type inference for terse generic programming I Python 3.x (interpreted): def add(x, y): r e t u r nx+y p r i n t(add(2, 3))#5 p r i n t (add("2","3"))# 23 p r i n t (add(2,"Boom"))# Fails at run −time: −( I Same in C++14 but compiled+ static compile-time type-checking: auto add= [] (auto x, autoy) { returnx+y; }; std::cout << add(2, 3) << std::endl;//5 std::cout << add("2"s,"3"s) << std::endl;// 23 std::cout << add(2,"Boom"s) << std::endl;// Does not compile: −) ((( Without using templated code! template(( <typename(( > Khronos Group SYCL standard ((( 3 / 27 , • I Power wall: the final frontier... • Current physical limits I Power consumption Cannot power-on all the transistors without melting (dark silicon) Accessing memory consumes orders of magnitude more energy than a simple computation Moving data inside a chip costs quite more than a computation I Speed of light 4 Accessing memory takes the time of 10 + CPU instructions Even moving data across the chip (cache) is slow at 1+ GHz... • I Specialize architecture ;I Use locality & hierarchy I Massive parallelism I NUMA & distributed memories I Power-on only what is required I Use hardware reconfiguration Khronos Group SYCL standard 4 / 27 • I Xilinx Zynq UltraScale+ MPSoC Overview Khronos Group SYCL standard 5 / 27 • I What about C++ for heterogenous computing??? • C++ std :: thread is great... • ...but supposed shared unified memory (SMP) I What if accelerator with own separate memory?/ Not same address space? I What if using distributed memory multi-processor system (MPI...)? • Extend the concepts... I Replace raw unified-memory with buffer objects ;I Define with accessor objects which/how buffers are used I Since accessors are already here to define dependencies, no longer need for std :: future/std :: promise! I Add concept of queue to express, where to run the task I Also add all goodies for massively parallel accelerators (OpenCL/Vulkan/SPIR-V) in clean C++ Khronos Group SYCL standard 6 / 27 • I Complete example of matrix addition in OpenCL SYCL #include <CL/sycl.hpp> buffer B { &b[0][0], range { N, M } }; #include <iostream> buffer C { &c[0][0], range { N, M } }; // Enqueue some computation kernel task using namespace cl::sycl; q.submit([&](handler& cgh) { // Define the data used/produced constexpr size_tN= 2; auto ka = A.get_access<access::mode::read>(cgh); constexpr size_tM= 3; auto kb = B.get_access<access::mode::read>(cgh); using Matrix = float[N][M]; auto kc = C.get_access<access::mode:: write >(cgh); // Create& call kernel named"mat_add" // Compute sum of matricesa andb intoc cgh.parallel_for <class mat_add>(range { N, M }, i n t main() { [=](id<2> i) { kc[i] = ka[i] + kb[i]; } Matrix a= { { 1, 2,3 }, { 4, 5,6 } }; ); Matrix b= { { 2, 3,4 }, { 5, 6,7 } }; });// End of our commands for this queue }// End scope, so wait for the buffers to be released Matrix c ; // Copy back the buffer data withRAII behaviour. std::cout <<"c[0][2] = " << c[0][2] << std::endl; {// Createa queue to work on default device r e t u r n 0; queue q ; } // Wrap some buffers around our data buffer A { &a[0][0], range { N, M } }; Khronos Group SYCL standard 7 / 27 • I Asynchronous task graph model • Change example with initialization kernels instead of host?... • Theoretical graph of an application described implicitly with kernel tasks using buffers through accessors cl::sycl::accessor<write> cl::sycl::accessor<read> init_a cl::sycl::buffer a cl::sycl::accessor<write> matrix_add cl::sycl::buffer c Display init_b cl::sycl::buffer b cl::sycl::accessor<read> cl::sycl::accessor<write> cl::sycl::accessor<read> • Possible schedule by SYCL runtime: init_b init_a matrix_add Display Automatic overlap of kernels & communications I Even better when looping around in an application ; I Assume it will be translated into pure OpenCL event graph I Runtime uses as many threads & OpenCL queues as necessary (GPU synchronous queues, AMD compute rings, AMD DMA rings...) Khronos Group SYCL standard 8 / 27 • I Task graph programming — the code #include <CL/sycl.hpp> [ = ] (auto index) { #include <iostream> B[index] = index[0]∗2014 + index[1]∗42; using namespace cl::sycl; }); // Size of the matrices }); constexpr size_t N = 2000; // Launch an asynchronous kernel to compute matrix additionc=a+b constexpr size_t M = 3000; q.submit([&](auto &cgh) { i n t main() { // In the kernela andb are read, butc is written {// By sticking all the SYCL work ina{} block, we ensure auto A = a.get_access<access::mode::read>(cgh); // all SYCL tasks must complete before exiting the block auto B = b.get_access<access::mode::read>(cgh); auto C = c.get_access<access::mode:: write >(cgh); // Createa queue to work on default device // From these accessors, the SYCL runtime will ensure that when queue q ; // this kernel is run, the kernels computinga andb completed // Create some2D buffers of float for our matrices b u f f e r <double, 2> a({ N, M }); // Enqueuea parallel kernel onaN ∗M2D iteration space b u f f e r <double, 2> b({ N, M }); cgh.parallel_for <class matrix_add>({ N, M }, b u f f e r <double, 2> c({ N, M }); [ = ] (auto index) { // Launcha first asynchronous kernel to initializea C[index] = A[index] + B[index]; q.submit([&](auto &cgh) { }); // The kernel writea, so geta write accessor on it }); auto A = a.get_access<access::mode:: write >(cgh); /∗ Request an access to readc from the host −side. The SYCL runtime ensures thatc is ready when the accessor is returned ∗/ // Enqueue parallel kernel onaN ∗M2D iteration space auto C = c.get_access<access::mode::read, access:: target :: host_buffer >(); cgh.parallel_for <class init_a>({ N, M }, std::cout << std::endl <<"Result:" << std::endl; [ = ] (auto index) { f o r(size_t i =0; i <N; i++) A[index] = index[0]∗2 + index[1]; f o r(size_t j = 0; j <M; j++) }); // Compare the result to the analytic value }); i f (C[i][j] != i ∗(2 + 2014) + j ∗(1 + 42)) { // Launch an asynchronous kernel to initializeb std::cout <<"Wrong value " << C[i][j] <<" on element " q.submit([&](auto &cgh) { << i <<’ ’ << j << std::endl; // The kernel writeb, so geta write accessor on it e x i t (−1); auto B = b.get_access<access::mode:: write >(cgh); } /∗ From the access pattern above, the SYCL runtime detect } t h i s command_group is independant from the first one std::cout <<"Good computation!" << std::endl; and can be scheduled independently ∗/ r e t u r n 0; } // Enqueuea parallel kernel onaN ∗M2D iteration space cgh.parallel_for <class init_b>({ N, M }, Khronos Group SYCL standard 9 / 27 • I Remember the OpenCL execution model? GPU DEVICE CUs Compute Unit 0 00 1 Processing Elements 2 N Work-Items Work-Groups ND-RANGE Khronos Group SYCL standard 10 / 27 • I From work-groups & work-items to hierarchical parallelism (I) // Launcha1D convolution filter my_queue.submit([&](handler &cgh) { f l o a t val = 0.f; auto in_access = inputB.get_access<access::mode::read>(cgh); i n t readAddress = group∗blocksize + tile − radius ; auto filter_access = filterB .get_access<access::mode::read>(cgh); i f (readAddress >= 0 && readAddress < size) auto out_access = outputB.get_access<access::mode:: write >(cgh); val = in_access[readAddress]; // Iterate on all the work−group cgh. parallel_for_work_group<class convolution>({ size, localData[tile] = val; groupsize }, }); [=](group<> group) { // There is an implicit barrier here // Iterate on all the work−items group.parallel_for_work_item([&](item<1> tile ) { std::cout <<"Group i d = " << group.get(0) << std::endl; // These are OpenCL local variables used asa cache f l o a t filterLocal[2∗ radius + 1 ] ; f l o a t sum = 0.f; f l o a t localData[blocksize + 2∗radius ] ; f o r(unsigned offset = 0; offset < radius; ++offset) f l o a t convolutionResult[blocksize ]; sum += filterLocal[offset]∗ localData[tile + offset + radius]; range<1> filterRange { 2∗radius + 1 } ; f l o a t result = sum/(2∗ radius + 1 ) ; // Iterate on filterRange work−items convolutionResult[ tile ] = result; group.parallel_for_work_item(filterRange , [&](item<1> tile ) { }); // There is an implicit barrier here // Iterate on all the work−items filterLocal[tile] = filter_access[tile ]; group.parallel_for_work_item(group, [&](item<1> tile ) }); out_access[group∗blocksize + tile] = convolutionResult[tile ]; // There is an implicit barrier here }); range<1> inputRange{ blocksize + 2∗radius } ; // There is an implicit barrier here // Iterate on inputRange work−items }); group.parallel_for_work_item(inputRange, [&](item<1> tile ) { }); Khronos Group SYCL standard 11 / 27 • I From work-groups & work-items to hierarchical parallelism (II) Very close to OpenMP 4 style! • Easy to understand the concept, of work-groups • Easy to write work-group only code • Replace code + barriers with several parallel_for_work_item() I Performance-portable between CPU and device I No need to think about barriers (automatically deduced) I Easier to compose components & algorithms I Ready for future device with non uniform work-group size Khronos Group SYCL standard 12 / 27 • I Pipes in OpenCL 2.x Already read Readable Output r_1 r_2r_2r_2 r_3 r_4 r_4 w_5 w_6 w_7 w_8 w_9 w_10w_11w_12w_13w_14w_15w_16w_17 Input • Simple FIFO objects • Useful to create

Khronos Group SYCL Standard Trisycl Open Source Implementation Ronan Keryell Xilinx Research Labs SC 2016 • I C++

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support