Khronos Group SYCL Standard Trisycl Open Source Implementation Ronan Keryell Xilinx Research Labs SC 2016 • I C++
Total Page:16
File Type:pdf, Size:1020Kb
Khronos Group SYCL standard triSYCL Open Source Implementation Ronan Keryell Xilinx Research Labs SC 2016 • I C++ • Direct mapping to hardware • Zero-overhead abstraction Khronos Group SYCL standard 2 / 27 • I Even better with modern C++ (C++14, C++17) • Huge library improvements • Simpler syntax std ::vector<int> my_vector { 1, 2, 3, 4, 5 }; f o r(auto &e : my_vector) e += 1; • Automatic type inference for terse generic programming I Python 3.x (interpreted): def add(x, y): r e t u r nx+y p r i n t(add(2, 3))#5 p r i n t (add("2","3"))# 23 p r i n t (add(2,"Boom"))# Fails at run −time: −( I Same in C++14 but compiled+ static compile-time type-checking: auto add= [] (auto x, autoy) { returnx+y; }; std::cout << add(2, 3) << std::endl;//5 std::cout << add("2"s,"3"s) << std::endl;// 23 std::cout << add(2,"Boom"s) << std::endl;// Does not compile: −) ((( Without using templated code! template(( <typename(( > Khronos Group SYCL standard ((( 3 / 27 , • I Power wall: the final frontier... • Current physical limits I Power consumption Cannot power-on all the transistors without melting (dark silicon) Accessing memory consumes orders of magnitude more energy than a simple computation Moving data inside a chip costs quite more than a computation I Speed of light 4 Accessing memory takes the time of 10 + CPU instructions Even moving data across the chip (cache) is slow at 1+ GHz... • I Specialize architecture ;I Use locality & hierarchy I Massive parallelism I NUMA & distributed memories I Power-on only what is required I Use hardware reconfiguration Khronos Group SYCL standard 4 / 27 • I Xilinx Zynq UltraScale+ MPSoC Overview Khronos Group SYCL standard 5 / 27 • I What about C++ for heterogenous computing??? • C++ std :: thread is great... • ...but supposed shared unified memory (SMP) I What if accelerator with own separate memory?/ Not same address space? I What if using distributed memory multi-processor system (MPI...)? • Extend the concepts... I Replace raw unified-memory with buffer objects ;I Define with accessor objects which/how buffers are used I Since accessors are already here to define dependencies, no longer need for std :: future/std :: promise! I Add concept of queue to express, where to run the task I Also add all goodies for massively parallel accelerators (OpenCL/Vulkan/SPIR-V) in clean C++ Khronos Group SYCL standard 6 / 27 • I Complete example of matrix addition in OpenCL SYCL #include <CL/sycl.hpp> buffer B { &b[0][0], range { N, M } }; #include <iostream> buffer C { &c[0][0], range { N, M } }; // Enqueue some computation kernel task using namespace cl::sycl; q.submit([&](handler& cgh) { // Define the data used/produced constexpr size_tN= 2; auto ka = A.get_access<access::mode::read>(cgh); constexpr size_tM= 3; auto kb = B.get_access<access::mode::read>(cgh); using Matrix = float[N][M]; auto kc = C.get_access<access::mode:: write >(cgh); // Create& call kernel named"mat_add" // Compute sum of matricesa andb intoc cgh.parallel_for <class mat_add>(range { N, M }, i n t main() { [=](id<2> i) { kc[i] = ka[i] + kb[i]; } Matrix a= { { 1, 2,3 }, { 4, 5,6 } }; ); Matrix b= { { 2, 3,4 }, { 5, 6,7 } }; });// End of our commands for this queue }// End scope, so wait for the buffers to be released Matrix c ; // Copy back the buffer data withRAII behaviour. std::cout <<"c[0][2] = " << c[0][2] << std::endl; {// Createa queue to work on default device r e t u r n 0; queue q ; } // Wrap some buffers around our data buffer A { &a[0][0], range { N, M } }; Khronos Group SYCL standard 7 / 27 • I Asynchronous task graph model • Change example with initialization kernels instead of host?... • Theoretical graph of an application described implicitly with kernel tasks using buffers through accessors cl::sycl::accessor<write> cl::sycl::accessor<read> init_a cl::sycl::buffer a cl::sycl::accessor<write> matrix_add cl::sycl::buffer c Display init_b cl::sycl::buffer b cl::sycl::accessor<read> cl::sycl::accessor<write> cl::sycl::accessor<read> • Possible schedule by SYCL runtime: init_b init_a matrix_add Display Automatic overlap of kernels & communications I Even better when looping around in an application ; I Assume it will be translated into pure OpenCL event graph I Runtime uses as many threads & OpenCL queues as necessary (GPU synchronous queues, AMD compute rings, AMD DMA rings...) Khronos Group SYCL standard 8 / 27 • I Task graph programming — the code #include <CL/sycl.hpp> [ = ] (auto index) { #include <iostream> B[index] = index[0]∗2014 + index[1]∗42; using namespace cl::sycl; }); // Size of the matrices }); constexpr size_t N = 2000; // Launch an asynchronous kernel to compute matrix additionc=a+b constexpr size_t M = 3000; q.submit([&](auto &cgh) { i n t main() { // In the kernela andb are read, butc is written {// By sticking all the SYCL work ina{} block, we ensure auto A = a.get_access<access::mode::read>(cgh); // all SYCL tasks must complete before exiting the block auto B = b.get_access<access::mode::read>(cgh); auto C = c.get_access<access::mode:: write >(cgh); // Createa queue to work on default device // From these accessors, the SYCL runtime will ensure that when queue q ; // this kernel is run, the kernels computinga andb completed // Create some2D buffers of float for our matrices b u f f e r <double, 2> a({ N, M }); // Enqueuea parallel kernel onaN ∗M2D iteration space b u f f e r <double, 2> b({ N, M }); cgh.parallel_for <class matrix_add>({ N, M }, b u f f e r <double, 2> c({ N, M }); [ = ] (auto index) { // Launcha first asynchronous kernel to initializea C[index] = A[index] + B[index]; q.submit([&](auto &cgh) { }); // The kernel writea, so geta write accessor on it }); auto A = a.get_access<access::mode:: write >(cgh); /∗ Request an access to readc from the host −side. The SYCL runtime ensures thatc is ready when the accessor is returned ∗/ // Enqueue parallel kernel onaN ∗M2D iteration space auto C = c.get_access<access::mode::read, access:: target :: host_buffer >(); cgh.parallel_for <class init_a>({ N, M }, std::cout << std::endl <<"Result:" << std::endl; [ = ] (auto index) { f o r(size_t i =0; i <N; i++) A[index] = index[0]∗2 + index[1]; f o r(size_t j = 0; j <M; j++) }); // Compare the result to the analytic value }); i f (C[i][j] != i ∗(2 + 2014) + j ∗(1 + 42)) { // Launch an asynchronous kernel to initializeb std::cout <<"Wrong value " << C[i][j] <<" on element " q.submit([&](auto &cgh) { << i <<’ ’ << j << std::endl; // The kernel writeb, so geta write accessor on it e x i t (−1); auto B = b.get_access<access::mode:: write >(cgh); } /∗ From the access pattern above, the SYCL runtime detect } t h i s command_group is independant from the first one std::cout <<"Good computation!" << std::endl; and can be scheduled independently ∗/ r e t u r n 0; } // Enqueuea parallel kernel onaN ∗M2D iteration space cgh.parallel_for <class init_b>({ N, M }, Khronos Group SYCL standard 9 / 27 • I Remember the OpenCL execution model? GPU DEVICE CUs Compute Unit 0 00 1 Processing Elements 2 N Work-Items Work-Groups ND-RANGE Khronos Group SYCL standard 10 / 27 • I From work-groups & work-items to hierarchical parallelism (I) // Launcha1D convolution filter my_queue.submit([&](handler &cgh) { f l o a t val = 0.f; auto in_access = inputB.get_access<access::mode::read>(cgh); i n t readAddress = group∗blocksize + tile − radius ; auto filter_access = filterB .get_access<access::mode::read>(cgh); i f (readAddress >= 0 && readAddress < size) auto out_access = outputB.get_access<access::mode:: write >(cgh); val = in_access[readAddress]; // Iterate on all the work−group cgh. parallel_for_work_group<class convolution>({ size, localData[tile] = val; groupsize }, }); [=](group<> group) { // There is an implicit barrier here // Iterate on all the work−items group.parallel_for_work_item([&](item<1> tile ) { std::cout <<"Group i d = " << group.get(0) << std::endl; // These are OpenCL local variables used asa cache f l o a t filterLocal[2∗ radius + 1 ] ; f l o a t sum = 0.f; f l o a t localData[blocksize + 2∗radius ] ; f o r(unsigned offset = 0; offset < radius; ++offset) f l o a t convolutionResult[blocksize ]; sum += filterLocal[offset]∗ localData[tile + offset + radius]; range<1> filterRange { 2∗radius + 1 } ; f l o a t result = sum/(2∗ radius + 1 ) ; // Iterate on filterRange work−items convolutionResult[ tile ] = result; group.parallel_for_work_item(filterRange , [&](item<1> tile ) { }); // There is an implicit barrier here // Iterate on all the work−items filterLocal[tile] = filter_access[tile ]; group.parallel_for_work_item(group, [&](item<1> tile ) }); out_access[group∗blocksize + tile] = convolutionResult[tile ]; // There is an implicit barrier here }); range<1> inputRange{ blocksize + 2∗radius } ; // There is an implicit barrier here // Iterate on inputRange work−items }); group.parallel_for_work_item(inputRange, [&](item<1> tile ) { }); Khronos Group SYCL standard 11 / 27 • I From work-groups & work-items to hierarchical parallelism (II) Very close to OpenMP 4 style! • Easy to understand the concept, of work-groups • Easy to write work-group only code • Replace code + barriers with several parallel_for_work_item() I Performance-portable between CPU and device I No need to think about barriers (automatically deduced) I Easier to compose components & algorithms I Ready for future device with non uniform work-group size Khronos Group SYCL standard 12 / 27 • I Pipes in OpenCL 2.x Already read Readable Output r_1 r_2r_2r_2 r_3 r_4 r_4 w_5 w_6 w_7 w_8 w_9 w_10w_11w_12w_13w_14w_15w_16w_17 Input • Simple FIFO objects • Useful to create