SYCL in the Openvx Ecosystem

SYCL in the OpenVX ecosystem Andrew Richards, Codeplay Embedded Vision Summit, May 2017 © Copyright Khronos Group 2016 - Page 1 PROMOTER MEMBERS Over 100 members worldwide Any company is welcome to join © Copyright Khronos Group 2016 - Page 2 Who am I? • Chair of the SYCL group • Chair of HSA Software Group • CEO of Codeplay - We build a C/C++ compiler for GPUs in 2002 - 60 staff in Edinburgh, Scotland - We build programming tools for heterogeneous processors - OpenCL, SYCL, + others © Copyright Khronos Group 2016 - Page 3 How does SYCL fit into OpenVX, vision & AI? 1. You need a graph system for AI/vision OpenVX 2. You need hand-coded kernels for common tasks OpenVX 3. You need to be able to write custom operations SYCL © Copyright Khronos Group 2016 - Page 4 What is SYCL for? • Modern C++ lets us separate the what from the how : - We want to separate what the user wants to do: science, computer vision, AI … - And enable the how to be: run fast on an OpenCL device • Modern C++ supports and encourages this separation © Copyright Khronos Group 2016 - Page 5 What we want to achieve • We want to enable a C++ ecosystem for OpenCL: - Must run on OpenCL devices: GPUs, CPUs, FPGAs, DSPs etc - C++ template libraries - Tools: compilers, debuggers, IDEs, optimizers - Training, example programs - Long-term support for current and future OpenCL features © Copyright Khronos Group 2016 - Page 6 Why a new standard? • There are already very established ways to map C++ to parallel processors - So we follow the established approaches • There are specifics to do with OpenCL we need to map to C++ - We have worked hard to be an enabler for other C++ parallel http://imgs.xkcd.com/comics/standards.png standards • We add no more than we need to © Copyright Khronos Group 2016 - Page 7 Where does SYCL fit in? © Copyright Khronos Group 2016 - Page 8 OpenCL / SYCL Stack User application code C++ template libraries SYCL for OpenCL OpenCL Devices CPU FPGA GPU DSP © Copyright Khronos Group 2016 - Page 9 Philosophy • With SYCL, we wanted to align with the direction the C++ standard is going - And we also need to future-proof for future OpenCL device capabilities • Key decisions: - We will not add any language extensions to C++ - We will work with existing C++ compilers - We will provide the full OpenCL feature-set in C++ - Everything must compile and run on the host as well as an OpenCL device © Copyright Khronos Group 2016 - Page 10 Where does SYCL fit in? – Language style C++ Embedded DSLs C++ template library Vector<float> a, b; uses overloading to e.g.: Sh/RapidMind, Halide, auto expr = a + b; build up expression Boost.compute Vector<float> r = expr.eval (); tree to compile at Pros: Works with existing C++ compilers runtime Cons: compile-time compilation, control- flow, composability Kernel myKernel; myKernel.load (“myKernel”); C++ Kernel languages myKernel.compile (); myKernel.setArg (0, a); Host (CPU) code e.g.: GLSL, OpenCL C and C++ kernel float r = myKernel.run (); loads and compiles languages kernel for specific Pros: Explicit offload, independent device, sets args and host/device code & compilers, run-time void myKernel (float *arg) { runs adaptation, popular in graphics return arg * 456.7f; Cons: Hard to compose cross-device } C++ single-source Vector<float> a, b, r; Single source e.g.: SYCL, CUDA, OpenMP, C++ AMP parallel_for (a.range (), [&](int id) file contains Pros: Composability, easy to use, offline { code for host compilation and validation r [id] = a [id] + b [id]; & device Cons: host/device compiler conflict }); © Copyright Khronos Group 2016 - Page 11 Comparison of SYCL & OpenVX • SYCL is a general programming model • OpenVX is a vision graph system • SYCL makes you write your own graph • OpenVX distributes a graph across an entire system system • SYCL makes you write your own nodes • OpenVX uses built-in nodes In AI applications, we see: • People needing pre-optimized graph nodes • People need to optimize whole graphs • Developers/researchers need to write their own nodes © Copyright Khronos Group 2016 - Page 12 Comparison of SYCL & CUDA #include <CL/sycl.hpp> #include <iostream> #include <iostream> #include <math.h> #include <math.h> // Kernel function to add the elements of two arrays // function to add the elements of two arrays __global__ void add(cl::sycl::nd_item<1> item, int n, void add(int n, float *x, float *y) cl::sycl::global_ptr<float> x, cl::sycl::global_ptr<float> y) { { int index = threadIdx.x; int index = item.get_local(0); int stride = blockDim.x; int stride = item.get_local_range(0); for (int i = index; i < n; i += stride) for (int i = index; i < n; i += stride) y[i] = x[i] + y[i]; y[i] = x[i] + y[i]; } } ... … // encapsulate data in SYCL buffers // Allocate Unified Memory – accessible from CPU or GPU cl::sycl::buffer<float> x(N); cudaMallocManaged(&x, N*sizeof(float)); cl::sycl::buffer<float> y(N); cudaMallocManaged(&y, N*sizeof(float)); … … { // create a scope to define the lifetime of the SYCL objects // create a SYCL queue for a GPU cl::sycl::gpu_selector selectgpu; cl::sycl::device gpu_device(selectgpu); cl::sycl::queue gpu_queue(gpu_device); // submit this work to the SYCL queue gpu_queue.submit([&](cl::sycl::handler &cgh) { // request access to the data on the OpenCL GPU // Run kernel on 1M elements on the GPU auto aX = x.get_access<cl::sycl::access::mode::read>(cgh); add <<<1, 256 >>>(N, x, y); auto aY = y.get_access<cl::sycl::access::mode::read_write>(cgh); // Run kernel on 1M elements on the OpenCL GPU // Wait for GPU to finish before accessing on host cgh.parallel_for<class add_functor>( cudaDeviceSynchronize(); cl::sycl::nd_range<1>(cl::sycl::range<1>(256), cl::sycl::range<1>(256)), [=](cl::sycl::nd_item<1> it) { add(it, N, aX, aY); © Copyright Khronos Group 2016 - Page 13 Why use C++ single-source programming? • Widely used, especially in AI • Runs on lots of platforms • Kernel fusion • Abstractions to enable performance-portability • Composability • Integrates with: OpenCL, OpenVX etc © Copyright Khronos Group 2016 - Page 14 Kernel fusion Most parallel processors are bandwidth bound a = b * c + d * f if a, b, c, d and f are vectors, and: if we execute operations separately, bandwidth-bound, but: if we fuse into just one kernel, perf is much better © Copyright Khronos Group 2016 - Page 15 Graph programming: some numbers In this example, we Effect of combining graph nodes on performance perform 3 image 100 processing 90 operations on an 80 Halide and SYCL accelerator and use kernel fusion, compare 3 systems 70 whereas OpenCV when executing 60 does not. For all 3 individual nodes, or 50 systems, the a whole graph 40 performance of the whole graph is 30 significantly better 20 than individual The system is an AMD 10 nodes executed on APU and the operations their own are: RGB->HSV, 0 channel masking, HSV- OpenCV (nodes) OpenCV (graph) Halide (nodes) Halide (graph) SYCL (nodes) SYCL (graph) >RGB Kernel time (ms) Overhead time (ms) © Copyright Khronos Group 2016 - Page 16 VisionCpp with SYCL (or OpenMP) 0: #include <visioncpp.hpp> lRGB g 2 1: int main() { sRGB 2: auto in= cv::imread(“input.jpg”); 3: auto q =get_queue<gpu_selector>(); h lHSV out f 2 4: auto a = Node<sRGB, 512, 512,Image>(in.data)); lRGB 5: auto b = Node<sRGB2lRGB>(a); This graph is created in 6: auto c = Node<lRGB2lHSV>(b); lHSV C++ at compile time, so e 2 can be optimized at 7: auto d = Node<Constant>(0.1); Scale compile time. 8: auto e = Node<lHSV2Scale>(c , d); lRGB This allows fast start up 9: auto f = Node<lHSV2lRGB>(e); c 2 Coe lHSV d f 10: auto g = Node<sRGB2lRGB>(f); 11: auto h = execute<fuse> (g , q); sRG B2 12: auto ptr = h.get_data(); b lRGB 13: auto output = cv::Mat(512 , 512 , CV_8UC3 , ptr.get()); 14: cv::imshow (“Display Image” , output); a in 15: return 0; Source on 16: } github © Copyright Khronos Group 2016 - Page 17 Expressing the execution for a device SYCL OpenMP 1: template <typename Expr, typename… Acc> 1: template <typename Expr, typename... Acc> void sycl (handler& cgh, Expr expr, Acc… acc) { Accessor void cpp(Expr expr, Acc.. acc) { // sycl accessor for accessing data on device // output pinter for accessing data on host 2: auto outPtr = expr.out-> template get_accessor<write>(cgh) ; 2: auto outPtr = expr.out->get(); Pointer // sycl range representing valid range of accessing data // valid range for accessing data on host 3: auto rng = range < 2 > (Expr::Rows , Expr::Cols) ; 3: auto rng = range (Expr::Rows , Expr::Cols ); // sycl parallel for for parallelisng execution across the range // rebuilding the tuple of input pointer on host 4: cgh.parallel_for<Type>(rng), [=](item<2> itemID) { 4: auto tuple = make_tuple (acc) ; // rebuilding accessor tuple on the device // OpenMP directive for parallelising for loop 5: auto tuple = make_tuple (acc) ; 5: #pragma omp parallel for Parallel C++/OpenMP for 6: for(size_t i=0; i< rng.rows; i++) // calling the eval function for each pixel 7: for(size_t j=0; j< rng.cols; j++) 6: outPtr[itemID] = expr.eval ( itemID, tuple ); // calling the eval function for each pixel 8: outPtr[indx] = expr.eval (index (i , j), tuple ); 7: }); 9: }; 8: } © Copyright Khronos Group 2016 - Page 18 TensorFlow for OpenCL and SYCL • Same source code supports CUDA and SYCL - via #ifdefs • In branches and trunk, being merged into trunk • Supported, continuously tested © Copyright Khronos Group 2016 - Page 19 Applying fusion to TensorFlow Eigen Performance improvement at size TensorFlow Eigen Kernel Fusion 4,000 1.2 18x This is how 16x17x 14x15x TensorFlow 12x13x 10x11x 8x9x uses Eigen 1 7x 5x6x to

Load more