SYCL in the OpenVX ecosystem

Andrew Richards, Embedded Vision Summit, May 2017

© Copyright 2016 - Page 1 PROMOTER MEMBERS

Over 100 members worldwide Any company is welcome to join

© Copyright Khronos Group 2016 - Page 2 Who am I?

• Chair of the SYCL group • Chair of HSA Software Group • CEO of Codeplay - We build a /C++ compiler for GPUs in 2002 - 60 staff in Edinburgh, Scotland - We build programming tools for heterogeneous processors - OpenCL, SYCL, + others

© Copyright Khronos Group 2016 - Page 3 How does SYCL fit into OpenVX, vision & AI?

1. You need a graph system for AI/vision  OpenVX

2. You need hand-coded kernels for common tasks  OpenVX

3. You need to be able to write custom operations  SYCL

© Copyright Khronos Group 2016 - Page 4 What is SYCL for?

• Modern C++ lets us separate the what from the how :

- We want to separate what the user wants to do: science, computer vision, AI …

- And enable the how to be: run fast on an OpenCL device

• Modern C++ supports and encourages this separation

© Copyright Khronos Group 2016 - Page 5 What we want to achieve

• We want to enable a C++ ecosystem for OpenCL:

- Must run on OpenCL devices: GPUs, CPUs, FPGAs, DSPs etc

- C++ template libraries

- Tools: compilers, debuggers, IDEs, optimizers

- Training, example programs

- Long-term support for current and future OpenCL features

© Copyright Khronos Group 2016 - Page 6 Why a new standard?

• There are already very established ways to map C++ to parallel processors - So we follow the established approaches

• There are specifics to do with OpenCL we need to map to C++ - We have worked hard to be an enabler for other C++ parallel http://imgs.xkcd.com/comics/standards.png standards

• We add no more than we need to

© Copyright Khronos Group 2016 - Page 7 Where does SYCL fit in?

© Copyright Khronos Group 2016 - Page 8 OpenCL / SYCL Stack

User application code

C++ template libraries

SYCL for OpenCL

OpenCL Devices

CPU FPGA GPU DSP

© Copyright Khronos Group 2016 - Page 9 Philosophy • With SYCL, we wanted to align with the direction the C++ standard is going - And we also need to future-proof for future OpenCL device capabilities

• Key decisions:

- We will not add any language extensions to C++

- We will work with existing C++ compilers

- We will provide the full OpenCL feature-set in C++

- Everything must compile and run on the host as well as an OpenCL device

© Copyright Khronos Group 2016 - Page 10 Where does SYCL fit in? – Language style

C++ Embedded DSLs C++ template library Vector a, b; uses overloading to e.g.: Sh/RapidMind, Halide, auto expr = a + b; build up expression .compute Vector r = expr.eval (); tree to compile at Pros: Works with existing C++ compilers runtime Cons: compile-time compilation, control- flow, composability Kernel myKernel; myKernel.load (“myKernel”); C++ Kernel languages myKernel.compile (); myKernel.setArg (0, a); Host (CPU) code e.g.: GLSL, OpenCL C and C++ kernel float r = myKernel.run (); loads and compiles languages kernel for specific Pros: Explicit offload, independent device, sets args and host/device code & compilers, run-time void myKernel (float *arg) { runs adaptation, popular in graphics return arg * 456.7f; Cons: Hard to compose cross-device }

C++ single-source Vector a, b, r; Single source e.g.: SYCL, CUDA, OpenMP, C++ AMP parallel_for (a.range (), [&](int id) file contains Pros: Composability, easy to use, offline { code for host compilation and validation r [id] = a [id] + b [id]; & device Cons: host/device compiler conflict }); © Copyright Khronos Group 2016 - Page 11 Comparison of SYCL & OpenVX

• SYCL is a general programming model • OpenVX is a vision graph system

• SYCL makes you write your own graph • OpenVX distributes a graph across an entire system system

• SYCL makes you write your own nodes • OpenVX uses built-in nodes

In AI applications, we see:

• People needing pre-optimized graph nodes • People need to optimize whole graphs • Developers/researchers need to write their own nodes

© Copyright Khronos Group 2016 - Page 12 Comparison of SYCL & CUDA

#include #include #include #include #include // Kernel function to add the elements of two arrays // function to add the elements of two arrays __global__ void add(cl::::nd_item<1> item, int n, void add(int n, float *x, float *y) cl::sycl::global_ptr x, cl::sycl::global_ptr y) { { int index = threadIdx.x; int index = item.get_local(0); int stride = blockDim.x; int stride = item.get_local_range(0); for (int i = index; i < n; i += stride) for (int i = index; i < n; i += stride) y[i] = x[i] + y[i]; y[i] = x[i] + y[i]; } } ... … // encapsulate data in SYCL buffers // Allocate Unified Memory – accessible from CPU or GPU cl::sycl::buffer x(N); cudaMallocManaged(&x, N*sizeof(float)); cl::sycl::buffer y(N); cudaMallocManaged(&y, N*sizeof(float)); … … { // create a scope to define the lifetime of the SYCL objects // create a SYCL queue for a GPU cl::sycl::gpu_selector selectgpu; cl::sycl::device gpu_device(selectgpu); cl::sycl::queue gpu_queue(gpu_device);

// submit this work to the SYCL queue gpu_queue.submit([&](cl::sycl::handler &cgh) { // request access to the data on the OpenCL GPU // Run kernel on 1M elements on the GPU auto aX = x.get_access(cgh); add <<<1, 256 >>>(N, x, y); auto aY = y.get_access(cgh); // Run kernel on 1M elements on the OpenCL GPU // Wait for GPU to finish before accessing on host cgh.parallel_for( cudaDeviceSynchronize(); cl::sycl::nd_range<1>(cl::sycl::range<1>(256), cl::sycl::range<1>(256)), [=](cl::sycl::nd_item<1> it) {

add(it, N, aX, aY);

© Copyright Khronos Group 2016 - Page 13 Why use C++ single-source programming?

• Widely used, especially in AI • Runs on lots of platforms • Kernel fusion • Abstractions to enable performance-portability • Composability • Integrates with: OpenCL, OpenVX etc

© Copyright Khronos Group 2016 - Page 14 Kernel fusion

Most parallel processors are bandwidth bound

a = b * c + d * f

if a, b, c, d and f are vectors, and: if we execute operations separately, bandwidth-bound, but: if we fuse into just one kernel, perf is much better

© Copyright Khronos Group 2016 - Page 15 Graph programming: some numbers

In this example, we Effect of combining graph nodes on performance perform 3 image 100 processing 90 operations on an 80 Halide and SYCL accelerator and use kernel fusion, compare 3 systems 70 whereas OpenCV when executing 60 does not. For all 3 individual nodes, or 50 systems, the a whole graph 40 performance of the whole graph is 30 significantly better 20 than individual The system is an AMD 10 nodes executed on APU and the operations their own are: RGB->HSV, 0 channel masking, HSV- OpenCV (nodes) OpenCV (graph) Halide (nodes) Halide (graph) SYCL (nodes) SYCL (graph) >RGB Kernel time (ms) Overhead time (ms)

© Copyright Khronos Group 2016 - Page 16 VisionCpp with SYCL (or OpenMP) 0: #include lRGB g 2 1: int main() { sRGB 2: auto in= cv::imread(“input.jpg”); 3: auto q =get_queue(); h lHSV out f 2 4: auto a = Node(in.data)); lRGB 5: auto b = Node(a); This graph is created in 6: auto c = Node(b); lHSV C++ at compile time, so e 2 can be optimized at 7: auto d = Node(0.1); Scale compile time. 8: auto e = Node(c , d); lRGB This allows fast start up 9: auto f = Node(e); c 2 Coe lHSV d f 10: auto g = Node(f); 11: auto h = execute (g , q); sRG B2 12: auto ptr = h.get_data(); b lRGB 13: auto output = cv::Mat(512 , 512 , CV_8UC3 , ptr.get()); 14: cv::imshow (“Display Image” , output); a in 15: return 0; Source on 16: } github

© Copyright Khronos Group 2016 - Page 17 Expressing the execution for a device SYCL OpenMP 1: template 1: template void sycl (handler& cgh, Expr expr, Acc… acc) { Accessor void cpp(Expr expr, Acc.. acc) {

// sycl accessor for accessing data on device // output pinter for accessing data on host 2: auto outPtr = expr.out-> template get_accessor(cgh) ; 2: auto outPtr = expr.out->get(); Pointer // sycl range representing valid range of accessing data // valid range for accessing data on host 3: auto rng = range < 2 > (Expr::Rows , Expr::Cols) ; 3: auto rng = range (Expr::Rows , Expr::Cols );

// sycl parallel for for parallelisng execution across the range // rebuilding the tuple of input pointer on host 4: cgh.parallel_for(rng), [=](item<2> itemID) { 4: auto tuple = make_tuple (acc) ;

// rebuilding accessor tuple on the device // OpenMP directive for parallelising for loop 5: auto tuple = make_tuple (acc) ; 5: #pragma omp parallel for Parallel C++/OpenMP for 6: for(size_t i=0; i< rng.rows; i++) // calling the eval function for each pixel 7: for(size_t j=0; j< rng.cols; j++) 6: outPtr[itemID] = expr.eval ( itemID, tuple ); // calling the eval function for each pixel 8: outPtr[indx] = expr.eval (index (i , j), tuple ); 7: }); 9: }; 8: }

© Copyright Khronos Group 2016 - Page 18 TensorFlow for OpenCL and SYCL

• Same source code supports CUDA and SYCL - via #ifdefs

• In branches and trunk, being merged into trunk

• Supported, continuously tested

© Copyright Khronos Group 2016 - Page 19 Applying fusion to TensorFlow Eigen

Performance improvement at size TensorFlow Eigen Kernel Fusion 4,000 1.2 18x This is how 16x17x 14x15x TensorFlow 12x13x 10x11x 8x9x uses Eigen 1 7x 5x6x to achieve 3x4x 1x2x kernel-fusion -1x0x 0.8 CUDA does

this for 0.6 fusion by

Improvement at 4,000

Speedup by Speedup fusion by Speedup fusion by Speedup fusion GPUs, Unfused performance 0.4 SYCL is improvement: AMD GPU

Normalized time (lower is better) (lower is time Normalized vs multi-core CPU used here for AMD 0.2 Total performance GPUs improvement delivered by SYCL is both of these 0 graphs combined 10 Fused 80 Fused 640 Fused 4096 Fused

Kernel1 Kernel2 Kernel3 Fused

© Copyright Khronos Group 2016 - Page 20 Continuous Integration Testing

© Copyright Khronos Group 2016 - Page 21 C++ Expression trees: Eigen Tensors

cl::sycl::gpu_selector s; // use OpenCL GPU This creates an Eigen cl::sycl::queue q(s); device Eigen::SyclDevice sycl_device(q);

array tensorRange = {{100, 10, 20}}; Tensor in1(tensorRange); Tensor in2(tensorRange); This code is the standard Eigen Tensor in3(tensorRange); approach to linear algebra, which Tensor out(tensorRange); TensorFlow has adapted to in2 = in2.random(); tensors and accelerated with in3 = in3.random(); CUDA. DataType *gpu_in1_data = static_cast(sycl_device.allocate(in1.size()*sizeof(DataType))); DataType *gpu_in2_data = static_cast(sycl_device.allocate(in2.size()*sizeof(DataType))); This is the code adapted to use DataType *gpu_in3_data = static_cast(sycl_device.allocate(in3.size()*sizeof(DataType))); DataType *gpu_out_data = static_cast(sycl_device.allocate(out.size()*sizeof(DataType))); SYCL TensorMap> gpu_in1(gpu_in1_data, tensorRange); TensorMap> gpu_in2(gpu_in2_data, tensorRange); TensorMap> gpu_in3(gpu_in3_data, tensorRange); TensorMap> gpu_out(gpu_out_data, tensorRange);

This expression is fused //a*3.14f + b*2.7f gpu_out.device(sycl_device) = gpu_in1 * gpu_in1.constant(3.14f) + gpu_in2 * gpu_in2.constant(2.7f); into a single kernel on the sycl_device.memcpyDeviceToHost(out.data(),gpu_out_data,(out.size())*sizeof(DataType)); device sycl_device.synchronize();

© Copyright Khronos Group 2016 - Page 22 Tensor operation as functor on device

Tensor expression template is in Expr type struct ExecExprFunctorKernel { typedef typename internal::createPlaceHolderExpression::Type PlaceHolderExpr;

typedef typename Expr::Index Index; FunctorExpr functors; TupleType tuple_of_accessors; Index range; ExecExprFunctorKernel (Index range_, FunctorExpr functors_, TupleType tuple_of_accessors_) : functors (functors_), tuple_of_accessors (tuple_of_accessors_), range (range_){}

void operator()(cl::sycl::nd_item<1> itemID) { Compile-time typedef typename internal::ConvertToDeviceExpression::Type DevExpr; template magic to auto device_expr = internal::createDeviceExpression(functors, reconstruct tensor tuple_of_accessors); auto device_evaluator = Eigen::TensorEvaluator(device_expr.expr, kernel Eigen::DefaultDevice()); typename DevExpr::Index gId = static_cast ( itemID.get_global_linear_id ()); Evaluate element of if (gId < range) tensor expression device_evaluator.evalScalar(gId); } };

© Copyright Khronos Group 2016 - Page 23 Enqueue tensor operation to device Tensor expression is template in Expr type void run (Expr &expr, Dev &dev) { Eigen::TensorEvaluator evaluator (expr, dev); const bool needs_assign = evaluator.evalSubExprsIfNeeded (NULL); if (needs_assign) { typedef decltype(internal::extractFunctors(evaluator)) FunctorExpr; FunctorExpr functors = internal::extractFunctors(evaluator); Add work to queue dev.sycl_queue().submit([&](cl::sycl::handler &cgh) { // create a tuple of accessors from Evaluator typedef decltype(internal::createTupleOfAccessors(cgh, Package up the data evaluator)) TupleType; references and TupleType tuple_of_accessors = internal::createTupleOfAccessors (cgh, evaluator); expression to be typename Expr::Index range, GRange, tileSize; evaluated for sending dev.parallel_for_setup(static_cast (evaluator.dimensions().TotalSize()) to device tileSize, range, GRange);

cgh.parallel_for (cl::sycl::nd_range<1> (cl::sycl::range<1>(GRange), Enqueue data parallel cl::sycl::range<1>(tileSize)), work to device ExecExprFunctorKernel (range, functors, tuple_of_accessors )); }); dev.asynchronousExec(); } evaluator.cleanup(); } © Copyright Khronos Group 2016 - Page 24 SYCL ecosystem

• SYCL.tech – http://sycl.tech

• SYCLBLAS – SYCL BLAS library that supports kernel fusion • Eigen – used in TensorFlow for custom operations • TensorFlow • VisionCpp – demonstration of how to build C++ graphs for vision • C++ 17 Parallel STL for SYCL – supports the new C++ 17 Parallel STL standard

© Copyright Khronos Group 2016 - Page 25 How do I get SYCL?

• ComputeCpp: From Codeplay (my company) - Available for free and works with OpenCL accelerators using SPIR

• triSYCL : open source - Doesn’t (yet) work with OpenCL accelerators

© Copyright Khronos Group 2016 - Page 26 What now?

• We are working on supporting OpenCL 2.2 with SYCL 2.2

• We are working on bringing heterogeneous acceleration into a future ISO C++

• We are building out the open standard ecosystem of C++ accelerated software

© Copyright Khronos Group 2016 - Page 27 Questions ?

© Copyright Khronos Group 2016 - Page 28 Where does SYCL fit in? – Parallelism Vector a, b, r; Directive-based parallelism Annotate serial code for (int i=0; i< a.size(); i++) with #pragmas e.g.: OpenMP, OpenAcc { highlighting where Pros: Original source code is annotated #pragma parallel_for the parallelising not modified; well-understood r [i] = a [i] + b [i]; compiler should Cons: Hard to compose; execution order transform code separate from source code }

Vector a, b, r; parallelism Thread t1 = createThread ([&]() { Create explicit e.g.: tbb, C++11 threads, pthreads sumFirstHalf (r, a, b); threads to break Pros: Well understood; works with }); up task into variety of algorithms Thread t2 = createThread ([&]() { parallel sections Cons: Doesn’t map to highly parallel sumSecondHalf (r, a, b); architectures like GPUs & FPGAS }); t1.wait (); t2.wait (); Vector a, b; e.g.: SYCL, Parallel STL, CUDA, C++AMP parallel_for (a.range (), [&](int id) Parallelism is Pros: Composable; works with wide { expressed variety of processor architectures explicitly in Cons: Requires user to know the a [id] = a [id] + b [id]; the program parallelism }); © Copyright Khronos Group 2016 - Page 29 What features of OpenCL do we need? • We want to make it easy to write high-performance OpenCL code in C++ - SYCL code in C++ must use memory and execute kernels efficiently - We must provide developers with all the optimization options they have in OpenCL

• We want to enable all OpenCL features in C++ with SYCL - Support wide range of OpenCL devices: CPUs, GPUs, FPGAs, DSPs… - Data on host: Images and buffers; mapping, DMA and copying - Data on device: global/constant/local/private memory; multiple pointer sizes - Parallelism: ND ranges, work-groups, work-items, barriers, queues, events - Multi-device: Platforms, devices, contexts

• We want to enable OpenCL C code to interoperate with C++ SYCL code - Sharing of contexts, memory objects etc

© Copyright Khronos Group 2016 - Page 30 Example SYCL Code #include #include the SYCL header file void func (float *array_a, float *array_b, float *array_c, float *array_r, size_t count) Encapsulate data in SYCL buffers { which be mapped or copied to or buffer buf_a(array_a, range<1>(count)); from OpenCL devices buffer buf_b(array_b, range<1>(count)); buffer buf_c(array_c, range<1>(count)); buffer buf_r(array_r, range<1>(count)); Create a queue, preferably on a GPU, queue myQueue (gpu_selector); which can execute kernels myQueue.submit([&](handler& cgh) Submit to the queue all the work { described in the handler lambda auto a = buf_a.get_access(cgh); that follows auto b = buf_b.get_access(cgh); auto c = buf_c.get_access(cgh); Create accessors which encapsulate the auto r = buf_r.get_access(cgh); type of access to data in the buffers

cgh.parallel_for(count, [=](id<1> i) Execute in parallel the work over an { ND range (in this case ‘count’) r[i] = a[i] + b[i] + c[i]; }); }); This code is executed in parallel on } the device © Copyright Khronos Group 2016 - Page 31 Task Graph Deduction const int n_items = 32; range<1> r(n_items);

int array_a[n_items] = { 0 }; Group int array_b[n_items] = { 0 }; Group buffer buf_a(array_a, range<1>(r)); buffer buf_b(array_b, range<1>(r));

queue q; q.submit([&](handler& cgh) { Group auto acc_a = buf_a.get_access(cgh); algorithm_a s(acc_a); cgh.parallel_for(r, s); });

q.submit([&](handler& cgh) { auto acc_b = buf_b.get_access(cgh); algorithm_b s(acc_b); cgh.parallel_for(r, s); });

q.submit([&](handler& cgh) Efficient Scheduling { auto acc_a = buf_a.get_access(cgh); algorithm_c s(acc_a); cgh.parallel_for(r, s); }); © Copyright Khronos Group 2016 - Page 32 Data access with accessors

• Encapsulates the difference between data storage and data access • Allows creation of a parallel task graph with schedule, synchronization and data movement • Enables devices to use optimal access to data - Including having different pointer sizes on the device to those on the host - Allows usage of different address spaces for different data • Enhanced with call-graph-duplication (for C++ pointers) and explicit pointer classes - To enable direct pointer-like access to data on the device • Portable, because accessors can be implemented as raw pointers

© Copyright Khronos Group 2016 - Page 33 ‘Shared source’ approach to single-source

• This is not required for SYCL, but is designed as a possible implementation • Have a different compiler for host and each device - Don’t really need to implement different front-end for each device, but can • Benefits - Many developers are required to use a specific host compiler - Allows front-ends to optimize for specific devices: e.g. CPU, FPGA, GPU, DSP - Allows the pre-processor to be used by developers for portability and performance portability

© Copyright Khronos Group 2016 - Page 34 Example SYCL Code: Building the program #include #include the SYCL header file. This can be implemented differently for host and int main () device. Can also #include the compiled { device kernel binaries buffer buf_a(array_a, range<1>(count)); buffer buf_b(array_b, range<1>(count)); On host, the accessors can represent the buffer buf_c(array_c, range<1>(count)); dependencies in the program. On device, buffer buf_r(array_r, range<1>(count)); they can be implemented as OpenCL pointers (whether global, local or constant) queue myQueue (gpu_selector); This code is extracted by a device compiler and compiled for a device, myQueue.submit([&](handler& cgh) { including any functions or methods auto a = buf_a.get_access(cgh); called from here. All the code must auto b = buf_b.get_access(cgh); conform to the OpenCL kernel auto c = buf_c.get_access(cgh); restrictions (e.g. no recursion). This auto r = buf_r.get_access(cgh); code can be compiled for different devices from the same source code cgh.parallel_for(count, [=](id<> i) { This is the name of the lambda, which r[i] = a[i] + b[i] + c[i]; is used to enable the host to load the }); correct compiled device kernel into }); OpenCL. C++ reflection may remove } this requirement © Copyright Khronos Group 2016 - Page 35 Where does SYCL fit in? – Memory model float *a = new float [size]; Cache coherent single-address space processCodeOnDevice (a, size); e.g.: Multi-core CPUs, HSA, OpenCL 2 System Sharing Pros: Very easy to program – just pass around When parallelizing on a system with a cache-coherent single pointers (leave ownership issues to user); low-latency address space, only need to pass around pointers. This makes offload; very little impact on programming model communication and offloading very low-cost and easy. Cons: Bandwidth limited; costs power; needs special Requires all memory accesses to go through virtual memory support system and caches communicate ownership across all cores

Non-coherent single-address space float *a = NewShared (size); a.passOwnershipToDevice (size); e.g.: HSA Coarse grained, OpenCL 2.x Pros: Doesn’t require (much) OS support or (much) processCodeOnDevice (a, size); hardware support All data is still referred to via shared pointers, but the Cons: Not supported on all processor cores; User user must manage the memory ownership between must manage ownership different cores.

Multi-address space Shared a (size); e.g.: SYCL 1.2, C++ AMP, OpenCL 1.x processCodeOnDevice (a); Pros: High performance and efficiency of memory accesses; wide device support Data needs to be encapsulated in new datatypes that are able Cons: Impact on programming model (pointers) to manage ownership between host CPU and different devices

© Copyright Khronos Group 2016 - Page 36