C++ on Accelerators: Supporting Single-Source SYCL and HSA Programming Models Using Clang

Victor Lomüller, Ralph Potter, Uwe Dolinsky {victor,ralph,uwe}@codeplay.com Codeplay Software Ltd.

17th March, 2016 Outline

1 Single Source Programming Models

2 Offloading C++ Code to Accelerators

3 Performance Results

4 Conclusion

2 SINGLE SOURCE PROGRAMMING MODELS A few definitions

I “Host” is your system CPU I “Device” is any accelerator: GPU, CPU, DSP. . . I “Work-item” is a thread in OpenCL I “Work-group” is a group of threads that can cooperate in OpenCL

4 SYCL: Single Source C++ ecosystem for OpenCL

I An open and royalty-free standard from the . I Cross-platform C++11 ecosystem for OpenCL 1.2. I Kernels and invocation code share the same source file I Standard C++ layer around OpenCL I No language extension I Works with any C++11 compiler I Ease OpenCL application development

Khronos and SYCL are trademarks of the Khronos Group Inc.

5 SYCL: Compilation

SPIR Binary + Integration header

ComputeCpp’s SYCL Compiler Added to Source File Binary Object Host Compiler

Binary Runtime Object Execution Kernel Load

6 SYCL: Compilation

SPIR Binary + Integration header

ComputeCpp’s SYCL Compiler Added to Source File Binary Object Host Compiler

Binary Runtime Object Execution Kernel Load

6 SYCL kernel

SYCL Vector Addition

#include using namespace cl::; template class SimpleVadd;

template void simple_vadd(T *VA, T *VB, T *VC, unsignedORDER){ queue q; buffer bA(VA, range<1>(ORDER)); buffer bB(VB, range<1>(ORDER)); buffer bC(VC, range<1>(ORDER));

q.submit([&](handler &cgh) { auto pA = bA.template get_access(cgh); auto pB = bB.template get_access(cgh); auto pC = bC.template get_access(cgh);

cgh .parallel_for >( range<1>(ORDER), [=](id<1> it) { pC[it] = pA[it] + pB[it]; }); }); } int main() { int A[4] = {1,2,3,4}, B[4] = {1,2,3,4}, C[4]; simple_vadd(A, B, C, 4); return 0; }

7 SYCL Vector Addition

#include using namespace cl::sycl; template class SimpleVadd;

template void simple_vadd(T *VA, T *VB, T *VC, unsignedORDER){ queue q; buffer bA(VA, range<1>(ORDER)); buffer bB(VB, range<1>(ORDER)); buffer bC(VC, range<1>(ORDER));

q.submit([&](handler &cgh) { auto pA = bA.template get_access(cgh); auto pB = bB.template get_access(cgh); auto pC = bC.template get_access(cgh);

cgh .parallel_for >( SYCL kernel range<1>(ORDER), [=](id<1> it) { pC[it] = pA[it] + pB[it]; }); }); } int main() { int A[4] = {1,2,3,4}, B[4] = {1,2,3,4}, C[4]; simple_vadd(A, B, C, 4); return 0; }

7 SYCL Vector Addition

cgh .parallel_for >( range<1>(ORDER), [=](id<1> it) { pC[it] = pA[it] + pB[it]; });

What is needed for running the kernel on the device ? I lambda body I accessor::operator[](id) I id copy constructor I ... and all functions used by accessor::operator[] and id copy constructor

8 template template void do_add(__global T *pC, const __global T *pA, void do_add(T *pC, const T *pA, const T *pB, size_t idx); const __global T *pB, size_t idx);

SYCL Vector Addition

#include using namespace cl::sycl; template class SimpleVadd;

template void do_add(T *pC, const T *pA, const T *pB, size_t idx) { pC[it] = pA[it] + pB[it]; Perform the addition in a separate function }

template void simple_vadd(T *VA, T *VB, T *VC, unsignedORDER){ queue q; buffer bA(VA, range<1>(ORDER)); buffer bB(VB, range<1>(ORDER)); buffer bC(VC, range<1>(ORDER));

q.submit([&](handler &cgh) { auto pA = bA.template get_access(cgh); auto pB = bB.template get_access(cgh); auto pC = bC.template get_access(cgh);

cgh .parallel_for >( range<1>(ORDER), [=](id<1> it) { do_add(pC.get_pointer(), pA.get_pointer(), pB.get_pointer(), it); }); }); }

9 template void do_add(__global T *pC, const __global T *pA, const __global T *pB, size_t idx);

Perform the addition in a separate function

SYCL Vector Addition: Host View

#include using namespace cl::sycl; template class SimpleVadd; template void do_add(T *pC, const T *pA, const T *pB, size_t idx); template void do_add(T *pC, const T *pA, const T *pB, size_t idx) { pC[it] = pA[it] + pB[it]; }

template void simple_vadd(T *VA, T *VB, T *VC, unsignedORDER){ queue q; buffer bA(VA, range<1>(ORDER)); buffer bB(VB, range<1>(ORDER)); buffer bC(VC, range<1>(ORDER));

q.submit([&](handler &cgh) { auto pA = bA.template get_access(cgh); auto pB = bB.template get_access(cgh); auto pC = bC.template get_access(cgh);

cgh .parallel_for >( range<1>(ORDER), [=](id<1> it) { do_add(pC.get_pointer(), pA.get_pointer(), pB.get_pointer(), it); }); }); }

9 template void do_add(T *pC, const T *pA, const T *pB, size_t idx);

Perform the addition in a separate function

SYCL Vector Addition: Device View

#include using namespace cl::sycl; template template class SimpleVadd; void do_add(__global T *pC, const __global T *pA, const __global T *pB, size_t idx); template void do_add(T *pC, const T *pA, const T *pB, size_t idx) { pC[it] = pA[it] + pB[it]; }

template void simple_vadd(T *VA, T *VB, T *VC, unsignedORDER){ queue q; buffer bA(VA, range<1>(ORDER)); buffer bB(VB, range<1>(ORDER)); buffer bC(VC, range<1>(ORDER));

q.submit([&](handler &cgh) { auto pA = bA.template get_access(cgh); auto pB = bB.template get_access(cgh); auto pC = bC.template get_access(cgh);

cgh .parallel_for >( range<1>(ORDER), [=](id<1> it) { do_add(pC.get_pointer(), pA.get_pointer(), pB.get_pointer(), it); }); }); }

9 Allocated in the OpenCL local memory

Normal per-work-item execution

Once per-work-group execution

SYCL Hierarchical

#defineLOCAL_RANGE... [...] h.parallel_for_work_group( range<1>(std::max(length, LOCAL_RANGE) / LOCAL_RANGE), range<1>(LOCAL_RANGE), [=](group<1> grp) { int scratch[LOCAL_RANGE];

parallel_for_work_item(grp, [=](item<1> it) { int globalId = grp.get() * it.get_range() + it.get(); if (globalId < length) scratch[it.get()] = aIn[globalId]; });

int min = (length < local) ? length : local; for(int offset = min / 2; offset > 1; offset /= 2) { parallel_for_work_item(grp, range<1>(offset), [=](item<1> it) { scratch[it] += scratch[it + offset]; }); } aOut[grp.get()] = scratch[0] + scratch[1]; });

10 Allocated in the OpenCL local memory

Once per-work-group execution

SYCL Hierarchical

#defineLOCAL_RANGE... [...] h.parallel_for_work_group( range<1>(std::max(length, LOCAL_RANGE) / LOCAL_RANGE), range<1>(LOCAL_RANGE), [=](group<1> grp) { int scratch[LOCAL_RANGE];

parallel_for_work_item(grp, [=](item<1> it) { int globalId = grp.get() * it.get_range() + it.get(); if (globalId < length) scratch[it.get()] = aIn[globalId]; }); Normal per-work-item execution int min = (length < local) ? length : local; for(int offset = min / 2; offset > 1; offset /= 2) { parallel_for_work_item(grp, range<1>(offset), [=](item<1> it) { scratch[it] += scratch[it + offset]; }); } aOut[grp.get()] = scratch[0] + scratch[1]; });

10 SYCL Hierarchical

#defineLOCAL_RANGE... [...] h.parallel_for_work_group( range<1>(std::max(length, LOCAL_RANGE) / LOCAL_RANGE), range<1>(LOCAL_RANGE), [=](group<1> grp) { int scratch[LOCAL_RANGE]; Allocated in the OpenCL local memory

parallel_for_work_item(grp, [=](item<1> it) { int globalId = grp.get() * it.get_range() + it.get(); if (globalId < length) scratch[it.get()] = aIn[globalId]; }); Normal per-work-item execution int min = (length < local) ? length : local; for(int offset = min / 2; offset > 1; offset /= 2) { parallel_for_work_item(grp, range<1>(offset), [=](item<1> it) { scratch[it] += scratch[it + offset]; }); } aOut[grp.get()] = scratch[0] + scratch[1]; Once per-work-group execution });

10 Not (exactly) the same function!

void do_sum(do_add(__localint& dst, int int &dst, a, int int b) a, int b);

void do_add(__localdo_sum(int& dst, int int &dst, a, int int b) a, int b);

void do_sum(do_add(__globalint& dst, intint &dst,a, int int b) a, int b);

SYCL Hierarchical

#defineLOCAL_RANGE... void do_sum(int& dst, int a, int b){ dst = a + b; Perform the addition in a separate function } [...] h.parallel_for_work_group( range<1>(std::max(length, LOCAL_RANGE) / LOCAL_RANGE), range<1>(LOCAL_RANGE), [=](group<1> grp) { int scratch[LOCAL_RANGE];

parallel_for_work_item(grp, [=](item<1> it) { int globalId = grp.get() * it.get_range() + it.get(); if (globalId < length) scratch[it.get()] = aIn[globalId]; });

int min = (length < local) ? length : local; for(int offset = min / 2; offset > 1; offset /= 2) { parallel_for_work_item(grp, range<1>(offset), [=](item<1> it) { // scratch[it] += scratch[it+ offset]; do_sum(scratch[it], scratch[it], scratch[it + offset]); }); } // aOut[grp.get()]= scratch[0]+ scratch[1]; do_sum(scratch[0], scratch[0], scratch[1]); do_sum(aOut[grp.get()], scratch[0], scratch[1]); });

11 Perform the addition in a separate function

Not (exactly) the same function!

void do_add(__local int &dst, int a, int b);

void do_add(__local int &dst, int a, int b);

void do_add(__global int &dst, int a, int b);

SYCL Hierarchical: Host View

#defineLOCAL_RANGE... void do_sum(int& dst, int a, int b){ dst = a + b; } [...] h.parallel_for_work_group( range<1>(std::max(length, LOCAL_RANGE) / LOCAL_RANGE), range<1>(LOCAL_RANGE), [=](group<1> grp) { int scratch[LOCAL_RANGE];

parallel_for_work_item(grp, [=](item<1> it) { int globalId = grp.get() * it.get_range() + it.get(); if (globalId < length) scratch[it.get()] = aIn[globalId]; });

int min = (length < local) ? length : local; for(int offset = min / 2; offset > 1; offset /= 2) { parallel_for_work_item(grp, range<1>(offset),void [=]( do_sum(item<1>int& it) dst, { int a, int b) // scratch[it] += scratch[it+ offset]; do_sum(scratch[it], scratch[it], scratch[it + offset]); }); } void do_sum(int& dst, int a, int b) // aOut[grp.get()]= scratch[0]+ scratch[1]; do_sum(scratch[0], scratch[0], scratch[1]); do_sum(aOut[grp.get()], scratch[0], scratch[1]); }); void do_sum(int& dst, int a, int b)

11 Perform the addition in a separate function

Not (exactly) the same function!

void do_sum(int& dst, int a, int b)

void do_sum(int& dst, int a, int b)

void do_sum(int& dst, int a, int b)

SYCL Hierarchical: Device View

#defineLOCAL_RANGE... void do_sum(int& dst, int a, int b){ dst = a + b; } [...] h.parallel_for_work_group( range<1>(std::max(length, LOCAL_RANGE) / LOCAL_RANGE), range<1>(LOCAL_RANGE), [=](group<1> grp) { int scratch[LOCAL_RANGE];

parallel_for_work_item(grp, [=](item<1> it) { int globalId = grp.get() * it.get_range() + it.get(); if (globalId < length) scratch[it.get()] = aIn[globalId]; });

int min = (length < local) ? length : local; for(int offset = min / 2; offset > 1; offset /= 2) { parallel_for_work_item(grp, range<1>(offset),void [=]( do_add(__localitem<1> it) { int &dst, int a, int b); // scratch[it] += scratch[it+ offset]; do_sum(scratch[it], scratch[it], scratch[it + offset]); }); } void do_add(__local int &dst, int a, int b); // aOut[grp.get()]= scratch[0]+ scratch[1]; do_sum(scratch[0], scratch[0], scratch[1]); do_sum(aOut[grp.get()], scratch[0], scratch[1]); }); void do_add(__global int &dst, int a, int b);

11 Perform the addition in a separate function

void do_sum(int& dst, int a, int b)

void do_sum(int& dst, int a, int b)

void do_sum(int& dst, int a, int b)

SYCL Hierarchical

#defineLOCAL_RANGE... void do_sum(int& dst, int a, int b){ dst = a + b; } [...] h.parallel_for_work_group( range<1>(std::max(length, LOCAL_RANGE) / LOCAL_RANGE), range<1>(LOCAL_RANGE), [=](group<1> grp) { int scratch[LOCAL_RANGE]; Not (exactly) the same function!

parallel_for_work_item(grp, [=](item<1> it) { int globalId = grp.get() * it.get_range() + it.get(); if (globalId < length) scratch[it.get()] = aIn[globalId]; });

int min = (length < local) ? length : local; for(int offset = min / 2; offset > 1; offset /= 2) { parallel_for_work_item(grp, range<1>(offset),void [=]( do_add(__localitem<1> it) { int &dst, int a, int b); // scratch[it] += scratch[it+ offset]; do_sum(scratch[it], scratch[it], scratch[it + offset]); }); } void do_add(__local int &dst, int a, int b); // aOut[grp.get()]= scratch[0]+ scratch[1]; do_sum(scratch[0], scratch[0], scratch[1]); do_sum(aOut[grp.get()], scratch[0], scratch[1]); }); void do_add(__global int &dst, int a, int b);

11 C++ Front-end for HSAIL HSA Foundation I Defines an intermediate language: HSAIL (HSA Intermediate Language) I Defines a runtime for compute I Primarily targets heterogeneous SoCs: CPU + Accelerators are on the same chip I More aimed to support systems with Shared Virtual Memory (but not limited to) I Provide some features not provided by OpenCL’s lowest common denominator model I e.g. function pointers, recursion, alloca Ralph’s work I C++14 based programming model I Compile to HSAIL (HSA Intermediate Language)

12 C++ Front-end for HSAIL: Compilation

HSAIL

C++ for HSA FE C++14 for HSA Added to Binary Object Source File Host Compiler

Binary Runtime Object Execution Kernel Load

13 Address space 0 (clang default)

Address space 1 I Default address spaces are different: I Global address space is mapped to 0 (clang default) I Private address space is mapped to 1 I These defaults need to be propagated through all called functions

C++ Front-end for HSAIL: Example

[[hsa::kernel]] void vector_add(float* a, float* b, I SYCL and the C++ HSA Front float* c) { uint32_t i = End creates similar challenges rt::builtin::workitemabsid(0); a[i] = b[i] + c[i]; for the compiler }

float* a, b, c; // Asynchronous dispatch auto future = rt ::parallel_for(rt::throughput, grid_size , vector_add, a, b, c); future.wait();

// Synchronous dispatch rt ::parallel_for(rt::throughput, grid_size , vector_add, a, b, c);

14 C++ Front-end for HSAIL: Example

Address space 0 (clang default)

[[hsa::kernel]] void vector_add(float* a, float* b, I SYCL and the C++ HSA Front float* c) { uint32_t i = End creates similar challenges rt::builtin::workitemabsid(0); a[i] = b[i] + c[i]; Address space 1 for the compiler } I Default address spaces are float* a, b, c; different: // Asynchronous dispatch auto future = I Global address space is rt ::parallel_for(rt::throughput, grid_size , mapped to 0 (clang default) vector_add, a, b, c); future.wait(); I Private address space is mapped to 1 // Synchronous dispatch rt ::parallel_for(rt::throughput, I These defaults need to be grid_size , vector_add, a, b, c); propagated through all called functions

14 SYCL vs C++ Front-end for HSAIL

What is common I The compiler must be able to change variable/argument types I A single function in the source file can be derived in many forms I Default address space must be configurable I It needs to maintain several device function versions I Required by SYCL’s hierarchical API I Context information is used to determine where a function will be executed

15 OFFLOADING C++ CODE TO ACCELERATORS Codeplay’s Offload Engine

I Call graph duplication algorithm I Cooper, Pete, et al. “Offload–automating code migration to heterogeneous multicore systems.” High Performance Embedded Architectures and Compilers 2010. I Successfully used in “Offload for PS3” I Integrated in some PS3 games such as NASCAR The Game 2011

I Work partially funded by the PEPPHER project I www.peppher.eu I Integrated into clang for heterogeneous system: “OffloadCL for clang” I Similar works done at the LLVM IR

I Work partially funded by the CARP project I http://carp.doc.ic.ac.uk

17 Codeplay’s Offload Engine in a nutshell

I Some attributes: I __offload__: explicitly identify a device function and its space I address_space_of_locals: to select address space defaults and propagation options I Set of hooks in clang to intercept I Function calls I Variable declarations I Extended overload resolution I A host function is different from a device function I Support for “multi-space” device functions I Calling context is important I The Offloading core I Clone host functions into device functions I Address space inference and promotion

18 FunctionDecl 0x66a79d0 used workload 'int *( int *) ' |-ParmVarDecl 0x66a7900 used arg 'int *' `-CompoundStmt 0x66a7be8 |-DeclStmt 0x66a7b70 | `-VarDecl 0x66a7a90 end 'int *' cinit | `-BinaryOperator 0x66a7b48 'int *''+' | |-ImplicitCastExpr 0x66a7b30 'int *' | | `-DeclRefExpr 0x66a7ae8 'int *' lvalue ParmVar 0x66a7900 'arg ''int *' | `-IntegerLiteral 0x66a7b10 'int ' 4 `-ReturnStmt 0x66a7bc8 `-ImplicitCastExpr 0x66a7bb0 'int *' `-DeclRefExpr 0x66a7b88 'int *' lvalue ParmVar 0x66a7900 'arg ''int *'

| `-DeclRefExpr 0x66e7e70 'int *( int *) ' lvalue Function 0x66a79d0 'workload ''int *( int *) '

Explicitly offload the function in the space 1 | `-DeclRefExpr 0x66e85c0 '__global int *(__global int *) ' lvalue Function 0x66e8320 'workload '' __global int *(__global int *) '

FunctionDecl 0x66e8320 used workload '__global int *(__global int *) ' __attribute__((__offload__(2)))|-ParmVarDecl 0x66e83c0 used arg '__globalExplicitly int * offload' the function in the space 2 __globalvoid device_function(__global int * workload(__global|-CompoundStmt int int* arg) * 0x66e8598 arg) { { workload(__global__global int * workload(__global int * arg);| |-DeclStmt int 0x66e8520* arg); __globalarg = workload(arg); int * end = arg +| | 4;`-VarDecl 0x66e8460 end '__global int *' cinit //... | | `-BinaryOperator 0x66e84f8 '__global int *''+' } | | |-ImplicitCastExpr 0x66e84e0 '__global int *' return arg; | | | `-DeclRefExpr 0x66e84b8 '__global int *' lvalue ParmVar 0x66e83c0 'arg '' __global int *' } | | `-IntegerLiteral 0x66a7b10 'int ' 4 | `-ReturnStmt 0x66e8578 | `-ImplicitCastExpr 0x66e8560 '__global int *' | `-DeclRefExpr 0x66e8538 '__global int *' lvalue ParmVar 0x66e83c0 'arg '' __global int *' `-OffloadFnAttr 0x66e8420 Implicit 1

OffloadCL: Simple Example

int* workload(int * arg) { int * end = arg + 4; //... return arg; }

void host_function(int * arg) { arg = workload(arg); }

#define __global \ __attribute__((address_space(0xFFFF00)))

__attribute__((__offload__)) void device_function(__global int * arg) { arg = workload(arg); }

19 Explicitly offload the function in the space 1 | `-DeclRefExpr 0x66e85c0 '__global int *(__global int *) ' lvalue Function 0x66e8320 'workload '' __global int *(__global int *) '

FunctionDecl 0x66e8320 used workload '__global int *(__global int *) ' __attribute__((__offload__(2)))|-ParmVarDecl 0x66e83c0 used arg '__globalExplicitly int * offload' the function in the space 2 __globalvoid device_function(__global int * workload(__global|-CompoundStmt int int* arg) * 0x66e8598 arg) { { workload(__global__global int * workload(__global int * arg);| |-DeclStmt int 0x66e8520* arg); __globalarg = workload(arg); int * end = arg +| | 4;`-VarDecl 0x66e8460 end '__global int *' cinit //... | | `-BinaryOperator 0x66e84f8 '__global int *''+' } | | |-ImplicitCastExpr 0x66e84e0 '__global int *' return arg; | | | `-DeclRefExpr 0x66e84b8 '__global int *' lvalue ParmVar 0x66e83c0 'arg '' __global int *' } | | `-IntegerLiteral 0x66a7b10 'int ' 4 | `-ReturnStmt 0x66e8578 | `-ImplicitCastExpr 0x66e8560 '__global int *' | `-DeclRefExpr 0x66e8538 '__global int *' lvalue ParmVar 0x66e83c0 'arg '' __global int *' `-OffloadFnAttr 0x66e8420 Implicit 1

OffloadCL: Simple Example

FunctionDecl 0x66a79d0 used workload 'int *( int *) ' |-ParmVarDecl 0x66a7900 used arg 'int *' `-CompoundStmt 0x66a7be8 int* workload(int * arg) { |-DeclStmt 0x66a7b70 | `-VarDecl 0x66a7a90 end 'int *' cinit int * end = arg + 4; | `-BinaryOperator 0x66a7b48 'int *''+' | |-ImplicitCastExpr 0x66a7b30 'int *' //... | | `-DeclRefExpr 0x66a7ae8 'int *' lvalue ParmVar 0x66a7900 'arg ''int *' return arg; | `-IntegerLiteral 0x66a7b10 'int ' 4 `-ReturnStmt 0x66a7bc8 } `-ImplicitCastExpr 0x66a7bb0 'int *' `-DeclRefExpr 0x66a7b88 'int *' lvalue ParmVar 0x66a7900 'arg ''int *' void host_function(int * arg) { arg = workload(arg); } | `-DeclRefExpr 0x66e7e70 'int *( int *) ' lvalue Function 0x66a79d0 'workload ''int *( int *) '

#define __global \ __attribute__((address_space(0xFFFF00)))

__attribute__((__offload__)) void device_function(__global int * arg) { arg = workload(arg); }

19 FunctionDecl 0x66a79d0 used workload 'int *( int *) ' |-ParmVarDecl 0x66a7900 used arg 'int *' `-CompoundStmt 0x66a7be8 |-DeclStmt 0x66a7b70 | `-VarDecl 0x66a7a90 end 'int *' cinit | `-BinaryOperator 0x66a7b48 'int *''+' | |-ImplicitCastExpr 0x66a7b30 'int *' | | `-DeclRefExpr 0x66a7ae8 'int *' lvalue ParmVar 0x66a7900 'arg ''int *' | `-IntegerLiteral 0x66a7b10 'int ' 4 `-ReturnStmt 0x66a7bc8 `-ImplicitCastExpr 0x66a7bb0 'int *' `-DeclRefExpr 0x66a7b88 'int *' lvalue ParmVar 0x66a7900 'arg ''int *'

| `-DeclRefExpr 0x66e7e70 'int *( int *) ' lvalue Function 0x66a79d0 'workload ''int *( int *) '

Explicitly offload the function in the space 1 | `-DeclRefExpr 0x66e85c0 '__global int *(__global int *) ' lvalue Function 0x66e8320 'workload '' __global int *(__global int *) '

FunctionDecl 0x66e8320 used workload '__global int *(__global int *) ' __attribute__((__offload__(2)))|-ParmVarDecl 0x66e83c0 used arg '__globalExplicitly int * offload' the function in the space 2 __globalvoid device_function(__global int * workload(__global|-CompoundStmt int int* arg) * 0x66e8598 arg) { { workload(__global__global int * workload(__global int * arg);| |-DeclStmt int 0x66e8520* arg); __globalarg = workload(arg); int * end = arg +| | 4;`-VarDecl 0x66e8460 end '__global int *' cinit //... | | `-BinaryOperator 0x66e84f8 '__global int *''+' } | | |-ImplicitCastExpr 0x66e84e0 '__global int *' return arg; | | | `-DeclRefExpr 0x66e84b8 '__global int *' lvalue ParmVar 0x66e83c0 'arg '' __global int *' } | | `-IntegerLiteral 0x66a7b10 'int ' 4 | `-ReturnStmt 0x66e8578 | `-ImplicitCastExpr 0x66e8560 '__global int *' | `-DeclRefExpr 0x66e8538 '__global int *' lvalue ParmVar 0x66e83c0 'arg '' __global int *' `-OffloadFnAttr 0x66e8420 Implicit 1

OffloadCL: Simple Example

int* workload(int * arg) { int * end = arg + 4; //... return arg; }

void host_function(int * arg) { arg = workload(arg); }

#define __global \ __attribute__((address_space(0xFFFF00)))

__attribute__((__offload__)) Explicitly offload the function void device_function(__global int * arg) { arg = workload(arg); }

19 FunctionDecl 0x66a79d0 used workload 'int *( int *) ' |-ParmVarDecl 0x66a7900 used arg 'int *' `-CompoundStmt 0x66a7be8 |-DeclStmt 0x66a7b70 | `-VarDecl 0x66a7a90 end 'int *' cinit | `-BinaryOperator 0x66a7b48 'int *''+' | |-ImplicitCastExpr 0x66a7b30 'int *' | | `-DeclRefExpr 0x66a7ae8 'int *' lvalue ParmVar 0x66a7900 'arg ''int *' | `-IntegerLiteral 0x66a7b10 'int ' 4 `-ReturnStmt 0x66a7bc8 `-ImplicitCastExpr 0x66a7bb0 'int *' `-DeclRefExpr 0x66a7b88 'int *' lvalue ParmVar 0x66a7900 'arg ''int *'

| `-DeclRefExpr 0x66e7e70 'int *( int *) ' lvalue Function 0x66a79d0 'workload ''int *( int *) '

Explicitly offload the function | `-DeclRefExpr 0x66e85c0 '__global int *(__global int *) ' lvalue Function 0x66e8320 'workload '' __global int *(__global int *) '

FunctionDecl 0x66e8320 used workload '__global int *(__global int *) ' |-ParmVarDecl 0x66e83c0 used arg '__global int *' __global int * workload(__global|-CompoundStmt int * 0x66e8598 arg) { workload(__global__global int * workload(__global int * arg);| |-DeclStmt int 0x66e8520* arg); __global int * end = arg +| | 4;`-VarDecl 0x66e8460 end '__global int *' cinit //... | | `-BinaryOperator 0x66e84f8 '__global int *''+' | | |-ImplicitCastExpr 0x66e84e0 '__global int *' return arg; | | | `-DeclRefExpr 0x66e84b8 '__global int *' lvalue ParmVar 0x66e83c0 'arg '' __global int *' } | | `-IntegerLiteral 0x66a7b10 'int ' 4 | `-ReturnStmt 0x66e8578 | `-ImplicitCastExpr 0x66e8560 '__global int *' | `-DeclRefExpr 0x66e8538 '__global int *' lvalue ParmVar 0x66e83c0 'arg '' __global int *' `-OffloadFnAttr 0x66e8420 Implicit 1

OffloadCL: Simple Example

int* workload(int * arg) { int * end = arg + 4; //... return arg; }

void host_function(int * arg) { arg = workload(arg); }

#define __global \ __attribute__((address_space(0xFFFF00)))

__attribute__((__offload__)) Explicitly offload the function in the space 1 void device_function(__global int * arg) { arg = workload(arg); } __attribute__((__offload__(2))) Explicitly offload the function in the space 2 void device_function(__global int * arg) { arg = workload(arg); }

19 FunctionDecl 0x66a79d0 used workload 'int *( int *) ' |-ParmVarDecl 0x66a7900 used arg 'int *' `-CompoundStmt 0x66a7be8 |-DeclStmt 0x66a7b70 | `-VarDecl 0x66a7a90 end 'int *' cinit | `-BinaryOperator 0x66a7b48 'int *''+' | |-ImplicitCastExpr 0x66a7b30 'int *' | | `-DeclRefExpr 0x66a7ae8 'int *' lvalue ParmVar 0x66a7900 'arg ''int *' | `-IntegerLiteral 0x66a7b10 'int ' 4 `-ReturnStmt 0x66a7bc8 `-ImplicitCastExpr 0x66a7bb0 'int *' `-DeclRefExpr 0x66a7b88 'int *' lvalue ParmVar 0x66a7900 'arg ''int *'

| `-DeclRefExpr 0x66e7e70 'int *( int *) ' lvalue Function 0x66a79d0 'workload ''int *( int *) '

Explicitly offload the function in the space 1 | `-DeclRefExpr 0x66e85c0 '__global int *(__global int *) ' lvalue Function 0x66e8320 'workload '' __global int *(__global int *) '

FunctionDecl 0x66e8320 used workload '__global int *(__global int *) ' __attribute__((__offload__(2)))|-ParmVarDecl 0x66e83c0 used arg '__globalExplicitly int * offload' the function in the space 2 __globalvoid device_function(__global int * workload(__global|-CompoundStmt int int* arg) * 0x66e8598 arg) { { __global int * workload(__global| |-DeclStmt int 0x66e8520* arg); __globalarg = workload(arg); int * end = arg +| | 4;`-VarDecl 0x66e8460 end '__global int *' cinit //... | | `-BinaryOperator 0x66e84f8 '__global int *''+' } | | |-ImplicitCastExpr 0x66e84e0 '__global int *' return arg; | | | `-DeclRefExpr 0x66e84b8 '__global int *' lvalue ParmVar 0x66e83c0 'arg '' __global int *' } | | `-IntegerLiteral 0x66a7b10 'int ' 4 | `-ReturnStmt 0x66e8578 | `-ImplicitCastExpr 0x66e8560 '__global int *' | `-DeclRefExpr 0x66e8538 '__global int *' lvalue ParmVar 0x66e83c0 'arg '' __global int *' `-OffloadFnAttr 0x66e8420 Implicit 1

OffloadCL: Simple Example

int* workload(int * arg) { int * end = arg + 4; //... return arg; I We need to resolve the call to } the “workload” function for an void host_function(int * arg) { arg = workload(arg); argument __global int* } I We make clang recognize the #define __global \ __attribute__((address_space(0xFFFF00))) conversion from a non-default address space to the default __attribute__((__offload__)) void device_function(__global int * arg) { one as a standard conversion arg = workload(arg); } I e.g. int __global **__global* workload(__global int * arg); is a standard conversion of int***

19 FunctionDecl 0x66a79d0 used workload 'int *( int *) ' |-ParmVarDecl 0x66a7900 used arg 'int *' `-CompoundStmt 0x66a7be8 |-DeclStmt 0x66a7b70 | `-VarDecl 0x66a7a90 end 'int *' cinit | `-BinaryOperator 0x66a7b48 'int *''+' | |-ImplicitCastExpr 0x66a7b30 'int *' | | `-DeclRefExpr 0x66a7ae8 'int *' lvalue ParmVar 0x66a7900 'arg ''int *' | `-IntegerLiteral 0x66a7b10 'int ' 4 `-ReturnStmt 0x66a7bc8 `-ImplicitCastExpr 0x66a7bb0 'int *' `-DeclRefExpr 0x66a7b88 'int *' lvalue ParmVar 0x66a7900 'arg ''int *'

| `-DeclRefExpr 0x66e7e70 'int *( int *) ' lvalue Function 0x66a79d0 'workload ''int *( int *) '

Explicitly offload the function in the space 1 | `-DeclRefExpr 0x66e85c0 '__global int *(__global int *) ' lvalue Function 0x66e8320 'workload '' __global int *(__global int *) '

FunctionDecl 0x66e8320 used workload '__global int *(__global int *) ' __attribute__((__offload__(2)))|-ParmVarDecl 0x66e83c0 used arg '__globalExplicitly int * offload' the function in the space 2 __globalvoid device_function(__global int * workload(__global|-CompoundStmt int int* arg) * 0x66e8598 arg) { { workload(__global int * arg);| |-DeclStmt 0x66e8520 __globalarg = workload(arg); int * end = arg +| | 4;`-VarDecl 0x66e8460 end '__global int *' cinit //... | | `-BinaryOperator 0x66e84f8 '__global int *''+' } | | |-ImplicitCastExpr 0x66e84e0 '__global int *' return arg; | | | `-DeclRefExpr 0x66e84b8 '__global int *' lvalue ParmVar 0x66e83c0 'arg '' __global int *' } | | `-IntegerLiteral 0x66a7b10 'int ' 4 | `-ReturnStmt 0x66e8578 | `-ImplicitCastExpr 0x66e8560 '__global int *' | `-DeclRefExpr 0x66e8538 '__global int *' lvalue ParmVar 0x66e83c0 'arg '' __global int *' `-OffloadFnAttr 0x66e8420 Implicit 1

OffloadCL: Simple Example

int* workload(int * arg) { int * end = arg + 4; //... return arg; }

void host_function(int * arg) { arg = workload(arg); } I We go through the function

#define __global \ AST to infer the return type __attribute__((address_space(0xFFFF00))) of our new function. __attribute__((__offload__)) void device_function(__global int * arg) { I Conflicting return types raise arg = workload(arg); } an error.

__global int * workload(__global int * arg);

19 FunctionDecl 0x66a79d0 used workload 'int *( int *) ' |-ParmVarDecl 0x66a7900 used arg 'int *' `-CompoundStmt 0x66a7be8 |-DeclStmt 0x66a7b70 | `-VarDecl 0x66a7a90 end 'int *' cinit | `-BinaryOperator 0x66a7b48 'int *''+' | |-ImplicitCastExpr 0x66a7b30 'int *' | | `-DeclRefExpr 0x66a7ae8 'int *' lvalue ParmVar 0x66a7900 'arg ''int *' | `-IntegerLiteral 0x66a7b10 'int ' 4 `-ReturnStmt 0x66a7bc8 `-ImplicitCastExpr 0x66a7bb0 'int *' `-DeclRefExpr 0x66a7b88 'int *' lvalue ParmVar 0x66a7900 'arg ''int *'

| `-DeclRefExpr 0x66e7e70 'int *( int *) ' lvalue Function 0x66a79d0 'workload ''int *( int *) '

Explicitly offload the function in the space 1 | `-DeclRefExpr 0x66e85c0 '__global int *(__global int *) ' lvalue Function 0x66e8320 'workload '' __global int *(__global int *) '

FunctionDecl 0x66e8320 used workload '__global int *(__global int *) ' __attribute__((__offload__(2)))|-ParmVarDecl 0x66e83c0 used arg '__globalExplicitly int * offload' the function in the space 2 void device_function(__global|-CompoundStmt int * arg) 0x66e8598 { workload(__global__global int * workload(__global int * arg);| |-DeclStmt int 0x66e8520* arg); arg = workload(arg); | | `-VarDecl 0x66e8460 end '__global int *' cinit | | `-BinaryOperator 0x66e84f8 '__global int *''+' } | | |-ImplicitCastExpr 0x66e84e0 '__global int *' | | | `-DeclRefExpr 0x66e84b8 '__global int *' lvalue ParmVar 0x66e83c0 'arg '' __global int *' | | `-IntegerLiteral 0x66a7b10 'int ' 4 | `-ReturnStmt 0x66e8578 | `-ImplicitCastExpr 0x66e8560 '__global int *' | `-DeclRefExpr 0x66e8538 '__global int *' lvalue ParmVar 0x66e83c0 'arg '' __global int *' `-OffloadFnAttr 0x66e8420 Implicit 1

OffloadCL: Simple Example

int* workload(int * arg) { int * end = arg + 4; //... return arg; } I Using the TreeTransform

void host_function(int * arg) { class, we rebuild the function arg = workload(arg); } body for the duplicated

#define __global \ function __attribute__((address_space(0xFFFF00))) I For each clang::VarDecl, __attribute__((__offload__)) void device_function(__global int * arg) { we infer its type using its arg = workload(arg); } initialization

__global int * workload(__global int * arg) { I We reinstantiate templates __global int * end = arg + 4; //... only if needed return arg; }

19 Explicitly offload the function in the space 1

__attribute__((__offload__(2))) Explicitly offload the function in the space 2 void device_function(__global int * arg) { workload(__global__global int * workload(__global int * arg); int * arg); arg = workload(arg); }

OffloadCL: Simple Example

FunctionDecl 0x66a79d0 used workload 'int *( int *) ' |-ParmVarDecl 0x66a7900 used arg 'int *' `-CompoundStmt 0x66a7be8 int* workload(int * arg) { |-DeclStmt 0x66a7b70 | `-VarDecl 0x66a7a90 end 'int *' cinit int * end = arg + 4; | `-BinaryOperator 0x66a7b48 'int *''+' | |-ImplicitCastExpr 0x66a7b30 'int *' //... | | `-DeclRefExpr 0x66a7ae8 'int *' lvalue ParmVar 0x66a7900 'arg ''int *' return arg; | `-IntegerLiteral 0x66a7b10 'int ' 4 `-ReturnStmt 0x66a7bc8 } `-ImplicitCastExpr 0x66a7bb0 'int *' `-DeclRefExpr 0x66a7b88 'int *' lvalue ParmVar 0x66a7900 'arg ''int *' void host_function(int * arg) { arg = workload(arg); } | `-DeclRefExpr 0x66e7e70 'int *( int *) ' lvalue Function 0x66a79d0 'workload ''int *( int *) '

#define __global \ __attribute__((address_space(0xFFFF00)))

__attribute__((__offload__)) void device_function(__global| `-DeclRefExpr int * arg) 0x66e85c0 { '__global int *(__global int *) ' lvalue arg = workload(arg); Function 0x66e8320 'workload '' __global int *(__global int *) ' } FunctionDecl 0x66e8320 used workload '__global int *(__global int *) ' |-ParmVarDecl 0x66e83c0 used arg '__global int *' __global int * workload(__global|-CompoundStmt int * 0x66e8598 arg) { | |-DeclStmt 0x66e8520 __global int * end = arg +| | 4;`-VarDecl 0x66e8460 end '__global int *' cinit //... | | `-BinaryOperator 0x66e84f8 '__global int *''+' | | |-ImplicitCastExpr 0x66e84e0 '__global int *' return arg; | | | `-DeclRefExpr 0x66e84b8 '__global int *' lvalue ParmVar 0x66e83c0 'arg '' __global int *' } | | `-IntegerLiteral 0x66a7b10 'int ' 4 | `-ReturnStmt 0x66e8578 | `-ImplicitCastExpr 0x66e8560 '__global int *' | `-DeclRefExpr 0x66e8538 '__global int *' lvalue ParmVar 0x66e83c0 'arg '' __global int *' `-OffloadFnAttr 0x66e8420 Implicit 1

19 I There is no hierarchy between spaces

OffloadCL: Overloading

1. Standard overload resolution #define __global \ __attribute__((address_space(0xFFFF00)))

2. Try to select a function in void call(int*); __attribute__((__offload__(2))) the same space void call(__global int*);

3. If not possible, try to select __attribute__((__offload__(2))) void caller(__global int* ptr) { host and then duplicate call ( ptr ); } 4. If not possible, select an __attribute__((__offload__)) offloaded function in void caller(__global int* ptr) { // match against the host function another space // and duplicate call ( ptr ); }

20 OffloadCL: Overloading

1. Standard overload resolution 2. Try to select a function in the same space __attribute__((__offload__(2))) void call(int*); __attribute__((__offload__(3))) 3. If not possible, try to select void call(int*);

host and then duplicate __attribute__((__offload__(1))) void caller(int* i) { 4. If not possible, select an call (i);// the call is ambiguous offloaded function in } another space I There is no hierarchy between spaces

20 OffloadCL: Overloading

The offloaded state is part of the #define __global \ signature, so: __attribute__((address_space(0xFFFF00))) #define __local \ I Function prototypes can __attribute__((address_space(0xFFFF01))) only differ by their return int* get(); __attribute__((__offload__)) types if in different spaces __global float* get();

I The distinction is made __attribute__((__offload__(2))) using the calling context __local int* get();

21 What is missing to have a full SYCL/HSA support ?

OffloadCL creates the infrastructure to support offloading Does not handle language specifics I What a kernel entry point is: SYCL: it is identified by a sycl_kernel attribute (hidden behind parallel_for) HSA FE: it is identified by a hsa::kernel I Languages restrictions, e.g. no function pointers in SYCL I Compilation product

22 PERFORMANCE RESULTS SYCL Performance

140

120

100

80

60

Duration (ms) 40

20

0 OpenCL SYCL

8x8 Discrete Cosine Transform I Measurement made on an AMD Radeon HD 5400 (Cedar) I DCT is ALU bound

24 C++ Front End for HSA Performance

25 Enqueue 20 Copy

15

10 Duration (ms) 5

0 HSA OpenCL

8x8 Discrete Cosine Transform I Measurement made on an AMD A10-7850K APU I DCT is ALU bound

25 CONCLUSION Conclusion

I OffloadCL is our single source enabler technology I Offers flexibility via I Call graph duplication and function space management I Extended overloading resolution logic I Automatic address space inference and promotion I Keeps a clear separation between host and device code I No overhead on the generated code

27 ComputeCpp

I ComputeCpp is our SYCL implementation I Offload is the core technology behind ComputeCpp’s compiler I An evaluation program is available I Register your interest on our website!

ComputeCPP is trademark of Codeplay Software Ltd.

28 @codeplaysoft [email protected] codeplay.com