How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller

IWOCL`19 - May 2019 Codeplay - Connecting AI to Silicon

Products Addressable Markets Automotive (ISO 26262) IoT, Smartphones & Tablets C++ platform via the SYCL™ open standard, enabling vision & machine learning e.g. High Performance Compute (HPC) TensorFlow™ Medical & Industrial

Technologies: Vision Processing The heart of Codeplay's compute technology Machine Learning enabling OpenCL™, SPIR™, HSA™ and Vulkan™ Artificial Intelligence Big Data Compute

Company Customers High-performance software solutions for custom heterogeneous systems Enabling the toughest systems with tools and middleware based on open standards Partners Established 2002 in Scotland ~70 employees

2 © 2019 Codeplay Software Ltd. Agenda

Emergent hardware for AI in automotive

Overview of OpenCL/SYCL programming model

Mapping typical hardware to the OpenCL/SYCL programming model

The Renesas R-Car architecture

Extending OpenCL & SYCL for R-Car

Optimising machine learning algorithms using R-Car

3 © 2019 Codeplay Software Ltd. Autonomous driving is one of the biggest challenges in technology

The automotive industry needs to deliver the latest AI technologies with safety, high performance and low power consumption

4 © 2019 Codeplay Software Ltd. Delivering an autonomous vehicle is a huge software and hardware challenge

It requires scaling up software development to very high levels of complexity, performance and risk

Whilst maintaining low power consumption

5 © 2019 Codeplay Software Ltd. Renesas R-Car architecture

● Embedded automotive architecture ● Optimized for computer vision processing and machine learning ● Designed for low latency, low power consumption and low cost

6 © 2019 Codeplay Software Ltd. SYCL-BLAS, SYCL-DNN

SYCL

OpenCL

7 © 2019 Codeplay Software Ltd. Agenda

Emergent hardware for AI in automotive

Overview of OpenCL/SYCL programming model

Mapping typical hardware to the OpenCL/SYCL programming model

The Renesas R-Car architecture

Extending OpenCL & SYCL for R-Car

Optimising machine learning algorithms using R-Car

8 © 2019 Codeplay Software Ltd. 1. A processing element executes a Processing single work-item Element

1 work-item

9 © 2019 Codeplay Software Ltd. 1. A processing element executes a Processing 2 single work-item Element 2. Each work-item can access private Private memory, a dedicated memory region memory for each processing element 1 work-item

10 © 2019 Codeplay Software Ltd. 1. A processing element executes a Processing 2 Compute unit single work-item Element 2. Each work-item can access private Private memory, a dedicated memory region memory for each processing element 1 3. A compute unit executes a work-item 3 work-group work-group, composed of multiple work-items, one for each processing element in the compute unit

11 © 2019 Codeplay Software Ltd. 1. A processing element executes a Processing 2 Compute unit 4 single work-item Element 2. Each work-item can access private Private Local memory, a dedicated memory region memory memory for each processing element 1 3. A compute unit executes a work-item 3 work-group work-group, composed of multiple work-items, one for each processing element in the compute unit 4. Each work-item can access local memory, a dedicated memory region for each compute unit

12 © 2019 Codeplay Software Ltd. 1. A processing element executes a Processing 2 Compute unit 4 single work-item Element 2. Each work-item can access private Private Local memory, a dedicated memory region memory memory for each processing element 1 3. A compute unit executes a work-item 3 work-group work-group, composed of multiple 5 work-items, one for each processing element in the compute unit 4. Each work-item can access local memory, a dedicated memory region for each compute unit 5. A device can execute multiple work-groups

13 © 2019 Codeplay Software Ltd. 1. A processing element executes a Processing 2 Compute unit 4 single work-item Element 2. Each work-item can access private Private Local memory, a dedicated memory region memory memory for each processing element 1 3. A compute unit executes a work-item 3 work-group work-group, composed of multiple 5 work-items, one for each processing element in the compute unit 4. Each work-item can access local memory, a dedicated memory region for each compute unit 5. A device can execute multiple work-groups 6. Each work-item can access global memory, a single memory region 6 Global memory available to all processing elements

14 © 2019 Codeplay Software Ltd. Private memory < Local memory < Global memory

15 © 2019 Codeplay Software Ltd. Work-item

16 © 2019 Codeplay Software Ltd. Work-item Private memory

17 © 2019 Codeplay Software Ltd. Work-item Private memory

Work-group

18 © 2019 Codeplay Software Ltd. Work-item Private memory

Work-group Local memory

19 © 2019 Codeplay Software Ltd. Work-item Private memory

Work-group Work-group barrier Local memory

20 © 2019 Codeplay Software Ltd. Work-item Private memory

Work-group Work-group barrier Local memory

Kernel

21 © 2019 Codeplay Software Ltd. Work-item Private memory

Work-group Work-group barrier Local memory

Kernel Global memory

22 © 2019 Codeplay Software Ltd. Work-item Private memory

Work-group Work-group barrier Local memory

Kernel Kernel barrier Global memory

23 © 2019 Codeplay Software Ltd. Cross-platform, single-source, high-level, C++ programming layer Built on top of OpenCL and based on standard C++11 Delivering a heterogeneous programming solution for C++

24 © 2019 Codeplay Software Ltd. __global__ vec_add(float *a, float *b, float *c) { return c[i] = a[i] + b[i]; vector a, b, c; } #pragma parallel_for float *a, *b, *c; for(int i = 0; i < a.size(); i++) { vec_add<<>>(a,array_view b, c); a, b, c; c[i] = a[i] + b[i]; extent<2> e(64, 64); }

parallel_for_each(e, [=](index<2> idx) restrict(amp) { c[idx] = a[idx] + b[idx]; });

cgh.parallel_for(range, [=](cl::::id<2> idx) { c[idx] = a[idx] + c[idx]; }));

25 © 2019 Codeplay Software Ltd. SYCL separates the storage and access of data through the use of buffers and accessors

SYCL provides data dependency tracking based on accessors to optimise the scheduling of tasks

26 © 2019 Codeplay Software Ltd. Accessors are used Buffers manage data to describe access across the host and requirements one or more devices

Accessor CG A

Buffer

Accessor CG B

Buffers and accessors are type safe access across host and device

27 © 2019 Codeplay Software Ltd. Buffer A Read accessor CG A Write accessor CG A CG B

Buffer B Read accessor CG B Write accessor

Buffer C

Read accessor CG C Read accessor CG C Buffer D Write accessor

28 © 2019 Codeplay Software Ltd. host_buffer Request access to a buffer accessor immediately on the host

Buffer global_buffer Request access to a buffer in the global memory region accessor

constant_buffer Request access to a buffer in accessor the constant memory region CG

local accessor Allocate memory in the local memory region

29 © 2019 Codeplay Software Ltd. Benefits of data dependency task graphs ● Allows you to describe your tasks in terms of relationships ○ Removes the need to en-queue explicit copies ○ Removes the need for complex event handling

● Allows the runtime to make data movement optimizations ○ Preemptively copy data to a device before kernels are executed ○ Avoid unnecessarily copying data back to the host after execution on a device

30 © 2019 Codeplay Software Ltd. Agenda

Emergent hardware for AI in automotive

Overview of OpenCL/SYCL programming model

Mapping typical hardware to the OpenCL/SYCL programming model

The Renesas R-Car architecture

Extending OpenCL & SYCL for R-Car

Optimising machine learning algorithms using R-Car

31 © 2019 Codeplay Software Ltd. CPU

32 © 2019 Codeplay Software Ltd. 1. A CPU has a region of dedicated memory

CPU

1 DDR

33 © 2019 Codeplay Software Ltd. 1. A CPU has a region of dedicated memory 2. CPU memory is connected to the CPU CPU via a bus

2

1 DDR

34 © 2019 Codeplay Software Ltd. 1. A CPU has a region of CPU dedicated memory 3 Core Core Core Core 2. The CPU memory is connected to the CPU via a bus 3. A CPU has a number of cores

2

1 DDR

35 © 2019 Codeplay Software Ltd. 1. A CPU has a region of CPU dedicated memory 3 Core Core Core Core 2. The CPU memory is connected to the CPU via a bus 3. A CPU has a number of 4 Cache (multiple levels) cores

2 4. A CPU has a number of caches of different 1 levels DDR

36 © 2019 Codeplay Software Ltd. 1. A CPU has a region of CPU dedicated memory 3 Core Core Core Core 2. The CPU memory is connected to the CPU 5 Registers Registers Registers Registers via a bus 3. A CPU has a number of 4 Cache (multiple levels) cores

2 4. A CPU has a number of caches of different 1 levels DDR 5. Each CPU core has dedicated registers

37 © 2019 Codeplay Software Ltd. CPU

Core Core Core Core

Registers Registers Registers Registers

Cache (multiple levels)

DDR

38 © 2019 Codeplay Software Ltd. 1. Lanes of the CPU core CPU SIMD instructions are 1 SIMD work-items SIMD work-items SIMD work-items SIMD work-items mapped to work-items

Registers Registers Registers Registers

Cache (multiple levels)

DDR

39 © 2019 Codeplay Software Ltd. 1. Lanes of the CPU core CPU SIMD instructions are 1 SIMD work-items SIMD work-items SIMD work-items SIMD work-items mapped to work-items 2 2. CPU registers and their Private memory Private memory Private memory Private memory associated caches are mapped to private Cache (multiple levels) memory

DDR

40 © 2019 Codeplay Software Ltd. 1. Lanes of the CPU core CPU SIMD instructions are 1 SIMD work-items SIMD work-items SIMD work-items SIMD work-items mapped to work-items 2 2. CPU registers and their Private memory Private memory Private memory Private memory associated caches are mapped to private Cache (multiple levels) memory 3. A section of DDR is mapped to local memory

3

Local memory DDR

41 © 2019 Codeplay Software Ltd. 1. Lanes of the CPU core CPU SIMD instructions are 1 SIMD work-items SIMD work-items SIMD work-items SIMD work-items mapped to work-items 2 2. CPU registers and their Private memory Private memory Private memory Private memory associated caches are mapped to private Cache (multiple levels) memory 3. A section of DDR is mapped to local memory 3 4 4. The rest of DDR is Local memory DDR Global memory mapped to global memory

42 © 2019 Codeplay Software Ltd. GPU

43 © 2019 Codeplay Software Ltd. 1. A GPU has a region of dedicated DDR memory which is connected to the CPU GPU

1 DDR

44 © 2019 Codeplay Software Ltd. 1. A GPU has a region of GPU dedicated DDR memory 2 Compute unit Compute unit which is connected to the CPU 2. A GPU is divided into a ... number of compute units

1 DDR

45 © 2019 Codeplay Software Ltd. 1. A GPU has a region of GPU dedicated DDR memory 2 Compute unit Compute unit which is connected to the CPU 2. A GPU is divided into a ... number of compute units 3. Each compute unit has 3 Shared memory Shared memory dedicated shared memory

1 DDR

46 © 2019 Codeplay Software Ltd. 1. A GPU has a region of GPU dedicated DDR memory 2 Compute unit Compute unit which is connected to the 4 PE PE PE PE PE PE CPU 2. A GPU is divided into a ...... number of compute units 3. Each compute unit has 3 Shared memory Shared memory dedicated shared memory 4. Each compute unit has a number of processing elements 1 DDR

47 © 2019 Codeplay Software Ltd. 1. A GPU has a region of GPU dedicated DDR memory 2 Compute unit Compute unit which is connected to the 4 PE PE PE PE PE PE CPU ...... 2. A GPU is divided into a 5 ... PM PM PM PM PM PM number of compute units 3. Each compute unit has 3 Shared memory Shared memory dedicated shared memory 4. Each compute unit has a number of processing elements 1 5. Each processing element has DDR dedicated processing element local memory

48 © 2019 Codeplay Software Ltd. GPU Compute unit Compute unit

PE PE PE PE PE PE

...... PM PM PM PM PM PM

Shared memory Shared memory

DDR

49 © 2019 Codeplay Software Ltd. Compute units on are mapped to GPU the optimal work-group size Work-group Work-group

PE PE PE PE PE PE

...... PM PM PM PM PM PM

Shared memory Shared memory

DDR

50 © 2019 Codeplay Software Ltd. Processing elements on are GPU mapped to work-items Work-group Work-group

Work- Work- Work- Work- Work- Work- item item item item item item ...... PM PM PM PM PM PM

Shared memory Shared memory

DDR

51 © 2019 Codeplay Software Ltd. DDR memory is mapped to global GPU memory Work-group Work-group

Work- Work- Work- Work- Work- Work- item item item item item item ...... PM PM PM PM PM PM

Shared memory Shared memory

Global memory

52 © 2019 Codeplay Software Ltd. Compute unit shared memory is GPU mapped to local memory Work-group Work-group

Work- Work- Work- Work- Work- Work- item item item item item item ...... PM PM PM PM PM PM

Local memory Local memory

Global memory

53 © 2019 Codeplay Software Ltd. Processing element local memory GPU is mapped to private memory Work-group Work-group

Work- Work- Work- Work- Work- Work- item item item item item item ...... PM PM PM PM PM PM

Local memory Local memory

Global memory

54 © 2019 Codeplay Software Ltd. Agenda

Emergent hardware for AI in automotive

Overview of OpenCL/SYCL programming model

Mapping typical hardware to the OpenCL/SYCL programming model

The Renesas R-Car architecture

Extending OpenCL & SYCL for R-Car

Optimising machine learning algorithms using R-Car

55 © 2019 Codeplay Software Ltd. CVEngine

56 © 2019 Codeplay Software Ltd. 1. The CVEngine has is connected to CVEngine off-chip DDR which is connected to the CPU

1 DDR

57 © 2019 Codeplay Software Ltd. 1. The CVEngine has is connected to CVEngine off-chip DDR which is connected to the CPU 2. The CVEngine has a number of clusters 2 Cluster Cluster

1 DDR

58 © 2019 Codeplay Software Ltd. 1. The CVEngine has is connected to CVEngine off-chip DDR which is connected to the CPU 3 SRAM 2. The CVEngine has a number of clusters 2 Cluster Cluster 3. The CVEngine has a region of on-chip SRAM, also connected to the CPU

1 DDR

59 © 2019 Codeplay Software Ltd. 1. The CVEngine has is connected to CVEngine off-chip DDR which is connected to the CPU 3 SRAM 2. The CVEngine has a number of clusters 2 Cluster Cluster

3. The CVEngine has a region of on-chip SRAM, also connected to 4 the CPU Core Core Core Core Core Core Core Core 4. Each cluster has 4 cores each with a number of processing elements

1 DDR

60 © 2019 Codeplay Software Ltd. 1. The CVEngine has is connected to CVEngine off-chip DDR which is connected to the CPU 3 SRAM 2. The CVEngine has a number of clusters 2 Cluster Cluster

3. The CVEngine has a region of on-chip SRAM, also connected to 4 the CPU Core Core Core Core Core Core Core Core 4. Each cluster has 4 cores each with a number of processing elements Registers Registers Registers Registers Registers Registers Registers Registers 5. Each core has dedicated registers 5 Cache (multiple levels) Cache (multiple levels) and can access DDR memory via caches

1 DDR

61 © 2019 Codeplay Software Ltd. 1. The CVEngine has is connected to CVEngine off-chip DDR which is connected to the CPU 3 SRAM 2. The CVEngine has a number of clusters 2 Cluster Cluster

3. The CVEngine has a region of 6 Local Local Local Local Local Local Local Local SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM on-chip SRAM, also connected to 4 the CPU Core Core Core Core Core Core Core Core 4. Each cluster has 4 cores each with a number of processing elements Registers Registers Registers Registers Registers Registers Registers Registers 5. Each core has dedicated registers 5 Cache (multiple levels) Cache (multiple levels) and can access DDR memory via caches 6. Each core also has dedicated local

1 SRAM DDR

62 © 2019 Codeplay Software Ltd. 1. The CVEngine has is connected to CVEngine off-chip DDR which is connected to the CPU 3 SRAM 2. The CVEngine has a number of clusters 7 2 Cluster Cluster

3. The CVEngine has a region of 6 Local Local Local Local Local Local Local Local SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM on-chip SRAM, also connected to 4 the CPU Core Core Core Core Core Core Core Core 4. Each cluster has 4 cores each with a number of processing elements Registers Registers Registers Registers Registers Registers Registers Registers 5. Each core has dedicated registers 5 Cache (multiple levels) Cache (multiple levels) and can access DDR memory via caches 6. Each core also has dedicated local

1 SRAM DDR 7. The local SRAM is connected to the on-chip SRAM and DDR via DMA

63 © 2019 Codeplay Software Ltd. CVEngine

SRAM

Cluster Cluster

Local Local Local Local Local Local Local Local SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM

Core Core Core Core Core Core Core Core

Registers Registers Registers Registers Registers Registers Registers Registers

Cache (multiple levels) Cache (multiple levels)

DDR

64 © 2019 Codeplay Software Ltd. Each cluster maps to the optimal CVEngine work-group size SRAM

Work-group Work-group

Local Local Local Local Local Local Local Local SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM

Core Core Core Core Core Core Core Core

Registers Registers Registers Registers Registers Registers Registers Registers

Cache (multiple levels) Cache (multiple levels)

DDR

65 © 2019 Codeplay Software Ltd. Cores provide an extra level of CVEngine subdivision within work-groups SRAM So each core maps to a sub-group Work-group Work-group Sub-groups are available in

Local Local Local Local Local Local Local Local SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM OpenCL 2.x but not yet available

Sub- Sub- Sub- Sub- Sub- Sub- Sub- Sub- in SYCL so this will require an group group group group group group group group extension

Registers Registers Registers Registers Registers Registers Registers Registers

Cache (multiple levels) Cache (multiple levels)

DDR

66 © 2019 Codeplay Software Ltd. Each processing element within a CVEngine core maps to a single work-item SRAM

Work-group Work-group

Local Local Local Local Local Local Local Local SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM

Sub- Sub- Sub- Sub- Sub- Sub- Sub- Sub- group group group group group group group group (N work- (N work- (N work- (N work- (N work- (N work- (N work- (N work- items) items) items) items) items) items) items) items)

Registers Registers Registers Registers Registers Registers Registers Registers

Cache (multiple levels) Cache (multiple levels)

DDR

67 © 2019 Codeplay Software Ltd. Off-chip DDR memory is mapped CVEngine to global memory SRAM

Work-group Work-group

Local Local Local Local Local Local Local Local SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM

Sub- Sub- Sub- Sub- Sub- Sub- Sub- Sub- group group group group group group group group (N work- (N work- (N work- (N work- (N work- (N work- (N work- (N work- items) items) items) items) items) items) items) items)

Registers Registers Registers Registers Registers Registers Registers Registers

Cache (multiple levels) Cache (multiple levels)

Global memory

68 © 2019 Codeplay Software Ltd. On-chip SRAM memory is mapped CVEngine to local memory Local memory

Work-group Work-group

Local Local Local Local Local Local Local Local SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM

Sub- Sub- Sub- Sub- Sub- Sub- Sub- Sub- group group group group group group group group (N work- (N work- (N work- (N work- (N work- (N work- (N work- (N work- items) items) items) items) items) items) items) items)

Registers Registers Registers Registers Registers Registers Registers Registers

Cache (multiple levels) Cache (multiple levels)

Global memory

69 © 2019 Codeplay Software Ltd. Registers are mapped to private CVEngine memory Local memory

Work-group Work-group

Local Local Local Local Local Local Local Local SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM

Sub- Sub- Sub- Sub- Sub- Sub- Sub- Sub- group group group group group group group group (N work- (N work- (N work- (N work- (N work- (N work- (N work- (N work- items) items) items) items) items) items) items) items)

PM PM PM PM PM PM PM PM

Cache (multiple levels) Cache (multiple levels)

Global memory

70 © 2019 Codeplay Software Ltd. Since SRAM can be written to and CVEngine read from the CPU and can be On-chip memory Local memory accessed by all work-groups similar to global memory Work-group Work-group

Local Local Local Local Local Local Local Local SRAM can also be used to allocate SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM

Sub- Sub- Sub- Sub- Sub- Sub- Sub- Sub- low-latency on-chip buffers group group group group group group group group (N work- (N work- (N work- (N work- (N work- (N work- (N work- (N work- items) items) items) items) items) items) items) items)

PM PM PM PM PM PM PM PM

Cache (multiple levels) Cache (multiple levels)

Global memory

71 © 2019 Codeplay Software Ltd. Since each core has its own CVEngine dedicated local SRAM On-chip memory Local memory Local SRAM can be mapped to a

Work-group Work-group sub-group local memory

Sub-group Sub-group Sub-group Sub-group Sub-group Sub-group Sub-group Sub-group local local local local local local local local Sub-group local memory is not yet memory memory memory memory memory memory memory memory Sub- Sub- Sub- Sub- Sub- Sub- Sub- Sub- group group group group group group group group available in OpenCL or SYCL so (N work- (N work- (N work- (N work- (N work- (N work- (N work- (N work- items) items) items) items) items) items) items) items) this will require an extension

PM PM PM PM PM PM PM PM

Cache (multiple levels) Cache (multiple levels)

Global memory

72 © 2019 Codeplay Software Ltd. Since there is DMA connections CVEngine from SRAM and DDR to local On-chip memory Local memory SRAM

Work-group Work-group The CVEngine can support

Sub-group Sub-group Sub-group Sub-group Sub-group Sub-group Sub-group Sub-group asynchronous memory copies local local local local local local local local memory memory memory memory memory memory memory memory Sub- Sub- Sub- Sub- Sub- Sub- Sub- Sub- from on-chip memory buffers and group group group group group group group group (N work- (N work- (N work- (N work- (N work- (N work- (N work- (N work- global memory buffers into items) items) items) items) items) items) items) items) sub-group local memory and vise PM PM PM PM PM PM PM PM versa Cache (multiple levels) Cache (multiple levels) These asynchronous copies cannot be represented in OpenCL or SYCL so will require an Global memory extension

73 © 2019 Codeplay Software Ltd. Work-item Private memory

Work-group Work-group barrier Local memory

Kernel Kernel barrier Global memory

74 © 2019 Codeplay Software Ltd. Work-item Private memory

Work-group Work-group barrier Local memory

Global On-chip Kernel Kernel barrier memory memory

75 © 2019 Codeplay Software Ltd. Work-item Private memory

Sub-group Sub-group barrier Sub-group local memory

Local memory Work-group Work-group barrier

Global On-chip Kernel Kernel barrier memory memory

76 © 2019 Codeplay Software Ltd. Private memory Work-item

Sub-group local memory Sub-group Sub-group barrier

Local memory Work-group Work-group barrier

Global On-chip Kernel Kernel barrier memory memory

Asynchronous sub-group copies

77 © 2019 Codeplay Software Ltd. Agenda

Emergent hardware for AI in automotive

Overview of OpenCL/SYCL programming model

Mapping typical hardware to the OpenCL/SYCL programming model

The Renesas R-Car architecture

Extending OpenCL & SYCL for R-Car

Optimising machine learning algorithms using R-Car

78 © 2019 Codeplay Software Ltd. Disclaimer

The features that I present here are Codeplay extensions and are not standard SYCL features

79 © 2019 Codeplay Software Ltd. CVEngine

On-chip memory Local memory

Work-group Work-group

Sub-group Sub-group Sub-group Sub-group Sub-group Sub-group Sub-group Sub-group local local local local local local local local memory memory memory memory memory memory memory memory Sub- Sub- Sub- Sub- Sub- Sub- Sub- Sub- group group group group group group group group (N work- (N work- (N work- (N work- (N work- (N work- (N work- (N work- items) items) items) items) items) items) items) items)

PM PM PM PM PM PM PM PM

Cache (multiple levels) Cache (multiple levels)

Global memory

80 © 2019 Codeplay Software Ltd. ● On-chip memory ○ On-chip memory is allocated in OpenCL/SYCL similarly to regular buffers ■ ComputeAorta (OpenCL) provides an extension API ■ ComputeCpp (SYCL) provides an extension buffer property use_onchip_memory ○ On-chip memory buffers are accessed in OpenCL/SYCL kernels in the same way as regular buffers

On-chip memory Local memory

81 © 2019 Codeplay Software Ltd. We construct a SYCL class kernel; buffer as normal, but using namespace cl::sycl; provide the use_onchip_memory { buffer property queue deviceQueue;

buffer onchipBuffer(hostData, size, {codeplay::property::buffer::use_onchip_memory( This property takes an codeplay::property::require)}); enumeration; either require, which means deviceQueue.submit([&](handler &cgh){ that SYCL runtime has to use it or prefer, which auto onchipAcc = onchipBuffer.get_access(cgh); means the SYCL runtime should try to use it cgh.parallel_for(range<1>(size), [=](id<1> idx){ onchipAcc[idx] = onchipAcc[idx] * onchipAcc[idx]; }): }); }

82 © 2019 Codeplay Software Ltd. ● Sub-groups ○ Sub-groups are exposed following the OpenCL 2.x feature and as a natural extension to the SYCL execution model ■ ComputeAorta (OpenCL) provides kernel builtins for querying sub-group info and invoking a sub-group barrier ■ ComputeCpp (SYCL) provides an extension to nd_item to expose a sub_group object, similar to group, which exposes member functions for querying sub-group info and invoking a sub-group barrier ○ The size of sub-groups cannot be specified Work-group explicitly by the users, they are determined by Sub-group Sub-group Sub-group Sub-group local local local local the implementation memory memory memory memory Sub- Sub- Sub- Sub- group group group group (N work- (N work- (N work- (N work- items) items) items) items)

PM PM PM PM

83 © 2019 Codeplay Software Ltd. class kernel; We query in-kernel using namespace cl::sycl; sub-group information

{ and invoke sub-group queue deviceQueue; barriers via the sub_group class and the buffer deviceBuffer(hostData, size); nd_item has a member function called deviceQueue.submit([&](handler &cgh){ get_sub_group that will auto deviceAcc= return a sub_group deviceBuffer.get_access(cgh); object

cgh.parallel_for(nd_range<1>(range<1>(size), range<1>(32)), [=](nd_item<1> ndItem){ … If an implementation auto subGroup = ndItem.get_sub_group(); does not support auto subGroupRange = subGroup.get_group_range(); sub-groups using auto subGroupId = subGroup.get_group_id(); sub_group is undefined subGroup.barrier(); … }): }); }

84 © 2019 Codeplay Software Ltd. ● Sub-group local memory ○ Sub-group local memory is exposed with extensions which follow the OpenCL/SYCL memory model ■ ComputeAorta (OpenCL) provides a new address space which can be used to allocate sub-group local memory ■ ComputeCpp (SYCL) provides a new accessor access target; access::target::subgroup_local, that behaves similarly to access::target::local

Work-group

Sub-group Sub-group Sub-group Sub-group local local local local memory memory memory memory Sub- Sub- Sub- Sub- group group group group (N work- (N work- (N work- (N work- items) items) items) items)

PM PM PM PM

85 © 2019 Codeplay Software Ltd. class kernel; We allocate sub-group using namespace cl::sycl; local memory by

{ constructing an accessor queue deviceQueue; with the subgroup_local access target buffer deviceBuffer(hostData, size);

deviceQueue.submit([&](handler &cgh){

auto deviceAcc= deviceBuffer.get_access(cgh);

auto subGroupLocalMem = accessor(cgh, range<1>(32));

cgh.parallel_for(nd_range<1>(range<1>(size), range<1>(32)), [=](nd_item<1> ndItem){ … subGroupLocalMem[idx] = ...; … }): }); }

86 © 2019 Codeplay Software Ltd. ● Asynchronous sub-group copies ○ Asynchronous sub-group copies are exposed following the OpenCL/SYCL feature for asynchronous work-group copies ■ ComputeAorta (OpenCL) provides a plane_t type to represent a non-accessible buffer and kernel builtins for invoking an asynchronous in-kernel copies between a plane_t and a sub-group local memory allocation ■ ComputeCpp (SYCL) provides a new accessor access target; access::target::plane, and a member function to the sub_group extension; async_sub_group_copy, to perform an asynchronous in-kernel copy from a plane accessor to a sub-group local accessor Work-group Sub-group Sub-group Sub-group Sub-group local local local local memory memory memory memory Sub- Sub- Sub- Sub- group group group group (N work- (N work- (N work- (N work- items) items) items) items)

PM PM PM PM

87 © 2019 Codeplay Software Ltd. class kernel; using namespace cl::sycl; We construct a plane

{ accessor using the plane queue deviceQueue; access target

buffer deviceBuffer(hostData, size);

deviceQueue.submit([&](handler &cgh){ We perform asynchronous in-kernel auto devicePlane = deviceBuffer.get_access(cgh); calling the sub_group

auto subGroupLocalMem = accessor(cgh, range<1>(32)); async_sub_group_copy

cgh.parallel_for(nd_range<1>(range<1>(size), range<1>(32)), [=](nd_item<1> ndItem){ … This returns a auto subGroup = ndItem.get_sub_group(); device_event that can be auto event = subGroup.async_sub_group_copy(subGroupLocalMem, devicePlane, range<1>(32)); used to wait on the copy … to complete. event.wait(); }): }); }

88 © 2019 Codeplay Software Ltd. Agenda

Emergent hardware for AI in automotive

Overview of OpenCL/SYCL programming model

Mapping typical hardware to the OpenCL/SYCL programming model

The Renesas R-Car architecture

Extending OpenCL & SYCL for R-Car

Optimising machine learning algorithms using R-Car

89 © 2019 Codeplay Software Ltd. 90 © 2019 Codeplay Software Ltd. Input Image

91 © 2019 Codeplay Software Ltd. Input Image

92 © 2019 Codeplay Software Ltd. Global memory Input Image

✔ The entire image will fit into global memory ✘ Global memory has a high access latency

93 © 2019 Codeplay Software Ltd. 0,0 1,0 On-chip memory

0,1 1,1 ✔ On-chip memory has a much lower access latency ✘ Only part of the image will fit into on-chip memory at once, so we have to tile it ✘ Executing a kernel per tile incurs host-side overhead Note that because convolutions are gather operations the input data much include a halo

94 © 2019 Codeplay Software Ltd. Copy Copy Copy Copy Copy {0, 0} {1, 0} {0, 1} {1, 1}

Convo Convo Convo Convo Compute {0, 0} {1, 0} {0, 1} {1, 1}

95 © 2019 Codeplay Software Ltd. Copy Copy Copy Copy Copy {0, 0} {1, 0} {0, 1} {1, 1}

Convo Convo Convo Convo Compute {0, 0} {1, 0} {0, 1} {1, 1}

96 © 2019 Codeplay Software Ltd. Copy Copy Copy Copy Copy {0, 0} {1, 0} {0, 1} {1, 1}

Convo Convo Convo Convo Compute {0, 0} {1, 0} {0, 1} {1, 1}

97 © 2019 Codeplay Software Ltd. Copy Copy Copy Copy Copy {0, 0} {1, 0} {0, 1} {1, 1}

Convo Convo Convo Convo Compute {0, 0} {1, 0} {0, 1} {1, 1}

98 © 2019 Codeplay Software Ltd. 0,0 1,0 On-chip memory

0,1 1,1 ✔ By double buffering copy and computation you can hide the latency of copying into on-chip memory However, the R-Car CVEngine provides further sub-group local memory which has an even lower access latency than on-chip memory

99 © 2019 Codeplay Software Ltd. 0,0 1,0 1,0 On-chip memory 0,1 1,1 Sub-group local memory

0,1 1,1 ✔ Asynchronously copying each part of the input data that is associated with a sub-group to sub-group local memory will further lower access latency ✘ Again, only part of the image data that is associated with a sub-group will fit into sub-group local memory at once, so again we have to tile it In this case the tiling is done in-kernel

100 © 2019 Codeplay Software Ltd. cgh.parallel_for(ndRange, [=](nd_item<1> ndItem){ auto subGroup = ndItem.get_sub_group();

auto numTiles = calculate_num_tiles(subGroup.get_group_range(), TILE_SIZE);

auto currentTileRange = calculate_tile_range(subGroup, 0); auto nextTileRange = calculate_tile_range(subGroup, 1);

subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);

copyEvent.wait();

if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);

swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } } });

101 © 2019 Codeplay Software Ltd. cgh.parallel_for(ndRange, [=](nd_item<1> ndItem){ auto subGroup = ndItem.get_sub_group(); This kernel is operating on a tile that is stored in auto numTiles = calculate_num_tiles(subGroup.get_group_range(), TILE_SIZE); on-chip memory auto currentTileRange = calculate_tile_range(subGroup, 0); auto nextTileRange = calculate_tile_range(subGroup, 1);

subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);

copyEvent.wait();

if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);

swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } } });

102 © 2019 Codeplay Software Ltd. cgh.parallel_for(ndRange, [=](nd_item<1> ndItem){ auto subGroup = ndItem.get_sub_group(); We want to perform the computation of the part auto numTiles = calculate_num_tiles(subGroup.get_group_range(), TILE_SIZE); of the input that each auto currentTileRange = calculate_tile_range(subGroup, 0); sub-group corresponds auto nextTileRange = calculate_tile_range(subGroup, 1); to in sub-group local subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); memory copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

for (int tile = 0; tile < numTiles; ++tile) { But all the memory compute_tile(subGroup, currentTileRange, output); required may not fit into copyEvent.wait(); sub-group local memory

if (tile == (numTiles - 1)) { at once copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

currentTileRange = nextTileRange; So we calculate how nextTileRange = calculate_tile_range(subGroup, tile + 1); many tiles are required swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); for a sub-group } } });

103 © 2019 Codeplay Software Ltd. cgh.parallel_for(ndRange, [=](nd_item<1> ndItem){ auto subGroup = ndItem.get_sub_group(); First we initiate and wait on the copy of the first auto numTiles = calculate_num_tiles(subGroup.get_group_range(), TILE_SIZE); tile so we can perform auto currentTileRange = calculate_tile_range(subGroup, 0); the computation on it auto nextTileRange = calculate_tile_range(subGroup, 1);

subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, Then we initiate, but nextTileRange); don’t wait for the copy for (int tile = 0; tile < numTiles; ++tile) { of the second tile, so compute_tile(subGroup, currentTileRange, output); that copy will happen in copyEvent.wait(); parallel to the

if (tile == (numTiles - 1)) { computation of the first copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, tile nextTileRange);

currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);

swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } } });

104 © 2019 Codeplay Software Ltd. cgh.parallel_for(ndRange, [=](nd_item<1> ndItem){ auto subGroup = ndItem.get_sub_group(); Then we iterate over the tiles, performing the auto numTiles = calculate_num_tiles(subGroup.get_group_range(), TILE_SIZE); computation of the auto currentTileRange = calculate_tile_range(subGroup, 0); current tile and then auto nextTileRange = calculate_tile_range(subGroup, 1); waiting on the copy for subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); the next tile copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);

copyEvent.wait();

if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);

swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } } });

105 © 2019 Codeplay Software Ltd. cgh.parallel_for(ndRange, [=](nd_item<1> ndItem){ auto subGroup = ndItem.get_sub_group(); Finally, if there are further tiles to be auto numTiles = calculate_num_tiles(subGroup.get_group_range(), TILE_SIZE); processed, then we auto currentTileRange = calculate_tile_range(subGroup, 0); initiate the copy for the auto nextTileRange = calculate_tile_range(subGroup, 1); next tile and then swap subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); the accessors for the copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, next iteration of the loop nextTileRange);

for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);

copyEvent.wait();

if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);

currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);

swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } } });

106 © 2019 Codeplay Software Ltd. 0,0 1,0 1,0 On-chip memory 0,1 1,1 Sub-group local memory

0,1 1,1 ✔ By double buffering asynchronous copies and the computation in each sub-group you can hide the latency of copying into sub-group local memory

107 © 2019 Codeplay Software Ltd. Conclusion

● The Renesas R-Car CVEngine is designed to efficiently accelerate complex machine learning algorithms in a low power environment

● The OpenCL/SYCL programming memory can be efficiently applied and extended when necessary to support very unique hardware architectures

● This allows automotive systems to take advantage of AI software stacks based on open standards

108 © 2019 Codeplay Software Ltd.

We’re Hiring!

codeplay.com/careers/

Thank you for listening

@codeplaysoft /codeplaysoft codeplay.com