How to Deploy AI Software to Self Driving Cars Rod Burns, Gordon Brown, Meenakshi Ravindran and Nicolas Miller
IWOCL`19 - May 2019 Codeplay - Connecting AI to Silicon
Products Addressable Markets Automotive (ISO 26262) IoT, Smartphones & Tablets C++ platform via the SYCL™ open standard, enabling vision & machine learning e.g. High Performance Compute (HPC) TensorFlow™ Medical & Industrial
Technologies: Vision Processing The heart of Codeplay's compute technology Machine Learning enabling OpenCL™, SPIR™, HSA™ and Vulkan™ Artificial Intelligence Big Data Compute
Company Customers High-performance software solutions for custom heterogeneous systems Enabling the toughest processor systems with tools and middleware based on open standards Partners Established 2002 in Scotland ~70 employees
2 © 2019 Codeplay Software Ltd. Agenda
Emergent hardware for AI in automotive
Overview of OpenCL/SYCL programming model
Mapping typical hardware to the OpenCL/SYCL programming model
The Renesas R-Car architecture
Extending OpenCL & SYCL for R-Car
Optimising machine learning algorithms using R-Car
3 © 2019 Codeplay Software Ltd. Autonomous driving is one of the biggest challenges in technology
The automotive industry needs to deliver the latest AI technologies with safety, high performance and low power consumption
4 © 2019 Codeplay Software Ltd. Delivering an autonomous vehicle is a huge software and hardware challenge
It requires scaling up software development to very high levels of complexity, performance and risk
Whilst maintaining low power consumption
5 © 2019 Codeplay Software Ltd. Renesas R-Car architecture
● Embedded automotive architecture ● Optimized for computer vision processing and machine learning ● Designed for low latency, low power consumption and low cost
6 © 2019 Codeplay Software Ltd. SYCL-BLAS, SYCL-DNN
SYCL
OpenCL
7 © 2019 Codeplay Software Ltd. Agenda
Emergent hardware for AI in automotive
Overview of OpenCL/SYCL programming model
Mapping typical hardware to the OpenCL/SYCL programming model
The Renesas R-Car architecture
Extending OpenCL & SYCL for R-Car
Optimising machine learning algorithms using R-Car
8 © 2019 Codeplay Software Ltd. 1. A processing element executes a Processing single work-item Element
1 work-item
9 © 2019 Codeplay Software Ltd. 1. A processing element executes a Processing 2 single work-item Element 2. Each work-item can access private Private memory, a dedicated memory region memory for each processing element 1 work-item
10 © 2019 Codeplay Software Ltd. 1. A processing element executes a Processing 2 Compute unit single work-item Element 2. Each work-item can access private Private memory, a dedicated memory region memory for each processing element 1 3. A compute unit executes a work-item 3 work-group work-group, composed of multiple work-items, one for each processing element in the compute unit
11 © 2019 Codeplay Software Ltd. 1. A processing element executes a Processing 2 Compute unit 4 single work-item Element 2. Each work-item can access private Private Local memory, a dedicated memory region memory memory for each processing element 1 3. A compute unit executes a work-item 3 work-group work-group, composed of multiple work-items, one for each processing element in the compute unit 4. Each work-item can access local memory, a dedicated memory region for each compute unit
12 © 2019 Codeplay Software Ltd. 1. A processing element executes a Processing 2 Compute unit 4 single work-item Element 2. Each work-item can access private Private Local memory, a dedicated memory region memory memory for each processing element 1 3. A compute unit executes a work-item 3 work-group work-group, composed of multiple 5 work-items, one for each processing element in the compute unit 4. Each work-item can access local memory, a dedicated memory region for each compute unit 5. A device can execute multiple work-groups
13 © 2019 Codeplay Software Ltd. 1. A processing element executes a Processing 2 Compute unit 4 single work-item Element 2. Each work-item can access private Private Local memory, a dedicated memory region memory memory for each processing element 1 3. A compute unit executes a work-item 3 work-group work-group, composed of multiple 5 work-items, one for each processing element in the compute unit 4. Each work-item can access local memory, a dedicated memory region for each compute unit 5. A device can execute multiple work-groups 6. Each work-item can access global memory, a single memory region 6 Global memory available to all processing elements
14 © 2019 Codeplay Software Ltd. Private memory < Local memory < Global memory
15 © 2019 Codeplay Software Ltd. Work-item
16 © 2019 Codeplay Software Ltd. Work-item Private memory
17 © 2019 Codeplay Software Ltd. Work-item Private memory
Work-group
18 © 2019 Codeplay Software Ltd. Work-item Private memory
Work-group Local memory
19 © 2019 Codeplay Software Ltd. Work-item Private memory
Work-group Work-group barrier Local memory
20 © 2019 Codeplay Software Ltd. Work-item Private memory
Work-group Work-group barrier Local memory
Kernel
21 © 2019 Codeplay Software Ltd. Work-item Private memory
Work-group Work-group barrier Local memory
Kernel Global memory
22 © 2019 Codeplay Software Ltd. Work-item Private memory
Work-group Work-group barrier Local memory
Kernel Kernel barrier Global memory
23 © 2019 Codeplay Software Ltd. Cross-platform, single-source, high-level, C++ programming layer Built on top of OpenCL and based on standard C++11 Delivering a heterogeneous programming solution for C++
24 © 2019 Codeplay Software Ltd. __global__ vec_add(float *a, float *b, float *c) { return c[i] = a[i] + b[i]; vector
parallel_for_each(e, [=](index<2> idx) restrict(amp) { c[idx] = a[idx] + b[idx]; });
cgh.parallel_for
25 © 2019 Codeplay Software Ltd. SYCL separates the storage and access of data through the use of buffers and accessors
SYCL provides data dependency tracking based on accessors to optimise the scheduling of tasks
26 © 2019 Codeplay Software Ltd. Accessors are used Buffers manage data to describe access across the host and requirements one or more devices
Accessor CG A
Buffer
Accessor CG B
Buffers and accessors are type safe access across host and device
27 © 2019 Codeplay Software Ltd. Buffer A Read accessor CG A Write accessor CG A CG B
Buffer B Read accessor CG B Write accessor
Buffer C
Read accessor CG C Read accessor CG C Buffer D Write accessor
28 © 2019 Codeplay Software Ltd. host_buffer Request access to a buffer accessor immediately on the host
Buffer global_buffer Request access to a buffer in the global memory region accessor
constant_buffer Request access to a buffer in accessor the constant memory region CG
local accessor Allocate memory in the local memory region
29 © 2019 Codeplay Software Ltd. Benefits of data dependency task graphs ● Allows you to describe your tasks in terms of relationships ○ Removes the need to en-queue explicit copies ○ Removes the need for complex event handling
● Allows the runtime to make data movement optimizations ○ Preemptively copy data to a device before kernels are executed ○ Avoid unnecessarily copying data back to the host after execution on a device
30 © 2019 Codeplay Software Ltd. Agenda
Emergent hardware for AI in automotive
Overview of OpenCL/SYCL programming model
Mapping typical hardware to the OpenCL/SYCL programming model
The Renesas R-Car architecture
Extending OpenCL & SYCL for R-Car
Optimising machine learning algorithms using R-Car
31 © 2019 Codeplay Software Ltd. CPU
32 © 2019 Codeplay Software Ltd. 1. A CPU has a region of dedicated memory
CPU
1 DDR
33 © 2019 Codeplay Software Ltd. 1. A CPU has a region of dedicated memory 2. CPU memory is connected to the CPU CPU via a bus
2
1 DDR
34 © 2019 Codeplay Software Ltd. 1. A CPU has a region of CPU dedicated memory 3 Core Core Core Core 2. The CPU memory is connected to the CPU via a bus 3. A CPU has a number of cores
2
1 DDR
35 © 2019 Codeplay Software Ltd. 1. A CPU has a region of CPU dedicated memory 3 Core Core Core Core 2. The CPU memory is connected to the CPU via a bus 3. A CPU has a number of 4 Cache (multiple levels) cores
2 4. A CPU has a number of caches of different 1 levels DDR
36 © 2019 Codeplay Software Ltd. 1. A CPU has a region of CPU dedicated memory 3 Core Core Core Core 2. The CPU memory is connected to the CPU 5 Registers Registers Registers Registers via a bus 3. A CPU has a number of 4 Cache (multiple levels) cores
2 4. A CPU has a number of caches of different 1 levels DDR 5. Each CPU core has dedicated registers
37 © 2019 Codeplay Software Ltd. CPU
Core Core Core Core
Registers Registers Registers Registers
Cache (multiple levels)
DDR
38 © 2019 Codeplay Software Ltd. 1. Lanes of the CPU core CPU SIMD instructions are 1 SIMD work-items SIMD work-items SIMD work-items SIMD work-items mapped to work-items
Registers Registers Registers Registers
Cache (multiple levels)
DDR
39 © 2019 Codeplay Software Ltd. 1. Lanes of the CPU core CPU SIMD instructions are 1 SIMD work-items SIMD work-items SIMD work-items SIMD work-items mapped to work-items 2 2. CPU registers and their Private memory Private memory Private memory Private memory associated caches are mapped to private Cache (multiple levels) memory
DDR
40 © 2019 Codeplay Software Ltd. 1. Lanes of the CPU core CPU SIMD instructions are 1 SIMD work-items SIMD work-items SIMD work-items SIMD work-items mapped to work-items 2 2. CPU registers and their Private memory Private memory Private memory Private memory associated caches are mapped to private Cache (multiple levels) memory 3. A section of DDR is mapped to local memory
3
Local memory DDR
41 © 2019 Codeplay Software Ltd. 1. Lanes of the CPU core CPU SIMD instructions are 1 SIMD work-items SIMD work-items SIMD work-items SIMD work-items mapped to work-items 2 2. CPU registers and their Private memory Private memory Private memory Private memory associated caches are mapped to private Cache (multiple levels) memory 3. A section of DDR is mapped to local memory 3 4 4. The rest of DDR is Local memory DDR Global memory mapped to global memory
42 © 2019 Codeplay Software Ltd. GPU
43 © 2019 Codeplay Software Ltd. 1. A GPU has a region of dedicated DDR memory which is connected to the CPU GPU
1 DDR
44 © 2019 Codeplay Software Ltd. 1. A GPU has a region of GPU dedicated DDR memory 2 Compute unit Compute unit which is connected to the CPU 2. A GPU is divided into a ... number of compute units
1 DDR
45 © 2019 Codeplay Software Ltd. 1. A GPU has a region of GPU dedicated DDR memory 2 Compute unit Compute unit which is connected to the CPU 2. A GPU is divided into a ... number of compute units 3. Each compute unit has 3 Shared memory Shared memory dedicated shared memory
1 DDR
46 © 2019 Codeplay Software Ltd. 1. A GPU has a region of GPU dedicated DDR memory 2 Compute unit Compute unit which is connected to the 4 PE PE PE PE PE PE CPU 2. A GPU is divided into a ...... number of compute units 3. Each compute unit has 3 Shared memory Shared memory dedicated shared memory 4. Each compute unit has a number of processing elements 1 DDR
47 © 2019 Codeplay Software Ltd. 1. A GPU has a region of GPU dedicated DDR memory 2 Compute unit Compute unit which is connected to the 4 PE PE PE PE PE PE CPU ...... 2. A GPU is divided into a 5 ... PM PM PM PM PM PM number of compute units 3. Each compute unit has 3 Shared memory Shared memory dedicated shared memory 4. Each compute unit has a number of processing elements 1 5. Each processing element has DDR dedicated processing element local memory
48 © 2019 Codeplay Software Ltd. GPU Compute unit Compute unit
PE PE PE PE PE PE
...... PM PM PM PM PM PM
Shared memory Shared memory
DDR
49 © 2019 Codeplay Software Ltd. Compute units on are mapped to GPU the optimal work-group size Work-group Work-group
PE PE PE PE PE PE
...... PM PM PM PM PM PM
Shared memory Shared memory
DDR
50 © 2019 Codeplay Software Ltd. Processing elements on are GPU mapped to work-items Work-group Work-group
Work- Work- Work- Work- Work- Work- item item item item item item ...... PM PM PM PM PM PM
Shared memory Shared memory
DDR
51 © 2019 Codeplay Software Ltd. DDR memory is mapped to global GPU memory Work-group Work-group
Work- Work- Work- Work- Work- Work- item item item item item item ...... PM PM PM PM PM PM
Shared memory Shared memory
Global memory
52 © 2019 Codeplay Software Ltd. Compute unit shared memory is GPU mapped to local memory Work-group Work-group
Work- Work- Work- Work- Work- Work- item item item item item item ...... PM PM PM PM PM PM
Local memory Local memory
Global memory
53 © 2019 Codeplay Software Ltd. Processing element local memory GPU is mapped to private memory Work-group Work-group
Work- Work- Work- Work- Work- Work- item item item item item item ...... PM PM PM PM PM PM
Local memory Local memory
Global memory
54 © 2019 Codeplay Software Ltd. Agenda
Emergent hardware for AI in automotive
Overview of OpenCL/SYCL programming model
Mapping typical hardware to the OpenCL/SYCL programming model
The Renesas R-Car architecture
Extending OpenCL & SYCL for R-Car
Optimising machine learning algorithms using R-Car
55 © 2019 Codeplay Software Ltd. CVEngine
56 © 2019 Codeplay Software Ltd. 1. The CVEngine has is connected to CVEngine off-chip DDR which is connected to the CPU
1 DDR
57 © 2019 Codeplay Software Ltd. 1. The CVEngine has is connected to CVEngine off-chip DDR which is connected to the CPU 2. The CVEngine has a number of clusters 2 Cluster Cluster
1 DDR
58 © 2019 Codeplay Software Ltd. 1. The CVEngine has is connected to CVEngine off-chip DDR which is connected to the CPU 3 SRAM 2. The CVEngine has a number of clusters 2 Cluster Cluster 3. The CVEngine has a region of on-chip SRAM, also connected to the CPU
1 DDR
59 © 2019 Codeplay Software Ltd. 1. The CVEngine has is connected to CVEngine off-chip DDR which is connected to the CPU 3 SRAM 2. The CVEngine has a number of clusters 2 Cluster Cluster
3. The CVEngine has a region of on-chip SRAM, also connected to 4 the CPU Core Core Core Core Core Core Core Core 4. Each cluster has 4 cores each with a number of processing elements
1 DDR
60 © 2019 Codeplay Software Ltd. 1. The CVEngine has is connected to CVEngine off-chip DDR which is connected to the CPU 3 SRAM 2. The CVEngine has a number of clusters 2 Cluster Cluster
3. The CVEngine has a region of on-chip SRAM, also connected to 4 the CPU Core Core Core Core Core Core Core Core 4. Each cluster has 4 cores each with a number of processing elements Registers Registers Registers Registers Registers Registers Registers Registers 5. Each core has dedicated registers 5 Cache (multiple levels) Cache (multiple levels) and can access DDR memory via caches
1 DDR
61 © 2019 Codeplay Software Ltd. 1. The CVEngine has is connected to CVEngine off-chip DDR which is connected to the CPU 3 SRAM 2. The CVEngine has a number of clusters 2 Cluster Cluster
3. The CVEngine has a region of 6 Local Local Local Local Local Local Local Local SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM on-chip SRAM, also connected to 4 the CPU Core Core Core Core Core Core Core Core 4. Each cluster has 4 cores each with a number of processing elements Registers Registers Registers Registers Registers Registers Registers Registers 5. Each core has dedicated registers 5 Cache (multiple levels) Cache (multiple levels) and can access DDR memory via caches 6. Each core also has dedicated local
1 SRAM DDR
62 © 2019 Codeplay Software Ltd. 1. The CVEngine has is connected to CVEngine off-chip DDR which is connected to the CPU 3 SRAM 2. The CVEngine has a number of clusters 7 2 Cluster Cluster
3. The CVEngine has a region of 6 Local Local Local Local Local Local Local Local SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM on-chip SRAM, also connected to 4 the CPU Core Core Core Core Core Core Core Core 4. Each cluster has 4 cores each with a number of processing elements Registers Registers Registers Registers Registers Registers Registers Registers 5. Each core has dedicated registers 5 Cache (multiple levels) Cache (multiple levels) and can access DDR memory via caches 6. Each core also has dedicated local
1 SRAM DDR 7. The local SRAM is connected to the on-chip SRAM and DDR via DMA
63 © 2019 Codeplay Software Ltd. CVEngine
SRAM
Cluster Cluster
Local Local Local Local Local Local Local Local SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM
Core Core Core Core Core Core Core Core
Registers Registers Registers Registers Registers Registers Registers Registers
Cache (multiple levels) Cache (multiple levels)
DDR
64 © 2019 Codeplay Software Ltd. Each cluster maps to the optimal CVEngine work-group size SRAM
Work-group Work-group
Local Local Local Local Local Local Local Local SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM
Core Core Core Core Core Core Core Core
Registers Registers Registers Registers Registers Registers Registers Registers
Cache (multiple levels) Cache (multiple levels)
DDR
65 © 2019 Codeplay Software Ltd. Cores provide an extra level of CVEngine subdivision within work-groups SRAM So each core maps to a sub-group Work-group Work-group Sub-groups are available in
Local Local Local Local Local Local Local Local SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM OpenCL 2.x but not yet available
Sub- Sub- Sub- Sub- Sub- Sub- Sub- Sub- in SYCL so this will require an group group group group group group group group extension
Registers Registers Registers Registers Registers Registers Registers Registers
Cache (multiple levels) Cache (multiple levels)
DDR
66 © 2019 Codeplay Software Ltd. Each processing element within a CVEngine core maps to a single work-item SRAM
Work-group Work-group
Local Local Local Local Local Local Local Local SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM
Sub- Sub- Sub- Sub- Sub- Sub- Sub- Sub- group group group group group group group group (N work- (N work- (N work- (N work- (N work- (N work- (N work- (N work- items) items) items) items) items) items) items) items)
Registers Registers Registers Registers Registers Registers Registers Registers
Cache (multiple levels) Cache (multiple levels)
DDR
67 © 2019 Codeplay Software Ltd. Off-chip DDR memory is mapped CVEngine to global memory SRAM
Work-group Work-group
Local Local Local Local Local Local Local Local SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM
Sub- Sub- Sub- Sub- Sub- Sub- Sub- Sub- group group group group group group group group (N work- (N work- (N work- (N work- (N work- (N work- (N work- (N work- items) items) items) items) items) items) items) items)
Registers Registers Registers Registers Registers Registers Registers Registers
Cache (multiple levels) Cache (multiple levels)
Global memory
68 © 2019 Codeplay Software Ltd. On-chip SRAM memory is mapped CVEngine to local memory Local memory
Work-group Work-group
Local Local Local Local Local Local Local Local SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM
Sub- Sub- Sub- Sub- Sub- Sub- Sub- Sub- group group group group group group group group (N work- (N work- (N work- (N work- (N work- (N work- (N work- (N work- items) items) items) items) items) items) items) items)
Registers Registers Registers Registers Registers Registers Registers Registers
Cache (multiple levels) Cache (multiple levels)
Global memory
69 © 2019 Codeplay Software Ltd. Registers are mapped to private CVEngine memory Local memory
Work-group Work-group
Local Local Local Local Local Local Local Local SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM
Sub- Sub- Sub- Sub- Sub- Sub- Sub- Sub- group group group group group group group group (N work- (N work- (N work- (N work- (N work- (N work- (N work- (N work- items) items) items) items) items) items) items) items)
PM PM PM PM PM PM PM PM
Cache (multiple levels) Cache (multiple levels)
Global memory
70 © 2019 Codeplay Software Ltd. Since SRAM can be written to and CVEngine read from the CPU and can be On-chip memory Local memory accessed by all work-groups similar to global memory Work-group Work-group
Local Local Local Local Local Local Local Local SRAM can also be used to allocate SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM
Sub- Sub- Sub- Sub- Sub- Sub- Sub- Sub- low-latency on-chip buffers group group group group group group group group (N work- (N work- (N work- (N work- (N work- (N work- (N work- (N work- items) items) items) items) items) items) items) items)
PM PM PM PM PM PM PM PM
Cache (multiple levels) Cache (multiple levels)
Global memory
71 © 2019 Codeplay Software Ltd. Since each core has its own CVEngine dedicated local SRAM On-chip memory Local memory Local SRAM can be mapped to a
Work-group Work-group sub-group local memory
Sub-group Sub-group Sub-group Sub-group Sub-group Sub-group Sub-group Sub-group local local local local local local local local Sub-group local memory is not yet memory memory memory memory memory memory memory memory Sub- Sub- Sub- Sub- Sub- Sub- Sub- Sub- group group group group group group group group available in OpenCL or SYCL so (N work- (N work- (N work- (N work- (N work- (N work- (N work- (N work- items) items) items) items) items) items) items) items) this will require an extension
PM PM PM PM PM PM PM PM
Cache (multiple levels) Cache (multiple levels)
Global memory
72 © 2019 Codeplay Software Ltd. Since there is DMA connections CVEngine from SRAM and DDR to local On-chip memory Local memory SRAM
Work-group Work-group The CVEngine can support
Sub-group Sub-group Sub-group Sub-group Sub-group Sub-group Sub-group Sub-group asynchronous memory copies local local local local local local local local memory memory memory memory memory memory memory memory Sub- Sub- Sub- Sub- Sub- Sub- Sub- Sub- from on-chip memory buffers and group group group group group group group group (N work- (N work- (N work- (N work- (N work- (N work- (N work- (N work- global memory buffers into items) items) items) items) items) items) items) items) sub-group local memory and vise PM PM PM PM PM PM PM PM versa Cache (multiple levels) Cache (multiple levels) These asynchronous copies cannot be represented in OpenCL or SYCL so will require an Global memory extension
73 © 2019 Codeplay Software Ltd. Work-item Private memory
Work-group Work-group barrier Local memory
Kernel Kernel barrier Global memory
74 © 2019 Codeplay Software Ltd. Work-item Private memory
Work-group Work-group barrier Local memory
Global On-chip Kernel Kernel barrier memory memory
75 © 2019 Codeplay Software Ltd. Work-item Private memory
Sub-group Sub-group barrier Sub-group local memory
Local memory Work-group Work-group barrier
Global On-chip Kernel Kernel barrier memory memory
76 © 2019 Codeplay Software Ltd. Private memory Work-item
Sub-group local memory Sub-group Sub-group barrier
Local memory Work-group Work-group barrier
Global On-chip Kernel Kernel barrier memory memory
Asynchronous sub-group copies
77 © 2019 Codeplay Software Ltd. Agenda
Emergent hardware for AI in automotive
Overview of OpenCL/SYCL programming model
Mapping typical hardware to the OpenCL/SYCL programming model
The Renesas R-Car architecture
Extending OpenCL & SYCL for R-Car
Optimising machine learning algorithms using R-Car
78 © 2019 Codeplay Software Ltd. Disclaimer
The features that I present here are Codeplay extensions and are not standard SYCL features
79 © 2019 Codeplay Software Ltd. CVEngine
On-chip memory Local memory
Work-group Work-group
Sub-group Sub-group Sub-group Sub-group Sub-group Sub-group Sub-group Sub-group local local local local local local local local memory memory memory memory memory memory memory memory Sub- Sub- Sub- Sub- Sub- Sub- Sub- Sub- group group group group group group group group (N work- (N work- (N work- (N work- (N work- (N work- (N work- (N work- items) items) items) items) items) items) items) items)
PM PM PM PM PM PM PM PM
Cache (multiple levels) Cache (multiple levels)
Global memory
80 © 2019 Codeplay Software Ltd. ● On-chip memory ○ On-chip memory is allocated in OpenCL/SYCL similarly to regular buffers ■ ComputeAorta (OpenCL) provides an extension API ■ ComputeCpp (SYCL) provides an extension buffer property use_onchip_memory ○ On-chip memory buffers are accessed in OpenCL/SYCL kernels in the same way as regular buffers
On-chip memory Local memory
81 © 2019 Codeplay Software Ltd. We construct a SYCL class kernel; buffer as normal, but using namespace cl::sycl; provide the use_onchip_memory { buffer property queue deviceQueue;
buffer
82 © 2019 Codeplay Software Ltd. ● Sub-groups ○ Sub-groups are exposed following the OpenCL 2.x feature and as a natural extension to the SYCL execution model ■ ComputeAorta (OpenCL) provides kernel builtins for querying sub-group info and invoking a sub-group barrier ■ ComputeCpp (SYCL) provides an extension to nd_item to expose a sub_group object, similar to group, which exposes member functions for querying sub-group info and invoking a sub-group barrier ○ The size of sub-groups cannot be specified Work-group explicitly by the users, they are determined by Sub-group Sub-group Sub-group Sub-group local local local local the implementation memory memory memory memory Sub- Sub- Sub- Sub- group group group group (N work- (N work- (N work- (N work- items) items) items) items)
PM PM PM PM
83 © 2019 Codeplay Software Ltd. class kernel; We query in-kernel using namespace cl::sycl; sub-group information
{ and invoke sub-group queue deviceQueue; barriers via the sub_group class and the buffer
cgh.parallel_for
84 © 2019 Codeplay Software Ltd. ● Sub-group local memory ○ Sub-group local memory is exposed with extensions which follow the OpenCL/SYCL memory model ■ ComputeAorta (OpenCL) provides a new address space which can be used to allocate sub-group local memory ■ ComputeCpp (SYCL) provides a new accessor access target; access::target::subgroup_local, that behaves similarly to access::target::local
Work-group
Sub-group Sub-group Sub-group Sub-group local local local local memory memory memory memory Sub- Sub- Sub- Sub- group group group group (N work- (N work- (N work- (N work- items) items) items) items)
PM PM PM PM
85 © 2019 Codeplay Software Ltd. class kernel; We allocate sub-group using namespace cl::sycl; local memory by
{ constructing an accessor queue deviceQueue; with the subgroup_local access target buffer
deviceQueue.submit([&](handler &cgh){
auto deviceAcc= deviceBuffer.get_access
auto subGroupLocalMem = accessor
cgh.parallel_for
86 © 2019 Codeplay Software Ltd. ● Asynchronous sub-group copies ○ Asynchronous sub-group copies are exposed following the OpenCL/SYCL feature for asynchronous work-group copies ■ ComputeAorta (OpenCL) provides a plane_t type to represent a non-accessible buffer and kernel builtins for invoking an asynchronous in-kernel copies between a plane_t and a sub-group local memory allocation ■ ComputeCpp (SYCL) provides a new accessor access target; access::target::plane, and a member function to the sub_group extension; async_sub_group_copy, to perform an asynchronous in-kernel copy from a plane accessor to a sub-group local accessor Work-group Sub-group Sub-group Sub-group Sub-group local local local local memory memory memory memory Sub- Sub- Sub- Sub- group group group group (N work- (N work- (N work- (N work- items) items) items) items)
PM PM PM PM
87 © 2019 Codeplay Software Ltd. class kernel; using namespace cl::sycl; We construct a plane
{ accessor using the plane queue deviceQueue; access target
buffer
deviceQueue.submit([&](handler &cgh){ We perform asynchronous in-kernel auto devicePlane = deviceBuffer.get_access
auto subGroupLocalMem = accessor
cgh.parallel_for
88 © 2019 Codeplay Software Ltd. Agenda
Emergent hardware for AI in automotive
Overview of OpenCL/SYCL programming model
Mapping typical hardware to the OpenCL/SYCL programming model
The Renesas R-Car architecture
Extending OpenCL & SYCL for R-Car
Optimising machine learning algorithms using R-Car
89 © 2019 Codeplay Software Ltd. 90 © 2019 Codeplay Software Ltd. Input Image
91 © 2019 Codeplay Software Ltd. Input Image
92 © 2019 Codeplay Software Ltd. Global memory Input Image
✔ The entire image will fit into global memory ✘ Global memory has a high access latency
93 © 2019 Codeplay Software Ltd. 0,0 1,0 On-chip memory
0,1 1,1 ✔ On-chip memory has a much lower access latency ✘ Only part of the image will fit into on-chip memory at once, so we have to tile it ✘ Executing a kernel per tile incurs host-side overhead Note that because convolutions are gather operations the input data much include a halo
94 © 2019 Codeplay Software Ltd. Copy Copy Copy Copy Copy {0, 0} {1, 0} {0, 1} {1, 1}
Convo Convo Convo Convo Compute {0, 0} {1, 0} {0, 1} {1, 1}
95 © 2019 Codeplay Software Ltd. Copy Copy Copy Copy Copy {0, 0} {1, 0} {0, 1} {1, 1}
Convo Convo Convo Convo Compute {0, 0} {1, 0} {0, 1} {1, 1}
96 © 2019 Codeplay Software Ltd. Copy Copy Copy Copy Copy {0, 0} {1, 0} {0, 1} {1, 1}
Convo Convo Convo Convo Compute {0, 0} {1, 0} {0, 1} {1, 1}
97 © 2019 Codeplay Software Ltd. Copy Copy Copy Copy Copy {0, 0} {1, 0} {0, 1} {1, 1}
Convo Convo Convo Convo Compute {0, 0} {1, 0} {0, 1} {1, 1}
98 © 2019 Codeplay Software Ltd. 0,0 1,0 On-chip memory
0,1 1,1 ✔ By double buffering copy and computation you can hide the latency of copying into on-chip memory However, the R-Car CVEngine provides further sub-group local memory which has an even lower access latency than on-chip memory
99 © 2019 Codeplay Software Ltd. 0,0 1,0 1,0 On-chip memory 0,1 1,1 Sub-group local memory
0,1 1,1 ✔ Asynchronously copying each part of the input data that is associated with a sub-group to sub-group local memory will further lower access latency ✘ Again, only part of the image data that is associated with a sub-group will fit into sub-group local memory at once, so again we have to tile it In this case the tiling is done in-kernel
100 © 2019 Codeplay Software Ltd. cgh.parallel_for
auto numTiles = calculate_num_tiles(subGroup.get_group_range(), TILE_SIZE);
auto currentTileRange = calculate_tile_range(subGroup, 0); auto nextTileRange = calculate_tile_range(subGroup, 1);
subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);
for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);
copyEvent.wait();
if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);
currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);
swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } } });
101 © 2019 Codeplay Software Ltd. cgh.parallel_for
subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);
for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);
copyEvent.wait();
if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);
currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);
swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } } });
102 © 2019 Codeplay Software Ltd. cgh.parallel_for
for (int tile = 0; tile < numTiles; ++tile) { But all the memory compute_tile(subGroup, currentTileRange, output); required may not fit into copyEvent.wait(); sub-group local memory
if (tile == (numTiles - 1)) { at once copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);
currentTileRange = nextTileRange; So we calculate how nextTileRange = calculate_tile_range(subGroup, tile + 1); many tiles are required swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); for a sub-group } } });
103 © 2019 Codeplay Software Ltd. cgh.parallel_for
subGroup.async_sub_group_copy(currentTileLocalMem, currentTilePlain, currentTileRange) .wait(); copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, Then we initiate, but nextTileRange); don’t wait for the copy for (int tile = 0; tile < numTiles; ++tile) { of the second tile, so compute_tile(subGroup, currentTileRange, output); that copy will happen in copyEvent.wait(); parallel to the
if (tile == (numTiles - 1)) { computation of the first copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, tile nextTileRange);
currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);
swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } } });
104 © 2019 Codeplay Software Ltd. cgh.parallel_for
for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);
copyEvent.wait();
if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);
currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);
swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } } });
105 © 2019 Codeplay Software Ltd. cgh.parallel_for
for (int tile = 0; tile < numTiles; ++tile) { compute_tile(subGroup, currentTileRange, output);
copyEvent.wait();
if (tile == (numTiles - 1)) { copyEvent = subGroup.async_sub_group_copy(nextTileLocalMem, nextTilePlain, nextTileRange);
currentTileRange = nextTileRange; nextTileRange = calculate_tile_range(subGroup, tile + 1);
swap(currentTileLocalMem, nextTileLocalMem); swap(currentTilePlain, nextTilePlain); } } });
106 © 2019 Codeplay Software Ltd. 0,0 1,0 1,0 On-chip memory 0,1 1,1 Sub-group local memory
0,1 1,1 ✔ By double buffering asynchronous copies and the computation in each sub-group you can hide the latency of copying into sub-group local memory
107 © 2019 Codeplay Software Ltd. Conclusion
● The Renesas R-Car CVEngine is designed to efficiently accelerate complex machine learning algorithms in a low power environment
● The OpenCL/SYCL programming memory can be efficiently applied and extended when necessary to support very unique hardware architectures
● This allows automotive systems to take advantage of AI software stacks based on open standards
108 © 2019 Codeplay Software Ltd.
We’re Hiring!
codeplay.com/careers/
Thank you for listening
@codeplaysoft /codeplaysoft codeplay.com