C++17, is it great or just OK and Heterogeneous computing in C++ Next for Self-Driving Cars Michael Wong ( Software, VP of Research and Development), Andrew Richards, CEO ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong Head of Delegation for C++ Standard for Canada Vice Chair of Programming Languages for Standards Council of Canada

Chair of WG21 SG5 Transactional Memory Chair of WG21 SG14 Games Dev/Low Latency/Financial Trading/Embedded Editor: C++ SG5 Transactional Memory Technical Specification Editor: C++ SG1 Concurrency Technical Specification http:://wongmichael.com/about

Code::dive 2016 Agenda

• How do we get to programming self-driving cars? • The C++ Standard • C++ 17: is it great or just OK • C++Future • A comparison of Heterogeneous Programming Models • SYCL Design Philosophy: C++ end to end model for HPC and consumers • The ecosystem: • VisionCpp • Parallel STL • TensorFlow, Machine Vision, Neural Networks, Self-Driving Cars • Codeplay ComputeCPP Community Edition: Free Download

2 © 2016 Codeplay Software Ltd. How do we get from here…

Level 5 … to Level 4 •Autonomous •Stages from here? •Deep self very local to control Level 3 extensive •All conditions •Limited overall journeys control Level 2 •Execute •Automated Level 1 manoeuvres These are the SAE levels for •Adaptive autonomous vehicles. Similar •Assist Level 0 challenges apply in other embedded intelligence industries •Warnings

3 © 2016 Codeplay Software Ltd. We have a mountain to climb

… or … without climbing getting lost the wrong … and we on our mountain want to get own… there in When we safe, don’t manageabl know e, How do what the affordable we get top looks steps… to the like... top?

4 © 2016 Codeplay Software Ltd. This presentation will focus on:

• The hardware and software platforms that will be able to deliver the results • The software tools to build up the solutions for those platforms • The open standards that will enable solutions to interoperate • How Codeplay can help deliver embedded intelligence

5 © 2016 Codeplay Software Ltd. Where do we need to go?

“On a 100 millimetre-squared chip, Google needs something like 50 teraflops of performance” - Daniel Rosenband (Google’s self-driving car project) at HotChips 2016

6 © 2016 Codeplay Software Ltd. Performance trends 65,536 32,768 16,384 Google target 8,192 4,096 Desktop GPU 2,048 1,024

512 Integrated GPU 256 128 64 Smartphone GPU

GFLOPS 32 16 8 Smartphone CPU These trend lines 4 2 seem to violate Desktop CPU the rules of 1 1 physics…

Year of introduction

7 © 2016 Codeplay Software Ltd. How do we get there from here?

1.We need to write software today for platforms that We need to validate cannot be built yet the systems as safe

We need to start with simpler systems that are not fully autonomous

8 © 2016 Codeplay Software Ltd. Two models of software development

Hardware Software designer designer writing Software writing Software Write software

Design next Implement Select platform Design a Validate Select version on model model platform platform

Validate whole Optimize for platform platform Which method can get us all the way to full autonomy?

9 © 2016 Codeplay Software Ltd. Desirable Development

Write software

Evaluate Optimize for Software platform Software Application

ValidateWrite whole Well Defined Middleware softwareplatform

Hardware & Low-Level Software Develop Evaluate Platform Architecture

Select Platform

10 © 2016 Codeplay Software Ltd. The different levels of programming model

Device-specific Higher-level C-level C++-level Graph programming language programming programming programming • Assembly enabler • OpenCL C • SYCL • OpenCV language • PTX • DSP C • CUDA • OpenVX • VHDL • HSA • MCAPI/MTAPI • HCC • Halide • Device-specific C- • OpenCL SPIR • C++ AMP • VisionCpp like programming • SPIR-V • TensorFlow models • Caffe

11 © 2016 Codeplay Software Ltd. Device-specific programming

Cannot … develop software Can deliver quick today for future results today platforms

 Not a route to full Can… autonomy hand-optimize  Does not allow directly for the software developers device to invest today

12 © 2016 Codeplay Software Ltd. The route to full autonomy

• Graph programming • This is the most widely-adopted approach to machine vision and machine learning

• Open standards • This lets you develop today for future architectures

13 © 2016 Codeplay Software Ltd. Why graph programming?

When you scale the Therefore: number of cores: • You need to reduce • You don’t scale the off-chip memory However, number of memory ports bandwidth by writing tiled processing • Your compute everything on-chip image pipelines performance increases • This is achieved by is hard • But your off-chip memory tiling bandwidth does not increase If we build up a graph of operations (e.g. convolutions) and then have a runtime system split into fused tiled operations across an entire system -on-chip, we get great performance

14 © 2016 Codeplay Software Ltd. Graph programming: some numbers

In this example, Effect of combining graph nodes on performance we perform 3 100

image processing 90 operations on an 80 Halide and SYCL accelerator and use kernel fusion, 70 compare 3 whereas OpenCV systems when 60 does not. For all 3 executing 50 systems, the individual nodes, performance of 40 or a whole graph the whole graph is 30 significantly better 20 than individual The system is an AMD nodes executed APU and the 10 operations are: RGB- on their own 0 >HSV, channel OpenCV (nodes) OpenCV (graph) Halide (nodes) Halide (graph) SYCL (nodes) SYCL (graph) masking, HSV->RGB Kernel time (ms) Overhead time (ms)

15 © 2016 Codeplay Software Ltd. Graph programming • For both machine vision algorithms and machine learning, graph programming is the most widely-adopted approach • Two styles of graph programming that we commonly see:

C-style graph C++-style graph programming programming • OpenVX • Halide • OpenCV • RapidMind • Eigen (also in TensorFlow) • VisionCpp

16 © 2016 Codeplay Software Ltd. C-style graph programming

OpenVX: open standard • Can be implemented by vendors • Create a graph with C API, then map to an entire SoC

OpenCV: open source • Implemented on OpenCL • Implemented on device-specific accelerators • Create a graph with C API, then execute

17 © 2016 Codeplay Software Ltd. & Device-Specific Programming

What happens if we Runtime systems invent our own can automatically graph nodes? optimize the graphs

Can … How do we adapt it develop software for all the graph today for future nodes we need? platforms

18 © 2016 Codeplay Software Ltd. C++-style graph programming

Examples in machine C++ compilers that vision/machine learning support this style • Halide • CUDA • RapidMind • C++ OpenMP • Eigen (also in • C++ 17 Parallel STL TensorFlow) • SYCL • VisionCpp

19 © 2016 Codeplay Software Ltd. C++ single-source programming

• C++ lets us build up graphs at compile-time • This means we can map a graph to the processors offline • C++ lets us write custom nodes ourselves • This approach is called a C++ Embedded Domain-Specific Language • Very widely used, eg Eigen, , TensorFlow, RapidMind, Halide

20 © 2016 Codeplay Software Ltd. C++ single-source Single-source is lets us create most widely- customizable adopted machine graph models learning programming model

Combining open OpenCL lets us run on a very wide range of standards, C++ and accelerators now and in the future graph programming

SYCL combines C++ single-source with OpenCL acceleration

21 © 2016 Codeplay Software Ltd. Putting it all together: building it

22 © 2016 Codeplay Software Ltd. Higher-level programming enablers

NVIDIA PTX HSA OpenCL SPIR SPIR-V

• NVIDIA CUDA-only • Royalty-free open • Defined for OpenCL • Open standard standard v1.2 • Defined by Khronos • HSAIL is the IR • Based on • Supports compute • Provides a single Clang/LLVM (the and graphics address space, with open-source (OpenCL, Vulkan and virtual memory compiler) OpenGL) • Low-latency • Not tied to any communication compiler

Open standard intermediate representations enable tools to be built on top and support a wide range of platforms

23 © 2016 Codeplay Software Ltd. Which model should we choose?

Device-specific Higher-level C-level C++-level Graph programming language programming programming programming • Assembly enabler • OpenCL C • SYCL • OpenCV language • NVIDIA PTX • DSP C • CUDA • OpenVX • VHDL • HSA • MCAPI/MTAPI • HCC • Halide • Device-specific C- • OpenCL SPIR • C++ AMP • VisionCpp like programming • SPIR-V • TensorFlow models • Caffe

24 © 2016 Codeplay Software Ltd. They are not alternatives, they are layers

Graph programming

OpenCV OpenVX Halide VisionCpp TensorFlow Caffe

C/C++-level programming

SYCL CUDA HCC C++ AMP OpenCL

Higher-level language enabler

NVIDIA PTX HSA OpenCL SPIR SPIR-V

Device-specific programming

Assembly language VHDL Device-specific C-like programming models

25 © 2016 Codeplay Software Ltd. Can specify, test and validate each layer

Graph programming

Validate graph models Validate the code using standard tools

C/C++-level programming

OpenCL/SYCL specs Clsmith testsuite Conformance testsuites Wide range of other testsuites

Higher-level language enabler

SPIR/SPIR-V/HSAIL specs Conformance testsuites

Device-specific programming

Device-specific specification Device-specific testing and validation

26 © 2016 Codeplay Software Ltd. Agenda

• How do we get to programming self-driving cars? • The C++ Standard • C++ 17: is it great or just OK • C++Future • A comparison of Heterogeneous Programming Models • SYCL Design Philosophy: C++ end to end model for HPC and consumers • The ecosystem: • VisionCpp • Parallel STL • TensorFlow, Machine Vision, Neural Networks, Self-Driving Cars • Codeplay ComputeCPP Community Edition: Free Download

27 © 2016 Codeplay Software Ltd. C++ support for massive parallel heterogeneous devices •Memory allocation (near, far memory) • Several candidates •Better affinity for cpu and memory • SYCl •Templates (static, compile time) • HPX •Exceptions • HCC •Polymorphism • Agency •Tasks blocks • KoKKos •Execution Agents/Context • Raja •Progress Guarantees • C++AMP •Current Technical Specifications – Concepts, Parallelism, Concurrency, TM

28 © 2016 Codeplay Software Ltd. C++ Std Timeline/status https://wongmichael.com/2016/06/29/c17-all-final-features-from-oulu-in-a-few-slides/

29 ©29 2016 Codeplay Software Ltd. Pre-C++11 projects

ISO number Name Status What is it? C++17? Published 2006 (ISO Technical Report on C++ store) ISO/IEC TR 18015:2006 C++ Performance report No Performance Draft: TR18015 (2006-02- 15) Published 2007-11-15 (ISO store) Has 14 Boost libraries, 13 Technical Report on C++ N/A (mostly already ISO/IEC TR 19768:2007 Draft: n1745 (2005-01-17) of which was added to Library Extensions included into C++11) TR 29124 split off, the rest C++11. merged into C++11 Published 2010-09-03 (ISO Store) Extensions to the C++ Really, ORDINARY math Final draft: n3060 (2010- Library to support today with a Boost and YES ISO/IEC TR 29124:2010 03-06). Under mathematical special Dinkumware consideration to merge functions Implementation into C++17 by p0226 (2016-02-10) Published 2011-10-25 (ISO Store) Extensions for the Decimal Floating Point Draft: n2849 (2009-03-06) programming language decimal32 ISO/IEC TR 24733:2011 May be superseded by a No. Ongoing work in SG6 C++ to support decimal decimal64 © 2016 Codeplay Software Ltd. future Decimal30 TS or floating-point arithmetic decimal128 merged into C++ b Status after Nov Issaquah C++ Meeting ISO number Name Status links C++17? Published 2015-06-18. Standardize Linux and C++ File System Technical YES ISO/IEC TS 18822:2015 (ISO store). Final draft: Windows file system Specification n4100 (2014-07-04) interface YES but removed Published 2015-06-24. dynamic execution C++ Extensions for ISO/IEC TS 19570:2015 (ISO Store). Final draft: Parallel STL algorithms. policy, exception_lists, Parallelism n4507 (2015-05-05) changed some names

No. Already in GCC 6 Published 2015-09-16, Composable lock-free release and waiting for ISO/IEC TS 19841:2015 Transactional Memory TS (ISO Store). Final draft: programming that scales subsequent usage n4514 (2015-05-08) experience. optional, any, YES but moved Published 2015-09-30, C++ Extensions for string_view and more Invocation Traits and ISO/IEC TS 19568:2015 (ISO Store). Final draft: Library Fundamentals Polymorphic allocators n4480 (2015-04-07) into LF TS2 No. Already in GCC 6 Published 2015-11-13. Constrained templates release and and waiting C++ Extensions for ISO/IEC TS 19217:2015 (ISO Store). Final draft: for subsequent usage Concepts n4553 (2015-10-02) experience. 31 © 2016 Codeplay Software Ltd. Status after Nov Issaquah C++ Meeting

ISO number Name Status What is it? C++17? improvements to No. Already in Visual Published 2016-01-19. C++ Extensions for future, latches and Studio release and ISO/IEC TS 19571:2016 (ISO Store) Final draft: Concurrency barriers, atomic smart waiting for subsequent p0159r0 (2015-10-22) pointers usage experience. No. Resolution of C++ Extensions for source code DTS. Draft: n4564 comments from ISO/IEC DTS 19568:xxxx Library Fundamentals, information capture (2015-11-05) national standards Version 2 and various utilities bodies in progress In development, Draft Range-based algorithms No. Wording review of ISO/IEC DTS 21425:xxxx Ranges TS n4569 (2016-02-15) and views the spec in progress

In development, Draft Sockets library based on No. Wording review of ISO/IEC DTS 19216:xxxx Networking TS n4575 (2016-02-15) Boost.ASIO the spec in progress.

No. Initial TS wording In development, Draft A component system to reflects Microsoft’s p0142r0 (2016-02-15) supersede the textual Modules design; changes and p0143r1 (2016-02- header file inclusion proposed by Clang 15) model implementers expected.

32 © 2016 Codeplay Software Ltd. Status after Nov Issaquah C++ Meeting ISO number Name Status What is it? C++17?

Early development. Draft Various numerical No. Under active Numerics TS p0101 (2015-09-27) facilities development

Exploring executors, synchronic types, lock- No. Under active ISO/IEC DTS 19571:xxxx Concurrency TS 2 Early development free, atomic views, development concurrent data structures Exploring task blocks, Early development. Draft No. Under active ISO/IEC DTS 19570:xxxx Parallelism TS 2 progress guarantees, n4578 (2016-02-22) development SIMD. Transactional Memory TS Exploring on_commit, No. Under active ISO/IEC DTS 19841:xxxx Early development 2 in_transaction. development.

Early development. Draft No. Wording review of Graphics TS 2D drawing API p0267r0 (2016-02-12) the spec in progress

Under overhaul. Stack arrays whose size is No. Withdrawn; any ISO/IEC DTS 19569:xxxx Array Extensions TS Abandoned draft: n3820 not known at compile future proposals will (2013-10-10) time target a different vehicle

33 © 2016 Codeplay Software Ltd. Status after Nov Issaquah C++ Meeting ISO number Name Status What is it? C++17? Initial TS wording will reflect Microsoft’s await design; Coroutine TS Resumable functions No. Under active development changes proposed by others expected.

Design direction for Code introspection and (later) Reflection TS introspection chosen; likely to No. Under active development reification mechanisms target a future TS

Unified proposal reviewed Preconditions, postconditions, Contracts TS No. Under active development favourably. ) etc.

Massive Parallelism TS Early development Massive parallelism dispatch No. Under active development.

Support Hetereogeneous No. Under active development. Heterogeneous Device TS Early development. Devices

Filesystem TS, Parallelism TS, Library Fundamentals TS I, if C++17 On track for 2017 constexpr, and various other YES enhancements are in. See slide 44-47 for details.

34 © 2016 Codeplay Software Ltd. Library Fundamental TS 2: being reviewed

• Source-code information capture (really a Reflection feature with a library interface) • A generalized callable negator • Uniform container erasure • GCD and LCM functions (GCD/LCM moved into C++17) • Delimited iterators • observer_ptr, the world’s dumbest smart pointer • A const-propagating wrapper class • make_array • A metaprogramming utility dubbed the “C++ detection idiom” • A replacement for std::rand() • Logical type traits.

35 © 2016 Codeplay Software Ltd. C++ 17 Language features already voted in •static_assert(condition) without a message • The [[fallthrough]] attribute, •Allowing auto var{expr}; •Writing a template template parameter as template <…> The [[nodiscard]] attribute, typename Name • •Removing trigraphs • The [[maybe_unused]] attribute •Folding expressions •std::uncaught_exceptions() • Extending aggregate initialization to allow •Attributes for namespaces and enumerators initializing base subobjects. •Shorthand syntax for nested namespace definitions • Lambdas in constexpr contexts •u8 character literals •Allowing full constant expressions in non-type template • Disallowing unary folds of some operators over parameters an empty parameter pack •Removing the register keyword, while keeping it reserved for future use Generalizing the range-based for loop •Removing operator++ for bool • •Making exception specifications part of the type system. • Lambda capture of *this by value •__has_include(), •Choosing an official name for what are commonly called • Relaxing the initialization rules for scoped “non-static data member initializers” or NSDMIs. The official name is “default member initializers”. enum types. •A minor change to the semantics of inheriting constructors • Hexadecimal floating-point literals

36 © 2016 Codeplay Software Ltd. C++17 Language features voted in Oulu Finland if constexpr (formerly known as constexpr_if, and before that, static_if) • Introducing the term 'templated Template parameter deduction for constructors entity‘ template • Proposed wording for structured Inline variables bindings Guaranteed copy elision Guarantees on expression evaluation order • Selection statements with initializer Dynamic memory allocation for over-aligned data • Explicit default constructors and copy- is_contiguous_layout (really a library feature, but it needs compiler support) list-initialization Removing exception specifications • Not in C++17 (yet!) Using attribute namespaces without repetition • Default comparisons Replacement of class objects containing reference • For/against/neutral: 16/31/20 members • Operator dot Standard and non-standard attributes • Not moved as CWG discovered a flaw Forward progress guarantees: Base definitions Forward progress guarantees for the Parallelism TS features © 2016 Codeplay Software Ltd. 37 C++17 Library features already voted in •Removing some legacy library components •Contiguous iterators • Re-enabling shared_from_this •Safe conversions in unique_ptr not_fn •Making std::reference_wrapper trivially copyable • •Cleaning up noexcept in containers • constexpr atomic::is_always_lock_free •Improved insertion interface for unique-key maps • Nothrow-swappable traits •void_t alias template • Fixing a design mistake in the searchers interface •invoke function template

•Non-member size(), empty(), and data() functions • An algorithm to clamp a value between a pair of boundary values •Improvements to pair and tuple

•bool_constant • constexpr std::hardware_{constructive,destructive}_interference_size •shared_mutex •Incomplete type support for standard containers • A 3-argument overload of std::hypot •Type traits variable templates. • Adding constexpr modifiers •as_const() • Giving std::string a non-const data() member function •Removing deprecated iostreams aliases

•Making std::owner_less more flexible • is_callable, the missing INVOKE-related trait

•Polishing

•Variadic lock_guard

•Logical type traits.

38 © 2016 Codeplay Software Ltd. C++17 Library features voted in Oulu Finland

• Synopses for the C library •High-performance, locale-independent number <-> string conversions • Making Optional Greater Equal Again •make_from_tuple() (like apply(), but for constructors) • Making Variant Greater Equal •Letting folks define a default_order<> without • Homogeneous interface for variant, any and optional defining std::less<> • Elementary string conversions •Splicing between associative containers • Integrating std::string_view and std::string •Relative paths •C11 libraries • has_unique_object_representations •shared_ptr::weak_type • Extending memory management tools •gcd() and lcm() from LF TS 2 • Emplace Return Type •Deprecating std::iterator, redundant members of std::allocator, and is_literal • Removing Allocator Support in std::function •Reserve a namespace for STL v2 • make_from_tuple: apply for construction •std::variant<> • Delete operator= for polymorphic_allocator •Better Names for Parallel Execution Policies in • Fixes for not_fn C++17 •Temporarily discourage memory_order_consume • Adapting string_view by filesystem paths •A Nomenclature Tweak • Hotel Parallelifornia: terminate() for Parallel Algorithms Exception Handling

39 © 2016 Codeplay Software Ltd.

What did not change from Issaquah No Concepts • Inline variable stays No Unified Call Syntax No Default Comparison No operator dot

40 © 2016 Codeplay Software Ltd. Changes voted in Issaquah Fixes to C+17 Some New Features for C++20 • Removing Deprecated • Pack expansions in using- Exception Specifications declarations from C++17 • Lifting Restrictions on • Added Elementary string requires-Expressions conversions • Std::byte was not added

41 © 2016 Codeplay Software Ltd. Future C++ Standard schedules • After Nov, Issaquah • Address additional returned comments in February Kona • Likely Issue DIS after Kona, Feb 2017, send it to National Body for final approval ballot; this is just an up/down vote, no comments • Most likely approved, then celebrate in July 2017 Toronto Meeting • Then send it to ISO Geneva for publication, likely by EOY 2017 • After C++17 • Default is 3 yr cycle: C++20, 23

42 © 2016 Codeplay Software Ltd. Improve support for large-scale dependable software

• Modules • to improve locality and improve compile time; n4465 and n4466 • Contracts • for improved specification; n4378 and n4415 • A type-safe union • probably functional-programming style pattern matching; something based on my Urbana presentation, which relied on the Mach7 library: Yuriy Solodkyy, Gabriel Dos Reis and Bjarne Stroustrup: Open Pattern Matching for C++. ACM GPCE'13.

C++17 Lenexa 43 43 © 2016 Codeplay Software Ltd. Provide support for higher-level concurrency models

• Basic networking • asio n4478 • A SIMD vector • to better utilize modern high-performance hardware; e.g., n4454 but I’d like a real vector rather than just a way of writing parallelizable loops • Improved futures • e.g., n3857 and n3865 • Co-routines • finally, again for the first time since 1990; N4402, N4403, and n4398 • Transactional memory • n4302 • Parallel algorithms (incl. parallel versions of some of the STL • n4409

C++17 Lenexa 44 44 © 2016 Codeplay Software Ltd. Simplify core language use and address major sources of errors • Concepts (n3701 and n4361) • concepts in the standard library • based on the work done in Origin, The Palo Alto TR, and Ranges n4263, n4128 and n4382 • default comparisons May come back in • to complete the support for fundamental operations; n4475 and n4476 limited form with National Body • uniform call syntax comment • among other things: it helps concepts and STL style library use; n4474 May come back in operator dot • limited form with • to finally get proxies and smart references; n4477 National Body • array_view and string_view comment • better range checking, DMR wanted those: "fat pointers"; n4480 • arrays on the stack • "stack_array" anyone? But we need to find a safe way of dealing with stack overflow; n4294 • optional • unless it is subsumed by pattern matching, and I think not in time for C++17, n4480

C++17 Lenexa 45 45 © 2016 Codeplay Software Ltd. The Verdict on C++17? (from reddit) •You blew it • Did a nice job •Not a Major release • But not Minor either •No risk, no gain • Safe and conservative wins •Nobody implement TSs • TSs are implemented •Tethering tower of Babel of • Followed the rules of a bus TSs train model, how to get 110 people to work together A Medium/OK Release 46 © 2016 Codeplay Software Ltd. The Parallel and concurrency planets of C++ today … … SG5 … Transactional Memory TS

SG1 Par/Con 3 TS …

SG14 Low … Latency … … …

47 © 2016 Codeplay Software Ltd. C++1Y(1Y=17/20/22) SG1/SG5/SG14 Plan red=C++17, blue=C++20? Black=future? Parallelism Concurrency • Future++ (then, wait_any, wait_all): • Latches and Barriers • Parallel Algorithms: • Atomic smart pointers

• Library Vector Types • osync_stream • Atomic Views, fp_atomics, • Vector loop algorithm/exec • Counters/Queues

policy • Executors:

• Task-based parallelism • Lock free techniques/Transactions • Synchronics replacement/atomic flags (, OpenMP, fork-join) • Co-routines

• Execution Agents • Concurrent Vector/Unordered Associative Containers • upgrade_lock • Progress guarantees • Pipelines/channels • MapReduce

48 © 2016 Codeplay Software Ltd.

Part 1: Parallel C++ Library In C++17

4 9 Execution Policies Published 2015 using namespace std::experimental::parallelism; std::vector vec = ...

// previous standard sequential sort std::sort(vec.begin(), vec.end());

// explicitly sequential sort std::sort(std::seq, vec.begin(), vec.end());

// permitting parallel execution std::sort(std::par, vec.begin(), vec.end());

// permitting vectorization as well std::sort(std::par_unseq, vec.begin(), vec.end());

50 © 2016 Codeplay Software Ltd. What was changed from Parallelism TS in C++17 • Removed dynamic execution policy • Name change from par_vec to par_unseq • Removed exception_list (of exception_ptr) to terminate and don’t unwind

51 51 © 2016 Codeplay Software Ltd. Issaquah changes to Parallelism TS • Exceptions now part of execution policy • To enable future exception handling such as reduction • Inner product is now transform reduce • Input iterators can cause reversal to sequential • Default policies cannot copy predicates • Trying to enable that so NUMA systems can work well

52 52 © 2016 Codeplay Software Ltd. Part 2: Forward Progress guarantees in C++17

5 3 ParallelSTL •C++17 execution policies require concurrent or parallel forward progress guarantees • This means GPUs are not support by the standard execution policies •Executors intend to interface with execution policies

parallel_for_each(par.on(exec), vec.begin(), vec.end(), [=](int&e){ /* … */ });

54 54 © 2016 Codeplay Software Ltd. Forward Progress Guarantees • C++17 forward progress guarantees are: • Concurrent forward progress guarantees • a of execution is required to make forward progress regardless of the forward progress of any other thread of execution. • Parallel forward progress guarantees • a thread of execution is not required to make forward progress until an execution step has occurred and from that point onward a thread of execution is required to make forward progress regardless of the forward progress of any other thread of execution. • Weakly parallel forward progress guarantees • a thread of execution is not required to make progress.

• These are not specific guarantees for GPUs 55 55 © 2016 Codeplay Software Ltd. Agenda

• How do we get to programming self-driving cars? • The C++ Standard • C++ 17: is it great or just OK • C++Future • A comparison of Heterogeneous Programming Models • SYCL Design Philosophy: C++ end to end model for HPC and consumers • The ecosystem: • VisionCpp • Parallel STL • TensorFlow, Machine Vision, Neural Networks, Self-Driving Cars • Codeplay ComputeCPP Community Edition: Free Download

56 © 2016 Codeplay Software Ltd. Part 3: Futures++ (.then, wait_any, wait_all) in future C++ 20

5 7 Futures & Continuations • Extensions to C++11 futures • MS-style .then continuations • then() • Sequential and parallel composition • when_all() - join • when_any() - choice • Useful utilities: • make_ready_future() • is_ready() • unwrap()

58 © 2016 Codeplay Software Ltd. Summary Of Proposed Extensions (1)

template auto then(F&& func) -> future; template auto then(executor &ex, F&& func) -> future; template auto then(launch policy, F&& func) -> future;

59 © 2016 Codeplay Software Ltd. Summary Of Proposed Extensions (2) template future::type> make_ready_future(T&& value); future make_ready_future(); bool is_ready() const; template future future::unwrap(); // R is a future or shared_future

60 © 2016 Codeplay Software Ltd. Summary Of Proposed Extensions (3) template <...> when_all(InputIterator first, InputIterator last); template <...> when_all(T&&... futures); template <...> when_any(InputIterator first, InputIterator last); template <...> when_any(T&&... futures);

template when_any_swapped(InputIterator first, InputIterator last);

61 © 2016 Codeplay Software Ltd. Part 4: Executors in future C++20

6 2 Executors • Executors are to function execution what allocators are to memory allocation • If a control structure such as std::async() or the parallel algorithms describe work that is to be executed • An executor describes where and when that work is to be executed • http://www.open- std.org/jtc1/sc22/wg21/docs/papers/2016/p0443r0.html

63 © 2016 Codeplay Software Ltd. The Idea Behind Executors

Unified Interface for Execution

64 © 2016 Codeplay Software Ltd. Several Competing Proposals • P0008r0 (Mysen): Minimal interface for fire-and-forget execution • P0058r1 (Hoberock et al.): Functionality needed for foundations of Parallelism TS • P0113r0 (Kohlhoff): Functionality needed for foundations of Networking TS • P0285r0 (Kohlhoff): Executor categories & customization points

65 © 2016 Codeplay Software Ltd. Telecon calls between Oulu to Issaquah meeting

Jared Hoberock (thanks for • Hans Boehm the slides!) • Gordon Brown Michael Garland • Thomas Heller Chris Kohlhoff • Lee Howes Chris Mysen • Hartmut Kaiser Carter Edwards • Bryce Lelbach • Gor Nishanov

• Thomas Rodgers • Michael Wong

66 © 2016 Codeplay Software Ltd. Current Progress of Executors • Closing in on minimal proposal • A foundation for later proposals (for heterogeneous computing) • Still work in progress

67 © 2016 Codeplay Software Ltd. Current Progress of Executors • An instruction stream is the function you want to execute • An executor is an interface that describes where and when to run an instruction stream • An executor has one or more execute functions • An execute function executes an instruction stream on light weight execution agents such as threads, SIMD units or GPU threads

68 © 2016 Codeplay Software Ltd. Current Progress of Executors • An execution platform is a target architecture such as linux • An execution resource is the hardware abstraction that is executing the work such as a thread pool • An execution context manages the light weight execution agents of an execution resource during the execution

69 © 2016 Codeplay Software Ltd. Executors: Bifurcation

• Bifurcation of one-way vs two-way • One-way –does not return anything • Two-way –returns a future type • Bifurcation of blocking vs non-blocking (WIP) • May block –the calling thread may block forward progress until the execution is complete • Always block –the calling thread always blocks forward progress until the execution is complete • Never block –the calling thread never blocks forward progress. • Bifurcation of hosted vs remote • Hosted –Execution is performed within threads of the device which the execution is launched from, minimum of parallel forward progress guarantee between threads • Remote –Execution is performed within threads of another remote device, minimum

70 © 2016 Codeplay Software Ltd. Features of C++ Executors

• One-way non-blocking single execute executors • One-way non-blocking bulk execute executors • Remote executors with weakly parallel forward progress guarantees • Top down relationship between execution context and executor • Reference counting semantics in executors • A minimal execution resource which supports bulk execute • Nested execution contexts and executors • Executors block on destruction

71 © 2016 Codeplay Software Ltd. Executor Framework: Abstract Platform details of execution. class sample_executor { public: Create execution agents using execution_category = ...; using shape_type = tuple; Manage data they share template using future = ...; Advertise semantics template future make_ready_future(T&& value); template future<...> Mediate dependencies bulk_async_execute(Function f, shape_type shape, Factory1 result_factory, Factory2 shared_factory);... }

72 © 2016 Codeplay Software Ltd. Purpose 1 of executors:where/how execution

• Placement is, by default, at discretion of the system.

• If the Programmer want to control placement:

73 © 2016 Codeplay Software Ltd. Purpose 2 of executors •Control relationship with Calling threads •async(launch_flags, function); •async(executor, function);

74 © 2016 Codeplay Software Ltd. Purpose 3 of executors •Uniform interface for scheduling semantics across control structures • for_each(P.on(executor), ...); • async(executor, ...); • future.then(executor, ...); • dispatch(executor, ...);

75 © 2016 Codeplay Software Ltd. SHORT TERM GOALS •Compose with existing control structures • In C++17: • async(), invoke(), for_each(), sort(), ... • In technical specifications: • define_task_block(), future.then(), Networking TS, asynchronous operations, Transactional memory

76 © 2016 Codeplay Software Ltd. UNIFIED DESIGN

• Distinguish executors from execution contexts • Categorize executors • Enable customization • Describe composition with existing control structures

77 © 2016 Codeplay Software Ltd. EXECUTORS & CONTEXTS Light-weight views on long-lived resources • Distinguish executors from execution contexts • Categorize executors • Enable customization • Describe composition with existing control structures

78 © 2016 Codeplay Software Ltd. EXECUTORS & CONTEXTS Light-weight views on long-lived resources • Executors are (potentially short-lived) objects that create execution agents on execution contexts. • Execution contexts are (potentially long-lived) objects that manage the lifetime of underlying execution resources.

79 © 2016 Codeplay Software Ltd. EXECUTORS & CONTEXTS Example: simple thread pool struct my_thread_pool

{

template

void submit(Function&& f);

struct executor_t

{ my_thread_pool& ctx; • Context: my_thread_pool template void execute(Function&& f) const • Executor: my_thread_pool::executor_t { .execute() // forward the function to the thread pool • submits a task to the thread pool ctx.submit(std::forward(f)); • The executor is created by the context }

my_thread_pool& context() const noexcept {return ctx;}

bool operator==(const executor_t& rhs) const noexcept {return ctx == rhs.ctx;}

bool operator!=(const executor_t& rhs) const noexcept {return ctx != rhs.ctx;}

};

executor_t executor() { return executor_t{*this}; }

...

};

80 © 2016 Codeplay Software Ltd. Executor Interface:semantic types exposed by executors Type Meaning execution_category Scheduling semantics amongst agents in a task. (sequenced, vector-parallel, parallel, concurrent) shape_type Type for indexing bulk launch of agents. (typically n-dimensional integer indices) future Type for synchronizing asynchronous activities. (follows interface of std::future)

81 © 2016 Codeplay Software Ltd. Executor Interface:core constructs for launching work Type of agent tasks Constructs Single-agent tasks result sync_execute(Function f); future async_execute(Function f); future then_execute(Function f, Future& predecessor);

Multi-agent tasks result bulk_sync_execute(Function f, shape_type shape, Factory result_factory, Factory shared_factory); future bulk_async_execute(Function f, shape_type shape, Factory result_factory, Factory shared_factory); future bulk_then_execute(Function f, Future& predecessor, shape_type shape, Factory result_factory, Factory shared_factory);

82 © 2016 Codeplay Software Ltd. EXECUTOR CATEGORIES Name collections of use cases Each executor operation identifies a unique use case •execute(f) : “fire-and-forget f” •async_execute(f) : “asynchronously execute f and return a future” Categorize executor types by the uses cases they natively support •OneWayExecutor : executors that natively fire-and-forget •TwoWayExecutor : executors that natively provide a channel to the result

83 © 2016 Codeplay Software Ltd. EXECUTOR CATEGORIES Name collections of use cases HostBased* •As if the execution agent is running on a std::thread •Passes the “database test” •.execute(f,alloc) : “fire-and-forget f, use alloc for allocation” Bulk* Create multiple execution agents with a single operation .bulk_execute(f,n,sf) : “fire-and-forget f n times in bulk”

84 © 2016 Codeplay Software Ltd. EXECUTOR CATEGORIES

85 © 2016 Codeplay Software Ltd. CUSTOMIZATION POINTS Enable uniform use Free functions in namespace execution:: •execute(f) — execution::execute(exec, f) •async_execute(f) — execution::async_execute(exec, f) Adapt exec when operation not natively provided

86 © 2016 Codeplay Software Ltd. CUSTOMIZATION POINTS Enable uniform use struct has_async_execute { template future<...> async_execute(F&&) const; • Need to use executors in a variety of use ... cases }; struct hasnt_async_execute { ... }; • Not all executors natively implement every use has_async_execute has; case hasnt_async_execute hasnt; // calls has.async_execute(f) • Customization points enables uniform use of auto fut1 = execution::async_execute(has, f); executors across use cases* // adapts hasnt to return a future • *when semantically possible auto fut2 = execution::async_execute(hasnt, g);

87 © 2016 Codeplay Software Ltd. CUSTOMIZATION POINTS Traits enable introspection template>> auto async_execute(Executor exec, F&& f) -> decltype(exec.async_execute(std::forward(f))) { // use .async_execute() directly return exec.async_execute(std::forward(f)); } template>> executor_future_t async_execute(OneWayExecutor exec, F&& f) { // adapt exec to return a future executor_future_t> result_future = ... // call another customization point execution::execute(exec, ...); return std::move(result_future); }

88 © 2016 Codeplay Software Ltd. EXECUTORS & THE STANDARD LIBRARY Composition with control structures my_executor exec = get_my_executor(...); • Most programmers use using namespace std; higher-level control auto fut1 = async(exec, task); structures sort(execution::par.on(exec), • Need to compose with user- vec.begin(), vec.end()); auto fut2 = fut1.then(exec, defined executors continuation);

89 © 2016 Codeplay Software Ltd. EXECUTORS & THE STANDARD LIBRARY Possible implementation of std::for_each template using __enable_if_bulk_sync_executable = enable_if_t< is_same_v< typename Policy::execution_category, parallel_execution_tag > && is_convertible_v< typename iterator_traits::iterator_category, random_access_iterator_tag > >;

90 © 2016 Codeplay Software Ltd. POSSIBLE EXTENSIONS Out of scope of minimal proposal •Error handling • Higher-level variadic •Requirements on user- abstractions defined Future types • Remote execution •Heterogeneity • Additional thread pool • functionality •Additional abstractions for • System resources bulk execution • Syntactic sugar for contexts + control structures

91 © 2016 Codeplay Software Ltd. Summary Executors

Executors decouple control structures from work creation Short-term goal: compose with existing control structures P0443 is the minimal proposal to achieve short-term goal Provides a foundation for extensions to build on

92 © 2016 Codeplay Software Ltd. Vector SIMD Parallelism for Parallelism TS2

• No standard! • Boost.SIMD • Proposal N3571 by Mathias Gaunard et. al., based on the Boost.SIMD library. • Proposal N4184 by Matthias Kretz, based on Vc library. • Unifying efforts and expertise to provide an API to use SIMD portably • Within C++ (P0203, P0214) • P0193 status report • P0203 design considerations • Please see Pablo Halpern, Nicolas Guillemot’s and Joel Falcou’s talks on Vector SPMD, and SIMD. 93 © 2016 Codeplay Software Ltd. SIMD from Matthias Kretz and Mathias Gaunard

• std::datapar • datapar SIMD register holding N elements of type T • datapar same with optimal N for the currently targeted architecture • Abi Defaulted ABI marker to make types with incompatible ABI different • Behaves like a value of type T but applying each operation on the N values it contains, possibly in parallel. • Constraints • T must be an integral or floating-point type (tuples/struct of those once we get reflection) • N parameter under discussion, probably will need to be power of 2.

94 © 2016 Codeplay Software Ltd. Operations on datapar • Built-in operators • No promotion: All usual binary operators are • datapar(255) + • datapar(1) == available, for all: datapar(0) • datapar datapar • Comparisons and • datapar U, U datapar conditionals: • ==, !=, <, <=,> and >= perform • Compound binary operators element-wise comparison return and unary operators as well mask • datapar convertible to • if(cond) x = y is written as datapar where(cond, x) = y • datapar(U) broadcasts the • cond ? x : y is written as value if_else(cond, x, y)

95 © 2016 Codeplay Software Ltd. The goal

•Great support for cpu latency computations through concurrency TS- •Great support for cpu throughput through parallelism TS •Great support for Heterogeneous throughput computation in future

96 © 2016 Codeplay Software Ltd. Many alternatives for Massive dispatch/heterogeneous

• Programming Languages Usage experience • OpenGL • DirectX • OpenMP/OpenACC • CUDA • HPC • OpenCL • SYCL • OpenMP • OpenCL • OpenACC • CUDA • C++ AMP • HPX • HSA • SYCL • Vulkan

97 © 2016 Codeplay Software Ltd.

Lots of experience now with Heterogeneous language design in C++ •Executors: P0058R1 An Interface for Abstracting Execution (one of them, there are 2 other) •AMD’s HCC/HSAIL: P00069r0 HCC: A C++ Compiler For Heterogeneous Computing •SYCL: P0236R0 Khronos's OpenCL SYCL to support Heterogeneous Devices for C++ •HPX: P0234R0 Towards Massive Parallelism support in C++ with HPX

98 © 2016 Codeplay Software Ltd. Not that far away from a Grand Unified Theory

•GUT is achievable •What we have is only missing 20% of where we want to be •It is just not designed with an integrated view in mind ... Yet •Need more focus direction on each proposal for GUT, whatever that is, and add a few elements

99 © 2016 Codeplay Software Ltd. What we want for Massive dispatch/Heterogeneous computing by 2020 •Integrated approach for 2020 for C++ – Marries concurrency/parallelism TS/co-routines •Heterogeneous Devices and/or just Massive Parallelism •Works for both HPC, consumer, games, embedded, fpga •Make asynchrony the core concept •Supports integrated (APU), but also discrete memory models •Supports High bandwidth memory •Support distributed architecture

100 © 2016 Codeplay Software Ltd. Better candidates

•Goal: Use standard C++ to express all intra-node parallelism 1. Khronos’ OpenCL SYCL extends Parallelism TS for embedded processors aiming to conform to ISO 26262 2. Agency extends Parallelism TS 3. HCC 4. HPX extends parallelism and concurrency TS 5. C++ AMP 6. KoKKos 7. Raja

101 © 2016 Codeplay Software Ltd. Agenda

• How do we get to programming self-driving cars? • The C++ Standard • C++ 17: is it great or just OK • C++Future • A comparison of Heterogeneous Programming Models • SYCL Design Philosophy: C++ end to end model for HPC and consumers • The ecosystem: • VisionCpp • Parallel STL • TensorFlow, Machine Vision, Neural Networks, Self-Driving Cars • Codeplay ComputeCPP Community Edition: Free Download

102 © 2016 Codeplay Software Ltd. How do we offload code to a heterogeneous device?

103 © 2016 Codeplay Software Ltd. Compilation Model

CPU CPU Source Linker x86 ISA x86 CPU File Compiler Object

104 © 2016 Codeplay Software Ltd. Compilation Model

CPU CPU Source Linker x86 ISA x86 CPU File Compiler Object

105 © 2016 Codeplay Software Ltd. Compilation Model

CPU CPU Source Linker x86 ISA x86 CPU File Compiler Object

GPU

106 © 2016 Codeplay Software Ltd. How can we compile source code for a sub architectures?

 Separate source

 Single source

107 © 2016 Codeplay Software Ltd. Separate Source Compilation Model

CPU CPU Source Linker x86 ISA x86 CPU File Compiler Object

Device Online GPU Source Compiler float *a, *b, *c; Here we’re using OpenCL as an example … kernel k = clCreateKernel(…, “my_kernel”, …);void my_kernel(__global float *a, __global float clEnqueueWriteBuffer(…, size, a, …); *b, clEnqueueWriteBuffer(…, size, a, …); __global float *c) { clEnqueueNDRange(…, k, 1, {size, 1, 1}, …); int id = get_global_id(0); clEnqueueWriteBuffer(…, size, c, …); c[id] = a[id] + b[id]; }

108 © 2016 Codeplay Software Ltd. Single Source Compilation Model

CPU CPU Source Linker x86 ISA x86 CPU File Compiler Object

GPU

array_view a, b, c; Here we are using C++ AMP as an example extent<2> e(64, 64); parallel_for_each(e, [=](index<2> idx) restrict(amp) { c[idx] = a[idx] + b[idx]; });

109 © 2016 Codeplay Software Ltd. Single Source Compilation Model

CPU CPU Linker x86 ISA x86 CPU Source Compiler Object File

Device Device Source Device IR / GPU Compiler Object array_view a, b, c; Here we are using C++ AMP as an example extent<2> e(64, 64); parallel_for_each(e, [=](index<2> idx) restrict(amp) { c[idx] = a[idx] + b[idx]; });

110 © 2016 Codeplay Software Ltd. Single Source Compilation Model

CPU CPU x86 ISA x86 CPU Source Compiler Object File Linker Device Device Source Device IR / GPU Compiler Object array_view a, b, c; Here we are using C++ AMP as an example extent<2> e(64, 64); parallel_for_each(e, [=](index<2> idx) restrict(amp) { c[idx] = a[idx] + b[idx]; });

111 © 2016 Codeplay Software Ltd. Single Source Compilation Model

CPU CPU x86 CPU Source Compiler Object x86 ISA File Linker (Embedded Device IR / Device Device Object) Source Device IR / GPU Compiler Object array_view a, b, c; Here we are using C++ AMP as an example extent<2> e(64, 64); parallel_for_each(e, [=](index<2> idx) restrict(amp) { c[idx] = a[idx] + b[idx]; });

112 © 2016 Codeplay Software Ltd. Benefits of Single Source

• Device code is written in the same source file as the host CPU code

• Allows compile-time evaluation of device code

• Supports type safety across host CPU and device

• Supports generic programming

• Removes the need to distribute source code

113 © 2016 Codeplay Software Ltd. Describing Parallelism

114 © 2016 Codeplay Software Ltd. How do you represent the different forms of parallelism?

 Directive vs

 Task vs

 Queue vs stream execution

115 © 2016 Codeplay Software Ltd. Directive vs Explicit Parallelism

Examples: Examples: • OpenMP, OpenACC • SYCL, CUDA, TBB, Fibers, C++11 Implementation: Threads • Compiler transforms code to be Implementation: parallel based on pragmas • An API is used to explicitly enqueuer one or more threads

Here we’re using OpenMP as an example Here we’re using C++ AMP as an example vector a, b, c; array_view a, b, c; extent<2> e(64, 64); #pragma omp parallel for parallel_for_each(e, [=](index<2> idx) for(int i = 0; i < a.size(); i++) { restrict(amp) { c[i] = a[i] + b[i]; c[idx] = a[idx] + b[idx]; } });

116 © 2016 Codeplay Software Ltd. Task vs Data Parallelism

Examples: Examples: • OpenMP, C++11 Threads, TBB • C++ AMP, CUDA, SYCL, C++17 Implementation: ParallelSTL • Multiple (potentially different) tasks Implementation: are performed in parallel • The same task is performed across a large data set

Here we’re using TBB as an example Here we’re using CUDA as an example vector tasks = { … }; float *a, *b, *c; cudaMalloc((void **)&a, size); tbb::parallel_for_each(tasks.begin(), cudaMalloc((void **)&b, size); tasks.end(), [=](task &v) { cudaMalloc((void **)&c, size); task(); }); vec_add<<<64, 64>>>(a, b, c);

117 © 2016 Codeplay Software Ltd. Queue vs Stream Execution

Examples: Examples: • C++ AMP, CUDA, SYCL, C++17 • BOINC, BrookGPU ParallelSTL Implementation: Implementation: • A function is executed on a • Functions are placed in a queue and continuous loop on a stream of data executed once per enqueuer

Here we’re using CUDA as an example Here we’re using BrookGPU as an example float *a, *b, *c; reduce void sum (float a<>, cudaMalloc((void **)&a, size); reduce float r<>) { cudaMalloc((void **)&b, size); r += a; cudaMalloc((void **)&c, size); } float a<100>;

float r; vec_add<<<64, 64>>>(a, b, c); sum(a,r);

118 © 2016 Codeplay Software Ltd. Data Locality & Movement

119 © 2016 Codeplay Software Ltd. One of the biggest limiting factor in heterogeneous computing

 Cost of data movement in time and power consumption

120 © 2016 Codeplay Software Ltd. Cost of Data Movement

• It can take considerable time to move data to a device • This varies greatly depending on the architecture • The bandwidth of a device can impose bottlenecks • This reduces the amount of throughput you have on the device • Performance gain from computation > cost of moving data • If the gain is less than the cost of moving the data it’s not worth doing • Many devices have a hierarchy of memory regions • Global, read-only, group, private • Each region has different size, affinity and access latency • Having the data as close to the computation as possible reduces the cost

121 © 2016 Codeplay Software Ltd. Cost of Data Movement

• 64bit DP Op: • 20pJ • 4x64bit register read: • 50pJ • 4x64bit move 1mm: • 26pJ • 4x64bit move 40mm: • 1nJ • 4x64bit move DRAM: • 16nJ

Credit: Bill Dally, Nvidia, 2010

122 © 2016 Codeplay Software Ltd. How do you move data from the host CPU to a device?

 Implicit vs explicit data movement

123 © 2016 Codeplay Software Ltd. Implicit vs Explicit Data Movement

Examples: Examples: • SYCL, C++ AMP • OpenCL, CUDA, OpenMP Implementation: Implementation: • Data is moved to the device implicitly • Data is moved to the device via via cross host CPU / device data explicit copy structures

Here we’re using C++ AMP as an example Here we’re using CUDA as an example array_view ptr; float *h_a = { … }, d_a; extent<2> e(64, 64); cudaMalloc((void **)&d_a, size); parallel_for_each(e, [=](index<2> idx) cudaMemcpy(d_a, h_a, size, restrict(amp) { cudaMemcpyHostToDevice); vec_add<<<64, 64>>>(a, b, c); ptr[idx] *= 2.0f; cudaMemcpy(d_a, h_a, size, }); cudaMemcpyDeviceToHost);

124 © 2016 Codeplay Software Ltd. How do you address memory between host CPU and device?

 Multiple address space

 Non-coherent single address space

 Cache coherent single address space

125 © 2016 Codeplay Software Ltd. Comparison of Memory Models

• Multiple address space • SYCL 1.2, C++AMP, OpenCL 1.x, CUDA • Pointers have keywords or structures for representing different address spaces • Allows finer control over where data is stored, but needs to be defined explicitly • Non-coherent single address space • SYCL 2.2, HSA, OpenCL 2.x , CUDA 4, OpenMP • Pointers address a shared address space that is mapped between devices • Allows the host CPU and device to access the same address, but requires mapping • Cache coherent single address space • SYCL 2.2, HSA, OpenCL 2.x, CUDA 6, C++, • Pointers address a shared address space (hardware or cache coherent runtime) • Allows concurrent access on host CPU and device, but can be inefficient for large data

126 © 2016 Codeplay Software Ltd. SYCL: A New Approach to Heterogeneous Programming in C++

127 © 2016 Codeplay Software Ltd. SYCL for OpenCL

 Cross-platform, single-source, high-level, C++ programming layer  Built on top of OpenCL and based on standard C++14

128 © 2016 Codeplay Software Ltd. The SYCL Ecosystem

C++ Application

C++ Template Library C++ Template Library C++ Template Library

SYCL for OpenCL

OpenCL

CPU GPU APU Accelerator FPGA DSP

129 © 2016 Codeplay Software Ltd. How does SYCL improve heterogeneous offload and performance portability?

 SYCL is entirely standard C++

 SYCL compiles to SPIR

 SYCL supports a multi compilation single source model

130 © 2016 Codeplay Software Ltd. Single Compilation Model

CPU CPU C++ x86 CPU Source Compiler Object File x86 ISA Linker (Embedded Device Device Object) Source Device Device GPU Compiler Object

131 © 2016 Codeplay Software Ltd. Single Compilation Model

C++ x86 CPU Source File x86 ISA Single Source Host & Device Compiler (Embedded Device Device Object) Source GPU

Tied to a single compiler chain

132 © 2016 Codeplay Software Ltd. Single Compilation Model

C++ 3 different language 3 different compilers 3 different executables Source extensions File

C++ x86 ISA x86 AMP AMD C++ AMP Compiler (Embedded CPU Source AMD ISA) GPU

CUDA x86 ISA X86 CUDA Compiler NVidia Source (Embedded CPU NVidia ISA) GPU

Open MP x86 ISA X86 OpenMP Compiler (Embedded SIMD Source CPU x86) CPU

133 © 2016 Codeplay Software Ltd. Agenda

• How do we get to programming self-driving cars? • The C++ Standard • C++ 17: is it great or just OK • C++Future • A comparison of Heterogeneous Programming Models • SYCL Design Philosophy: C++ end to end model for HPC and consumers • The ecosystem: • VisionCpp • Parallel STL • TensorFlow, Machine Vision, Neural Networks, Self-Driving Cars • Codeplay ComputeCPP Community Edition: Free Download

134 © 2016 Codeplay Software Ltd. SYCL is Entirely Standard C++

__global__ vec_add(float *a, float *b, float *c) { return c[i] = a[i] + b[i]; vector a, b, c; } #pragma parallel_for float *a, *b, *c; for(int i = 0; i < a.size(); i++) { vec_add<<>>(a,array_view b, c); a, b, c; c[i] = a[i] + b[i]; extent<2> e(64, 64); }

parallel_for_each(e, [=](index<2> idx) restrict(amp) { c[idx] = a[idx] + b[idx]; });

cgh.parallel_for(range, [=](cl::::id<2> idx) { c[idx] = a[idx] + c[idx]; }));

135 © 2016 Codeplay Software Ltd. SYCL Targets a Wide Range of Devices with SPIR

CPU GPU APU Accelerator FPGA DSP

136 © 2016 Codeplay Software Ltd. Multi Compilation Model

CPU CPU x86 ISA C++ (Embedded x86 CPU Compiler Object Source SPIR) File Linker Device Code SYCL Online SPIR GPU Compiler Finalizer

SYCL device compiler Generating SPIR

137 © 2016 Codeplay Software Ltd. Multi Compilation Model

GCC, Clang, VisualC++, Intel C++

CPU CPU x86 ISA C++ (Embedded x86 CPU Compiler Object Source SPIR) File Linker Device Code SYCL Online SPIR GPU Compiler Finalizer

138 © 2016 Codeplay Software Ltd. Multi Compilation Model

CPU CPU x86 ISA C++ (Embedded x86 CPU Source Compiler Object SPIR) File Linker Device Code SYCL Online SPIR GPU Compiler Finalizer

139 © 2016 Codeplay Software Ltd. Multi Compilation Model

CPU CPU x86 ISA C++ (Embedded x86 CPU Compiler Source Object SPIR) File Linker SIMD Device CPU Code SYCL SPIR Compiler GPU OpenCL APU Standard IR allows for Online better performance Finalizer SYCL does not mandate portability FPGA SPIR Device can be DSP selected at runtime

140 © 2016 Codeplay Software Ltd. Multi Compilation Model

CPU CPU x86 ISA (Embedded x86 CPU Compiler Object SPIR)

C++ Linker SIMD Source CPU File SYCL SPIR Compiler GPU

OpenCL Device Online APU Code SYCL Finalizer PTX Compiler FPGA

DSP

141 © 2016 Codeplay Software Ltd. Multi Compilation Model

CPU CPU x86 ISA (Embedded x86 CPU Compiler Object SPIR)

C++ SIMD Source CPU File SYCL SPIR Compiler Linker GPU

OpenCL Device Online APU Code SYCL Finalizer PTX Compiler FPGA PTX binary can be selected for NVidia DSP GPUs at runtime

142 © 2016 Codeplay Software Ltd. How does SYCL support different ways of representing parallelism?

 SYCL is an explicit parallelism model

 SYCL is a queue execution model

 SYCL supports both task and data parallelism

143 © 2016 Codeplay Software Ltd. Representing Parallelism

cgh.single_task([=](){

/* task parallel task executed once*/

});

cgh.parallel_for(range<2>(64, 64), [=](id<2> idx){

/* data parallel task executed across a range */

});

144 © 2016 Codeplay Software Ltd. How does SYCL make data movement more efficient?

 SYCL separates the storage and access of data

 SYCL can specify where data should be stored/allocated

 SYCL creates automatic data dependency graphs

145 © 2016 Codeplay Software Ltd. Separating Storage & Access

Buffers managed Accessors are used to data across host describe access CPU and one or more devices Accessor CPU

Buffer

Accessor GPU

Buffers and accessors type safe access across host and device

146 © 2016 Codeplay Software Ltd. Storing/Allocating Memory in Different Regions

Memory stored in Global global memory region Accessor

Buffer

Constant Memory stored in read- Kernel Accessor only memory region

Local Memory allocated in Accessor group memory region

147 © 2016 Codeplay Software Ltd. Data Dependency Task Graphs

Buffer A Read Accessor Kernel A Write Accessor Kernel A Kernel B Buffer B Read Accessor Kernel B Write Accessor Buffer C Read Accessor Kernel C Read Accessor Kernel C Buffer D Write Accessor

148 © 2016 Codeplay Software Ltd. Benefits of Data Dependency Graphs

• Allows you to describe your problems in terms of relationships • Don’t need to en-queue explicit copies • Removes the need for complex event handling • Dependencies between kernels are automatically constructed • Allows the runtime to make data movement optimizations • Pre-emptively copy data to a device before kernels • Avoid unnecessarily copying data back to the host after execution on a device • Avoid copies of data that you don’t need

149 © 2016 Codeplay Software Ltd. So what does SYCL look like?

 Here is a simple example SYCL application; a vector add

150 © 2016 Codeplay Software Ltd. Example: Vector Add

151 © 2016 Codeplay Software Ltd. Example: Vector Add

#include

template void parallel_add(std::vector inA, std::vector inB, std::vector out) {

}

152 © 2016 Codeplay Software Ltd. Example: Vector Add

#include

template void parallel_add(std::vector inA, std::vector inB, std::vector out) { cl::sycl::buffer inputABuf(inA.data(), out.size()); cl::sycl::buffer inputBBuf(inB.data(), out.size()); cl::sycl::buffer outputBuf(out.data(), out.size());

The buffers synchronise upon destruction

}

153 © 2016 Codeplay Software Ltd. Example: Vector Add

#include

template void parallel_add(std::vector inA, std::vector inB, std::vector out) { cl::sycl::buffer inputABuf(inA.data(), out.size()); cl::sycl::buffer inputBBuf(inB.data(), out.size()); cl::sycl::buffer outputBuf(out.data(), out.size()); cl::sycl::queue defaultQueue;

}

154 © 2016 Codeplay Software Ltd. Example: Vector Add

#include

template void parallel_add(std::vector inA, std::vector inB, std::vector out) { cl::sycl::buffer inputABuf(inA.data(), out.size()); cl::sycl::buffer inputBBuf(inB.data(), out.size()); cl::sycl::buffer outputBuf(out.data(), out.size()); cl::sycl::queue defaultQueue; Create a command group defaultQueue.submit([&] (cl::sycl::handler &cgh) { to define an asynchronous task

}); }

155 © 2016 Codeplay Software Ltd. Example: Vector Add

#include

template void parallel_add(std::vector inA, std::vector inB, std::vector out) { cl::sycl::buffer inputABuf(inA.data(), out.size()); cl::sycl::buffer inputBBuf(inB.data(), out.size()); cl::sycl::buffer outputBuf(out.data(), out.size()); cl::sycl::queue defaultQueue; defaultQueue.submit([&] (cl::sycl::handler &cgh) { auto inputAPtr = inputABuf.get_access(cgh); auto inputBPtr = inputBBuf.get_access(cgh); auto outputPtr = outputBuf.get_access(cgh);

}); }

156 © 2016 Codeplay Software Ltd. Example: Vector Add

#include template kernel; template void parallel_add(std::vector inA, std::vector inB, std::vector out) { cl::sycl::buffer inputABuf(inA.data(), out.size()); You must provide cl::sycl::buffer inputBBuf(inB.data(), out.size()); a name for the cl::sycl::buffer outputBuf(out.data(), out.size()); cl::sycl::queue defaultQueue; lambda defaultQueue.submit([&] (cl::sycl::handler &cgh) { auto inputAPtr = inputABuf.get_access(cgh); auto inputBPtr = inputBBuf.get_access(cgh); auto outputPtr = outputBuf.get_access(cgh); cgh.parallel_for>(cl::sycl::range<1>(out.size())), [=](cl::sycl::id<1> idx) { Create a parallel_for to })); define the device }); code }

157 © 2016 Codeplay Software Ltd. Example: Vector Add

#include template kernel; template void parallel_add(std::vector inA, std::vector inB, std::vector out) { cl::sycl::buffer inputABuf(inA.data(), out.size()); cl::sycl::buffer inputBBuf(inB.data(), out.size()); cl::sycl::buffer outputBuf(out.data(), out.size()); cl::sycl::queue defaultQueue; defaultQueue.submit([&] (cl::sycl::handler &cgh) { auto inputAPtr = inputABuf.get_access(cgh); auto inputBPtr = inputBBuf.get_access(cgh); auto outputPtr = outputBuf.get_access(cgh); cgh.parallel_for>(cl::sycl::range<1>(out.size())), [=](cl::sycl::id<1> idx) { outputPtr[idx] = inputAPtr[idx] + inputBPtr[idx]; })); }); }

158 © 2016 Codeplay Software Ltd. Example: Vector Add

template void parallel_add(std::vector inA, std::vector inB, std::vector out); int main() {

std::vector inputA = { /* input a */ }; std::vector inputB = { /* input b */ }; std::vector output = { /* output */ };

parallel_add(inputA, inputB, output, count); }

159 © 2016 Codeplay Software Ltd. Single-source vs C++ kernel language

• Single-source: a single-source file contains both host and device code • Type-checking between host and device • A single template instantiation can create all the code to kick off work, manage data and execute the kernel • e.g. sort (myData); • The approach taken by C++ 17 Parallel STL as well as SYCL

• C++ kernel language • Matches standard OpenCL C • Proposed for OpenCL v2.1 • Being considered as an addition for SYCL v2.1

160 © 2016 Codeplay Software Ltd. Why ‘name’ kernels?

• Enables implementers to have multiple, different compilers for host and different devices • With SYCL, software developers can choose to use the best compiler for CPU and the best compiler for each individual device they want to support • The resulting application will be highly optimized for CPU and OpenCL devices • Easy-to-integrate into existing build systems

• Only required for C++11 lambdas, not required for C++ functors • Required because lambdas don’t have a name to enable linking between different compilers

161 © 2016 Codeplay Software Ltd. Buffers/images/accessors vs shared pointers

• OpenCL v1.2 supports a wide range of different devices and operating systems • All shared data must be encapsulated in OpenCL memory objects: buffers and images • To enable SYCL to achieve maximum performance of OpenCL, we follow OpenCL’s memory model approach • But, we apply OpenCL’s memory model to C++ with buffers, images and accessors • Separation of data storage and data access

162 © 2016 Codeplay Software Ltd. What can I do with SYCL?

Anything you can do with C++!

With the performance and portability of OpenCL

163 © 2016 Codeplay Software Ltd. Progress report on the SYCL vision

Open, royalty-free standard: released Conformance testsuite: going into adopters package

Open-source implementation: in progress (triSYCL) Commercial, conformant implementation: in progress C++ 17 Parallel STL: open-source in progress

• Template libraries for important C++ algorithms: getting going • Integration into existing parallel C++ libraries: getting going

164 © 2016 Codeplay Software Ltd. Building the SYCL for OpenCL ecosystem • To deliver on the full potential of high-performance heterogeneous systems • We need the libraries • We need integrated tools • We need implementations • We need training and examples

• An open standard makes it much easier for people to work together • SYCL is a group effort • We have designed SYCL for maximum ease of integration

165 © 2016 Codeplay Software Ltd. Agenda

• How do we get to programming self-driving cars? • The C++ Standard • C++ 17: is it great or just OK • C++Future • A comparison of Heterogeneous Programming Models • SYCL Design Philosophy: C++ end to end model for HPC and consumers • The ecosystem: • VisionCpp • Parallel STL • TensorFlow, Machine Vision, Neural Networks, Self-Driving Cars • Codeplay ComputeCPP Community Edition: Free Download

166 © 2016 Codeplay Software Ltd.

Using SYCL to Develop Vision Tools

A high-level CV framework for OpenCL promoting: Ease of use • Applications • Easy to write code • Unified front end to client code (API) • Easily able to add customisable operations • Compile time graph validation • Predictable memory usage

Performance portability • Separation of concern • Portable between different programming Cross-platform portability models and architectures

• No modification in application computation • OpenCL enabled devices • CPU

167 © 2016 Codeplay Software Ltd. What is Parallelism TS v1 adding? • A set of execution policies and a collection of parallel algorithms • The Execution Policies • Paragraphs explaining the conditions for parallel algorithms • New parallel algorithms • But only on CPUs • Can we execute it on GPUs now?

168 © 2016 Codeplay Software Ltd. Sorting with the STL

A parallel sort A sequential sort std :: vector data = { 8, 9, 1, 4 }; std :: vector data = { 8, 9, 1, 4 }; std :: sort ( std :: begin ( data ), std :: end( data std :: sort ( std :: par , std :: begin ( data ), std :: end ( )); data )); if ( std :: is_sorted ( data )) { if ( std :: is_sorted ( data )) { cout << " Data is sorted ! " << endl ; cout << " Data is sorted ! " << endl ; } }

• par is an object of an Execution Policy • The sort will be executed in parallel using an implementation-defined method

169 © 2016 Codeplay Software Ltd. The SYCL execution policy

template std :: vector data = { 8, 9, 1, 4 }; class sycl_execution_policy { std :: sort ( sycl_policy , std :: begin (v), std :: end public : (v)); using kernelName = KernelName ; if ( std :: is_sorted ( data )) { sycl_execution_policy () = default ; cout << " Data is sorted ! " << endl ; sycl_execution_policy (cl :: sycl :: queue q); } cl :: sycl :: queue get_queue () const ; };

• sycl_policy is an Execution Policy • data is a standard stl::vector • Technically, will use the device returned by default_selector

170 © 2016 Codeplay Software Ltd. Heterogeneous load balancing

171 © 2016 Codeplay Software Ltd. Future work

https://github.com/KhronosGroup/SyclParallelSTL

172 © 2016 Codeplay Software Ltd. Other projects – in progress

TensorFlow: Google’s machine learning library + others …

Eigen: C++ linear algebra template library

173 © 2016 Codeplay Software Ltd. Conclusion

• Heterogeneous programming has been supported through OpenCL for years • C++ is a prominent language for doing this but currently is only CPU-based • Graph programming languages enables Neural network, machine vision • SYCL allows you to program heterogeneous devices with standard C++ today • ComputeCpp is available now for you to download and experiment • For engineers/companies/consortiums producing embedded devices for automotive ADAS, machine vision, or neural network • Who want to deliver artificial intelligent devices that are also low power for e.g. self-driving cars, smart homes • But who are dissatisfied with the current single vendor locked in heterogeneous solution or design/code with no reuse solution • We provide performance-portable open-standard software across multiple platforms with long-term support • Unlike vertical locked in solutions such as CUDA, C++AMP, or HCC • We have assembled a whole ecosystem of software accelerated by your parallel hardware enabling reuse with open standards

© 2016 Codeplay Software Ltd. 174 Agenda

• How do we get to programming self-driving cars? • The C++ Standard • C++ 17: is it great or just OK • C++Future • A comparison of Heterogeneous Programming Models • SYCL Design Philosophy: C++ end to end model for HPC and consumers • The ecosystem: • VisionCpp • Parallel STL • TensorFlow, Machine Vision, Neural Networks, Self-Driving Cars • Codeplay ComputeCPP Community Edition: Free Download

175 © 2016 Codeplay Software Ltd. For Codeplay, these are our layer choices

We have chosen a layer of standards, based on current market adoption • TensorFlow and OpenCV • SYCL • OpenCL (with SPIR) Graph • LLVM as the standard compiler back-end programming C/C++-level programming • TensorFlow Higher-level • OpenCV language • SYCL Device- enabler specific The actual choice of standards may change based • OpenCL SPIR on market dynamics, but by choosing widely programming adopted standards and a layering approach, it is easy to adapt • LLVM

176 © 2016 Codeplay Software Ltd. For Codeplay, these are our products

Graph programming C/C++-level programming • TensorFlow Higher-level • OpenCV language • SYCL Device- enabler specific programming • OpenCL SPIR • LLVM

177 © 2016 Codeplay Software Ltd. Codeplay •Standards bodies Research Open source Presentations Company •HSA Foundation: Chair of software •Members of EU research •HSA LLDB Debugger •Building an LLVM back-end •Based in Edinburgh, Scotland group, spec editor of runtime and consortiums: PEPPHER, LPGPU, •SPIR-V tools •Creating an SPMD Vectorizer for •57 staff, mostly engineering debugging LPGPU2, CARP •RenderScript debugger in AOSP OpenCL with LLVM •License and customize technologies •Khronos: chair & spec editor of SYCL. •Sponsorship of PhDs and EngDs for •LLDB for Qualcomm Hexagon •Challenges of Mixed-Width Vector for semiconductor companies Contributors to OpenCL, Safety heterogeneous programming: HSA, •TensorFlow for OpenCL Code Gen & Scheduling in LLVM •ComputeAorta and ComputeCpp: Critical, Vulkan FPGAs, ray-tracing •C++ 17 Parallel STL for SYCL •C++ on Accelerators: Supporting implementations of OpenCL, Vulkan •ISO C++: Chair of Low Latency, •Collaborations with academics Single-Source SYCL and HSA and SYCL •VisionCpp: C++ performance- Embedded WG; Editor of SG1 •Members of HiPEAC •LLDB Tutorial: Adding debugger •15+ years of experience in Concurrency TS portable programming model for vision support for your target heterogeneous systems tools •EEMBC: members

Codeplay build the software platforms that deliver massive performance

178 © 2016 Codeplay Software Ltd. What our ComputeCpp users say about us

Benoit Steiner – Google TensorFlow WIGNER Research Centre ONERA Hartmut Kaiser -HPX engineer for Physics

“We at Google have been working “We work with royalty-free SYCL “My team and I are working with It was a great pleasure this week for closely with Luke and his Codeplay because it is hardware vendor Codeplay's ComputeCpp for almost a us, that Codeplay released the colleagues on this project for almost 12 agnostic, single-source C++ year now and they have resolved every ComputeCpp project for the wider months now. Codeplay's contribution programming model without platform issue in a timely manner, while audience. We've been waiting for this to this effort has been tremendous, so specific keywords. This will allow us to demonstrating that this technology moment and keeping our colleagues we felt that we should let them take easily work with any heterogeneous can work with the most complex C++ and students in constant rally and the lead when it comes down to solutions using OpenCL to template code. I am happy to say that excitement. We'd like to build on this communicating updates related to develop our complex algorithms and the combination of Codeplay's SYCL opportunity to increase the awareness OpenCL. … we are planning to merge ensure future compatibility” implementation with our HPX runtime of this technology by providing sample the work that has been done so far… system has turned out to be a very codes and talks to potential users. we want to put together a capable basis for Building a We're going to give a lecture series on comprehensive test infrastructure” Heterogeneous Computing Model for modern scientific programming the C++ Standard using high-level providing field specific examples.“ abstractions.”

179 © 2016 Codeplay Software Ltd. Further information

• OpenCL https://www.khronos.org/opencl/ • OpenVX https://www.khronos.org/openvx/ • HSA http://www.hsafoundation.com/ • SYCL http://sycl.tech • OpenCV http://opencv.org/ • Halide http://halide-lang.org/ • VisionCpp https://github.com/codeplaysoftware/visioncpp

180 © 2016 Codeplay Software Ltd.

Community Edition Available now for free!

Visit: computecpp.codeplay.com

181 © 2016 Codeplay Software Ltd. • Open source SYCL projects: • ComputeCpp SDK - Collection of sample code and integration tools • SYCL ParallelSTL – SYCL based implementation of the parallel algorithms • VisionCpp – Compile-time embedded DSL for image processing • Eigen C++ Template Library – Compile-time library for machine learning All of this and more at: http://sycl.tech

182 © 2016 Codeplay Software Ltd. Questions ?

@codeplaysoft /codeplaysoft codeplay.com