C++17, is it great or just OK and Heterogeneous computing in C++ Next for Self-Driving Cars Michael Wong (Codeplay Software, VP of Research and Development), Andrew Richards, CEO ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong Head of Delegation for C++ Standard for Canada Vice Chair of Programming Languages for Standards Council of Canada
Chair of WG21 SG5 Transactional Memory Chair of WG21 SG14 Games Dev/Low Latency/Financial Trading/Embedded Editor: C++ SG5 Transactional Memory Technical Specification Editor: C++ SG1 Concurrency Technical Specification http:://wongmichael.com/about
Code::dive 2016 Agenda
• How do we get to programming self-driving cars? • The C++ Standard • C++ 17: is it great or just OK • C++Future • A comparison of Heterogeneous Programming Models • SYCL Design Philosophy: C++ end to end model for HPC and consumers • The ecosystem: • VisionCpp • Parallel STL • TensorFlow, Machine Vision, Neural Networks, Self-Driving Cars • Codeplay ComputeCPP Community Edition: Free Download
2 © 2016 Codeplay Software Ltd. How do we get from here…
Level 5 … to Level 4 •Autonomous •Stages from here? •Deep self very local to control Level 3 extensive •All conditions •Limited overall journeys control Level 2 •Execute •Automated Level 1 manoeuvres These are the SAE levels for •Adaptive autonomous vehicles. Similar •Assist Level 0 challenges apply in other embedded intelligence industries •Warnings
3 © 2016 Codeplay Software Ltd. We have a mountain to climb
… or … without climbing getting lost the wrong … and we on our mountain want to get own… there in When we safe, don’t manageabl know e, How do what the affordable we get top looks steps… to the like... top?
4 © 2016 Codeplay Software Ltd. This presentation will focus on:
• The hardware and software platforms that will be able to deliver the results • The software tools to build up the solutions for those platforms • The open standards that will enable solutions to interoperate • How Codeplay can help deliver embedded intelligence
5 © 2016 Codeplay Software Ltd. Where do we need to go?
“On a 100 millimetre-squared chip, Google needs something like 50 teraflops of performance” - Daniel Rosenband (Google’s self-driving car project) at HotChips 2016
6 © 2016 Codeplay Software Ltd. Performance trends 65,536 32,768 16,384 Google target 8,192 4,096 Desktop GPU 2,048 1,024
512 Integrated GPU 256 128 64 Smartphone GPU
GFLOPS 32 16 8 Smartphone CPU These trend lines 4 2 seem to violate Desktop CPU the rules of 1 1 physics…
Year of introduction
7 © 2016 Codeplay Software Ltd. How do we get there from here?
1.We need to write software today for platforms that We need to validate cannot be built yet the systems as safe
We need to start with simpler systems that are not fully autonomous
8 © 2016 Codeplay Software Ltd. Two models of software development
Hardware Software designer designer writing Software writing Software Write software
Design next Implement Select platform Design a Validate Select version on model model platform platform
Validate whole Optimize for platform platform Which method can get us all the way to full autonomy?
9 © 2016 Codeplay Software Ltd. Desirable Development
Write software
Evaluate Optimize for Software platform Software Application
ValidateWrite whole Well Defined Middleware softwareplatform
Hardware & Low-Level Software Develop Evaluate Platform Architecture
Select Platform
10 © 2016 Codeplay Software Ltd. The different levels of programming model
Device-specific Higher-level C-level C++-level Graph programming language programming programming programming • Assembly enabler • OpenCL C • SYCL • OpenCV language • NVIDIA PTX • DSP C • CUDA • OpenVX • VHDL • HSA • MCAPI/MTAPI • HCC • Halide • Device-specific C- • OpenCL SPIR • C++ AMP • VisionCpp like programming • SPIR-V • TensorFlow models • Caffe
11 © 2016 Codeplay Software Ltd. Device-specific programming
Cannot … develop software Can deliver quick today for future results today platforms
Not a route to full Can… autonomy hand-optimize Does not allow directly for the software developers device to invest today
12 © 2016 Codeplay Software Ltd. The route to full autonomy
• Graph programming • This is the most widely-adopted approach to machine vision and machine learning
• Open standards • This lets you develop today for future architectures
13 © 2016 Codeplay Software Ltd. Why graph programming?
When you scale the Therefore: number of cores: • You need to reduce • You don’t scale the off-chip memory However, number of memory ports bandwidth by writing tiled processing • Your compute everything on-chip image pipelines performance increases • This is achieved by is hard • But your off-chip memory tiling bandwidth does not increase If we build up a graph of operations (e.g. convolutions) and then have a runtime system split into fused tiled operations across an entire system -on-chip, we get great performance
14 © 2016 Codeplay Software Ltd. Graph programming: some numbers
In this example, Effect of combining graph nodes on performance we perform 3 100
image processing 90 operations on an 80 Halide and SYCL accelerator and use kernel fusion, 70 compare 3 whereas OpenCV systems when 60 does not. For all 3 executing 50 systems, the individual nodes, performance of 40 or a whole graph the whole graph is 30 significantly better 20 than individual The system is an AMD nodes executed APU and the 10 operations are: RGB- on their own 0 >HSV, channel OpenCV (nodes) OpenCV (graph) Halide (nodes) Halide (graph) SYCL (nodes) SYCL (graph) masking, HSV->RGB Kernel time (ms) Overhead time (ms)
15 © 2016 Codeplay Software Ltd. Graph programming • For both machine vision algorithms and machine learning, graph programming is the most widely-adopted approach • Two styles of graph programming that we commonly see:
C-style graph C++-style graph programming programming • OpenVX • Halide • OpenCV • RapidMind • Eigen (also in TensorFlow) • VisionCpp
16 © 2016 Codeplay Software Ltd. C-style graph programming
OpenVX: open standard • Can be implemented by vendors • Create a graph with C API, then map to an entire SoC
OpenCV: open source • Implemented on OpenCL • Implemented on device-specific accelerators • Create a graph with C API, then execute
17 © 2016 Codeplay Software Ltd. & Device-Specific Programming
What happens if we Runtime systems invent our own can automatically graph nodes? optimize the graphs
Can … How do we adapt it develop software for all the graph today for future nodes we need? platforms
18 © 2016 Codeplay Software Ltd. C++-style graph programming
Examples in machine C++ compilers that vision/machine learning support this style • Halide • CUDA • RapidMind • C++ OpenMP • Eigen (also in • C++ 17 Parallel STL TensorFlow) • SYCL • VisionCpp
19 © 2016 Codeplay Software Ltd. C++ single-source programming
• C++ lets us build up graphs at compile-time • This means we can map a graph to the processors offline • C++ lets us write custom nodes ourselves • This approach is called a C++ Embedded Domain-Specific Language • Very widely used, eg Eigen, Boost, TensorFlow, RapidMind, Halide
20 © 2016 Codeplay Software Ltd. C++ single-source Single-source is lets us create most widely- customizable adopted machine graph models learning programming model
Combining open OpenCL lets us run on a very wide range of standards, C++ and accelerators now and in the future graph programming
SYCL combines C++ single-source with OpenCL acceleration
21 © 2016 Codeplay Software Ltd. Putting it all together: building it
22 © 2016 Codeplay Software Ltd. Higher-level programming enablers
NVIDIA PTX HSA OpenCL SPIR SPIR-V
• NVIDIA CUDA-only • Royalty-free open • Defined for OpenCL • Open standard standard v1.2 • Defined by Khronos • HSAIL is the IR • Based on • Supports compute • Provides a single Clang/LLVM (the and graphics address space, with open-source (OpenCL, Vulkan and virtual memory compiler) OpenGL) • Low-latency • Not tied to any communication compiler
Open standard intermediate representations enable tools to be built on top and support a wide range of platforms
23 © 2016 Codeplay Software Ltd. Which model should we choose?
Device-specific Higher-level C-level C++-level Graph programming language programming programming programming • Assembly enabler • OpenCL C • SYCL • OpenCV language • NVIDIA PTX • DSP C • CUDA • OpenVX • VHDL • HSA • MCAPI/MTAPI • HCC • Halide • Device-specific C- • OpenCL SPIR • C++ AMP • VisionCpp like programming • SPIR-V • TensorFlow models • Caffe
24 © 2016 Codeplay Software Ltd. They are not alternatives, they are layers
Graph programming
OpenCV OpenVX Halide VisionCpp TensorFlow Caffe
C/C++-level programming
SYCL CUDA HCC C++ AMP OpenCL
Higher-level language enabler
NVIDIA PTX HSA OpenCL SPIR SPIR-V
Device-specific programming
Assembly language VHDL Device-specific C-like programming models
25 © 2016 Codeplay Software Ltd. Can specify, test and validate each layer
Graph programming
Validate graph models Validate the code using standard tools
C/C++-level programming
OpenCL/SYCL specs Clsmith testsuite Conformance testsuites Wide range of other testsuites
Higher-level language enabler
SPIR/SPIR-V/HSAIL specs Conformance testsuites
Device-specific programming
Device-specific specification Device-specific testing and validation
26 © 2016 Codeplay Software Ltd. Agenda
• How do we get to programming self-driving cars? • The C++ Standard • C++ 17: is it great or just OK • C++Future • A comparison of Heterogeneous Programming Models • SYCL Design Philosophy: C++ end to end model for HPC and consumers • The ecosystem: • VisionCpp • Parallel STL • TensorFlow, Machine Vision, Neural Networks, Self-Driving Cars • Codeplay ComputeCPP Community Edition: Free Download
27 © 2016 Codeplay Software Ltd. C++ support for massive parallel heterogeneous devices •Memory allocation (near, far memory) • Several candidates •Better affinity for cpu and memory • SYCl •Templates (static, compile time) • HPX •Exceptions • HCC •Polymorphism • Agency •Tasks blocks • KoKKos •Execution Agents/Context • Raja •Progress Guarantees • C++AMP •Current Technical Specifications – Concepts, Parallelism, Concurrency, TM
28 © 2016 Codeplay Software Ltd. C++ Std Timeline/status https://wongmichael.com/2016/06/29/c17-all-final-features-from-oulu-in-a-few-slides/
29 ©29 2016 Codeplay Software Ltd. Pre-C++11 projects
ISO number Name Status What is it? C++17? Published 2006 (ISO Technical Report on C++ store) ISO/IEC TR 18015:2006 C++ Performance report No Performance Draft: TR18015 (2006-02- 15) Published 2007-11-15 (ISO store) Has 14 Boost libraries, 13 Technical Report on C++ N/A (mostly already ISO/IEC TR 19768:2007 Draft: n1745 (2005-01-17) of which was added to Library Extensions included into C++11) TR 29124 split off, the rest C++11. merged into C++11 Published 2010-09-03 (ISO Store) Extensions to the C++ Really, ORDINARY math Final draft: n3060 (2010- Library to support today with a Boost and YES ISO/IEC TR 29124:2010 03-06). Under mathematical special Dinkumware consideration to merge functions Implementation into C++17 by p0226 (2016-02-10) Published 2011-10-25 (ISO Store) Extensions for the Decimal Floating Point Draft: n2849 (2009-03-06) programming language decimal32 ISO/IEC TR 24733:2011 May be superseded by a No. Ongoing work in SG6 C++ to support decimal decimal64 © 2016 Codeplay Software Ltd. future Decimal30 TS or floating-point arithmetic decimal128 merged into C++ b Status after Nov Issaquah C++ Meeting ISO number Name Status links C++17? Published 2015-06-18. Standardize Linux and C++ File System Technical YES ISO/IEC TS 18822:2015 (ISO store). Final draft: Windows file system Specification n4100 (2014-07-04) interface YES but removed Published 2015-06-24. dynamic execution C++ Extensions for ISO/IEC TS 19570:2015 (ISO Store). Final draft: Parallel STL algorithms. policy, exception_lists, Parallelism n4507 (2015-05-05) changed some names
No. Already in GCC 6 Published 2015-09-16, Composable lock-free release and waiting for ISO/IEC TS 19841:2015 Transactional Memory TS (ISO Store). Final draft: programming that scales subsequent usage n4514 (2015-05-08) experience. optional, any, YES but moved Published 2015-09-30, C++ Extensions for string_view and more Invocation Traits and ISO/IEC TS 19568:2015 (ISO Store). Final draft: Library Fundamentals Polymorphic allocators n4480 (2015-04-07) into LF TS2 No. Already in GCC 6 Published 2015-11-13. Constrained templates release and and waiting C++ Extensions for ISO/IEC TS 19217:2015 (ISO Store). Final draft: for subsequent usage Concepts n4553 (2015-10-02) experience. 31 © 2016 Codeplay Software Ltd. Status after Nov Issaquah C++ Meeting
ISO number Name Status What is it? C++17? improvements to No. Already in Visual Published 2016-01-19. C++ Extensions for future, latches and Studio release and ISO/IEC TS 19571:2016 (ISO Store) Final draft: Concurrency barriers, atomic smart waiting for subsequent p0159r0 (2015-10-22) pointers usage experience. No. Resolution of C++ Extensions for source code DTS. Draft: n4564 comments from ISO/IEC DTS 19568:xxxx Library Fundamentals, information capture (2015-11-05) national standards Version 2 and various utilities bodies in progress In development, Draft Range-based algorithms No. Wording review of ISO/IEC DTS 21425:xxxx Ranges TS n4569 (2016-02-15) and views the spec in progress
In development, Draft Sockets library based on No. Wording review of ISO/IEC DTS 19216:xxxx Networking TS n4575 (2016-02-15) Boost.ASIO the spec in progress.
No. Initial TS wording In development, Draft A component system to reflects Microsoft’s p0142r0 (2016-02-15) supersede the textual Modules design; changes and p0143r1 (2016-02- header file inclusion proposed by Clang 15) model implementers expected.
32 © 2016 Codeplay Software Ltd. Status after Nov Issaquah C++ Meeting ISO number Name Status What is it? C++17?
Early development. Draft Various numerical No. Under active Numerics TS p0101 (2015-09-27) facilities development
Exploring executors, synchronic types, lock- No. Under active ISO/IEC DTS 19571:xxxx Concurrency TS 2 Early development free, atomic views, development concurrent data structures Exploring task blocks, Early development. Draft No. Under active ISO/IEC DTS 19570:xxxx Parallelism TS 2 progress guarantees, n4578 (2016-02-22) development SIMD. Transactional Memory TS Exploring on_commit, No. Under active ISO/IEC DTS 19841:xxxx Early development 2 in_transaction. development.
Early development. Draft No. Wording review of Graphics TS 2D drawing API p0267r0 (2016-02-12) the spec in progress
Under overhaul. Stack arrays whose size is No. Withdrawn; any ISO/IEC DTS 19569:xxxx Array Extensions TS Abandoned draft: n3820 not known at compile future proposals will (2013-10-10) time target a different vehicle
33 © 2016 Codeplay Software Ltd. Status after Nov Issaquah C++ Meeting ISO number Name Status What is it? C++17? Initial TS wording will reflect Microsoft’s await design; Coroutine TS Resumable functions No. Under active development changes proposed by others expected.
Design direction for Code introspection and (later) Reflection TS introspection chosen; likely to No. Under active development reification mechanisms target a future TS
Unified proposal reviewed Preconditions, postconditions, Contracts TS No. Under active development favourably. ) etc.
Massive Parallelism TS Early development Massive parallelism dispatch No. Under active development.
Support Hetereogeneous No. Under active development. Heterogeneous Device TS Early development. Devices
Filesystem TS, Parallelism TS, Library Fundamentals TS I, if C++17 On track for 2017 constexpr, and various other YES enhancements are in. See slide 44-47 for details.
34 © 2016 Codeplay Software Ltd. Library Fundamental TS 2: being reviewed
• Source-code information capture (really a Reflection feature with a library interface) • A generalized callable negator • Uniform container erasure • GCD and LCM functions (GCD/LCM moved into C++17) • Delimited iterators • observer_ptr, the world’s dumbest smart pointer • A const-propagating wrapper class • make_array • A metaprogramming utility dubbed the “C++ detection idiom” • A replacement for std::rand() • Logical type traits.
35 © 2016 Codeplay Software Ltd. C++ 17 Language features already voted in •static_assert(condition) without a message • The [[fallthrough]] attribute, •Allowing auto var{expr}; •Writing a template template parameter as template <…> The [[nodiscard]] attribute, typename Name • •Removing trigraphs • The [[maybe_unused]] attribute •Folding expressions •std::uncaught_exceptions() • Extending aggregate initialization to allow •Attributes for namespaces and enumerators initializing base subobjects. •Shorthand syntax for nested namespace definitions • Lambdas in constexpr contexts •u8 character literals •Allowing full constant expressions in non-type template • Disallowing unary folds of some operators over parameters an empty parameter pack •Removing the register keyword, while keeping it reserved for future use Generalizing the range-based for loop •Removing operator++ for bool • •Making exception specifications part of the type system. • Lambda capture of *this by value •__has_include(), •Choosing an official name for what are commonly called • Relaxing the initialization rules for scoped “non-static data member initializers” or NSDMIs. The official name is “default member initializers”. enum types. •A minor change to the semantics of inheriting constructors • Hexadecimal floating-point literals
36 © 2016 Codeplay Software Ltd. C++17 Language features voted in Oulu Finland if constexpr (formerly known as constexpr_if, and before that, static_if) • Introducing the term 'templated Template parameter deduction for constructors entity‘ template
•Non-member size(), empty(), and data() functions • An algorithm to clamp a value between a pair of boundary values •Improvements to pair and tuple
•bool_constant • constexpr std::hardware_{constructive,destructive}_interference_size •shared_mutex •Incomplete type support for standard containers • A 3-argument overload of std::hypot •Type traits variable templates. • Adding constexpr modifiers •as_const() • Giving std::string a non-const data() member function •Removing deprecated iostreams aliases
•Making std::owner_less more flexible • is_callable, the missing INVOKE-related trait
•Polishing
•Variadic lock_guard
•Logical type traits.
38 © 2016 Codeplay Software Ltd. C++17 Library features voted in Oulu Finland
• Synopses for the C library •High-performance, locale-independent number <-> string conversions • Making Optional Greater Equal Again •make_from_tuple() (like apply(), but for constructors) • Making Variant Greater Equal •Letting folks define a default_order<> without • Homogeneous interface for variant, any and optional defining std::less<> • Elementary string conversions •Splicing between associative containers • Integrating std::string_view and std::string •Relative paths •C11 libraries • has_unique_object_representations •shared_ptr::weak_type • Extending memory management tools •gcd() and lcm() from LF TS 2 • Emplace Return Type •Deprecating std::iterator, redundant members of std::allocator, and is_literal • Removing Allocator Support in std::function •Reserve a namespace for STL v2 • make_from_tuple: apply for construction •std::variant<> • Delete operator= for polymorphic_allocator •Better Names for Parallel Execution Policies in • Fixes for not_fn C++17 •Temporarily discourage memory_order_consume • Adapting string_view by filesystem paths •A
39 © 2016 Codeplay Software Ltd.
What did not change from Issaquah No Concepts • Inline variable stays No Unified Call Syntax No Default Comparison No operator dot
40 © 2016 Codeplay Software Ltd. Changes voted in Issaquah Fixes to C+17 Some New Features for C++20 • Removing Deprecated • Pack expansions in using- Exception Specifications declarations from C++17 • Lifting Restrictions on • Added Elementary string requires-Expressions conversions • Std::byte was not added
41 © 2016 Codeplay Software Ltd. Future C++ Standard schedules • After Nov, Issaquah • Address additional returned comments in February Kona • Likely Issue DIS after Kona, Feb 2017, send it to National Body for final approval ballot; this is just an up/down vote, no comments • Most likely approved, then celebrate in July 2017 Toronto Meeting • Then send it to ISO Geneva for publication, likely by EOY 2017 • After C++17 • Default is 3 yr cycle: C++20, 23
42 © 2016 Codeplay Software Ltd. Improve support for large-scale dependable software
• Modules • to improve locality and improve compile time; n4465 and n4466 • Contracts • for improved specification; n4378 and n4415 • A type-safe union • probably functional-programming style pattern matching; something based on my Urbana presentation, which relied on the Mach7 library: Yuriy Solodkyy, Gabriel Dos Reis and Bjarne Stroustrup: Open Pattern Matching for C++. ACM GPCE'13.
C++17 Lenexa 43 43 © 2016 Codeplay Software Ltd. Provide support for higher-level concurrency models
• Basic networking • asio n4478 • A SIMD vector • to better utilize modern high-performance hardware; e.g., n4454 but I’d like a real vector rather than just a way of writing parallelizable loops • Improved futures • e.g., n3857 and n3865 • Co-routines • finally, again for the first time since 1990; N4402, N4403, and n4398 • Transactional memory • n4302 • Parallel algorithms (incl. parallel versions of some of the STL • n4409
C++17 Lenexa 44 44 © 2016 Codeplay Software Ltd. Simplify core language use and address major sources of errors • Concepts (n3701 and n4361) • concepts in the standard library • based on the work done in Origin, The Palo Alto TR, and Ranges n4263, n4128 and n4382 • default comparisons May come back in • to complete the support for fundamental operations; n4475 and n4476 limited form with National Body • uniform call syntax comment • among other things: it helps concepts and STL style library use; n4474 May come back in operator dot • limited form with • to finally get proxies and smart references; n4477 National Body • array_view and string_view comment • better range checking, DMR wanted those: "fat pointers"; n4480 • arrays on the stack • "stack_array" anyone? But we need to find a safe way of dealing with stack overflow; n4294 • optional • unless it is subsumed by pattern matching, and I think not in time for C++17, n4480
C++17 Lenexa 45 45 © 2016 Codeplay Software Ltd. The Verdict on C++17? (from reddit) •You blew it • Did a nice job •Not a Major release • But not Minor either •No risk, no gain • Safe and conservative wins •Nobody implement TSs • TSs are implemented •Tethering tower of Babel of • Followed the rules of a bus TSs train model, how to get 110 people to work together A Medium/OK Release 46 © 2016 Codeplay Software Ltd. The Parallel and concurrency planets of C++ today … … SG5 … Transactional Memory TS
SG1 Par/Con 3 TS …
…
SG14 Low … Latency … … …
47 © 2016 Codeplay Software Ltd. C++1Y(1Y=17/20/22) SG1/SG5/SG14 Plan red=C++17, blue=C++20? Black=future? Parallelism Concurrency • Future++ (then, wait_any, wait_all): • Latches and Barriers • Parallel Algorithms: • Atomic smart pointers
• Library Vector Types • osync_stream • Atomic Views, fp_atomics, • Vector loop algorithm/exec • Counters/Queues
policy • Executors:
• Task-based parallelism • Lock free techniques/Transactions • Synchronics replacement/atomic flags (cilk, OpenMP, fork-join) • Co-routines
• Execution Agents • Concurrent Vector/Unordered Associative Containers • upgrade_lock • Progress guarantees • Pipelines/channels • MapReduce
48 © 2016 Codeplay Software Ltd.
Part 1: Parallel C++ Library In C++17
4 9 Execution Policies Published 2015 using namespace std::experimental::parallelism; std::vector
// previous standard sequential sort std::sort(vec.begin(), vec.end());
// explicitly sequential sort std::sort(std::seq, vec.begin(), vec.end());
// permitting parallel execution std::sort(std::par, vec.begin(), vec.end());
// permitting vectorization as well std::sort(std::par_unseq, vec.begin(), vec.end());
50 © 2016 Codeplay Software Ltd. What was changed from Parallelism TS in C++17 • Removed dynamic execution policy • Name change from par_vec to par_unseq • Removed exception_list (of exception_ptr) to terminate and don’t unwind
51 51 © 2016 Codeplay Software Ltd. Issaquah changes to Parallelism TS • Exceptions now part of execution policy • To enable future exception handling such as reduction • Inner product is now transform reduce • Input iterators can cause reversal to sequential • Default policies cannot copy predicates • Trying to enable that so NUMA systems can work well
52 52 © 2016 Codeplay Software Ltd. Part 2: Forward Progress guarantees in C++17
5 3 ParallelSTL •C++17 execution policies require concurrent or parallel forward progress guarantees • This means GPUs are not support by the standard execution policies •Executors intend to interface with execution policies
parallel_for_each(par.on(exec), vec.begin(), vec.end(), [=](int&e){ /* … */ });
54 54 © 2016 Codeplay Software Ltd. Forward Progress Guarantees • C++17 forward progress guarantees are: • Concurrent forward progress guarantees • a thread of execution is required to make forward progress regardless of the forward progress of any other thread of execution. • Parallel forward progress guarantees • a thread of execution is not required to make forward progress until an execution step has occurred and from that point onward a thread of execution is required to make forward progress regardless of the forward progress of any other thread of execution. • Weakly parallel forward progress guarantees • a thread of execution is not required to make progress.
• These are not specific guarantees for GPUs 55 55 © 2016 Codeplay Software Ltd. Agenda
• How do we get to programming self-driving cars? • The C++ Standard • C++ 17: is it great or just OK • C++Future • A comparison of Heterogeneous Programming Models • SYCL Design Philosophy: C++ end to end model for HPC and consumers • The ecosystem: • VisionCpp • Parallel STL • TensorFlow, Machine Vision, Neural Networks, Self-Driving Cars • Codeplay ComputeCPP Community Edition: Free Download
56 © 2016 Codeplay Software Ltd. Part 3: Futures++ (.then, wait_any, wait_all) in future C++ 20
5 7 Futures & Continuations • Extensions to C++11 futures • MS-style .then continuations • then() • Sequential and parallel composition • when_all() - join • when_any() - choice • Useful utilities: • make_ready_future() • is_ready() • unwrap()
58 © 2016 Codeplay Software Ltd. Summary Of Proposed Extensions (1)
template
59 © 2016 Codeplay Software Ltd. Summary Of Proposed Extensions (2) template
60 © 2016 Codeplay Software Ltd. Summary Of Proposed Extensions (3) template
template
61 © 2016 Codeplay Software Ltd. Part 4: Executors in future C++20
6 2 Executors • Executors are to function execution what allocators are to memory allocation • If a control structure such as std::async() or the parallel algorithms describe work that is to be executed • An executor describes where and when that work is to be executed • http://www.open- std.org/jtc1/sc22/wg21/docs/papers/2016/p0443r0.html
63 © 2016 Codeplay Software Ltd. The Idea Behind Executors
Unified Interface for Execution
64 © 2016 Codeplay Software Ltd. Several Competing Proposals • P0008r0 (Mysen): Minimal interface for fire-and-forget execution • P0058r1 (Hoberock et al.): Functionality needed for foundations of Parallelism TS • P0113r0 (Kohlhoff): Functionality needed for foundations of Networking TS • P0285r0 (Kohlhoff): Executor categories & customization points
65 © 2016 Codeplay Software Ltd. Telecon calls between Oulu to Issaquah meeting
Jared Hoberock (thanks for • Hans Boehm the slides!) • Gordon Brown Michael Garland • Thomas Heller Chris Kohlhoff • Lee Howes Chris Mysen • Hartmut Kaiser Carter Edwards • Bryce Lelbach • Gor Nishanov
• Thomas Rodgers • Michael Wong
66 © 2016 Codeplay Software Ltd. Current Progress of Executors • Closing in on minimal proposal • A foundation for later proposals (for heterogeneous computing) • Still work in progress
67 © 2016 Codeplay Software Ltd. Current Progress of Executors • An instruction stream is the function you want to execute • An executor is an interface that describes where and when to run an instruction stream • An executor has one or more execute functions • An execute function executes an instruction stream on light weight execution agents such as threads, SIMD units or GPU threads
68 © 2016 Codeplay Software Ltd. Current Progress of Executors • An execution platform is a target architecture such as linux x86 • An execution resource is the hardware abstraction that is executing the work such as a thread pool • An execution context manages the light weight execution agents of an execution resource during the execution
69 © 2016 Codeplay Software Ltd. Executors: Bifurcation
• Bifurcation of one-way vs two-way • One-way –does not return anything • Two-way –returns a future type • Bifurcation of blocking vs non-blocking (WIP) • May block –the calling thread may block forward progress until the execution is complete • Always block –the calling thread always blocks forward progress until the execution is complete • Never block –the calling thread never blocks forward progress. • Bifurcation of hosted vs remote • Hosted –Execution is performed within threads of the device which the execution is launched from, minimum of parallel forward progress guarantee between threads • Remote –Execution is performed within threads of another remote device, minimum
70 © 2016 Codeplay Software Ltd. Features of C++ Executors
• One-way non-blocking single execute executors • One-way non-blocking bulk execute executors • Remote executors with weakly parallel forward progress guarantees • Top down relationship between execution context and executor • Reference counting semantics in executors • A minimal execution resource which supports bulk execute • Nested execution contexts and executors • Executors block on destruction
71 © 2016 Codeplay Software Ltd. Executor Framework: Abstract Platform details of execution. class sample_executor { public: Create execution agents using execution_category = ...; using shape_type = tuple
72 © 2016 Codeplay Software Ltd. Purpose 1 of executors:where/how execution
• Placement is, by default, at discretion of the system.
• If the Programmer want to control placement:
73 © 2016 Codeplay Software Ltd. Purpose 2 of executors •Control relationship with Calling threads •async(launch_flags, function); •async(executor, function);
74 © 2016 Codeplay Software Ltd. Purpose 3 of executors •Uniform interface for scheduling semantics across control structures • for_each(P.on(executor), ...); • async(executor, ...); • future.then(executor, ...); • dispatch(executor, ...);
75 © 2016 Codeplay Software Ltd. SHORT TERM GOALS •Compose with existing control structures • In C++17: • async(), invoke(), for_each(), sort(), ... • In technical specifications: • define_task_block(), future.then(), Networking TS, asynchronous operations, Transactional memory
76 © 2016 Codeplay Software Ltd. UNIFIED DESIGN
• Distinguish executors from execution contexts • Categorize executors • Enable customization • Describe composition with existing control structures
77 © 2016 Codeplay Software Ltd. EXECUTORS & CONTEXTS Light-weight views on long-lived resources • Distinguish executors from execution contexts • Categorize executors • Enable customization • Describe composition with existing control structures
78 © 2016 Codeplay Software Ltd. EXECUTORS & CONTEXTS Light-weight views on long-lived resources • Executors are (potentially short-lived) objects that create execution agents on execution contexts. • Execution contexts are (potentially long-lived) objects that manage the lifetime of underlying execution resources.
79 © 2016 Codeplay Software Ltd. EXECUTORS & CONTEXTS Example: simple thread pool struct my_thread_pool
{
template
void submit(Function&& f);
struct executor_t
{ my_thread_pool& ctx; • Context: my_thread_pool template
my_thread_pool& context() const noexcept {return ctx;}
bool operator==(const executor_t& rhs) const noexcept {return ctx == rhs.ctx;}
bool operator!=(const executor_t& rhs) const noexcept {return ctx != rhs.ctx;}
};
executor_t executor() { return executor_t{*this}; }
...
};
80 © 2016 Codeplay Software Ltd. Executor Interface:semantic types exposed by executors Type Meaning execution_category Scheduling semantics amongst agents in a task. (sequenced, vector-parallel, parallel, concurrent) shape_type Type for indexing bulk launch of agents. (typically n-dimensional integer indices) future
81 © 2016 Codeplay Software Ltd. Executor Interface:core constructs for launching work Type of agent tasks Constructs Single-agent tasks result sync_execute(Function f); future
Multi-agent tasks result bulk_sync_execute(Function f, shape_type shape, Factory result_factory, Factory shared_factory); future
82 © 2016 Codeplay Software Ltd. EXECUTOR CATEGORIES Name collections of use cases Each executor operation identifies a unique use case •execute(f) : “fire-and-forget f” •async_execute(f) : “asynchronously execute f and return a future” Categorize executor types by the uses cases they natively support •OneWayExecutor : executors that natively fire-and-forget •TwoWayExecutor : executors that natively provide a channel to the result
83 © 2016 Codeplay Software Ltd. EXECUTOR CATEGORIES Name collections of use cases HostBased* •As if the execution agent is running on a std::thread •Passes the “database test” •.execute(f,alloc) : “fire-and-forget f, use alloc for allocation” Bulk* Create multiple execution agents with a single operation .bulk_execute(f,n,sf) : “fire-and-forget f n times in bulk”
84 © 2016 Codeplay Software Ltd. EXECUTOR CATEGORIES
85 © 2016 Codeplay Software Ltd. CUSTOMIZATION POINTS Enable uniform use Free functions in namespace execution:: •execute(f) — execution::execute(exec, f) •async_execute(f) — execution::async_execute(exec, f) Adapt exec when operation not natively provided
86 © 2016 Codeplay Software Ltd. CUSTOMIZATION POINTS Enable uniform use struct has_async_execute { template
87 © 2016 Codeplay Software Ltd. CUSTOMIZATION POINTS Traits enable introspection template
88 © 2016 Codeplay Software Ltd. EXECUTORS & THE STANDARD LIBRARY Composition with control structures my_executor exec = get_my_executor(...); • Most programmers use using namespace std; higher-level control auto fut1 = async(exec, task); structures sort(execution::par.on(exec), • Need to compose with user- vec.begin(), vec.end()); auto fut2 = fut1.then(exec, defined executors continuation);
89 © 2016 Codeplay Software Ltd. EXECUTORS & THE STANDARD LIBRARY Possible implementation of std::for_each template
90 © 2016 Codeplay Software Ltd. POSSIBLE EXTENSIONS Out of scope of minimal proposal •Error handling • Higher-level variadic •Requirements on user- abstractions defined Future types • Remote execution •Heterogeneity • Additional thread pool •Distributed memory functionality •Additional abstractions for • System resources bulk execution • Syntactic sugar for contexts + control structures
91 © 2016 Codeplay Software Ltd. Summary Executors
Executors decouple control structures from work creation Short-term goal: compose with existing control structures P0443 is the minimal proposal to achieve short-term goal Provides a foundation for extensions to build on
92 © 2016 Codeplay Software Ltd. Vector SIMD Parallelism for Parallelism TS2
• No standard! • Boost.SIMD • Proposal N3571 by Mathias Gaunard et. al., based on the Boost.SIMD library. • Proposal N4184 by Matthias Kretz, based on Vc library. • Unifying efforts and expertise to provide an API to use SIMD portably • Within C++ (P0203, P0214) • P0193 status report • P0203 design considerations • Please see Pablo Halpern, Nicolas Guillemot’s and Joel Falcou’s talks on Vector SPMD, and SIMD. 93 © 2016 Codeplay Software Ltd. SIMD from Matthias Kretz and Mathias Gaunard
• std::datapar
94 © 2016 Codeplay Software Ltd. Operations on datapar • Built-in operators • No promotion: All usual binary operators are • datapar
95 © 2016 Codeplay Software Ltd. The goal
•Great support for cpu latency computations through concurrency TS- •Great support for cpu throughput through parallelism TS •Great support for Heterogeneous throughput computation in future
96 © 2016 Codeplay Software Ltd. Many alternatives for Massive dispatch/heterogeneous
• Programming Languages Usage experience • OpenGL • DirectX • OpenMP/OpenACC • CUDA • HPC • OpenCL • SYCL • OpenMP • OpenCL • OpenACC • CUDA • C++ AMP • HPX • HSA • SYCL • Vulkan
97 © 2016 Codeplay Software Ltd.
Lots of experience now with Heterogeneous language design in C++ •Executors: P0058R1 An Interface for Abstracting Execution (one of them, there are 2 other) •AMD’s HCC/HSAIL: P00069r0 HCC: A C++ Compiler For Heterogeneous Computing •SYCL: P0236R0 Khronos's OpenCL SYCL to support Heterogeneous Devices for C++ •HPX: P0234R0 Towards Massive Parallelism support in C++ with HPX
98 © 2016 Codeplay Software Ltd. Not that far away from a Grand Unified Theory
•GUT is achievable •What we have is only missing 20% of where we want to be •It is just not designed with an integrated view in mind ... Yet •Need more focus direction on each proposal for GUT, whatever that is, and add a few elements
99 © 2016 Codeplay Software Ltd. What we want for Massive dispatch/Heterogeneous computing by 2020 •Integrated approach for 2020 for C++ – Marries concurrency/parallelism TS/co-routines •Heterogeneous Devices and/or just Massive Parallelism •Works for both HPC, consumer, games, embedded, fpga •Make asynchrony the core concept •Supports integrated (APU), but also discrete memory models •Supports High bandwidth memory •Support distributed architecture
100 © 2016 Codeplay Software Ltd. Better candidates
•Goal: Use standard C++ to express all intra-node parallelism 1. Khronos’ OpenCL SYCL extends Parallelism TS for embedded processors aiming to conform to ISO 26262 2. Agency extends Parallelism TS 3. HCC 4. HPX extends parallelism and concurrency TS 5. C++ AMP 6. KoKKos 7. Raja
101 © 2016 Codeplay Software Ltd. Agenda
• How do we get to programming self-driving cars? • The C++ Standard • C++ 17: is it great or just OK • C++Future • A comparison of Heterogeneous Programming Models • SYCL Design Philosophy: C++ end to end model for HPC and consumers • The ecosystem: • VisionCpp • Parallel STL • TensorFlow, Machine Vision, Neural Networks, Self-Driving Cars • Codeplay ComputeCPP Community Edition: Free Download
102 © 2016 Codeplay Software Ltd. How do we offload code to a heterogeneous device?
103 © 2016 Codeplay Software Ltd. Compilation Model
CPU CPU Source Linker x86 ISA x86 CPU File Compiler Object
104 © 2016 Codeplay Software Ltd. Compilation Model
CPU CPU Source Linker x86 ISA x86 CPU File Compiler Object
105 © 2016 Codeplay Software Ltd. Compilation Model
CPU CPU Source Linker x86 ISA x86 CPU File Compiler Object
GPU
106 © 2016 Codeplay Software Ltd. How can we compile source code for a sub architectures?
Separate source
Single source
107 © 2016 Codeplay Software Ltd. Separate Source Compilation Model
CPU CPU Source Linker x86 ISA x86 CPU File Compiler Object
Device Online GPU Source Compiler float *a, *b, *c; Here we’re using OpenCL as an example … kernel k = clCreateKernel(…, “my_kernel”, …);void my_kernel(__global float *a, __global float clEnqueueWriteBuffer(…, size, a, …); *b, clEnqueueWriteBuffer(…, size, a, …); __global float *c) { clEnqueueNDRange(…, k, 1, {size, 1, 1}, …); int id = get_global_id(0); clEnqueueWriteBuffer(…, size, c, …); c[id] = a[id] + b[id]; }
108 © 2016 Codeplay Software Ltd. Single Source Compilation Model
CPU CPU Source Linker x86 ISA x86 CPU File Compiler Object
GPU
array_view
109 © 2016 Codeplay Software Ltd. Single Source Compilation Model
CPU CPU Linker x86 ISA x86 CPU Source Compiler Object File
Device Device Source Device IR / GPU Compiler Object array_view
110 © 2016 Codeplay Software Ltd. Single Source Compilation Model
CPU CPU x86 ISA x86 CPU Source Compiler Object File Linker Device Device Source Device IR / GPU Compiler Object array_view
111 © 2016 Codeplay Software Ltd. Single Source Compilation Model
CPU CPU x86 CPU Source Compiler Object x86 ISA File Linker (Embedded Device IR / Device Device Object) Source Device IR / GPU Compiler Object array_view
112 © 2016 Codeplay Software Ltd. Benefits of Single Source
• Device code is written in the same source file as the host CPU code
• Allows compile-time evaluation of device code
• Supports type safety across host CPU and device
• Supports generic programming
• Removes the need to distribute source code
113 © 2016 Codeplay Software Ltd. Describing Parallelism
114 © 2016 Codeplay Software Ltd. How do you represent the different forms of parallelism?
Directive vs explicit parallelism
Task vs data parallelism
Queue vs stream execution
115 © 2016 Codeplay Software Ltd. Directive vs Explicit Parallelism
Examples: Examples: • OpenMP, OpenACC • SYCL, CUDA, TBB, Fibers, C++11 Implementation: Threads • Compiler transforms code to be Implementation: parallel based on pragmas • An API is used to explicitly enqueuer one or more threads
Here we’re using OpenMP as an example Here we’re using C++ AMP as an example vector
116 © 2016 Codeplay Software Ltd. Task vs Data Parallelism
Examples: Examples: • OpenMP, C++11 Threads, TBB • C++ AMP, CUDA, SYCL, C++17 Implementation: ParallelSTL • Multiple (potentially different) tasks Implementation: are performed in parallel • The same task is performed across a large data set
Here we’re using TBB as an example Here we’re using CUDA as an example vector
117 © 2016 Codeplay Software Ltd. Queue vs Stream Execution
Examples: Examples: • C++ AMP, CUDA, SYCL, C++17 • BOINC, BrookGPU ParallelSTL Implementation: Implementation: • A function is executed on a • Functions are placed in a queue and continuous loop on a stream of data executed once per enqueuer
Here we’re using CUDA as an example Here we’re using BrookGPU as an example float *a, *b, *c; reduce void sum (float a<>, cudaMalloc((void **)&a, size); reduce float r<>) { cudaMalloc((void **)&b, size); r += a; cudaMalloc((void **)&c, size); } float a<100>;
float r; vec_add<<<64, 64>>>(a, b, c); sum(a,r);
118 © 2016 Codeplay Software Ltd. Data Locality & Movement
119 © 2016 Codeplay Software Ltd. One of the biggest limiting factor in heterogeneous computing
Cost of data movement in time and power consumption
120 © 2016 Codeplay Software Ltd. Cost of Data Movement
• It can take considerable time to move data to a device • This varies greatly depending on the architecture • The bandwidth of a device can impose bottlenecks • This reduces the amount of throughput you have on the device • Performance gain from computation > cost of moving data • If the gain is less than the cost of moving the data it’s not worth doing • Many devices have a hierarchy of memory regions • Global, read-only, group, private • Each region has different size, affinity and access latency • Having the data as close to the computation as possible reduces the cost
121 © 2016 Codeplay Software Ltd. Cost of Data Movement
• 64bit DP Op: • 20pJ • 4x64bit register read: • 50pJ • 4x64bit move 1mm: • 26pJ • 4x64bit move 40mm: • 1nJ • 4x64bit move DRAM: • 16nJ
Credit: Bill Dally, Nvidia, 2010
122 © 2016 Codeplay Software Ltd. How do you move data from the host CPU to a device?
Implicit vs explicit data movement
123 © 2016 Codeplay Software Ltd. Implicit vs Explicit Data Movement
Examples: Examples: • SYCL, C++ AMP • OpenCL, CUDA, OpenMP Implementation: Implementation: • Data is moved to the device implicitly • Data is moved to the device via via cross host CPU / device data explicit copy APIs structures
Here we’re using C++ AMP as an example Here we’re using CUDA as an example array_view
124 © 2016 Codeplay Software Ltd. How do you address memory between host CPU and device?
Multiple address space
Non-coherent single address space
Cache coherent single address space
125 © 2016 Codeplay Software Ltd. Comparison of Memory Models
• Multiple address space • SYCL 1.2, C++AMP, OpenCL 1.x, CUDA • Pointers have keywords or structures for representing different address spaces • Allows finer control over where data is stored, but needs to be defined explicitly • Non-coherent single address space • SYCL 2.2, HSA, OpenCL 2.x , CUDA 4, OpenMP • Pointers address a shared address space that is mapped between devices • Allows the host CPU and device to access the same address, but requires mapping • Cache coherent single address space • SYCL 2.2, HSA, OpenCL 2.x, CUDA 6, C++, • Pointers address a shared address space (hardware or cache coherent runtime) • Allows concurrent access on host CPU and device, but can be inefficient for large data
126 © 2016 Codeplay Software Ltd. SYCL: A New Approach to Heterogeneous Programming in C++
127 © 2016 Codeplay Software Ltd. SYCL for OpenCL
Cross-platform, single-source, high-level, C++ programming layer Built on top of OpenCL and based on standard C++14
128 © 2016 Codeplay Software Ltd. The SYCL Ecosystem
C++ Application
C++ Template Library C++ Template Library C++ Template Library
SYCL for OpenCL
OpenCL
CPU GPU APU Accelerator FPGA DSP
129 © 2016 Codeplay Software Ltd. How does SYCL improve heterogeneous offload and performance portability?
SYCL is entirely standard C++
SYCL compiles to SPIR
SYCL supports a multi compilation single source model
130 © 2016 Codeplay Software Ltd. Single Compilation Model
CPU CPU C++ x86 CPU Source Compiler Object File x86 ISA Linker (Embedded Device Device Object) Source Device Device GPU Compiler Object
131 © 2016 Codeplay Software Ltd. Single Compilation Model
C++ x86 CPU Source File x86 ISA Single Source Host & Device Compiler (Embedded Device Device Object) Source GPU
Tied to a single compiler chain
132 © 2016 Codeplay Software Ltd. Single Compilation Model
C++ 3 different language 3 different compilers 3 different executables Source extensions File
C++ x86 ISA x86 AMP AMD C++ AMP Compiler (Embedded CPU Source AMD ISA) GPU
CUDA x86 ISA X86 CUDA Compiler NVidia Source (Embedded CPU NVidia ISA) GPU
Open MP x86 ISA X86 OpenMP Compiler (Embedded SIMD Source CPU x86) CPU
133 © 2016 Codeplay Software Ltd. Agenda
• How do we get to programming self-driving cars? • The C++ Standard • C++ 17: is it great or just OK • C++Future • A comparison of Heterogeneous Programming Models • SYCL Design Philosophy: C++ end to end model for HPC and consumers • The ecosystem: • VisionCpp • Parallel STL • TensorFlow, Machine Vision, Neural Networks, Self-Driving Cars • Codeplay ComputeCPP Community Edition: Free Download
134 © 2016 Codeplay Software Ltd. SYCL is Entirely Standard C++
__global__ vec_add(float *a, float *b, float *c) { return c[i] = a[i] + b[i]; vector
parallel_for_each(e, [=](index<2> idx) restrict(amp) { c[idx] = a[idx] + b[idx]; });
cgh.parallel_for
135 © 2016 Codeplay Software Ltd. SYCL Targets a Wide Range of Devices with SPIR
CPU GPU APU Accelerator FPGA DSP
136 © 2016 Codeplay Software Ltd. Multi Compilation Model
CPU CPU x86 ISA C++ (Embedded x86 CPU Compiler Object Source SPIR) File Linker Device Code SYCL Online SPIR GPU Compiler Finalizer
SYCL device compiler Generating SPIR
137 © 2016 Codeplay Software Ltd. Multi Compilation Model
GCC, Clang, VisualC++, Intel C++
CPU CPU x86 ISA C++ (Embedded x86 CPU Compiler Object Source SPIR) File Linker Device Code SYCL Online SPIR GPU Compiler Finalizer
138 © 2016 Codeplay Software Ltd. Multi Compilation Model
CPU CPU x86 ISA C++ (Embedded x86 CPU Source Compiler Object SPIR) File Linker Device Code SYCL Online SPIR GPU Compiler Finalizer
139 © 2016 Codeplay Software Ltd. Multi Compilation Model
CPU CPU x86 ISA C++ (Embedded x86 CPU Compiler Source Object SPIR) File Linker SIMD Device CPU Code SYCL SPIR Compiler GPU OpenCL APU Standard IR allows for Online better performance Finalizer SYCL does not mandate portability FPGA SPIR Device can be DSP selected at runtime
140 © 2016 Codeplay Software Ltd. Multi Compilation Model
CPU CPU x86 ISA (Embedded x86 CPU Compiler Object SPIR)
C++ Linker SIMD Source CPU File SYCL SPIR Compiler GPU
OpenCL Device Online APU Code SYCL Finalizer PTX Compiler FPGA
DSP
141 © 2016 Codeplay Software Ltd. Multi Compilation Model
CPU CPU x86 ISA (Embedded x86 CPU Compiler Object SPIR)
C++ SIMD Source CPU File SYCL SPIR Compiler Linker GPU
OpenCL Device Online APU Code SYCL Finalizer PTX Compiler FPGA PTX binary can be selected for NVidia DSP GPUs at runtime
142 © 2016 Codeplay Software Ltd. How does SYCL support different ways of representing parallelism?
SYCL is an explicit parallelism model
SYCL is a queue execution model
SYCL supports both task and data parallelism
143 © 2016 Codeplay Software Ltd. Representing Parallelism
cgh.single_task([=](){
/* task parallel task executed once*/
});
cgh.parallel_for(range<2>(64, 64), [=](id<2> idx){
/* data parallel task executed across a range */
});
144 © 2016 Codeplay Software Ltd. How does SYCL make data movement more efficient?
SYCL separates the storage and access of data
SYCL can specify where data should be stored/allocated
SYCL creates automatic data dependency graphs
145 © 2016 Codeplay Software Ltd. Separating Storage & Access
Buffers managed Accessors are used to data across host describe access CPU and one or more devices Accessor CPU
Buffer
Accessor GPU
Buffers and accessors type safe access across host and device
146 © 2016 Codeplay Software Ltd. Storing/Allocating Memory in Different Regions
Memory stored in Global global memory region Accessor
Buffer
Constant Memory stored in read- Kernel Accessor only memory region
Local Memory allocated in Accessor group memory region
147 © 2016 Codeplay Software Ltd. Data Dependency Task Graphs
Buffer A Read Accessor Kernel A Write Accessor Kernel A Kernel B Buffer B Read Accessor Kernel B Write Accessor Buffer C Read Accessor Kernel C Read Accessor Kernel C Buffer D Write Accessor
148 © 2016 Codeplay Software Ltd. Benefits of Data Dependency Graphs
• Allows you to describe your problems in terms of relationships • Don’t need to en-queue explicit copies • Removes the need for complex event handling • Dependencies between kernels are automatically constructed • Allows the runtime to make data movement optimizations • Pre-emptively copy data to a device before kernels • Avoid unnecessarily copying data back to the host after execution on a device • Avoid copies of data that you don’t need
149 © 2016 Codeplay Software Ltd. So what does SYCL look like?
Here is a simple example SYCL application; a vector add
150 © 2016 Codeplay Software Ltd. Example: Vector Add
151 © 2016 Codeplay Software Ltd. Example: Vector Add
#include
template
}
152 © 2016 Codeplay Software Ltd. Example: Vector Add
#include
template
The buffers synchronise upon destruction
}
153 © 2016 Codeplay Software Ltd. Example: Vector Add
#include
template
}
154 © 2016 Codeplay Software Ltd. Example: Vector Add
#include
template
}); }
155 © 2016 Codeplay Software Ltd. Example: Vector Add
#include
template
}); }
156 © 2016 Codeplay Software Ltd. Example: Vector Add
#include
157 © 2016 Codeplay Software Ltd. Example: Vector Add
#include
158 © 2016 Codeplay Software Ltd. Example: Vector Add
template
std::vector
parallel_add(inputA, inputB, output, count); }
159 © 2016 Codeplay Software Ltd. Single-source vs C++ kernel language
• Single-source: a single-source file contains both host and device code • Type-checking between host and device • A single template instantiation can create all the code to kick off work, manage data and execute the kernel • e.g. sort
• C++ kernel language • Matches standard OpenCL C • Proposed for OpenCL v2.1 • Being considered as an addition for SYCL v2.1
160 © 2016 Codeplay Software Ltd. Why ‘name’ kernels?
• Enables implementers to have multiple, different compilers for host and different devices • With SYCL, software developers can choose to use the best compiler for CPU and the best compiler for each individual device they want to support • The resulting application will be highly optimized for CPU and OpenCL devices • Easy-to-integrate into existing build systems
• Only required for C++11 lambdas, not required for C++ functors • Required because lambdas don’t have a name to enable linking between different compilers
161 © 2016 Codeplay Software Ltd. Buffers/images/accessors vs shared pointers
• OpenCL v1.2 supports a wide range of different devices and operating systems • All shared data must be encapsulated in OpenCL memory objects: buffers and images • To enable SYCL to achieve maximum performance of OpenCL, we follow OpenCL’s memory model approach • But, we apply OpenCL’s memory model to C++ with buffers, images and accessors • Separation of data storage and data access
162 © 2016 Codeplay Software Ltd. What can I do with SYCL?
Anything you can do with C++!
With the performance and portability of OpenCL
163 © 2016 Codeplay Software Ltd. Progress report on the SYCL vision
Open, royalty-free standard: released Conformance testsuite: going into adopters package
Open-source implementation: in progress (triSYCL) Commercial, conformant implementation: in progress C++ 17 Parallel STL: open-source in progress
• Template libraries for important C++ algorithms: getting going • Integration into existing parallel C++ libraries: getting going
164 © 2016 Codeplay Software Ltd. Building the SYCL for OpenCL ecosystem • To deliver on the full potential of high-performance heterogeneous systems • We need the libraries • We need integrated tools • We need implementations • We need training and examples
• An open standard makes it much easier for people to work together • SYCL is a group effort • We have designed SYCL for maximum ease of integration
165 © 2016 Codeplay Software Ltd. Agenda
• How do we get to programming self-driving cars? • The C++ Standard • C++ 17: is it great or just OK • C++Future • A comparison of Heterogeneous Programming Models • SYCL Design Philosophy: C++ end to end model for HPC and consumers • The ecosystem: • VisionCpp • Parallel STL • TensorFlow, Machine Vision, Neural Networks, Self-Driving Cars • Codeplay ComputeCPP Community Edition: Free Download
166 © 2016 Codeplay Software Ltd.
Using SYCL to Develop Vision Tools
A high-level CV framework for OpenCL promoting: Ease of use • Applications • Easy to write code • Unified front end to client code (API) • Easily able to add customisable operations • Compile time graph validation • Predictable memory usage
Performance portability • Separation of concern • Portable between different programming Cross-platform portability models and architectures
• No modification in application computation • OpenCL enabled devices • CPU
167 © 2016 Codeplay Software Ltd. What is Parallelism TS v1 adding? • A set of execution policies and a collection of parallel algorithms • The Execution Policies • Paragraphs explaining the conditions for parallel algorithms • New parallel algorithms • But only on CPUs • Can we execute it on GPUs now?
168 © 2016 Codeplay Software Ltd. Sorting with the STL
A parallel sort A sequential sort std :: vector
• par is an object of an Execution Policy • The sort will be executed in parallel using an implementation-defined method
169 © 2016 Codeplay Software Ltd. The SYCL execution policy
template
• sycl_policy is an Execution Policy • data is a standard stl::vector • Technically, will use the device returned by default_selector
170 © 2016 Codeplay Software Ltd. Heterogeneous load balancing
171 © 2016 Codeplay Software Ltd. Future work
https://github.com/KhronosGroup/SyclParallelSTL
172 © 2016 Codeplay Software Ltd. Other projects – in progress
TensorFlow: Google’s machine learning library + others …
Eigen: C++ linear algebra template library
173 © 2016 Codeplay Software Ltd. Conclusion
• Heterogeneous programming has been supported through OpenCL for years • C++ is a prominent language for doing this but currently is only CPU-based • Graph programming languages enables Neural network, machine vision • SYCL allows you to program heterogeneous devices with standard C++ today • ComputeCpp is available now for you to download and experiment • For engineers/companies/consortiums producing embedded devices for automotive ADAS, machine vision, or neural network • Who want to deliver artificial intelligent devices that are also low power for e.g. self-driving cars, smart homes • But who are dissatisfied with the current single vendor locked in heterogeneous solution or design/code with no reuse solution • We provide performance-portable open-standard software across multiple platforms with long-term support • Unlike vertical locked in solutions such as CUDA, C++AMP, or HCC • We have assembled a whole ecosystem of software accelerated by your parallel hardware enabling reuse with open standards
© 2016 Codeplay Software Ltd. 174 Agenda
• How do we get to programming self-driving cars? • The C++ Standard • C++ 17: is it great or just OK • C++Future • A comparison of Heterogeneous Programming Models • SYCL Design Philosophy: C++ end to end model for HPC and consumers • The ecosystem: • VisionCpp • Parallel STL • TensorFlow, Machine Vision, Neural Networks, Self-Driving Cars • Codeplay ComputeCPP Community Edition: Free Download
175 © 2016 Codeplay Software Ltd. For Codeplay, these are our layer choices
We have chosen a layer of standards, based on current market adoption • TensorFlow and OpenCV • SYCL • OpenCL (with SPIR) Graph • LLVM as the standard compiler back-end programming C/C++-level programming • TensorFlow Higher-level • OpenCV language • SYCL Device- enabler specific The actual choice of standards may change based • OpenCL SPIR on market dynamics, but by choosing widely programming adopted standards and a layering approach, it is easy to adapt • LLVM
176 © 2016 Codeplay Software Ltd. For Codeplay, these are our products
Graph programming C/C++-level programming • TensorFlow Higher-level • OpenCV language • SYCL Device- enabler specific programming • OpenCL SPIR • LLVM
177 © 2016 Codeplay Software Ltd. Codeplay •Standards bodies Research Open source Presentations Company •HSA Foundation: Chair of software •Members of EU research •HSA LLDB Debugger •Building an LLVM back-end •Based in Edinburgh, Scotland group, spec editor of runtime and consortiums: PEPPHER, LPGPU, •SPIR-V tools •Creating an SPMD Vectorizer for •57 staff, mostly engineering debugging LPGPU2, CARP •RenderScript debugger in AOSP OpenCL with LLVM •License and customize technologies •Khronos: chair & spec editor of SYCL. •Sponsorship of PhDs and EngDs for •LLDB for Qualcomm Hexagon •Challenges of Mixed-Width Vector for semiconductor companies Contributors to OpenCL, Safety heterogeneous programming: HSA, •TensorFlow for OpenCL Code Gen & Scheduling in LLVM •ComputeAorta and ComputeCpp: Critical, Vulkan FPGAs, ray-tracing •C++ 17 Parallel STL for SYCL •C++ on Accelerators: Supporting implementations of OpenCL, Vulkan •ISO C++: Chair of Low Latency, •Collaborations with academics Single-Source SYCL and HSA and SYCL •VisionCpp: C++ performance- Embedded WG; Editor of SG1 •Members of HiPEAC •LLDB Tutorial: Adding debugger •15+ years of experience in Concurrency TS portable programming model for vision support for your target heterogeneous systems tools •EEMBC: members
Codeplay build the software platforms that deliver massive performance
178 © 2016 Codeplay Software Ltd. What our ComputeCpp users say about us
Benoit Steiner – Google TensorFlow WIGNER Research Centre ONERA Hartmut Kaiser -HPX engineer for Physics
“We at Google have been working “We work with royalty-free SYCL “My team and I are working with It was a great pleasure this week for closely with Luke and his Codeplay because it is hardware vendor Codeplay's ComputeCpp for almost a us, that Codeplay released the colleagues on this project for almost 12 agnostic, single-source C++ year now and they have resolved every ComputeCpp project for the wider months now. Codeplay's contribution programming model without platform issue in a timely manner, while audience. We've been waiting for this to this effort has been tremendous, so specific keywords. This will allow us to demonstrating that this technology moment and keeping our colleagues we felt that we should let them take easily work with any heterogeneous can work with the most complex C++ and students in constant rally and the lead when it comes down to processor solutions using OpenCL to template code. I am happy to say that excitement. We'd like to build on this communicating updates related to develop our complex algorithms and the combination of Codeplay's SYCL opportunity to increase the awareness OpenCL. … we are planning to merge ensure future compatibility” implementation with our HPX runtime of this technology by providing sample the work that has been done so far… system has turned out to be a very codes and talks to potential users. we want to put together a capable basis for Building a We're going to give a lecture series on comprehensive test infrastructure” Heterogeneous Computing Model for modern scientific programming the C++ Standard using high-level providing field specific examples.“ abstractions.”
179 © 2016 Codeplay Software Ltd. Further information
• OpenCL https://www.khronos.org/opencl/ • OpenVX https://www.khronos.org/openvx/ • HSA http://www.hsafoundation.com/ • SYCL http://sycl.tech • OpenCV http://opencv.org/ • Halide http://halide-lang.org/ • VisionCpp https://github.com/codeplaysoftware/visioncpp
180 © 2016 Codeplay Software Ltd.
Community Edition Available now for free!
Visit: computecpp.codeplay.com
181 © 2016 Codeplay Software Ltd. • Open source SYCL projects: • ComputeCpp SDK - Collection of sample code and integration tools • SYCL ParallelSTL – SYCL based implementation of the parallel algorithms • VisionCpp – Compile-time embedded DSL for image processing • Eigen C++ Template Library – Compile-time library for machine learning All of this and more at: http://sycl.tech
182 © 2016 Codeplay Software Ltd. Questions ?
@codeplaysoft /codeplaysoft codeplay.com