C++17, Is It Great Or Just OK and Heterogeneous Computing in C++

C++17, is it great or just OK and Heterogeneous computing in C++ Next for Self-Driving Cars Michael Wong (Codeplay Software, VP of Research and Development), Andrew Richards, CEO ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong Head of Delegation for C++ Standard for Canada Vice Chair of Programming Languages for Standards Council of Canada Chair of WG21 SG5 Transactional Memory Chair of WG21 SG14 Games Dev/Low Latency/Financial Trading/Embedded Editor: C++ SG5 Transactional Memory Technical Specification Editor: C++ SG1 Concurrency Technical Specification http:://wongmichael.com/about Code::dive 2016 Agenda • How do we get to programming self-driving cars? • The C++ Standard • C++ 17: is it great or just OK • C++Future • A comparison of Heterogeneous Programming Models • SYCL Design Philosophy: C++ end to end model for HPC and consumers • The ecosystem: • VisionCpp • Parallel STL • TensorFlow, Machine Vision, Neural Networks, Self-Driving Cars • Codeplay ComputeCPP Community Edition: Free Download 2 © 2016 Codeplay Software Ltd. How do we get from here… Level 5 … to Level 4 •Autonomous •Stages from here? •Deep self very local to control Level 3 extensive •All conditions •Limited overall journeys control Level 2 •Execute •Automated Level 1 manoeuvres These are the SAE levels for •Adaptive autonomous vehicles. Similar •Assist Level 0 challenges apply in other embedded intelligence industries •Warnings 3 © 2016 Codeplay Software Ltd. We have a mountain to climb … or … without climbing getting lost the wrong … and we on our mountain want to get own… there in When we safe, don’t manageabl know e, How do what the affordable we get top looks steps… to the like... top? 4 © 2016 Codeplay Software Ltd. This presentation will focus on: • The hardware and software platforms that will be able to deliver the results • The software tools to build up the solutions for those platforms • The open standards that will enable solutions to interoperate • How Codeplay can help deliver embedded intelligence 5 © 2016 Codeplay Software Ltd. Where do we need to go? “On a 100 millimetre-squared chip, Google needs something like 50 teraflops of performance” - Daniel Rosenband (Google’s self-driving car project) at HotChips 2016 6 © 2016 Codeplay Software Ltd. Performance trends 65,536 32,768 16,384 Google target 8,192 4,096 Desktop GPU 2,048 1,024 512 Integrated GPU 256 128 64 Smartphone GPU GFLOPS 32 16 8 Smartphone CPU These trend lines 4 2 seem to violate Desktop CPU the rules of 1 1 physics… Year of introduction 7 © 2016 Codeplay Software Ltd. How do we get there from here? 1.We need to write software today for platforms that We need to validate cannot be built yet the systems as safe We need to start with simpler systems that are not fully autonomous 8 © 2016 Codeplay Software Ltd. Two models of software development Hardware Software designer designer writing Software writing Software Write software Design next Implement Select platform Design a Validate Select version on model model platform platform Validate whole Optimize for platform platform Which method can get us all the way to full autonomy? 9 © 2016 Codeplay Software Ltd. Desirable Development Write software Evaluate Optimize for Software platform Software Application ValidateWrite whole Well Defined Middleware softwareplatform Hardware & Low-Level Software Develop Evaluate Platform Architecture Select Platform 10 © 2016 Codeplay Software Ltd. The different levels of programming model Device-specific Higher-level C-level C++-level Graph programming language programming programming programming • Assembly enabler • OpenCL C • SYCL • OpenCV language • NVIDIA PTX • DSP C • CUDA • OpenVX • VHDL • HSA • MCAPI/MTAPI • HCC • Halide • Device-specific C- • OpenCL SPIR • C++ AMP • VisionCpp like programming • SPIR-V • TensorFlow models • Caffe 11 © 2016 Codeplay Software Ltd. Device-specific programming Cannot … develop software Can deliver quick today for future results today platforms Not a route to full Can… autonomy hand-optimize Does not allow directly for the software developers device to invest today 12 © 2016 Codeplay Software Ltd. The route to full autonomy • Graph programming • This is the most widely-adopted approach to machine vision and machine learning • Open standards • This lets you develop today for future architectures 13 © 2016 Codeplay Software Ltd. Why graph programming? When you scale the Therefore: number of cores: • You need to reduce • You don’t scale the off-chip memory However, number of memory ports bandwidth by writing tiled processing • Your compute everything on-chip image pipelines performance increases • This is achieved by is hard • But your off-chip memory tiling bandwidth does not increase If we build ff f graph ff operations (e.g. convolutions) and then have f runtime system split into fused tiled operations across ff entire system ffffchip, we get great performance 14 © 2016 Codeplay Software Ltd. Graph programming: some numbers In this example, Effect of combining graph nodes on performance we perform 3 100 image processing 90 operations on an 80 Halide and SYCL accelerator and use kernel fusion, 70 compare 3 whereas OpenCV systems when 60 does not. For all 3 executing 50 systems, the individual nodes, performance of 40 or a whole graph the whole graph is 30 significantly better 20 than individual The system is an AMD nodes executed APU and the 10 operations are: RGB- on their own 0 >HSV, channel OpenCV (nodes) OpenCV (graph) Halide (nodes) Halide (graph) SYCL (nodes) SYCL (graph) masking, HSV->RGB Kernel time (ms) Overhead time (ms) 15 © 2016 Codeplay Software Ltd. Graph programming • For both machine vision algorithms and machine learning, graph programming is the most widely-adopted approach • Two styles of graph programming that we commonly see: C-style graph C++-style graph programming programming • OpenVX • Halide • OpenCV • RapidMind • Eigen (also in TensorFlow) • VisionCpp 16 © 2016 Codeplay Software Ltd. C-style graph programming OpenVX: open standard • Can be implemented by vendors • Create a graph with C API, then map to an entire SoC OpenCV: open source • Implemented on OpenCL • Implemented on device-specific accelerators • Create a graph with C API, then execute 17 © 2016 Codeplay Software Ltd. & Device-Specific Programming What happens if we Runtime systems invent our own can automatically graph nodes? optimize the graphs Can … How do we adapt it develop software for all the graph today for future nodes we need? platforms 18 © 2016 Codeplay Software Ltd. C++-style graph programming Examples in machine C++ compilers that vision/machine learning support this style • Halide • CUDA • RapidMind • C++ OpenMP • Eigen (also in • C++ 17 Parallel STL TensorFlow) • SYCL • VisionCpp 19 © 2016 Codeplay Software Ltd. C++ single-source programming • C++ lets us build up graphs at compile-time • This means we can map a graph to the processors offline • C++ lets us write custom nodes ourselves • This approach is called a C++ Embedded Domain-Specific Language • Very widely used, eg Eigen, Boost, TensorFlow, RapidMind, Halide 20 © 2016 Codeplay Software Ltd. C++ single-source Single-source is lets us create most widely- customizable adopted machine graph models learning programming model Combining open OpenCL lets us run on a very wide range of standards, C++ and accelerators now and in the future graph programming SYCL combines C++ single-source with OpenCL acceleration 21 © 2016 Codeplay Software Ltd. Putting it all together: building it 22 © 2016 Codeplay Software Ltd. Higher-level programming enablers NVIDIA PTX HSA OpenCL SPIR SPIR-V • NVIDIA CUDA-only • Royalty-free open • Defined for OpenCL • Open standard standard v1.2 • Defined by Khronos • HSAIL is the IR • Based on • Supports compute • Provides a single Clang/LLVM (the and graphics address space, with open-source (OpenCL, Vulkan and virtual memory compiler) OpenGL) • Low-latency • Not tied to any communication compiler Open standard intermediate representations enable tools to be built on top and support a wide range of platforms 23 © 2016 Codeplay Software Ltd. Which model should we choose? Device-specific Higher-level C-level C++-level Graph programming language programming programming programming • Assembly enabler • OpenCL C • SYCL • OpenCV language • NVIDIA PTX • DSP C • CUDA • OpenVX • VHDL • HSA • MCAPI/MTAPI • HCC • Halide • Device-specific C- • OpenCL SPIR • C++ AMP • VisionCpp like programming • SPIR-V • TensorFlow models • Caffe 24 © 2016 Codeplay Software Ltd. They are not alternatives, they are layers Graph programming OpenCV OpenVX Halide VisionCpp TensorFlow Caffe C/C++-level programming SYCL CUDA HCC C++ AMP OpenCL Higher-level language enabler NVIDIA PTX HSA OpenCL SPIR SPIR-V Device-specific programming Assembly language VHDL Device-specific C-like programming models 25 © 2016 Codeplay Software Ltd. Can specify, test and validate each layer Graph programming Validate graph models Validate the code using standard tools C/C++-level programming OpenCL/SYCL specs Clsmith testsuite Conformance testsuites Wide range of other testsuites Higher-level language enabler SPIR/SPIR-V/HSAIL specs Conformance testsuites Device-specific programming Device-specific specification Device-specific testing and validation 26 © 2016 Codeplay Software Ltd. Agenda • How do we get to programming self-driving cars? • The C++ Standard • C++ 17: is it great or just OK • C++Future • A comparison of Heterogeneous Programming Models • SYCL Design Philosophy: C++ end to end model for HPC and

Load more