SYCL in the Openvx Ecosystem

Total Page:16

File Type:pdf, Size:1020Kb

SYCL in the Openvx Ecosystem SYCL in the OpenVX ecosystem Andrew Richards, Codeplay Embedded Vision Summit, May 2017 © Copyright Khronos Group 2016 - Page 1 PROMOTER MEMBERS Over 100 members worldwide Any company is welcome to join © Copyright Khronos Group 2016 - Page 2 Who am I? • Chair of the SYCL group • Chair of HSA Software Group • CEO of Codeplay - We build a C/C++ compiler for GPUs in 2002 - 60 staff in Edinburgh, Scotland - We build programming tools for heterogeneous processors - OpenCL, SYCL, + others © Copyright Khronos Group 2016 - Page 3 How does SYCL fit into OpenVX, vision & AI? 1. You need a graph system for AI/vision OpenVX 2. You need hand-coded kernels for common tasks OpenVX 3. You need to be able to write custom operations SYCL © Copyright Khronos Group 2016 - Page 4 What is SYCL for? • Modern C++ lets us separate the what from the how : - We want to separate what the user wants to do: science, computer vision, AI … - And enable the how to be: run fast on an OpenCL device • Modern C++ supports and encourages this separation © Copyright Khronos Group 2016 - Page 5 What we want to achieve • We want to enable a C++ ecosystem for OpenCL: - Must run on OpenCL devices: GPUs, CPUs, FPGAs, DSPs etc - C++ template libraries - Tools: compilers, debuggers, IDEs, optimizers - Training, example programs - Long-term support for current and future OpenCL features © Copyright Khronos Group 2016 - Page 6 Why a new standard? • There are already very established ways to map C++ to parallel processors - So we follow the established approaches • There are specifics to do with OpenCL we need to map to C++ - We have worked hard to be an enabler for other C++ parallel http://imgs.xkcd.com/comics/standards.png standards • We add no more than we need to © Copyright Khronos Group 2016 - Page 7 Where does SYCL fit in? © Copyright Khronos Group 2016 - Page 8 OpenCL / SYCL Stack User application code C++ template libraries SYCL for OpenCL OpenCL Devices CPU FPGA GPU DSP © Copyright Khronos Group 2016 - Page 9 Philosophy • With SYCL, we wanted to align with the direction the C++ standard is going - And we also need to future-proof for future OpenCL device capabilities • Key decisions: - We will not add any language extensions to C++ - We will work with existing C++ compilers - We will provide the full OpenCL feature-set in C++ - Everything must compile and run on the host as well as an OpenCL device © Copyright Khronos Group 2016 - Page 10 Where does SYCL fit in? – Language style C++ Embedded DSLs C++ template library Vector<float> a, b; uses overloading to e.g.: Sh/RapidMind, Halide, auto expr = a + b; build up expression Boost.compute Vector<float> r = expr.eval (); tree to compile at Pros: Works with existing C++ compilers runtime Cons: compile-time compilation, control- flow, composability Kernel myKernel; myKernel.load (“myKernel”); C++ Kernel languages myKernel.compile (); myKernel.setArg (0, a); Host (CPU) code e.g.: GLSL, OpenCL C and C++ kernel float r = myKernel.run (); loads and compiles languages kernel for specific Pros: Explicit offload, independent device, sets args and host/device code & compilers, run-time void myKernel (float *arg) { runs adaptation, popular in graphics return arg * 456.7f; Cons: Hard to compose cross-device } C++ single-source Vector<float> a, b, r; Single source e.g.: SYCL, CUDA, OpenMP, C++ AMP parallel_for (a.range (), [&](int id) file contains Pros: Composability, easy to use, offline { code for host compilation and validation r [id] = a [id] + b [id]; & device Cons: host/device compiler conflict }); © Copyright Khronos Group 2016 - Page 11 Comparison of SYCL & OpenVX • SYCL is a general programming model • OpenVX is a vision graph system • SYCL makes you write your own graph • OpenVX distributes a graph across an entire system system • SYCL makes you write your own nodes • OpenVX uses built-in nodes In AI applications, we see: • People needing pre-optimized graph nodes • People need to optimize whole graphs • Developers/researchers need to write their own nodes © Copyright Khronos Group 2016 - Page 12 Comparison of SYCL & CUDA #include <CL/sycl.hpp> #include <iostream> #include <iostream> #include <math.h> #include <math.h> // Kernel function to add the elements of two arrays // function to add the elements of two arrays __global__ void add(cl::sycl::nd_item<1> item, int n, void add(int n, float *x, float *y) cl::sycl::global_ptr<float> x, cl::sycl::global_ptr<float> y) { { int index = threadIdx.x; int index = item.get_local(0); int stride = blockDim.x; int stride = item.get_local_range(0); for (int i = index; i < n; i += stride) for (int i = index; i < n; i += stride) y[i] = x[i] + y[i]; y[i] = x[i] + y[i]; } } ... … // encapsulate data in SYCL buffers // Allocate Unified Memory – accessible from CPU or GPU cl::sycl::buffer<float> x(N); cudaMallocManaged(&x, N*sizeof(float)); cl::sycl::buffer<float> y(N); cudaMallocManaged(&y, N*sizeof(float)); … … { // create a scope to define the lifetime of the SYCL objects // create a SYCL queue for a GPU cl::sycl::gpu_selector selectgpu; cl::sycl::device gpu_device(selectgpu); cl::sycl::queue gpu_queue(gpu_device); // submit this work to the SYCL queue gpu_queue.submit([&](cl::sycl::handler &cgh) { // request access to the data on the OpenCL GPU // Run kernel on 1M elements on the GPU auto aX = x.get_access<cl::sycl::access::mode::read>(cgh); add <<<1, 256 >>>(N, x, y); auto aY = y.get_access<cl::sycl::access::mode::read_write>(cgh); // Run kernel on 1M elements on the OpenCL GPU // Wait for GPU to finish before accessing on host cgh.parallel_for<class add_functor>( cudaDeviceSynchronize(); cl::sycl::nd_range<1>(cl::sycl::range<1>(256), cl::sycl::range<1>(256)), [=](cl::sycl::nd_item<1> it) { add(it, N, aX, aY); © Copyright Khronos Group 2016 - Page 13 Why use C++ single-source programming? • Widely used, especially in AI • Runs on lots of platforms • Kernel fusion • Abstractions to enable performance-portability • Composability • Integrates with: OpenCL, OpenVX etc © Copyright Khronos Group 2016 - Page 14 Kernel fusion Most parallel processors are bandwidth bound a = b * c + d * f if a, b, c, d and f are vectors, and: if we execute operations separately, bandwidth-bound, but: if we fuse into just one kernel, perf is much better © Copyright Khronos Group 2016 - Page 15 Graph programming: some numbers In this example, we Effect of combining graph nodes on performance perform 3 image 100 processing 90 operations on an 80 Halide and SYCL accelerator and use kernel fusion, compare 3 systems 70 whereas OpenCV when executing 60 does not. For all 3 individual nodes, or 50 systems, the a whole graph 40 performance of the whole graph is 30 significantly better 20 than individual The system is an AMD 10 nodes executed on APU and the operations their own are: RGB->HSV, 0 channel masking, HSV- OpenCV (nodes) OpenCV (graph) Halide (nodes) Halide (graph) SYCL (nodes) SYCL (graph) >RGB Kernel time (ms) Overhead time (ms) © Copyright Khronos Group 2016 - Page 16 VisionCpp with SYCL (or OpenMP) 0: #include <visioncpp.hpp> lRGB g 2 1: int main() { sRGB 2: auto in= cv::imread(“input.jpg”); 3: auto q =get_queue<gpu_selector>(); h lHSV out f 2 4: auto a = Node<sRGB, 512, 512,Image>(in.data)); lRGB 5: auto b = Node<sRGB2lRGB>(a); This graph is created in 6: auto c = Node<lRGB2lHSV>(b); lHSV C++ at compile time, so e 2 can be optimized at 7: auto d = Node<Constant>(0.1); Scale compile time. 8: auto e = Node<lHSV2Scale>(c , d); lRGB This allows fast start up 9: auto f = Node<lHSV2lRGB>(e); c 2 Coe lHSV d f 10: auto g = Node<sRGB2lRGB>(f); 11: auto h = execute<fuse> (g , q); sRG B2 12: auto ptr = h.get_data(); b lRGB 13: auto output = cv::Mat(512 , 512 , CV_8UC3 , ptr.get()); 14: cv::imshow (“Display Image” , output); a in 15: return 0; Source on 16: } github © Copyright Khronos Group 2016 - Page 17 Expressing the execution for a device SYCL OpenMP 1: template <typename Expr, typename… Acc> 1: template <typename Expr, typename... Acc> void sycl (handler& cgh, Expr expr, Acc… acc) { Accessor void cpp(Expr expr, Acc.. acc) { // sycl accessor for accessing data on device // output pinter for accessing data on host 2: auto outPtr = expr.out-> template get_accessor<write>(cgh) ; 2: auto outPtr = expr.out->get(); Pointer // sycl range representing valid range of accessing data // valid range for accessing data on host 3: auto rng = range < 2 > (Expr::Rows , Expr::Cols) ; 3: auto rng = range (Expr::Rows , Expr::Cols ); // sycl parallel for for parallelisng execution across the range // rebuilding the tuple of input pointer on host 4: cgh.parallel_for<Type>(rng), [=](item<2> itemID) { 4: auto tuple = make_tuple (acc) ; // rebuilding accessor tuple on the device // OpenMP directive for parallelising for loop 5: auto tuple = make_tuple (acc) ; 5: #pragma omp parallel for Parallel C++/OpenMP for 6: for(size_t i=0; i< rng.rows; i++) // calling the eval function for each pixel 7: for(size_t j=0; j< rng.cols; j++) 6: outPtr[itemID] = expr.eval ( itemID, tuple ); // calling the eval function for each pixel 8: outPtr[indx] = expr.eval (index (i , j), tuple ); 7: }); 9: }; 8: } © Copyright Khronos Group 2016 - Page 18 TensorFlow for OpenCL and SYCL • Same source code supports CUDA and SYCL - via #ifdefs • In branches and trunk, being merged into trunk • Supported, continuously tested © Copyright Khronos Group 2016 - Page 19 Applying fusion to TensorFlow Eigen Performance improvement at size TensorFlow Eigen Kernel Fusion 4,000 1.2 18x This is how 16x17x 14x15x TensorFlow 12x13x 10x11x 8x9x uses Eigen 1 7x 5x6x to
Recommended publications
  • Visual Development Environment for Openvx
    ______________________________________________________PROCEEDING OF THE 20TH CONFERENCE OF FRUCT ASSOCIATION Visual Development Environment for OpenVX Alexey Syschikov, Boris Sedov, Konstantin Nedovodeev, Sergey Pakharev Saint Petersburg State University of Aerospace Instrumentation Saint Petersburg, Russia {alexey.syschikov, boris.sedov, konstantin.nedovodeev, sergey.pakharev}@guap.ru Abstract—OpenVX standard has appeared as an answer II. STATE OF THE ART from the computer vision community to the challenge of accelerating vision applications on embedded heterogeneous OpenVX is intended to increase performance and reduce platforms. It is designed as a low-level programming framework power consumption of machine vision applications. It is that enables software developers to leverage the computer vision focused on embedded systems with real-time use cases such as hardware potential with functional and performance portability. face, body and gesture tracking, video surveillance, advanced In this paper, we present the visual environment for OpenVX driver assistance systems (ADAS), object and scene programs development. To the best of our knowledge, this is the reconstruction, augmented reality, visual inspection etc. first time the graphical notation is used for OpenVX programming. Our environment addresses the need to design The using of OpenVX standard functions is a way to ensure OpenVX graphs in a natural visual form with automatic functional portability of the developed software to all hardware generation of a full-fledged program, saving the programmer platforms that support OpenVX. from writing a bunch of a boilerplate code. Using the VIPE visual IDE to develop OpenVX programs also makes it possible to work Since the OpenVX API is based on opaque data types, with our performance analysis tools.
    [Show full text]
  • GLSL 4.50 Spec
    The OpenGL® Shading Language Language Version: 4.50 Document Revision: 7 09-May-2017 Editor: John Kessenich, Google Version 1.1 Authors: John Kessenich, Dave Baldwin, Randi Rost Copyright (c) 2008-2017 The Khronos Group Inc. All Rights Reserved. This specification is protected by copyright laws and contains material proprietary to the Khronos Group, Inc. It or any components may not be reproduced, republished, distributed, transmitted, displayed, broadcast, or otherwise exploited in any manner without the express prior written permission of Khronos Group. You may use this specification for implementing the functionality therein, without altering or removing any trademark, copyright or other notice from the specification, but the receipt or possession of this specification does not convey any rights to reproduce, disclose, or distribute its contents, or to manufacture, use, or sell anything that it may describe, in whole or in part. Khronos Group grants express permission to any current Promoter, Contributor or Adopter member of Khronos to copy and redistribute UNMODIFIED versions of this specification in any fashion, provided that NO CHARGE is made for the specification and the latest available update of the specification for any version of the API is used whenever possible. Such distributed specification may be reformatted AS LONG AS the contents of the specification are not changed in any way. The specification may be incorporated into a product that is sold as long as such product includes significant independent work developed by the seller. A link to the current version of this specification on the Khronos Group website should be included whenever possible with specification distributions.
    [Show full text]
  • The Importance of Data
    The landscape of Parallel Programing Models Part 2: The importance of Data Michael Wong and Rod Burns Codeplay Software Ltd. Distiguished Engineer, Vice President of Ecosystem IXPUG 2020 2 © 2020 Codeplay Software Ltd. Distinguished Engineer Michael Wong ● Chair of SYCL Heterogeneous Programming Language ● C++ Directions Group ● ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong ● [email protected][email protected] Ported ● Head of Delegation for C++ Standard for Canada Build LLVM- TensorFlow to based ● Chair of Programming Languages for Standards open compilers for Council of Canada standards accelerators Chair of WG21 SG19 Machine Learning using SYCL Chair of WG21 SG14 Games Dev/Low Latency/Financial Trading/Embedded Implement Releasing open- ● Editor: C++ SG5 Transactional Memory Technical source, open- OpenCL and Specification standards based AI SYCL for acceleration tools: ● Editor: C++ SG1 Concurrency Technical Specification SYCL-BLAS, SYCL-ML, accelerator ● MISRA C++ and AUTOSAR VisionCpp processors ● Chair of Standards Council Canada TC22/SC32 Electrical and electronic components (SOTIF) ● Chair of UL4600 Object Tracking ● http://wongmichael.com/about We build GPU compilers for semiconductor companies ● C++11 book in Chinese: Now working to make AI/ML heterogeneous acceleration safe for https://www.amazon.cn/dp/B00ETOV2OQ autonomous vehicle 3 © 2020 Codeplay Software Ltd. Acknowledgement and Disclaimer Numerous people internal and external to the original C++/Khronos group, in industry and academia, have made contributions, influenced ideas, written part of this presentations, and offered feedbacks to form part of this talk. But I claim all credit for errors, and stupid mistakes. These are mine, all mine! You can’t have them.
    [Show full text]
  • Introduction to GPU Computing
    Introduction to GPU Computing J. Austin Harris Scientific Computing Group Oak Ridge National Laboratory ORNL is managed by UT-Battelle for the US Department of Energy Performance Development in Top500 • Yardstick for measuring 1 Exaflop/s performance in HPC – Solve Ax = b – Measure floating-point operations per second (Flop/s) • U.S. targeting Exaflop system as early as 2022 – Building on recent trend of using GPUs https://www.top500.org/statistics/perfdevel 2 J. Austin Harris --- JETSCAPE --- 2020 Hardware Trends • Scaling number of cores/chip instead of clock speed • Power is the root cause – Power density limits clock speed • Goal has shifted to performance through parallelism • Performance is now a software concern Figure from Kathy Yelick, “Ten Ways to Waste a Parallel Computer.” Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç. 3 J. Austin Harris --- JETSCAPE --- 2020 GPUs for Computation • Excellent at graphics rendering – Fast computation (e.g., TV refresh rate) – High degree of parallelism (millions of independent pixels) – Needs high memory bandwidth • Often sacrifices latency, but this can be ameliorated • This computation pattern common in many scientific applications 4 J. Austin Harris --- JETSCAPE --- 2020 GPUs for Computation • CPU Strengths • GPU Strengths – Large memory – High mem. BW – Fast clock speeds – Latency tolerant via – Large cache for parallelism latency optimization – More compute – Small number of resources (cores) threads that can run – High perf./watt very quickly • GPU Weaknesses • CPU Weaknesses – Low mem. Capacity – Low mem. bandwidth – Low per-thread perf. – Costly cache misses – Low perf./watt Slide from Jeff Larkin, “Fundamentals of GPU Computing” 5 J.
    [Show full text]
  • SID Khronos Open Standards for AR May17
    Open Standards for AR Neil Trevett | Khronos President NVIDIA VP Developer Ecosystem [email protected] | @neilt3d LA, May 2017 © Copyright Khronos Group 2017 - Page 1 Khronos Mission Software Silicon Khronos is an International Industry Consortium of over 100 companies creating royalty-free, open standard APIs to enable software to access hardware acceleration for 3D graphics, Virtual and Augmented Reality, Parallel Computing, Neural Networks and Vision Processing © Copyright Khronos Group 2017 - Page 2 Khronos Standards Ecosystem 3D for the Web Real-time 2D/3D - Real-time apps and games in-browser - Cross-platform gaming and UI - Efficiently delivering runtime 3D assets - VR and AR Displays - CAD and Product Design - Safety-critical displays VR, Vision, Neural Networks Parallel Computation - VR/AR system portability - Tracking and odometry - Machine Learning acceleration - Embedded vision processing - Scene analysis/understanding - High Performance Computing (HPC) - Neural Network inferencing © Copyright Khronos Group 2017 - Page 3 Why AR Needs Standard Acceleration APIs Without API Standards With API Standards Platform Application Fragmentation Portability Everything Silicon runs on CPU Acceleration Standard Acceleration APIs provide PERFORMANCE, POWER AND PORTABILITY © Copyright Khronos Group 2017 - Page 4 AR Processing Flow Download 3D augmentation object and scene data Tracking and Positioning Generate Low Latency Vision Geometric scene 3D Augmentations for sensor(s) reconstruction display by optical system Semantic scene understanding
    [Show full text]
  • (CITIUS) Phd DISSERTATION
    UNIVERSITY OF SANTIAGO DE COMPOSTELA DEPARTMENT OF ELECTRONICS AND COMPUTER SCIENCE CENTRO DE INVESTIGACION´ EN TECNOLOX´IAS DA INFORMACION´ (CITIUS) PhD DISSERTATION PERFORMANCE COUNTER-BASED STRATEGIES TO IMPROVE DATA LOCALITY ON MULTIPROCESSOR SYSTEMS: REORDERING AND PAGE MIGRATION TECHNIQUES Author: Juan Angel´ Lorenzo del Castillo PhD supervisors: Prof. Francisco Fernandez´ Rivera Dr. Juan Carlos Pichel Campos Santiago de Compostela, October 2011 Prof. Francisco Fernandez´ Rivera, professor at the Computer Architecture Group of the University of Santiago de Compostela Dr. Juan Carlos Pichel Campos, researcher at the Computer Architecture Group of the University of Santiago de Compostela HEREBY CERTIFY: That the dissertation entitled Performance Counter-based Strategies to Improve Data Locality on Multiprocessor Systems: Reordering and Page Migration Techniques has been developed by Juan Angel´ Lorenzo del Castillo under our direction at the Department of Electronics and Computer Science of the University of Santiago de Compostela in fulfillment of the requirements for the Degree of Doctor of Philosophy. Santiago de Compostela, October 2011 Francisco Fernandez´ Rivera, Profesor Catedratico´ de Universidad del Area´ de Arquitectura de Computadores de la Universidad de Santiago de Compostela Juan Carlos Pichel Campos, Profesor Doctor del Area´ de Arquitectura de Computadores de la Universidad de Santiago de Compostela HACEN CONSTAR: Que la memoria titulada Performance Counter-based Strategies to Improve Data Locality on Mul- tiprocessor Systems: Reordering and Page Migration Techniques ha sido realizada por D. Juan Angel´ Lorenzo del Castillo bajo nuestra direccion´ en el Departamento de Electronica´ y Computacion´ de la Universidad de Santiago de Compostela, y constituye la Tesis que presenta para optar al t´ıtulo de Doctor por la Universidad de Santiago de Compostela.
    [Show full text]
  • Standards for Vision Processing and Neural Networks
    Standards for Vision Processing and Neural Networks Radhakrishna Giduthuri, AMD [email protected] © Copyright Khronos Group 2017 - Page 1 Agenda • Why we need a standard? • Khronos NNEF • Khronos OpenVX dog Network Architecture Pre-trained Network Model (weights, …) © Copyright Khronos Group 2017 - Page 2 Neural Network End-to-End Workflow Neural Network Third Vision/AI Party Applications Training Frameworks Tools Datasets Trained Vision and Neural Network Network Inferencing Runtime Network Model Architecture Desktop and Cloud Hardware Embedded/Mobile Embedded/MobileEmbedded/Mobile Embedded/Mobile/Desktop/CloudVision/InferencingVision/Inferencing Hardware Hardware cuDNN MIOpen MKL-DNN Vision/InferencingVision/Inferencing Hardware Hardware GPU DSP CPU Custom FPGA © Copyright Khronos Group 2017 - Page 3 Problem: Neural Network Fragmentation Neural Network Training and Inferencing Fragmentation NN Authoring Framework 1 Inference Engine 1 NN Authoring Framework 2 Inference Engine 2 NN Authoring Framework 3 Inference Engine 3 Every Tool Needs an Exporter to Every Accelerator Neural Network Inferencing Fragmentation toll on Applications Inference Engine 1 Hardware 1 Vision/AI Inference Engine 2 Hardware 2 Application Inference Engine 3 Hardware 3 Every Application Needs know about Every Accelerator API © Copyright Khronos Group 2017 - Page 4 Khronos APIs Connect Software to Silicon Software Silicon Khronos is an International Industry Consortium of over 100 companies creating royalty-free, open standard APIs to enable software to access
    [Show full text]
  • Intel® Oneapi Programming Guide
    Intel® oneAPI Programming Guide Intel Corporation www.intel.com Notices and Disclaimers Contents Notices and Disclaimers....................................................................... 5 Chapter 1: Introduction oneAPI Programming Model Overview ..........................................................7 Data Parallel C++ (DPC++)................................................................8 oneAPI Toolkit Distribution..................................................................9 About This Guide.......................................................................................9 Related Documentation ..............................................................................9 Chapter 2: oneAPI Programming Model Sample Program ..................................................................................... 10 Platform Model........................................................................................ 14 Execution Model ...................................................................................... 15 Memory Model ........................................................................................ 17 Memory Objects.............................................................................. 19 Accessors....................................................................................... 19 Synchronization .............................................................................. 20 Unified Shared Memory.................................................................... 20 Kernel Programming
    [Show full text]
  • Opencl SYCL 2.2 Specification
    SYCLTM Provisional Specification SYCL integrates OpenCL devices with modern C++ using a single source design Version 2.2 Revision Date: – 2016/02/15 Khronos OpenCL Working Group — SYCL subgroup Editor: Maria Rovatsou Copyright 2011-2016 The Khronos Group Inc. All Rights Reserved Contents 1 Introduction 13 2 SYCL Architecture 15 2.1 Overview . 15 2.2 The SYCL Platform and Device Model . 15 2.2.1 Platform Mixed Version Support . 16 2.3 SYCL Execution Model . 17 2.3.1 Execution Model: Queues, Command Groups and Contexts . 18 2.3.2 Execution Model: Command Queues . 19 2.3.3 Execution Model: Mapping work-items onto an nd range . 21 2.3.4 Execution Model: Execution of kernel-instances . 22 2.3.5 Execution Model: Hierarchical Parallelism . 24 2.3.6 Execution Model: Device-side enqueue . 25 2.3.7 Execution Model: Synchronization . 25 2.4 Memory Model . 26 2.4.1 Access to memory . 27 2.4.2 Memory consistency . 28 2.4.3 Atomic operations . 28 2.5 The SYCL programming model . 28 2.5.1 Basic data parallel kernels . 30 2.5.2 Work-group data parallel kernels . 30 2.5.3 Hierarchical data parallel kernels . 30 2.5.4 Kernels that are not launched over parallel instances . 31 2.5.5 Synchronization . 31 2.5.6 Error handling . 32 2.5.7 Scheduling of kernels and data movement . 32 2.5.8 Managing object lifetimes . 34 2.5.9 Device discovery and selection . 35 2.5.10 Interfacing with OpenCL . 35 2.6 Anatomy of a SYCL application .
    [Show full text]
  • Executable Modelling for Highly Parallel Accelerators
    Executable Modelling for Highly Parallel Accelerators Lorenzo Addazi, Federico Ciccozzi, Bjorn¨ Lisper School of Innovation, Design, and Engineering Malardalen¨ University -Vaster¨ as,˚ Sweden florenzo.addazi, federico.ciccozzi, [email protected] Abstract—High-performance embedded computing is develop- development are lagging behind. Programming accelerators ing rapidly since applications in most domains require a large and needs device-specific expertise in computer architecture and increasing amount of computing power. On the hardware side, low-level parallel programming for the accelerator at hand. this requirement is met by the introduction of heterogeneous systems, with highly parallel accelerators that are designed to Programming is error-prone and debugging can be very dif- take care of the computation-heavy parts of an application. There ficult due to potential race conditions: this is particularly is today a plethora of accelerator architectures, including GPUs, disturbing as many embedded applications, like autonomous many-cores, FPGAs, and domain-specific architectures such as AI vehicles, are safety-critical, meaning that failures may have accelerators. They all have their own programming models, which lethal consequences. Furthermore software becomes hard to are typically complex, low-level, and involve explicit parallelism. This yields error-prone software that puts the functional safety at port between different kinds of accelerators. risk, unacceptable for safety-critical embedded applications. In A convenient solution is to adopt a model-driven approach, this position paper we argue that high-level executable modelling where the computation-intense parts of the application are languages tailored for parallel computing can help in the software expressed in a high-level, accelerator-agnostic executable mod- design for high performance embedded applications.
    [Show full text]
  • Programming Multicores: Do Applications Programmers Need to Write Explicitly Parallel Programs?
    [3B2-14] mmi2010030003.3d 11/5/010 16:48 Page 2 .......................................................................................................................................................................................................................... PROGRAMMING MULTICORES: DO APPLICATIONS PROGRAMMERS NEED TO WRITE EXPLICITLY PARALLEL PROGRAMS? .......................................................................................................................................................................................................................... IN THIS PANEL DISCUSSION FROM THE 2009 WORKSHOP ON COMPUTER ARCHITECTURE Arvind RESEARCH DIRECTIONS,DAVID AUGUST AND KESHAV PINGALI DEBATE WHETHER EXPLICITLY Massachusetts Institute PARALLEL PROGRAMMING IS A NECESSARY EVIL FOR APPLICATIONS PROGRAMMERS, of Technology ASSESS THE CURRENT STATE OF PARALLEL PROGRAMMING MODELS, AND DISCUSS David August POSSIBLE ROUTES TOWARD FINDING THE PROGRAMMING MODEL FOR THE MULTICORE ERA. Princeton University Keshav Pingali Moderator’s introduction: Arvind in the 1970s and 1980s, when two main Do applications programmers need to write approaches were developed. Derek Chiou explicitly parallel programs? Most people be- The first approach required that the com- lieve that the current method of parallel pro- pilers do all the work in finding the parallel- University of Texas gramming is impeding the exploitation of ism. This was often referred to as the ‘‘dusty multicores. In other words, the number of decks’’ problem—that is, how to
    [Show full text]
  • Of the Threading Building Blocks Flow Graph API, a C++ API for Expressing Dependency, Streaming and Data Flow Applications
    Episode 15: A Proving Ground for Open Standards Host: Nicole Huesman, Intel Guests: Mike Voss, Intel; Andrew Lumsdaine, Northwest Institute for Advanced Computing ___________________________________________________________________________ Nicole Huesman: Welcome to Code Together, an interview series exploring the possibilities of cross- architecture development with those who live it. I’m your host, Nicole Huesman. In earlier episodes, we’ve talked about the need to make it easier for developers to build code for heterogeneous environments in the face of increasingly diverse and data-intensive workloads. And the industry shift to modern C++ with the C++11 release. Today, we’ll continue that conversation, exploring parallelism and heterogeneous computing from a user’s perspective with: Andrew Lumsdaine. As Chief Scientist at the Northwest Institute for Advanced Computing, Andrew wears at least two hats: Laboratory Fellow at Pacific Northwest National Lab, and Affiliate Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington. By spanning a university and a national lab, Andrew has the opportunity to work on research questions, and then translate those results into practice. His primary research interest is High Performance Computing, with a particular attention to scalable graph algorithms. I should also note that Andrew is a member of the oneAPI Technical Advisory Board. Andrew, so great to have you with us! Andrew Lumsdaine: Thank you. It’s great to be here. Nicole Huesman: And Mike Voss. Mike is a Principal Engineer at Intel, and was the original architect of the Threading Building Blocks flow graph API, a C++ API for expressing dependency, streaming and data flow applications.
    [Show full text]