C++17, Is It Great Or Just OK and Heterogeneous Computing in C++

Total Page:16

File Type:pdf, Size:1020Kb

C++17, Is It Great Or Just OK and Heterogeneous Computing in C++ C++17, is it great or just OK and Heterogeneous computing in C++ Next for Self-Driving Cars Michael Wong (Codeplay Software, VP of Research and Development), Andrew Richards, CEO ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong Head of Delegation for C++ Standard for Canada Vice Chair of Programming Languages for Standards Council of Canada Chair of WG21 SG5 Transactional Memory Chair of WG21 SG14 Games Dev/Low Latency/Financial Trading/Embedded Editor: C++ SG5 Transactional Memory Technical Specification Editor: C++ SG1 Concurrency Technical Specification http:://wongmichael.com/about Code::dive 2016 Agenda • How do we get to programming self-driving cars? • The C++ Standard • C++ 17: is it great or just OK • C++Future • A comparison of Heterogeneous Programming Models • SYCL Design Philosophy: C++ end to end model for HPC and consumers • The ecosystem: • VisionCpp • Parallel STL • TensorFlow, Machine Vision, Neural Networks, Self-Driving Cars • Codeplay ComputeCPP Community Edition: Free Download 2 © 2016 Codeplay Software Ltd. How do we get from here… Level 5 … to Level 4 •Autonomous •Stages from here? •Deep self very local to control Level 3 extensive •All conditions •Limited overall journeys control Level 2 •Execute •Automated Level 1 manoeuvres These are the SAE levels for •Adaptive autonomous vehicles. Similar •Assist Level 0 challenges apply in other embedded intelligence industries •Warnings 3 © 2016 Codeplay Software Ltd. We have a mountain to climb … or … without climbing getting lost the wrong … and we on our mountain want to get own… there in When we safe, don’t manageabl know e, How do what the affordable we get top looks steps… to the like... top? 4 © 2016 Codeplay Software Ltd. This presentation will focus on: • The hardware and software platforms that will be able to deliver the results • The software tools to build up the solutions for those platforms • The open standards that will enable solutions to interoperate • How Codeplay can help deliver embedded intelligence 5 © 2016 Codeplay Software Ltd. Where do we need to go? “On a 100 millimetre-squared chip, Google needs something like 50 teraflops of performance” - Daniel Rosenband (Google’s self-driving car project) at HotChips 2016 6 © 2016 Codeplay Software Ltd. Performance trends 65,536 32,768 16,384 Google target 8,192 4,096 Desktop GPU 2,048 1,024 512 Integrated GPU 256 128 64 Smartphone GPU GFLOPS 32 16 8 Smartphone CPU These trend lines 4 2 seem to violate Desktop CPU the rules of 1 1 physics… Year of introduction 7 © 2016 Codeplay Software Ltd. How do we get there from here? 1.We need to write software today for platforms that We need to validate cannot be built yet the systems as safe We need to start with simpler systems that are not fully autonomous 8 © 2016 Codeplay Software Ltd. Two models of software development Hardware Software designer designer writing Software writing Software Write software Design next Implement Select platform Design a Validate Select version on model model platform platform Validate whole Optimize for platform platform Which method can get us all the way to full autonomy? 9 © 2016 Codeplay Software Ltd. Desirable Development Write software Evaluate Optimize for Software platform Software Application ValidateWrite whole Well Defined Middleware softwareplatform Hardware & Low-Level Software Develop Evaluate Platform Architecture Select Platform 10 © 2016 Codeplay Software Ltd. The different levels of programming model Device-specific Higher-level C-level C++-level Graph programming language programming programming programming • Assembly enabler • OpenCL C • SYCL • OpenCV language • NVIDIA PTX • DSP C • CUDA • OpenVX • VHDL • HSA • MCAPI/MTAPI • HCC • Halide • Device-specific C- • OpenCL SPIR • C++ AMP • VisionCpp like programming • SPIR-V • TensorFlow models • Caffe 11 © 2016 Codeplay Software Ltd. Device-specific programming Cannot … develop software Can deliver quick today for future results today platforms Not a route to full Can… autonomy hand-optimize Does not allow directly for the software developers device to invest today 12 © 2016 Codeplay Software Ltd. The route to full autonomy • Graph programming • This is the most widely-adopted approach to machine vision and machine learning • Open standards • This lets you develop today for future architectures 13 © 2016 Codeplay Software Ltd. Why graph programming? When you scale the Therefore: number of cores: • You need to reduce • You don’t scale the off-chip memory However, number of memory ports bandwidth by writing tiled processing • Your compute everything on-chip image pipelines performance increases • This is achieved by is hard • But your off-chip memory tiling bandwidth does not increase If we build ff f graph ff operations (e.g. convolutions) and then have f runtime system split into fused tiled operations across ff entire system ffffchip, we get great performance 14 © 2016 Codeplay Software Ltd. Graph programming: some numbers In this example, Effect of combining graph nodes on performance we perform 3 100 image processing 90 operations on an 80 Halide and SYCL accelerator and use kernel fusion, 70 compare 3 whereas OpenCV systems when 60 does not. For all 3 executing 50 systems, the individual nodes, performance of 40 or a whole graph the whole graph is 30 significantly better 20 than individual The system is an AMD nodes executed APU and the 10 operations are: RGB- on their own 0 >HSV, channel OpenCV (nodes) OpenCV (graph) Halide (nodes) Halide (graph) SYCL (nodes) SYCL (graph) masking, HSV->RGB Kernel time (ms) Overhead time (ms) 15 © 2016 Codeplay Software Ltd. Graph programming • For both machine vision algorithms and machine learning, graph programming is the most widely-adopted approach • Two styles of graph programming that we commonly see: C-style graph C++-style graph programming programming • OpenVX • Halide • OpenCV • RapidMind • Eigen (also in TensorFlow) • VisionCpp 16 © 2016 Codeplay Software Ltd. C-style graph programming OpenVX: open standard • Can be implemented by vendors • Create a graph with C API, then map to an entire SoC OpenCV: open source • Implemented on OpenCL • Implemented on device-specific accelerators • Create a graph with C API, then execute 17 © 2016 Codeplay Software Ltd. & Device-Specific Programming What happens if we Runtime systems invent our own can automatically graph nodes? optimize the graphs Can … How do we adapt it develop software for all the graph today for future nodes we need? platforms 18 © 2016 Codeplay Software Ltd. C++-style graph programming Examples in machine C++ compilers that vision/machine learning support this style • Halide • CUDA • RapidMind • C++ OpenMP • Eigen (also in • C++ 17 Parallel STL TensorFlow) • SYCL • VisionCpp 19 © 2016 Codeplay Software Ltd. C++ single-source programming • C++ lets us build up graphs at compile-time • This means we can map a graph to the processors offline • C++ lets us write custom nodes ourselves • This approach is called a C++ Embedded Domain-Specific Language • Very widely used, eg Eigen, Boost, TensorFlow, RapidMind, Halide 20 © 2016 Codeplay Software Ltd. C++ single-source Single-source is lets us create most widely- customizable adopted machine graph models learning programming model Combining open OpenCL lets us run on a very wide range of standards, C++ and accelerators now and in the future graph programming SYCL combines C++ single-source with OpenCL acceleration 21 © 2016 Codeplay Software Ltd. Putting it all together: building it 22 © 2016 Codeplay Software Ltd. Higher-level programming enablers NVIDIA PTX HSA OpenCL SPIR SPIR-V • NVIDIA CUDA-only • Royalty-free open • Defined for OpenCL • Open standard standard v1.2 • Defined by Khronos • HSAIL is the IR • Based on • Supports compute • Provides a single Clang/LLVM (the and graphics address space, with open-source (OpenCL, Vulkan and virtual memory compiler) OpenGL) • Low-latency • Not tied to any communication compiler Open standard intermediate representations enable tools to be built on top and support a wide range of platforms 23 © 2016 Codeplay Software Ltd. Which model should we choose? Device-specific Higher-level C-level C++-level Graph programming language programming programming programming • Assembly enabler • OpenCL C • SYCL • OpenCV language • NVIDIA PTX • DSP C • CUDA • OpenVX • VHDL • HSA • MCAPI/MTAPI • HCC • Halide • Device-specific C- • OpenCL SPIR • C++ AMP • VisionCpp like programming • SPIR-V • TensorFlow models • Caffe 24 © 2016 Codeplay Software Ltd. They are not alternatives, they are layers Graph programming OpenCV OpenVX Halide VisionCpp TensorFlow Caffe C/C++-level programming SYCL CUDA HCC C++ AMP OpenCL Higher-level language enabler NVIDIA PTX HSA OpenCL SPIR SPIR-V Device-specific programming Assembly language VHDL Device-specific C-like programming models 25 © 2016 Codeplay Software Ltd. Can specify, test and validate each layer Graph programming Validate graph models Validate the code using standard tools C/C++-level programming OpenCL/SYCL specs Clsmith testsuite Conformance testsuites Wide range of other testsuites Higher-level language enabler SPIR/SPIR-V/HSAIL specs Conformance testsuites Device-specific programming Device-specific specification Device-specific testing and validation 26 © 2016 Codeplay Software Ltd. Agenda • How do we get to programming self-driving cars? • The C++ Standard • C++ 17: is it great or just OK • C++Future • A comparison of Heterogeneous Programming Models • SYCL Design Philosophy: C++ end to end model for HPC and
Recommended publications
  • The Importance of Data
    The landscape of Parallel Programing Models Part 2: The importance of Data Michael Wong and Rod Burns Codeplay Software Ltd. Distiguished Engineer, Vice President of Ecosystem IXPUG 2020 2 © 2020 Codeplay Software Ltd. Distinguished Engineer Michael Wong ● Chair of SYCL Heterogeneous Programming Language ● C++ Directions Group ● ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong ● [email protected][email protected] Ported ● Head of Delegation for C++ Standard for Canada Build LLVM- TensorFlow to based ● Chair of Programming Languages for Standards open compilers for Council of Canada standards accelerators Chair of WG21 SG19 Machine Learning using SYCL Chair of WG21 SG14 Games Dev/Low Latency/Financial Trading/Embedded Implement Releasing open- ● Editor: C++ SG5 Transactional Memory Technical source, open- OpenCL and Specification standards based AI SYCL for acceleration tools: ● Editor: C++ SG1 Concurrency Technical Specification SYCL-BLAS, SYCL-ML, accelerator ● MISRA C++ and AUTOSAR VisionCpp processors ● Chair of Standards Council Canada TC22/SC32 Electrical and electronic components (SOTIF) ● Chair of UL4600 Object Tracking ● http://wongmichael.com/about We build GPU compilers for semiconductor companies ● C++11 book in Chinese: Now working to make AI/ML heterogeneous acceleration safe for https://www.amazon.cn/dp/B00ETOV2OQ autonomous vehicle 3 © 2020 Codeplay Software Ltd. Acknowledgement and Disclaimer Numerous people internal and external to the original C++/Khronos group, in industry and academia, have made contributions, influenced ideas, written part of this presentations, and offered feedbacks to form part of this talk. But I claim all credit for errors, and stupid mistakes. These are mine, all mine! You can’t have them.
    [Show full text]
  • Introduction to GPU Computing
    Introduction to GPU Computing J. Austin Harris Scientific Computing Group Oak Ridge National Laboratory ORNL is managed by UT-Battelle for the US Department of Energy Performance Development in Top500 • Yardstick for measuring 1 Exaflop/s performance in HPC – Solve Ax = b – Measure floating-point operations per second (Flop/s) • U.S. targeting Exaflop system as early as 2022 – Building on recent trend of using GPUs https://www.top500.org/statistics/perfdevel 2 J. Austin Harris --- JETSCAPE --- 2020 Hardware Trends • Scaling number of cores/chip instead of clock speed • Power is the root cause – Power density limits clock speed • Goal has shifted to performance through parallelism • Performance is now a software concern Figure from Kathy Yelick, “Ten Ways to Waste a Parallel Computer.” Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç. 3 J. Austin Harris --- JETSCAPE --- 2020 GPUs for Computation • Excellent at graphics rendering – Fast computation (e.g., TV refresh rate) – High degree of parallelism (millions of independent pixels) – Needs high memory bandwidth • Often sacrifices latency, but this can be ameliorated • This computation pattern common in many scientific applications 4 J. Austin Harris --- JETSCAPE --- 2020 GPUs for Computation • CPU Strengths • GPU Strengths – Large memory – High mem. BW – Fast clock speeds – Latency tolerant via – Large cache for parallelism latency optimization – More compute – Small number of resources (cores) threads that can run – High perf./watt very quickly • GPU Weaknesses • CPU Weaknesses – Low mem. Capacity – Low mem. bandwidth – Low per-thread perf. – Costly cache misses – Low perf./watt Slide from Jeff Larkin, “Fundamentals of GPU Computing” 5 J.
    [Show full text]
  • Delivering Heterogeneous Programming in C++
    Delivering Heterogeneous Programming in C++ Duncan McBain, Codeplay Software Ltd. About me ● Graduated from Edinburgh University 3 years ago ● Postgrad course got me interested in GPU programming ● Worked at Codeplay since graduating ● Research projects, benchmarking, debuggers ● Most recently on C++ library for heterogeneous systems 2 © 2016 Codeplay Software Ltd. Contents • What are heterogeneous systems? • How can we program them? • The future of heterogeneous systems 3 © 2016 Codeplay Software Ltd. What are heterogeneous systems • By this, I mean devices like GPUs, DSPs, FPGAs… • Generally a bit of hardware that is more specialised than, and fundamentally different to, the host CPU • Specialisation can make it very fast • Can also be harder to program because of specialisation 4 © 2016 Codeplay Software Ltd. Some definitions • Host – The CPU/code that runs on the CPU, controls main memory (RAM), might control many devices • Device – A GPU, DSP, or something more exotic • Heterogeneous system – A host, a device and an API tying them together 5 © 2016 Codeplay Software Ltd. Some definitions • Kernel – Code representing the computation to be performed on the device. • Work group – A collection of many work items executing on a device. Has shared local memory and executes same instructions 6 © 2016 Codeplay Software Ltd. Some definitions ● Work item – A single thread or task on a device that executes in parallel ● Parallel for – Some collection of work items, in many work groups, executing a kernel in parallel. In general, cannot return anything, and must be enqueued asynchronously 7 © 2016 Codeplay Software Ltd. Example heterogeneous device ● CPUs today can execute instructions out-of-order, speculative execution, branch prediction ● Complexity hidden from programmer ● Contrast with e.g.
    [Show full text]
  • (CITIUS) Phd DISSERTATION
    UNIVERSITY OF SANTIAGO DE COMPOSTELA DEPARTMENT OF ELECTRONICS AND COMPUTER SCIENCE CENTRO DE INVESTIGACION´ EN TECNOLOX´IAS DA INFORMACION´ (CITIUS) PhD DISSERTATION PERFORMANCE COUNTER-BASED STRATEGIES TO IMPROVE DATA LOCALITY ON MULTIPROCESSOR SYSTEMS: REORDERING AND PAGE MIGRATION TECHNIQUES Author: Juan Angel´ Lorenzo del Castillo PhD supervisors: Prof. Francisco Fernandez´ Rivera Dr. Juan Carlos Pichel Campos Santiago de Compostela, October 2011 Prof. Francisco Fernandez´ Rivera, professor at the Computer Architecture Group of the University of Santiago de Compostela Dr. Juan Carlos Pichel Campos, researcher at the Computer Architecture Group of the University of Santiago de Compostela HEREBY CERTIFY: That the dissertation entitled Performance Counter-based Strategies to Improve Data Locality on Multiprocessor Systems: Reordering and Page Migration Techniques has been developed by Juan Angel´ Lorenzo del Castillo under our direction at the Department of Electronics and Computer Science of the University of Santiago de Compostela in fulfillment of the requirements for the Degree of Doctor of Philosophy. Santiago de Compostela, October 2011 Francisco Fernandez´ Rivera, Profesor Catedratico´ de Universidad del Area´ de Arquitectura de Computadores de la Universidad de Santiago de Compostela Juan Carlos Pichel Campos, Profesor Doctor del Area´ de Arquitectura de Computadores de la Universidad de Santiago de Compostela HACEN CONSTAR: Que la memoria titulada Performance Counter-based Strategies to Improve Data Locality on Mul- tiprocessor Systems: Reordering and Page Migration Techniques ha sido realizada por D. Juan Angel´ Lorenzo del Castillo bajo nuestra direccion´ en el Departamento de Electronica´ y Computacion´ de la Universidad de Santiago de Compostela, y constituye la Tesis que presenta para optar al t´ıtulo de Doctor por la Universidad de Santiago de Compostela.
    [Show full text]
  • Programmers' Tool Chain
    Reduce the complexity of programming multicore ++ Offload™ for PlayStation®3 | Offload™ for Cell Broadband Engine™ | Offload™ for Embedded | Custom C and C++ Compilers | Custom Shader Language Compiler www.codeplay.com It’s a risk to underestimate the complexity of programming multicore applications Software developers are now presented with a rapidly-growing range of different multi-core processors. The common feature of many of these processors is that they are difficult and error-prone to program with existing tools, give very unpredictable performance, and that incompatible, complex programming models are used. Codeplay develop compilers and programming tools with one primary goal - to make it easy for programmers to achieve big performance boosts with multi-core processors, but without needing bigger, specially-trained, expensive development teams to get there. Introducing Codeplay Based in Edinburgh, Scotland, Codeplay Software Limited was founded by veteran games developer Andrew Richards in 2002 with funding from Jez San (the founder of Argonaut Games and ARC International). Codeplay introduced their first product, VectorC, a highly optimizing compiler for x86 PC and PlayStation®2, in 2003. In 2004 Codeplay further developed their business by offering services to processor developers to provide them with compilers and programming tools for their new and unique architectures, using VectorC’s highly retargetable compiler technology. Realising the need for new multicore tools Codeplay started the development of the company’s latest product, the Offload™ C++ Multicore Programming Platform. In October 2009 Offload™: Community Edition was released as a free-to-use tool for PlayStation®3 programmers. Experience and Expertise Codeplay have developed compilers and software optimization technology since 1999.
    [Show full text]
  • Intel® Oneapi Programming Guide
    Intel® oneAPI Programming Guide Intel Corporation www.intel.com Notices and Disclaimers Contents Notices and Disclaimers....................................................................... 5 Chapter 1: Introduction oneAPI Programming Model Overview ..........................................................7 Data Parallel C++ (DPC++)................................................................8 oneAPI Toolkit Distribution..................................................................9 About This Guide.......................................................................................9 Related Documentation ..............................................................................9 Chapter 2: oneAPI Programming Model Sample Program ..................................................................................... 10 Platform Model........................................................................................ 14 Execution Model ...................................................................................... 15 Memory Model ........................................................................................ 17 Memory Objects.............................................................................. 19 Accessors....................................................................................... 19 Synchronization .............................................................................. 20 Unified Shared Memory.................................................................... 20 Kernel Programming
    [Show full text]
  • Opencl SYCL 2.2 Specification
    SYCLTM Provisional Specification SYCL integrates OpenCL devices with modern C++ using a single source design Version 2.2 Revision Date: – 2016/02/15 Khronos OpenCL Working Group — SYCL subgroup Editor: Maria Rovatsou Copyright 2011-2016 The Khronos Group Inc. All Rights Reserved Contents 1 Introduction 13 2 SYCL Architecture 15 2.1 Overview . 15 2.2 The SYCL Platform and Device Model . 15 2.2.1 Platform Mixed Version Support . 16 2.3 SYCL Execution Model . 17 2.3.1 Execution Model: Queues, Command Groups and Contexts . 18 2.3.2 Execution Model: Command Queues . 19 2.3.3 Execution Model: Mapping work-items onto an nd range . 21 2.3.4 Execution Model: Execution of kernel-instances . 22 2.3.5 Execution Model: Hierarchical Parallelism . 24 2.3.6 Execution Model: Device-side enqueue . 25 2.3.7 Execution Model: Synchronization . 25 2.4 Memory Model . 26 2.4.1 Access to memory . 27 2.4.2 Memory consistency . 28 2.4.3 Atomic operations . 28 2.5 The SYCL programming model . 28 2.5.1 Basic data parallel kernels . 30 2.5.2 Work-group data parallel kernels . 30 2.5.3 Hierarchical data parallel kernels . 30 2.5.4 Kernels that are not launched over parallel instances . 31 2.5.5 Synchronization . 31 2.5.6 Error handling . 32 2.5.7 Scheduling of kernels and data movement . 32 2.5.8 Managing object lifetimes . 34 2.5.9 Device discovery and selection . 35 2.5.10 Interfacing with OpenCL . 35 2.6 Anatomy of a SYCL application .
    [Show full text]
  • Executable Modelling for Highly Parallel Accelerators
    Executable Modelling for Highly Parallel Accelerators Lorenzo Addazi, Federico Ciccozzi, Bjorn¨ Lisper School of Innovation, Design, and Engineering Malardalen¨ University -Vaster¨ as,˚ Sweden florenzo.addazi, federico.ciccozzi, [email protected] Abstract—High-performance embedded computing is develop- development are lagging behind. Programming accelerators ing rapidly since applications in most domains require a large and needs device-specific expertise in computer architecture and increasing amount of computing power. On the hardware side, low-level parallel programming for the accelerator at hand. this requirement is met by the introduction of heterogeneous systems, with highly parallel accelerators that are designed to Programming is error-prone and debugging can be very dif- take care of the computation-heavy parts of an application. There ficult due to potential race conditions: this is particularly is today a plethora of accelerator architectures, including GPUs, disturbing as many embedded applications, like autonomous many-cores, FPGAs, and domain-specific architectures such as AI vehicles, are safety-critical, meaning that failures may have accelerators. They all have their own programming models, which lethal consequences. Furthermore software becomes hard to are typically complex, low-level, and involve explicit parallelism. This yields error-prone software that puts the functional safety at port between different kinds of accelerators. risk, unacceptable for safety-critical embedded applications. In A convenient solution is to adopt a model-driven approach, this position paper we argue that high-level executable modelling where the computation-intense parts of the application are languages tailored for parallel computing can help in the software expressed in a high-level, accelerator-agnostic executable mod- design for high performance embedded applications.
    [Show full text]
  • Programming Multicores: Do Applications Programmers Need to Write Explicitly Parallel Programs?
    [3B2-14] mmi2010030003.3d 11/5/010 16:48 Page 2 .......................................................................................................................................................................................................................... PROGRAMMING MULTICORES: DO APPLICATIONS PROGRAMMERS NEED TO WRITE EXPLICITLY PARALLEL PROGRAMS? .......................................................................................................................................................................................................................... IN THIS PANEL DISCUSSION FROM THE 2009 WORKSHOP ON COMPUTER ARCHITECTURE Arvind RESEARCH DIRECTIONS,DAVID AUGUST AND KESHAV PINGALI DEBATE WHETHER EXPLICITLY Massachusetts Institute PARALLEL PROGRAMMING IS A NECESSARY EVIL FOR APPLICATIONS PROGRAMMERS, of Technology ASSESS THE CURRENT STATE OF PARALLEL PROGRAMMING MODELS, AND DISCUSS David August POSSIBLE ROUTES TOWARD FINDING THE PROGRAMMING MODEL FOR THE MULTICORE ERA. Princeton University Keshav Pingali Moderator’s introduction: Arvind in the 1970s and 1980s, when two main Do applications programmers need to write approaches were developed. Derek Chiou explicitly parallel programs? Most people be- The first approach required that the com- lieve that the current method of parallel pro- pilers do all the work in finding the parallel- University of Texas gramming is impeding the exploitation of ism. This was often referred to as the ‘‘dusty multicores. In other words, the number of decks’’ problem—that is, how to
    [Show full text]
  • Of the Threading Building Blocks Flow Graph API, a C++ API for Expressing Dependency, Streaming and Data Flow Applications
    Episode 15: A Proving Ground for Open Standards Host: Nicole Huesman, Intel Guests: Mike Voss, Intel; Andrew Lumsdaine, Northwest Institute for Advanced Computing ___________________________________________________________________________ Nicole Huesman: Welcome to Code Together, an interview series exploring the possibilities of cross- architecture development with those who live it. I’m your host, Nicole Huesman. In earlier episodes, we’ve talked about the need to make it easier for developers to build code for heterogeneous environments in the face of increasingly diverse and data-intensive workloads. And the industry shift to modern C++ with the C++11 release. Today, we’ll continue that conversation, exploring parallelism and heterogeneous computing from a user’s perspective with: Andrew Lumsdaine. As Chief Scientist at the Northwest Institute for Advanced Computing, Andrew wears at least two hats: Laboratory Fellow at Pacific Northwest National Lab, and Affiliate Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington. By spanning a university and a national lab, Andrew has the opportunity to work on research questions, and then translate those results into practice. His primary research interest is High Performance Computing, with a particular attention to scalable graph algorithms. I should also note that Andrew is a member of the oneAPI Technical Advisory Board. Andrew, so great to have you with us! Andrew Lumsdaine: Thank you. It’s great to be here. Nicole Huesman: And Mike Voss. Mike is a Principal Engineer at Intel, and was the original architect of the Threading Building Blocks flow graph API, a C++ API for expressing dependency, streaming and data flow applications.
    [Show full text]
  • An Outlook of High Performance Computing Infrastructures for Scientific Computing
    An Outlook of High Performance Computing Infrastructures for Scientific Computing [This draft has mainly been extracted from the PhD Thesis of Amjad Ali at Center for Advanced Studies in Pure and Applied Mathematics (CASPAM), Bahauddin Zakariya University, Multan, Pakitsan 60800. ([email protected])] In natural sciences, two conventional ways of carrying out studies and reserach are: theoretical and experimental.Thefirstapproachisabouttheoraticaltreatmment possibly dealing with the mathematical models of the concerning physical phenomena and analytical solutions. The other approach is to perform physical experiments to carry out studies. This approach is directly favourable for developing the products in engineering. The theoretical branch, although very rich in concepts, has been able to develop only a few analytical methods applicable to rather simple problems. On the other hand the experimental branch involves heavy costs and specific arrangements. These shortcomings with the two approaches motivated the scientists to look around for a third or complementary choice, i.e., the use of numerical simulations in science and engineering. The numerical solution of scientific problems, by its very nature, requires high computational power. As the world has been continuously en- joying substantial growth in computer technologies during the past three decades, the numerical approach appeared to be more and more pragmatic. This constituted acompletenewbranchineachfieldofscience,oftentermedasComputational Science,thatisaboutusingnumericalmethodsforsolvingthemathematicalformu- lations of the science problems. Developments in computational sciences constituted anewdiscipline,namelytheScientific Computing,thathasemergedwiththe spirit of obtaining qualitative predictions through simulations in ‘virtual labora- tories’ (i.e., the computers) for all areas of science and engineering. The scientific computing, due to its nature, exists at the overlapping region of the three disciplines, mathematical modeling, numerical mathematics and computer science,asshownin Fig.
    [Show full text]
  • Performance Tuning Workshop
    Performance Tuning Workshop Samuel Khuvis Scientifc Applications Engineer, OSC February 18, 2021 1/103 Workshop Set up I Workshop – set up account at my.osc.edu I If you already have an OSC account, sign in to my.osc.edu I Go to Project I Project access request I PROJECT CODE = PZS1010 I Reset your password I Slides are the workshop website: https://www.osc.edu/~skhuvis/opt21_spring 2/103 Outline I Introduction I Debugging I Hardware overview I Performance measurement and analysis I Help from the compiler I Code tuning/optimization I Parallel computing 3/103 Introduction 4/103 Workshop Philosophy I Aim for “reasonably good” performance I Discuss performance tuning techniques common to most HPC architectures I Compiler options I Code modifcation I Focus on serial performance I Reduce time spent accessing memory I Parallel processing I Multithreading I MPI 5/103 Hands-on Code During this workshop, we will be using a code based on the HPCCG miniapp from Mantevo. I Performs Conjugate Gradient (CG) method on a 3D chimney domain. I CG is an iterative algorithm to numerically approximate the solution to a system of linear equations. I Run code with srun -n <numprocs> ./test_HPCCG nx ny nz, where nx, ny, and nz are the number of nodes in the x, y, and z dimension on each processor. I Download with: wget go . osu .edu/perftuning21 t a r x f perftuning21 I Make sure that the following modules are loaded: intel/19.0.5 mvapich2/2.3.3 6/103 More important than Performance! I Correctness of results I Code readability/maintainability I Portability -
    [Show full text]