Modern C++ Heterogeneous Programming with SYCL

Total Page:16

File Type:pdf, Size:1020Kb

Modern C++ Heterogeneous Programming with SYCL Modern C++ Heterogeneous Programming with SYCL Michael Wong Distinguished Engineer SYCL WG Chair ISO C++ Directions Group Chair of C++ Machine Learning, Low latency, Games, Embedded, Finance 2021 2 © 2016 Codeplay Software Ltd. Distinguished Engineer Michael Wong ● Chair of SYCL Heterogeneous Programming Language ● C++ Directions Group ● Past CEO OpenMP ● ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong ● [email protected][email protected] ● Head of Delegation for C++ Standard for Canada Ported Build ● Chair of Programming Languages for Standards TensorFlow to LLVM-based Council of Canada open standards compilers for using SYCL accelerators Chair of WG21 SG19 Machine Learning Chair of WG21 SG14 Games Dev/Low Implement Latency/Financial Trading/Embedded Releasing ● Editor: C++ SG5 Transactional Memory Technical open-source, OpenCL and open-standards based SYCL for Specification AI acceleration tools: accelerator ● Editor: C++ SG1 Concurrency Technical Specification SYCL-BLAS, SYCL-ML, VisionCpp processors ● MISRA C++ and AUTOSAR ● Chair of Standards Council Canada TC22/SC32 Electrical and electronic components (SOTIF) ● Chair of UL4600 Object Tracking ● http://wongmichael.com/about We build GPU compilers for semiconductor companies ● C++11 book in Chinese: Now working to make AI/ML heterogeneous acceleration safe for https://www.amazon.cn/dp/B00ETOV2OQ autonomous vehicle 3 © 2016 Codeplay Software Ltd. Acknowledgement and Disclaimer Numerous people internal and external to the original C++/Khronos group/OpenMP, in industry and academia, have made contributions, influenced ideas, written part of this presentations, and offered feedbacks to form part of this talk. These include Bjarne Stroustrup, Joe Hummel, Botond Ballo, Simon Mcintosh-Smith, as well as many others. But I claim all credit for errors, and stupid mistakes. These are mine, all mine! You can’t have them. 4 © 2016 Codeplay Software Ltd. Legal Disclaimer THIS WORK REPRESENTS THE VIEW OF THE OTHER COMPANY, PRODUCT, AND SERVICE AUTHOR AND DOES NOT NECESSARILY NAMES MAY BE TRADEMARKS OR SERVICE REPRESENT THE VIEW OF CODEPLAY. MARKS OF OTHERS. 5 © 2016 Codeplay Software Ltd. Disclaimers NVIDIA, the NVIDIA logo and CUDA are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and/or other countries Codeplay is not associated with NVIDIA for this work and it is purely using public documentation and widely available code 6 © 2016 Codeplay Software Ltd. SYCL 2020 is here! Open Standard for Single Source C++ Parallel Heterogeneous Programming SYCL 2020 is released after 3 years of intense work Significant adoption in Embedded, Desktop and HPC markets Improved programmability, smaller code size, faster performance Based on C++17, backwards compatible with SYCL 1.2.1 Simplify porting of standard C++ applications to SYCL Closer alignment and integration with ISO C++ Multiple Backend acceleration and API independent SYCL 2020 increases expressiveness and simplicity for modern C++ heterogeneous programming This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2021 - Page 7 SYCL 2020 Industry Momentum SYCL support growing from Embedded Systems through https://www.alcf.anl.gov/support-center/aurora/sycl-and-dpc-aurora https://www.embeddedcomputing.com/technology/open-source/risc-v-open-source-ip/nsitexe-kyoto-microcomputer-and-codeplay-software-are-bringing-open-standards-programming-to-risc-v-vector-processor-for-hpc-and-ai-systems Desktops to Supercomputers https://www.nextplatform.com/2021/02/03/can-sycl-slice-into-broader-supercomputing/ https://www.phoronix.com/scan.php?page=news_item&px=hipSYCL-New-Lite-Runtime https://software.intel.com/content/www/us/en/develop/articles/interoperability-dpcpp-sycl-opencl.html https://www.renesas.com/br/en/about/press-room/renesas-electronics-and-codeplay-collaborate-opencl-and-sycl-adas-solutions https://www.nersc.gov/news-publications/nersc-news/nersc-center-news/2021/nersc-alcf-codeplay-partner-on-sycl-for-next-generation-supercomputers/ https://research-portal.uws.ac.uk/en/publications/trisycl-for-xilinx-fpga https://www.imaginationtech.com/news/press-release/tensorflow-gets-native-support-for-powervr-gpus-via-optimised-open-source-sycl-libraries/ This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2021 - Page 8 Agenda Challenges of an Accelerator Programming Model SYCL 2020 SYCL in HPC, Embedded, Safety, and Autonomous Vehicles 9 © 2018 Codeplay Software Ltd. Understanding the Challenges of the Heterogeneous Era 10 © 2016 Codeplay Software Ltd. So what are the biggest challenges for heterogeneous computing? ➢ Single Source vs Multiple Source ➢ Heterogeneous offloading ➢ Expressing parallelism ➢ Four Horsemen of Heterogeneous Computing: Data locality, movement, layout, affinity ➢ SPMD Programming model 11 © 2016 Codeplay Software Ltd. Heterogeneous Offloading 12 © 2016 Codeplay Software Ltd. How do we offload code to a heterogeneous device? ➢ This can be answered by looking at the C++ compilation model 13 © 2016 Codeplay Software Ltd. How can we compile source code for a sub architectures? ➢ Separate source ➢ Single source 14 © 2016 Codeplay Software Ltd. Separate Source Compilation Model C++ CPU CPU x86 Source Compile Linker x86 ISA Object CPU File r Device Online Source Compile GPU r float *a, *b, *c; Here we’re using OpenCL as an example … kernel k = clCreateKernel(…, “my_kernel”, …);void my_kernel(__global float *a, __global float clEnqueueWriteBuffer(…, size, a, …); *b, clEnqueueWriteBuffer(…, size, a, …); __global float *c) { clEnqueueNDRange(…, k, 1, {size, 1, 1}, …); int id = get_global_id(0); clEnqueueWriteBuffer(…, size, c, …); c[id] = a[id] + b[id]; } 15 © 2016 Codeplay Software Ltd. Std C++ compilation model C++ CPU CPU source Linker CPU ISA CPU file compiler object Std C++ compilation model C++ CPU CPU source Linker CPU ISA CPU file compiler object GPU ? SYCL single source compilation model C++ CPU CPU source Linker CPU ISA CPU file compiler object GPU auto aAcc = aBuf.get_access<access::read>(cgh); auto bAcc = bBuf.get_access<access::read>(cgh); auto oAcc = oBuf.get_access<access::write>(cgh); cgh.parallel_for<add>(range<1>(a.size()), [=](id<1> idx) { oAcc[i] = aAcc[i] + aAcc[i]; })); SYCL single source compilation model CPU CPU C++ Linker CPU ISA CPU compiler object source file C++ device source GPU auto aAcc = aBuf.get_access<access::read>(cgh); auto bAcc = bBuf.get_access<access::read>(cgh); auto oAcc = oBuf.get_access<access::write>(cgh); cgh.parallel_for<add>(range<1>(a.size()), [=](id<1> idx) { oAcc[i] = aAcc[i] + aAcc[i]; })); SYCL single source compilation model CPU CPU C++ Linker CPU ISA CPU compiler object source file C++ device SYCL SYCL doesn’t mandate source SPIR GPU compiler SPIR auto aAcc = aBuf.get_access<access::read>(cgh); auto bAcc = bBuf.get_access<access::read>(cgh); auto oAcc = oBuf.get_access<access::write>(cgh); cgh.parallel_for<add>(range<1>(a.size()), [=](id<1> idx) { oAcc[i] = aAcc[i] + aAcc[i]; })); SYCL single source compilation model CPU CPU C++ CPU ISA CPU compiler object source file Linker C++ device SYCL source SPIR GPU compiler auto aAcc = aBuf.get_access<access::read>(cgh); auto bAcc = bBuf.get_access<access::read>(cgh); auto oAcc = oBuf.get_access<access::write>(cgh); cgh.parallel_for<add>(range<1>(a.size()), [=](id<1> idx) { oAcc[i] = aAcc[i] + aAcc[i]; })); SYCL single source compilation model CPU CPU C++ CPU compiler object source file CPU ISA (embedded Linker C++ SPIR) device SYCL source SPIR GPU compiler auto aAcc = aBuf.get_access<access::read>(cgh); auto bAcc = bBuf.get_access<access::read>(cgh); auto oAcc = oBuf.get_access<access::write>(cgh); cgh.parallel_for<add>(range<1>(a.size()), [=](id<1> idx) { oAcc[i] = aAcc[i] + aAcc[i]; })); Benefits of Single Source •Device code is written in C++ in the same source file as the host CPU code •Allows compile-time evaluation of device code •Supports type safety across host CPU and device •Supports generic programming •Removes the need to distribute source code 23 © 2016 Codeplay Software Ltd. Describing Parallelism 24 © 2016 Codeplay Software Ltd. How do you represent the different forms of parallelism? ➢ Directive vs explicit parallelism ➢ Task vs data parallelism ➢ Queue vs stream execution 25 © 2016 Codeplay Software Ltd. Directive vs Explicit Parallelism Examples: Examples: • OpenMP, OpenACC • SYCL, CUDA, TBB, Fibers, C++11 Threads Implementation: Implementation: • Compiler transforms code to be • An API is used to explicitly parallel based on pragmas enqueuer one or more threads Here we’re using OpenMP as an example Here we’re using C++ AMP as an example vector<float> a, b, c; array_view<float> a, b, c; extent<2> e(64, 64); #pragma omp parallel for parallel_for_each(e, [=](index<2> idx) for(int i = 0; i < a.size(); i++) { restrict(amp) { c[i] = a[i] + b[i]; c[idx] = a[idx] + b[idx]; } }); 26 © 2016 Codeplay Software Ltd. Task vs Data Parallelism Examples: Examples: • OpenMP, C++11 Threads, TBB • C++ AMP, CUDA, SYCL, C++17 ParallelSTL Implementation: Implementation: • Multiple (potentially different) • The same task is performed tasks are performed in parallel across a large data set Here we’re using TBB as an example Here we’re using CUDA as an example vector<task> tasks = { … }; floatHere *a, we’re *b, using *c; OpenMP as an example cudaMalloc((void **)&a, size); tbb::parallel_for_each(tasks.begin(), cudaMalloc((void **)&b, size); tasks.end(), [=](task &v) { cudaMalloc((void
Recommended publications
  • The Importance of Data
    The landscape of Parallel Programing Models Part 2: The importance of Data Michael Wong and Rod Burns Codeplay Software Ltd. Distiguished Engineer, Vice President of Ecosystem IXPUG 2020 2 © 2020 Codeplay Software Ltd. Distinguished Engineer Michael Wong ● Chair of SYCL Heterogeneous Programming Language ● C++ Directions Group ● ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong ● [email protected][email protected] Ported ● Head of Delegation for C++ Standard for Canada Build LLVM- TensorFlow to based ● Chair of Programming Languages for Standards open compilers for Council of Canada standards accelerators Chair of WG21 SG19 Machine Learning using SYCL Chair of WG21 SG14 Games Dev/Low Latency/Financial Trading/Embedded Implement Releasing open- ● Editor: C++ SG5 Transactional Memory Technical source, open- OpenCL and Specification standards based AI SYCL for acceleration tools: ● Editor: C++ SG1 Concurrency Technical Specification SYCL-BLAS, SYCL-ML, accelerator ● MISRA C++ and AUTOSAR VisionCpp processors ● Chair of Standards Council Canada TC22/SC32 Electrical and electronic components (SOTIF) ● Chair of UL4600 Object Tracking ● http://wongmichael.com/about We build GPU compilers for semiconductor companies ● C++11 book in Chinese: Now working to make AI/ML heterogeneous acceleration safe for https://www.amazon.cn/dp/B00ETOV2OQ autonomous vehicle 3 © 2020 Codeplay Software Ltd. Acknowledgement and Disclaimer Numerous people internal and external to the original C++/Khronos group, in industry and academia, have made contributions, influenced ideas, written part of this presentations, and offered feedbacks to form part of this talk. But I claim all credit for errors, and stupid mistakes. These are mine, all mine! You can’t have them.
    [Show full text]
  • AMD Accelerated Parallel Processing Opencl Programming Guide
    AMD Accelerated Parallel Processing OpenCL Programming Guide November 2013 rev2.7 © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Accelerated Parallel Processing, the AMD Accelerated Parallel Processing logo, ATI, the ATI logo, Radeon, FireStream, FirePro, Catalyst, and combinations thereof are trade- marks of Advanced Micro Devices, Inc. Microsoft, Visual Studio, Windows, and Windows Vista are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdic- tions. Other names are for informational purposes only and may be trademarks of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. The information contained herein may be of a preliminary or advance nature and is subject to change without notice. No license, whether express, implied, arising by estoppel or other- wise, to any intellectual property rights is granted by this publication. Except as set forth in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right. AMD’s products are not designed, intended, authorized or warranted for use as compo- nents in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD’s product could create a situation where personal injury, death, or severe property or envi- ronmental damage may occur.
    [Show full text]
  • Delivering Heterogeneous Programming in C++
    Delivering Heterogeneous Programming in C++ Duncan McBain, Codeplay Software Ltd. About me ● Graduated from Edinburgh University 3 years ago ● Postgrad course got me interested in GPU programming ● Worked at Codeplay since graduating ● Research projects, benchmarking, debuggers ● Most recently on C++ library for heterogeneous systems 2 © 2016 Codeplay Software Ltd. Contents • What are heterogeneous systems? • How can we program them? • The future of heterogeneous systems 3 © 2016 Codeplay Software Ltd. What are heterogeneous systems • By this, I mean devices like GPUs, DSPs, FPGAs… • Generally a bit of hardware that is more specialised than, and fundamentally different to, the host CPU • Specialisation can make it very fast • Can also be harder to program because of specialisation 4 © 2016 Codeplay Software Ltd. Some definitions • Host – The CPU/code that runs on the CPU, controls main memory (RAM), might control many devices • Device – A GPU, DSP, or something more exotic • Heterogeneous system – A host, a device and an API tying them together 5 © 2016 Codeplay Software Ltd. Some definitions • Kernel – Code representing the computation to be performed on the device. • Work group – A collection of many work items executing on a device. Has shared local memory and executes same instructions 6 © 2016 Codeplay Software Ltd. Some definitions ● Work item – A single thread or task on a device that executes in parallel ● Parallel for – Some collection of work items, in many work groups, executing a kernel in parallel. In general, cannot return anything, and must be enqueued asynchronously 7 © 2016 Codeplay Software Ltd. Example heterogeneous device ● CPUs today can execute instructions out-of-order, speculative execution, branch prediction ● Complexity hidden from programmer ● Contrast with e.g.
    [Show full text]
  • Exploring Weak Scalability for FEM Calculations on a GPU-Enhanced Cluster
    Exploring weak scalability for FEM calculations on a GPU-enhanced cluster Dominik G¨oddeke a,∗,1, Robert Strzodka b,2, Jamaludin Mohd-Yusof c, Patrick McCormick c,3, Sven H.M. Buijssen a, Matthias Grajewski a and Stefan Turek a aInstitute of Applied Mathematics, University of Dortmund bStanford University, Max Planck Center cComputer, Computational and Statistical Sciences Division, Los Alamos National Laboratory Abstract The first part of this paper surveys co-processor approaches for commodity based clusters in general, not only with respect to raw performance, but also in view of their system integration and power consumption. We then extend previous work on a small GPU cluster by exploring the heterogeneous hardware approach for a large-scale system with up to 160 nodes. Starting with a conventional commodity based cluster we leverage the high bandwidth of graphics processing units (GPUs) to increase the overall system bandwidth that is the decisive performance factor in this scenario. Thus, even the addition of low-end, out of date GPUs leads to improvements in both performance- and power-related metrics. Key words: graphics processors, heterogeneous computing, parallel multigrid solvers, commodity based clusters, Finite Elements PACS: 02.70.-c (Computational Techniques (Mathematics)), 02.70.Dc (Finite Element Analysis), 07.05.Bx (Computer Hardware and Languages), 89.20.Ff (Computer Science and Technology) ∗ Corresponding author. Address: Vogelpothsweg 87, 44227 Dortmund, Germany. Email: [email protected], phone: (+49) 231 755-7218, fax: -5933 1 Supported by the German Science Foundation (DFG), project TU102/22-1 2 Supported by a Max Planck Center for Visual Computing and Communication fellowship 3 Partially supported by the U.S.
    [Show full text]
  • (CITIUS) Phd DISSERTATION
    UNIVERSITY OF SANTIAGO DE COMPOSTELA DEPARTMENT OF ELECTRONICS AND COMPUTER SCIENCE CENTRO DE INVESTIGACION´ EN TECNOLOX´IAS DA INFORMACION´ (CITIUS) PhD DISSERTATION PERFORMANCE COUNTER-BASED STRATEGIES TO IMPROVE DATA LOCALITY ON MULTIPROCESSOR SYSTEMS: REORDERING AND PAGE MIGRATION TECHNIQUES Author: Juan Angel´ Lorenzo del Castillo PhD supervisors: Prof. Francisco Fernandez´ Rivera Dr. Juan Carlos Pichel Campos Santiago de Compostela, October 2011 Prof. Francisco Fernandez´ Rivera, professor at the Computer Architecture Group of the University of Santiago de Compostela Dr. Juan Carlos Pichel Campos, researcher at the Computer Architecture Group of the University of Santiago de Compostela HEREBY CERTIFY: That the dissertation entitled Performance Counter-based Strategies to Improve Data Locality on Multiprocessor Systems: Reordering and Page Migration Techniques has been developed by Juan Angel´ Lorenzo del Castillo under our direction at the Department of Electronics and Computer Science of the University of Santiago de Compostela in fulfillment of the requirements for the Degree of Doctor of Philosophy. Santiago de Compostela, October 2011 Francisco Fernandez´ Rivera, Profesor Catedratico´ de Universidad del Area´ de Arquitectura de Computadores de la Universidad de Santiago de Compostela Juan Carlos Pichel Campos, Profesor Doctor del Area´ de Arquitectura de Computadores de la Universidad de Santiago de Compostela HACEN CONSTAR: Que la memoria titulada Performance Counter-based Strategies to Improve Data Locality on Mul- tiprocessor Systems: Reordering and Page Migration Techniques ha sido realizada por D. Juan Angel´ Lorenzo del Castillo bajo nuestra direccion´ en el Departamento de Electronica´ y Computacion´ de la Universidad de Santiago de Compostela, y constituye la Tesis que presenta para optar al t´ıtulo de Doctor por la Universidad de Santiago de Compostela.
    [Show full text]
  • Programmers' Tool Chain
    Reduce the complexity of programming multicore ++ Offload™ for PlayStation®3 | Offload™ for Cell Broadband Engine™ | Offload™ for Embedded | Custom C and C++ Compilers | Custom Shader Language Compiler www.codeplay.com It’s a risk to underestimate the complexity of programming multicore applications Software developers are now presented with a rapidly-growing range of different multi-core processors. The common feature of many of these processors is that they are difficult and error-prone to program with existing tools, give very unpredictable performance, and that incompatible, complex programming models are used. Codeplay develop compilers and programming tools with one primary goal - to make it easy for programmers to achieve big performance boosts with multi-core processors, but without needing bigger, specially-trained, expensive development teams to get there. Introducing Codeplay Based in Edinburgh, Scotland, Codeplay Software Limited was founded by veteran games developer Andrew Richards in 2002 with funding from Jez San (the founder of Argonaut Games and ARC International). Codeplay introduced their first product, VectorC, a highly optimizing compiler for x86 PC and PlayStation®2, in 2003. In 2004 Codeplay further developed their business by offering services to processor developers to provide them with compilers and programming tools for their new and unique architectures, using VectorC’s highly retargetable compiler technology. Realising the need for new multicore tools Codeplay started the development of the company’s latest product, the Offload™ C++ Multicore Programming Platform. In October 2009 Offload™: Community Edition was released as a free-to-use tool for PlayStation®3 programmers. Experience and Expertise Codeplay have developed compilers and software optimization technology since 1999.
    [Show full text]
  • Intel® Oneapi Programming Guide
    Intel® oneAPI Programming Guide Intel Corporation www.intel.com Notices and Disclaimers Contents Notices and Disclaimers....................................................................... 5 Chapter 1: Introduction oneAPI Programming Model Overview ..........................................................7 Data Parallel C++ (DPC++)................................................................8 oneAPI Toolkit Distribution..................................................................9 About This Guide.......................................................................................9 Related Documentation ..............................................................................9 Chapter 2: oneAPI Programming Model Sample Program ..................................................................................... 10 Platform Model........................................................................................ 14 Execution Model ...................................................................................... 15 Memory Model ........................................................................................ 17 Memory Objects.............................................................................. 19 Accessors....................................................................................... 19 Synchronization .............................................................................. 20 Unified Shared Memory.................................................................... 20 Kernel Programming
    [Show full text]
  • Executable Modelling for Highly Parallel Accelerators
    Executable Modelling for Highly Parallel Accelerators Lorenzo Addazi, Federico Ciccozzi, Bjorn¨ Lisper School of Innovation, Design, and Engineering Malardalen¨ University -Vaster¨ as,˚ Sweden florenzo.addazi, federico.ciccozzi, [email protected] Abstract—High-performance embedded computing is develop- development are lagging behind. Programming accelerators ing rapidly since applications in most domains require a large and needs device-specific expertise in computer architecture and increasing amount of computing power. On the hardware side, low-level parallel programming for the accelerator at hand. this requirement is met by the introduction of heterogeneous systems, with highly parallel accelerators that are designed to Programming is error-prone and debugging can be very dif- take care of the computation-heavy parts of an application. There ficult due to potential race conditions: this is particularly is today a plethora of accelerator architectures, including GPUs, disturbing as many embedded applications, like autonomous many-cores, FPGAs, and domain-specific architectures such as AI vehicles, are safety-critical, meaning that failures may have accelerators. They all have their own programming models, which lethal consequences. Furthermore software becomes hard to are typically complex, low-level, and involve explicit parallelism. This yields error-prone software that puts the functional safety at port between different kinds of accelerators. risk, unacceptable for safety-critical embedded applications. In A convenient solution is to adopt a model-driven approach, this position paper we argue that high-level executable modelling where the computation-intense parts of the application are languages tailored for parallel computing can help in the software expressed in a high-level, accelerator-agnostic executable mod- design for high performance embedded applications.
    [Show full text]
  • Evolution of the Graphical Processing Unit
    University of Nevada Reno Evolution of the Graphical Processing Unit A professional paper submitted in partial fulfillment of the requirements for the degree of Master of Science with a major in Computer Science by Thomas Scott Crow Dr. Frederick C. Harris, Jr., Advisor December 2004 Dedication To my wife Windee, thank you for all of your patience, intelligence and love. i Acknowledgements I would like to thank my advisor Dr. Harris for his patience and the help he has provided me. The field of Computer Science needs more individuals like Dr. Harris. I would like to thank Dr. Mensing for unknowingly giving me an excellent model of what a Man can be and for his confidence in my work. I am very grateful to Dr. Egbert and Dr. Mensing for agreeing to be committee members and for their valuable time. Thank you jeffs. ii Abstract In this paper we discuss some major contributions to the field of computer graphics that have led to the implementation of the modern graphical processing unit. We also compare the performance of matrix‐matrix multiplication on the GPU to the same computation on the CPU. Although the CPU performs better in this comparison, modern GPUs have a lot of potential since their rate of growth far exceeds that of the CPU. The history of the rate of growth of the GPU shows that the transistor count doubles every 6 months where that of the CPU is only every 18 months. There is currently much research going on regarding general purpose computing on GPUs and although there has been moderate success, there are several issues that keep the commodity GPU from expanding out from pure graphics computing with limited cache bandwidth being one.
    [Show full text]
  • Programming Multicores: Do Applications Programmers Need to Write Explicitly Parallel Programs?
    [3B2-14] mmi2010030003.3d 11/5/010 16:48 Page 2 .......................................................................................................................................................................................................................... PROGRAMMING MULTICORES: DO APPLICATIONS PROGRAMMERS NEED TO WRITE EXPLICITLY PARALLEL PROGRAMS? .......................................................................................................................................................................................................................... IN THIS PANEL DISCUSSION FROM THE 2009 WORKSHOP ON COMPUTER ARCHITECTURE Arvind RESEARCH DIRECTIONS,DAVID AUGUST AND KESHAV PINGALI DEBATE WHETHER EXPLICITLY Massachusetts Institute PARALLEL PROGRAMMING IS A NECESSARY EVIL FOR APPLICATIONS PROGRAMMERS, of Technology ASSESS THE CURRENT STATE OF PARALLEL PROGRAMMING MODELS, AND DISCUSS David August POSSIBLE ROUTES TOWARD FINDING THE PROGRAMMING MODEL FOR THE MULTICORE ERA. Princeton University Keshav Pingali Moderator’s introduction: Arvind in the 1970s and 1980s, when two main Do applications programmers need to write approaches were developed. Derek Chiou explicitly parallel programs? Most people be- The first approach required that the com- lieve that the current method of parallel pro- pilers do all the work in finding the parallel- University of Texas gramming is impeding the exploitation of ism. This was often referred to as the ‘‘dusty multicores. In other words, the number of decks’’ problem—that is, how to
    [Show full text]
  • Graphics Processing Units
    Graphics Processing Units Graphics Processing Units (GPUs) are coprocessors that traditionally perform the rendering of 2-dimensional and 3-dimensional graphics information for display on a screen. In particular computer games request more and more realistic real-time rendering of graphics data and so GPUs became more and more powerful highly parallel specialist computing units. It did not take long until programmers realized that this computational power can also be used for tasks other than computer graphics. For example already in 1990 Lengyel, Re- ichert, Donald, and Greenberg used GPUs for real-time robot motion planning [43]. In 2003 Harris introduced the term general-purpose computations on GPUs (GPGPU) [28] for such non-graphics applications running on GPUs. At that time programming general-purpose computations on GPUs meant expressing all algorithms in terms of operations on graphics data, pixels and vectors. This was feasible for speed-critical small programs and for algorithms that operate on vectors of floating-point values in a similar way as graphics data is typically processed in the rendering pipeline. The programming paradigm shifted when the two main GPU manufacturers, NVIDIA and AMD, changed the hardware architecture from a dedicated graphics-rendering pipeline to a multi-core computing platform, implemented shader algorithms of the rendering pipeline in software running on these cores, and explic- itly supported general-purpose computations on GPUs by offering programming languages and software- development toolchains. This chapter first gives an introduction to the architectures of these modern GPUs and the tools and languages to program them. Then it highlights several applications of GPUs related to information security with a focus on applications in cryptography and cryptanalysis.
    [Show full text]
  • An Outlook of High Performance Computing Infrastructures for Scientific Computing
    An Outlook of High Performance Computing Infrastructures for Scientific Computing [This draft has mainly been extracted from the PhD Thesis of Amjad Ali at Center for Advanced Studies in Pure and Applied Mathematics (CASPAM), Bahauddin Zakariya University, Multan, Pakitsan 60800. ([email protected])] In natural sciences, two conventional ways of carrying out studies and reserach are: theoretical and experimental.Thefirstapproachisabouttheoraticaltreatmment possibly dealing with the mathematical models of the concerning physical phenomena and analytical solutions. The other approach is to perform physical experiments to carry out studies. This approach is directly favourable for developing the products in engineering. The theoretical branch, although very rich in concepts, has been able to develop only a few analytical methods applicable to rather simple problems. On the other hand the experimental branch involves heavy costs and specific arrangements. These shortcomings with the two approaches motivated the scientists to look around for a third or complementary choice, i.e., the use of numerical simulations in science and engineering. The numerical solution of scientific problems, by its very nature, requires high computational power. As the world has been continuously en- joying substantial growth in computer technologies during the past three decades, the numerical approach appeared to be more and more pragmatic. This constituted acompletenewbranchineachfieldofscience,oftentermedasComputational Science,thatisaboutusingnumericalmethodsforsolvingthemathematicalformu- lations of the science problems. Developments in computational sciences constituted anewdiscipline,namelytheScientific Computing,thathasemergedwiththe spirit of obtaining qualitative predictions through simulations in ‘virtual labora- tories’ (i.e., the computers) for all areas of science and engineering. The scientific computing, due to its nature, exists at the overlapping region of the three disciplines, mathematical modeling, numerical mathematics and computer science,asshownin Fig.
    [Show full text]