Dark Silicon and the End of Multicore Scaling

Total Page:16

File Type:pdf, Size:1020Kb

Dark Silicon and the End of Multicore Scaling CTA01++ : multicore and beyond Source: Source: Xilinx https://cloud.google.com/tpu/ Johan Peltenburg [email protected] Accelerated Big Data Systems group Quantum & Computer Engineering Delft University of Technology 26 October 2020 @ Hogeschool Rotterdam 1 TU Delft (2017 figures) 23 461 Students 2 799 PhD students 253 FTE professors 3 448 Scientists 2 385 Supporting 23 Startups / year 2 Accelerated Big Data Systems Faculty of Electrical Engineering, Mathematics and Computer Science Department of Quantum & Computer Engineering: – Accelerated Big Data Systems – Computer Engineering – Network Architecture & Services – Applied Quantum Architectures, Quantum Communications Lab, Quantum Computer Architecture Lab, Quantum Computing, Quantum Information & Software, Quantum Integration Technology 3 Table of Contents ● Multicore processors ● Amdahl’s Law in the Multicore Era ● Dark Silicon and the End of Multicore Scaling ● Heterogeneous Computing 4 Moore’s “Law”[1] What to do with all these transistors? Original paper: [1] Moore, G. E. (1965). Cramming more components onto integrated circuits. Electronics 38 (8): 114–117. 5 What is a multicore processor? A multi-core processor is a single computing component with two or more independent actual processors (called "cores"), which are the units that read and execute program instructions Characteristics: – 2 or more general purpose processors – Single component (chip or integrated circuit) – Sharing infrastructure (sharing memory and communication resources) 6 What is a multicore processor? Famous examples of multicore – Intel Core, Core 2, i3, i5, i7, etc. – AMD Athlon, Phenom, Ryzen – Sony CELL microprocessor Many other multicore examples – Adapteva – Aeroflex Intel i7 – Ageia – Ambric – AMD – Analog Devices Intel Core 2 Duo – ARM – ASOCS – Azul Systems ….......... Sony/IBM CELL 7 What is a multicore processor? Is a processor with embedded acceleration circuitry considered as a multicore? Are two fully independent processors Intel P4 with SSE on a single chip Intel i7 6950X considered as multicore? Is a GPU considered a multicore processor? Nvidia Tesla V100 8 The birth of the multicore What reasons to go to multicore? 9 The Power Wall Dynamic power Power due to consumption leakage current 2 P=ACV f+VI leak Reduce the supply voltage, V qVth Ileak µexp(- ) 1<a<2 kT (V -Vth ) Reduce fmax µ threshold V V t 10 Overcoming the Power Wall ● Solution: – Reduce frequency – Duplicate CPU core Source: Intel 2006 11 ILP Wall ● The ’80s: superscalar expansion – 50% per year improvement in performance – Pipeline processor – 10 CPI → 1 CPI ● The ’90s: the era of diminishing returns – Squeezing out the last bit of implicit parallelism – 2-way to 6-way issue, out-of-order issue, branch prediction – 1 CPI → 0.5 CPI ● The ’00s: the multicore era – The need for explicit parallelism ● The ‘10s: My guess: the heterogeneous multicore era 12 CTA01++ ● Very deep pipelines – Intel once went up to 31! ● Complex branch predictors – Speculative execution ● Advanced memory hierarchy – E.g. prefetching – E.g. 3 levels of cache ● SIMD extensions Source: https://commons.wikimedia.org – E.g. SSE, AVX, AVX2, AVX512 13 CTA01++ ● Out-of-Order execution ● Superscalar execution – Can launch execution of more than one instruction. ● Simulatenous Multithreading (SMT) – Can run multiple threads on a single core – Intel calls this HyperThreading What do all these Source: Patterson & Hennessey improvements cost? 14 Pollack’s Rule[3] ● Pollack’s Rule: Performance increase is roughly proportional to the square root of increase in complexity. REMEMBER THIS FOR LATER! ● Exercise: We double the chip area to make a circuit twice as complex. How much performance do we get? ● We only get 1.4× (√2) improvement in performance. [3] Borkar, S. (2007, June). Thousand core chips: a technology perspective. In Proceedings of the 44th annual Design Automation Conference (pp. 746-749). ACM. 15 More reasons for multicore ● Memory Wall (see book) ● Industry push 16 Table of Contents ● Multicore processors ● Amdahl’s Law in the Multicore Era ● Dark Silicon and the End of Multicore Scaling ● Heterogeneous Computing 17 Amdahl’s Law 1 S(n)= f (1−f )+ n S = speedup w.r.t. single core n = number of cores f = parallel portion of workload 18 Amdahl’s Law in the Multicore Era[4] ● Resources to build and operate a computer circuit: – Area, capacitive load, frequency, power, money, etc… ● Let’s forget about the specific resource and call it: ● A “Base Core Equivalent” or BCE. – A “Base Core” are the resources required to implement the simplest core imaginable that can run our instruction set. n = 1 BCE Normalized performance = 1 [4] Hill, M. D., & Marty, M. R. (2008). Amdahl's law in the multicore era. Computer, 41(7). 19 Exercise ● n : the total number of BCE resources available to our n = r = 1 BCE design Performance = 1 ● r : the number of BCE resources we use for our single core ● We create an architecture of n = r = 4 BCE r = 4 BCE. ● How much performance do Relative we get compared to performance = 2 ✕ r = 1 BCE? Performance of an r BCE core = √r (remember Pollack’s Rule) 20 Symmetric Multicore[4] (1/2) 1 ● S(n)= Let’s build a multicore f system with an (1−f )+ n = 16 BCE budget. n ● We give every core r = 4 BCE. ● What is the speedup 1 1 S(n)= = over 1 BCE given we 1−f n 1−f f⋅r +f ÷ ÷√r + have a parallel portion √r r √r √r⋅n of f ? Performance per core Number of cores 21 Symmetric Multicore[4] (2/2) 22 Asymmetric Multicore[4] (1/2) ● A multicore system with an n = 16 BCE budget. ● We create one big core r = 4 BCE. ● We create a small, simple core out of each remaining BCE. ● What is the speedup over 1 1 BCE given we have a parallel S(n)= 1−f f portion of f ? + r r n r √ √ + − 23 Asymmetric Multicore[4] (2/2) 24 Dynamic Multicore[4] (1/2) ● When performing a non-parallelizable part 1-f... – Use all BCE to form one huge core ● When performing parallelizable part f... – Use all BCE to form many tiny cores 25 Dynamic Multicore[4] (2/2) What sort of magical device would have such dynamic properties? 26 Time for a break 27 Table of Contents ● Multicore processors ● Amdahl’s Law in the Multicore Era ● Dark Silicon and the End of Multicore Scaling ● Heterogeneous Computing 28 Dennard Scaling[2] ● Once upon a time ... ● As transistors became smaller … ● Their power density stayed constant. ● Moore’s law & Dennard Scaling lived happily ever after… The end? ● Leakage current and threshold voltage were not taken into consideration for Dennard Scaling. ● Broke down around 2006: The Power Wall! [2] Dennard, R. H., Gaensslen, F. H., Rideout, V. L., Bassous, E., & LeBlanc, A. R. (1974). Design of ion-implanted MOSFET's with very small physical dimensions. IEEE Journal of Solid-State Circuits, 9(5), 256-268. 29 Dark Silicon[5] ● No matter what chip topology we use (CPU-like / GPU-like) ● We must power off parts of the chip to stay within a power budget. ● At 8 nm, we must power off 50% of the chip continuously to stay within power budget! ● Limits to speedup in 2024: – Only 7.9x predicted when paper appeared in 2011! – Shouldn’t we get ~388x according to Moore’s Law? ● Don’t confuse Moore’s Law with performance! [5] Esmaeilzadeh, H., Blem, E., Amant, R. S., Sankaralingam, K., & Burger, D. (2011, June). Dark silicon and the end of multicore scaling. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on (pp. 365-376). IEEE. 30 Source: https://www.ibmbigdatahub.com 31 Source: https://www.ibmbigdatahub.com 32 Source: https://www.ibmbigdatahub.com 33 Source: https://www.ibmbigdatahub.com 34 Where do we go now? 35 Table of Contents ● Multicore processors ● Amdahl’s Law in the Multicore Era ● Dark Silicon and the End of Multicore Scaling ● Heterogeneous Computing 36 Heterogeneous Computing 37 General Purpose Graphics Processing Unit ● Most mainstream accelerator nowadays: GPU – Originally used to render 3D images – Cores got less specialized – could now do any computation ● Now used in general purpose computing: GPGPU ● Programmable using CUDA/OpenCL – C/C++ like languages ● Widely used in scientific, AI, Machine Learning ● Top supercomputers make use of GPGPUs. – How many GPGPUs? 38 ORNLORNL SummitSummit supersuper computercomputer ● Processor:Processor: IBMIBM POWER9™POWER9™ (2/node)(2/node) ● GPUs:GPUs: 27,64827,648 NVIDIANVIDIA VoltaVolta V100sV100s (6/node)(6/node) ● Nodes:Nodes: 4,6084,608 ● NodeNode Performance:Performance: 4242 TFlop/sTFlop/s ● Memory/node:Memory/node: 512GB512GB DDR4DDR4 ++ 96GB96GB HBM2HBM2 ● NVNV Memory/node:Memory/node: 1600GB1600GB ● TotalTotal SystemSystem Memory:Memory: >10PB>10PB DDR4DDR4 ++ HBMHBM ++ Non-volatileNon-volatile ● PeakPeak PowerPower Consumption:Consumption: 13MW13MW ● ~~ powerpower consumptionconsumption ofof aa reasonablyreasonably sizedsized towntown 39 FPGA accelerator trends (1/2) https://newsroom.intel.com/editorials/intel-fpgas-accelerating-future/ https://www.xilinx.com/applications/high-performance-computing.html 40 FPGA accelerator trends (2/2) 41 FPGA advantages ● Great flexibility in solution trade-offs. ● Can work with numeric formats that are not supported by GPGPU / CPU – Can have many parallel `cores’ (until we run – out of resources) Arbitrary integer, fixed & floating point widths – E.g. 5-bit integers, float16, posits, etc... – Can completely tailor circuit to application. ● Can interface with I/O directly ● Dataflow computing – Network controllers – can process data as it travels on link (filtering, etc...) – Many algorithms map naturally – Non-volatile storage – can process data as it travels to disk – Don’t require load-store of intermediate (compression, error checking, etc...) values – Etc… – Minimal control overhead; no instruction set – No operating system Source: [6] 42 FPGA disadvantages ● Low clock speeds ~10× lower ● High area overhead compared to fixed function IC ~15× lower ● FPGA circuit itself should be ~150× more efficient than CPU! – Requires computer architecture & digital design knowledge ● Hard to program – Need digital design knowledge ● Ratio of digital design engineers to Python programmers might underflow in float32.
Recommended publications
  • The Importance of Data
    The landscape of Parallel Programing Models Part 2: The importance of Data Michael Wong and Rod Burns Codeplay Software Ltd. Distiguished Engineer, Vice President of Ecosystem IXPUG 2020 2 © 2020 Codeplay Software Ltd. Distinguished Engineer Michael Wong ● Chair of SYCL Heterogeneous Programming Language ● C++ Directions Group ● ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong ● [email protected][email protected] Ported ● Head of Delegation for C++ Standard for Canada Build LLVM- TensorFlow to based ● Chair of Programming Languages for Standards open compilers for Council of Canada standards accelerators Chair of WG21 SG19 Machine Learning using SYCL Chair of WG21 SG14 Games Dev/Low Latency/Financial Trading/Embedded Implement Releasing open- ● Editor: C++ SG5 Transactional Memory Technical source, open- OpenCL and Specification standards based AI SYCL for acceleration tools: ● Editor: C++ SG1 Concurrency Technical Specification SYCL-BLAS, SYCL-ML, accelerator ● MISRA C++ and AUTOSAR VisionCpp processors ● Chair of Standards Council Canada TC22/SC32 Electrical and electronic components (SOTIF) ● Chair of UL4600 Object Tracking ● http://wongmichael.com/about We build GPU compilers for semiconductor companies ● C++11 book in Chinese: Now working to make AI/ML heterogeneous acceleration safe for https://www.amazon.cn/dp/B00ETOV2OQ autonomous vehicle 3 © 2020 Codeplay Software Ltd. Acknowledgement and Disclaimer Numerous people internal and external to the original C++/Khronos group, in industry and academia, have made contributions, influenced ideas, written part of this presentations, and offered feedbacks to form part of this talk. But I claim all credit for errors, and stupid mistakes. These are mine, all mine! You can’t have them.
    [Show full text]
  • (CITIUS) Phd DISSERTATION
    UNIVERSITY OF SANTIAGO DE COMPOSTELA DEPARTMENT OF ELECTRONICS AND COMPUTER SCIENCE CENTRO DE INVESTIGACION´ EN TECNOLOX´IAS DA INFORMACION´ (CITIUS) PhD DISSERTATION PERFORMANCE COUNTER-BASED STRATEGIES TO IMPROVE DATA LOCALITY ON MULTIPROCESSOR SYSTEMS: REORDERING AND PAGE MIGRATION TECHNIQUES Author: Juan Angel´ Lorenzo del Castillo PhD supervisors: Prof. Francisco Fernandez´ Rivera Dr. Juan Carlos Pichel Campos Santiago de Compostela, October 2011 Prof. Francisco Fernandez´ Rivera, professor at the Computer Architecture Group of the University of Santiago de Compostela Dr. Juan Carlos Pichel Campos, researcher at the Computer Architecture Group of the University of Santiago de Compostela HEREBY CERTIFY: That the dissertation entitled Performance Counter-based Strategies to Improve Data Locality on Multiprocessor Systems: Reordering and Page Migration Techniques has been developed by Juan Angel´ Lorenzo del Castillo under our direction at the Department of Electronics and Computer Science of the University of Santiago de Compostela in fulfillment of the requirements for the Degree of Doctor of Philosophy. Santiago de Compostela, October 2011 Francisco Fernandez´ Rivera, Profesor Catedratico´ de Universidad del Area´ de Arquitectura de Computadores de la Universidad de Santiago de Compostela Juan Carlos Pichel Campos, Profesor Doctor del Area´ de Arquitectura de Computadores de la Universidad de Santiago de Compostela HACEN CONSTAR: Que la memoria titulada Performance Counter-based Strategies to Improve Data Locality on Mul- tiprocessor Systems: Reordering and Page Migration Techniques ha sido realizada por D. Juan Angel´ Lorenzo del Castillo bajo nuestra direccion´ en el Departamento de Electronica´ y Computacion´ de la Universidad de Santiago de Compostela, y constituye la Tesis que presenta para optar al t´ıtulo de Doctor por la Universidad de Santiago de Compostela.
    [Show full text]
  • Intel® Oneapi Programming Guide
    Intel® oneAPI Programming Guide Intel Corporation www.intel.com Notices and Disclaimers Contents Notices and Disclaimers....................................................................... 5 Chapter 1: Introduction oneAPI Programming Model Overview ..........................................................7 Data Parallel C++ (DPC++)................................................................8 oneAPI Toolkit Distribution..................................................................9 About This Guide.......................................................................................9 Related Documentation ..............................................................................9 Chapter 2: oneAPI Programming Model Sample Program ..................................................................................... 10 Platform Model........................................................................................ 14 Execution Model ...................................................................................... 15 Memory Model ........................................................................................ 17 Memory Objects.............................................................................. 19 Accessors....................................................................................... 19 Synchronization .............................................................................. 20 Unified Shared Memory.................................................................... 20 Kernel Programming
    [Show full text]
  • Executable Modelling for Highly Parallel Accelerators
    Executable Modelling for Highly Parallel Accelerators Lorenzo Addazi, Federico Ciccozzi, Bjorn¨ Lisper School of Innovation, Design, and Engineering Malardalen¨ University -Vaster¨ as,˚ Sweden florenzo.addazi, federico.ciccozzi, [email protected] Abstract—High-performance embedded computing is develop- development are lagging behind. Programming accelerators ing rapidly since applications in most domains require a large and needs device-specific expertise in computer architecture and increasing amount of computing power. On the hardware side, low-level parallel programming for the accelerator at hand. this requirement is met by the introduction of heterogeneous systems, with highly parallel accelerators that are designed to Programming is error-prone and debugging can be very dif- take care of the computation-heavy parts of an application. There ficult due to potential race conditions: this is particularly is today a plethora of accelerator architectures, including GPUs, disturbing as many embedded applications, like autonomous many-cores, FPGAs, and domain-specific architectures such as AI vehicles, are safety-critical, meaning that failures may have accelerators. They all have their own programming models, which lethal consequences. Furthermore software becomes hard to are typically complex, low-level, and involve explicit parallelism. This yields error-prone software that puts the functional safety at port between different kinds of accelerators. risk, unacceptable for safety-critical embedded applications. In A convenient solution is to adopt a model-driven approach, this position paper we argue that high-level executable modelling where the computation-intense parts of the application are languages tailored for parallel computing can help in the software expressed in a high-level, accelerator-agnostic executable mod- design for high performance embedded applications.
    [Show full text]
  • Programming Multicores: Do Applications Programmers Need to Write Explicitly Parallel Programs?
    [3B2-14] mmi2010030003.3d 11/5/010 16:48 Page 2 .......................................................................................................................................................................................................................... PROGRAMMING MULTICORES: DO APPLICATIONS PROGRAMMERS NEED TO WRITE EXPLICITLY PARALLEL PROGRAMS? .......................................................................................................................................................................................................................... IN THIS PANEL DISCUSSION FROM THE 2009 WORKSHOP ON COMPUTER ARCHITECTURE Arvind RESEARCH DIRECTIONS,DAVID AUGUST AND KESHAV PINGALI DEBATE WHETHER EXPLICITLY Massachusetts Institute PARALLEL PROGRAMMING IS A NECESSARY EVIL FOR APPLICATIONS PROGRAMMERS, of Technology ASSESS THE CURRENT STATE OF PARALLEL PROGRAMMING MODELS, AND DISCUSS David August POSSIBLE ROUTES TOWARD FINDING THE PROGRAMMING MODEL FOR THE MULTICORE ERA. Princeton University Keshav Pingali Moderator’s introduction: Arvind in the 1970s and 1980s, when two main Do applications programmers need to write approaches were developed. Derek Chiou explicitly parallel programs? Most people be- The first approach required that the com- lieve that the current method of parallel pro- pilers do all the work in finding the parallel- University of Texas gramming is impeding the exploitation of ism. This was often referred to as the ‘‘dusty multicores. In other words, the number of decks’’ problem—that is, how to
    [Show full text]
  • An Outlook of High Performance Computing Infrastructures for Scientific Computing
    An Outlook of High Performance Computing Infrastructures for Scientific Computing [This draft has mainly been extracted from the PhD Thesis of Amjad Ali at Center for Advanced Studies in Pure and Applied Mathematics (CASPAM), Bahauddin Zakariya University, Multan, Pakitsan 60800. ([email protected])] In natural sciences, two conventional ways of carrying out studies and reserach are: theoretical and experimental.Thefirstapproachisabouttheoraticaltreatmment possibly dealing with the mathematical models of the concerning physical phenomena and analytical solutions. The other approach is to perform physical experiments to carry out studies. This approach is directly favourable for developing the products in engineering. The theoretical branch, although very rich in concepts, has been able to develop only a few analytical methods applicable to rather simple problems. On the other hand the experimental branch involves heavy costs and specific arrangements. These shortcomings with the two approaches motivated the scientists to look around for a third or complementary choice, i.e., the use of numerical simulations in science and engineering. The numerical solution of scientific problems, by its very nature, requires high computational power. As the world has been continuously en- joying substantial growth in computer technologies during the past three decades, the numerical approach appeared to be more and more pragmatic. This constituted acompletenewbranchineachfieldofscience,oftentermedasComputational Science,thatisaboutusingnumericalmethodsforsolvingthemathematicalformu- lations of the science problems. Developments in computational sciences constituted anewdiscipline,namelytheScientific Computing,thathasemergedwiththe spirit of obtaining qualitative predictions through simulations in ‘virtual labora- tories’ (i.e., the computers) for all areas of science and engineering. The scientific computing, due to its nature, exists at the overlapping region of the three disciplines, mathematical modeling, numerical mathematics and computer science,asshownin Fig.
    [Show full text]
  • Performance Tuning Workshop
    Performance Tuning Workshop Samuel Khuvis Scientifc Applications Engineer, OSC February 18, 2021 1/103 Workshop Set up I Workshop – set up account at my.osc.edu I If you already have an OSC account, sign in to my.osc.edu I Go to Project I Project access request I PROJECT CODE = PZS1010 I Reset your password I Slides are the workshop website: https://www.osc.edu/~skhuvis/opt21_spring 2/103 Outline I Introduction I Debugging I Hardware overview I Performance measurement and analysis I Help from the compiler I Code tuning/optimization I Parallel computing 3/103 Introduction 4/103 Workshop Philosophy I Aim for “reasonably good” performance I Discuss performance tuning techniques common to most HPC architectures I Compiler options I Code modifcation I Focus on serial performance I Reduce time spent accessing memory I Parallel processing I Multithreading I MPI 5/103 Hands-on Code During this workshop, we will be using a code based on the HPCCG miniapp from Mantevo. I Performs Conjugate Gradient (CG) method on a 3D chimney domain. I CG is an iterative algorithm to numerically approximate the solution to a system of linear equations. I Run code with srun -n <numprocs> ./test_HPCCG nx ny nz, where nx, ny, and nz are the number of nodes in the x, y, and z dimension on each processor. I Download with: wget go . osu .edu/perftuning21 t a r x f perftuning21 I Make sure that the following modules are loaded: intel/19.0.5 mvapich2/2.3.3 6/103 More important than Performance! I Correctness of results I Code readability/maintainability I Portability -
    [Show full text]
  • Regent: a High-Productivity Programming Language for Implicit Parallelism with Logical Regions
    REGENT: A HIGH-PRODUCTIVITY PROGRAMMING LANGUAGE FOR IMPLICIT PARALLELISM WITH LOGICAL REGIONS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Elliott Slaughter August 2017 © 2017 by Elliott David Slaughter. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/ This dissertation is online at: http://purl.stanford.edu/mw768zz0480 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Alex Aiken, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Philip Levis I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Oyekunle Olukotun Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost for Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii Abstract Modern supercomputers are dominated by distributed-memory machines. State of the art high-performance scientific applications targeting these machines are typically written in low-level, explicitly parallel programming models that enable maximal performance but expose the user to programming hazards such as data races and deadlocks.
    [Show full text]
  • Introduction to Parallel Processing
    IntroductionIntroduction toto ParallelParallel ProcessingProcessing • Parallel Computer Architecture: Definition & Broad issues involved – A Generic Parallel ComputerComputer Architecture • The Need And Feasibility of Parallel Computing Why? – Scientific Supercomputing Trends – CPU Performance and Technology Trends, Parallelism in Microprocessor Generations – Computer System Peak FLOP Rating History/Near Future • The Goal of Parallel Processing • Elements of Parallel Computing • Factors Affecting Parallel System Performance • Parallel Architectures History – Parallel Programming Models – Flynn’s 1972 Classification of Computer Architecture • Current Trends In Parallel Architectures – Modern Parallel Architecture Layered Framework • Shared Address Space Parallel Architectures • Message-Passing Multicomputers: Message-Passing Programming Tools • Data Parallel Systems • Dataflow Architectures • Systolic Architectures: Matrix Multiplication Systolic Array Example PCA Chapter 1.1, 1.2 EECC756 - Shaaban #1 lec # 1 Spring 2012 3-13-2012 ParallelParallel ComputerComputer ArchitectureArchitecture A parallel computer (or multiple processor system) is a collection of communicating processing elements (processors) that cooperate to solve large computational problems fast by dividing such problems into parallel tasks, exploiting Thread-Level Parallelism (TLP). i.e Parallel Processing • Broad issues involved: Task = Computation done on one processor – The concurrency and communication characteristics of parallel algorithms for a given computational
    [Show full text]
  • CIS 501 Computer Architecture This Unit: Shared Memory
    This Unit: Shared Memory Multiprocessors App App App • Thread-level parallelism (TLP) System software • Shared memory model • Multiplexed uniprocessor CIS 501 Mem CPUCPU I/O CPUCPU • Hardware multihreading CPUCPU Computer Architecture • Multiprocessing • Synchronization • Lock implementation Unit 9: Multicore • Locking gotchas (Shared Memory Multiprocessors) • Cache coherence Slides originally developed by Amir Roth with contributions by Milo Martin • Bus-based protocols at University of Pennsylvania with sources that included University of • Directory protocols Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. • Memory consistency models CIS 501 (Martin): Multicore 1 CIS 501 (Martin): Multicore 2 Readings Beyond Implicit Parallelism • Textbook (MA:FSPTCM) • Consider “daxpy”: • Sections 7.0, 7.1.3, 7.2-7.4 daxpy(double *x, double *y, double *z, double a): for (i = 0; i < SIZE; i++) • Section 8.2 Z[i] = a*x[i] + y[i]; • Lots of instruction-level parallelism (ILP) • Great! • But how much can we really exploit? 4 wide? 8 wide? • Limits to (efficient) super-scalar execution • But, if SIZE is 10,000, the loop has 10,000-way parallelism! • How do we exploit it? CIS 501 (Martin): Multicore 3 CIS 501 (Martin): Multicore 4 Explicit Parallelism Multiplying Performance • Consider “daxpy”: • A single processor can only be so fast daxpy(double *x, double *y, double *z, double a): • Limited clock frequency for (i = 0; i < SIZE; i++) • Limited instruction-level parallelism Z[i] = a*x[i] + y[i]; • Limited cache hierarchy • Break it
    [Show full text]
  • Comparing SYCL with HPX, Kokkos, Raja and C++ Executors the Future of ISO C++ Heterogeneous Computing
    Comparing SYCL with HPX, Kokkos, Raja and C++ Executors The future of ISO C++ Heterogeneous Computing Michael Wong (Codeplay Software, VP of Research and Development), Andrew Richards, CEO ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong Head of Delegation for C++ Standard for Canada Vice Chair of Programming Languages for Standards Council of Canada Chair of WG21 SG5 Transactional Memory Chair of WG21 SG14 Games Dev/Low Latency/Financial Trading/Embedded Editor: C++ SG5 Transactional Memory Technical Specification Editor: C++ SG1 Concurrency Technical Specification http:://wongmichael.com/about SC2016 Agenda • Heterogensous Computing for ISO C++ • SYCL • HPX (slides thanks to Hartmut Kaiser) • Kokkos (slides thanks to Carter Edwards, Christian Trott) • Raja (Slides thanks to David Beckingsale) • State of C++ Concurrency and Parallelism • C++ Executors coming in C++20 • C++ simd/vector coming in C++20 2 © 2016 Codeplay Software Ltd. The goal for C++ •Great support for cpu latency computations through concurrency TS- •Great support for cpu throughput through parallelism TS •Great support for Heterogeneous throughput computation in future 3 © 2016 Codeplay Software Ltd. Many alternatives for Massive dispatch/heterogeneous • Programming Languages Usage experience • OpenGL • DirectX • OpenMP/OpenACC • CUDA • HPC • OpenCL • SYCL • OpenMP • OpenCL • OpenACC • CUDA • C++ AMP • HPX • HSA • SYCL • Vulkan 4 © 2016 Codeplay Software Ltd. Not that far away from a Grand Unified Theory •GUT is achievable •What we have is only missing 20% of where we want to be •It is just not designed with an integrated view in mind ... Yet •Need more focus direction on each proposal for GUT, whatever that is, and add a few elements 5 © 2016 Codeplay Software Ltd.
    [Show full text]
  • The Cascade High Productivity Programming Language
    The Cascade High Productivity Programming Language Hans P. Zima University of Vienna, Austria and JPL, California Institute of Technology, Pasadena, CA ECMWF Workshop on the Use of High Performance Computing in Meteorology Reading, UK, October 25, 2004 Contents 1 IntroductionIntroduction 2 ProgrammingProgramming ModelsModels forfor HighHigh ProductivityProductivity ComputingComputing SystemsSystems 3 CascadeCascade andand thethe ChapelChapel LanguageLanguage DesignDesign 4 ProgrammingProgramming EnvironmentsEnvironments 5 ConclusionConclusion Abstraction in Programming Programming models and languages bridge the gap between ““reality”reality” and hardware – at differentdifferent levels ofof abstractionabstraction - e.g., assembly languages general-purpose procedural languages functional languages very high-level domain-specific languages Abstraction implies loss of information – gain in simplicity, clarity, verifiability, portability versus potential performance degradation The Emergence of High-Level Sequential Languages The designers of the very first highhigh level programming langulanguageage were aware that their successsuccess dependeddepended onon acceptableacceptable performance of the generated target programs: John Backus (1957): “… It was our belief that if FORTRAN … were to translate any reasonable scientific source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger …” High-level algorithmic languages became generally accepted standards for
    [Show full text]