Memory Bandwidth and System Balance in HPC Systems

Total Page:16

File Type:pdf, Size:1020Kb

Memory Bandwidth and System Balance in HPC Systems Memory Bandwidth and System Balance in HPC Systems John D. McCalpin, PhD [email protected] Outline 1. Changes in TOP500 Systems – System Architectures & System Sizes 2. Technology Trends & System Balances – The STREAM Benchmark – Computation Rates vs Data Motion Latency and Bandwidth – Required Concurrency to Exploit available Bandwidths 3. Comments & Speculations… – Impact on HPC systems and on HPC Applications – Potential disruptive technologies Part 1 CHANGES IN TOP500 SYSTEMS 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% 1993.5 1993.9 1994.5 1995 1994.9 1995.5 VECTOR 1995.9 1996.5 TOP500 Rmax Contributions by SystemArchitecture 1996.9 ASCI Red 1997.5 1997.9 1998.5 1998.9 1999.5 2000 1999.9 2000.5 Earth Simulator 2000.9 RISC 2001.5 2001.9 2002.5 2002.9 2003.5 2003.9 2004.5 IBM BG/L 2005 2004.9 2005.5 2005.9 2006.5 RoadRunner 2006.9 2007.5 2007.9 x86 2008.5 2008.9 IBM BG/Q 2009.5 2010 2009.9 2010.5 K Machine 2010.9 2011.5 2011.9 2012.5 Accelerated Tianhe 2012.9 from x86 as well as x86 from “Accelerated” A Note: 2013.5 TaihuLight MPP 2013.9 fraction of the the of fraction 2014.5 - 2015 2014.9 2 2015.5 Rmax … 2015.9 2016.5 is 2016.9 HPC "Typical" System Acquisition Price per "Processor" $10,000,000 $1,000,000 Vector RISC x86 single-core $100,000 x86 multi-core ($/socket) x86 multi-core ($/core) $10,000 $1,000 $100 1980 1990 2000 2010 2020 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% 1993.5 1993.9 4p 16p 1994.5 VECTOR 1995 1994.9 100p 1995.5 1995.9 1996.5 TOP500 Rmax Contributions by SystemArchitecture 1996.9 1997.5 1997.9 1998.5 32p 1998.9 1999.5 2000 1999.9 2000.5 250p 2000.9 RISC 2001.5 2001.9 2002.5 2002.9 2003.5 2003.9 2004.5 2005 2004.9 1000p 2005.5 1000p 2005.9 2006.5 2006.9 2007.5 2007.9 x86 2008.5 2008.9 2009.5 10,000p 2010 2009.9 2010.5 2010.9 2011.5 2011.9 2012.5 Accelerated 2012.9 2013.5 MPP 2013.9 2014.5 48,000 2015 2014.9 2015.5 2015.9 2016.5 2016.9 Heterogeneous Systems • More sites are building “clusters of clusters”, e.g.: – Sub-cluster 1: 2-socket nodes with small memory – Sub-cluster 2: 2-socket nodes with large memory – Sub-cluster 3: 4-socket nodes with very large memory – Sub-cluster 4: nodes with accelerators, etc… • Consistent with observed partial shift to accelerators – Peaking at ~30% of the aggregate Rmax of the list in 2013-2015 • Under 20% for November 2016 list (after excluding x86 contributions) – Peaking at ~20% of the systems in 2015, now under 17% • No more than 16% of systems have had >2/3 of their Rpeak in accelerators – Accelerators have been split between many-core and GPU Part 2 TECHNOLOGY TRENDS & SYSTEM BALANCES The STREAM Benchmark • Created in 1991 while on faculty at the University of Delaware College of Marine Studies • Intended to be an extremely simplified representation of the low-compute-intensity, long-vector operations characteristic of ocean circulation models • Widely used for research, testing, marketing • Almost 1100 results in database at main site • Hosted at www.cs.virginia.edu/stream/ The STREAM Benchmark (2) • Four kernels, separately timed: Copy: C[i] = A[i]; i=1..N Scale: B[i] = scalar * C[i]; i=1..N Add: C[i] = A[i] + B[i]; i=1..N Triad A[i] = B[i] + scalar*C[i]; i=1..N • “N” chosen to make each array >> cache size • Repeated (typically) 10 times, first iteration ignored • Min/Avg/Max timings reported, best time used for BW The STREAM Benchmark (3) • Assumed Memory Traffic per index: Copy: C[i] = A[i]; 16 Bytes Scale: B[i] = scalar * C[i]; 16 Bytes Add: C[i] = A[i] + B[i]; 24 Bytes Triad A[i] = B[i] + scalar*C[i]; 24 Bytes • Many systems will read data before updating it (“write allocate” policy) – STREAM ignores this extra traffic if present What are “System Balances”? • “Performance” can be viewed as an N-dimensional vector of “mostly-orthogonal” components, e.g.: – Core performance (FLOPs) – LINPACK – Memory Bandwidth – STREAM – Memory Latency – lmBench/lat_mem_rd – Interconnect Bandwidth – osu_Bw, osu_BiBw – Interconnect Latency – osu_latency • System Balances are the ratios of these components Performance Component Trends 1. Peak FLOPS per socket increasing at 50%-60% per year 2. Memory Bandwidth increasing at ~23% per year 3. Memory Latency increasing at ~4% per year 4. Interconnect Bandwidth increasing at ~20% per year 5. Interconnect Latency decreasing at ~20% per year • These ratios suggest that processors should Be increasingly imBalanced with respect to data motion…. – Today’s talk focuses on (1), (2), and a Bit of (3) Memory Bandwidth is Falling Behind: (GFLOP/s) / (GWord/s) 10,000 1,000 100 +14.2%/year 2x / 5.2 years 10 Balance Ratio (FLOPS/memory access) RISC systems (IBM, MIPS, Alpha) x86-64 systems (AMD & Intel) 1 1990 1995 2000 2005 2010 2015 2020 Date of Introduction Memory Latency is much worse: (GFLOP/s) / (Memory Latency) 10,000 Peak FLOPS per Idle Memory Latency +24.5%/year Peak FLOPS / Word of Sustained Memory BW 2x / 3.2 years 1,000 100 +14.2%/year 2x / 5.2 years 10 Balance Ratio (FLOPS/memory access) RISC systems (IBM, MIPS, Alpha) x86-64 systems (AMD & Intel) 1 1990 1995 2000 2005 2010 2015 2020 Date of Introduction Interconnect Bandwidth is Falling Behind at a comparable rate 10,000 Peak FLOPS per Idle Memory Latency Peak FLOPS / Word of Sustained Memory BW +24.5%/year Peak FLOPS / Word of Sustained Network BW 2x / 3.2 years 1,000 +22.3%/year 2x / 3.4 years 100 +14.2%/year 2x / 5.2 years 10 Balance Ratio (FLOPS/memory access) RISC systems (IBM, MIPS, Alpha) x86-64 systems (AMD & Intel) 1 1990 1995 2000 2005 2010 2015 2020 Date of Introduction Interconnect latency follows a similar trend… 10,000,000 Peak FLOPS per Idle Memory Latency 1,000,000 Peak FLOPS / Word of Sustained Memory BW Peak FLOPS / Word of Sustained Network BW 100,000 Peak FLOPS (core) * Network Latency Peak FLOPS (chip) * Network Latency 10,000 1,000 100 10 Balance Ratio (FLOPS/memory access) RISC systems (IBM, MIPS, Alpha) x86-64 systems (AMD & Intel) 1 1990 1995 2000 2005 2010 2015 2020 Date of Introduction STREAM Balance (FLOPS/Word): Mainstream vs ManyCore vs GPGPU 100 2016 product releases 90 80 70 60 2012 product releases 50 40 30 20 10 0 Xeon E5-2690 Xeon Phi SE10P NVIDIA K20x Xeon E5-2690 v4 Xeon Phi 7250 NVIDIA P100 (Sandy Bridge EP) (KNC, GDDR5) (Kepler, GDDR5) (Broadwell) (KNL, MCDRAM) (Pascal, HBM2) Why are FLOPS increasing so fast? • Peak FLOPs per package is the product of several terms: – Frequency – FP operations per cycle per core • Product of #FP units, SIMD width of each unit, and complexity of FP instructions (e.g., separate ADD & MUL vs FMA) – Number of cores per package • Low-level semiconductor technology tends to drive these terms at different rates… Intel Processor GFLOPS/Package Contributions over time Skylake Xeon log10(GHz) Haswell / 3.0 Sandy Broadwell log10(Cores/Socket) Bridge / log10(FP/Hz) Ivy Bridge 32 Nehalem/ 32 16 2.0 Westmere 16 16 Core 2 8 8 4 4 4 4 1.0 Pentium 4 4 4 12 12 14 ~20 ~24 6 6 8 10 2 4 2 2 2 2 2 Base 10 logcontribution ofGFLOPS 2 2.8 3.1 3.2 3.3 3.1 3.3 2.5 3.0 3.0 2.5 3.0 2.9 2.1 2.1 2.1 ~1.9 ~1.8 0.0 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 Intel Processor GFLOPS/Package Contributions: Mainstream vs ManyCore 4.0 log10(GHz) log10(Cores/Socket) log10(FP/Hz) Xeon Phi (Knights Landing) Xeon Phi 6.8x (Knights Corner) Xeon E5 (v4) 3.0 5.2x (Broadwell) Xeon E5 (v1) 32 (Sandy Bridge) 16 2.0 16 8 1.0 60 14 72 8 Base 10log contribution of GFLOPS 3.0 1.05 2.1 1.4 0.0 2012.25 2012.75 2016.33 2016.50 Why is Memory Bandwidth increasing slowly? • Slow rate of pin speed improvements – Emphasis has been on increasing capacity, not increasing bandwidth – Shared-bus architecture (multiple DIMMs per channel) is very hard at high frequencies • DRAM cell cycle time almost unchanged in 20 years – Speed increases require increasing transfer sizes – DDR3/DDR4 have minimum 64 Byte transfers in DIMMs • Slow rate of increase in interface width – Pins cost money! Why is Memory Latency stagnant or growing? • More levels in cache hierarchy – Many lookups serialized to save power • More asynchronous clock domain crossings – Many different clock domains to save power – Snoop (6): Core -> Ring -> QPI -> Ring -> QPI -> Ring -> Core – Local Memory (4): Core -> Ring -> DDR -> Ring -> Core – Remote Memory (8): Core -> Ring -> QPI -> Ring -> DDR -> Ring -> QPI -> Ring -> Core • More cores to keep coherent – Challenging even on a single mainstream server chip – Two-socket system latency typically dominated by coherence, not data – Manycore chips have much higher latency • Decreasing frequencies! Why is Interconnect Bandwidth growing slowly? • Slow rate of pin speed improvements – About 20%/year • Reluctance to increase interface width – Switch chips typically pin-limited – wider interfaces get fewer ports – Parallel links require more switches – too expensive and does not always provide improved real-world bandwidth Why is Interconnect Latency improving slowly? • Legacy IO architecture designed around disks, not communications – Control operations using un-cached loads/stores – hundreds of ns per operation and no concurrency – Interrupt-driven processing requires many thousands of cycles per transaction • Mismatch between SW requirements and HW capabilities A different implication of these technology trends LATENCY, BANDWIDTH, AND CONCURRENCY Latency, Bandwidth, and Concurrency • “Little’s Law” from queuing theory describes the relationship between latency (or occupancy), bandwidth, and concurrency.
Recommended publications
  • The Von Neumann Computer Model 5/30/17, 10:03 PM
    The von Neumann Computer Model 5/30/17, 10:03 PM CIS-77 Home http://www.c-jump.com/CIS77/CIS77syllabus.htm The von Neumann Computer Model 1. The von Neumann Computer Model 2. Components of the Von Neumann Model 3. Communication Between Memory and Processing Unit 4. CPU data-path 5. Memory Operations 6. Understanding the MAR and the MDR 7. Understanding the MAR and the MDR, Cont. 8. ALU, the Processing Unit 9. ALU and the Word Length 10. Control Unit 11. Control Unit, Cont. 12. Input/Output 13. Input/Output Ports 14. Input/Output Address Space 15. Console Input/Output in Protected Memory Mode 16. Instruction Processing 17. Instruction Components 18. Why Learn Intel x86 ISA ? 19. Design of the x86 CPU Instruction Set 20. CPU Instruction Set 21. History of IBM PC 22. Early x86 Processor Family 23. 8086 and 8088 CPU 24. 80186 CPU 25. 80286 CPU 26. 80386 CPU 27. 80386 CPU, Cont. 28. 80486 CPU 29. Pentium (Intel 80586) 30. Pentium Pro 31. Pentium II 32. Itanium processor 1. The von Neumann Computer Model Von Neumann computer systems contain three main building blocks: The following block diagram shows major relationship between CPU components: the central processing unit (CPU), memory, and input/output devices (I/O). These three components are connected together using the system bus. The most prominent items within the CPU are the registers: they can be manipulated directly by a computer program. http://www.c-jump.com/CIS77/CPU/VonNeumann/lecture.html Page 1 of 15 IPR2017-01532 FanDuel, et al.
    [Show full text]
  • UNIT 8B a Full Adder
    UNIT 8B Computer Organization: Levels of Abstraction 15110 Principles of Computing, 1 Carnegie Mellon University - CORTINA A Full Adder C ABCin Cout S in 0 0 0 A 0 0 1 0 1 0 B 0 1 1 1 0 0 1 0 1 C S out 1 1 0 1 1 1 15110 Principles of Computing, 2 Carnegie Mellon University - CORTINA 1 A Full Adder C ABCin Cout S in 0 0 0 0 0 A 0 0 1 0 1 0 1 0 0 1 B 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 C S out 1 1 0 1 0 1 1 1 1 1 ⊕ ⊕ S = A B Cin ⊕ ∧ ∨ ∧ Cout = ((A B) C) (A B) 15110 Principles of Computing, 3 Carnegie Mellon University - CORTINA Full Adder (FA) AB 1-bit Cout Full Cin Adder S 15110 Principles of Computing, 4 Carnegie Mellon University - CORTINA 2 Another Full Adder (FA) http://students.cs.tamu.edu/wanglei/csce350/handout/lab6.html AB 1-bit Cout Full Cin Adder S 15110 Principles of Computing, 5 Carnegie Mellon University - CORTINA 8-bit Full Adder A7 B7 A2 B2 A1 B1 A0 B0 1-bit 1-bit 1-bit 1-bit ... Cout Full Full Full Full Cin Adder Adder Adder Adder S7 S2 S1 S0 AB 8 ⁄ ⁄ 8 C 8-bit C out FA in ⁄ 8 S 15110 Principles of Computing, 6 Carnegie Mellon University - CORTINA 3 Multiplexer (MUX) • A multiplexer chooses between a set of inputs. D1 D 2 MUX F D3 D ABF 4 0 0 D1 AB 0 1 D2 1 0 D3 1 1 D4 http://www.cise.ufl.edu/~mssz/CompOrg/CDAintro.html 15110 Principles of Computing, 7 Carnegie Mellon University - CORTINA Arithmetic Logic Unit (ALU) OP 1OP 0 Carry In & OP OP 0 OP 1 F 0 0 A ∧ B 0 1 A ∨ B 1 0 A 1 1 A + B http://cs-alb-pc3.massey.ac.nz/notes/59304/l4.html 15110 Principles of Computing, 8 Carnegie Mellon University - CORTINA 4 Flip Flop • A flip flop is a sequential circuit that is able to maintain (save) a state.
    [Show full text]
  • Lecture Notes
    Lecture #4-5: Computer Hardware (Overview and CPUs) CS106E Spring 2018, Young In these lectures, we begin our three-lecture exploration of Computer Hardware. We start by looking at the different types of computer components and how they interact during basic computer operations. Next, we focus specifically on the CPU (Central Processing Unit). We take a look at the Machine Language of the CPU and discover it’s really quite primitive. We explore how Compilers and Interpreters allow us to go from the High-Level Languages we are used to programming to the Low-Level machine language actually used by the CPU. Most modern CPUs are multicore. We take a look at when multicore provides big advantages and when it doesn’t. We also take a short look at Graphics Processing Units (GPUs) and what they might be used for. We end by taking a look at Reduced Instruction Set Computing (RISC) and Complex Instruction Set Computing (CISC). Stanford President John Hennessy won the Turing Award (Computer Science’s equivalent of the Nobel Prize) for his work on RISC computing. Hardware and Software: Hardware refers to the physical components of a computer. Software refers to the programs or instructions that run on the physical computer. - We can entirely change the software on a computer, without changing the hardware and it will transform how the computer works. I can take an Apple MacBook for example, remove the Apple Software and install Microsoft Windows, and I now have a Window’s computer. - In the next two lectures we will focus entirely on Hardware.
    [Show full text]
  • With Extreme Scale Computing the Rules Have Changed
    With Extreme Scale Computing the Rules Have Changed Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 11/17/15 1 • Overview of High Performance Computing • With Extreme Computing the “rules” for computing have changed 2 3 • Create systems that can apply exaflops of computing power to exabytes of data. • Keep the United States at the forefront of HPC capabilities. • Improve HPC application developer productivity • Make HPC readily available • Establish hardware technology for future HPC systems. 4 11E+09 Eflop/s 362 PFlop/s 100000000100 Pflop/s 10000000 10 Pflop/s 33.9 PFlop/s 1000000 1 Pflop/s SUM 100000100 Tflop/s 166 TFlop/s 1000010 Tflop /s N=1 1 Tflop1000/s 1.17 TFlop/s 100 Gflop/s100 My Laptop 70 Gflop/s N=500 10 59.7 GFlop/s 10 Gflop/s My iPhone 4 Gflop/s 1 1 Gflop/s 0.1 100 Mflop/s 400 MFlop/s 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2015 1 Eflop/s 1E+09 420 PFlop/s 100000000100 Pflop/s 10000000 10 Pflop/s 33.9 PFlop/s 1000000 1 Pflop/s SUM 100000100 Tflop/s 206 TFlop/s 1000010 Tflop /s N=1 1 Tflop1000/s 1.17 TFlop/s 100 Gflop/s100 My Laptop 70 Gflop/s N=500 10 59.7 GFlop/s 10 Gflop/s My iPhone 4 Gflop/s 1 1 Gflop/s 0.1 100 Mflop/s 400 MFlop/s 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2015 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM 100 Tflop/s 100000 10 Tflop/s N=1 10000 1 Tflop/s 1000 100 Gflop/s N=500 100 10 Gflop/s 10 1 Gflop/s 1 100 Mflop/s 0.1 1996 2002 2020 2008 2014 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM 100 Tflop/s 100000 10 Tflop/s N=1 10000 1 Tflop/s 1000 100 Gflop/s N=500 100 10 Gflop/s 10 1 Gflop/s 1 100 Mflop/s 0.1 1996 2002 2020 2008 2014 • Pflops (> 1015 Flop/s) computing fully established with 81 systems.
    [Show full text]
  • Theoretical Peak FLOPS Per Instruction Set on Modern Intel Cpus
    Theoretical Peak FLOPS per instruction set on modern Intel CPUs Romain Dolbeau Bull – Center for Excellence in Parallel Programming Email: [email protected] Abstract—It used to be that evaluating the theoretical and potentially multiple threads per core. Vector of peak performance of a CPU in FLOPS (floating point varying sizes. And more sophisticated instructions. operations per seconds) was merely a matter of multiplying Equation2 describes a more realistic view, that we the frequency by the number of floating-point instructions will explain in details in the rest of the paper, first per cycles. Today however, CPUs have features such as vectorization, fused multiply-add, hyper-threading or in general in sectionII and then for the specific “turbo” mode. In this paper, we look into this theoretical cases of Intel CPUs: first a simple one from the peak for recent full-featured Intel CPUs., taking into Nehalem/Westmere era in section III and then the account not only the simple absolute peak, but also the full complexity of the Haswell family in sectionIV. relevant instruction sets and encoding and the frequency A complement to this paper titled “Theoretical Peak scaling behavior of current Intel CPUs. FLOPS per instruction set on less conventional Revision 1.41, 2016/10/04 08:49:16 Index Terms—FLOPS hardware” [1] covers other computing devices. flop 9 I. INTRODUCTION > operation> High performance computing thrives on fast com- > > putations and high memory bandwidth. But before > operations => any code or even benchmark is run, the very first × micro − architecture instruction number to evaluate a system is the theoretical peak > > - how many floating-point operations the system > can theoretically execute in a given time.
    [Show full text]
  • COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous Multi-Threading and Multi-Core Processors Edgar Gabriel Spring 2011
    COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors Edgar Gabriel Spring 2011 Edgar Gabriel Moore’s Law • Long-term trend on the number of transistor per integrated circuit • Number of transistors double every ~18 month Source: http://en.wikipedia.org/wki/Images:Moores_law.svg COSC 6385 – Computer Architecture Edgar Gabriel 1 What do we do with that many transistors? • Optimizing the execution of a single instruction stream through – Pipelining • Overlap the execution of multiple instructions • Example: all RISC architectures; Intel x86 underneath the hood – Out-of-order execution: • Allow instructions to overtake each other in accordance with code dependencies (RAW, WAW, WAR) • Example: all commercial processors (Intel, AMD, IBM, SUN) – Branch prediction and speculative execution: • Reduce the number of stall cycles due to unresolved branches • Example: (nearly) all commercial processors COSC 6385 – Computer Architecture Edgar Gabriel What do we do with that many transistors? (II) – Multi-issue processors: • Allow multiple instructions to start execution per clock cycle • Superscalar (Intel x86, AMD, …) vs. VLIW architectures – VLIW/EPIC architectures: • Allow compilers to indicate independent instructions per issue packet • Example: Intel Itanium series – Vector units: • Allow for the efficient expression and execution of vector operations • Example: SSE, SSE2, SSE3, SSE4 instructions COSC 6385 – Computer Architecture Edgar Gabriel 2 Limitations of optimizing a single instruction
    [Show full text]
  • Reverse Engineering X86 Processor Microcode
    Reverse Engineering x86 Processor Microcode Philipp Koppe, Benjamin Kollenda, Marc Fyrbiak, Christian Kison, Robert Gawlik, Christof Paar, and Thorsten Holz, Ruhr-University Bochum https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/koppe This paper is included in the Proceedings of the 26th USENIX Security Symposium August 16–18, 2017 • Vancouver, BC, Canada ISBN 978-1-931971-40-9 Open access to the Proceedings of the 26th USENIX Security Symposium is sponsored by USENIX Reverse Engineering x86 Processor Microcode Philipp Koppe, Benjamin Kollenda, Marc Fyrbiak, Christian Kison, Robert Gawlik, Christof Paar, and Thorsten Holz Ruhr-Universitat¨ Bochum Abstract hardware modifications [48]. Dedicated hardware units to counter bugs are imperfect [36, 49] and involve non- Microcode is an abstraction layer on top of the phys- negligible hardware costs [8]. The infamous Pentium fdiv ical components of a CPU and present in most general- bug [62] illustrated a clear economic need for field up- purpose CPUs today. In addition to facilitate complex and dates after deployment in order to turn off defective parts vast instruction sets, it also provides an update mechanism and patch erroneous behavior. Note that the implementa- that allows CPUs to be patched in-place without requiring tion of a modern processor involves millions of lines of any special hardware. While it is well-known that CPUs HDL code [55] and verification of functional correctness are regularly updated with this mechanism, very little is for such processors is still an unsolved problem [4, 29]. known about its inner workings given that microcode and the update mechanism are proprietary and have not been Since the 1970s, x86 processor manufacturers have throughly analyzed yet.
    [Show full text]
  • Misleading Performance Reporting in the Supercomputing Field David H
    Misleading Performance Reporting in the Supercomputing Field David H. Bailey RNR Technical Report RNR-92-005 December 1, 1992 Ref: Scientific Programming, vol. 1., no. 2 (Winter 1992), pg. 141–151 Abstract In a previous humorous note, I outlined twelve ways in which performance figures for scientific supercomputers can be distorted. In this paper, the problem of potentially mis- leading performance reporting is discussed in detail. Included are some examples that have appeared in recent published scientific papers. This paper also includes some pro- posed guidelines for reporting performance, the adoption of which would raise the level of professionalism and reduce the level of confusion in the field of supercomputing. The author is with the Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Ames Research Center, Moffett Field, CA 94035. 1 1. Introduction Many readers may have read my previous article “Twelve Ways to Fool the Masses When Giving Performance Reports on Parallel Computers” [5]. The attention that this article received frankly has been surprising [11]. Evidently it has struck a responsive chord among many professionals in the field who share my concerns. The following is a very brief summary of the “Twelve Ways”: 1. Quote only 32-bit performance results, not 64-bit results, and compare your 32-bit results with others’ 64-bit results. 2. Present inner kernel performance figures as the performance of the entire application. 3. Quietly employ assembly code and other low-level language constructs, and compare your assembly-coded results with others’ Fortran or C implementations. 4. Scale up the problem size with the number of processors, but don’t clearly disclose this fact.
    [Show full text]
  • Assembly Language: IA-X86
    Assembly Language for x86 Processors X86 Processor Architecture CS 271 Computer Architecture Purdue University Fort Wayne 1 Outline Basic IA Computer Organization IA-32 Registers Instruction Execution Cycle Basic IA Computer Organization Since the 1940's, the Von Neumann computers contains three key components: Processor, called also the CPU (Central Processing Unit) Memory and Storage Devices I/O Devices Interconnected with one or more buses Data Bus Address Bus data bus Control Bus registers Processor I/O I/O IA: Intel Architecture Memory Device Device (CPU) #1 #2 32-bit (or i386) ALU CU clock control bus address bus Processor The processor consists of Datapath ALU Registers Control unit ALU (Arithmetic logic unit) Performs arithmetic and logic operations Control unit (CU) Generates the control signals required to execute instructions Memory Address Space Address Space is the set of memory locations (bytes) that are addressable Next ... Basic Computer Organization IA-32 Registers Instruction Execution Cycle Registers Registers are high speed memory inside the CPU Eight 32-bit general-purpose registers Six 16-bit segment registers Processor Status Flags (EFLAGS) and Instruction Pointer (EIP) 32-bit General-Purpose Registers EAX EBP EBX ESP ECX ESI EDX EDI 16-bit Segment Registers EFLAGS CS ES SS FS EIP DS GS General-Purpose Registers Used primarily for arithmetic and data movement mov eax 10 ;move constant integer 10 into register eax Specialized uses of Registers eax – Accumulator register Automatically
    [Show full text]
  • State of the Port to X86-64
    OpenVMS State of the Port to x86-64 October 6, 2017 1 Executive Summary - Development The Development Plan consists of five strategic work areas for porting the OpenVMS operating system to the x86-64 architecture. • OpenVMS supports nine programming languages, six of which use a DEC-developed proprietary backend code generator on both Alpha and Itanium. A converter is being created to internally connect these compiler frontends to the open source LLVM backend code generator which targets x86-64 as well as many other architectures. LLVM implements the most current compiler technology and associated development tools and it provides a direct path for porting to other architectures in the future. The other three compilers have their own individual pathways to the new architecture. • Operating system components are being modified to conform to the industry standard AMD64 ABI calling conventions. OpenVMS moved away from proprietary conventions and formats in porting from Alpha to Itanium so there is much less work in these areas in porting to x86-64. • As in any port to a new architecture, a number of architecture-defined interfaces that are critical to the inner workings of the operating system are being implemented. • OpenVMS is currently built for Alpha and Itanium from common source code modules. Support for x86-64 is being added, so all conditional assembly directives must be verified. • The boot manager for x86-64 has been upgraded to take advantage of new features and methods which did not exist when our previous implementation was created for Itanium. This will streamline the entire boot path and make it more flexible and maintainable.
    [Show full text]
  • Intel Xeon & Dgpu Update
    Intel® AI HPC Workshop LRZ April 09, 2021 Morning – Machine Learning 9:00 – 9:30 Introduction and Hardware Acceleration for AI OneAPI 9:30 – 10:00 Toolkits Overview, Intel® AI Analytics toolkit and oneContainer 10:00 -10:30 Break 10:30 - 12:30 Deep dive in Machine Learning tools Quizzes! LRZ Morning Session Survey 2 Afternoon – Deep Learning 13:30 – 14:45 Deep dive in Deep Learning tools 14:45 – 14:50 5 min Break 14:50 – 15:20 OpenVINO 15:20 - 15:45 25 min Break 15:45 – 17:00 Distributed training and Federated Learning Quizzes! LRZ Afternoon Session Survey 3 INTRODUCING 3rd Gen Intel® Xeon® Scalable processors Performance made flexible Up to 40 cores per processor 20% IPC improvement 28 core, ISO Freq, ISO compiler 1.46x average performance increase Geomean of Integer, Floating Point, Stream Triad, LINPACK 8380 vs. 8280 1.74x AI inference increase 8380 vs. 8280 BERT Intel 10nm Process 2.65x average performance increase vs. 5-year-old system 8380 vs. E5-2699v4 Performance varies by use, configuration and other factors. Configurations see appendix [1,3,5,55] 5 AI Performance Gains 3rd Gen Intel® Xeon® Scalable Processors with Intel Deep Learning Boost Machine Learning Deep Learning SciKit Learn & XGBoost 8380 vs 8280 Real Time Inference Batch 8380 vs 8280 Inference 8380 vs 8280 XGBoost XGBoost up to up to Fit Predict Image up up Recognition to to 1.59x 1.66x 1.44x 1.30x Mobilenet-v1 Kmeans Kmeans up to Fit Inference Image up to up up Classification to to 1.52x 1.56x 1.36x 1.44x ResNet-50-v1.5 Linear Regression Linear Regression Fit Inference Language up to up to up up Processing to 1.44x to 1.60x 1.45x BERT-large 1.74x Performance varies by use, configuration and other factors.
    [Show full text]
  • Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures
    Computer Architecture A Quantitative Approach, Fifth Edition Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Copyright © 2012, Elsevier Inc. All rights reserved. 1 Contents 1. SIMD architecture 2. Vector architectures optimizations: Multiple Lanes, Vector Length Registers, Vector Mask Registers, Memory Banks, Stride, Scatter-Gather, 3. Programming Vector Architectures 4. SIMD extensions for media apps 5. GPUs – Graphical Processing Units 6. Fermi architecture innovations 7. Examples of loop-level parallelism 8. Fallacies Copyright © 2012, Elsevier Inc. All rights reserved. 2 Classes of Computers Classes Flynn’s Taxonomy SISD - Single instruction stream, single data stream SIMD - Single instruction stream, multiple data streams New: SIMT – Single Instruction Multiple Threads (for GPUs) MISD - Multiple instruction streams, single data stream No commercial implementation MIMD - Multiple instruction streams, multiple data streams Tightly-coupled MIMD Loosely-coupled MIMD Copyright © 2012, Elsevier Inc. All rights reserved. 3 Introduction Advantages of SIMD architectures 1. Can exploit significant data-level parallelism for: 1. matrix-oriented scientific computing 2. media-oriented image and sound processors 2. More energy efficient than MIMD 1. Only needs to fetch one instruction per multiple data operations, rather than one instr. per data op. 2. Makes SIMD attractive for personal mobile devices 3. Allows programmers to continue thinking sequentially SIMD/MIMD comparison. Potential speedup for SIMD twice that from MIMID! x86 processors expect two additional cores per chip per year SIMD width to double every four years Copyright © 2012, Elsevier Inc. All rights reserved. 4 Introduction SIMD parallelism SIMD architectures A. Vector architectures B. SIMD extensions for mobile systems and multimedia applications C.
    [Show full text]