Computer Architecture and Performance Tuning

Computer Architecture and Performance Tuning “Recent progress in Processor Architectures and the 7 Dimensions of Performance” Vincenzo Innocente Original Slides by Sverre Jarp CERN Honorary Staff ESC 2016 – Bertinoro, Italy – October 2016 Computer Architecture and Performance Tuning Goal of these lectures 1. Give an understanding of modern computer architectures from a performance point-of-view § Processor, Memory subsystem, Caches § Use x86-64 as a de-facto standard § But keep an eye on ARM64, POWER(-PC), as well as GPUs/accelerators 2. Explain hardware factors that improve or degrade program execution speed § Prepare for writing well-performing software 2 Sverre Jarp Computer Architecture and Performance Tuning Contents § Introduction: § Setting the Scene; Scaling “laws” § Complexity in Computing § Basic Architecture § Memory subsystem § Performance Dimensions: § Vectorisation § Instruction level parallelism § Multi-core parallelisation § Conclusion 3 Sverre Jarp Computer Architecture and Performance Tuning The Big Issues (from an architectural viewpoint) 4 Sverre Jarp Computer Architecture and Performance Tuning Where are we coming from? § Von Neumann architecture (since forever) § Memory: § Single, homogeneous memory John von Neumann (1903 – 57) § Low latency Source : Wikipedia § Primitive machine code (assembly) § CPU scaling: § Moore’s law (1965) § Dennard scaling (1974) § Little or no parallelism Robert Dennard (IBM) Source : Wikipedia 5 Sverre Jarp Computer Architecture and Performance Tuning Von Neumann architecture § From Wikipedia: Algorithms and Data Structures § The von Neumann architecture is a computer Instructions Data design model that uses a processing unit and a single separate storage structure to hold both instructions and data. Results § It can be viewed as an Input entity into which one streams instructions and data in order to produce Processing results 6 Sverre Jarp Computer Architecture and Performance Tuning Von Neumann architecture (cont’d) Algorithms and Data Structures § The goal is to produce results as fast as possible Instructions Data § But, lots of problems can occur: § Instructions or data don’t arrive in time § Bandwidth issues? § Latency issues? Results Input § Clashes between input data Caches and output data Processing § Other “complexity-based” problems inside an extreme Many people think the architecture processing parallelism is out-dated. But nobody has managed to replace it (yet). 7 Sverre Jarp Computer Architecture and Performance Tuning Moore’s “law” § A marching order established ~50 years ago § “Let’s continue to double the number of transistors every other year!” § First published as: § Moore, G.E.: Cramming more components onto integrated circuits. Electronics, 38(8), April 1965. § Accepted by all partners: § Semiconductor manufacturers § Hardware integrators § Software companies § Us, the consumers From Wikipedia 8 Sverre Jarp Computer Architecture and Performance Tuning Semiconductor evolution (as 2014..) § Today’s silicon processes: § 28, 22 nm We are here § Being introduced: § 14 nm (2013/14) § In research: § 10 nm (2015/16) LHC data § 7 nm (2017/18) § 5 nm (2019/20) S. Borkar et al. (Intel), "Platform 2015: Intel Platform Evolution for the Next Decade", 2005. – Source: Intel 2 nm (2028?) TSMC § By the end of this decade we will have chips with ~100’000’000’000 (1011) transistors! § And, this will continue to drive innovation 9 Sverre Jarp Computer Architecture and Performance Tuning frequency scaling… § The 7 “fat” years of frequency scaling: § The Pentium Pro in 1996: 150 MHz (12W) § The Pentium 4 in 2003: 3.8 GHz (~25x) (115W) § Since then § Core 2 systems: § ~3 GHz § Multi-core § Recent CERN purchase: § Intel Xeon E5-2630 v3 © 2009 Herb § “only” 2.40 GHz (85W) Sutter § 8 core 10 Sverre Jarp Computer Architecture and Performance Tuning …vs Memory Latency 11 Sverre Jarp Computer Architecture and Performance Tuning Where are we today ? § Von Neumann architecture (unchanged) § Memory: § Multi-layered, complex layout § Non-uniform; even disjoint § High latency Things have § Primitive machine code (unchanged) become worse! § CPU scaling: § Moore’s law: Slowing down § Dennard scaling: Practically gone § Extreme parallelism at all levels § Instruction, Chip, System 12 Sverre Jarp Computer Architecture and Performance Tuning Real consequence of Moore’s law § We are being “snowed under” by “innovation”: § More (and more complex) execution units § Hundreds of new instructions § Longer SIMD/SSE hardware vectors § More and more cores § Specialised accelerators § Complex cache hierarchies § In order to profit we need to “think parallel” § Data parallelism § Task parallelism “Data Oriented Design” 13 Sverre Jarp Computer Architecture and Performance Tuning Let’s start with the basics! For a review of one of the latest INTEL architecture see http://www.anandtech.com/show/6355/intels-haswell-architecture 14 Sverre Jarp Computer Architecture and Performance Tuning Simple processor layout Keeps the state of execution § A simple processor with IC four key components: § Control Logic FU Flags § Instruction Counter § Program Status Word PSW Control § Register file R0 Data § functional Unit Data R1 transfer § Data Transfer Unit unit § Data bus Address RNN § Address bus Registers 15 Sverre Jarp Computer Architecture and Performance Tuning Simple server diagram § Multiple components which Socket 0 Socket 1 interact during the execution C0T0 of a program: C0 C1 C0C0T1 C1 § Processors/cores C2 C3 C2 C3 § w/private caches C4 C5 C4 C5 – I-cache, D-cache Shared Shared cache cache § Shared caches Mem-ctl Mem-ctl § Instructions and Data Memory Memory § Memory controllers Interconnect § Memory (non-uniform) § I/O subsystem § Network attachment I/O bus § Disk subsystem Intel Nehalem 16 Sverre Jarp Computer Architecture and Performance Tuning Memory Subsystem 17 Sverre Jarp Computer Architecture and Performance Tuning Optimal Memory Programming § What needs to be understood: § The memory hierarchy § Main memory – Physical layout – Latency – Bandwidth § Caches – Physical layout, Line sizes – Levels/Sharing – Latency § Programmer/Compiler – Data Layout – Data Locality § Execution environment: – Affinity 18 Sverre Jarp Computer Architecture and Performance Tuning Cache/Memory Hierarchy Processor Core (Registers) § From CPU to main memory on (R:64B + W:32B)/1c a recent Haswell L1I L1D 4c latency processor (32 KB) (32 KB) § With multicore, L2 R: 64B/1c memory (256 KB) 11c latency bandwidth is shared 32B/1c for all cores between > 21c latency cores in the same Shared L3 processor (~20 MB) (socket) ~24 B/c for all cores > 200c latency Local/remote memory (large, but typically non-uniform) c = cycle 19 Sverre Jarp Computer Architecture and Performance Tuning Cache lines (1) § When a data element or an instruction is requested by the processor, a cache line is ALWAYS moved (as the minimum quantity), usually to Level-1 requested § A cache line is a contiguous section of memory, typically 64B in size (8 * double) and 64B aligned § A 32KB Level-1 cache can hold 512 lines § When cache lines have to be moved come from memory § Latency is long (>200 cycles) § It is even longer if the memory is remote § Memory controller stays busy (~8 cycles) 20 Sverre Jarp Computer Architecture and Performance Tuning Cache lines (2) § Good utilisation is vital § When only one element (4B or 8B) element is used inside the cache line: § A lot of bandwidth is wasted! requested § Multidimensional C arrays should be accessed with the last index changing fastest: for (i = 0; i < rows; ++i) for (j = 0; j < columns; ++j) mymatrix [i] [j] += increment; § Pointer chasing (in linked lists) can easily lead to “cache thrashing” (too much memory traffic) 21 Sverre Jarp Computer Architecture and Performance Tuning Cache lines (3) § Prefetching: § fetch a cache line before it is requested § Hiding latency § Normally done by the hardware § Especially if processor executes Out-of-order § Also done by software instructions § Especially when In-order (IA-64, Xeon Phi, etc.) § Locality is vital: § Spatial locality – Use all elements in the line § Temporal locality – Complete the execution whilst the elements are certain to be in the cache Programming the memory hierarchy is an art in itself. 22 Sverre Jarp Computer Architecture and Performance Tuning Cache/Memory Trends Processor Core § The trend is to deepen and diversify the L1I L1D cache/memory hierarchy: § Additional L2 levels of cache Shared L3 § Multiple kinds of large memories Shared L4 § Non-volatile memories (great for databases, Local/remote Local/remote Non-volatile etc.) memory (1) memory (2) memory (3) Larger, slower Faster, smaller 23 Sverre Jarp Computer Architecture and Performance Tuning Latency Measurements (example) § Memory Latency on Sandy Bridge-EP 2690 (dual socket) § 90 ns (local) versus 150 ns (remote) Socket 0 Socket 1 C0T0 C0 C1 C0C0T1 C1 C2 C3 C2 C3 C4 C5 C4 C5 Shared Shared cache cache Mem-ctl Mem-ctl Memory Interconnect I/O bus 24 Sverre Jarp Computer Architecture and Performance Tuning Recent architectures Source AnandTech 25 Sverre Jarp Computer Architecture and Performance Tuning Current GPU Memory Layout § CPU and GPU memories are separate CPU GPU § What everybody wants System GPU is a single unified view Unified Memory of memory Memory Memory § One vision is “Heterogeneous Systems Architecture”(HSA) pushed by AMD, ARM, and others § AMD Kaveri APU § NVidia+IBM: Unified Memory over NVLink § Annnounced in 2016: OpenCAPI (everybody but INTEL) 26 Sverre Jarp Computer

Load more