Parallel Architectures
Total Page:16
File Type:pdf, Size:1020Kb
Parallel architectures Denis Barthou [email protected] 1 Parallel architectures 2014-2015 D. Barthou 1- Objectives of this lecture ● Analyze and understand how parallel machines work ● Study modern parallel architectures ● Use this knowledge to write better code 2 Parallel architectures 2014-2015 D. Barthou Outline 1. Introduction 2. Unicore architecture Pipeline, OoO, superscalar, VLIW, branch prediction, ILP limit 3. Vectors Definition, vectorization 4. Memory and caches Principle, caches, multicores and optimization 5. New architectures and accelerators 3 Parallel architectures 2014-2015 D. Barthou 1- Parallelism Many services and machines are already parallel ● Internet and server infrastructures ● Data bases ● Games ● Sensor networks (cars, embedded equipment, …) ● ... What's new ? 4 Parallel architectures 2014-2015 D. Barthou 1- Parallelism Many services and machines are already parallel ● Internet and server infrastructures ● Data bases ● Games ● Sensor networks (cars, embedded equipment, …) ● ... What's new ? ● Parallelism everywhere ● Dramatic increase of parallelism inside a compute node 5 Parallel architectures 2014-2015 D. Barthou 1- Multicore/manycore Many core already there Nvidia Kepler: 192 cores Intel Tera chip, 2007 7,1 billion of transistors (80 cores) Intel SCC, 2010 (48 cores) Many Integrated Chips ou Xeon Phi (60 cores) 6 Parallel architectures 2014-2015 D. Barthou 1- Why so many cores ? Moore's law Every 18 mois, the number of transistors double, with the same cost (1965) Exponential law applies on: ● Processor performance, ● Memory & disk capacity ● Size of the wire ● Heat dissipated 7 Parallel architectures 2014-2015 D. Barthou 1- Moore's law, limiting factor: W W = CV2f 8 Parallel architectures 2014-2015 D. Barthou 1- Impacts ● No more increase in frequency ● Increase in core number 9 Parallel architectures 2014-2015 D. Barthou 1- Impacts ● Dissipated heat goes down ● Performance/core stalls 10 Parallel architectures 2014-2015 D. Barthou 1- Multicore strategy ● Technological choice by default ● Need for software improvements – Hide HW complexity – Find parallelism, a lot and efficiently ! ● All applications will run on parallel machines – Parallel Machines for HPC:worth hand tuning application codes for performance – End-user Machines: Tuning does not pay off, parallelism worth it ? 11 Parallel architectures 2014-2015 D. Barthou 1- Don't forget: Amdahl's law Measures ● Speed-up: T1 / Tp ● Efficiency: T1 / (p.Tp) Amdahl's law: f code fraction that runs parallel, speed up max on p proc.: 1 / ((1 – f) + f/p) In fact scalability not so good due to higher synchronization & communication costs when p increases. 12 Parallel architectures 2014-2015 D. Barthou 1- Don't forget: soft+hard interactions ● Performance are obtained through interations between – Compiler, – OS – Libraries and runtime – HW 13 Parallel architectures 2014-2015 D. Barthou Looking the future of performance Performance for scientific codes on TOP500 machines 14 Parallel architectures 2014-2015 D. Barthou Looking the future of performance 15 Parallel architectures 2014-2015 D. Barthou 2- Unicore ● Current processors: ~1 billion transistors – Work in parallel ● How hardware organizes/expresses this parallelism ? – Find parallelism between instructions (ILP) ● Goal of the architects until 2002 – Hide parallelism to everyone (user, compiler, runtime) 16 Parallel architectures 2014-2015 D. Barthou 2- Unicore ● Mechanisms for finding ILP – Pipeline: slice execution in several steps – Superscalar: execute multiple instructions the same cycle – VLIW: read bundles of instructions and execute them in the same cycle – Vectors: one instruction, multiple data (SIMD) – Out of order execution: independent instructions executed in parallel 17 Parallel architectures 2014-2015 D. Barthou 2-a Pipeline Washing machines (D. Patterson) 30min washing, 40min dryier, 20min folding (3 stages) Not pipelined: 90min/pers., Bandwidth: 1pers/90min Pipelined: 120min/pers., Bandwidth: 1pers/40min. Each step takes the time of the longest step (in sync) Speed-up:increase with number of stages 18 Parallel architectures 2014-2015 D. Barthou 2-a Pipeline ● Pipeline with 5 stages, MIPS (IF/ID/Ex/Mem/WB) 19 Parallel architectures 2014-2015 D. Barthou 2-a Pipeline ● 1 cycle/instruction ● Superscalar: less than 1 cycle/instruction 20 Parallel architectures 2014-2015 D. Barthou 2-a Pipeline Issues for pipelines (hazards) ● Data dependences: value computed by an instruction is used by another instruction... ● Branches Solutions: ● Forwarding: pass the value to the unit as soon as it is available in the pipeline ● Stall ● Speculation 21 Parallel architectures 2014-2015 D. Barthou 2-a Pipeline ● Dependences, forwarding & stall ● Superscalar: increases probability for “hazard” 22 Parallel architectures 2014-2015 D. Barthou 2-b Superscalar Scalar pipeline Superscalar 23 Parallel architectures 2014-2015 D. Barthou 2-b Superscalar architecture Key Features ● Many instructions at the same cycle ● Multiples functional units Adaptations ● High Risk of dependences ● Everything more complex – Penality important if stall ● HW Mecanisms – Registers renaming, OoO – Branch prediction 24 Parallel architectures 2014-2015 D. Barthou 2-b Out of order ● Main idea: enable instructions to execute even when one instruction stalls Issues for OoO: ● Interruptions ? completion order of instructions and side effects ? 25 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: example of MIPS 10k 26 Parallel architectures 2014-2015 D. Barthou 2-b Out of Order: Tomasulo algorithm 5 steps: ● Dispatch: take an instruction from the queue and put it in the ROB. Update registers to write ● Issue: wait for operands to be ready ● Execute: instruction in the pipeline ● Write result (WB): write result on Common Data Bus to update its value and execute other instructions that depend on it. ● Commit: update register with ROB value. When the first instruction of the ROB (a queue) is terminated, remove it. 27 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: mise en place HW needs ● Buffer for pending instructions : reorder buffer (rob) ● Written registers are renamed (remove WAR,WAW dependences) 28 Parallel architectures 2014-2015 D. Barthou 29 Parallel architectures 2014-2015 D. Barthou 30 Parallel architectures 2014-2015 D. Barthou 31 Parallel architectures 2014-2015 D. Barthou 32 Parallel architectures 2014-2015 D. Barthou 33 Parallel architectures 2014-2015 D. Barthou 34 Parallel architectures 2014-2015 D. Barthou 35 Parallel architectures 2014-2015 D. Barthou 36 Parallel architectures 2014-2015 D. Barthou 37 Parallel architectures 2014-2015 D. Barthou 38 Parallel architectures 2014-2015 D. Barthou 39 Parallel architectures 2014-2015 D. Barthou 40 Parallel architectures 2014-2015 D. Barthou 2-b Out of order ● Currently – Pipelines with more than 10 stages – 6-8 instructions / cycle => many instructions in flight ● OoO & ILP: dependences computation, dynamic scheduling, register renaming, ROB and dispatch buffer: as if instructions were executed sequentially. ● To avoid stalls: – Speculation on memory dependences – Speculative branches, delay slot ● Complexity of OoO mechanism: quadratic in the number of instructions... 41 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact ● Does it pay off ? 42 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 43 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 44 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 45 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 46 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 47 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 48 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 49 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 50 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 51 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 52 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 53 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 54 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 55 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 56 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 57 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 58 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 59 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 60 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 61 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 62 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact 63 Parallel architectures 2014-2015 D. Barthou 2-c Very Large Instruction Word Key Features ● Instructions are packed statically into bundles, in the asm code. Bundles are executed at the same cycle ● The number of instruction per bundle is defined ● The compiler creates the bundles 64 Parallel architectures 2014-2015 D. Barthou 2-c VLIW ● Compiler has to ensure that – Instructions are only scheduled when their operands are ready – Time