Advanced Computer Architecture II Review of 752 Iron Law Iron Law
Total Page:16
File Type:pdf, Size:1020Kb
ECE/CS 757: Advanced Review of 752 Computer Architecture II • Iron law Instructor:Mikko H Lipasti • Beyond pipelining • Superscalar challenges Spring 2009 • Instruction flow • Register data flow University of Wisconsin‐Madison • Memory Dataflow • Modern memory interface Lecture notes based on slides created by John Shen, Mark Hill, David Wood, Guri Sohi, and Jim Smith, Natalie Enright Jerger, and probably others Iron Law Iron Law Time Processor Performance = --------------- • Instructions/Program Program – Instructions executed, not static code size – Determined by algorithm, compiler, ISA Instructions Cycles Time = X X • Cycles/Instruction Program Instruction Cycle – Determined by ISA and CPU organization (code size) (CPI) (cycle time) – Overlap among instructions reduces this term • Time/cycle Architecture --> Implementation --> Realization – Determined by technology, organization, clever circuit design Compiler Designer Processor Designer Chip Designer Our Goal Pipelined Design • Motivation: • Minimize time, which is the product, NOT – Increase throughput with little increase in hardware. isolated terms Bandwidth or Throughput = Performance • Common error to miss terms while devising optimizations • Bandwidth (BW) = no. of tasks/unit time • For a system that operates on one task at a time: – E.g. ISA change to decrease instruction count – BW = 1/delay (latency) – BUT leads to CPU organization which makes clock • BW can be increased by pipelining if many operands exist which need the same operation, i.e. many repetitions of slower the same task are to be performed. • Bottom line: terms are inter‐related • Latency required for each task remains the same or may even increase slightly. ECE 752: Advanced Computer Architecture I 1 Ideal Pipelining Example: Integer Multiplier Comb. Logic L n Gate Delay BW = ~(1/n) n n L -- Gate L -- Gate 2 Delay 2 Delay BW = ~(2/n) n n n L -- Gate L -- Gate L -- Gate 3 Delay 3 Delay 3 Delay BW = ~(3/n) • Bandwidth increases linearly with pipeline depth [Source: J. Hayes, Univ. o • Latency increases by latch delays 16x16 combinational multiplier ISCAS‐85 C6288 standard benchmark Tools: Synopsys DC/LSI Logic 110nm gflxp ASIC 8 Example: Integer Multiplier Pipelining Idealisms Configuration Delay MPS Area (FF/wiring) Area Increase • Uniform subcomputations Combinational 3.52ns 284 7535 (‐‐/1759) – Can pipeline into stages with equal delay 2 Stages 1.87ns 534 (1.9x) 8725 (1078/1870) 16% – Balance pipeline stages 4 Stages 1.17ns 855 (3.0x) 11276 (3388/2112) 50% • Identical computations 8 Stages 0.80ns 1250 (()4.4x) 17127 ((/)8938/2612) 127% – Can fill pipeline with identical work – Unify instruction types • Independent computations Pipeline efficiency – No relationships between work units 2‐stage: nearly double throughput; marginal area cost – Minimize pipeline stalls 4‐stage: 75% efficiency; area still reasonable • Are these practical? 8‐stage: 55% efficiency; area more than doubles – No, but can get close enough to get significant speedup Tools: Synopsys DC/LSI Logic 110nm gflxp ASIC 9 Instruction Pipelining Generic Instruction Pipeline • The “computation” to be pipelined. 1. Instruction IF Fetch – Instruction Fetch (IF) 2. Instruction ID – Instruction Decode (ID) Decode – Operand(s) Fetch (OF) 3. OdOperand OF Fetch – Instruction Execution (EX) 4. Instruction EX – Operand Store (OS) Execute – Update Program Counter (PC) 5. Operand OS Store • Based on “obvious” subcomputations ECE 752: Advanced Computer Architecture I 2 Pipelining Idealisms Program Dependences A true dependence between two instructions may only Uniform subcomputations involve one subcomputation of each instruction. i1: – Can pipeline into stages with equal delay i1: xxxx i1 – Balance pipeline stages Identical computations i2: xxxx i2 i2: – Can fill pipeline with identical work – Unify instruction types (example in 752 notes) i3: xxxx i3 i3: • Independent computations – No relationships between work units The implied sequential precedences are – Minimize pipeline stalls an overspecification. It is sufficient but not necessary to ensure program correctness. © 2005 Mikko Lipasti 13 © 2005 Mikko Lipasti 14 Program Data Dependences Control Dependences • True dependence (RAW) • Conditional branches D(i) R( j) – j cannot execute until i – Branch must execute to determine which produces its result instruction to fetch next • Anti‐dependence (WAR) – Instructions following a conditional branch are – R(i) D( j) j cannot write its result until i control dependent on the branch instruction has read its sources • Output dependence (WAW) – j cannot write its result until i D(i) D( j) has written its result © 2005 Mikko Lipasti 15 © 2005 Mikko Lipasti 16 Resolution of Pipeline Hazards IBM RISC Experience [Agerwala and Cocke 1987] • Pipeline hazards • Internal IBM study: Limits of a scalar pipeline? – Potential violations of program dependences • Memory Bandwidth – Must ensure program dependences are not violated – Fetch 1 instr/cycle from I‐cache – 40% of instructions are load/store (D‐cache) • Hazard resolution • Code characteristics (dynamic) – Static: compiler/programmer guarantees correctness – Loads – 25% – Dynamic: hardware performs checks at runtime – Stores 15% • Pipeline interlock – ALU/RR – 40% – Hardware mechanism for dynamic hazard resolution – Branches – 20% – Must detect and enforce dependences at runtime • 1/3 unconditional (always taken • 1/3 conditional taken, 1/3 conditional not taken © 2005 Mikko Lipasti 17 © 2005 Mikko Lipasti 18 ECE 752: Advanced Computer Architecture I 3 IBM Experience CPI Optimizations • Cache Performance • Goal and impediments – Assume 100% hit ratio (upper bound) – CPI = 1, prevented by pipeline stalls – Cache latency: I = D = 1 cycle default • No cache bypass of RF, no load/branch scheduling • Load and branch scheduling – Load penalty: 2 cycles: 0.25 x 2 = 0.5 CPI – Loads – Branch penalty: 2 cycles: 020.2 x 2/3 x 2 = 0270.27 CPI • 25% cannot be scheduled (delay slot empty) – Total CPI: 1 + 0.5 + 0.27 = 1.77 CPI • 65% can be moved back 1 or 2 instructions • Bypass, no load/branch scheduling • 10% can be moved back 1 instruction – Load penalty: 1 cycle: 0.25 x 1 = 0.25 CPI – Branches • Unconditional – 100% schedulable (fill one delay slot) – Total CPI: 1 + 0.25 + 0.27 = 1.52 CPI • Conditional – 50% schedulable (fill one delay slot) © 2005 Mikko Lipasti 19 © 2005 Mikko Lipasti 20 More CPI Optimizations Simplify Branches • Bypass, scheduling of loads/branches – Load penalty: • Assume 90% can be PC‐relative 15% Overhead • 65% + 10% = 75% moved back, no penalty – No register indirect, no register access from program • 25% => 1 cycle penalty – Separate adder (like MIPS R3000) dependences • 0.25 x 0.25 x 1 = 0.0625 CPI – Branch penalty reduced – Branch Penalty • Total CPI: 1 + 0.063 + 0.085 = 1.15 CPI = 0.87 IPC • 1/3 unconditional 100% schedulable => 1 cycle • 1/3 cond. not‐taken, => no penalty (predict not‐taken) PC-relative Schedulable Penalty • 1/3 cond. Taken, 50% schedulable => 1 cycle • 1/3 cond. Taken, 50% unschedulable => 2 cycles Yes (90%) Yes (50%) 0 cycle • 0.25 x [1/3 x 1 + 1/3 x 0.5 x 1 + 1/3 x 0.5 x 2] = 0.167 Yes (90%) No (50%) 1 cycle • Total CPI: 1 + 0.063 + 0.167 = 1.23 CPI No (10%) Yes (50%) 1 cycle No (10%) No (50%) 2 cycles © 2005 Mikko Lipasti 21 © 2005 Mikko Lipasti 22 Limits of Pipelining Processor Performance Time Processor Performance = --------------- Program • IBM RISC Experience – Control and data dependences add 15% Instructions Cycles Time = XX – Best case CPI of 1.15, IPC of 0.87 Program Instruction Cycle – Deeper piliipelines (hig her f)frequency) magnify dependence penalties (code size) (CPI) (cycle time) • This analysis assumes 100% cache hit rates • In the 1980’s (decade of pipelining): – CPI: 5.0 => 1.15 – Hit rates approach 100% for some programs • In the 1990’s (decade of superscalar): – Many important programs have much worse hit rates – CPI: 1.15 => 0.5 (best case) • In the 2000’s (decade of multicore): – Core CPI unchanged; chip CPI scales with #cores ECE 752: Advanced Computer Architecture I 4 Limits on Instruction Level Superscalar Proposal Parallelism (ILP) Weiss and Smith [1984] 1.58 • Go beyond single instruction pipeline, achieve Sohi and Vajapeyam [1987] 1.81 IPC > 1 Tjaden and Flynn [1970] 1.86 (Flynn’s bottleneck) Tjaden and Flynn [1973] 1.96 • Dispatch multiple instructions per cycle Uht [1986] 2.00 Smith et al. [1989] 2.00 • Provide more generally applicable form of Jouppi and Wall [1988] 2.40 concurrency (not just vectors) Johnson [1991] 2.50 Acosta et al. [1986] 2.79 • Geared for sequential code that is hard to Wedig [1982] 3.00 parallelize otherwise Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 • Exploit fine‐grained or instruction‐level Wall [1991] 7 (Jouppi disagreed) parallelism (ILP) Kuck et al. [1972] 8 Riseman and Foster [1972] 51 (no control dependences) Nicolau and Fisher [1984] 90 (Fisher’s optimism) Limitations of Scalar Pipelines Parallel Pipelines • Scalar upper bound on throughput – IPC <= 1 or CPI >= 1 • Inefficient unified pipeline – Long latency for each iiinstruction (a) No Parallelism (b) Temporal Parallelism • Rigid pipeline stall policy – One stalled instruction stalls all newer instructions (d) Parallel Pipeline (c) Spatial Parallelism Power4 Diversified Pipelines Rigid Pipeline Stall Policy I-Cache PC Fetch Q BR Scan BR Decode Predict Backward Propagation FP FX/LD 1 FX/LD 2 BR/CR of Stalling Reorder Buffer Issue Q Issue Q Issue Q Issue Q Bypassing Stalled of Stalled Instruction FX1 FX2 CR BR Instruction Unit