ECE/CS 757: Advanced Review of 752

Computer Architecture II • Iron law Instructor:Mikko H Lipasti • Beyond pipelining • Superscalar challenges Spring 2009 • Instruction flow • Register data flow University of Wisconsin‐Madison • Memory Dataflow • Modern memory interface Lecture notes based on slides created by John Shen, Mark Hill, David Wood, Guri Sohi, and Jim Smith, Natalie Enright Jerger, and probably others

Iron Law Iron Law

Time Performance = ------• Instructions/Program Program – Instructions executed, not static code size – Determined by algorithm, , ISA Instructions Cycles Time = X X • Cycles/Instruction Program – Determined by ISA and CPU organization (code size) (CPI) (cycle time) – Overlap among instructions reduces this term • Time/cycle Architecture --> Implementation --> Realization – Determined by technology, organization, clever circuit design Compiler Designer Processor Designer Chip Designer

Our Goal Pipelined Design • Motivation: • Minimize time, which is the product, NOT – Increase with little increase in hardware. isolated terms Bandwidth or Throughput = Performance • Common error to miss terms while devising optimizations • Bandwidth (BW) = no. of tasks/unit time • For a system that operates on one task at a time: – E.g. ISA change to decrease instruction count – BW = 1/delay (latency) – BUT leads to CPU organization which makes clock • BW can be increased by pipelining if many operands exist which need the same operation, i.e. many repetitions of slower the same task are to be performed. • Bottom line: terms are inter‐related • Latency required for each task remains the same or may even increase slightly.

ECE 752: Advanced I 1 Ideal Pipelining Example: Integer Multiplier

Comb. Logic L n Gate Delay BW = ~(1/n)

n n L -- Gate L -- Gate 2 Delay 2 Delay BW = ~(2/n)

n n n L -- Gate L -- Gate L -- Gate 3 Delay 3 Delay 3 Delay BW = ~(3/n)

• Bandwidth increases linearly with depth [Source: J. Hayes, Univ. o • Latency increases by latch delays  16x16 combinational multiplier  ISCAS‐85 C6288 standard benchmark  Tools: Synopsys DC/LSI Logic 110nm gflxp ASIC 8

Example: Integer Multiplier Pipelining Idealisms

Configuration Delay MPS Area (FF/wiring) Area Increase • Uniform subcomputations Combinational 3.52ns 284 7535 (‐‐/1759) – Can pipeline into stages with equal delay 2 Stages 1.87ns 534 (1.9x) 8725 (1078/1870) 16% – Balance pipeline stages 4 Stages 1.17ns 855 (3.0x) 11276 (3388/2112) 50% • Identical computations 8 Stages 0.80ns 1250 (()4.4x) 17127 ((/)8938/2612) 127% – Can fill pipeline with identical work – Unify instruction types • Independent computations  Pipeline efficiency – No relationships between work units  2‐stage: nearly double throughput; marginal area cost – Minimize pipeline stalls  4‐stage: 75% efficiency; area still reasonable • Are these practical?  8‐stage: 55% efficiency; area more than doubles – No, but can get close enough to get significant  Tools: Synopsys DC/LSI Logic 110nm gflxp ASIC

9

Instruction Pipelining Generic Instruction Pipeline

• The “computation” to be pipelined. 1. Instruction IF Fetch – Instruction Fetch (IF) 2. Instruction ID – Instruction Decode (ID) Decode

– Operand(s) Fetch (OF) 3. OdOperand OF Fetch – Instruction Execution (EX) 4. Instruction EX – Operand Store (OS) Execute – Update Program (PC) 5. Operand OS Store • Based on “obvious” subcomputations

ECE 752: Advanced Computer Architecture I 2 Pipelining Idealisms Program Dependences A true dependence between two instructions may only Uniform subcomputations involve one subcomputation of each instruction. i1: – Can pipeline into stages with equal delay i1: xxxx i1 – Balance pipeline stages Identical computations i2: xxxx i2 i2: – Can fill pipeline with identical work – Unify instruction types (example in 752 notes) i3: xxxx i3 i3: • Independent computations – No relationships between work units The implied sequential precedences are – Minimize pipeline stalls an overspecification. It is sufficient but not necessary to ensure program correctness.

© 2005 Mikko Lipasti 13 © 2005 Mikko Lipasti 14

Program Data Dependences Control Dependences

• True dependence (RAW) • Conditional branches D(i)  R( j)   – j cannot execute until i – Branch must execute to determine which produces its result instruction to fetch next • Anti‐dependence (WAR) – Instructions following a conditional branch are – R(i)  D( j)   j cannot write its result until i control dependent on the branch instruction has read its sources • Output dependence (WAW) – j cannot write its result until i D(i)  D( j)   has written its result

© 2005 Mikko Lipasti 15 © 2005 Mikko Lipasti 16

Resolution of Pipeline Hazards IBM RISC Experience [Agerwala and Cocke 1987] • Pipeline hazards • Internal IBM study: Limits of a scalar pipeline? – Potential violations of program dependences • Memory Bandwidth – Must ensure program dependences are not violated – Fetch 1 instr/cycle from I‐ – 40% of instructions are load/store (D‐cache) • Hazard resolution • Code characteristics (dynamic) – Static: compiler/programmer guarantees correctness – Loads – 25% – Dynamic: hardware performs checks at runtime – Stores 15% • Pipeline interlock – ALU/RR – 40% – Hardware mechanism for dynamic hazard resolution – Branches – 20% – Must detect and enforce dependences at runtime • 1/3 unconditional (always taken • 1/3 conditional taken, 1/3 conditional not taken

© 2005 Mikko Lipasti 17 © 2005 Mikko Lipasti 18

ECE 752: Advanced Computer Architecture I 3 IBM Experience CPI Optimizations

• Cache Performance • Goal and impediments – Assume 100% hit ratio (upper bound) – CPI = 1, prevented by pipeline stalls – Cache latency: I = D = 1 cycle default • No cache bypass of RF, no load/branch scheduling • Load and branch scheduling – Load penalty: 2 cycles: 0.25 x 2 = 0.5 CPI – Loads – Branch penalty: 2 cycles: 020.2 x 2/3 x 2 = 0270.27 CPI • 25% cannot be scheduled ( empty) – Total CPI: 1 + 0.5 + 0.27 = 1.77 CPI • 65% can be moved back 1 or 2 instructions • Bypass, no load/branch scheduling • 10% can be moved back 1 instruction – Load penalty: 1 cycle: 0.25 x 1 = 0.25 CPI – Branches • Unconditional – 100% schedulable (fill one delay slot) – Total CPI: 1 + 0.25 + 0.27 = 1.52 CPI • Conditional – 50% schedulable (fill one delay slot)

© 2005 Mikko Lipasti 19 © 2005 Mikko Lipasti 20

More CPI Optimizations Simplify Branches

• Bypass, scheduling of loads/branches – Load penalty: • Assume 90% can be PC‐relative 15% Overhead • 65% + 10% = 75% moved back, no penalty – No register indirect, no register access from program • 25% => 1 cycle penalty – Separate (like MIPS ) dependences • 0.25 x 0.25 x 1 = 0.0625 CPI – Branch penalty reduced – Branch Penalty • Total CPI: 1 + 0.063 + 0.085 = 1.15 CPI = 0.87 IPC • 1/3 unconditional 100% schedulable => 1 cycle • 1/3 cond. not‐taken, => no penalty (predict not‐taken) PC-relative Schedulable Penalty • 1/3 cond. Taken, 50% schedulable => 1 cycle • 1/3 cond. Taken, 50% unschedulable => 2 cycles Yes (90%) Yes (50%) 0 cycle • 0.25 x [1/3 x 1 + 1/3 x 0.5 x 1 + 1/3 x 0.5 x 2] = 0.167 Yes (90%) No (50%) 1 cycle • Total CPI: 1 + 0.063 + 0.167 = 1.23 CPI No (10%) Yes (50%) 1 cycle No (10%) No (50%) 2 cycles © 2005 Mikko Lipasti 21 © 2005 Mikko Lipasti 22

Limits of Pipelining Processor Performance Time Processor Performance = ------Program • IBM RISC Experience – Control and data dependences add 15% Instructions Cycles Time = XX – Best case CPI of 1.15, IPC of 0.87 Program Instruction Cycle – Deeper piliipelines (hig her f)frequency) magnify dependence penalties (code size) (CPI) (cycle time) • This analysis assumes 100% cache hit rates • In the 1980’s (decade of pipelining): – CPI: 5.0 => 1.15 – Hit rates approach 100% for some programs • In the 1990’s (decade of superscalar): – Many important programs have much worse hit rates – CPI: 1.15 => 0.5 (best case) • In the 2000’s (decade of multicore): – Core CPI unchanged; chip CPI scales with #cores

ECE 752: Advanced Computer Architecture I 4 Limits on Instruction Level Superscalar Proposal Parallelism (ILP)

Weiss and Smith [1984] 1.58 • Go beyond single instruction pipeline, achieve Sohi and Vajapeyam [1987] 1.81 IPC > 1 Tjaden and Flynn [1970] 1.86 (Flynn’s bottleneck) Tjaden and Flynn [1973] 1.96 • Dispatch multiple Uht [1986] 2.00 Smith et al. [1989] 2.00 • Provide more generally applicable form of Jouppi and Wall [1988] 2.40 concurrency (not just vectors) Johnson [1991] 2.50 Acosta et al. [1986] 2.79 • Geared for sequential code that is hard to Wedig [1982] 3.00 parallelize otherwise Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 • Exploit fine‐grained or instruction‐level Wall [1991] 7 (Jouppi disagreed) parallelism (ILP) Kuck et al. [1972] 8 Riseman and Foster [1972] 51 (no control dependences) Nicolau and Fisher [1984] 90 (Fisher’s optimism)

Limitations of Scalar Pipelines Parallel Pipelines

• Scalar upper bound on throughput – IPC <= 1 or CPI >= 1 • Inefficient unified pipeline – Long latency for each iiinstruction (a) No Parallelism (b) Temporal Parallelism • Rigid policy – One stalled instruction stalls all newer instructions (d) Parallel Pipeline

(c) Spatial Parallelism

Power4 Diversified Pipelines Rigid Pipeline Stall Policy I-Cache PC

Fetch Q BR Scan

BR Decode Predict Backward Propagation FP FX/LD 1 FX/LD 2 BR/CR of Stalling Reorder Buffer Issue Q Issue Q Issue Q Issue Q Bypassing Stalled of Stalled Instruction FX1 FX2 CR BR LD1 LD2 Unit Unit Unit FP1 FP2 Unit Unit Not Allowed Unit Unit

StQ

D-Cache

ECE 752: Advanced Computer Architecture I 5 Dynamic Pipelines Limitations of Scalar Pipelines

IF •• • • Scalar upper bound on throughput ID •• • – IPC <= 1 or CPI >= 1 RD •• • – Solution: wide (superscalar) pipeline ( in order ) Dispatch • Inefficient unified pipeline Buffer (() out of order ) – Long latency for each instruction EX ALU MEM1 FP1 BR – Solution: diversified, specialized pipelines MEM2 FP2 • Rigid pipeline stall policy – One stalled instruction stalls all newer instructions FP3 – Solution: Out‐of‐order execution, distributed execution ( out of order ) Reorder pipelines Buffer ( in order ) WB •• •

Superscalar Overview Goal and Impediments

• Instruction flow • Goal of Instruction Flow – Branches, jumps, calls: predict target, direction – – Fetch alignment Supply processor with maximum number of useful – Instruction cache misses instructions every clock cycle • Register data flow • Impediments – : RAW/WAR/WAW – Branches and jumps • Memory data flow – Finite I‐Cache – In‐order stores: WAR/WAW – Store queue: RAW • Capacity – Data cache misses • Bandwidth restrictions

Limits on Instruction Level Parallelism (ILP)

Weiss and Smith [1984] 1.58 • Riseman & Foster showed potential Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 (Flynn’s bottleneck) – But no idea how to reap benefit Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 • 1979: Jim Smith patents branch prediction at Smith et al. [1989] 2.00 Jouppi and Wall [1988] 2.40 Control Data Johnson [1991] 2.50 – Predict current branch based on past history Acosta et al. [1986] 2.79 Wedig [1982] 3.00 • Butler et al. [1991] 5.8 Today: virtually all processors use branch Melvin and Patt [1991] 6 prediction Wall [1991] 7 (Jouppi disagreed) Kuck et al. [1972] 8 Riseman and Foster [1972] 51 (no control dependences) Nicolau and Fisher [1984] 90 (Fisher’s optimism) © 2005 Mikko Lipasti 36

ECE 752: Advanced Computer Architecture I 6 Improving I‐Cache Performance Program Control Flow • Implicit Sequential Control Flow • Larger cache size – Static Program Representation – Code compression • – Instruction registers Control Flow Graph (CFG) • Increased associativity • Nodes = basic blocks – Conflict misses less of a problem than in data caches • Edges = Control flow transfers • Larger line size – Physical Program Layout – Spatial locality inherent in sequential program I‐stream • Mapping of CFG to linear program memory • CdCode ltlayout • Implied sequential control flow – Maximize instruction stream’s spatial locality – Dynamic Program Execution • Cache prefetching • Traversal of the CFG nodes and edges (e.g. loops) – Next‐line, streaming buffer • Traversal dictated by branch conditions – Branch target (even if not taken) • Other types of I‐cache organization – Dynamic Control Flow – [Ch. 9] • Deviates from sequential control flow • Disrupts sequential fetching • Can stall IF stage and reduce I‐fetch bandwidth

© 2005 Mikko Lipasti 37

Program Control Flow Disruption of Sequential Control Flow • Dynamic traversal of Fetch static CFG Instruction/Decode Buffer • Mapping CFG to linear Decode memory Dispatch Buffer Dispatch

Reservation Issue Stations Branch Execute

Finish Reorder/ Completion Buffer Complete

(a) (b) Store Buffer Retire

Branch Prediction Branch Instruction Speculation to I-cache Prediction FA-mux • Target address generation  Target Speculation Spec. target PC(seq.) = FA (fetch address) – Access register: Branch PC(seq.) Fetch Spec. cond. Predictor • PC, General purpose register, Link register (using a BTB) Decode Buffer

– Perform calculation: BTB Decode • +/‐ offset, autoincrement, autodecrement update (target addr. Dispa tc h Bu ffer • Condition resolution  Condition speculation and history) Dispatch – Access register: • Condition code register, General purpose register Reservation Stations – Perform calculation: Issue • Comparison of data register(s) Branch Execute

Finish Completion Buffer

ECE 752: Advanced Computer Architecture I 7 Branch/Jump Target Prediction Branch Prediction: Condition Speculation

0x0348 0101 (NTNT) 0x0612 1. Biased Not Taken – Hardware prediction – Does not affect ISA – Not effective for loops Branch inst. Information Branch target 2. Software Prediction address for predict. address (most recent) – Extra bit in each branch instruction • BhBranch TtTarget BBffuffer: small cache in fthfetch stage • Set to 0 for not taken • Set to 1 for taken – Previously executed branches, address, taken history, target(s) – • Fetch stage compares current FA against BTB Bit set by compiler or user; can use profiling – If match, use prediction – Static prediction, same behavior every time – If predict taken, use BTB target 3. Prediction based on branch offset • When branch executes, BTB is updated – Positive offset: predict not taken • Optimization: – Negative offset: predict taken – Size of BTB: increases hit rate 4. Prediction based on dynamic history – Prediction algorithm: increase accuracy of prediction

Branch Speculation Branch Speculation NT T (TAG 1) • Leading Speculation 1. Tag speculative instructions NT T NT T (TAG 2) 2. Advance branch and following instructions NT T NT T NT T NT T 3. Buffer addresses of speculated branch instructions (TAG 3) • Leading Speculation • Trailing Confirmation – Typically done during the Fetch stage 1. When branch resolves, remove/deallocate – Based on potential branch instruction(s) in the current fetch group speculation tag • Trailing Confirmation 2. Permit completion of branch and following – Typically done during the Branch Execute stage instructions – Based on the next Branch instruction to finish execution

Branch Speculation Mis‐speculation Recovery T NT • Start new correct path NT NT T T 1. Update PC with computed branch target (if predicted (TAG 2) NT) NT T NT T NT T NT T 2. Update PC with sequential instruction address (if predicted T) (TAG 3) (()TAG 1) 3. Can begin speculation again at next branch • Start new correct path • Eliminate incorrect path – Must remember the alternate (non‐predicted) path 1. Use tag(s) to deallocate ROB entries occupied by • Eliminate incorrect path speculative instructions – Must ensure that the mis‐speculated instructions 2. Invalidate all instructions in the decode and dispatch produce no side effects buffers, as well as those in reservation stations

ECE 752: Advanced Computer Architecture I 8 Dynamic Branch Prediction Smith Predictor Hardware

• Main advantages: Branch Address 2m k-bit counters

– Learn branch behavior autonomously Updated Counter Value • No compiler analysis, heuristics, or profiling m Saturating Counter – Adapt to changing branch behavior Increment/Decrement • Program phase changes branch behavior

• First proposed in 1980 Branch Prediction Branch Outcome – US Patent #4,370,711, using most significant bit random access memory, James. E. Smith • Jim E. Smith. A Study of Branch Prediction Strategies. International Symposium on Computer Architecture, pages 135‐148, May 1981 • Continually refined since then • Widely employed: Pentium, PowerPC 604, PowerPC 620, etc.

The Big Picture Register Data Dependences

INSTRUCTION PROCESSING CONSTRAINTS • Program data dependences cause hazards – True dependences (RAW)

Resource Contention Code Dependences – Antidependences (WAR) (Structural Dependences) – Output dependences (WAW)

Control Dependences Data Dependences • When are registers read and written? – Out of program order! – Hence, any/all of these can occur (RAW) True Dependences Storage Conflicts • Solution to all three: register renaming

(WAR) Anti-Dependences Output Dependences (WAW)

© Shen, Lipasti 51 © Shen, Lipasti 52

Register Renaming: WAR/WAW Register Renaming: RAW

• Widely employed (P‐4, P‐M, Athlon, …) • In order, at dispatch: • Resolving WAR/WAW: – Source registers checked to see if “in flight” – Each register write gets unique “rename register” • Register map table keeps track of this – WiWrites are commiditted in program order at WiWrite bac k • If not in flight, read from real – WAR and WAW are not an issue • If in flight, look up “rename register” tag (IOU) • All updates to “architected state” delayed till writeback • Writeback stage always later than read stage – Then, allocate new register for register write – Reorder Buffer (ROB) enforces in‐order writeback Add R3 <= R2 + R1 P32 <= P2 + P1 Add R3 <= … P32 <= … Sub R4 <= R3 + R1 P33 <= P32 + P1 Sub R4 <= … P33 <= … And R3 <= R4 & R2 P35 <= P33 + P2 And R3 <= … P35 <= … © Shen, Lipasti 53 © Shen, Lipasti 54

ECE 752: Advanced Computer Architecture I 9 Register Renaming: RAW “Dataflow Engine” for Dynamic Execution Dispatch Buffer Reg. Write Back

• Advance instruction to Dispatch Reg. File Ren. Reg. Allocate – Wait for rename register tag to trigger issue Reorder Buffer • Reservation station enables out‐of‐order issue entries Reservation Stations – Newer iiinstructions can bypass stalle d iiinstructions Branch Integer Integer Float.- Load/ Point Store Forwarding results to Res. Sta. & rename registers

Reorder Buffer Managed as a queue; Maintains sequential order of all Instructions in flight Complete © Shen, Lipasti 55 © Shen, Lipasti 56

Instruction Processing Steps •DISPATCH: Physical Register File •Read operands from Register File (RF) and/or Rename Buffers (RRB) •Rename destination register and allocate RRF entry

•Allocate Reorder Buffer (ROB) entry D$ Issue Fetch Decode Execute Rename Agen-D$ •Advance instruction to appropriate Reservation Station (RS) RF Read ALU RF Write Load RF Write •EXECUTE: •RS entry monitors for register Tag(s) to latch in pending operand(s) Map Table R0 => P7 Physical Register File •When all operands ready, issue instruction into Functional Unit (FU) and R1 => P3 deallocate RS entry (no further stalling in execution pipe) … R31 => P39 •When execution finishes, broadcast result to waiting RS entries, RRB entry, and ROB entry • •COMPLETE: Used in the MIPS pipeline • All registers in one place •Update architected register from RRB entry, deallocate RRB entry, and if it is a store instruction, advance it to Store Buffer – Always accessed right before EX stage – No copying to real register file •Deallocate ROB entry and instruction is considered architecturally completed © Shen, Lipasti 57 © Shen, Lipasti 58

Managing Physical Registers Memory Data Dependences • Besides branches, long memory latencies are one of the biggest

Map Table Add R3 <= R2 + R1 P32 <= P2 + P1 Release P32 performance challenges today. R0 => P7 Sub R4 <= R3 + R1 P33 <= P32 + P1 (previous R3) • To preserve sequential (in‐order) state in the data caches and R1 => P3 … when this … instruction external memory (so that recovery from exceptions is possible) R31 => P39 … completes stores are performed in order. This takes care of antidependences And R3 <= R4 & R2 P35 <= P33 + P2 execution and output dependences to memory locations. • However, loads can be issued out of order with respect to stores if • What to do when all physical registers are in use? the out‐of‐order loads check for data dependences with respect to previous, pending stores. – Must release them somehow to avoid stalling WAW WAR RAW – Maintain free list of “unused” physical registers store Xload Xstore X • Release when no more uses are possible ::: – Sufficient: next write commits store Xstore Xload X

© Shen, Lipasti 59

ECE 752: Advanced Computer Architecture I 10 Memory Data Dependences Memory Data Load/Store RS • “Memory Aliasing” = Two memory references involving the same memory location (collision of two memory addresses). Dependences Agen • “” = Determining whether two memory references will alias or not (whether there is a dependence or not). • WAR/WAW: stores commit in order Mem • Memory Dependency Detection: – Hazards not possible. – Must compute effective addresses of both memory references • Store RAW: loads must check pending stores Queue – Effective addresses can depend on run‐time data and other instructions – Store queue keeps track of pending store – Comparison of addresses require much wider comparators addresses Example code: – Loads check against these addresses Reorder Buffer (1) STORE V – Similar to register bypass logic (2) ADD – Comparators are 32 or 64 bits wide (address (3) LOAD W RAW size) (4) LOAD X • Major source of complexity in modern (5) LOADWAR V designs (6) ADD – Store queue lookup is position‐based (7) STORE W – What if store address is not yet known? Stall all trailing ops

© Shen, Lipasti 62

Optimizing Load/Store Disambiguation Speculative Disambiguation • Non‐speculative load/store disambiguation • What if aliases are rare? Load/Store RS 1. Loads wait for addresses of all prior stores 1. Loads don’t wait for addresses of all prior stores Agen 2. Full address comparison 2. Full address comparison of stores Mem 3. Bypass if no match, forward if match that are ready 3. Bypass if no match, forward if Load Store • Queue Queue (()1) can limit performance: mathtch 4. Check all store addresses when Reorder Buffer load r5,MEM[r3]  cache miss they commit –No matching loads –speculation store r7, MEM[r5]  RAW for agen, stalled was correct … –Matching unbypassed load – incorrect speculation load r8, MEM[r9]  independent load stalled 5. Replay starting from incorrect load

Speculative Disambiguation: Load Bypass Speculative Disambiguation: Load Forward i1: st R3, MEM[R8]: ?? i1: st R3, MEM[R8]: ?? i2: ld R9, MEM[R4]: ?? i2: ld R9, MEM[R4]: ??

Agen Agen

Mem Mem

Load Store Load Store Queue Queue Queue Queue i2: ld R9, MEM[R4]: x400A i1: st R3, MEM[R8]: x800A i2: ld R9, MEM[R4]: x800A i1: st R3, MEM[R8]: x800A

Reorder Buffer Reorder Buffer • i1 and i2 issue in program order • i1 and i2 issue in program order • i2 checks store queue (no match) • i2 checks store queue (match=>forward)

ECE 752: Advanced Computer Architecture I 11 Speculative Disambiguation: Safe Speculation Speculative Disambiguation: Violation i1: st R3, MEM[R8]: ?? i1: st R3, MEM[R8]: ?? i2: ld R9, MEM[R4]: ?? i2: ld R9, MEM[R4]: ??

Agen Agen

Mem Mem

Load Store Load Store Queue Queue Queue Queue i2: ld R9, MEM[R4]: x400C i1: st R3, MEM[R8]: x800A i2: ld R9, MEM[R4]: x800A i1: st R3, MEM[R8]: x800A

Reorder Buffer Reorder Buffer • i1 and i2 issue out of program order • i1 and i2 issue out of program order • i1 checks load queue at commit (no match) • i1 checks load queue at commit (match) – i2 marked for replay

Use of Prediction Load/Store Disambiguation Discussion • If aliases are rare: static prediction • RISC ISA: – Predict no alias every time – Many registers, most variables allocated to registers • Why even implement forwarding? PowerPC 620 doesn’t – Aliases are rare – Pay misprediction penalty rarely – Most important to not delay loads (bypass) • If aliases are more frequent: dynamic prediction – Alias predictor may/may not be necessary – Use PHT‐like history table for loads • CISC ISA: • If alias predicted: delay load – Few registers, many operands from memory • If aliased pair predicted: forward from store to load – Aliases much more common, forwarding necessary – More difficult to predict pair [store sets, ] – Incorrect load speculation should be avoided – Pay misprediction penalty rarely – If load speculation allowed, predictor probably necessary • Memory cloaking [Moshovos, Sohi] • Address translation: – Predict load/store pair – Can’t use virtual address (must use physical) – Directly copy store data register to load target register – Wait till after TLB lookup is done – Reduce data transfer latency to absolute minimum – Or, use subset of untranslated bits (page offset) • Safe for proving inequality (bypassing OK) • Not sufficient for showing equality (forwarding not OK)

The Memory Bottleneck Increasing Memory Bandwidth Reg. Write Back Reg. Write Back Dispatch Buffer Dispatch Buffer

Dispatch Reg. File Ren. Reg. Dispatch Reg. File Ren. Reg.

RS’s RS’s

Branch Integer Integer Float.- Load/ Eff. Addr. Gen. Branch Integer Integer Float.- Load/ Load/ Point Store Addr. Translation Point Store Store D-cache Access

Reorder Buff. Data Cache Reorder Buff. Missed loads Complete Complete

Store Buff. Store Buff. Data Cache Retire Retire© Shen, Lipasti 72

ECE 752: Advanced Computer Architecture I 12 Coherent Memory Interface Coherent Memory Interface • Load Queue Out-of-order Critical word bypass – Tracks inflight loads for aliasing, coherence processor core • Store Queue – Defers stores until commit, tracks aliasing Load Q Store Q • Storethrough Queue or Write Buffer or Store Buffer

Level 1 tag array Level 1 data array – Defers stores, coalesces writes, must handle RAW • MSHR Storethrough Q WB buffer – Tracks outstanding misses, enables lockup‐free caches [Kroft ISCA 91] • Snoop Queue Level 2 tag array Level 2 data array – Buffers, tracks incoming requests from coherent I/O, other processors WB buffer Fill buffer • Fill Buffer MSHR Snoop queue – Works with MSHR to hold incoming partial lines • Writeback Buffer – Defers writeback of evicted line (demand miss handled first)

System address and response bus System data bus

Split Transaction Bus Memory Consistency Req A Rsp ~A Read A from DRAM Xmit A Reorder Proc0 Proc1 Req B Rsp ~B Read B from DRAM Xmit B load before st A=1 st B=1 (a) Simple bus with atomic transactions store if (load B==0) { if (load A==0) { ...critical section ...critical section Req A Rsp ~A Read A from DRAM Xmit A } } Req B Rsp ~B Read B from DRAM Xmit B • How are memory references from different processors interleaved? Req C Rsp ~ C Read C from DRAM XitCXmit C • If this is not well‐specified, synchronization becomes difficult or even Req D Rsp ~D Read D from DRAM Xmit D impossible – ISA must specify (b) Split-transaction bus with separate requests and responses • Common example using Dekker’s algorithm for synchronization – If load reordered ahead of store (as we assume for a baseline OOO CPU) • “Packet switched” vs. “circuit switched” – Both Proc0 and Proc1 enter critical section, since both observe that other’s lock variable (A/B) is not set • Release bus after request issued • If consistency model allows loads to execute ahead of stores, Dekker’s • Allow multiple concurrent requests to overlap memory latency algorithm no longer works • Complicates control, arbitration, and coherence protocol – Common ISAs allow this: IA‐32, PowerPC, SPARC, Alpha – Transient states for pending blocks (e.g. “req. issued but not completed”)

Sequential Consistency [Lamport 1979] High‐Performance Sequential Consistency P2 P3 • Coherent caches isolate CPUs if no sharing is P1 P4 occurring – Absence of coherence activity means CPU is free to reorder references Memory • Still have to order references with respect to misses and other coherence activity (snoops) • Processors treated as if they are interleaved processes on a single time‐shared CPU • Key: use speculation • All references must fit into a total global order or interleaving that does not violate any CPUs program order – Reorder references speculatively – Otherwise sequential consistency not maintained • Now Dekker’s algorithm will work – Track which addresses were touched speculatively • Appears to preclude any OOO memory references – Force replay (in order execution) of such references – Hence precludes any real benefit from OOO CPUs that collide with coherence activity (snoops)

ECE 752: Advanced Computer Architecture I 13 High‐Performance Sequential Consistency Issues in Completion/Retirement System • Out‐of‐order execution address bus Load queue Other – ALU instructions processors Out-of-order processor P – Load/store instructions core Bus writes Bus upgrades • In‐order completion/retirement P – Precise exceptions P – In-order commit Memory ordering in multiprocessors • Solutions • Load queue records all speculative loads – Reorder buffer retires instructions in order • Bus writes/upgrades are checked against LQ • Any matching load gets marked for replay – Store queue retires stores in order • At commit, loads are checked and replayed if necessary – Results in machine flush, since load‐dependent ops must also replay – Exceptions can be handled at any instruction boundary by • Practically, conflicts are rare, so expensive flush is OK reconstructing state out of ROB/SQ – Load queue monitors other processors’ references © Shen, Lipasti 80

A Dynamic Fetch Superscalar Summary

Instruction/decode buffer

Decode

I-cache

In order Dispatch buffer Branch FETCH Instruction Predictor Flow Dispatch Instruction Buffer DECODE Reservation stations Issue Integer Floating-point Media Memory

Execute Memory

Out of order Data EXECUTE Finish Flow Reorder Reorder/completion buffer Buffer Register (ROB) Complete Data COMMIT Flow Store D-cache Queue

In order Store buffer

© Shen, Lipasti © Shen, Lipasti Retire 81 82

[John DeVale & Bryan Black, 2005] Landscape of Families Review of 752 1

1300 1700 1900 SpecINT 2000 1100 1500 Intel-  Iron law 900 700 500 AMD-x86 300  Beyond pipelining Power5 100 Power DTN  Superscalar challenges Itanium  Instruction flow Power 3 Power4 Opteron  Register data flow Extreme 0.5 PIII  Memory Dataflow t2000/MHz Athlon n 800 MHz  Modern memory interface PSC NWD SPECi P4 • What was not covered Frequency – (caches, DRAM) PerformanceCPU  – PathLengthCPI – Power

0 – Many implementation/design details 0 500 1000 1500 2000 2500 3000 3500 – Etc. Frequency (MHz) ** Data source www.spec.org – Multithreading (next week) © Shen, Lipasti 83

ECE 752: Advanced Computer Architecture I 14