Flow Path Model of Superscalars CS 211:

Branch Prediction for Superscalar I-cache

Processors Branch Instruction FETCH Flow Predictor Instruction Buffer DECODE

Integer Floating-point Media Memory

Memory Data EXECUTE Flow Reorder Buffer Register (ROB) Data COMMIT Flow Store D-cache Queue

Instruction Fetch Buffer

Fetch Out-of-order Unit Core Instruction Flow Bandwidth

‹ Fetch buffer smoothes out the rate mismatch between fetch and execution - neither the fetch bandwidth nor the execution bandwidth is consistent ‹ Fetch bandwidth should be higher than execution bandwidth - we prefer to have a stockpile of instructions in the buffer to hide cache miss latencies. This requires both raw cache bandwidth + control flow speculation Instruction Cache Basic Spatial Locality and Fetch Bandwidth

00 01 10 11 00 01 10 11 000 000 001 001

PC=..xxRRRCC00 111 PC=..xxRRRCC00 111 Row Decoder Row Decoder

Mutiplexer Inst0 Inst1 Inst2 Inst3

Instruction example: 4 instructions per cache line

Instruction Decoding Issues Pentium Pro Fetch/Decode Unit

‹ Primary tasks: Macro-Instruction Bytes from IFU - Identify individual instructions Instruction Buffer 16 bytes To Next - Determine instruction types Address Calc. - Detect inter-instruction dependences

uROM Decoder Decoder Decoder 0 1 2 ‹ Two important factors: Branch - Instruction set architecture Address Calc. - Width of parallel pipeline 4 uops 1 uop 1 uop

uop Queue (6)

Up to 3 uops Issued to dispatch Instruction Flow– Control Flow

‹ Throughput of early stages places bound an upper bound on per. Of subsequent stages ‹ Program control flow represented by Control Flow Control Dependence Graph (CFG) - Nodes represent basic block of code • Sequence of instructions with no incoming or outgoing branches - Edges represent transfer of control flow from one block to another

IBM’s Experience on Pipelined Processors Control Flow Graph [Agerwala and Cocke 1987] ‹ Shows possible paths of control flow through basic blocks BB 1 main: ‹ Code Characteristics (dynamic) addi r2, r0, A addi r3, r0, B - loads - 25% addi r4, r0, C BB 1 addi r5, r0, N BB 2 add r10,r0, r0 - stores - 15% bge r10,r5, end loop: - ALU/RR - 40% lw r20, 0(r2) lw r21, 0(r3) BB 2 bge r20,r21,T1 - branches - 20% BB 3 BB 4 sw r21, 0(r4) BB 3 b T2 • 1/3 unconditional (always taken) T1: sw r20, 0(r4) BB 4 T2: unconditional - 100% schedulable addi r10,r10,1 BB 5 addi r2, r2, 4 • 1/3 conditional taken addi r3, r3, 4 BB 5 addi r4, r4, 4 blt r10,r5, loop • 1/3 conditional not taken end: conditional - 50% schedulable ‹ Control Dependence - Node X is control dependant on Node Y if the computation in Y determines whether X executes Mapping CFG to CFG and Branches Linear Instruction Sequence

‹ Basic blocks and their constituent instructions must A AA be stored in sequential location in memory - In mapping a CFG to linear consecutive mem location, additional unconditional branches must be added C B B ‹ Encounter of branches (cond and uncond.) at run- C time induces deviations from implied sequential control flow and consequent disruptions to sequential D fetching of instructions - These disruptions cause stalls in Inst.Fetch (IF) stage and D reduce overall IF bandwidth D B

C

Branch Types and Implementation Branch-- actions

‹ Types of Branches ‹ When branches occur, disruption to IF occurs - Conditional or Unconditional? ‹ For unconditional branches - Subroutine Call (aka Link), needs to save PC? - Subsequent instruction cannot be fetched until target - How is the branch target computed? address determined • Static Target e.g. immediate, PC-relative ‹ For conditional branches • Dynamic targets e.g. register indirect - Machine must wait for resolution of branch condition - And if branch taken then wait till target address computed ‹ Conditional Branch Architectures ‹ Branch inst executed by the branch functional unit - Condition Code ‘N-Z-C-V’ e.g. PowerPC ‹ Note: Cost in superscalar/ILP processors = width - General Purpose Register e.g. Alpha, MIPS (parallelism) X stall cycles - Special Purposes register e.g. Power’s Loop Count - 3 stall cycles on a 4 wide machine = 12 lost cycles Condition Resolution Target Address Generation

Fetch Fetch Decode Buffer PC- Decode Buffer Decode rel. CC Reg. Decode reg. ind. GP Dispatch Buffer Reg. Dispatch Buffer reg. Dispatch ind. value with Dispatch comp. offset Reservation Reservation Stations Stations Issue Issue Branch Branch Execute Execute

Finish Finish Completion Buffer Completion Buffer Complete Complete

Store Buffer Store Buffer Retire Retire

What’s So Bad About Branches? Branch penalties

‹ Performance Penalties ‹ When branch occurs two parts needed: - Use up execution resources - Branch target address (BTA) has to be computed - Branch condition resolution - Fragmentation of I-Cache lines ‹ Addressing modes will affect BTA delay - Disruption of sequential control flow - For PC relative, BTA can be generated during Fetch stage • Need to determine branch direction (conditional for 1 cycle penalty branches) - For Register indirect, BTA generated after decode stage (to access register) = 2 cycle penalty • Need to determine branch target - For register indirect with offset = 3 cycle penalty ‹ For condition resolution, depends on methods Robs instruction fetch bandwidth and ILP - If condition code registers used, then penalty =2 - If ISA permits comparison of 2 registers then output of ALU => 3 cycles ‹ Penalty will be max of penalties for condition resolution and BTA What to do with branches Riseman and Foster’s Study

‹ To maximize sustained instruction fetch bandwidth, ‹ 7 benchmark programs on CDC-3600 number of stall cycles in fetch stage must be ‹ Assume infinite machine: minimized - Infinite memory and instruction stack, , fxn units ‹ The primary aim of instruction flow techniques Consider only true dependency at data-flow limit (branch prediction) is to minimize stall cycles and/or ‹ If bounded to single basic block, i.e. no bypassing of make use of these cycles to do useful work branches ⇒ maximum speedup is 1.72 - Note that there must be a mechanism to validate prediction ‹ Suppose one can bypass conditional branches and and to safely recover from misprediction jumps (i.e. assume the actual branch path is always known such that branches do not impede instruction execution)

Br. Bypassed:012832128 Max Speedup: 1.72 2.72 3.62 7.21 24.4 51.2

Determining Branch Direction Determining Branch Target

Problem: Cannot fetch subsequent instructions until branch Problem: Cannot fetch subsequent instructions until direction is determined branch target is determined

‹ Minimize penalty ‹ Minimize delay - Move the instruction that computes the branch condition - Generate branch target early in the pipeline away from branch (ISA&compiler)

‹ Make use of delay ‹ Make use of penalty - Bias for not taken - Bias for not-taken - Predict branch target - Fill delay slots with useful/safe instructions (ISA&compiler) - Follow both paths of execution (hardware) PC-relative vs Register Indirect targets - Predict branch direction (hardware) Branch Target Speculation – Branch Keys to Branch Prediction Target Buffer ‹ Target Address Generation ‹ Use branch target buffer (BTB) to store previous - Access register branch target address • PC, GP register, Link register ‹ BTB is a small fully associative cache - Perform calculation - Accessed during instruction fetch using PC • +/- offset, auto incrementing/decrementing ‹ BTB can have three fields ⇒ Target Speculation - Branch instruction address (BIA) - Branch target address (BTA) - History bits ‹ Condition Resolution ‹ When PC matches BIA, an entry is made into BTB - Access register - A hit in BTB Implies inst being fetched is branch inst • Condition code register, data register, count register - The BTA field can be used to fetch next instruction if - Perform calculation particular branch is predicted to be taken • Comparison of data register(s) - Note: br inst is still fetched and executed for ⇒ Condition Speculation validation/recovery

Branch Condition Speculation History based prediction

‹ Biased For Not Taken - Does not affect the instruction set architecture ‹ Make prediction based on previous observation - Not effective in loops - Assumption guiding history based prediction is that historical info on direction taken by branch in previous execution can ‹ Software Prediction give helpful hints on direction it will take in future execution - Encode an extra bit in the branch instruction ‹ How much history ? What prediction ? • Predict not taken: set bit to 0 • Predict taken: set bit to 1 ‹ Finite state machine algorithm - Bit set by compiler or user; can use profiling - N state variables encode direction taken by last n exec of branch - Static prediction, same behavior every time • Each state represents particular history pattern in terms ‹ Prediction Based on Branch Offsets of taken/not-taken (T/NT) - Positive offset: predict not taken • Output logic generates prediction based on history - Negative offset: predict taken - When predicted branch is finally executed, use actual ‹ Prediction Based on History outcome to transition to next state - Next state logic – chain state variables into shift Reg. Branch Instruction Speculation Branch Target Buffer (BTB) ‹ A small “cache-like” memory in the instruction fetch stage nPC to Icache prediction FA-mux specu. target nPC(seq.) = PC+4 current Branch PC Fetch PC ……. ……. …… specu. cond. Predictor (using a BTB) Decode Buffer

BTB Decode Branch Inst. Branch Branch Target update (target addr. Dispatch Buffer Address (tag) History (Most Recent) nPC=BP(PC) and history) Dispatch ‹ Remembers previously executed branches, their addresses, What should this information to aid prediction, and most recent target Reservation really look like Stations addresses for a superscalar? Issue Branch ‹ Instruction fetch stage compares current PC against those in BTB to “guess” nPC Execute - If matched then prediction is made else nPC=PC+4 - If predict taken then nPC=target address in BTB else nPC=PC+4 Finish Completion Buffer ‹ When branch is actually resolved, BTB is updated

UCB Study [Lee and Smith, 1984] Branch Prediction Function ‹ Benchmarks - 26 programs (traces on IBM 370, DEC PDP-11, CDC 6400) ‹ Based on opcode only (%) - Use trace-driven simulation with parameterized machine models IBM1 IBM2 IBM3 IBM4 DEC CDC 66 69 71 55 80 78 ‹ Branch types - Unconditional: always taken or always not taken ‹ Based on history of branch - Subroutine call: always taken - Branch prediction function F (X1, X2, .... ) - Loop control: usually taken (loop back) - Use up to 5 previous branches for history (%) - Decision: either way, e.g. IF-THEN-ELSE - Computed GOTO: always taken, with changing target IBM1 IBM2 IBM3 IBM4 DEC CDC - Supervisor call: always taken 0 64.1 64.4 70.4 54.0 73.8 77.8 1 91.9 95.2 86.6 79.7 96.5 82.3 - “Execute”: always taken (IBM 370) 2 93.3 96.5 90.8 83.4 97.5 90.6 ‹ Branch behavior: Taken vs Not Taken 3 93.7 96.7 91.2 83.5 97.7 93.5 4 94.5 97.0 92.0 83.7 98.1 95.3 IBM1 IBM2 IBM3 IBM4 DEC CDC Average T 0.640 0.657 0.704 0.540 0.738 0.778 0.676 5 94.7 97.1 92.2 83.9 98.2 95.7 NT 0.360 0.343 0.296 0.460 0.262 0.222 0.324 2-bit predictors Example Prediction Algorithm

‹ Prediction accuracy approaches maximum with as ‹ Use 2 history bits to track outcome of 2 previous few as 2 preceding branch occurrences used as executions of branch history - 2 bits are status of FSM T - NN, NT, TN, TT T last two branches ‹ Each FSM represents a prediction algorithm TTTT NT T T next prediction ‹ To support history based prediction, the BTB includes T N T history field for each branch N - Retrive target address plus history bits TN NN TTN N - Feed history bits to logic that generates next state and T N prediction N Results (%) IBM1 IBM2 IBM3 IBM4 DEC CDC 93.3 96.5 90.8 83.4 97.5 90.6

Other Prediction Algorithms IBM RS/6000 Study [Nair, 1992]

N N ‹ Five different branch types T - b: unconditional branch t t? t t? - bl: branch and link (subroutine calls) T T - bc: conditional branch Saturation N N Hysteresis T T - bcr: conditional branch using link register (subroutine Counter T N Counter returns) N n?TN n N - bcc: conditional branch using count register (system calls) T TN n n?T T N T N ‹ Separate branch function unit to overlap of branch instructions with other instructions ‹ Combining prediction accuracy with BTB hit rate (86.5% for 128 sets of 4 entries each), branch ‹ Two causes for branch stalls prediction can provide the net prediction accuracy of - Unresolved conditions approximately 80%. This implies a 5-20% - Branches downstream too close to unresolved branches performance enhancement. Branch Instruction Distribution Number of Counter Bits Needed Benchmark Prediction Accuracy (Overall CPI Overhead) 3-bit 2-bit 1-bit 0-bit % of diff. types of % of bc inst. with branch instructions: penalty cycles: spice2g6 97.0 (0.009) 97.0 (0.009) 96.2 (0.013) 76.6 (0.031) Benchmark bblbcbcr3 cyc. 2 cyc. 1 cyc. doduc 94.2 (0.003) 94.3 (0.003) 90.2 (0.004) 69.2 (0.022) gcc 89.7 (0.025) 89.1 (0.026) 86.0 (0.033) 50.0 (0.128) spice2g6 7.86 0.30 12.58 0.32 13.82 3.12 0.76 doduc 1.00 0.94 8.22 1.01 10.14 1.76 2.02 espresso 89.5 (0.045) 89.1 (0.047) 87.2 (0.054) 58.5 (0.176) matrix300 0.00 0.00 14.50 0.00 0.68 0.22 0.20 li 88.3 (0.042) 86.8 (0.048) 82.5 (0.063) 62.4 (0.142) tomcatv 0.00 0.00 6.10 0.00 0.24 0.02 0.01 eqntott 89.3 (0.028) 87.2 (0.033) 82.9 (0.046) 78.4 (0.049) gcc 2.30 1.32 15.50 1.81 22.46 9.48 4.85 espresso 3.61 0.58 19.85 0.68 37.37 1.77 0.31 li 2.41 1.92 14.36 1.91 31.55 3.44 1.37 ‹ Branch history table size: Direct-mapped array of 2k entries eqntott 0.91 0.47 32.87 0.51 5.01 11.01 0.80 ‹ Programs, like gcc, can have over 7000 conditional branches ‹ In collisions, multiple branches share the same predictor ‹ Variation of branch penalty with branch history table size level out at 1024

Inter-relating Branches Global Branch Prediction

‹ So far, the prediction of each static branch instruction is ‹ So far, not considered inter-dependent branches based solely on its own past behavior and not the - Branches whose outcome depends on other branches behaviors of other neighboring static branch instructions If (a=0) then { s1}; If (a>0) then {s2}; Pattern History Table (PHT) 00...00 If (a<0) then {s3}; Branch History Register 00...01 (shift left when update) 00...10 PHT 1 1 1 1 0 Bits index old new 11...10 11...11 Prediction FSM Logic Branch Resolution 2-Level Adaptive Prediction [Yeh & Patt] 2-level predictors

‹ Two-level adaptive branch prediction ‹ Relate branches - 1st level: History of last k (dynamic) branches encountered - Globally (G) - 2nd level: branch behavior of the last s occurrences of the - Individual (P) (also known as per-branch) specific pattern of these k branches - Use a Branch History Register (BHR) in conjunction with a ‹ Global: single BHSR of k bits tracks branch directions Pattern History Table (PHT) of last k dynamic branches ‹ Example: (k=8, s=6) ‹ Individual: employ set of k-bit BHSR, one of which is - Last k branches with the behavior (11100101) selected for a branch - s-bit History at the entry (11100101) is [101010] - Global shared by all branches, whereas individual has - Using history, branch prediction algorithm predicts direction BHSR dedicated to each branch (or subset) of the branch ‹ PHT has options: global (g), individual (p), shared(s) ‹ Effectiveness: - Global has single table for all static - Average 97% accuracy for SPEC - Individual has PHT for each static branch - Used in the Intel P6 and AMD K6 - Or subset of PHT shared by each branch

Global BHSR Scheme (GAs) Per-Branch BHSR Scheme (PAs)

Branch Address Branch Address j bits j bits Standard BHT i bits

Branch History Shift Register (BHSR) Branch History k x 2 i Shift Register (BHSR) Prediction k bits Prediction

k bits j+k BHT of 2 x 2 j+k BHT of 2 x 2 Other Schemes BTB for Superscalar Fetch +16 ‹ Function Return Stack - Register indirect targets are hard to predict from branch history PC icache Branch - Register indirect branches are mostly used for function returns Branch Target History ⇒ 1. Push the return address onto a stack on each function call Address Table Decode Buffer 2. On a reg. indirect branch, pop and return the top address Cache Decode as prediction Dispatch Buffer ‹ Combining Branch Predictors feedback Dispatch Reservation - Each type of branch prediction scheme tries to capture a Stations particular program behavior BRN SFX SFX CFX FPU LS Issue - May want to include multiple prediction schemes in hardware Branch BTB input: current cache line - Use another history-based prediction scheme to “predict” which Execute address predictor should be used for a particular branch BTB output: what is the next Finish Completion Buffer You get the best of all worlds. This works quite well cache line address and how many words of the current line is useful

PPC 604 Fetch Address Generation Branch Mis-prediction Recovery

instruction FAR BHT BTAC +2 +4 cache ‹ Branch speculation involves predicting direction of branch and then proceeding to fetch along that path fetch - Fetching on that path may encounter more branch inst Prediction Logic Target Seq Addr (4 instructions) ‹ Must provide validation and recovery mechanisms decode ‹ To identify speculated instructions, tagging is used Prediction Logic Target Seq Addr (4 instructions) - Tagged instruction indicates a speculative inst dispatch - Tag value for each basic block (branch)

Prediction Logic Target Seq Addr ‹ Validation occurs when branch is executed and (4 instructions) branch outcome known; correction of prediction known execute - prediction correct = de-allocate spec. tag Target Exception Logic - Incorrect prediction = terminate incorrect path and fetch from

+ correct path

PC complete Control Flow Speculation Mis-speculation Recovery

NT T tag1 NT T tag1

NT T NT T NT T NT T tag2 tag2 tag2 NT T NT T NT T NT T NT T NT T NT T NT T

tag3 tag3 tag3 tag3

‹ Eliminate Incorrect Path ‹ Leading Speculation - Must ensure that the mis-speculated instructions produce no - Tag speculative instructions side effects - Advance branch and following instructions ‹ Start New Correct Path - Buffer addresses of speculated branch instructions - Must have remembered the alternate (non-predicted) path

Mis-speculation Recovery Trailing Confirmation

‹ Eliminate Incorrect Path -Use branch tag(s) to deallocate completion buffer entries NT T tag1 occupied by speculative instructions (now determined to be mis-speculated). NT T NT T - Invalidate all instructions in the decode and dispatch buffers, as tag2 tag2 well as those in reservation stations NT T NT T NT T NT T How expensive is a misprediction? tag3 tag3 tag3 ‹ Start New Correct Path - Update PC with computed branch target (if it was predicted NT) - Update PC with sequential instruction address (if it was ‹ Trailing Confirmation predicted T) - When branch is resolved, remove/deallocate speculation tag - Can begin speculation once again when encounter a new - Permit completion of branch and following instructions branch How soon can you restart? Impediments to Parallel/Wide Fetching HW Schemes: Instruction Parallelism

‹ Average Basic Block Size ‹ Why in HW at run time? - integer code: 4-6 instructions - Works when can’t know real dependence at compile time - Compiler simpler - floating-point code: 6-10 instructions - Code for one machine runs well on another

‹ Branch Prediction Mechanisms ‹ Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 - must make multiple branch predictions per cycle ADDD F10,F0,F8 - potentially multiple predicted taken branches SUBD F12,F8,F14 - Enables out-of-order execution => out-of-order completion ‹ Conventional I-Cache Organization - ID stage checked both for structuralScoreboard dates to CDC - must fetch from multiple predicted taken targets per 6600 in 1963 cycle - must align and collapse multiple fetch groups per cycle …Trace Caching!!

Advantages of HW Schemes: Instruction Parallelism Dynamic Scheduling

‹ Out-of-order execution divides ID stage: ‹ Handles cases when dependences unknown at 1. Issue—decode instructions, check for structural hazards compile time 2. Read operands—wait until no data hazards, then read operands - (e.g., because they may involve a memory reference) ‹ Two major schemes: ‹ It simplifies the compiler - Scoreboard ‹ Allows code that compiled for one pipeline to run - Reservation station (Tomasulo algo) efficiently on a different pipeline ‹ They allow instruction to execute whenever 1 & 2 hold, ‹ Hardware speculation, a technique with significant not waiting for prior instructions performance advantages, that builds on dynamic ‹ In order issue, out of order execution, out of order scheduling commit ( also called completion) A Dynamic Algorithm: Tomasulo Method:Approach Tomasulo’s Algorithm ‹ Tracks when operands for instructions are available ‹ For IBM 360/91 (before caches!) - Minimizes RAW hazards ‹ Goal: High Performance without special compilers ‹ ‹ Small number of floating point registers (4 in 360) prevented - Minimize WAR and WAW hazards interesting compiler scheduling of operations ‹ Many variations in use today but key remains - This led Tomasulo to try to figure out how to get more effective - Tracking instruction dependencies to allow execution as registers — renaming in hardware! soon as operands available, and rename registers to avoid ‹ Why Study 1966 Computer? WAR and WAW - Basically, it tries to follow the data-flow execution ‹ The descendants of this have flourished! - Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, …

Diversified Pipelined Inorder Issue, Out-of-order Complete Tomasulo Algorithm

‹ Multiple functional units (FU’s) ‹ Control & buffers distributed with Function Units (FU) - Floating-point add - Floating-point multiply/divide - FU buffers called “reservation stations”; have pending operands ‹ Three register files (pseudo reg-reg machine in FP unit) ‹ Registers in instructions replaced by values or pointers to reservation - (4) floating-point registers (FLR) stations(RS); called register renaming ; - (6) floating-point buffers (FLB) - avoids WAR, WAW hazards - (3) store data buffers (SDB) - More reservation stations than registers, so can do optimizations ‹ Out of order instruction execution: compilers can’t - After decode the instruction unit passes all floating point ‹ Results to FU from RS, not through registers, over Common Data Bus instructions (in order) to the floating-point operation stack (FLOS). that broadcasts results to all FUs - In the floating point unit, instructions are then further decoded and issued from the FLOS to the two FU’s ‹ Load and Stores treated as FUs with RSs as well ‹ Variable operation latencies (not pipelined): ‹ Integer instructions can go past branches, allowing - Floating-point add: 2 cycles FP ops beyond basic block in FP queue - Floating-point multiply: 3 cycles - Floating-point divide: 12 cycles Tomasulo Organization Reservation Station

From Mem FP Op FP Registers ‹ Buffers where instructions can wait for RAW hazard Queue resolution and execution Load Buffers ‹ Associate more than one set of buffering registers Load1 Load2 (control, source, sink) with each FU ⇒ virtual FU’s. Load3 Load4 - Add unit: three reservation stations Load5 Store Load6 - Multiply/divide unit: two reservation stations Buffers ‹ Pending (not yet executing) instructions can have Add1 Add2 Mult1 either value operands or pseudo operands (aka. tags). Add3 Mult2 Reservation To Mem RS1 Stations FPFP addersadders FPFP multipliersmultipliers RS2 RS1 RS2 ⇒ Mult Mult Mult Common Data Bus (CDB)

Rename Tags Common Data Bus (CDB)

‹ Register names are normally bound to FLR registers ‹ CDB is driven by all units that can update FLR ‹ When an FLR register is stale, the register “name” is bound to the - When an instruction finishes, it broadcasts both its “tag” and its pending-update instruction result on the CDB. - Why don’t we need the destination register name? ‹ Tags are names to refer to these pending-update instructions ‹ Sources of CDB: ‹ In Tomasulo, A “tag” is statically bound to the buffer where a pending- update instruction waits. - Floating-point buffers (FLB) -6 FLB’s - Two FU’s (add unit and the multiply/divide unit) - 5 reservation stations (3 add RSs, 2 multiply/divide RSs) ‹ The CDB is monitored by all units that was left holding a tag ⇒ 4-bit tag is needed to identify the 11 potential sources instead of a value operand - Listens for tag broadcast on the CDB ‹ Instructions can be dispatched to RSs with either value operands or just tags. - If a tag matches, grab the value - Tag operand ⇒ unfulfilled RAW dependence ‹ Destinations of CDB: - the instruction in the RS corresponding to the Tag will produce the actual - Reservation stations value eventually - Store data buffers (SDB) - Floating-point registers (FLR) Superscalar Execution Check List Reservation Station Components

INSTRUCTION PROCESSING CONSTRAINTS Op: Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands - Store buffers has V field, result to be stored Resource Contention Code Dependences Qj, Qk: Reservation stations producing source registers (Structural Dependences) (value to be written)

Control Dependences Data Dependences - Note: Qj,Qk=0 => ready - Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy (RAW)True Dependences Storage Conflicts Register result status—Indicates which functional unit will (WAR)Anti-Dependences Output Dependences (WAW) write each register, if one exists. Blank when no pending instructions that will write that register.

Structural Dependence Resolution Three Stages of Tomasulo Algorithm

1. Issue—get instruction from FP Op Queue ‹ Structural dependence: virtual FU’s If reservation station free (no structural hazard), - FLOS can hold and decode up to 8 instructions. control issues instr & sends operands (renames registers). 2. Execute—operate on operands (EX) - Instructions are dispatched to the 5 reservation When both operands ready then execute; stations (virtual FU’s) even though there are only if not ready, watch Common Data Bus for result two physical FU’s. 3. Write result—finish execution (WB) - Hence, structural dependence does not stall Write on Common Data Bus to all awaiting units; mark reservation station available decoding ‹ Normal data bus: data + destination (“go to” bus) Why is this useful? ‹ Common data bus: data + source (“come from” bus) - 64 bits of data + 4 bits of Functional Unit source address - Write if matches expected Functional Unit (produces result) - Does the broadcast ‹ Example speed: 3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for / Resolving True-Dependence Resolving Anti-Dependence ‹ True dependence: Tags + CDB

- If an operand is available in FLR, it is copied to RS ‹ Anti-dependence: Operand Copying - If an operand is not available then a tag is copied to the RS instead. This tag identifies the source (RS/instruction) of the pending write - If an operand is available in FLR, it is copied to RS - Eventually the source instruction completes and with the issuing instruction broadcasts its tag and value on the CDB - By copying this operand to RS, all WAR - Any reservation station entry, FLR entry or SDB entry dependencies due to future writes to this same that holds a matching tag as operand will latch in the register are resolved broadcasted value from the CDB. Hence, the reading of an operand is not delayed, RAW dependence does not block subsequent possibly due to other dependencies, and independent instructions and does not block an FU subsequent writes are also not delayed.

Resolving Output-Dependence Instruction stream Tomasulo Example Instruction status: Exec Write ‹ Output dependence: “register renaming” + result forwarding Instruction jkIssue Comp Result Busy Address - If a FLR is waiting for a pending write, it’s tag field will contain the tag LD F6 34+ R2 Load1 No of the source instruction LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No - If a 2nd instruction comes along and want to write the same register SUBD F8 F6 F2 DIVD F10 F0 F6 • the register can be renamed to the 2nd instruction (i.e. new tag) ADDD F6 F8 F2 3 Load/Buffers • Any instruction that needs the value of the 1st pending write has Reservation Stations: S1 S2 RS RS the tag of the 1st instruction. Hence, the correct value will be Time Name Busy Op Vj Vk Qj Qk forwarded from the 1st instruction directly Add1 No FU count Add2 No • any subsequent instruction that reads the register will get the tag, 3 FP R.S. down Add3 No or eventually the result, of the 2nd instruction Mult1 No 2 FP Mult R.S. Mult2 No Register result status: WAW dependence is resolved without stalling a physical Clock F0 F2 F4 F6 F8 F10 F12 ... F30 functional unit and does not require additional buffers to 0 FU ensure sequential write back to the register file. Clock cycle counter Tomasulo Example Cycle 1 Tomasulo Example Cycle 2 Instruction status: Exec Write Instruction status: Exec Write Instruction jkIssue Comp Result Busy Address Instruction jkIssue Comp Result Busy Address LD F6 34+ R2 1 Load1 Yes 34+R2 LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 Load2 No LD F2 45+ R3 2Load2Yes45+R3 MULTD F0 F2 F4 Load3 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 SUBD F8 F6 F2 DIVD F10 F0 F6 DIVD F10 F0 F6 ADDD F6 F8 F2 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk Add1 No Add1 No Add2 No Add2 No Add3 No Add3 No Mult1 No Mult1 No Mult2 No Mult2 No Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 1 FU Load1 2 FU Load2 Load1

Note: Can have multiple loads outstanding

Tomasulo Example Cycle 3 Tomasulo Example Cycle 4 Instruction status: Exec Write Instruction status: Exec Write Instruction jkIssue Comp Result Busy Address Instruction jkIssue Comp Result Busy Address LD F6 34+ R2 1 3 Load1 Yes 34+R2 LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2Load2Yes45+R3 LD F2 45+ R3 2 4 Load2 Yes 45+R3 MULTD F0 F2 F4 3Load3No MULTD F0 F2 F4 3Load3No SUBD F8 F6 F2 SUBD F8 F6 F2 4 DIVD F10 F0 F6 DIVD F10 F0 F6 ADDD F6 F8 F2 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk Add1 No Add1 Yes SUBD M(A1) Load2 Add2 No Add2 No Add3 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult1 Yes MULTD R(F4) Load2 Mult2 No Mult2 No Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 3 FU Mult1 Load2 Load1 4 FU Mult1 Load2 M(A1) Add1 • Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued • Load2 completing; what is waiting for Load2? • Load1 completing; what is waiting for Load1? Tomasulo Example Cycle 5 Tomasulo Example Cycle 6 Instruction status: Exec Write Instruction status: Exec Write Instruction jkIssue Comp Result Busy Address Instruction jkIssue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 245 Load2No LD F2 45+ R3 245 Load2No MULTD F0 F2 F4 3Load3No MULTD F0 F2 F4 3Load3No SUBD F8 F6 F2 4 SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 DIVD F10 F0 F6 5 ADDD F6 F8 F2 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk 2Add1 Yes SUBD M(A1) M(A2) 1Add1 Yes SUBD M(A1) M(A2) Add2 No Add2 Yes ADDD M(A2) Add1 Add3 No Add3 No 10 Mult1 Yes MULTD M(A2) R(F4) 9Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Mult2 Yes DIVD M(A1) Mult1 Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 5 FU Mult1 M(A2) M(A1) Add1 Mult2 6 FU Mult1 M(A2) Add2 Add1 Mult2

• Timer starts down for Add1, Mult1 • Issue ADDD here despite name dependency on F6?

Tomasulo Example Cycle 7 Tomasulo Example Cycle 8 Instruction status: Exec Write Instruction status: Exec Write Instruction jkIssue Comp Result Busy Address Instruction jkIssue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 245 Load2No LD F2 45+ R3 245 Load2No MULTD F0 F2 F4 3Load3No MULTD F0 F2 F4 3Load3No SUBD F8 F6 F2 47 SUBD F8 F6 F2 478 DIVD F10 F0 F6 5 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk 0Add1 Yes SUBD M(A1) M(A2) Add1 No Add2 Yes ADDD M(A2) Add1 2 Add2 Yes ADDD (M-M) M(A2) Add3 No Add3 No 8Mult1 Yes MULTD M(A2) R(F4) 7Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Mult2 Yes DIVD M(A1) Mult1 Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 7 FU Mult1 M(A2) Add2 Add1 Mult2 8 FU Mult1 M(A2) Add2 (M-M) Mult2

• Add1 (SUBD) completing; what is waiting for it? Tomasulo Example Cycle 9 Tomasulo Example Cycle 10 Instruction status: Exec Write Instruction status: Exec Write Instruction jkIssue Comp Result Busy Address Instruction jkIssue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 245 Load2No LD F2 45+ R3 245 Load2No MULTD F0 F2 F4 3Load3No MULTD F0 F2 F4 3Load3No SUBD F8 F6 F2 478 SUBD F8 F6 F2 478 DIVD F10 F0 F6 5 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 ADDD F6 F8 F2 6 10 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk Add1 No Add1 No 1 Add2 Yes ADDD (M-M) M(A2) 0 Add2 Yes ADDD (M-M) M(A2) Add3 No Add3 No 6Mult1 Yes MULTD M(A2) R(F4) 5Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Mult2 Yes DIVD M(A1) Mult1 Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 9 FU Mult1 M(A2) Add2 (M-M) Mult2 10 FU Mult1 M(A2) Add2 (M-M) Mult2

• Add2 (ADDD) completing; what is waiting for it?

Tomasulo Example Cycle 11 Tomasulo Example Cycle 12 Instruction status: Exec Write Instruction status: Exec Write Instruction jkIssue Comp Result Busy Address Instruction jkIssue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 245 Load2No LD F2 45+ R3 245 Load2No MULTD F0 F2 F4 3Load3No MULTD F0 F2 F4 3Load3No SUBD F8 F6 F2 478 SUBD F8 F6 F2 478 DIVD F10 F0 F6 5 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk Add1 No Add1 No Add2 No Add2 No Add3 No Add3 No 4Mult1 Yes MULTD M(A2) R(F4) 3Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Mult2 Yes DIVD M(A1) Mult1 Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 11 FU Mult1 M(A2) (M-M+M(M-M) Mult2 12 FU Mult1 M(A2) (M-M+M(M-M) Mult2

• Write result of ADDD here? • All quick instructions complete in this cycle! Tomasulo Example Cycle 13 Tomasulo Example Cycle 14 Instruction status: Exec Write Instruction status: Exec Write Instruction jkIssue Comp Result Busy Address Instruction jkIssue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 245 Load2No LD F2 45+ R3 245 Load2No MULTD F0 F2 F4 3Load3No MULTD F0 F2 F4 3Load3No SUBD F8 F6 F2 478 SUBD F8 F6 F2 478 DIVD F10 F0 F6 5 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk Add1 No Add1 No Add2 No Add2 No Add3 No Add3 No 2Mult1 Yes MULTD M(A2) R(F4) 1Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Mult2 Yes DIVD M(A1) Mult1 Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 13 FU Mult1 M(A2) (M-M+M(M-M) Mult2 14 FU Mult1 M(A2) (M-M+M(M-M) Mult2

Tomasulo Example Cycle 15 Tomasulo Example Cycle 16 Instruction status: Exec Write Instruction status: Exec Write Instruction jkIssue Comp Result Busy Address Instruction jkIssue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 245 Load2No LD F2 45+ R3 245 Load2No MULTD F0 F2 F4 315 Load3No MULTD F0 F2 F4 31516Load3No SUBD F8 F6 F2 478 SUBD F8 F6 F2 478 DIVD F10 F0 F6 5 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk Add1 No Add1 No Add2 No Add2 No Add3 No Add3 No 0Mult1 Yes MULTD M(A2) R(F4) Mult1 No Mult2 Yes DIVD M(A1) Mult1 40 Mult2 Yes DIVD M*F4 M(A1) Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 15 FU Mult1 M(A2) (M-M+M(M-M) Mult2 16 FU M*F4 M(A2) (M-M+M(M-M) Mult2

• Mult1 (MULTD) completing; what is waiting for it? • Just waiting for Mult2 (DIVD) to complete Tomasulo Example Cycle 55 Instruction status: Exec Write Instruction jkIssue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 245 Load2No MULTD F0 F2 F4 31516Load3No SUBD F8 F6 F2 478 Faster than light computation DIVD F10 F0 F6 5 (skip a couple of cycles) ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Add1 No Add2 No Add3 No Mult1 No 1Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 55 FU M*F4 M(A2) (M-M+M(M-M) Mult2

Tomasulo Example Cycle 56 Tomasulo Example Cycle 57 Instruction status: Exec Write Instruction status: Exec Write Instruction jkIssue Comp Result Busy Address Instruction jkIssue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 245 Load2No LD F2 45+ R3 245 Load2No MULTD F0 F2 F4 31516Load3No MULTD F0 F2 F4 31516Load3No SUBD F8 F6 F2 478 SUBD F8 F6 F2 478 DIVD F10 F0 F6 556 DIVD F10 F0 F6 55657 ADDD F6 F8 F2 6 10 11 ADDD F6 F8 F2 6 10 11 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk Add1 No Add1 No Add2 No Add2 No Add3 No Add3 No Mult1 No Mult1 No 0Mult2 Yes DIVD M*F4 M(A1) Mult2 Yes DIVD M*F4 M(A1) Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 56 FU M*F4 M(A2) (M-M+M(M-M) Mult2 56 FU M*F4 M(A2) (M-M+M(M-M) Result

• Mult2 (DIVD) is completing; what is waiting for it? • Once again: In-order issue, out-of-order execution and out-of-order completion. Tomasulo Drawbacks Tomasulo Loop Example

Loop:LD F0 0 R1 MULTD F4 F0 F2 ‹ Complexity - delays of 360/91, MIPS 10000, Alpha 21264, SD F4 0 R1 IBM PPC 620 in CA:AQA 2/e, but not in silicon! SUBI R1 R1 #8 ‹ Many associative stores (CDB) at high speed BNEZ R1 Loop ‹ Performance limited by Common Data Bus - Each CDB must go to multiple functional units ⇒high capacitance, high wiring density ‹ This time assume Multiply takes 4 clocks - Number of functional units that can complete per cycle ‹ Assume 1st load takes 8 clocks limited to one! (L1 cache miss), 2nd load takes 1 clock (hit) • Multiple CDBs ⇒ more FU logic for parallel assoc stores ‹ To be clear, will show clocks for SUBI, BNEZ ‹ Non-precise interrupts! - Reality: integer instructions ahead of Fl. Pt. Instructions ‹ Show 2 iterations

Loop Example Loop Example Cycle 1 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu ITER Instruction jkIssueCompResult Busy Addr Fu 1LD F00 R1 Load1 No 1LDF00R11Load1Yes 80 1 MULTD F4 F0 F2 Load2 No Load2 No 1SD F40 R1 Load3 No Load3 No Iter- 2LD F00 R1 Store1 No Store1 No ation 2 MULTD F4 F0 F2 Store2 No Store2 No Count 2SD F40 R1 Store3 No Store3 No Reservation Stations: Reservation Stations: S1 S2 RS Added Store Buffers S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 Mult1 No SUBI R1 R1 #8 Mult1 No SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Mult2 No BNEZ R1 Loop Register result status Instruction Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 080Fu 180Fu Load1

Value of Register used for address, iteration control Loop Example Cycle 2 Loop Example Cycle 3 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction jkIssueCompResult Busy Addr Fu ITER Instruction jkIssueCompResult Busy Addr Fu 1LDF00R11Load1Yes 80 1LDF00R11Load1Yes 80 1 MULTD F4 F0 F2 2Load2No 1 MULTD F4 F0 F2 2Load2No Load3 No 1SDF40R13Load3No Store1 No Store1 Yes 80 Mult1 Store2 No Store2 No Store3 No Store3 No Reservation Stations: S1 S2 RS Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Mult2 No BNEZ R1 Loop Register result status Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 280Fu Load1 Mult1 380Fu Load1 Mult1

‹ Implicit renaming sets up data flow graph

Loop Example Cycle 4 Loop Example Cycle 5 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction jkIssueCompResult Busy Addr Fu ITER Instruction jkIssueCompResult Busy Addr Fu 1LDF00R11Load1Yes 80 1LDF00R11Load1Yes 80 1 MULTD F4 F0 F2 2Load2No 1 MULTD F4 F0 F2 2Load2No 1SDF40R13Load3No 1SDF40R13Load3No Store1 Yes 80 Mult1 Store1 Yes 80 Mult1 Store2 No Store2 No Store3 No Store3 No Reservation Stations: S1 S2 RS Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Mult2 No BNEZ R1 Loop Register result status Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 480Fu Load1 Mult1 572Fu Load1 Mult1

‹ Dispatching SUBI Instruction (not in FP queue) ‹ And, BNEZ instruction (not in FP queue) Loop Example Cycle 6 Loop Example Cycle 7 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction jkIssueCompResult Busy Addr Fu ITER Instruction jkIssueCompResult Busy Addr Fu 1LDF00R11Load1Yes 80 1LDF00R11Load1Yes 80 1 MULTD F4 F0 F2 2Load2Yes 72 1 MULTD F4 F0 F2 2Load2Yes 72 1SDF40R13Load3No 1SDF40R13Load3No 2LDF00R16Store1Yes 80 Mult1 2LDF00R16Store1Yes 80 Mult1 Store2 No 2 MULTD F4 F0 F2 7Store2No Store3 No Store3 No Reservation Stations: S1 S2 RS Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop Register result status Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 672Fu Load2 Mult1 772Fu Load2 Mult2

‹ Register file completely detached from computation ‹ Notice that F0 never sees Load from location 80 ‹ First and Second iteration completely overlapped

Loop Example Cycle 8 Loop Example Cycle 9 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu ITER Instruction jkIssueCompResult Busy Addr Fu 1LDF00R11Load1Yes 80 1LDF00R119 Load1Yes 80 1 MULTD F4 F0 F2 2Load2Yes 72 1 MULTD F4 F0 F2 2Load2Yes 72 1SDF40R13Load3No 1SDF40R13Load3No 2LDF00R16Store1Yes 80 Mult1 2LDF00R16Store1Yes 80 Mult1 2 MULTD F4 F0 F2 7Store2Yes 72 Mult2 2 MULTD F4 F0 F2 7Store2Yes 72 Mult2 2SDF40R18Store3No 2SDF40R18Store3No Reservation Stations: S1 S2 RS Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop Register result status Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 872Fu Load2 Mult2 972Fu Load2 Mult2

‹ Load1 completing: who is waiting? ‹ Note: Dispatching SUBI Loop Example Cycle 10 Loop Example Cycle 11 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu ITER Instruction j k Issue CompResult Busy Addr Fu 1LDF00R11910Load1No 1LDF00R11910Load1No 1 MULTD F4 F0 F2 2Load2Yes 72 1 MULTD F4 F0 F2 2Load2No 1SDF40R13Load3No 1SDF40R13Load3Yes64 2LDF00R1610 Store1Yes 80 Mult1 2LDF00R161011Store1Yes 80 Mult1 2 MULTD F4 F0 F2 7Store2Yes 72 Mult2 2 MULTD F4 F0 F2 7Store2Yes 72 Mult2 2SDF40R18Store3No 2SDF40R18Store3No Reservation Stations: S1 S2 RS Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 4Mult1Yes Multd M[80] R(F2) SUBI R1 R1 #8 3Mult1Yes Multd M[80] R(F2) SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop 4Mult2Yes Multd M[72] R(F2) BNEZ R1 Loop Register result status Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 10 64 Fu Load2 Mult2 11 64 Fu Load3 Mult2

‹ Load2 completing: who is waiting? ‹ Next load in sequence ‹ Note: Dispatching BNEZ

Loop Example Cycle 12 Loop Example Cycle 13 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu ITER Instruction j k Issue CompResult Busy Addr Fu 1LDF00R11910Load1No 1LDF00R11910Load1No 1 MULTD F4 F0 F2 2Load2No 1 MULTD F4 F0 F2 2Load2No 1SDF40R13Load3Yes64 1SDF40R13Load3Yes64 2LDF00R161011Store1Yes 80 Mult1 2LDF00R161011Store1Yes 80 Mult1 2 MULTD F4 F0 F2 7Store2Yes 72 Mult2 2 MULTD F4 F0 F2 7Store2Yes 72 Mult2 2SDF40R18Store3No 2SDF40R18Store3No Reservation Stations: S1 S2 RS Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 2Mult1Yes Multd M[80] R(F2) SUBI R1 R1 #8 1Mult1Yes Multd M[80] R(F2) SUBI R1 R1 #8 3Mult2Yes Multd M[72] R(F2) BNEZ R1 Loop 2Mult2Yes Multd M[72] R(F2) BNEZ R1 Loop Register result status Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 12 64 Fu Load3 Mult2 13 64 Fu Load3 Mult2

‹ Why not issue third multiply? ‹ Why not issue third store? Loop Example Cycle 14 Loop Example Cycle 15 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction jkIssueCompResult Busy Addr Fu ITER Instruction j k Issue CompResult Busy Addr Fu 1LDF00R11910Load1No 1LDF00R11910Load1No 1 MULTD F4 F0 F2 214 Load2No 1 MULTD F4 F0 F2 21415Load2No 1SDF40R13Load3Yes64 1SDF40R13Load3Yes64 2LDF00R161011Store1Yes 80 Mult1 2LDF00R161011Store1Yes 80 [80]*R2 2 MULTD F4 F0 F2 7Store2Yes 72 Mult2 2 MULTD F4 F0 F2 715 Store2Yes 72 Mult2 2SDF40R18Store3No 2SDF40R18Store3No Reservation Stations: S1 S2 RS Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 0Mult1Yes Multd M[80] R(F2) SUBI R1 R1 #8 Mult1 No SUBI R1 R1 #8 1Mult2Yes Multd M[72] R(F2) BNEZ R1 Loop 0Mult2Yes Multd M[72] R(F2) BNEZ R1 Loop Register result status Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 14 64 Fu Load3 Mult2 15 64 Fu Load3 Mult2

‹ Mult1 completing. Who is waiting? ‹ Mult2 completing. Who is waiting?

Loop Example Cycle 16 Loop Example Cycle 17 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction jkIssueCompResult Busy Addr Fu ITER Instruction j k Issue CompResult Busy Addr Fu 1LDF00R11910Load1No 1LDF00R11910Load1No 1 MULTD F4 F0 F2 21415Load2No 1 MULTD F4 F0 F2 21415Load2No 1SDF40R13Load3Yes64 1SDF40R13Load3Yes64 2LDF00R161011Store1Yes 80 [80]*R2 2LDF00R161011Store1Yes 80 [80]*R2 2 MULTD F4 F0 F2 71516Store2Yes 72 [72]*R2 2 MULTD F4 F0 F2 71516Store2Yes 72 [72]*R2 2SDF40R18Store3No 2SDF40R18Store3Yes64Mult1 Reservation Stations: S1 S2 RS Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 4 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Mult2 No BNEZ R1 Loop Register result status Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 16 64 Fu Load3 Mult1 17 64 Fu Load3 Mult1 Loop Example Cycle 18 Loop Example Cycle 19 Instruction status: Exec Write Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu ITER Instruction jkIssueCompResult Busy Addr Fu 1LDF00R11910Load1No 1LDF00R11910Load1No 1 MULTD F4 F0 F2 21415Load2No 1 MULTD F4 F0 F2 21415Load2No 1SDF40R1318 Load3Yes64 1SDF40R131819Load3Yes64 2LDF00R161011Store1Yes 80 [80]*R2 2LDF00R161011Store1No 2 MULTD F4 F0 F2 71516Store2Yes 72 [72]*R2 2 MULTD F4 F0 F2 71516Store2Yes 72 [72]*R2 2SDF40R18Store3Yes64Mult1 2SDF40R1819 Store3Yes64Mult1 Reservation Stations: S1 S2 RS Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 Mult2 No BNEZ R1 Loop Mult2 No BNEZ R1 Loop Register result status Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 18 64 Fu Load3 Mult1 19 56 Fu Load3 Mult1

Loop Example Cycle 20 Why can Tomasulo overlap iterations Instruction status: Exec Write of loops? ITER Instruction jkIssueCompResult Busy Addr Fu 1LDF00R11910Load1Yes 56 1 MULTD F4 F0 F2 21415Load2No 1SDF40R131819Load3Yes64 2LDF00R161011Store1No ‹ Register renaming 2 MULTD F4 F0 F2 71516Store2No - Multiple iterations use different physical destinations for registers 2SDF40R181920Store3Yes64Mult1 (dynamic loop unrolling). Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: ‹ Reservation stations Add1 No LD F0 0 R1 - Permit instruction issue to advance past integer control flow operations Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 - Also buffer old values of registers - totally avoiding the WAR stall that we Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8 saw in the scoreboard. Mult2 No BNEZ R1 Loop Register result status ‹ Other perspective: Tomasulo building data flow dependency graph Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F30 on the fly. 20 56 Fu Load1 Mult1 • Once again: In-order issue, out-of-order execution and out-of-order completion. Tomasulo’s scheme offers 2 major What about Precise Interrupts? advantages

(1) the distribution of the hazard detection logic ‹ Tomasulo had: - distributed reservation stations and the CDB - If multiple instructions waiting on single result, & each In-order issue, out-of-order execution, and out-of- instruction has other operand, then instructions can be order completion released simultaneously by broadcast on CDB - If a centralized register file were used, the units would have to read their results from the registers when register buses ‹ Need to “fix” the out-of-order completion aspect so are available. that we can find precise breakpoint in instruction (2) the elimination of stalls for WAW and WAR stream. hazards

Relationship between precise Multiple Issue ILP Processors interrupts and speculation:

‹ Speculation is a form of guessing. ‹ In statically scheduled superscalar instructions issue ‹ Important for branch prediction: in order, and all pipeline hazards checked at issue - Need to “take our best shot” at predicting branch direction. time - Inst causing hazard will force subsequent inst to be stalled ‹ If we speculate and are wrong, need to back up and restart execution to point at which we predicted ‹ In statically scheduled VLIW, compiler generates incorrectly: multiple issue packets of instructions - This is exactly same as precise exceptions! ‹ During instruction fetch, pipeline receives number of ‹ Technique for both precise interrupts/exceptions and inst from IF stage – issue packet speculation: in-order completion or commit - Examine each inst in packet: if no hazard then issue else wait - Issue unit examines all inst in packet • Complexity implies further splitting of issue stage Multiple Issue Extending Tomasulo

‹ To issue multiple instructions per clock, key is ‹ Have to allow for speculative instructions assigning reservation station and updating pipeline - Separate bypassing of results among instructions from the control tables actual completion of instruction ‹ Two approaches ‹ Tag instructions as speculative until they are - Run this step in half a clock cycle; two inst can be processed validated – and then commit the instruction in one clock cycle ‹ Key idea: allow instructions to execute out of order - Build logic needed to handle two instructions at once but commit in order - Requires reorder buffer (ROB) to hold and pass results among speculated instructions - ROB is a source of operands - Register file is not updated until instruction commits • Therefore ROB supplies operands in interval between completion of inst and commit of inst

HW support for precise interrupts What are the hardware complexities with reorder buffer (ROB)? ‹ Need HW buffer for results of uncommitted instructions: Compar network reorder buffer Reorder Buffer - 3 fields: instr, destination, value FP - Use reorder buffer number instead Reorder Op Buffer of reservation station when FP Queue FP Regs execution completes Op Exceptions? Dest Reg Result Valid - Supplies operands between Queue FP Regs execution complete & commit - (Reorder buffer can be operand Reorder Table Res Stations Res Stations source => more registers like RS) FP Adder FP Adder - Instructions commit Res Stations Res Stations - Once instruction commits, FP Adder FP Adder result is put into register ‹ How do you find the latest version of a register? - As a result, easy to undo - (As specified by Smith paper) need associative comparison network speculated instructions - Could use future file or just use the register result status buffer to track on mispredicted branches which specific reorder buffer has received the value or exceptions ‹ Need as many ports on ROB as register file Four Steps of Speculative Summary Tomasulo Algorithm ‹ Reservations stations: implicit register renaming to larger 1. Issue—get instruction from FP Op Queue set of registers + buffering source operands If reservation station and reorder buffer slot free, issue instr & send - Prevents registers as bottleneck operands & reorder buffer no. for destination (this stage sometimes called “dispatch”) - Avoids WAR, WAW hazards of Scoreboard 2. Execution—operate on operands (EX) - Allows loop unrolling in HW When both operands ready then execute; if not ready, watch CDB ‹ Not limited to basic blocks for result; when both in reservation station, execute; checks RAW (sometimes called “issue”) (integer units gets ahead, beyond branches) 3. Write result—finish execution (WB) ‹ Lasting Contributions Write on Common Data Bus to all awaiting FUs - Dynamic scheduling & reorder buffer; mark reservation station available. - Register renaming 4. Commit—update register with reorder result - Load/store disambiguation When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from ‹ 360/91 descendants are Pentium III; PowerPC 604; MIPS reorder buffer. Mispredicted branch flushes reorder buffer R10000; HP-PA 8000; Alpha 21264 (sometimes called “graduation”)

Next . . . VLIW/EPIC

‹ Static ILP Processors: Very Long Instruction Word(VLIW) – Explicit Parallel Instruction Computing (EPIC) - Compiler determines dependencies and resolves hazards - Hardware support needed for this ? ‹ Do ‘standard’ compiler techniques work or do we need new compiler optimization methods to extract more ILP - Overview of compiler optimization