ILP: Multiple Issue, and Recovery

Multiple issue, speculative execution, precise interrupt, reorder buffer Multiple instruction issue

• Recap: Tomasulo’s algorithm handles RAW, WAR, WAW hazards via dynamic scheduling

• Three basic mechanisms: – Execute only when data ready – RAW – Buffer register values in reservation stations when instruction is issued – WAR – Register destination field – WAW Single-issue Tomasulo

• The previous mechanisms support a model where: – Instructions are issued one at a time, in program order » Into reservation stations – Multiple instructions can begin execution at a time, out of program order » Provided their operands, and independent functional units are available – Instructions can complete out of order Support for multiple issue

• There are several phases in the Tomasulo algorithm – Fetch into instruction queue – Decode and issue to reservation stations – Execute – Write results • If any of these phases support only one instruction per cycle – Multiple issue will not yield CPI<1 Support for multiple issue (cont)

• Our Tomasulo design so far: – Execution supports multiple instructions – Write results support multiple instructions » Provided there are multiple functional units, and CDBs

• Extensions: – Fetching multiple – Issuing multiple instructions per cycle Fetching multiple instructions

• Must be able to fill in instruction queue at a higher rate – If want to sustain “N” instructions per cycle, must fetch at least “N” instructions per cycle • Must look ahead the program order stream and fetch from multiple PCs Decoding multiple operations

• Once multiple instructions are in the fetch queue – Must be able to decode and issue more than one in a single cycle – An “issue packet” Issue logic

• First, need to check whether reservation stations are available – For each instruction that may be issued – As previously, but now the check gets more complex » Not only reservation stations that are already in use, but also those that will be occupied by other instructions in the issue packet Issue logic

• Second, need to resolve dependences within issue packet – In addition to those with instructions already in flight

• Keep in mind checks must preserve program order within issue packet Issue logic

• Check must be done in parallel – While preserving program order • Possible approaches: – Hardware checks for all possible combinations within a single cycle – Pipeline the issue stage Simplifications

• If floating point and integer registers are separate – May facilitate issue logic by looking at one FP and one INT instruction at a time • But some programs only have INT instructions • General case: support issue of arbitrary mix of INT, FP – And loads, stores Tomasulo: decoupling

• Fetch, issue and execution are decoupled – Number of instructions fetched in cycle “k” does not imply same number of instructions issued in cycle “k+1” – Number of instructions issued does not imply same number of instructions executed • Once fetch/issue stages are improved – Execution works just as before Example

Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDIUR1,R1,#-8 BNE R1,R2,Loop

1 integer ALU, 1 FP ALU, 1 address , 1 memory unit, 2 CDBs Multiple reservation stations Issue packet: any two instructions (except branches; single issue) Branch-predicted instructions can be fetched, issued but not executed Loop: L.D F0,0(R1) Cycle #0 ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D fetched into DADDI R1,R1,#-8 instruction queue BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDI R1,R1,#-8 BNE R1,R2,Loop Cycle #1 Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D issued DADDI R1,R1,#-8 -what checks need BNE R1,R2,Loop to be made? Loop: L.D F0,0(R1) ADD.D F4,F0,F2 -what are the Vj/Vk, S.D F4,0(R1) Qj/Qk of ADD.D RS DADDI R1,R1,#-8 set to? BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDI R1,R1,#-8 BNE R1,R2,Loop S.D, DADDIU brought into instruction queue Tomasulo Example Cycle 1

Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F0 0+ R1 1 Load1 Yes 0+R1 ADD F4 F0 F2 1

Reservation Stations S1 S2 RS for j RS for k TimeNameBusyOp Vj Vk Qj Qk Add1 Yes Add F2 Load1

Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 1 FU Load1 Add1 Cycle #2 Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) S.D, DADDIU issued DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) BNE brought DADDI R1,R1,#-8 into instruction queue BNE R1,R2,Loop -Assumption: single-issue of branches Tomasulo Example Cycle 2

Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F0 0+ R1 1 Load1 Yes 0+R1 ADD F4 F0 F2 1 SD F4 0+ R1 2 Store1 Yes 0+R1 DAD R1 R1 -8 2

Reservation Stations S1 S2 RS for j RS for k TimeNameBusyOp Vj Vk Qj Qk Add1 Yes Add F2 Load1 Int Yes DAD R1 -8

Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 2 FU Load1 Add1 Cycle #3 Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) BNE issued DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, DADDI access DADDI R1,R1,#-8 memory, execute BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D brought DADDI R1,R1,#-8 into instruction queue BNE R1,R2,Loop - Under what assumption? Cycle #4

Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D issued DADDI R1,R1,#-8 -Can they execute? BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, DADDI write results DADDI R1,R1,#-8 -who needs their results? BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) S.D, DADDI brought DADDI R1,R1,#-8 into instruction queue BNE R1,R2,Loop Cycle #5

Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) BNE issued DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) BNE executes DADDI R1,R1,#-8 -what happens after BNE R1,R2,Loop branch is resolved? Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D brought DADDI R1,R1,#-8 into instruction queue BNE R1,R2,Loop Limitations/challenges

• Multiple issue relies on fetching a stream of instructions ahead of time – Branch/target prediction is very important • What to do with instructions that depend on prediction results? – They may or may not be valid • Can always fetch and issue them – No harm done until memory or register contents are written – But just fetching/issuing not enough Dealing with branches

• Predicting branches lets us: – Fetch appropriate instructions into queue – Issue instructions into reservation stations – If prediction is incorrect » Flush queue » Flush reservation station entries • If prediction is correct – Some overlapping will come from being able to fetch/issue new instructions – If branch outcome takes long to compute, will fill up reservation stations Limitations/challenges (cont)

• Single- and multi-issue Tomasulo schemes so far allow for out-of-order completion – Difficult to support precise exceptions

• Approach to handling this and previous issue – Speculative execution Multiple-issue Tomasulo

• Extends fetch/issue to handle multiple instructions per cycle

• Branches pose limitations – Prediction lets fetch ahead – Can also issue – But cannot execute, complete until known if prediction is correct Control Dependencies

• Every instruction is control dependent on some set of branches if p1 S1; if p2 S2; • S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1. control dependencies must be preserved to preserve program order Speculative execution

• Let instructions that follow a predicted branch actually execute and complete – Speculate; “gamble” that a prediction is correct, then verify prediction – Buffer all instructions that complete until they are safe to alter registers/memory » And commit these results in program order

• Hardware enhancement: Reorder Buffers (ROBs) Example

ADD R1, R2, R3 BNEZ R1, Foo MULT R4, R4, R5 Foo: ADD R4, R4, R6

Assume R4=1, R5=10, R6=20, branch predicted taken, 5 cycles to resolve correct results: R4=21 (taken), R4=30 (not taken) Example

ADD R1, R2, R3 BNEZ R1, Foo MULT R4, R4, R5 Foo: ADD R4, R4, R6

The processor issues and execute the instructions in red -> R4=1+20 Correct if branch taken what if not taken? Example

ADD R1, R2, R3 BNEZ R1, Foo MULT R4, R4, R5 Foo: ADD R4, R4, R6

If branch not taken: -cannot simply roll-back and execute MULT and ADD would result in R4=21*10+20=230 -cannot simply execute MULT and skip ADD would result in R4=21*10=210 Example

ADD R1, R2, R3 BNEZ R1, Foo MULT R4, R4, R5 Foo: ADD R4, R4, R6

If branch not taken, two things must happen: 1) ADD’s result must be discarded, and, 2) MULT and ADD must be executed in order (roll-back sequential execution from MULT) Supporting discards/roll-back

• Do not let results go immediately to – Stage them in a buffer

• Once branch condition is checked: – If prediction was correct, copy from buffer to register file – If incorrect, clear buffer entries and roll-back Re-order buffer Key ideas

• Add another buffer to the design – Re-order buffer (ROB)

• Break the completion of an instruction into two stages: – Write result » From functional units to the ROB – Commit results » From ROB to registers, memory Entries

• ROB: – Busy – Instruction opcode – Destination (register, memory) – Value

• New reservation station entry: – “Dest” (index points to ROB entry) – Also, Qj, Qk now point to ROB entries (not to RS)

• Register status modifications – Index (“Reorder”) now also points to ROB entry (not to RS) Execution stages

• (fetch)

• 1. Issue • 2. Execute • 3. Write • 4. Commit Re-order buffer

Issue Re-order buffer

Execute Re-order buffer

Write Re-order buffer

Commit FP Issue • Condition: RS “r”, ROB “b” both available • Actions: If (RegStat[rs].Busy) { /* source operand from ROB */ h <- RegStat[rs].Reorder; /* h points to producer ROB if (ROB[h].Ready) { /* operand has been computed */ RS[r].Vj <- ROB[h].value; RS[r].Qj <- 0; } else { RS[r].Qj <- h; } /* point to ROB entry */ } else { RS[r].Vj <- Regs[rs]; RS[r].Qj <- 0; }

RS[r].Busy <- yes; RS[r].Dest <- b; ROB[b].Instruction <- opcode; Ready <- no; RegStat[rd].Reorder <- b; RegStat[rd].Busy <- yes; ROB[b].Dest <- rd; Example (single-issue) Op; dest; rdy; val #4 #3 L.D F6,34(R2) #2 L.D F2,45(R3) MUL.D F0,F2,F4 #1 SUB.D F8,F6,F2 ROB

RS(load) RS(mult) RS(add)

Qi F0 F2 F6 F8 Cycle 1 Op; dest; rdy; val #4 #3 L.D F6,34(R2) #2 L.D F2,45(R3) LD; F6; no; xxxx MUL.D F0,F2,F4 #1 SUB.D F8,F6,F2 ROB vj=34,vk=R2,d=#1 RS(load) RS(mult) RS(add)

Qi #1 F0 F2 F6 F8 Cycle 2 Op; dest; rdy; val #4 #3 L.D F6,34(R2) LD; F2; no; xxxx #2 L.D F2,45(R3) LD; F6; no; xxxx MUL.D F0,F2,F4 #1 SUB.D F8,F6,F2 ROB vj=45,vk=R3,d=#2 vj=34,vk=R2,d=#1 RS(load) RS(mult) RS(add)

Qi #2 #1 F0 F2 F6 F8 Cycle 3 Op; dest; rdy; val #4 MU; F0; no; xxxx #3 L.D F6,34(R2) LD; F2; no; xxxx #2 L.D F2,45(R3) LD; F6; no; xxxx MUL.D F0,F2,F4 #1 SUB.D F8,F6,F2 ROB vj=45,vk=R3,d=#2 vj=34,vk=R2,d=#1 Qj=#2,vk=F4,d=#3 RS(load) RS(mult) RS(add)

Qi #3 #2 #1 F0 F2 F6 F8 Cycle 4 Op; dest; rdy; val SU; F8; no; xxxx #4 MU; F0; no; xxxx #3 L.D F6,34(R2) LD; F2; no; xxxx #2 L.D F2,45(R3) LD; F6; no; xxxx MUL.D F0,F2,F4 #1 SUB.D F8,F6,F2 ROB vj=45,vk=R3,d=#2 vj=34,vk=R2,d=#1 Qj=#2,vk=F4,d=#3 Qj=#1,Qk=#2,d=#4 RS(load) RS(mult) RS(add)

Qi #3 #2 #1 #4 F0 F2 F6 F8 FP Execute • Condition: RS[r].Qj==0, Qk==0 • Actions: Compute results in functional units; operands are in Vj, Vk Just as before FP Write result • Condition: Exec. done at r, CDB available • Actions: b <- RS[r].Dest; RS[r].Busy <- no; /* free RS */ Forall x { if (RS[x].Qj=b) {RS[x].Vj<-result; RS[x].Qj<-0; if (RS[x].Qk=b) {RS[x].Vk<-result; RS[x].Qk<-0;} ROB[b].Value <- result; ROB[b].Ready <- yes;

Result goes to CDB, tagged with RS[r].Dest Example; assumptions

• First LD ready to write in cycle 5 – Returns value = 1000 • Second LD ready to write in cycle 6 – Returns value = 999 • SUB ready to write in cycle 7 • MUL takes many cycles to complete Op; dest; rdy; val Cycle 5 SU; F8; no; xxxx #4 MU; F0; no; xxxx #3 L.D F6,34(R2) LD; F2; no; xxxx #2 L.D F2,45(R3) LD; F6; yes; 1000 MUL.D F0,F2,F4 #1 SUB.D F8,F6,F2 ROB vj=45,vk=R3,d=#2 Qj=#2,vk=F4,d=#3 Vj=1000,Qk=#2,d=#4 RS(load) RS(mult) RS(add)

Qi #3 #2 #1 #4 F0 F2 F6 F8 Op; dest; rdy; val Cycle 6 SU; F8; no; xxxx #4 MU; F0; no; xxxx #3 L.D F6,34(R2) LD; F2; yes; 999 #2 L.D F2,45(R3) LD; F6; yes; 1000 MUL.D F0,F2,F4 #1 SUB.D F8,F6,F2 ROB (begins exec) (begins exec) VJ=999,vk=F4,d=#3Vj=1000,Vk=999,d=#4 RS(load) RS(mult) RS(add)

Qi #3 #2 #1 #4 F0 F2 F6 F8 Op; dest; rdy; val Cycle 7 SU; F8; yes; 1 #4 MU; F0; no; xxxx #3 L.D F6,34(R2) LD; F2; yes; 999 #2 L.D F2,45(R3) LD; F6; yes; 1000 MUL.D F0,F2,F4 #1 SUB.D F8,F6,F2 ROB (executing) VJ=1000,vk=F4,d=#3 RS(load) RS(mult) RS(add)

Qi #3 #2 #1 #4 F0 F2 F6 F8 FP commit • Condition: entry h at head of ROB is ready • Actions: d = ROB[h].Dest; Regs[d] <- ROB[h].Value; ROB[h].Busy <- no; If(RegStat[d].Reorder==h) RegStat[d].Busy <- no; Cycle 6 revisited Op; dest; rdy; val SU; F8; yes; 1 #4 MU; F0; no; xxxx #3 L.D F6,34(R2) LD; F2; yes; 999 #2 L.D F2,45(R3) h=1 xx; xx; no; xxxx MUL.D F0,F2,F4 #1 d=F6 SUB.D F8,F6,F2 ROB (executing) VJ=1000,vk=F4,d=#3 RS(load) RS(mult) RS(add)

Qi #3 #2 #4 F0 F2 F6 F8 =1000 Cycle 7 revisited Op; dest; rdy; val SU; F8; yes; 1 #4 MU; F0; no; xxxx #3 h=2 L.D F6,34(R2) xx; xx; no; xxxx #2 L.D F2,45(R3) d=F2 xx; xx; no; xxxx MUL.D F0,F2,F4 #1 SUB.D F8,F6,F2 ROB (executing) VJ=1000,vk=F4,d=#3 RS(load) RS(mult) RS(add)

Qi #3 #4 F0 F2 F6 F8 =999 =1000 Cycle 10 Op; dest; rdy; val SU; F8; yes; 1 #4 not MU; F0; no; xxxx #3 L.D F6,34(R2) ready xx; xx; no; xxxx #2 L.D F2,45(R3) xx; xx; no; xxxx MUL.D F0,F2,F4 #1 SUB.D F8,F6,F2 ROB (executing) VJ=1000,vk=F4,d=#3 RS(load) RS(mult) RS(add)

Qi #3 #4 F0 F2 F6 F8 =999 =1000 **Ordering

• The addition of an ROB: – Buffers speculative data until known to be valid – Also, allows for in-order commits of instructions that execute out- of-order • In our example: – SUB completes execution before MULT – But SUB cannot commit because the MULT entry is the ROB’s head » Will instructions that use SUB’s result be stalled? Branches and the ROB

• When branches reach ROB’s head – It is known whether the prediction was or not correct » Branch has finished execution… – If correct: remove branch from the ROB, continue committing subsequent instructions – If incorrect: flush all ROB entries; start fetching from correct branch target Branch at head of ROB Op; dest; rdy; val #4 #3 ADD R1, R2, R3 ADD;R4; yes; 21 #2 BNEZ R1, Foo MULT R4, R4, R5 BNEZ;x; yes; xxx #1 Foo: ADD R4, R4, R6 ROB

RS(load) RS(mult) RS(add)

Qi F0 F2 F6 F8 Correct prediction (taken) Op; dest; rdy; val #4 #3 ADD R1, R2, R3 ADD;R4; yes; 21 #2 BNEZ R1, Foo MULT R4, R4, R5 xx; x; no; xxx #1 Foo: ADD R4, R4, R6 ROB

RS(load) RS(mult) RS(add)

Qi F0 F2 F6 F8 Incorrect prediction (not taken) Op; dest; rdy; val #4 #3 ADD R1, R2, R3 xx; x; no; xx #2 BNEZ R1, Foo MULT R4, R4, R5 xx; x; no; xxx #1 Foo: ADD R4, R4, R6 ROB The ADD with result = 21 gets removed from the ROB: - it was never written to R4 - any instruction using the result =21 has to be in the ROB, and hence would also be flushed - therefore, it is as if the ADD never executed Memory operations

• Loads: – Only execute if: » Address A is available, and » All stores earlier in ROB entries have different A’s • Stores: – Do not store value immediately to memory when it becomes available » Also buffer it in ROB entry – Only store to memory when store commits » Store may be control-dependent on a mispredicted branch! Exceptions

• Only handle an exception if instruction that raised it ever reaches the ROB head – Mis-speculated instructions won’t

• Exceptions can then be handled in program order Four Steps of Speculative Tomasulo Algorithm

1. Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”)

2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”)

3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.

4. Commit—update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”) Changes to Other Components

Use ROB index as tag

• Renaming table maps architecture registers to ROB index if the register is renamed

• Reservation stations now use ROB index for tracking dependence and for wakeup

• Again tag (now ROB index) and data are broadcast on CDB at writeback

• Inst may receive values from reg/mem, data broadcasting, or ROB Summary

• Reservations stations: implicit to larger set of registers + buffering source operands – Prevents registers as bottleneck – Avoids WAR, WAW hazards of Scoreboard • Not limited to basic blocks when compared to static scheduling (integer units gets ahead, beyond branches) • Today, helps cache misses as well – Don’t stall for L1 Data cache miss (insufficient ILP for L2 miss?) – Can support memory-level parallelism • Lasting Contributions – Dynamic scheduling – Register renaming – Load/store disambiguation (discuss later) • 360/91 descendants are Pentium III; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264 Dynamic Scheduling: The Only Choice?

• Most high-performance processors today are dynamically scheduled superscalar processors – With deeper and n-way issue pipeline

• Other alternatives to exploit instruction-level parallelism – Statically scheduled superscalar – VLIW

• Mixed effort: EPIC – Explicit Parallel Instruction Computing – Example: processors

Why is dynamic scheduling so popular today? – Technology trends: increasing transistor budget, deeper pipeline, wide issue