Multiple Issue, Speculative Execution and Recovery

ILP: Multiple Issue, Speculative Execution and Recovery Multiple issue, speculative execution, precise interrupt, reorder buffer Multiple instruction issue • Recap: Tomasulo’s algorithm handles RAW, WAR, WAW hazards via dynamic scheduling • Three basic mechanisms: – Execute only when data ready – RAW – Buffer register values in reservation stations when instruction is issued – WAR – Register destination field – WAW Single-issue Tomasulo • The previous mechanisms support a model where: – Instructions are issued one at a time, in program order » Into reservation stations – Multiple instructions can begin execution at a time, out of program order » Provided their operands, and independent functional units are available – Instructions can complete out of order Support for multiple issue • There are several phases in the Tomasulo algorithm – Fetch into instruction queue – Decode and issue to reservation stations – Execute – Write results • If any of these phases support only one instruction per cycle – Multiple issue will not yield CPI<1 Support for multiple issue (cont) • Our Tomasulo design so far: – Execution supports multiple instructions – Write results support multiple instructions » Provided there are multiple functional units, and CDBs • Extensions: – Fetching multiple instructions per cycle – Issuing multiple instructions per cycle Fetching multiple instructions • Must be able to fill in instruction queue at a higher rate – If want to sustain “N” instructions per cycle, must fetch at least “N” instructions per cycle • Must look ahead the program order stream and fetch from multiple PCs Decoding multiple operations • Once multiple instructions are in the fetch queue – Must be able to decode and issue more than one in a single cycle – An “issue packet” Issue logic • First, need to check whether reservation stations are available – For each instruction that may be issued – As previously, but now the check gets more complex » Not only reservation stations that are already in use, but also those that will be occupied by other instructions in the issue packet Issue logic • Second, need to resolve dependences within issue packet – In addition to those with instructions already in flight • Keep in mind checks must preserve program order within issue packet Issue logic • Check must be done in parallel – While preserving program order • Possible approaches: – Hardware checks for all possible combinations within a single cycle – Pipeline the issue stage Simplifications • If floating point and integer registers are separate – May facilitate issue logic by looking at one FP and one INT instruction at a time • But some programs only have INT instructions • General case: support issue of arbitrary mix of INT, FP – And loads, stores Tomasulo: decoupling • Fetch, issue and execution are decoupled – Number of instructions fetched in cycle “k” does not imply same number of instructions issued in cycle “k+1” – Number of instructions issued does not imply same number of instructions executed • Once fetch/issue stages are improved – Execution works just as before Example Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDIUR1,R1,#-8 BNE R1,R2,Loop 1 integer ALU, 1 FP ALU, 1 address adder, 1 memory unit, 2 CDBs Multiple reservation stations Issue packet: any two instructions (except branches; single issue) Branch-predicted instructions can be fetched, issued but not executed Loop: L.D F0,0(R1) Cycle #0 ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D fetched into DADDI R1,R1,#-8 instruction queue BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDI R1,R1,#-8 BNE R1,R2,Loop Cycle #1 Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D issued DADDI R1,R1,#-8 -what checks need BNE R1,R2,Loop to be made? Loop: L.D F0,0(R1) ADD.D F4,F0,F2 -what are the Vj/Vk, S.D F4,0(R1) Qj/Qk of ADD.D RS DADDI R1,R1,#-8 set to? BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDI R1,R1,#-8 BNE R1,R2,Loop S.D, DADDIU brought into instruction queue Tomasulo Example Cycle 1 Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F0 0+ R1 1 Load1 Yes 0+R1 ADD F4 F0 F2 1 Reservation Stations S1 S2 RS for j RS for k TimeNameBusyOp Vj Vk Qj Qk Add1 Yes Add F2 Load1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 1 FU Load1 Add1 Cycle #2 Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) S.D, DADDIU issued DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) BNE brought DADDI R1,R1,#-8 into instruction queue BNE R1,R2,Loop -Assumption: single-issue of branches Tomasulo Example Cycle 2 Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F0 0+ R1 1 Load1 Yes 0+R1 ADD F4 F0 F2 1 SD F4 0+ R1 2 Store1 Yes 0+R1 DAD R1 R1 -8 2 Reservation Stations S1 S2 RS for j RS for k TimeNameBusyOp Vj Vk Qj Qk Add1 Yes Add F2 Load1 Int Yes DAD R1 -8 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 2 FU Load1 Add1 Cycle #3 Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) BNE issued DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, DADDI access DADDI R1,R1,#-8 memory, execute BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D brought DADDI R1,R1,#-8 into instruction queue BNE R1,R2,Loop - Under what assumption? Cycle #4 Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D issued DADDI R1,R1,#-8 -Can they execute? BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, DADDI write results DADDI R1,R1,#-8 -who needs their results? BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) S.D, DADDI brought DADDI R1,R1,#-8 into instruction queue BNE R1,R2,Loop Cycle #5 Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) BNE issued DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) BNE executes DADDI R1,R1,#-8 -what happens after BNE R1,R2,Loop branch is resolved? Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D brought DADDI R1,R1,#-8 into instruction queue BNE R1,R2,Loop Limitations/challenges • Multiple issue relies on fetching a stream of instructions ahead of time – Branch/target prediction is very important • What to do with instructions that depend on prediction results? – They may or may not be valid • Can always fetch and issue them – No harm done until memory or register contents are written – But just fetching/issuing not enough Dealing with branches • Predicting branches lets us: – Fetch appropriate instructions into queue – Issue instructions into reservation stations – If prediction is incorrect » Flush queue » Flush reservation station entries • If prediction is correct – Some overlapping will come from being able to fetch/issue new instructions – If branch outcome takes long to compute, will fill up reservation stations Limitations/challenges (cont) • Single- and multi-issue Tomasulo schemes so far allow for out-of-order completion – Difficult to support precise exceptions • Approach to handling this and previous issue – Speculative execution Multiple-issue Tomasulo • Extends fetch/issue to handle multiple instructions per cycle • Branches pose limitations – Prediction lets processor fetch ahead – Can also issue – But cannot execute, complete until known if prediction is correct Control Dependencies • Every instruction is control dependent on some set of branches if p1 S1; if p2 S2; • S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1. control dependencies must be preserved to preserve program order Speculative execution • Let instructions that follow a predicted branch actually execute and complete – Speculate; “gamble” that a prediction is correct, then verify prediction – Buffer all instructions that complete until they are safe to alter registers/memory » And commit these results in program order • Hardware enhancement: Reorder Buffers (ROBs) Example ADD R1, R2, R3 BNEZ R1, Foo MULT R4, R4, R5 Foo: ADD R4, R4, R6 Assume R4=1, R5=10, R6=20, branch predicted taken, 5 cycles to resolve correct results: R4=21 (taken), R4=30 (not taken) Example ADD R1, R2, R3 BNEZ R1, Foo MULT R4, R4, R5 Foo: ADD R4, R4, R6 The processor issues and execute the instructions in red -> R4=1+20 Correct if branch taken what if not taken? Example ADD R1, R2, R3 BNEZ R1, Foo MULT R4, R4, R5 Foo: ADD R4, R4, R6 If branch not taken: -cannot simply roll-back and execute MULT and ADD would result in R4=21*10+20=230 -cannot simply execute MULT and skip ADD would result in R4=21*10=210 Example ADD R1, R2, R3 BNEZ R1, Foo MULT R4, R4, R5 Foo: ADD R4, R4, R6 If branch not taken, two things must happen: 1) ADD’s result must be discarded, and, 2) MULT and ADD must be executed in order (roll-back sequential execution from MULT) Supporting discards/roll-back • Do not let results go immediately to register file – Stage them in a buffer • Once branch condition is checked: – If prediction was correct, copy from buffer to register file – If incorrect, clear buffer entries and roll-back Re-order buffer Key ideas • Add another buffer to the design – Re-order buffer (ROB) • Break the completion of an instruction into two stages: – Write result » From functional units to the ROB – Commit results » From ROB to registers, memory Entries • ROB: – Busy – Instruction opcode – Destination (register, memory) – Value • New reservation station entry: – “Dest” (index points to ROB entry) – Also, Qj, Qk now point to ROB entries (not to RS) • Register status modifications – Index (“Reorder”) now also points to ROB entry (not to RS) Execution stages • (fetch) • 1.

Load more