Multiple Issue, Speculative Execution and Recovery

ILP: Multiple Issue, Speculative Execution and Recovery Multiple issue, speculative execution, precise interrupt, reorder buffer Multiple instruction issue • Recap: Tomasulo’s algorithm handles RAW, WAR, WAW hazards via dynamic scheduling • Three basic mechanisms: – Execute only when data ready – RAW – Buffer register values in reservation stations when instruction is issued – WAR – Register destination field – WAW Single-issue Tomasulo • The previous mechanisms support a model where: – Instructions are issued one at a time, in program order » Into reservation stations – Multiple instructions can begin execution at a time, out of program order » Provided their operands, and independent functional units are available – Instructions can complete out of order Support for multiple issue • There are several phases in the Tomasulo algorithm – Fetch into instruction queue – Decode and issue to reservation stations – Execute – Write results • If any of these phases support only one instruction per cycle – Multiple issue will not yield CPI<1 Support for multiple issue (cont) • Our Tomasulo design so far: – Execution supports multiple instructions – Write results support multiple instructions » Provided there are multiple functional units, and CDBs • Extensions: – Fetching multiple instructions per cycle – Issuing multiple instructions per cycle Fetching multiple instructions • Must be able to fill in instruction queue at a higher rate – If want to sustain “N” instructions per cycle, must fetch at least “N” instructions per cycle • Must look ahead the program order stream and fetch from multiple PCs Decoding multiple operations • Once multiple instructions are in the fetch queue – Must be able to decode and issue more than one in a single cycle – An “issue packet” Issue logic • First, need to check whether reservation stations are available – For each instruction that may be issued – As previously, but now the check gets more complex » Not only reservation stations that are already in use, but also those that will be occupied by other instructions in the issue packet Issue logic • Second, need to resolve dependences within issue packet – In addition to those with instructions already in flight • Keep in mind checks must preserve program order within issue packet Issue logic • Check must be done in parallel – While preserving program order • Possible approaches: – Hardware checks for all possible combinations within a single cycle – Pipeline the issue stage Simplifications • If floating point and integer registers are separate – May facilitate issue logic by looking at one FP and one INT instruction at a time • But some programs only have INT instructions • General case: support issue of arbitrary mix of INT, FP – And loads, stores Tomasulo: decoupling • Fetch, issue and execution are decoupled – Number of instructions fetched in cycle “k” does not imply same number of instructions issued in cycle “k+1” – Number of instructions issued does not imply same number of instructions executed • Once fetch/issue stages are improved – Execution works just as before Example Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDIUR1,R1,#-8 BNE R1,R2,Loop 1 integer ALU, 1 FP ALU, 1 address adder, 1 memory unit, 2 CDBs Multiple reservation stations Issue packet: any two instructions (except branches; single issue) Branch-predicted instructions can be fetched, issued but not executed Loop: L.D F0,0(R1) Cycle #0 ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D fetched into DADDI R1,R1,#-8 instruction queue BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDI R1,R1,#-8 BNE R1,R2,Loop Cycle #1 Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D issued DADDI R1,R1,#-8 -what checks need BNE R1,R2,Loop to be made? Loop: L.D F0,0(R1) ADD.D F4,F0,F2 -what are the Vj/Vk, S.D F4,0(R1) Qj/Qk of ADD.D RS DADDI R1,R1,#-8 set to? BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDI R1,R1,#-8 BNE R1,R2,Loop S.D, DADDIU brought into instruction queue Tomasulo Example Cycle 1 Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F0 0+ R1 1 Load1 Yes 0+R1 ADD F4 F0 F2 1 Reservation Stations S1 S2 RS for j RS for k TimeNameBusyOp Vj Vk Qj Qk Add1 Yes Add F2 Load1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 1 FU Load1 Add1 Cycle #2 Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) S.D, DADDIU issued DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) BNE brought DADDI R1,R1,#-8 into instruction queue BNE R1,R2,Loop -Assumption: single-issue of branches Tomasulo Example Cycle 2 Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F0 0+ R1 1 Load1 Yes 0+R1 ADD F4 F0 F2 1 SD F4 0+ R1 2 Store1 Yes 0+R1 DAD R1 R1 -8 2 Reservation Stations S1 S2 RS for j RS for k TimeNameBusyOp Vj Vk Qj Qk Add1 Yes Add F2 Load1 Int Yes DAD R1 -8 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 2 FU Load1 Add1 Cycle #3 Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) BNE issued DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, DADDI access DADDI R1,R1,#-8 memory, execute BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D brought DADDI R1,R1,#-8 into instruction queue BNE R1,R2,Loop - Under what assumption? Cycle #4 Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D issued DADDI R1,R1,#-8 -Can they execute? BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, DADDI write results DADDI R1,R1,#-8 -who needs their results? BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) S.D, DADDI brought DADDI R1,R1,#-8 into instruction queue BNE R1,R2,Loop Cycle #5 Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) BNE issued DADDI R1,R1,#-8 BNE R1,R2,Loop Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) BNE executes DADDI R1,R1,#-8 -what happens after BNE R1,R2,Loop branch is resolved? Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D, ADD.D brought DADDI R1,R1,#-8 into instruction queue BNE R1,R2,Loop Limitations/challenges • Multiple issue relies on fetching a stream of instructions ahead of time – Branch/target prediction is very important • What to do with instructions that depend on prediction results? – They may or may not be valid • Can always fetch and issue them – No harm done until memory or register contents are written – But just fetching/issuing not enough Dealing with branches • Predicting branches lets us: – Fetch appropriate instructions into queue – Issue instructions into reservation stations – If prediction is incorrect » Flush queue » Flush reservation station entries • If prediction is correct – Some overlapping will come from being able to fetch/issue new instructions – If branch outcome takes long to compute, will fill up reservation stations Limitations/challenges (cont) • Single- and multi-issue Tomasulo schemes so far allow for out-of-order completion – Difficult to support precise exceptions • Approach to handling this and previous issue – Speculative execution Multiple-issue Tomasulo • Extends fetch/issue to handle multiple instructions per cycle • Branches pose limitations – Prediction lets processor fetch ahead – Can also issue – But cannot execute, complete until known if prediction is correct Control Dependencies • Every instruction is control dependent on some set of branches if p1 S1; if p2 S2; • S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1. control dependencies must be preserved to preserve program order Speculative execution • Let instructions that follow a predicted branch actually execute and complete – Speculate; “gamble” that a prediction is correct, then verify prediction – Buffer all instructions that complete until they are safe to alter registers/memory » And commit these results in program order • Hardware enhancement: Reorder Buffers (ROBs) Example ADD R1, R2, R3 BNEZ R1, Foo MULT R4, R4, R5 Foo: ADD R4, R4, R6 Assume R4=1, R5=10, R6=20, branch predicted taken, 5 cycles to resolve correct results: R4=21 (taken), R4=30 (not taken) Example ADD R1, R2, R3 BNEZ R1, Foo MULT R4, R4, R5 Foo: ADD R4, R4, R6 The processor issues and execute the instructions in red -> R4=1+20 Correct if branch taken what if not taken? Example ADD R1, R2, R3 BNEZ R1, Foo MULT R4, R4, R5 Foo: ADD R4, R4, R6 If branch not taken: -cannot simply roll-back and execute MULT and ADD would result in R4=21*10+20=230 -cannot simply execute MULT and skip ADD would result in R4=21*10=210 Example ADD R1, R2, R3 BNEZ R1, Foo MULT R4, R4, R5 Foo: ADD R4, R4, R6 If branch not taken, two things must happen: 1) ADD’s result must be discarded, and, 2) MULT and ADD must be executed in order (roll-back sequential execution from MULT) Supporting discards/roll-back • Do not let results go immediately to register file – Stage them in a buffer • Once branch condition is checked: – If prediction was correct, copy from buffer to register file – If incorrect, clear buffer entries and roll-back Re-order buffer Key ideas • Add another buffer to the design – Re-order buffer (ROB) • Break the completion of an instruction into two stages: – Write result » From functional units to the ROB – Commit results » From ROB to registers, memory Entries • ROB: – Busy – Instruction opcode – Destination (register, memory) – Value • New reservation station entry: – “Dest” (index points to ROB entry) – Also, Qj, Qk now point to ROB entries (not to RS) • Register status modifications – Index (“Reorder”) now also points to ROB entry (not to RS) Execution stages • (fetch) • 1.

Multiple Issue, Speculative Execution and Recovery

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support