Dynamic Scheduling II

This Unit: Dynamic Scheduling II Application • Previously: dynamic scheduling OS • Insn buffer + scheduling algorithms Compiler Firmware CIS 501 • Scoreboard: no register renaming • Tomasulo: register renaming Introduction to Computer Architecture CPU I/O Memory • Now: add speculation, precise state Digital Circuits • Re-order buffer Unit 9: Dynamic Scheduling II Gates & Transistors • PentiumPro vs. MIPS R10000 • Also: dynamic load scheduling • Out-of-order memory operations CIS 501 (Martin/Roth): Dynamic Scheduling II 1 CIS 501 (Martin/Roth): Dynamic Scheduling II 2 Readings Superscalar + Out-of-Order + Speculation • H+P • Three great tastes that taste great together • None (not happy with explanation of this topic) • CPI ! 1? • Go superscalar • Superscalar increases RAW hazards? • Go out-of-order (OoO) • RAW hazards still a problem? • Build a larger window • Branches a problem for filling large window? • Add control speculation CIS 501 (Martin/Roth): Dynamic Scheduling II 3 CIS 501 (Martin/Roth): Dynamic Scheduling II 4 Speculation and Precise Interrupts Precise State • Why are we discussing these together? • Speculative execution requires • Sequential (vN) semantics for interrupts • (Ability to) abort & restart at every branch • All insns before interrupt should be complete • Abort & restart at every load useful for load speculation (later) • All insns after interrupt should look as if never started (abort) • And for shared memory multiprocessing (much later) • Basically want same thing for mis-predicted branch • Precise synchronous (program-internal) interrupts require • What makes precise interrupts difficult? • Abort & restart at every load, store, ?? • OoO completion ! must undo post-interrupt writebacks • Precise asynchronous (external) interrupts require • Same thing for branches • Abort & restart at every ?? • In-order ! branches complete before younger insns writeback • OoO ! not necessarily • Bite the bullet • Implement abort & restart at every insn • Precise interrupts, mis-speculation recovery: same problem • Called “precise state” • Same problem ! same solution CIS 501 (Martin/Roth): Dynamic Scheduling II 5 CIS 501 (Martin/Roth): Dynamic Scheduling II 6 Precise State Options The Problem with Precise State • Imprecise state: ignore the problem! insn buffer – Makes page faults (any restartable exceptions) difficult – Makes speculative execution almost impossible regfile • IEEE standard strongly suggests precise state I$ D$ • Compromise: Alpha implemented precise state only for integer ops B P D S • Force in-order completion (W): stall pipe if necessary – Slow • Precise state in software: trap to recovery routine • Problem: writeback combines two separate functions – Implementation dependent • Forwards values to younger insns: OK for this to be out-of-order • Trap on every mis-predicted branch (you must be joking) • Write values to registers: would like this to be in-order • Precise state in hardware • Similar problem (decode) for OoO execution: solution? + Everything is better in hardware (except policy) • Split decode (D) ! in-order dispatch (D) + out-of-order issue (S) • Separate using insn buffer: scoreboard or reservation station CIS 501 (Martin/Roth): Dynamic Scheduling II 7 CIS 501 (Martin/Roth): Dynamic Scheduling II 8 Re-Order Buffer (ROB) Complete and Retire Reorder buffer (ROB) Reorder buffer (ROB) regfile regfile I$ I$ D$ D$ B B P W1 W2 P C R • Insn buffer ! re-order buffer (ROB) • Complete (C): second part of decode • Buffers completed results en route to register file • Completed insns write results into ROB • May be combined with RS or separate + Out-of-order: wait doesn’t back-propagate to younger insns • Combined in picture: register-update unit RUU (Sohi’s method) • Retire (R): aka commit, graduate • Separate (more common today): P6-style • ROB writes results to register file • Split writeback (W) into two stages • In order: stall back-propagates to younger insns • Why is there no latch between W1 and W2? CIS 501 (Martin/Roth): Dynamic Scheduling II 9 CIS 501 (Martin/Roth): Dynamic Scheduling II 10 Load/Store Queue (LSQ) ROB + LSQ • ROB makes register writes in-order, but what about stores? ROB regfile • As usual, i.e., to D$ in X stage? I$ • Not even close, imprecise memory worse than imprecise registers B P C R • Load/store queue (LSQ) load data • Completed stores write to LSQ store data addr • When store retires, head of LSQ written to D$ D$ • When loads execute, access LSQ and D$ in parallel load/store LSQ • Forward from LSQ if older store with matching address • More modern design: loads and stores in separate queues • More on this later • Modulo gross simplifications, this picture is almost realistic! CIS 501 (Martin/Roth): Dynamic Scheduling II 11 CIS 501 (Martin/Roth): Dynamic Scheduling II 12 P6 P6 Data Structures • P6: Start with Tomasulo’s algorithm… add ROB • Reservation Stations are same as before • Separate ROB and RS • ROB • head, tail: pointers maintain sequential order • Simple-P6 • R: insn output register, V: insn output value • Our old RS organization: 1 ALU, 1 load, 1 store, 2 3-cycle FP • Tags are different • Tomasulo: RS# ! P6: ROB# • Map Table is different • T+: tag + “ready-in-ROB” bit • T==0 ! Value is ready in regfile • T!=0 ! Value is not ready • T!=0+ ! Value is ready in the ROB CIS 501 (Martin/Roth): Dynamic Scheduling II 13 CIS 501 (Martin/Roth): Dynamic Scheduling II 14 P6 Data Structures P6 Data Structures Regfile ROB Map Table CDB Map Table T+ value R value ht # Insn R V S X C Reg T+ T V Head 1 ldf X(r1),f1 f0 Retire 2 mulf f0,f1,f2 f1 3 stf f2,Z(r1) f2 Tail 4 addi r1,4,r1 r1 Dispatch 5 ldf X(r1),f1 CDB.V op T T1 T2 CDB.T V1 V2 ROB 6 mulf f0,f1,f2 == == == == 7 stf f2,Z(r1) == == Dispatch == == RS Reservation Stations T FU # FU busy op T T1 T2 V1 V2 1 ALU no • Insn fields and status bits 2 LD no • Tags 3 ST no 4 FP1 no • Values 5 FP2 no CIS 501 (Martin/Roth): Dynamic Scheduling II 15 CIS 501 (Martin/Roth): Dynamic Scheduling II 16 P6 Pipeline P6 Pipeline • New pipeline structure: F, D, S, X, C, R • C (complete) • D (dispatch) • Structural hazard (CDB)? wait • Structural hazard (ROB/LSQ/RS) ? Stall • Write value into ROB entry indicated by RS tag • Allocate ROB/LSQ/RS • Mark ROB entry as complete • Set RS tag to ROB# • If not overwritten, mark Map Table entry “ready-in-ROB” bit (+) • Set Map Table entry to ROB# and clear “ready-in-ROB” bit • R (retire) • Read ready registers into RS (from either ROB or Regfile) • Insn at ROB head not complete ? stall • X (execute) • Handle any exceptions • Free RS entry • Write ROB head value to register file • Use to be at W, can be earlier because RS# are not tags • If store, write LSQ head to D$ • Free ROB/LSQ entries CIS 501 (Martin/Roth): Dynamic Scheduling II 17 CIS 501 (Martin/Roth): Dynamic Scheduling II 18 P6 Dispatch (D): Part I P6 Dispatch (D): Part II Regfile Regfile Map Table T+ value R value Map Table T+ value R value Head Head Retire Retire Tail Tail Dispatch Dispatch CDB.V CDB.V op T T1 T2 CDB.T V1 V2 ROB op T T1 T2 CDB.T V1 V2 ROB == == == == == == == == == == == == Dispatch == == Dispatch == == RS RS T FU T FU • RS/ROB full ? stall • Read tags for register inputs from Map Table • Allocate RS/ROB entries, assign ROB# to RS output tag • Tag==0 ! copy value from Regfile (not shown) • Set output register Map Table entry to ROB#, clear “ready-in-ROB” • Tag!=0 ! copy Map Table tag to RS • Tag!=0+ ! copy value from ROB CIS 501 (Martin/Roth): Dynamic Scheduling II 19 CIS 501 (Martin/Roth): Dynamic Scheduling II 20 P6 Complete (C) P6 Retire (R) Regfile Regfile Map Table T+ value R value Map Table T value R value Head Head Retire Retire Tail Tail Dispatch Dispatch CDB.V CDB.V op T T1 T2 CDB.T V1 V2 ROB op T T1 T2 CDB.T V1 V2 ROB == == == == == == == == == == == == Dispatch == == Dispatch == == RS RS T FU T FU • Structural hazard (CDB) ? Stall : broadcast <value,tag> on CDB • ROB head not complete ? stall : free ROB entry • Write result into ROB, if still valid clear MapTable “ready-in-ROB” bit • Write ROB head result to Regfile • Match tags, write CDB.V into RS slots of dependent insns • If still valid, clear Map Table entry CIS 501 (Martin/Roth): Dynamic Scheduling II 21 CIS 501 (Martin/Roth): Dynamic Scheduling II 22 P6: Cycle 1 P6: Cycle 2 ROB Map Table CDB ROB Map Table CDB ht # Insn R V S X C Reg T+ T V ht # Insn R V S X C Reg T+ T V ht 1 ldf X(r1),f1 f1 f0 h 1 ldf X(r1),f1 f1 c2 f0 2 mulf f0,f1,f2 f1 ROB#1 t 2 mulf f0,f1,f2 f2 f1 ROB#1 3 stf f2,Z(r1) f2 3 stf f2,Z(r1) f2 ROB#2 4 addi r1,4,r1 r1 4 addi r1,4,r1 r1 5 ldf X(r1),f1 5 ldf X(r1),f1 6 mulf f0,f1,f2 6 mulf f0,f1,f2 7 stf f2,Z(r1) 7 stf f2,Z(r1) Reservation Stations Reservation Stations # FU busy op T T1 T2 V1 V2 set ROB# tag # FU busy op T T1 T2 V1 V2 set ROB# tag 1 ALU no 1 ALU no 2 LD yes ldf ROB#1 [r1] allocate 2 LD yes ldf ROB#1 [r1] 3 ST no 3 ST no 4 FP1 no 4 FP1 yes mulf ROB#2 ROB#1 [f0] allocate 5 FP2 no 5 FP2 no CIS 501 (Martin/Roth): Dynamic Scheduling II 23 CIS 501 (Martin/Roth): Dynamic Scheduling II 24 P6: Cycle 3 P6: Cycle 4 ROB Map Table CDB ROB Map Table CDB ht # Insn R V S X C Reg T+ T V ht # Insn R V S X C Reg T+ T V h 1 ldf X(r1),f1 f1 c2 c3 f0 h 1 ldf X(r1),f1 f1 [f1] c2 c3 c4 f0 ROB#1 [f1] 2 mulf f0,f1,f2 f2 f1 ROB#1 2 mulf f0,f1,f2 f2 c4 f1 ROB#1+ t 3 stf f2,Z(r1) f2 ROB#2 3 stf f2,Z(r1) f2 ROB#2 4 addi r1,4,r1 r1 t 4 addi r1,4,r1 r1 r1 ROB#4 5 ldf X(r1),f1 5 ldf X(r1),f1 ldf finished 6 mulf f0,f1,f2 6 mulf f0,f1,f2 1.

Load more