Dynamic Scheduling I

This Unit: Dynamic Scheduling I Application • Dynamic scheduling OS • Out-of-order execution Compiler Firmware CIS 501 • Scoreboard Introduction to Computer Architecture CPU I/O • Dynamic scheduling with WAW/WAR Memory • Tomasulo’s algorithm • Add register renaming to fix WAW/WAR Digital Circuits Gates & Transistors Unit 8: Dynamic Scheduling I • Next unit • Adding speculation and precise state • Dynamic load scheduling CIS 501 (Martin/Roth): Dynamic Scheduling I 1 CIS 501 (Martin/Roth): Dynamic Scheduling I 2 Readings The Problem With In-Order Pipelines 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 • H+P addf f0,f1,f2 F D E+ E+ E+ W • None (not happy with explanation of this topic) mulf f2,f3,f2 F D d* d* E* E* E* E* E* W subf f0,f1,f4 F p* p* D E+ E+ E+ W • What’s happening in cycle 4? • mulf stalls due to RAW hazard • OK, this is a fundamental problem • subf stalls due to pipeline hazard • Why? subf can’t proceed into D because addf is there • That is the only reason, and it isn’t a fundamental one • Why can’t subf go into D in cycle 4 and E+ in cycle 5? CIS 501 (Martin/Roth): Dynamic Scheduling I 3 CIS 501 (Martin/Roth): Dynamic Scheduling I 4 Dynamic Scheduling: The Big Picture Register Renaming add p2,p3,p4 sub p2,p4,p5 • To eliminate WAW and WAR hazards mul p2,p5,p6 regfile • Example div p4,4,p7 • Names: r1,r2,r3 I$ insn buffer D$ • Locations: p1,p2,p3,p4,p5,p6,p7 B P D S • Original mapping: r1!p1, r2!p2, r3!p3, p4–p7 are “free” MapTable FreeList Raw insns Renamed insns Ready Table r1 r2 r3 P2 P3 P4 P5 P6 P7 p1 p2 p3 p4,p5,p6,p7 add r2,r3,r1 add p2,p3,p4 Yes Yes add p2,p3,p4 p4 p2 p3 p5,p6,p7 sub r2,r1,r3 sub p2,p4,p5 Yes Yes Yes Yes sub p2,p4,p5 and div p4,4,p7 p4 p2 p5 p6,p7 mul r2,r3,r3 mul p2,p5,p6 Yes Yes Yes Yes Yes mul p2,p5,p6 p4 p2 p6 p7 div r1,4,r1 div p4,4,p7 Yes Yes Yes Yes Yes Yes • Instructions fetch/decoded/renamed into Instruction Buffer • Renaming • Also called “instruction window” or “instruction scheduler” + Removes WAW and WAR dependences • Instructions (conceptually) check ready bits every cycle + Leaves RAW intact! • Execute when ready CIS 501 (Martin/Roth): Dynamic Scheduling I 5 CIS 501 (Martin/Roth): Dynamic Scheduling I 6 Dynamic Scheduling - OoO Execution Static Instruction Scheduling • Dynamic scheduling • Issue: time at which insns execute • Totally in the hardware • Schedule: order in which insns execute • Also called “out-of-order execution” (OoO) • Related to issue, but the distinction is important • Fetch many instructions into instruction window • Use branch prediction to speculate past (multiple) branches • Scheduling: re-arranging insns to enable rapid issue • Flush pipeline on branch misprediction • Static: by compiler • Rename to avoid false dependencies (WAW and WAR) • Requires knowledge of pipeline and program dependences • Execute instructions as soon as possible • Pipeline scheduling: the basics • Register dependencies are known • Requires large scheduling scope full of independent insns • Handling memory dependencies more tricky (much more later) • Loop unrolling, software pipelining: increase scope for loops • Commit instructions in order • Trace scheduling: increase scope for non-loops • Any strange happens before commit, just flush the pipeline • Current machines: 100+ instruction scheduling window Anything software can do … hardware can do better CIS 501 (Martin/Roth): Dynamic Scheduling I 7 CIS 501 (Martin/Roth): Dynamic Scheduling I 8 Motivation Dynamic Scheduling Before We Continue • Dynamic scheduling (out-of-order execution) • If we can do this in software… • Execute insns in non-sequential (non-VonNeumann) order… • …why build complex (slow-clock, high-power) hardware? + Reduce RAW stalls + Performance portability + Increase pipeline and functional unit (FU) utilization • Don’t want to recompile for new machines • Original motivation was to increase FP unit utilization + More information available + Expose more opportunities for parallel issue (ILP) • Memory addresses, branch directions, cache misses • Not in-order ! can be in parallel + More registers available (??) • …but make it appear like sequential execution • Compiler may not have enough to fix WAR/WAW hazards • Important + Easier to speculate and recover from mis-speculation – But difficult • Flush instead of recover code • Next unit – But compiler has a larger scope • Compiler does as much as it can (not much) • Hardware does the rest CIS 501 (Martin/Roth): Dynamic Scheduling I 9 CIS 501 (Martin/Roth): Dynamic Scheduling I 10 Going Forward: What’s Next Dynamic Scheduling as Loop Unrolling • We’ll build this up in steps over the next few weeks • Three steps of loop unrolling • “Scoreboarding” - first OoO, no register renaming • Step I: combine iterations • “Tomasulo’s algorithm” - adds register renaming • Increase scheduling scope for more flexibility • Handling precise state and speculation • Step II: pipeline schedule • P6-style execution (Intel Pentium Pro) • Reduce impact of RAW hazards • R10k-style execution (MIPS R10k) • Step III: rename registers • Handling memory dependencies • Remove WAR/WAW violations that result from scheduling • Conservative and speculative • Let’s get started! CIS 501 (Martin/Roth): Dynamic Scheduling I 11 CIS 501 (Martin/Roth): Dynamic Scheduling I 12 Loop Example: SAX (SAXPY – PY) New Pipeline Terminology • SAX (Single-precision A X) • Only because there won’t be room in the diagrams for SAXPY regfile for (i=0;i<N;i++) I$ D$ Z[i]=A*X[i]; B P 0: ldf X(r1),f1 // loop 1: mulf f0,f1,f2 // A in f0 2: stf f4,Z(r1) • In-order pipeline 3: addi r1,4,r1 // i in r1 • Often written as F,D,X,W (multi-cycle X includes M) 4: blt r1,r2,0 // N*4 in r2 • Example pipeline: 1-cycle int (including mem), 3-cycle pipelined FP • Consider two iterations, ignore branch ldf, mulf, stf, addi, ldf, mulf, stf CIS 501 (Martin/Roth): Dynamic Scheduling I 13 CIS 501 (Martin/Roth): Dynamic Scheduling I 14 New Pipeline Diagram The Problem With In-Order Pipelines Insn D X W ldf X(r1),f1 c1 c2 c3 mulf f0,f1,f2 c3 c4+ c7 stf f2,Z(r1) c7 c8 c9 regfile addi r1,4,r1 c8 c9 c10 I$ ldf X(r1),f1 c10 c11 c12 D$ B mulf f0,f1,f2 c12 c13+ c16 P stf f2,Z(r1) c16 c17 c18 • Alternative pipeline diagram • In-order pipeline • Down: insns • Structural hazard: 1 insn register (latch) per stage • Across: pipeline stages • 1 insn per stage per cycle (unless pipeline is replicated) • In boxes: cycles • Younger insn can’t “pass” older insn without “clobbering” it • Basically: stages " cycles • Out-of-order pipeline • Convenient for out-of-order • Implement “passing” functionality by removing structural hazard CIS 501 (Martin/Roth): Dynamic Scheduling I 15 CIS 501 (Martin/Roth): Dynamic Scheduling I 16 Instruction Buffer Dispatch and Issue insn buffer insn buffer regfile regfile I$ I$ D$ D$ B B P D1 D2 P D S • Trick: insn buffer (many names for this buffer) • Dispatch (D): first part of decode • Basically: a bunch of latches for holding insns • Allocate slot in insn buffer • Implements iteration fusing … here is your scheduling scope – New kind of structural hazard (insn buffer is full) • Split D into two pieces • In order: stall back-propagates to younger insns • Accumulate decoded insns in buffer in-order • Issue (S): second part of decode • Buffer sends insns down rest of pipeline out-of-order • Send insns from insn buffer to execution units + Out-of-order: wait doesn’t back-propagate to younger insns CIS 501 (Martin/Roth): Dynamic Scheduling I 17 CIS 501 (Martin/Roth): Dynamic Scheduling I 18 Dispatch and Issue with Floating-Point Dynamic Scheduling Algorithms insn buffer • Three parts to loop unrolling • Scheduling scope: insn buffer regfile • Pipeline scheduling and register renaming: scheduling algorithm I$ D$ B • Look at two register scheduling algorithms P D S • Register scheduler: scheduler based on register dependences • Scoreboard E* E* E* • No register renaming ! limited scheduling flexibility • Tomasulo E E • Register renaming ! more flexibility, better performance + + • Big simplification in this unit: memory scheduling E/ • Pretend register algorithm magically knows memory dependences • A little more realism next unit F-regfile CIS 501 (Martin/Roth): Dynamic Scheduling I 19 CIS 501 (Martin/Roth): Dynamic Scheduling I 20 Scheduling Algorithm I: Scoreboard Scoreboard Data Structures • Scoreboard • FU Status Table • Centralized control scheme: insn status explicitly tracked • FU, busy, op, R, R1, R2: destination/source register names • Insn buffer: Functional Unit Status Table (FUST) • T: destination register tag (FU producing the value) • T1,T2: source register tags (FU producing the values) • First implementation: CDC 6600 [1964] • Register Status Table • 16 separate non-pipelined functional units (7 int, 4 FP, 5 mem) • T: tag (FU that will write this register) • No bypassing • Tags interpreted as ready-bits • Tag == 0 ! Value is ready in register file • Our example: “Simple Scoreboard” • Tag != 0 ! Value is not ready, will be supplied by T • 5 FU: 1 ALU, 1 load, 1 store, 2 FP (3-cycle, pipelined) • Insn status table • S,X bits for all active insns CIS 501 (Martin/Roth): Dynamic Scheduling I 21 CIS 501 (Martin/Roth): Dynamic Scheduling I 22 Simple Scoreboard Data Structures Scoreboard Pipeline Regfile • New pipeline structure: F, D, S, X, W Insn Reg Status S X T value • F (fetch) • Same as it ever was • D (dispatch) • Structural or WAW hazard ? stall : allocate scoreboard entry Fetched R1 R2 R op T T1 T2 • S (issue) == == insns == == • RAW hazard ? wait : read registers, go to execute == == CAMs == == FU Status • X (execute)

Load more