Computer Architecture Spring 2016

Lecture 11: Dynamic Scheduling: Scoreboarding Shuai Wang

Department of Computer Science and Technology Nanjing University [Slides adapted from CS252, UC Berkeley and CS 246, Harvard University] Dynamic Scheduling • Dynamic scheduling: rely on hardware to rearrange inst. execution to reduce the stalls while maintaining exception behavior and data flow

• Advantages of dynamic scheduling: – Handling cases when dependences are unknown at compile time, e.g., involving memory reference – Simplifies compiler – Allow code that compiled for one pipeline to run efficiently on a different pipeline – Hardware speculation, a technique with significant performance advantages, that builds on dynamic scheduling Dynamic Scheduling: The Idea • Key idea: allow instructions behind stall to proceed DIVD F0 ,F2,F4 ADDD F10, F0 ,F8 SUBD F12,F8,F14

• Enables out-of-order execution and allows out-of-order completion - But • Dynamic scheduling introduces the possibility of WAW and WAR hazards • Dynamic scheduling may generate imprecise exceptions Precise vs. Imprecise Exceptions • Precise exception: if the pipeline can be stopped so that the inst. before the faulting inst. are completed and those after it can be restarted from scratch.

• Imprecise exception: if the processor state when an exception is raised does not look like exactly as if the inst. were executed sequentially in strict program order – Inst. later in program order than the faulting inst. have completed – Inst. earlier in program order than the faulting inst. have not yet completed Dynamic Scheduling with a Scoreboard • Scoreboard : allow inst. to execute out-of-order when there are sufficient resources and no data dependences

• Named after CDC6600 scoreboard

• The scoreboard takes full responsibility for instruction issue and execution, including all hazard detection Supercomputer (CDC6600) Definitions of a supercomputer: • Fastest machine in world at given task • A device to turn a compute-bound problem into an I/O bound problem • Any machine costing $30M+ • Any machine designed by Seymour Cray

• CDC6600 (Cray, 1963) regarded as first supercomputer

Control Data Corporation (CDC), 1957-1992 CDC 6600 Seymour Cray , 1963 • A fast pipelined machine with 60-bit words – 128 Kword main memory capacity, 32 banks • Ten functional units (parallel, unpipelined) – Floating Point: , 2 multipliers, divider – Integer: adder, 2 incrementers, ... • Hardwired control (no microcoding) • Scoreboard for dynamic scheduling of instructions • Ten Peripheral Processors for Input/Output – a fast multi-threaded 12-bit integer ALU • Very fast clock, 10 MHz (FP add in 4 clocks) • >400,000 transistors, 750 sq. ft., 5 tons, 150 kW, novel freon-based technology for cooling • Fastest machine in world for 5 years (until 7600) – over 100 sold ($7-10M each) CDC 6600: A Load/Store Architecture (RISC) • Separate instructions to manipulate three types of reg. – 8x60-bit data registers (X) – 8x18-bit address registers (A) – 8x18-bit index registers (B)

• All arithmetic and logic instructions are register-to-register 6 6 6 6 opcode i j k Ri ← Rj op Rk

• Only Load and Store instructions refer to memory! 6 6 6 6 opcode i j disp Ri ← M[Rj + disp] Touching address registers 1 to 5 initiates a load 6 to 7 initiates a store -very useful for vector operations CDC6600 Scoreboard • Instructions dispatched in-order to functional units provided no structural hazard or WAW – Stall on structural hazard, no functional units available – Only one pending write to any register • Instructions wait for input operands (RAW hazards) before execution – Can execute out-of-order • Instructions wait for output register to be read by preceding instructions (WAR) – Result held in functional unit until register free Scoreboarding

Centralized scheme – No bypassing – WAR/WAW hazards are problems

• Originally proposed in CDC6600 (S. Cray,1963) First Stage of Scoreboarding – Issue (or Dispatch) • Issue replaces first half of ID stage – Check structural hazard • All function units for this instruction are busy – Check WAW hazards • Active instruction wants to write the same destination register – Stall all current (future) inst. issuing until none of these hazards remain (In-Order) – Issue the instruction and update scoreboard Second Stage of Scoreboarding – Read Operands (or Issue!) • Read operands replaces the second half of ID – Check for RAW hazards: are all source operands available? • If NO, hold instruction in a pre-execution buffer (I- Buffer) • If YES, scoreboard tells functional unit to read operands from registers and begin execution Third Stage of Scoreboarding – Execution • Execution replaces EX in MIPS – Begin execution upon receiving operands – Notify scoreboard when it completes execution Fourth Stage of Scoreboarding – Write Result • Write Result replaces WB in MIPS – Checks for WAR hazards • Any preceding instruction has not read its operands, and • One of operand is the same register as the destination register of the completing inst. • Stalls the completing inst. If WAR detected • Otherwise, functional unit writes its result to destination register Scoreboard Implementation • Instruction status: which of the stages the inst. is in • Functional unit status: state of the functional unit – Busy : indicates whether the unit is busy or not – Op : operation to perform in the unit – Fi : destination register – Fj , Fk : source registers – Qj , Qk : functional units producing Fj, Fk – Rj , Rk : flags indicating when Fj, Fk are ready and not yet read. Set to NO after operands are read. • Register result status: indicates which functional unit will write each register Scoreboard Example Scoreboard Example: Cycle 1 Scoreboard Example: Cycle 2

Issue 2nd LD? Scoreboard Example: Cycle 3

Issue MULT? Scoreboard Example: Cycle 4 Scoreboard Example: Cycle 5 Scoreboard Example: Cycle 6 Scoreboard Example: Cycle 7

Read MULT operands? Scoreboard Example: Cycle 8a (First half of clock cycle) Scoreboard Example: Cycle 8b (Second half of clock cycle) Scoreboard Example: Cycle 9

Read operands for MULT & SUB? Issue ADD? Scoreboard Example: Cycle 10 Scoreboard Example: Cycle 11 Scoreboard Example: Cycle 12

Read Operands for DIV? Scoreboard Example: Cycle 13 Scoreboard Example: Cycle 14 Scoreboard Example: Cycle 15 Scoreboard Example: Cycle 16 Scoreboard Example: Cycle 17

Why not write result of ADD? Scoreboard Example: Cycle 18 Scoreboard Example: Cycle 19 Scoreboard Example: Cycle 20 Scoreboard Example: Cycle 21

WAR Hazard is now gone… Scoreboard Example: Cycle 22 Skip Couple of Cycles Scoreboard Example: Cycle 61 Scoreboard Example: Cycle 62 Review Scoreboard Example: Cycle 62

In-order issue; out-of-order execute & commit Scoreboarding Limitations • Number and type of functional units

• Number of scoreboard entries

• Amount of application ILP (RAW hazards)

• The presence of anti and output dependence – Lead to WAR and WAW stalls