Advantages of Dynamic Scheduling

Advantages of Dynamic Scheduling

Advantages of Dynamic Scheduling COMP4211 05s1 Seminar 3: • Handles cases when dependences unknown at Dynamic Scheduling compile time – (e.g., because they may involve a memory reference) • It simplifies the compiler • Allows code that compiled for one pipeline Slides on Tomasulo’s approach due to to run efficiently on a different pipeline David A. Patterson, 2001 • Hardware speculation, a technique with significant performance advantages, that builds on dynamic scheduling Scoreboarding slides due to Oliver F. Diessel, 2005 W03S1 W03S2 HW Schemes: Instruction Parallelism Overview • Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 ADDD F10,F0,F8 • We’ll look at two schemes for implementing SUBD F12,F8,F14 dynamic scheduling • Enables out-of-order execution – Scoreboarding from the 1964 CDC 6600 computer, and and allows out-of-order completion – Tomasulo’s Algorithm, as implemented for the FP unit of the IBM 360/91 in 1966 • Will distinguish when an instruction begins • Since scoreboarding is a little closer to in- execution and when it completes execution; order execution, we’ll look at it first between 2 times, the instruction is in execution • In a dynamically scheduled pipeline, all instructions pass through issue stage in order (in-order issue) W03S3 W03S4 Dynamic Scheduling Step 1 Scoreboarding • Simple pipeline had 1 stage to check both structural and data hazards: Instruction • Instructions pass through the issue stage in Decode (ID), also called Instruction Issue order • Split the ID pipe stage of simple 5-stage • Instructions can be stalled or bypass each pipeline into 2 stages: other in the read operands stage and enter execution out of order • Issue—Decode instructions, check for structural hazards • Scoreboarding allows instructions to execute out of order when there are sufficient • Read operands—Wait until no data hazards, resources and no data dependencies then read operands • Named after the CDC 6600 scoreboard, which developed this capability W03S5 W03S6 Scoreboarding ideas A Scoreboard for MIPS • Note that WAR and WAW hazards can occur with out-of- Registers order execution Data buses – note: source of structural hazard – Scoreboarding deals with both of these by stalling the later instruction involved in the name dependence FP Mult • Scoreboarding aims to maintain an execution rate of one instruction per cycle when there are no structural hazards FP Mult – Executes instructions as early as possible – When the next instruction to execute is stalled, other instructions can be issued and executed if they do not depend on any active or stalled FP Divide instruction • Taking advantage of out-of-order execution requires multiple instructions to be in the EX stage simultaneously FP Add – Achieved with multiple functional units, with pipelined functional units, or both • All instructions go through the scoreboard; the scoreboard Integer Unit centralizes control of issue, operand reading, execution and writeback – All hazard resolution is centralized in the scoreboard as well Scoreboard Control/ Control/ status status W03S7 W03S8 Steps in Execution with Scoreboarding Scoreboarding details 1. Issue if a f.u. for the instruction is free and no other 3 parts to scoreboard: active instruction has the same destination register 1. Instruction status • Thus avoids structural and WAW hazards • Stalls subsequent fetches when stalled – Indicates which of the 4 steps an instruction is in 2. Read operands when all source operands are available 2. Functional unit status (9 fields) • Note forwarding not used Busy – is the f.u. busy or not • A source operand is available if no earlier issued active instruction is going to write it Op – the operation to be performed • Thus resolves RAW hazards dynamically Fi – destination register 3. Execution begins when the f.u. receives its operands; Fj, Fk – source register numbers scoreboard notified when execution completes Qj, Qk – f.u. producing source registers Fj, Fk 4. Write result after WAR hazards have been resolved Rj, Rk – flags indicating when Fj, Fk are ready – set to “No” • Eg, consider the code after operands read DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F8, F8, F14 3. Register result status – Indicates which functional unit will write each register the ADD.D cannot proceed to read operands until DIV.D completes; SUB.D can execute but not write back until ADD.D has read F8. – Left blank if not the destination of an active instruction W03S9 W03S10 Scoreboard example continued Scoreboard eg – partially progressed comp. (Assume 2 cyc for +, 10 cyc for *, 40 cyc for /) W03S11 W03S12 Scoreboard bookkeeping Scoreboarding assessment Instruction Wait until Bookkeeping status Issue Not Busy[FU] and Busy[FU] ← yes; Op[FU] ← op; Fi[FU] ← D; • 1.7 improvement for FORTRAN and 2.5 for not Result[D] Fj[FU] ← S1; Fk[FU] ← S2; Qj ← Result[S1]; hand-coded assembly on CDC 6600! Qk ← Result[S2]; Rj ← not Qj; Rk ← not Qk; Result[D] ← FU; – Before semiconductor main memory or caches… Read Rj and Rk Rj ← No; Rk ← No; Qj ← 0; Qk ← 0; • On the CDC 6600 required about as much operands logic as a functional unit – quite low Execution Functional unit done • Large number of buses needed – however, complete since we want to issue multiple instructions per clock more wires are needed in any case Write result ∀f((Fj[f] ≠ Fi[FU] ∀f(if Qj[f] = FU then Rj[f] ← Yes); or Rj[f] = No) & ∀ ← ≠ f(if Qk[f] = FU then Rk[f] Yes); (Fk[f] Fi[FU] or ← ← Rk[f] = No)) Result[Fi[FU]] 0; Busy[FU] No; W03S13 W03S14 A more sophisticated approach: Limits to Scoreboarding Tomasulo’s Algorithm • For IBM 360/91 (before caches!) • A scoreboard uses available ILP to minimize the number of stalls due to true data • Goal: High Performance without special compilers dependencies. • Small number of floating point registers (4 in 360) • Scoreboarding is constrained in achieving prevented interesting compiler scheduling of operations this goal by: – This led Tomasulo to try to figure out how to get more effective – Available parallelism – determines whether independent registers — renaming in hardware! instructions can be found – The number of scoreboard entries – limits how far • Why Study 1966 Computer? ahead we can look • The descendants of this have flourished! – The number and types of functional units – contributes to structural stalls – Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, … – The presence of antidependences and output dependences which lead to WAR and WAW hazards W03S15 W03S16 Tomasulo Organization Tomasulo Algorithm From Mem FP Op FP Registers • Control & buffers distributed with Function Units (FU) Queue – FU buffers called “reservation stations”; have pending Load Buffers operands Load1 Load2 • Registers in instructions replaced by values or pointers Load3 to reservation stations(RS); called register renaming ; Load4 Load5 Store Load6 – avoids WAR, WAW hazards Buffers – More reservation stations than registers, so can do optimizations compilers can’t Add1 Add2 Mult1 • Results to FU from RS, not through registers, over Add3 Mult2 Common Data Bus that broadcasts results to all FUs Reservation To Mem • Load and Stores treated as FUs with RSs as well Stations FP adders FP multipliers • Integer instructions can go past branches, allowing FP adders FP multipliers FP ops beyond basic block in FP queue Common Data Bus (CDB) W03S17 W03S18 Reservation Station Components Three Stages of Tomasulo Algorithm Op: Operation to perform in the unit (e.g., + or –) 1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), Vj, Vk: Value of Source operands control issues instr & sends operands (renames registers). – Store buffers has V field, result to be stored 2. Execute—operate on operands (EX) When both operands ready then execute; Qj, Qk: Reservation stations producing source if not ready, watch Common Data Bus for result registers (value to be written) 3. Write result—finish execution (WB) – Note: Qj,Qk=0 => ready Write on Common Data Bus to all awaiting units; mark reservation station available – Store buffers only have Qi for RS producing result • Normal data bus: data + destination (“go to” bus) Busy: Indicates reservation station or FU is busy • Common data bus: data + source (“come from” bus) – 64 bits of data + 4 bits of Functional Unit source address – Write if matches expected Functional Unit (produces result) Register result status—Indicates which functional – Does the broadcast unit will write each register, if one exists. Blank • Example speed: when no pending instructions that will write that 2 clocks for Fl .pt. +,-; 10 for * ; 40 clks for / register. W03S19 W03S20 Instruction stream Tomasulo Example Tomasulo Example Cycle 1 Instruction status: Exec Write Instruction status: Exec Write Instruction jkIssue Comp Result Busy Address Instruction jkIssue Comp Result Busy Address LD F6 34+ R2 Load1 No LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 Load2 No LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 SUBD F8 F6 F2 DIVD F10 F0 F6 DIVD F10 F0 F6 ADDD F6 F8 F2 3 Load/Buffers ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk Add1 No Add1 No FU count Add2 No Add2 No Add3 No 3 FP Adder R.S. Add3 No down Mult1 No 2 FP Mult R.S. Mult1 No Mult2 No Mult2 No Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 0 FU 1 FU Load1 Clock cycle counter W03S21 W03S22 Tomasulo Example

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    18 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us