CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit

CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit

CS252 Recall: Software Pipelining Example Graduate Computer Architecture Before: Unrolled 3 times After: Software Pipelined 1 LD F0,0(R1) 1 SD 0(R1),F4 ; Stores M[i] Lecture 6 2 ADDD F4,F0,F2 2 ADDD F4,F0,F2 ; Adds to M[i-1] 3 SD 0(R1),F4 3 LD F0,-16(R1); Loads M[i-2] 4 LD F6,-8(R1) 4 SUBI R1,R1,#8 Tomasulo, Implicit Register Renaming, 5 ADDD F8,F6,F2 5 BNEZ R1,LOOP Loop-Level Parallelism Extraction 6 SD -8(R1),F8 7 LD F10,-16(R1) Explicit Register Renaming 8 ADDD F12,F10,F2 SW Pipeline 9 SD -16(R1),F12 10 SUBI R1,R1,#24 John Kubiatowicz 11 BNEZ R1,LOOP Time Loop Unrolled Electrical Engineering and Computer Sciences • Symbolic Loop Unrolling University of California, Berkeley – Maximize result-use distance overlapped ops – Less code space than unrolling – Fill & drain pipe only once per loop Time http://www.eecs.berkeley.edu/~kubitron/cs252 vs. once per each unrolled iteration in loop unrolling 5 cycles per iteration 2/9/2009 CS252-S09, Lecture 6 2 Review: Scoreboard (CDC 6600) Review: Scoreboard Implications • Scoreboard keeps track of dependencies between FPFP MultMult instructions that have already issued. – Scoreboard replaces ID, EX, WB with 4 stages FPFP MultMult • Out-of-order completion => WAR, WAW hazards? • Solutions for WAR: FPFP DivideDivide – Stall writeback until registers have been read – Read registers only during Read Operands stage – Greatly limits overlap of independent computations FPFP AddAdd Registers • Solution for WAW: – Detect hazard and stall issue of new instruction until other instruction completes Functional Units IntegerInteger – Prevents overlapping of loop iterations! • No register renaming! – We will fix this today SCOREBOARD Memory • Need to have multiple instructions in execution SCOREBOARD phase => multiple execution units or pipelined execution units 2/9/2009 CS252-S09, Lecture 6 3 2/9/2009 CS252-S09, Lecture 6 4 Another Dynamic Algorithm: Review: Scoreboard Example: Cycle 62 Tomasulo Algorithm Instruction status: Read Exec Write Instruction jkIssue Oper Comp Result LDF634+R21234 LD F2 45+ R3 5678 • For IBM 360/91 about 3 years after CDC 6600 (1966) MULTD F0 F2 F4 6 9 19 20 SUBD F8 F6 F2 7 9 11 12 • Goal: High Performance without special compilers DIVD F10 F0 F6 8 216162 • Differences between IBM 360 & CDC 6600 ISA ADDDF6F8F213141622 – IBM has only 2 register specifiers/instr vs. 3 in CDC 6600 Functional unit status: dest S1 S2 FU FU Fj? Fk? – IBM has 4 FP registers vs. 8 in CDC 6600 Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No – IBM has memory-register ops Mult1 No Mult2 No • Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Add No Pentium II, PowerPC 604, … Divide No Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 62 FU • In-order issue; out-of-order execute & commit 2/9/2009 CS252-S09, Lecture 6 5 2/9/2009 CS252-S09, Lecture 6 6 Tomasulo Organization Tomasulo Algorithm vs. Scoreboard FP Registers From Mem FP Op • Control & buffers distributed with Function Units (FU) vs. Queue Load Buffers centralized in scoreboard; Load1 – FU buffers called “reservation stations”; have pending operands Load2 Load3 • Registers in instructions replaced by values or pointers Load4 to reservation stations(RS); called register renaming ; Load5 Store Load6 Buffers – avoids WAR, WAW hazards – More reservation stations than registers, so can do optimizations Add1 compilers can’t Add2 Mult1 Add3 Mult2 • Results to FU from RS, not through registers, over Reservation To Mem Common Data Bus that broadcasts results to all FUs Stations FPFP addersadders FPFP multipliersmultipliers • Load and Stores treated as FUs with RSs as well • Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue Common Data Bus (CDB) 2/9/2009 CS252-S09, Lecture 6 7 2/9/2009 CS252-S09, Lecture 6 8 Reservation Station Components Three Stages of Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue Op: Operation to perform in the unit (e.g., + or –) If reservation station free (no structural hazard), Vj, Vk: Value of Source operands control issues instr & sends operands (renames registers). – Store buffers has V field, result to be stored 2.Execution—operate on operands (EX) Qj, Qk: Reservation stations producing source When both operands ready then execute; registers (value to be written) if not ready, watch Common Data Bus for result – Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready 3.Write result—finish execution (WB) – Store buffers only have Qi for RS producing result Write on Common Data Bus to all awaiting units; mark reservation station available Busy: Indicates reservation station or FU is busy • Normal data bus: data + destination (“go to” bus) • Common data bus: data + source (“come from” bus) Register result status—Indicates which functional unit – 64 bits of data + 4 bits of Functional Unit source address will write each register, if one exists. Blank when no – Write if matches expected Functional Unit (produces result) pending instructions that will write that register. – Does the broadcast 2/9/2009 CS252-S09, Lecture 6 9 2/9/2009 CS252-S09, Lecture 6 10 Tomasulo Example Tomasulo Example Cycle 1 Instruction status: Exec Write Instruction status: Exec Write Instruction jkIssue Comp Result Busy Address Instruction jkIssue Comp Result Busy Address LD F6 34+ R2 Load1 No LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 Load2 No LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 SUBD F8 F6 F2 DIVD F10 F0 F6 DIVD F10 F0 F6 ADDD F6 F8 F2 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk Add1 No Add1 No Add2 No Add2 No Add3 No Add3 No Mult1 No Mult1 No Mult2 No Mult2 No Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 0 FU 1 FU Load1 2/9/2009 CS252-S09, Lecture 6 11 2/9/2009 CS252-S09, Lecture 6 12 Tomasulo Example Cycle 2 Tomasulo Example Cycle 3 Instruction status: Exec Write Instruction status: Exec Write Instruction jkIssue Comp Result Busy Address Instruction jkIssue Comp Result Busy Address LD F6 34+ R2 1 Load1 Yes 34+R2 LD F6 34+ R2 1 3 Load1 Yes 34+R2 LD F2 45+ R3 2Load2Yes45+R3 LD F2 45+ R3 2Load2Yes45+R3 MULTD F0 F2 F4 Load3 No MULTD F0 F2 F4 3Load3No SUBD F8 F6 F2 SUBD F8 F6 F2 DIVD F10 F0 F6 DIVD F10 F0 F6 ADDD F6 F8 F2 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk Add1 No Add1 No Add2 No Add2 No Add3 No Add3 No Mult1 No Mult1 Yes MULTD R(F4) Load2 Mult2 No Mult2 No Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 2 FU Load2 Load1 3 FU Mult1 Load2 Load1 • Note: registers names are removed (“renamed”) in Note: Unlike 6600, can have multiple loads outstanding Reservation Stations; MULT issued vs. scoreboard 2/9/2009 CS252-S09, Lecture 6 13 2/9/2009• Load1 completing; whatCS252-S09, is waitingLecture 6 for Load1? 14 Tomasulo Example Cycle 4 Tomasulo Example Cycle 5 Instruction status: Exec Write Instruction status: Exec Write Instruction jkIssue Comp Result Busy Address Instruction jkIssue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 Load2 Yes 45+R3 LD F2 45+ R3 245 Load2No MULTD F0 F2 F4 3Load3No MULTD F0 F2 F4 3Load3No SUBD F8 F6 F2 4 SUBD F8 F6 F2 4 DIVD F10 F0 F6 DIVD F10 F0 F6 5 ADDD F6 F8 F2 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk Add1 Yes SUBD M(A1) Load2 2 Add1 Yes SUBD M(A1) M(A2) Add2 No Add2 No Add3 No Add3 No Mult1 Yes MULTD R(F4) Load2 10 Mult1 Yes MULTD M(A2) R(F4) Mult2 No Mult2 Yes DIVD M(A1) Mult1 Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 4 FU Mult1 Load2 M(A1) Add1 5 FU Mult1 M(A2) M(A1) Add1 Mult2 • Load2 completing; what is waiting for Load2? 2/9/2009 CS252-S09, Lecture 6 15 2/9/2009 CS252-S09, Lecture 6 16 Tomasulo Example Cycle 6 Tomasulo Example Cycle 7 Instruction status: Exec Write Instruction status: Exec Write Instruction jkIssue Comp Result Busy Address Instruction jkIssue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 245 Load2No LD F2 45+ R3 245 Load2No MULTD F0 F2 F4 3Load3No MULTD F0 F2 F4 3Load3No SUBD F8 F6 F2 4 SUBD F8 F6 F2 47 DIVD F10 F0 F6 5 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op Vj Vk Qj Qk Time Name Busy Op Vj Vk Qj Qk 1 Add1 Yes SUBD M(A1) M(A2) 0 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add2 Yes ADDD M(A2) Add1 Add3 No Add3 No 9Mult1 Yes MULTD M(A2) R(F4) 8Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Mult2 Yes DIVD M(A1) Mult1 Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ..

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    21 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us