Dynamic Instruction Scheduling

PD-MIRI Ramon Canal Dynamic vs. Static Scheduling

• Data hazards in a program cause a to stall. • With static scheduling the compiler tries to reorder these instructions during compile time to reduce pipeline stalls. – Uses less hardware – Can use more powerful algorithms • With dynamic scheduling the hardware tries to rearrange the instructions during run-time to reduce pipeline stalls. – Simpler compiler – Handles dependencies not known at compile time – Allows code compiled for a different machine to run efficiently. Out-Of-Order Execution

• In our previous model, all instructions executed in the order that they appear • This can lead to unnecessary stalls DIVD FO, F2, F4 ADDD F10, F0, F8 SUBD F12, F8, F14 • SUBD stalls waiting for the ADDD to go first, even though SUBD does not have a data dependency. • With out-of-order execution, the SUBD is allowed to executed before the add – This can lead to out-of order completion, which can cause WAW and WAR hazards Scoreboarding

• The scoreboard implements a centralized control scheme that – Detects all resource and data hazards – Allows instructions to execute out-of-order when no resource hazards or data dependencies • First implemented in 1964 by the CDC 6600, which had 18 separate functional units – 4 FP units (2 multiply, 1 add, 1 divide) – 7 memory units (5 loads, 2 stores) – 7 integer units (add, shift, logical, compare, etc.) • Our dynamic pipeline (much simpler) – 2 FP multiply (10 EX cycles) – 1 FP add (2 EX cycles) – 1 FP divide (40 EX cycles) – 1 integer unit (1 EX cycle) Out-of-Order Execution

• Out-of-order execution divides DR stage into: 1. Issue—decode instructions, check for structural hazards 2. Read operands—wait until no data hazards, then read operands • Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions • CDC 6600: In order issue, out of order execution, out of order commit (also called completion) Scoreboard Implications

• Out-of-order completion can lead to WAR and WAW hazards? • Solution for WAW – Detect WAW hazard before reading operands – Stall write until other instruction completes • Solutions for WAR – Detect WAR hazards before writing back to the register files and stall the write back • This scoreboard does not take advantage of forwarding (i.e. bypasses), since it waits until both results are written back to the • Scoreboard replaces DR, EX, WB with 4 stages Four Stages of Scoreboard Control

• Decode+Issue (Issue) – decode instructions – check for structural and WAW hazards – stall until structural and WAW hazards are resolved • Read operands (Read) – wait until no RAW hazards – then read operands • Execution (EX) – operate on operands – may be multiple cycles - notify scoreboard when done • Write result (WB) – finish execution – stall if WAR hazard Three Parts of the Scoreboard

1.Instruction status—which of 4 steps the instruction is in: Issue, Read, EX, or WB.

2.Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy—Indicates whether the unit is busy or not Op—Operation to perform in the unit (e.g., + or –) Fi—Destination register Fj, Fk—Source-register numbers Qj, Qk—Functional units producing source registers Fj, Fk Rj, Rk—Flags indicating when Fj, Fk are ready

3.Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register Scoreboarding Example Cycle 0 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0Integer No 0Mult1 No Mult2 No 0AddNo 0Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 0 F U Scoreboarding Example Cycle 1 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer Yes Load F6 R2 Yes 0Mult1No Mult2 No 0AddNo 0 Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 1 F U Integer Scoreboarding Example Cycle 2 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer Yes Load F6 R2 Yes 0Mult1No Mult2 No 0AddNo 0 Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 2 F U Integer Issue 2nd Load? Scoreboarding Example Cycle 3 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer Yes Load F6 R2 Yes 0Mult1No Mult2 No 0AddNo 0 Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 3 F U Integer Issue 2nd Load? Scoreboarding Example Cycle 4 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer Yes Load F6 R2 Yes 0Mult1No Mult2 No 0AddNo 0 Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 4 F U Integer Scoreboarding Example Cycle 5 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer Yes Load F2 R3 Yes 0Mult1No Mult2 No 0AddNo 0 Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 5 F U Integer Scoreboarding Example Cycle 6 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 56 MULTD F0 F2 F4 6 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer Yes Load F2 R3 Yes 0Mult1Yes Mult F0 F2 F4 integer No Yes Mult2 No 0AddNo 0 Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 6 F U Mult1 Integer Scoreboarding Example Cycle 7 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer Yes Load F2 R3 Yes 0Mult1Yes Mult F0 F2 F4 integer No Yes Mult2 No 0AddYes Sub F8 F6 F2 integer Yes No 0 Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 7 F U Mult1 Integer Add Scoreboarding Example Cycle 8 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 8 MULTD F0 F2 F4 6 SUBD F8 F6 F2 7 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer Yes Ld F2 R3 Yes 0Mult1Yes Mult F0 F2 F4 integer No Yes Mult2 No 0AddYes Sub F8 F6 F2 integer Yes No 0 Divide Yes Div F10 F0 F6 mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 8 F U Mult1 Integer Add Divide Scoreboarding Example Cycle 9 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 8 MULTD F0 F2 F4 69 SUBD F8 F6 F2 79 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer No 10 Mult1 Yes Mult F0 F2 F4 Yes Yes Mult2 No 2AddYes Sub F8 F6 F2 Yes Yes 0 Divide Yes Div F10 F0 F6 mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 9 F U Mult1 Add Divide Scoreboarding Example Cycle 10 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 8 MULTD F0 F2 F4 69 10 -- SUBD F8 F6 F2 79 10 -- DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer No 9Mult1Yes Mult F0 F2 F4 Yes Yes Mult2 No 1AddYes Sub F8 F6 F2 Yes Yes 0 Divide Yes Div F10 F0 F6 mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 10 F U Mult1 Add Divide Scoreboarding Example Cycle 11 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 8 MULTD F0 F2 F4 69 10 -- SUBD F8 F6 F2 7 9 10 -- 11 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer No 8Mult1Yes Mult F0 F2 F4 Yes Yes Mult2 No 0AddYes Sub F8 F6 F2 Yes Yes 0 Divide Yes Div F10 F0 F6 mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 11 F U Mult1 Add Divide Scoreboarding Example Cycle 12 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 8 MULTD F0 F2 F4 69 10 -- SUBD F8 F6 F2 7 9 10 -- 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer No 7Mult1Yes Mult F0 F2 F4 Yes Yes Mult2 No 0AddYes Sub F8 F6 F2 Yes Yes 0 Divide Yes Div F10 F0 F6 mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 12 F U Mult1 Add Divide Scoreboarding Example Cycle 13 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 8 MULTD F0 F2 F4 69 10 -- SUBD F8 F6 F2 7 9 10 -- 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer No 6Mult1Yes Mult F0 F2 F4 Yes Yes Mult2 No 0 Add Yes Add F6 F8 F2 Yes Yes 0 Divide Yes Div F10 F0 F6 mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 13 F U Mult1 Add Divide Scoreboarding Example Cycle 14 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 8 MULTD F0 F2 F4 69 10 -- SUBD F8 F6 F2 7 9 10 -- 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer No 5Mult1Yes Mult F0 F2 F4 Yes Yes Mult2 No 2 Add Yes Add F6 F8 F2 Yes Yes 0 Divide Yes Div F10 F0 F6 mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 14 F U Mult1 Add Divide Scoreboarding Example Cycle 15 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 8 MULTD F0 F2 F4 69 10 -- SUBD F8 F6 F2 7 9 10 -- 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 15 -- Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer No 4Mult1Yes Mult F0 F2 F4 Yes Yes Mult2 No 1 Add Yes Add F6 F8 F2 Yes Yes 0 Divide Yes Div F10 F0 F6 mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 15 F U Mult1 Add Divide Scoreboarding Example Cycle 16 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 8 MULTD F0 F2 F4 69 10 -- SUBD F8 F6 F2 7 9 10 -- 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 15 -- 16 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer No 3Mult1Yes Mult F0 F2 F4 Yes Yes Mult2 No 0 Add Yes Add F6 F8 F2 Yes Yes 0 Divide Yes Div F10 F0 F6 mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 16 F U Mult1 Add Divide Scoreboarding Example Cycle 17 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 8 MULTD F0 F2 F4 69 10 -- SUBD F8 F6 F2 7 9 10 -- 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 15 -- 16 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer No 2Mult1Yes Mult F0 F2 F4 Yes Yes Mult2 No 0 Add Yes Add F6 F8 F2 Yes Yes 0 Divide Yes Div F10 F0 F6 mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 17 F U Mult1 Add Divide Write result of ADDD? Scoreboarding Example Cycle 18 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 8 MULTD F0 F2 F4 69 10 -- SUBD F8 F6 F2 7 9 10 -- 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 15 -- 16 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer No 1Mult1Yes Mult F0 F2 F4 Yes Yes Mult2 No 0 Add Yes Add F6 F8 F2 Yes Yes 0 Divide Yes Div F10 F0 F6 mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 18 F U Mult1 Add Divide Scoreboarding Example Cycle 19 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 8 MULTD F0 F2 F4 6 9 10 -- 19 SUBD F8 F6 F2 7 9 10 -- 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 15 -- 16 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer No 0Mult1Yes Mult F0 F2 F4 Yes Yes Mult2 No 0 Add Yes Add F6 F8 F2 Yes Yes 0 Divide Yes Div F10 F0 F6 mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 19 F U Mult1 Add Divide Scoreboarding Example Cycle 20 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 8 MULTD F0 F2 F4 6 9 10 -- 19 20 SUBD F8 F6 F2 7 9 10 -- 11 12 DIVD F10 F0 F6 8 ADDD F6 F8 F2 13 14 15 -- 16 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer No 0Mult1Yes Mult F0 F2 F4 Yes Yes Mult2 No 0 Add Yes Add F6 F8 F2 Yes Yes 0 Divide Yes Div F10 F0 F6 mult1 No Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 20 F U Mult1 Add Divide Scoreboarding Example Cycle 21 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 8 MULTD F0 F2 F4 6 9 10 -- 19 20 SUBD F8 F6 F2 7 9 10 -- 11 12 DIVD F10 F0 F6 821 ADDD F6 F8 F2 13 14 15 -- 16 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer No 0Mult1No Mult2 No 0 Add Yes Add F6 F8 F2 Yes Yes 40 Divide Yes Div F10 F0 F6 Yes Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 21 F U Add Divide Scoreboarding Example Cycle 22 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 8 MULTD F0 F2 F4 6 9 10 -- 19 20 SUBD F8 F6 F2 7 9 10 -- 11 12 DIVD F10 F0 F6 82122 -- ADDD F6 F8 F2 13 14 15 -- 16 22 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer No 0Mult1No Mult2 No 0 Add Yes Add F6 F8 F2 Yes Yes 39 Divide Yes Div F10 F0 F6 Yes Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 22 F U Add Divide Scoreboarding Example Cycle 23 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 8 MULTD F0 F2 F4 6 9 10 -- 19 20 SUBD F8 F6 F2 7 9 10 -- 11 12 DIVD F10 F0 F6 8 21 22 -- ADDD F6 F8 F2 13 14 15 -- 16 22 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer No 0Mult1No Mult2 No 0AddNo 38 Divide Yes Div F10 F0 F6 Yes Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 23 F U Divide Scoreboarding Example Cycle 61 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 8 MULTD F0 F2 F4 6 9 10 -- 19 20 SUBD F8 F6 F2 7 9 10 -- 11 12 DIVD F10 F0 F6 8 21 22 -- 61 ADDD F6 F8 F2 13 14 15 -- 16 22 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer No 0Mult1No Mult2 No 0AddNo 0 Divide Yes Div F10 F0 F6 Yes Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 61 F U Divide Scoreboarding Example Cycle 62 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 8 MULTD F0 F2 F4 6 9 10 -- 19 20 SUBD F8 F6 F2 7 9 10 -- 11 12 DIVD F10 F0 F6 8 21 22 -- 61 62 ADDD F6 F8 F2 13 14 15 -- 16 22 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer No 0Mult1No Mult2 No 0AddNo 0 Divide Yes Div F10 F0 F6 Yes Yes Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 62 F U Divide Scoreboarding Example Cycle 63 Instruction status Operand Execution Write Instruction jkIssueRead start -complete Result LD F6 34+ R2 1 2 3 -- 3 4 LD F2 45+ R3 5 6 7 -- 7 8 MULTD F0 F2 F4 6 9 10 -- 19 20 SUBD F8 F6 F2 7 9 10 -- 11 12 DIVD F10 F0 F6 8 21 22 -- 61 62 ADDD F6 F8 F2 13 14 15 -- 16 22 Functional Unit Status Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk 0 Integer No 0Mult1No Mult2 No 0AddNo 0 Divide No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 63 F U CDC 6000 Scoreboard Summary

• Speedup from scoreboard – 1.7 for FORTRAN programs – 2.5 for hand-coded assembly language programs – Effects of modern compilers? • Hardware – Scoreboard hardware approximately same as one FPU – Main cost was buses (4x’s normal amount) – Could be more severe for modern processors • Limitations – No forwarding logic – Limited to instructions instruction window – Stalls for WAW hazards – Wait for WAR hazards before WB Scoreboarding

Scoreboard Tomasulo Algorithm for Dynamic Scheduling

• For IBM 360/91 in 1967 - about 3 years after CDC 6600 • Goal: High performance without special compilers • Differences between IBM 360 & CDC 6600 – IBM has only 2 register specifiers/instr vs. 3 in CDC 6600 – IBM has register-memory instructions – IBM has 4 FP registers vs. 8 in CDC 6600 – IBM has pipelined functional units (3 adds, 2 multiplies) • Tomasulo algorithm is designed to handle name dependencies (WAW and WAR hazards) efficiently SUB F1, F2, F0 DIVF F2, F3 , F2 ADDF F3, F0, F0 MULF F3, F1, F1 Tomasulo Algorithm

• Differences from Scoreboarding – Distributed hazard detection and control (through reservation stations) – Results are bypassed to function units – Common data bus (CDB) broadcasts results to all FUs. – HW renaming of registers to avoid WAR, WAW hazards – Load and Stores treated as FUs as well – Registers in instructions replaced by pointers to reservation station buffers • Lead to concepts used in Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, … Tomasulo Organization (FP + Load) Reservation Station Components

Op—Operation to perform in the unit (e.g., + or –) Qj, Qk—Reservation stations producing source registers Vj, Vk—Value of Source operands Rj, Rk—Flags indicating when Vj, Vk are ready Busy—Indicates reservation station and FU is busy

Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register. Three Stages of Tomasulo Algorithm

1.Issue—get instruction from FP Op Queue If reservation station free, issue instruction & send operands (renames registers). 2.Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result 3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available. Tomasulo Example Cycle 0 Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F6 34+ R2 Load1 No LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 S1 S2 RS for j RS for k Reservation Stations Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0Mult1 No 0Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 0 F U Tomasulo Example Cycle 1 Instruction status Execution Write Address Instruction j k Issue complete Result Busy 34+R2 LD F6 34+ R2 1 Load1 Yes LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 S1 S2 RS for j RS for k Reservation Stations Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0Mult1 No 0Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 1 F U Load1 Tomasulo Example Cycle 2 Instruction status Execution Write Address Instruction j k Issue complete Result Busy 34+R2 LD F6 34+ R2 1 2- Load1 Yes 45+R3 LD F2 45+ R3 2 Load2 Yes MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 S1 S2 RS for j RS for k Reservation Stations Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0Mult1 No 0Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 2 F U Load2 Load1 Tomasulo Example Cycle 3 Instruction status Execution Write Address Instruction j k Issue complete Result Busy 34+R2 LD F6 34+ R2 1 2--3 Load1 Yes 45+R3 LD F2 45+ R3 2 3- Load2 Yes MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 S1 S2 RS for j RS for k Reservation Stations Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0Mult1 Yes Mult F4 Load2 0Mult2 No ClockRegister result status F0 F2 F4 F6 F8 F10 F12 ... F30 3 F U Mult1 Load2 Load1 Tomasulo Example Cycle 4 Instruction status Execution Write Address Instruction j k Issue complete Result Busy LD F6 34+ R2 1 2--3 4 Load1 No 45+R3 LD F2 45+ R3 2 3--4 Load2 Yes MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 ADDD F6 F8 F2 S1 S2 RS for j RS for k Reservation Stations Time Name Busy Op Vj Vk Qj Qk 0Add1 Yes Sub F6 Load2 0Add2No Add3 No 0Mult1 Yes Mult F4 Load2 0Mult2 No ClockRegister result status F0 F2 F4 F6 F8 F10 F12 ... F30 4 F U Mult1 Load2 Add1 Instruction statusTomasulo Execution Example Write Cycle 5 Instruction j k Issue complete Result Busy Address LD F6 34+ R2 1 2--3 4 Load1 No LD F2 45+ R3 2 3--4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 2 Add1 Yes Sub F6 F2 0 Add2 No Add3 No 10 Mult1 Yes Mult F2 F4 0Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F 6 F8 F10 F12 ... F30 5 FU Mult1 Add1 Mult2 Tomasulo Example Cycle 6 Instruction status Execution Write Address Instruction j k Issue complete Result Busy LD F6 34+ R2 1 2--3 4 Load1 No LD F2 45+ R3 2 3--4 5 Load2 No MULTD F0 F2 F4 3 6 -- Load3 No SUBD F8 F6 F2 4 6 -- DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 S1 S2 RS for j RS for k Reservation Stations Time Name Busy Op Vj Vk Qj Qk 1Add1 Yes Sub F6 F2 0 Add2 Yes Add F2 Add1 Add3 No 9Mult1 Yes Mult F2 F4 0Mult2 Yes Div F6 Mult1 Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 6 FU Mult1 Add2 Add1 Mult2 Tomasulo Example Cycle 7 Instruction status Execution Write Address Instruction j k Issue complete Result Busy LD F6 34+ R2 1 2--3 4 Load1 No LD F2 45+ R3 2 3--4 5 Load2 No MULTD F0 F2 F4 3 6 -- Load3 No SUBD F8 F6 F2 4 6 -- 7 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 S1 S2 RS for j RS for k Reservation Stations Time Name Busy Op Vj Vk Qj Qk 0 Add1 Yes Sub F6 F2 0 Add2 Yes Add F2 Add1 Add3 No 8Mult1 Yes Mult F2 F4 0Mult2 Yes Div F6 Mult1 ClockRegister result status F0 F2 F4 F6 F8 F10 F12 ... F30 7 F U Mult1 Add2 Add1 Mult2 Tomasulo Example Cycle 8 Instruction status Execution Write Address Instruction j k Issue complete Result Busy LD F6 34+ R2 1 2--3 4 Load1 No LD F2 45+ R3 2 3--4 5 Load2 No MULTD F0 F2 F4 3 6 -- Load3 No SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 S1 S2 RS for j RS for k Reservation Stations Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 2 Add2 Yes Add F8 F2 Add3 No 7Mult1 Yes Mult F2 F4 0Mult2 Yes Div F2 Mult1 ClockRegister result status F0 F2 F4 F6 F8 F10 F12 ... F30 8 F U Mult1 Add2 Mult2 Tomasulo Example Cycle 9 Instruction status Execution Write Address Instruction j k Issue complete Result Busy LD F6 34+ R2 1 2--3 4 Load1 No LD F2 45+ R3 2 3--4 5 Load2 No MULTD F0 F2 F4 3 6 -- Load3 No SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 9 -- S1 S2 RS for j RS for k Reservation Stations Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 1 Add2 Yes Add F8 F2 Add3 No 6Mult1 Yes Mult F2 F4 0Mult2 Yes Div F6 Mult1 ClockRegister result status F0 F2 F4 F6 F8 F10 F12 ... F30 9 F U Mult1 Add2 Mult2 Tomasulo Example Cycle 10 Instruction status Execution Write Address Instruction j k Issue complete Result Busy LD F6 34+ R2 1 2--3 4 Load1 No LD F2 45+ R3 2 3--4 5 Load2 No MULTD F0 F2 F4 3 6 -- Load3 No SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 9 -- 10 S1 S2 RS for j RS for k Reservation Stations Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 Yes Add F8 F2 Add3 No 5Mult1 Yes Mult F2 F4 0Mult2 Yes Div F6 Mult1 ClockRegister result status F0 F2 F4 F6 F8 F10 F12 ... F30 10 F U Mult1 Add2 Mult2 Tomasulo Example Cycle 11 Instruction status Execution Write Address Instruction j k Issue complete Result Busy LD F6 34+ R2 1 2--3 4 Load1 No LD F2 45+ R3 2 3--4 5 Load2 No MULTD F0 F2 F4 3 6 -- Load3 No SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 9 -- 10 11 S1 S2 RS for j RS for k Reservation Stations Time Name Busy Op Vj Vk Qj Qk 0 Add1 No Add2 No Add3 No 4Mult1 Yes Mult F2 F4 0Mult2 Yes Div F6 Mult1 ClockRegister result status F0 F2 F4 F6 F8 F10 F12 ... F30 11 F U Mult1 Mult2 Tomasulo Example Cycle 12 Instruction status Execution Write Address Instruction j k Issue complete Result Busy LD F6 34+ R2 1 2--3 4 Load1 No LD F2 45+ R3 2 3--4 5 Load2 No MULTD F0 F2 F4 3 6 -- Load3 No SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 9 -- 10 11 S1 S2 RS for j RS for k Reservation Stations Time Name Busy Op Vj Vk Qj Qk 0 Add1 No Add2 No Add3 No 4Mult1 Yes Mult F2 F4 0Mult2 Yes Div F6 Mult1 ClockRegister result status F0 F2 F4 F6 F8 F10 F12 ... F30 12 F U Mult1 Mult2 Tomasulo Example Cycle 15 Instruction status Execution Write Address Instruction j k Issue complete Result Busy LD F6 34+ R2 1 2--3 4 Load1 No LD F2 45+ R3 2 3--4 5 Load2 No MULTD F0 F2 F4 3 6 -- 15 Load3 No SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 9 -- 10 11 S1 S2 RS for j RS for k Reservation Stations Time Name Busy Op Vj Vk Qj Qk 0 Add1 No Add2 No Add3 No 0Mult1 Yes Mult F2 F4 0Mult2 Yes Div F6 Mult1 ClockRegister result status F0 F2 F4 F6 F8 F10 F12 ... F30 15 F U Mult1 Mult2 Tomasulo Example Cycle 16 Instruction status Execution Write Address Instruction j k Issue complete Result Busy LD F6 34+ R2 1 2--3 4 Load1 No LD F2 45+ R3 2 3--4 5 Load2 No MULTD F0 F2 F4 3 6 -- 15 16 Load3 No SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 9 -- 10 11 S1 S2 RS for j RS for k Reservation Stations Time Name Busy Op Vj Vk Qj Qk 0 Add1 No Add2 No Add3 No Mult1 No 40 Mult2 Yes Div F0 F6 ClockRegister result status F0 F2 F4 F6 F8 F10 F12 ... F30 16 F U Mult2 Tomasulo Example Cycle 56 Instruction status Execution Write Address Instruction j k Issue complete Result Busy LD F6 34+ R2 1 2--3 4 Load1 No LD F2 45+ R3 2 3--4 5 Load2 No MULTD F0 F2 F4 3 6 -- 15 16 Load3 No SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 17 -- 56 ADDD F6 F8 F2 6 9 -- 10 11 S1 S2 RS for j RS for k Reservation Stations Time Name Busy Op Vj Vk Qj Qk 0 Add1 No Add2 No Add3 No Mult1 No 0Mult2 Yes Div F0 F6 ClockRegister result status F0 F2 F4 F6 F8 F10 F12 ... F30 56 F U Mult2 Tomasulo Example Cycle 57 Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F6 34+ R2 1 2--3 4 Load1 No LD F2 45+ R3 2 3--4 5 Load2 No MULTD F0 F2 F4 3 6 -- 15 16 Load3 No SUBD F8 F6 F2 4 6 -- 7 8 DIVD F10 F0 F6 5 17 -- 56 57 ADDD F6 F8 F2 6 9 -- 10 11 S1 S2 RS for j RS for k Reservation Stations Time Name Busy Op Vj Vk Qj Qk 0 Add1 No Add2 No Add3 No Mult1 No 0Mult2 No ClockRegister result status F0 F2 F4 F6 F8 F10 F12 ... F30 57 F U Tomasulo Summary

• Advantages – Prevents register from being the bottleneck – Eliminates WAR, WAW hazards – Allows loop unrolling in HW • Common Data Bus – Broadcasts results to multiple instructions – Central bottleneck • Lasting Contributions – Dynamic scheduling – – Load/store disambiguation Tomasulo Implementation RS layout (1)

• How many RS exist for each FU type? – One single RS (centralized RS) » Pentium Pro (Pentium II, III)

UF UF UF UF UF RS layout (2)

– One RS per FU » PowerPC 6xx

UF UF UF UF UF RS layout (3)

– One RS per n FU » MIPS R10000, HP PA-8500, AMD Opteron, DEC/Compaq Alpha 21264, Intel Pentium IV, IBM Power 4

UF UF UF UF UF Register Renaming (1)

• What do we use to rename the registers (tags)? – Reservation Station id » IBM 360 – ROB entry id » Intel Pentium Pro, AMD K5, HP PA-8000 – Future/architectural register file » IBM PowerPC 6xx – Merged future/architectural register file » MIPS R10000, DEC/Compaq Alpha, Intel Pentium IV (netburst), AMD Opteron Register Renaming (2)

• Merged future/architectural register file – Two kinds of registers » Logical Registers (the ones the compiler thinks) • Typically R0, R1, ... , R31 » Physical Registers (the ones the processor has) • Typically R0, R1, ... , Rn (n>31, and is usually #Rlogic+#ROB entries) – Need a structure to know what physical register hold a logical value » Rename table – Need to know the free physical registers » Free list

Give me a free physical register free physical reg Logical reg Physical reg

Free-list Rename Table Register Renaming (3)

• Merged future/architectural register file - STEPS – Decode » Keep a copy of the old mapping • Old_Dest_Physical_Reg = Rename Table [Dest_Logical_Reg] » Get new physical reg for destination register • Dest_Physical_Reg = Free-List() • Rename Table [Dest_Logical_Reg] = Dest_Physical_Reg – Writeback » In case of good completition free old mapping • Free-table + = Old_Dest_Physical_Reg » In case of interruption/exception • Restore old mapping Register Renaming (Example)

Sample Code LD F6, 34(F2) LD F6, 34(F2) FETCH LD F2, 45(F3) In-order! DECODE Reg. Map Free list MULTD F0, F2, F4 F0 RF0 RF32 Rename SUBD F8,F6,F2 F1 RF1 RF33 In-order! F2 RF2 RF34 DIVD F10, F0, F6 F3 RF3 RF35 ADDD F6, F8, F2 F4 RF4 RF36 F5 RF5 RF37 F6 RF6 ... F7 RF7 F8 RF8 F9 RF9 F10 RF10 ...... F31 RF31

ISSUE Cycle 0 Register Renaming (Example)

Sample Code LD F2, 45(F3) LD F6, 34(F2) FETCH LD F2, 45(F3) In-order! DECODE LD F6, 34(F2) Reg. Map Free list MULTD F0, F2, F4 F0 RF0 RF32 Rename SUBD F8,F6,F2 LD F6, 34(RF2) F1 RF1 RF33 In-order! F2 RF2 RF34 DIVD F10, F0, F6 F3 RF3 LD RF32, 34(RF2) RF35 ADDD F6, F8, F2 F4 RF4 RF36 (Old:RF6) F5 RF5 RF37 F6 RF32 ... F7 RF7 F8 RF8 F9 RF9 F10 RF10 ...... F31 RF31

ISSUE Cycle 1 Register Renaming (Example)

Sample Code MULTD F0,F2,F4 LD F6, 34(F2) FETCH LD F2, 45(F3) In-order! DECODE LD F2, 45(F3) Reg. Map Free list MULTD F0, F2, F4 F0 RF0 RF32 Rename SUBD F8,F6,F2 LD F2, 45(RF3) F1 RF1 RF33 In-order! F2 RF33 RF34 DIVD F10, F0, F6 F3 RF3 LD RF33, 45(RF3) RF35 ADDD F6, F8, F2 F4 RF4 RF36 (Old: RF2) F5 RF5 RF37 F6 RF32 ... F7 RF7 F8 RF8 F9 RF9 F10 RF10 ...... F31 RF31

ISSUE LD RF33, 45(RF3) (Old: RF6) Cycle 2 Register Renaming (Example)

Sample Code MULTD F0,F2,F4 LD F6, 34(F2) FETCH LD F2, 45(F3) In-order! DECODE MULTD F0,F2,F4 Reg. Map Free list MULTD F0, F2, F4 F0 RF34 RF32 Rename SUBD F8,F6,F2 MULTD F0,RF33,RF4 F1 RF1 RF33 In-order! F2 RF33 RF34 DIVD F10, F0, F6 F3 RF3 MULTD RF34,RF33,RF4 RF35 ADDD F6, F8, F2 F4 RF4 RF36 (Old: RF0) F5 RF5 RF37 F6 RF32 ... F7 RF7 F8 RF8 F9 RF9 F10 RF10 ...... F31 RF31

ISSUE LD RF33, 45(RF3) (Old: RF2) LD RF33, 45(RF3) (Old: RF6) Cycle 3 Register Renaming (Example)

Sample Code LD F6, 34(F2) WB LD F2, 45(F3) COMMIT LD RF33, 45(RF3) Reg. Map Free list MULTD F0, F2, F4 (Old: RF2) F0 RF34 RF32 in-order! F1 RF1 RF33 SUBD F8,F6,F2 LD RF33, 45(RF3) (Oldest to youngest) F2 RF33 RF34 (Old: RF6) DIVD F10, F0, F6 F3 RF3 RF35 ADDD F6, F8, F2 F4 RF4 RF36 F5 RF5 RF37 F6 RF37 ... F7 RF7 F8 RF35 F9 RF9 F10 RF36 ...... F31 RF31

Cycle N Register Renaming (Example)

Sample Code LD F6, 34(F2) WB LD F2, 45(F3) COMMIT LD RF33, 45(RF3) Reg. Map Free list MULTD F0, F2, F4 (Old: RF2) F0 RF34 RF32 in-order! F1 RF1 RF33 SUBD F8,F6,F2 LD RF33, 45(RF3) (Oldest to youngest) F2 RF33 RF34 (Old: RF6) DIVD F10, F0, F6 F3 RF3 RF35 ADDD F6, F8, F2 F4 RF4 RF36 F5 RF5 RF37 F6 RF37 ... F7 RF7 RF6 F8 RF35 RF2 F9 RF9 F10 RF36 ...... F31 RF31

Cycle N Register Renaming (Example)

Sample Code LD F6, 34(F2) WB LD F2, 45(F3) COMMIT Reg. Map Free list MULTD F0, F2, F4 F0 RF34 RF32 in-order! SUBD F8,F6,F2 F1 RF1 RF33 (Oldest to youngest) F2 RF33 RF34 DIVD F10, F0, F6 F3 RF3 RF35 ADDD F6, F8, F2 F4 RF4 RF36 F5 RF5 RF37 F6 RF37 ... F7 RF7 RF6 F8 RF35 RF2 F9 RF9 F10 RF36 ...... F31 RF31

Cycle N Issue Logic • MIPS R10000 Issue Logic • MIPS R10000

Kenneth C. Yeager, “The MIPS R10000 ”, IEEE Micro ,Volume: 16 , Issue: 2 , April 1996 Pages: 28 - 41 Issue Logic • MIPS R10000

Kenneth C. Yeager, “The MIPS R10000 Superscalar Processor”, IEEE Micro ,Volume: 16 , Issue: 2 , April 1996 Pages: 28 - 41 Issue Logic • DEC/Compaq Alpha 21264

R.E. Kessler, E.J. McLellan, and D.A. Webb, “The Alpha 21264 MicroprocessorArchitecture,” Proc. 1998 IEEE Int’l Conf. Computer Design: VLSI in Computers and Processors, Oct. 1998, pp. 90–95. Issue Logic • DEC/Compaq Alpha 21264

R.E. Kessler, E.J. McLellan, and D.A. Webb, “The Alpha 21264 MicroprocessorArchitecture,” Proc. 1998 IEEE Int’l Conf. Computer Design: VLSI in Computers and Processors, Oct. 1998, pp. 90–95. Issue Logic • Intel Pentium III Issue Logic • Intel Pentium III Issue Logic • Intel Pentium IV

Glem Hilton et al., “The of the Pentium 4 processor”, Intel Technology Journal Q1, 2001 Issue Logic

• Intel Pentium IV

Glem Hilton et al., “The Microarchitecture of the Pentium 4 processor”, Intel Technology Journal Q1, 2001 Issue Logic

• Intel Pentium IV

Glem Hilton et al., “The Microarchitecture of the Pentium 4 processor”, Intel Technology Journal Q1, 2001 Issue Logic • Intel Core 2  Lynnfield (2009)

Sandy Bridge (2011) Issue Logic 1

1 Cache size and associativity, ROB size and RS number vary across generations Issue Logic

• AMD Athlon Issue Logic • AMD Athlon Issue Logic

• AMD Opteron Issue Logic • AMD Opteron

Instr’n Level 1 Instr’n Cache 2k TLB Branch Targets

Fetch 2 - transit 16k History Level 2 Pick Counter Cache RAS Decode 1 Decode 1 Decode 1 & Target Address Decode 2 Decode 2 Decode 2

Pack Pack Pack L2 ECC L2 Tags Decode Decode Decode L2 Tag ECC

System Request Queue (SRQ) 8-entry 8-entry 8-entry 36-entry Scheduler Scheduler Scheduler Scheduler

Cross Bar (XBAR) AGU ALU AGU ALU AGU ALU FADD FMUL FMISC

Memory Controller & HyperTransport™ Data Level 1 Data Cache ECC “Northbridge” TLB Issue Logic • AMD Phenom II (bulldozer core) AMD Bulldozer Issue Logic ARM Cortex-A15 MPCore • Samsung Galaxy SIII, tablets?, servers?