ECE4750/CS4420 Computer Architecture L13: Wide Issue Processors
Edward Suh Computer Systems Laboratory [email protected]
Announcements
ECE4750/CS4420 — Computer Architecture, Fall 2008 2
1
Overview
. Review: techniques to run a single program fast • Control hazards branch prediction & speculative execution • Data hazards dynamic scheduling & register renaming
. Execute more than one instructions per cycle • Motivation • HW-based approach – Simple In-order version – Dynamically scheduled version • SW-based approach
. Reading: Chapter 2.7, 2.8
ECE4750/CS4420 — Computer Architecture, Fall 2008 3
Limit of a Single-Issue Processor
. What is the CPI if all techniques described so far work perfectly?
. How can we do better?
. Possible approaches?
ECE4750/CS4420 — Computer Architecture, Fall 2008 4
2
Simple MIPS Superscalar
. Dual-issue superscalar – two datapaths • int/branch/mem (including FP mem) • FP
IFint IDint EXint MEMint WBint
IFfp IDfp EXfp EXfp EXfp WBfp
IFint IDint EXint MEMint WBint
IFfp IDfp EXfp EXfp EXfp WBfp
IFint IDint EXint MEMint WBint
IFfp IDfp EXfp EXfp EXfp WBfp
IFint IDint EXint MEMint WBint
IFfp IDfp EXfp EXfp EXfp WBfp
ECE4750/CS4420 — Computer Architecture, Fall 2008 5
Simple MIPS Superscalar
ECE4750/CS4420 — Computer Architecture, Fall 2008 6
3
Example Code
L: fld f0,0($1) fadd f4,f0,f2 fst f4,0($1) addi $1,$1,-8 bne $1,$2,L
. Read an array and increment by f2
ECE4750/CS4420 — Computer Architecture, Fall 2008 7
Simple MIPS Superscalar
. 17 instructions in 12 cycles, or IPC = 1.42
CLK Int FP 1 fld f0,0($1) 2 fld f6,-8($1) 3 fld f10,-16($1) fadd f4,f0,f2 4 fld f14,-24($1) fadd f8,f6,f2 5 fld f18,-32($1) fadd f12,f10,f2 6 fst f4,0($1) fadd f16,f14,f2 7 fst f8,-8($1) fadd f20,f18,f2 8 fst f12,-16($1) 9 fst f16,-24($1) 10 addi $1,$1,-40 11 bne $1,$2,L 12 fst f20,-32($1)
ECE4750/CS4420 — Computer Architecture, Fall 2008 8
4
Dynamic Superscalar Processors
. Today’s processors – out-of-order speculative superscalar
. Example: Tomasulo’s algorithm with dual-issue
. Assumptions: • arbitrary issue mix (but correct dependence resolution) • perfect branch prediction, no delay slot, no speculative execution • two CDBs • latencies: 1ALU = 1, 1LD/ST = 2 (EX+MEM), 1FPU = 3
ECE4750/CS4420 — Computer Architecture, Fall 2008 9
Two-Iteration Schedule
Iteration Instruction ID EX MEM WB Hazards 1 fld f0,0($1) 1 2 3 4
1 fadd f4,f0,f2 1 5 8 FLD1 1 sd f4,0($1) 1 addi $1,$1,-8 1 bne $1,$2,L 2 fld f0,0($1) 2 fadd f4,f0,f2 2 sd f4,0($1) 2 addi $1,$1,-8 2 bne $1,$2,L
ECE4750/CS4420 — Computer Architecture, Fall 2008 10
5
Two-Iteration Schedule
Iteration Instruction ID EX MEM WB Hazards 1 fld f0,0($1) 1 2 3 4
1 fadd f4,f0,f2 1 5 8 FLD1
1 sd f4,0($1) 2 3 9 FADD1 1 addi $1,$1,-8 2 4 5 ALU
1 bne $1,$2,L 3 6 ADDI1
2 fld f0,0($1) 4 7 8 9 BNE1
2 fadd f4,f0,f2 4 10 13 FLD2
2 sd f4,0($1) 5 8 14 BNE1, ALU, FADD2
2 addi $1,$1,-8 5 9 10 BNE1, ALU
2 bne $1,$2,L 6 11 ADDI2
ECE4750/CS4420 — Computer Architecture, Fall 2008 11
Challenges in Superscalar
ECE4750/CS4420 — Computer Architecture, Fall 2008 12
6
Superscalar Instruction Fetch
ECE4750/CS4420 — Computer Architecture, Fall 2008 13
Superscalar Register Renaming
• During decode, instructions allocated new physical destination register • Source operands renamed to physical register with newest value • Execution unit only sees physical register numbers
Inst 1 Op Dest Src1 Src2 Op Dest Src1 Src2 Inst 2
Read Addresses Update Register
Mapping Rename Table Ports Write Free List Read Data
Op PDestPSrc1 PSrc2 Op PDestPSrc1 PSrc2 Does this work? ECE4750/CS4420 — Computer Architecture, Fall 2008 14
7
Renaming Example
. Assume 2-way superscalar (rename 2 instructions per cycle)
ld r1, (r3) ld add r3, r1, #4 add sub r6, r7, r9 sub add r3, r3, r6 Rename add ld r6, (r1) ld add r6, r6, r3 add st r6, (r1) st ld r6, (r11) ld
ECE4750/CS4420 — Computer Architecture, Fall 2008 15
Superscalar Register Renaming
Inst 1 Op Dest Src1 Src2 Op Dest Src1 Src2 Inst 2
Read Addresses Update Register
Mapping Rename Table =? =? Ports Write Free List Read Data Must check for RAW hazards between instructions issuing in same cycle. Can be done in parallel with rename Op PDestPSrc1 PSrc2 Op PDestPSrc1 PSrc2 lookup. MIPS R10K renames 4 serially-RAW-dependent insts/cycle)
ECE4750/CS4420 — Computer Architecture, Fall 2008 16
8
Superscalar Control Logic Scaling
Issue Width W
Issue Group
Previously Issued Lifetime L Instructions
ECE4750/CS4420 — Computer Architecture, Fall 2008 17
Out-of-Order Control Complexity
[ R10000 SGI/MIPS Technologies Inc., 1995 ]
ECE4750/CS4420 — Computer Architecture, Fall 2008 18
9
VLIW: Very Long Instruction Word
Int Op 1 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2
One Integer Units, Single Cycle Latency Two Load/Store Units, Two Cycle Latency Two Floating-Point Units, Three Cycle Latency
ECE4750/CS4420 — Computer Architecture, Fall 2008 19
VLIW Compiler Responsibilities
ECE4750/CS4420 — Computer Architecture, Fall 2008 20
10
VLIW Example
L: fld f0,0($1) fadd f4,f0,f2 fst f4,0($1) addi $1,$1,-8 bne $1,$2,L
. Schedule code for VLIW processor considering packet as follows: • 2Mem (2 cycles) • 2FP (3 cycles) • 1(Int+Branch)
ECE4750/CS4420 — Computer Architecture, Fall 2008 21
Trivial Solution
. Do nothing (only basic scheduling)
Mem 1 Mem 2 FP 1 FP 2 Int+Br fld f0,0($1)
fadd f4,f0,f2
fst f4,0($1) addi $1,$1,-8 beq $1,$2,L
ECE4750/CS4420 — Computer Architecture, Fall 2008 22
11