Multiple Issue

ECE4750/CS4420 Computer Architecture L13: Wide Issue Processors Edward Suh Computer Systems Laboratory [email protected] Announcements ECE4750/CS4420 — Computer Architecture, Fall 2008 2 1 Overview . Review: techniques to run a single program fast • Control hazards branch prediction & speculative execution • Data hazards dynamic scheduling & register renaming . Execute more than one instructions per cycle • Motivation • HW-based approach – Simple In-order version – Dynamically scheduled version • SW-based approach . Reading: Chapter 2.7, 2.8 ECE4750/CS4420 — Computer Architecture, Fall 2008 3 Limit of a Single-Issue Processor . What is the CPI if all techniques described so far work perfectly? . How can we do better? . Possible approaches? ECE4750/CS4420 — Computer Architecture, Fall 2008 4 2 Simple MIPS Superscalar . Dual-issue superscalar – two datapaths • int/branch/mem (including FP mem) • FP IFint IDint EXint MEMint WBint IFfp IDfp EXfp EXfp EXfp WBfp IFint IDint EXint MEMint WBint IFfp IDfp EXfp EXfp EXfp WBfp IFint IDint EXint MEMint WBint IFfp IDfp EXfp EXfp EXfp WBfp IFint IDint EXint MEMint WBint IFfp IDfp EXfp EXfp EXfp WBfp ECE4750/CS4420 — Computer Architecture, Fall 2008 5 Simple MIPS Superscalar ECE4750/CS4420 — Computer Architecture, Fall 2008 6 3 Example Code L: fld f0,0($1) fadd f4,f0,f2 fst f4,0($1) addi $1,$1,-8 bne $1,$2,L . Read an array and increment by f2 ECE4750/CS4420 — Computer Architecture, Fall 2008 7 Simple MIPS Superscalar . 17 instructions in 12 cycles, or IPC = 1.42 CLK Int FP 1 fld f0,0($1) 2 fld f6,-8($1) 3 fld f10,-16($1) fadd f4,f0,f2 4 fld f14,-24($1) fadd f8,f6,f2 5 fld f18,-32($1) fadd f12,f10,f2 6 fst f4,0($1) fadd f16,f14,f2 7 fst f8,-8($1) fadd f20,f18,f2 8 fst f12,-16($1) 9 fst f16,-24($1) 10 addi $1,$1,-40 11 bne $1,$2,L 12 fst f20,-32($1) ECE4750/CS4420 — Computer Architecture, Fall 2008 8 4 Dynamic Superscalar Processors . Today’s processors – out-of-order speculative superscalar . Example: Tomasulo’s algorithm with dual-issue . Assumptions: • arbitrary issue mix (but correct dependence resolution) • perfect branch prediction, no delay slot, no speculative execution • two CDBs • latencies: 1ALU = 1, 1LD/ST = 2 (EX+MEM), 1FPU = 3 ECE4750/CS4420 — Computer Architecture, Fall 2008 9 Two-Iteration Schedule Iteration Instruction ID EX MEM WB Hazards 1 fld f0,0($1) 1 2 3 4 1 fadd f4,f0,f2 1 5 8 FLD1 1 sd f4,0($1) 1 addi $1,$1,-8 1 bne $1,$2,L 2 fld f0,0($1) 2 fadd f4,f0,f2 2 sd f4,0($1) 2 addi $1,$1,-8 2 bne $1,$2,L ECE4750/CS4420 — Computer Architecture, Fall 2008 10 5 Two-Iteration Schedule Iteration Instruction ID EX MEM WB Hazards 1 fld f0,0($1) 1 2 3 4 1 fadd f4,f0,f2 1 5 8 FLD1 1 sd f4,0($1) 2 3 9 FADD1 1 addi $1,$1,-8 2 4 5 ALU 1 bne $1,$2,L 3 6 ADDI1 2 fld f0,0($1) 4 7 8 9 BNE1 2 fadd f4,f0,f2 4 10 13 FLD2 2 sd f4,0($1) 5 8 14 BNE1, ALU, FADD2 2 addi $1,$1,-8 5 9 10 BNE1, ALU 2 bne $1,$2,L 6 11 ADDI2 ECE4750/CS4420 — Computer Architecture, Fall 2008 11 Challenges in Superscalar ECE4750/CS4420 — Computer Architecture, Fall 2008 12 6 Superscalar Instruction Fetch ECE4750/CS4420 — Computer Architecture, Fall 2008 13 Superscalar Register Renaming • During decode, instructions allocated new physical destination register • Source operands renamed to physical register with newest value • Execution unit only sees physical register numbers Inst 1 Op Dest Src1 Src2 Op Dest Src1 Src2 Inst 2 Read Addresses Update Register Mapping Rename Table Ports Write Free List Read Data Op PDestPSrc1 PSrc2 Op PDestPSrc1 PSrc2 Does this work? ECE4750/CS4420 — Computer Architecture, Fall 2008 14 7 Renaming Example . Assume 2-way superscalar (rename 2 instructions per cycle) ld r1, (r3) ld add r3, r1, #4 add sub r6, r7, r9 sub add r3, r3, r6 Rename add ld r6, (r1) ld add r6, r6, r3 add st r6, (r1) st ld r6, (r11) ld ECE4750/CS4420 — Computer Architecture, Fall 2008 15 Superscalar Register Renaming Inst 1 Op Dest Src1 Src2 Op Dest Src1 Src2 Inst 2 Read Addresses Update Register Mapping Rename Table =? =? Ports Write Free List Read Data Must check for RAW hazards between instructions issuing in same cycle. Can be done in parallel with rename Op PDestPSrc1 PSrc2 Op PDestPSrc1 PSrc2 lookup. MIPS R10K renames 4 serially-RAW-dependent insts/cycle) ECE4750/CS4420 — Computer Architecture, Fall 2008 16 8 Superscalar Control Logic Scaling Issue Width W Issue Group Previously Issued Lifetime L Instructions ECE4750/CS4420 — Computer Architecture, Fall 2008 17 Out-of-Order Control Complexity [ R10000 SGI/MIPS Technologies Inc., 1995 ] ECE4750/CS4420 — Computer Architecture, Fall 2008 18 9 VLIW: Very Long Instruction Word Int Op 1 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 One Integer Units, Single Cycle Latency Two Load/Store Units, Two Cycle Latency Two Floating-Point Units, Three Cycle Latency ECE4750/CS4420 — Computer Architecture, Fall 2008 19 VLIW Compiler Responsibilities ECE4750/CS4420 — Computer Architecture, Fall 2008 20 10 VLIW Example L: fld f0,0($1) fadd f4,f0,f2 fst f4,0($1) addi $1,$1,-8 bne $1,$2,L . Schedule code for VLIW processor considering packet as follows: • 2Mem (2 cycles) • 2FP (3 cycles) • 1(Int+Branch) ECE4750/CS4420 — Computer Architecture, Fall 2008 21 Trivial Solution . Do nothing (only basic scheduling) Mem 1 Mem 2 FP 1 FP 2 Int+Br fld f0,0($1) fadd f4,f0,f2 fst f4,0($1) addi $1,$1,-8 beq $1,$2,L ECE4750/CS4420 — Computer Architecture, Fall 2008 22 11.

Load more