ECE4750/CS4420 Computer Architecture L13: Wide Issue Processors

Edward Suh Computer Systems Laboratory [email protected]

Announcements

ECE4750/CS4420 — Computer Architecture, Fall 2008 2

1

Overview

. Review: techniques to run a single program fast • Control hazards  branch prediction & • Data hazards  dynamic scheduling &

. Execute more than one • Motivation • HW-based approach – Simple In-order version – Dynamically scheduled version • SW-based approach

. Reading: Chapter 2.7, 2.8

ECE4750/CS4420 — Computer Architecture, Fall 2008 3

Limit of a Single-Issue

. What is the CPI if all techniques described so far work perfectly?

. How can we do better?

. Possible approaches?

ECE4750/CS4420 — Computer Architecture, Fall 2008 4

2

Simple MIPS Superscalar

. Dual-issue superscalar – two • int/branch/mem (including FP mem) • FP

IFint IDint EXint MEMint WBint

IFfp IDfp EXfp EXfp EXfp WBfp

IFint IDint EXint MEMint WBint

IFfp IDfp EXfp EXfp EXfp WBfp

IFint IDint EXint MEMint WBint

IFfp IDfp EXfp EXfp EXfp WBfp

IFint IDint EXint MEMint WBint

IFfp IDfp EXfp EXfp EXfp WBfp

ECE4750/CS4420 — Computer Architecture, Fall 2008 5

Simple MIPS Superscalar

ECE4750/CS4420 — Computer Architecture, Fall 2008 6

3

Example Code

L: fld f0,0($1) fadd f4,f0,f2 fst f4,0($1) addi $1,$1,-8 bne $1,$2,L

. Read an array and increment by f2

ECE4750/CS4420 — Computer Architecture, Fall 2008 7

Simple MIPS Superscalar

. 17 instructions in 12 cycles, or IPC = 1.42

CLK Int FP 1 fld f0,0($1) 2 fld f6,-8($1) 3 fld f10,-16($1) fadd f4,f0,f2 4 fld f14,-24($1) fadd f8,f6,f2 5 fld f18,-32($1) fadd f12,f10,f2 6 fst f4,0($1) fadd f16,f14,f2 7 fst f8,-8($1) fadd f20,f18,f2 8 fst f12,-16($1) 9 fst f16,-24($1) 10 addi $1,$1,-40 11 bne $1,$2,L 12 fst f20,-32($1)

ECE4750/CS4420 — Computer Architecture, Fall 2008 8

4

Dynamic Superscalar Processors

. Today’s processors – out-of-order speculative superscalar

. Example: Tomasulo’s algorithm with dual-issue

. Assumptions: • arbitrary issue mix (but correct dependence resolution) • perfect branch prediction, no delay slot, no speculative execution • two CDBs • latencies: 1ALU = 1, 1LD/ST = 2 (EX+MEM), 1FPU = 3

ECE4750/CS4420 — Computer Architecture, Fall 2008 9

Two-Iteration Schedule

Iteration Instruction ID EX MEM WB Hazards 1 fld f0,0($1) 1 2 3 4

1 fadd f4,f0,f2 1 5 8 FLD1 1 sd f4,0($1) 1 addi $1,$1,-8 1 bne $1,$2,L 2 fld f0,0($1) 2 fadd f4,f0,f2 2 sd f4,0($1) 2 addi $1,$1,-8 2 bne $1,$2,L

ECE4750/CS4420 — Computer Architecture, Fall 2008 10

5

Two-Iteration Schedule

Iteration Instruction ID EX MEM WB Hazards 1 fld f0,0($1) 1 2 3 4

1 fadd f4,f0,f2 1 5 8 FLD1

1 sd f4,0($1) 2 3 9 FADD1 1 addi $1,$1,-8 2 4 5 ALU

1 bne $1,$2,L 3 6 ADDI1

2 fld f0,0($1) 4 7 8 9 BNE1

2 fadd f4,f0,f2 4 10 13 FLD2

2 sd f4,0($1) 5 8 14 BNE1, ALU, FADD2

2 addi $1,$1,-8 5 9 10 BNE1, ALU

2 bne $1,$2,L 6 11 ADDI2

ECE4750/CS4420 — Computer Architecture, Fall 2008 11

Challenges in Superscalar

ECE4750/CS4420 — Computer Architecture, Fall 2008 12

6

Superscalar Instruction Fetch

ECE4750/CS4420 — Computer Architecture, Fall 2008 13

Superscalar Register Renaming

• During decode, instructions allocated new physical destination register • Source operands renamed to physical register with newest value • only sees physical register numbers

Inst 1 Op Dest Src1 Src2 Op Dest Src1 Src2 Inst 2

Read Addresses Update Register

Mapping Rename Table Ports Write Free List Read Data

Op PDestPSrc1 PSrc2 Op PDestPSrc1 PSrc2 Does this work? ECE4750/CS4420 — Computer Architecture, Fall 2008 14

7

Renaming Example

. Assume 2-way superscalar (rename 2 instructions per cycle)

ld r1, (r3) ld add r3, r1, #4 add sub r6, r7, r9 sub add r3, r3, r6 Rename add ld r6, (r1) ld add r6, r6, r3 add st r6, (r1) st ld r6, (r11) ld

ECE4750/CS4420 — Computer Architecture, Fall 2008 15

Superscalar Register Renaming

Inst 1 Op Dest Src1 Src2 Op Dest Src1 Src2 Inst 2

Read Addresses Update Register

Mapping Rename Table =? =? Ports Write Free List Read Data Must check for RAW hazards between instructions issuing in same cycle. Can be done in parallel with rename Op PDestPSrc1 PSrc2 Op PDestPSrc1 PSrc2 lookup. MIPS R10K renames 4 serially-RAW-dependent insts/cycle)

ECE4750/CS4420 — Computer Architecture, Fall 2008 16

8

Superscalar Control Logic Scaling

Issue Width W

Issue Group

Previously Issued Lifetime L Instructions

ECE4750/CS4420 — Computer Architecture, Fall 2008 17

Out-of-Order Control Complexity

[ R10000 SGI/MIPS Technologies Inc., 1995 ]

ECE4750/CS4420 — Computer Architecture, Fall 2008 18

9

VLIW: Very Long Instruction Word

Int Op 1 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2

One Integer Units, Single Cycle Latency Two Load/Store Units, Two Cycle Latency Two Floating-Point Units, Three Cycle Latency

ECE4750/CS4420 — Computer Architecture, Fall 2008 19

VLIW Compiler Responsibilities

ECE4750/CS4420 — Computer Architecture, Fall 2008 20

10

VLIW Example

L: fld f0,0($1) fadd f4,f0,f2 fst f4,0($1) addi $1,$1,-8 bne $1,$2,L

. Schedule code for VLIW processor considering packet as follows: • 2Mem (2 cycles) • 2FP (3 cycles) • 1(Int+Branch)

ECE4750/CS4420 — Computer Architecture, Fall 2008 21

Trivial Solution

. Do nothing (only basic scheduling)

Mem 1 Mem 2 FP 1 FP 2 Int+Br fld f0,0($1)

fadd f4,f0,f2

fst f4,0($1) addi $1,$1,-8 beq $1,$2,L

ECE4750/CS4420 — Computer Architecture, Fall 2008 22

11