Computer Architecture ELE 475 / COS 475 Slide Deck 7: VLIW David Wentzlaff Department of Electrical Engineering Princeton University
1 Superscalar Control Logic Scaling Issue Width W
Issue Group
Previously Issued Lifetime L Instructions
• Each issued instruc on must somehow check against W*L instruc ons, i.e., growth in hardware ∝ W*(W*L) • For in-order machines, L is related to pipeline latencies and check is done during issue (scoreboard) • For out-of-order machines, L also includes me spent in IQ, SB, and check is done by broadcas ng tags to wai ng instruc ons at comple on • As W increases, larger instruc on window is needed to find enough parallelism to keep machine busy => greater L 2 3 => Out-of-order control logic grows faster than W (~W ) 2 Out-of-Order Control Complexity: MIPS R10000
[A. Ahi et al., MIPS R10000 Superscalar Microprocessor, Hot Chips, 1995 ] Image Credit: MIPS Technologies Inc. / Silicon Graphics Computer Systems 3 Out-of-Order Control Complexity: MIPS R10000
Control Logic
[A. Ahi et al., MIPS R10000 Superscalar Microprocessor, Hot Chips, 1995 ] Image Credit: MIPS Technologies Inc. / Silicon Graphics Computer Systems 4 Sequen al ISA Bo leneck Sequen al Superscalar compiler source code a = foo(b); for (i=0, i<
Find independent opera ons
5 Sequen al ISA Bo leneck Sequen al Superscalar compiler source code a = foo(b); for (i=0, i<
Find independent Schedule opera ons opera ons
6 Sequen al ISA Bo leneck Sequen al Superscalar compiler Sequen al source code machine code a = foo(b); for (i=0, i<
Find independent Schedule opera ons opera ons
Superscalar processor
Check instruc on dependencies 7 Sequen al ISA Bo leneck Sequen al Superscalar compiler Sequen al source code machine code a = foo(b); for (i=0, i<
Find independent Schedule opera ons opera ons
Superscalar processor
Check instruc on Schedule dependencies execu on 8 VLIW: Very Long Instruc on Word Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2
Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floa ng-Point Units, Four Cycle Latency • Mul ple opera ons packed into one instruc on • Each opera on slot is for a fixed func on • Constant opera on latencies are specified • Architecture requires guarantee of: – Parallelism within an instruc on => no cross-opera on RAW check – No data use before data ready => no data interlocks 9 VLIW Equals (EQ) Scheduling Model
• Each opera on takes exactly specified latency • Efficient register usage (Effec vely more registers) • No need for register renaming or buffering – Bypass from func onal unit output to inputs – Register writes whenever func onal unit completes • Compiler depends on not having registers visible early
10 VLIW Less-Than-or-Equals (LEQ) Scheduling Model • Each opera on may take less than or equal to its specified latency – Des na on can be wri en any me a er instruc on issue – Dependent instruc on s ll needs to be scheduled a er instruc on latency • Precise interrupts simplified • Binary compa bility preserved when latencies are reduced
11 Blackboard Example Equals Model
{MUL R1, R2, R3; ADD R7, R8, R9} {AND R10, R1, R5}
X0 Y0 Y1 Y2 W M0 M1
12 Extra Blackboard Example VLIW Code Sequence EQ Scheduling {MUL R1, R3, R4; ADD R7, R8, R9} {ADD R9, R1, R10} {ADD; SUB} {NOP} {LW R17, R1}
13 Early VLIW Machines • FPS AP120B (1976) – scien fic a ached array processor – first commercial wide instruc on machine – hand-coded vector math libraries using so ware pipelining and loop unrolling • Mul flow Trace (1987) – commercializa on of ideas from Fisher’s Yale group including “trace scheduling” – available in configura ons with 7, 14, or 28 opera ons/instruc on – 28 opera ons packed into a 1024-bit instruc on word • Cydrome Cydra-5 (1987) – 7 opera ons encoded in 256-bit instruc on word – rota ng register file 14 VLIW Compiler Responsibili es
• Schedule opera ons to maximize parallel execu on
• Guarantees intra-instruc on parallelism
• Schedule to avoid data hazards (no interlocks) – Typically separates opera ons with explicit NOPs
15 Loop Execu on for (i=0; i 16 Loop Execu on for (i=0; i 17 Loop Execu on for (i=0; i 18 Loop Execu on for (i=0; i 19 Loop Execu on for (i=0; i How many FP ops/cycle? 20 Loop Execu on for (i=0; i How many FP ops/cycle? 1 add.s / 8 cycles = 0.125 21 Loop Unrolling for (i=0; i 22 Loop Unrolling for (i=0; i Need to handle values of N that are not multiples of unrolling factor with final cleanup loop 23 Scheduling Loop Unrolled Code Unroll 4 ways loop: lw F1, 0(r1) Int1 Int 2 M1 M2 FP+ FPx lw F2, 4(r1) lw F3, 8(r1) loop: lw F4, 12(r1) addiu R1, R1, 16 add.s F5, F0, F1 add.s F6, F0, F2 Schedule add.s F7, F0, F3 add.s F8, F0, F4 sw F5, 0(R2) sw F6, 4(R2) sw F7, 8(R2) sw F8, 12(R2) addiu R2, R2, 16 bne R1, R3, loop 24 Scheduling Loop Unrolled Code Unroll 4 ways loop: lw F1, 0(r1) Int1 Int 2 M1 M2 FP+ FPx lw F2, 4(r1) lw F3, 8(r1) loop: lw F1 lw F4, 12(r1) lw F2 addiu R1, R1, 16 lw F3 add.s F5, F0, F1 add R1 lw F4 add.s F6, F0, F2 Schedule add.s F7, F0, F3 add.s F8, F0, F4 sw F5, 0(R2) sw F6, 4(R2) sw F7, 8(R2) sw F8, 12(R2) addiu R2, R2, 16 bne R1, R3, loop 25 Scheduling Loop Unrolled Code Unroll 4 ways loop: lw F1, 0(r1) Int1 Int 2 M1 M2 FP+ FPx lw F2, 4(r1) lw F3, 8(r1) loop: lw F1 lw F4, 12(r1) lw F2 addiu R1, R1, 16 lw F3 add.s F5, F0, F1 add R1 lw F4 add.s F5 add.s F6, F0, F2 Schedule add.s F6 add.s F7, F0, F3 add.s F7 add.s F8, F0, F4 add.s F8 sw F5, 0(R2) sw F5 sw F6, 4(R2) sw F6 sw F7, 8(R2) sw F7 sw F8, 12(R2) add R2 bne sw F8 addiu R2, R2, 16 bne R1, R3, loop 26 Scheduling Loop Unrolled Code Unroll 4 ways loop: lw F1, 0(r1) Int1 Int 2 M1 M2 FP+ FPx lw F2, 4(r1) lw F3, 8(r1) loop: lw F1 lw F4, 12(r1) lw F2 addiu R1, R1, 16 lw F3 add.s F5, F0, F1 add R1 lw F4 add.s F5 add.s F6, F0, F2 Schedule add.s F6 add.s F7, F0, F3 add.s F7 add.s F8, F0, F4 add.s F8 sw F5, 0(R2) sw F5 sw F6, 4(R2) sw F6 sw F7, 8(R2) sw F7 sw F8, 12(R2) add R2 bne sw F8 addiu R2, R2, 16 bne R1, R3, loop How many FLOPS/cycle? 27 Scheduling Loop Unrolled Code Unroll 4 ways loop: lw F1, 0(