A VLIW Processor with Hardware Functions: Increasing Performance While Reducing Power

Profs. Ray Hoare and Alex Jones Dara Kusic, Joshua Fazekas, Gayatri Mehta, and John Foster Objective

„ High Performance PC to Supercomputer Performance „ Low Power PDA Power „ Rapid Design Cycle “Compile Software into Hardware” Unmodified C Source Code No Hardware Design Experience Required Automatic Integration with Processor Target Technologies For all devices, we want to Maximize Performance while Minimizing Power consumption and Design Time. Our Approach „ VLIW Architecture Up to 4x performance improvement for software alone Full 32-bit Processor „ Real applications „ NIOS II Instruction Set Architecture „ “Hardware Functions” Convert critical software kernels into hardware 9x to 332x Performance improvement 42x to 418x Power reduction „ Productivity through Design Automation Source: Unmodified C programs “Compile” software into hardware functions VLIW Processor: Exploit Instruction Level Parallelism „ Full NIOS II 32-bit ISA Real processor, real apps „ Expanded to 4 ALUs for up to 4x speedup Shared Shared Instruction Stream Trimaran Compiler Instr RAM

Instruction Cust Cust Cust Cust Decoder ALU Instr ALU Instr ALU Instr ALU Instr

MUX MUX MUX MUX Controller Hardware Functions: Exploit Hardware Acceleration Kernels converted to Hardware Data shared in Reg. File Control flow in Software VLIW Exploits Instruction Parallelism Hardware Provides Acceleration

Kernels High Performance and converted Low Power to Hardware Data shared No Data movement in Reg. File Control flow Programmable in Software Benchmark Codes

„ Speech Coding ADPCM Encoder (4:1 compression) ADPCM Decoder G.721 (cordless phones) GSM (wireless phones, 8:1 compression) „ Communications Spherical Decoder (M tx, N rx antennas) „ (Written in vectorized Matlab) „ Multimedia Decoding MPEG2 Decoder (IDCT) Performance Results

„ Fully Implemented Processors FPGA Comparison: Soft-core processor vs. HW Function in 90nm FPGA pNIOS: our sequential processor VLIW-4: our 4-wide VLIW „ Hardware Functions vs. Equivalent Software Estimations based on a shared register file with HW Functions Strong ARM 206MHz vs 160nm ASIC HW Functions Intel XScale 733MHz vs 160nm ASIC HW Functions Performance Results

Acceleration Perspective „ A speedup of 2x turns 100MHz system into 200MHz of performance „ 10x provides 1GHz performance „ 100x provides 10GHz performance Kernel Acceleration: 7x to 345x Software vs. HW-Accelerated

1,000.00

100.00

Times Improvement Times 10.00

1.00 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark

pNiosII VLIW-4 StrongARM@206MHz XScale@733MHz Application Acceleration: 1.8x to 154x Software vs. HW-Accelerated

1,000.00

100.00

Times Improvement 10.00

1.00 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark

pNiosII + HW VLIW-4 + HW StrongARM@206MHz+HW XScale@733MHz+HW Application Acceleration: pNIOS

18

16

14

12

10

8 Times Improved

6

4

2

0 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark

VLIW over pNIOS pNIOS+Hw over pNIOS VLIW4+HW over pNIOS Effective MHz for pNIOS

3000

2500

2000

1500 MHz

1000

500

0 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark

pNIOS VLIW4 pNIOS+HW VLIW4+HW Application Acceleration: StrongARM

1,000.00

100.00 Times Improved 10.00

1.00 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark

StrongARM+HW over StrongARM Effective MHz for StrongARM

27,226 31,750 1000

900

800

700

600

500 MHz

400

300

200

100

0 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark

StrongARM StongARM+HW Application Acceleration: XScale

100.00

10.00 Times Improved Times

1.00 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark

Xscale+HW over Xscale Effective MHz for XScale

24,906 27,700 3000

2500

2000

1500 MHz

1000

500

0 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark

Xscale Xscale+HW Power/Energy Savings

„ Software on 4 wide VLIW in 160nm ASIC vs. Hardware Functions in 160nm ASIC „ Power Reduction: 42x to 418x „ Energy Reduction: 1,700x to 60,000x If (bufferstep) { delta = inputbuffer & 0xf; (a) } else { inputbuffer = *inp++ Automatic Conversion delta = (inputbuffer >> 4) & 0xf; } of Software Functions bufferstep into Hardware != 0x0 If Basic Block Functions (b) *inp inp inputbuffer „ Specified hardware inputbuffer 0xF 0x4 0xF >> functions are converted & ++ & delta inp automatically into delta hardware Then Basic Block Else Basic Block

„ Part of the “compiler” inp *inp inputbuffer „ Prototype compiler is bufferstep inputbuffer 0xF 0xF >> 0x4 (c) working in laboratory ++ != 0x0 & &

T/1 F/0 T/ 1 F/ 0 2:1 MUX 2:1 MUX inp delta ADPCM - CDFG

Control Flow for 'adpcm_code r' Basic Block3 Basic Block5 Basic Block 6 Basic Block 9 Basic Block11 Basic Block 13 Basic Block 14 Basic Block16 Basic Block 17 Basic Block19 Basic Block 20 Basic Block 22 Basic Block 23 Basic Block 24 Basic Block 25 Basic Block 26 Basic Block 27 Basic Block 28 Basic Block 29 Basic Block31 Basic Block 34 Basic Block 35 Basic Block37 Basic Block 38 Basic Block 39 Basic Block 40

BB3 2 inp 8 0 0 suif_tmp0 di ff di ff st ep 3 0 vpdi ff step di ff 4 st ep 1 vpdi ff st ep di ff delta step 1 vpdi ff step delt a sign 0 valpre d vpdi ff valp re d vpdi ff 32 767 valp re d 32 767 valpre d -3 2768 delt a sign -3 2768 0 88 index 88 buf fe rstep 0 stepsizeT able inde x delta 4 delta bu ffer step 0

BB5 BB6 + load suif_tmp suif_tmp0 suif_tmp0 != sign *-1 <= >> delt a + - delta diff >> + - conver t 2 di ff >> + conver t 1 != - + < valp red < conv ert conver t valpre d inde x < inde x != + << conver t 15 != 0

BB9 valpre d in p conver t eval di ff eval vpdi ff vpdi ff di ff <= st ep vpdiff diff | <= st ep vpdiff | eval valp red valp re d eval eval | eval eval load conver t 240 & ==

BB11 0 - val eval conver t eval conver t indexT able conver t step & conver t outputbuf fe r buf fe rs tep

BB13 < di ff delt a delt a + delt a conver t conver t conver t

BB14 eval inde x load outputbuf fer |

BB16 0 + 1 out p conver t

BB17 < inde x + addr suif_tmp conver t

BB19 eval outp stor e

BB20

BB22

BB23 BB24

BB25

BB27

BB26 BB29

BB28

BB31

BB34

BB35

BB37

BB38 BB39

BB40 Control Flow for 'adpcm_code r '

BB3

ADPCM - CFG BB5 BB6

BB9

BB1 1

BB1 3 „ Control flow dominated BB1 4 BB1 6

if-then-else BB1 7

BB1 9 „ ILP < 2 BB2 0 „ Branch prediction is poor BB2 2 BB2 3 BB2 4 „ Poor inherent parallelism BB2 5 BB2 7 „ DFGs contain variable BB2 6 BB2 9 BB2 8

length critical paths BB3 1

BB3 4

Cycle time for RTL BB3 5

conforms to longest DFG BB3 7

BB3 8 BB3 9

BB4 0 Basic Block 3

2 in p

+ load

valpred in p convert

0 - val ADPCM - SDFG < 8 0 0 0 mux *-1

!= st e p 3 !=

>> mux convert

1 + <= - 4 0

>> mux mux mux

„ Single SDFG 1 + <= - convert 2

>> mux mux | „ Removal of clock + <= convert mux mux Combinational/Data - + convert 1 32767 -32768 mux |

Flow < < -32768 convert

32767 mux mux

Cycle compression mux convert

valpred | „ Significant increase in 4 convert indexTable << convert 15 + parallelism conver t 240 & index load 0 bufferstep 0 & outputbuffer convert 0 +

!= 0 != convert 1 outp convert convert < 0

== mux + mux | 88 mux

bufferstep outputbuffer mux suif_tm p convert < 88

outp addr convert st epsizeTable mux

store +

load

st e p MPEG II (IDCT) - CDFG

„ Two nested loops within fast_IDCT function idct_row and idct_col „ In this example, control flow is segregated from most of the work

Control Flow for 'Fast_IDC T ' Basic Bloc k 0 Basic Bloc k 1 Basic Bloc k 5 Basic Bloc k 6

BB0 1 0 8 1 0 8 8 i fo r_ st ep 2 i fo r_ st ep

BB5 fo r_ st ep i fo r _ lb < fo r_ub fo r_ st ep i fo r _ lb < fo r_ub fo r _ub * + 2 bloc k * fo r _ub +

BB1 ev a l ev a l < bloc k i * + < i

BB6 ev a l + idctco l ev a l

BB3 idctro w MPEG II (IDCT) - CDFG (row)

Basic Bloc k 0

0 blk

return 0 convert 3 convert 4 convert 2 convert 6 convert 0 convert 4 convert 2 convert 7 convert 3 convert 1 convert 1 convert 5 convert 5 convert 6 convert 7 convert

+ + + + + + + + + + + + + + + +

load load load load load load load load

1568 convert 11 convert convert 37 8 4 convert 3406 convert 4017 convert convert convert 2276 799

* + 11 0 8 << 12 8 * 11 * * + 565 + 2408 * *

addr addr addr * + << * * addr

+ + - - - - + -

+ - - + + + - -

8 + 8 x7 x8 + - 8 x0 18 1 x3 x6 - x1 x5 - 8 + 18 1

>> >> >> * 12 8 >> * 12 8

convert convert convert + 8 convert + 8

stor e store store >> store >>

+ 8 x4 + 8 - 8 - x2 8

>> addr >> addr >> >>

convert convert convert addr addr convert

store store store store MPEG II (IDCT) - CDFG (col)

Basic Bloc k 0

0 blk

return convert 16 convert 0 convert 24 convert 8 convert 32 convert 56 convert 16 convert 32 convert 56 convert 24 convert 40 convert 48 convert 8 convert 40 convert 48

+ + + + + + + + + + + + + + +

convert 0 load load load load load load load

1568 + convert 3406 convert 4017 convert convert convert convert 2276 799 37 8 4

* load * 11 0 8 + * 565 + + 2408 * * *

convert 8 * 4 * 4 * 4

addr << 819 2 convert 8 + + +

3 + + << - 3 - 3 + 3 - 3 -

>> addr + - >> >> >> >>

addr + - addr + + - -

14 + + 14 x7 iclp - 14 x8 - 14 x6 18 1 x1 x5 - + 18 1

>> >> convert convert conver t convert >> >> 3 12 8 * * 12 8

+ + + + >> 8 + + 8

load load load load - >> + >> convert

store + 14 store store store x0 + 14 - x3 14 x4 - 14 x2

convert >> addr convert >> convert >> >>

addr + + + + addr

load load addr load load

st ore st ore store st ore A VLIW Processor with Hardware Functions: Increasing Performance While Reducing Power „ For FPGA, Structured ASIC and ASIC technologies „ VLIW Architecture Up to 4x performance improvement for software alone Full 32-bit Processor „ Real applications „ Altera NIOS II Instruction Set Architecture „ “Hardware Functions” Convert critical software kernels into hardware 9x to 332x Performance improvement 42x to 418x Power reduction „ Productivity through Design Automation Source: Unmodified C programs “Compile” software into hardware functions „ Contact Prof. Ray Hoare [email protected] Prof. Alex Jones [email protected]