A VLIW Processor with Hardware Functions: Increasing Performance While Reducing Power
Profs. Ray Hoare and Alex Jones Dara Kusic, Joshua Fazekas, Gayatri Mehta, and John Foster Objective
High Performance PC to Supercomputer Performance Low Power PDA Power Rapid Design Cycle “Compile Software into Hardware” Unmodified C Source Code No Hardware Design Experience Required Automatic Integration with Processor Target Technologies For all devices, we want to Maximize Performance while Minimizing Power consumption and Design Time. Our Approach VLIW Architecture Up to 4x performance improvement for software alone Full 32-bit Processor Real applications Altera NIOS II Instruction Set Architecture “Hardware Functions” Convert critical software kernels into hardware 9x to 332x Performance improvement 42x to 418x Power reduction Productivity through Design Automation Source: Unmodified C programs “Compile” software into hardware functions VLIW Processor: Exploit Instruction Level Parallelism Full NIOS II 32-bit ISA Real processor, real apps Expanded to 4 ALUs for up to 4x speedup Shared Register File Shared Instruction Stream Trimaran Compiler Instr RAM
Instruction Cust Cust Cust Cust Decoder ALU Instr ALU Instr ALU Instr ALU Instr
MUX MUX MUX MUX Controller Hardware Functions: Exploit Hardware Acceleration Kernels converted to Hardware Data shared in Reg. File Control flow in Software VLIW Exploits Instruction Parallelism Hardware Provides Acceleration
Kernels High Performance and converted Low Power to Hardware Data shared No Data movement in Reg. File Control flow Programmable in Software Benchmark Codes
Speech Coding ADPCM Encoder (4:1 compression) ADPCM Decoder G.721 (cordless phones) GSM (wireless phones, 8:1 compression) Communications Spherical Decoder (M tx, N rx antennas) (Written in vectorized Matlab) Multimedia Decoding MPEG2 Decoder (IDCT) Performance Results
Fully Implemented Processors FPGA Comparison: Soft-core processor vs. HW Function in 90nm FPGA pNIOS: our sequential processor VLIW-4: our 4-wide VLIW Hardware Functions vs. Equivalent Software Estimations based on a shared register file with HW Functions Intel Strong ARM 206MHz vs 160nm ASIC HW Functions Intel XScale 733MHz vs 160nm ASIC HW Functions Performance Results
Acceleration Perspective A speedup of 2x turns 100MHz system into 200MHz of performance 10x provides 1GHz performance 100x provides 10GHz performance Kernel Acceleration: 7x to 345x Software vs. HW-Accelerated
1,000.00
100.00
Times Improvement Times 10.00
1.00 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark
pNiosII VLIW-4 StrongARM@206MHz XScale@733MHz Application Acceleration: 1.8x to 154x Software vs. HW-Accelerated
1,000.00
100.00
Times Improvement 10.00
1.00 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark
pNiosII + HW VLIW-4 + HW StrongARM@206MHz+HW XScale@733MHz+HW Application Acceleration: pNIOS
18
16
14
12
10
8 Times Improved
6
4
2
0 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark
VLIW over pNIOS pNIOS+Hw over pNIOS VLIW4+HW over pNIOS Effective MHz for pNIOS
3000
2500
2000
1500 MHz
1000
500
0 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark
pNIOS VLIW4 pNIOS+HW VLIW4+HW Application Acceleration: StrongARM
1,000.00
100.00 Times Improved 10.00
1.00 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark
StrongARM+HW over StrongARM Effective MHz for StrongARM
27,226 31,750 1000
900
800
700
600
500 MHz
400
300
200
100
0 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark
StrongARM StongARM+HW Application Acceleration: XScale
100.00
10.00 Times Improved Times
1.00 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark
Xscale+HW over Xscale Effective MHz for XScale
24,906 27,700 3000
2500
2000
1500 MHz
1000
500
0 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark
Xscale Xscale+HW Power/Energy Savings
Software on 4 wide VLIW in 160nm ASIC vs. Hardware Functions in 160nm ASIC Power Reduction: 42x to 418x Energy Reduction: 1,700x to 60,000x If (bufferstep) { delta = inputbuffer & 0xf; (a) } else { inputbuffer = *inp++ Automatic Conversion delta = (inputbuffer >> 4) & 0xf; } of Software Functions bufferstep into Hardware != 0x0 If Basic Block Functions (b) *inp inp inputbuffer Specified hardware inputbuffer 0xF 0x4 0xF >> functions are converted & ++ & delta inp automatically into delta hardware Then Basic Block Else Basic Block
Part of the “compiler” inp *inp inputbuffer Prototype compiler is bufferstep inputbuffer 0xF 0xF >> 0x4 (c) working in laboratory ++ != 0x0 & &
T/1 F/0 T/ 1 F/ 0 2:1 MUX 2:1 MUX inp delta ADPCM - CDFG
Control Flow for 'adpcm_code r' Basic Block3 Basic Block5 Basic Block 6 Basic Block 9 Basic Block11 Basic Block 13 Basic Block 14 Basic Block16 Basic Block 17 Basic Block19 Basic Block 20 Basic Block 22 Basic Block 23 Basic Block 24 Basic Block 25 Basic Block 26 Basic Block 27 Basic Block 28 Basic Block 29 Basic Block31 Basic Block 34 Basic Block 35 Basic Block37 Basic Block 38 Basic Block 39 Basic Block 40
BB3 2 inp 8 0 0 suif_tmp0 di ff di ff st ep 3 0 vpdi ff step di ff 4 st ep 1 vpdi ff st ep di ff delta step 1 vpdi ff step delt a sign 0 valpre d vpdi ff valp re d vpdi ff 32 767 valp re d 32 767 valpre d -3 2768 delt a sign -3 2768 0 88 index 88 buf fe rstep 0 stepsizeT able inde x delta 4 delta bu ffer step 0
BB5 BB6 + load suif_tmp suif_tmp0 suif_tmp0 != sign *-1 <= >> delt a + - delta diff >> + - conver t 2 di ff >> + conver t 1 != - + < valp red < conv ert conver t valpre d inde x < inde x != + << conver t 15 != 0
BB9 valpre d in p conver t eval di ff eval vpdi ff vpdi ff di ff <= st ep vpdiff diff | <= st ep vpdiff | eval valp red valp re d eval eval | eval eval load conver t 240 & ==
BB11 0 - val eval conver t eval conver t indexT able conver t step & conver t outputbuf fe r buf fe rs tep
BB13 < di ff delt a delt a + delt a conver t conver t conver t
BB14 eval inde x load outputbuf fer |
BB16 0 + 1 out p conver t
BB17 < inde x + addr suif_tmp conver t
BB19 eval outp stor e
BB20
BB22
BB23 BB24
BB25
BB27
BB26 BB29
BB28
BB31
BB34
BB35
BB37
BB38 BB39
BB40 Control Flow for 'adpcm_code r '
BB3
ADPCM - CFG BB5 BB6
BB9
BB1 1
BB1 3 Control flow dominated BB1 4 BB1 6
if-then-else BB1 7
BB1 9 ILP < 2 BB2 0 Branch prediction is poor BB2 2 BB2 3 BB2 4 Poor inherent parallelism BB2 5 BB2 7 DFGs contain variable BB2 6 BB2 9 BB2 8
length critical paths BB3 1
BB3 4
Cycle time for RTL BB3 5
conforms to longest DFG BB3 7
BB3 8 BB3 9
BB4 0 Basic Block 3
2 in p
+ load
valpred in p convert
0 - val ADPCM - SDFG < 8 0 0 0 mux *-1
!= st e p 3 !=
>> mux convert
1 + <= - 4 0
>> mux mux mux
Single SDFG 1 + <= - convert 2
>> mux mux | Removal of clock + <= convert mux mux Combinational/Data - + convert 1 32767 -32768 mux |
Flow < < -32768 convert
32767 mux mux
Cycle compression mux convert
valpred | Significant increase in 4 convert indexTable << convert 15 + parallelism conver t 240 & index load 0 bufferstep 0 & outputbuffer convert 0 +
!= 0 != convert 1 outp convert convert < 0
== mux + mux | 88 mux
bufferstep outputbuffer mux suif_tm p convert < 88
outp addr convert st epsizeTable mux
store +
load
st e p MPEG II (IDCT) - CDFG
Two nested loops within fast_IDCT function idct_row and idct_col In this example, control flow is segregated from most of the work
Control Flow for 'Fast_IDC T ' Basic Bloc k 0 Basic Bloc k 1 Basic Bloc k 5 Basic Bloc k 6
BB0 1 0 8 1 0 8 8 i fo r_ st ep 2 i fo r_ st ep
BB5 fo r_ st ep i fo r _ lb < fo r_ub fo r_ st ep i fo r _ lb < fo r_ub fo r _ub * + 2 bloc k * fo r _ub +
BB1 ev a l ev a l < bloc k i * + < i
BB6 ev a l + idctco l ev a l
BB3 idctro w MPEG II (IDCT) - CDFG (row)
Basic Bloc k 0
0 blk
return 0 convert 3 convert 4 convert 2 convert 6 convert 0 convert 4 convert 2 convert 7 convert 3 convert 1 convert 1 convert 5 convert 5 convert 6 convert 7 convert
+ + + + + + + + + + + + + + + +
load load load load load load load load
1568 convert 11 convert convert 37 8 4 convert 3406 convert 4017 convert convert convert 2276 799
* + 11 0 8 << 12 8 * 11 * * + 565 + 2408 * *
addr addr addr * + << * * addr
+ + - - - - + -
+ - - + + + - -
8 + 8 x7 x8 + - 8 x0 18 1 x3 x6 - x1 x5 - 8 + 18 1
>> >> >> * 12 8 >> * 12 8
convert convert convert + 8 convert + 8
stor e store store >> store >>
+ 8 x4 + 8 - 8 - x2 8
>> addr >> addr >> >>
convert convert convert addr addr convert
store store store store MPEG II (IDCT) - CDFG (col)
Basic Bloc k 0
0 blk
return convert 16 convert 0 convert 24 convert 8 convert 32 convert 56 convert 16 convert 32 convert 56 convert 24 convert 40 convert 48 convert 8 convert 40 convert 48
+ + + + + + + + + + + + + + +
convert 0 load load load load load load load
1568 + convert 3406 convert 4017 convert convert convert convert 2276 799 37 8 4
* load * 11 0 8 + * 565 + + 2408 * * *
convert 8 * 4 * 4 * 4
addr << 819 2 convert 8 + + +
3 + + << - 3 - 3 + 3 - 3 -
>> addr + - >> >> >> >>
addr + - addr + + - -
14 + + 14 x7 iclp - 14 x8 - 14 x6 18 1 x1 x5 - + 18 1
>> >> convert convert conver t convert >> >> 3 12 8 * * 12 8
+ + + + >> 8 + + 8
load load load load - >> + >> convert
store + 14 store store store x0 + 14 - x3 14 x4 - 14 x2
convert >> addr convert >> convert >> >>
addr + + + + addr
load load addr load load
st ore st ore store st ore A VLIW Processor with Hardware Functions: Increasing Performance While Reducing Power For FPGA, Structured ASIC and ASIC technologies VLIW Architecture Up to 4x performance improvement for software alone Full 32-bit Processor Real applications Altera NIOS II Instruction Set Architecture “Hardware Functions” Convert critical software kernels into hardware 9x to 332x Performance improvement 42x to 418x Power reduction Productivity through Design Automation Source: Unmodified C programs “Compile” software into hardware functions Contact Prof. Ray Hoare [email protected] Prof. Alex Jones [email protected]