Presentation

A VLIW Processor with Hardware Functions: Increasing Performance While Reducing Power Profs. Ray Hoare and Alex Jones Dara Kusic, Joshua Fazekas, Gayatri Mehta, and John Foster Objective High Performance PC to Supercomputer Performance Low Power PDA Power Rapid Design Cycle “Compile Software into Hardware” Unmodified C Source Code No Hardware Design Experience Required Automatic Integration with Processor Target Technologies For all devices, we want to Maximize Performance while Minimizing Power consumption and Design Time. Our Approach VLIW Architecture Up to 4x performance improvement for software alone Full 32-bit Processor Real applications Altera NIOS II Instruction Set Architecture “Hardware Functions” Convert critical software kernels into hardware 9x to 332x Performance improvement 42x to 418x Power reduction Productivity through Design Automation Source: Unmodified C programs “Compile” software into hardware functions VLIW Processor: Exploit Instruction Level Parallelism Full NIOS II 32-bit ISA Real processor, real apps Expanded to 4 ALUs for up to 4x speedup Shared Register File Shared Instruction Stream Trimaran Compiler Instr RAM Instruction Cust Cust Cust Cust Decoder ALU Instr ALU Instr ALU Instr ALU Instr MUX MUX MUX MUX Controller Hardware Functions: Exploit Hardware Acceleration Kernels converted to Hardware Data shared in Reg. File Control flow in Software VLIW Exploits Instruction Parallelism Hardware Provides Acceleration Kernels High Performance and converted Low Power to Hardware Data shared No Data movement in Reg. File Control flow Programmable in Software Benchmark Codes Speech Coding ADPCM Encoder (4:1 compression) ADPCM Decoder G.721 (cordless phones) GSM (wireless phones, 8:1 compression) Communications Spherical Decoder (M tx, N rx antennas) (Written in vectorized Matlab) Multimedia Decoding MPEG2 Decoder (IDCT) Performance Results Fully Implemented Processors FPGA Comparison: Soft-core processor vs. HW Function in 90nm FPGA pNIOS: our sequential processor VLIW-4: our 4-wide VLIW Hardware Functions vs. Equivalent Software Estimations based on a shared register file with HW Functions Intel Strong ARM 206MHz vs 160nm ASIC HW Functions Intel XScale 733MHz vs 160nm ASIC HW Functions Performance Results Acceleration Perspective A speedup of 2x turns 100MHz system into 200MHz of performance 10x provides 1GHz performance 100x provides 10GHz performance Kernel Acceleration: 7x to 345x Software vs. HW-Accelerated 1,000.00 100.00 Times Improvement Times 10.00 1.00 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark pNiosII VLIW-4 StrongARM@206MHz XScale@733MHz Application Acceleration: 1.8x to 154x Software vs. HW-Accelerated 1,000.00 100.00 Times Improvement 10.00 1.00 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark pNiosII + HW VLIW-4 + HW StrongARM@206MHz+HW XScale@733MHz+HW Application Acceleration: pNIOS 18 16 14 12 10 8 Times Improved 6 4 2 0 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark VLIW over pNIOS pNIOS+Hw over pNIOS VLIW4+HW over pNIOS Effective MHz for pNIOS 3000 2500 2000 1500 MHz 1000 500 0 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark pNIOS VLIW4 pNIOS+HW VLIW4+HW Application Acceleration: StrongARM 1,000.00 100.00 Times Improved 10.00 1.00 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark StrongARM+HW over StrongARM Effective MHz for StrongARM 27,226 31,750 1000 900 800 700 600 500 MHz 400 300 200 100 0 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark StrongARM StongARM+HW Application Acceleration: XScale 100.00 10.00 Times Improved Times 1.00 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark Xscale+HW over Xscale Effective MHz for XScale 24,906 27,700 3000 2500 2000 1500 MHz 1000 500 0 ADPCM Encode ADPCM Decode G.721 Decode GSM Decode MPEG2 Decode Benchmark Xscale Xscale+HW Power/Energy Savings Software on 4 wide VLIW in 160nm ASIC vs. Hardware Functions in 160nm ASIC Power Reduction: 42x to 418x Energy Reduction: 1,700x to 60,000x If (bufferstep) { delta = inputbuffer & 0xf; (a) } else { inputbuffer = *inp++ Automatic Conversion delta = (inputbuffer >> 4) & 0xf; } of Software Functions bufferstep into Hardware != 0x0 If Basic Block Functions (b) *inp inp inputbuffer Specified hardware inputbuffer 0xF 0x4 0xF >> functions are converted & ++ & delta inp automatically into delta hardware Then Basic Block Else Basic Block Part of the “compiler” inp *inp inputbuffer Prototype compiler is bufferstep inputbuffer 0xF 0xF >> 0x4 (c) working in laboratory ++ != 0x0 & & T/1 F/0 T/ 1 F/ 0 2:1 MUX 2:1 MUX inp delta ADPCM - CDFG Control Flow for 'adpcm_code r' Basic Block 3 Basic Block 5 Basic Block 6 Basic Block 9 Basic Block 11 Basic Block 13 Basic Block 14 Basic Block 16 Basic Block 17 Basic Block 19 Basic Block 20 Basic Block 22 Basic Block 23 Basic Block 24 Basic Block 25 Basic Block 26 Basic Block 27 Basic Block 28 Basic Block 29 Basic Block 31 Basic Block 34 Basic Block 35 Basic Block 37 Basic Block 38 Basic Block 39 Basic Block 40 BB3 2 inp 8 0 0 suif_tmp0 di ff di ff st ep 3 0 vpdi ff step di ff 4 st ep 1 vpdi ff st ep di ff delta step 1 vpdi ff step delt a sign 0 valpre d vpdi ff valp re d vpdi ff 32 767 valp re d 32 767 valpre d -3 2768 delt a sign -3 2768 0 88 index 88 buf fe rstep 0 stepsizeT able inde x delta 4 delta bu ffer step 0 BB5 BB6 + load suif_tmp suif_tmp0 suif_tmp0 != sign *-1 <= >> delt a + - delta diff >> + - conver t 2 di ff >> + conver t 1 != - + < valp red < conv ert conver t valpre d inde x < inde x != + << conver t 15 != 0 BB9 valpre d in p conver t eval di ff eval vpdi ff vpdi ff di ff <= st ep vpdiff diff | <= st ep vpdiff | eval valp red valp re d eval eval | eval eval load conver t 240 & == BB11 0 - val eval conver t eval conver t indexT able conver t step & conver t outputbuf fe r buf fe rs tep BB13 < di ff delt a delt a + delt a conver t conver t conver t BB14 eval inde x load outputbuf fer | BB16 0 + 1 out p conver t BB17 < inde x + addr suif_tmp conver t BB19 eval outp stor e BB20 BB22 BB23 BB24 BB25 BB27 BB26 BB29 BB28 BB31 BB34 BB35 BB37 BB38 BB39 BB40 Control Flow for 'adpcm_code r ' BB3 ADPCM - CFG BB5 BB6 BB9 BB1 1 BB1 3 Control flow dominated BB1 4 BB1 6 if-then-else BB1 7 BB1 9 ILP < 2 BB2 0 Branch prediction is poor BB2 2 BB2 3 BB2 4 Poor inherent parallelism BB2 5 BB2 7 DFGs contain variable BB2 6 BB2 9 BB2 8 length critical paths BB3 1 BB3 4 Cycle time for RTL BB3 5 conforms to longest DFG BB3 7 BB3 8 BB3 9 BB4 0 Basic Block 3 2 in p + load valpred in p convert 0 - val ADPCM - SDFG < 8 0 0 0 mux *-1 != st e p 3 != >> mux convert 1 + <= - 4 0 >> mux mux mux Single SDFG 1 + <= - convert 2 >> mux mux | Removal of clock + <= convert mux mux Combinational/Data - + convert 1 32767 -32768 mux | Flow < < -32768 convert 32767 mux mux Cycle compression mux convert valpred | Significant increase in 4 convert indexTable << convert 15 + parallelism conver t 240 & index load 0 bufferstep 0 & outputbuffer convert 0 + != 0 != convert 1 outp convert convert < 0 == mux + mux | 88 mux bufferstep outputbuffer mux suif_tm p convert < 88 outp addr convert st epsizeTable mux store + load st e p MPEG II (IDCT) - CDFG Two nested loops within fast_IDCT function idct_row and idct_col In this example, control flow is segregated from most of the work Control Flow for 'Fast_IDC T ' Basic Bloc k 0 Basic Bloc k 1 Basic Bloc k 5 Basic Bloc k 6 BB0 1 0 8 1 0 8 8 i fo r_ st ep 2 i fo r_ st ep BB5 fo r_ st ep i fo r _ lb < fo r_ub fo r_ st ep i fo r _ lb < fo r_ub fo r _ub * + 2 bloc k * fo r _ub + BB1 ev a l ev a l < bloc k i * + < i BB6 ev a l + idctco l ev a l BB3 idctro w MPEG II (IDCT) - CDFG (row) Basic Bloc k 0 0 blk return 0 convert 3 convert 4 convert 2 convert 6 convert 0 convert 4 convert 2 convert 7 convert 3 convert 1 convert 1 convert 5 convert 5 convert 6 convert 7 convert + + + + + + + + + + + + + + + + load load load load load load load load 1568 convert 11 convert convert 37 8 4 convert 3406 convert 4017 convert convert convert 2276 799 * + 11 0 8 << 12 8 * 11 * * + 565 + 2408 * * addr addr addr * + << * * addr + + - - - - + - + - - + + + - - 8 + 8 x7 x8 + - 8 x0 18 1 x3 x6 - x1 x5 - 8 + 18 1 >> >> >> * 12 8 >> * 12 8 convert convert convert + 8 convert + 8 stor e store store >> store >> + 8 x4 + 8 - 8 - x2 8 >> addr >> addr >> >> convert convert convert addr addr convert store store store store MPEG II (IDCT) - CDFG (col) Basic Bloc k 0 0 blk return convert 16 convert 0 convert 24 convert 8 convert 32 convert 56 convert 16 convert 32 convert 56 convert 24 convert 40 convert 48 convert 8 convert 40 convert 48 + + + + + + + + + + + + + + + convert 0 load load load load load load load 1568 + convert 3406 convert 4017 convert convert convert convert 2276 799 37 8 4 * load * 11 0 8 + * 565 + + 2408 * * * convert 8 * 4 * 4 * 4 addr << 819 2 convert 8 + + + 3 + + << - 3 - 3 + 3 - 3 - >> addr + - >> >> >> >> addr + - addr + + - - 14 + + 14 x7 iclp - 14 x8 - 14 x6 18 1 x1 x5 - + 18 1 >> >> convert convert conver t convert >> >> 3 12 8 * * 12 8 + + + + >> 8 + + 8 load load load load - >> + >> convert store + 14 store store store x0 + 14 - x3 14 x4 - 14 x2 convert >> addr convert >> convert >> >> addr + + + + addr load load addr load load st ore st ore store st ore A VLIW Processor with Hardware Functions: Increasing Performance While Reducing Power For FPGA, Structured ASIC and ASIC technologies VLIW Architecture Up to 4x performance improvement for software alone Full 32-bit Processor Real applications Altera NIOS II Instruction Set Architecture “Hardware Functions” Convert critical software kernels into hardware 9x to 332x Performance improvement 42x to 418x Power reduction Productivity through Design Automation Source: Unmodified C programs “Compile” software into hardware functions Contact Prof.

Load more