Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Christos Kozyrakis David Patterson Stanford University U.C. Berkeley

http://csl.stanford.edu/~christos Motivation

• Ideal for embedded media processing – High performance for media tasks – Low cost • Small code size, low power consumption, highly integrated – Low power consumption (for portable applications) – Low design complexity – Easy to program with HLLs – Scalable • This work – How efficient is a simple for embedded media processing? • No , no wide issue, no out-of-order execution – How does it compare to superscalar and VLIW embedded designs?

© C. Kozyrakis,11/ 2002 2 Outline

• Motivation • Overview of VIRAM architecture – Multimedia instruction set, processor organization, vectorizing • EEMBC benchmarks & alternative architectures • Evaluation – Instruction set analysis & code size comparison – Performance comparison – VIRAM study • Conclusions

© C. Kozyrakis,11/ 2002 3 VIRAM Instruction Set

• Vector load-store instruction set for media processing – extension to MIPS architecture • Architecture state – 32 general-purpose vector registers – 16 flag registers – Scalar registers for control, addresses, strides, etc • Vector instructions – Arithmetic: integer, floating-point, logical – Load-store: unit-stride, strided, indexed – Misc: vector & flag processing (pop count, insert/extract) – 90 unique instructions

© C. Kozyrakis,11/ 2002 4 VIRAM ISA Enhancements

• Multimedia processing – Support for multiple data-types (64b/32b/16b) • Element & operation width specified with control register – Saturated and fixed-point arithmetic • Flexible multiply-add model without accumulators – Simple element permutations for reductions and FFTs – Conditional execution using the flag registers • General-purpose systems – TLB-based • Separate TLB for vector loads & stores – Hardware support for reduced context overhead • Valid/dirty bits for vector registers • Support for “lazy” save/restore of vector state

© C. Kozyrakis,11/ 2002 5 VIRAM Processor

© C. Kozyrakis,11/ 2002 6 VIRAM Chip

• 0.18mm CMOS (IBM) • Features MIPS – 8KB vector Core – 2 256-bit integer ALUs DRAM DRAM DRAM DRAM – 1 256-bit FPU IO-FPU – 13MB DRAM • 325mm2 die area Vector LANE LANE LANE LANE • 125M transistors DRAM • 200 MHz, 2 Watts Control • Peak vector performance – Integer: 1.6/3.2/6.4 Gop/s (64b/32b/16b) – Fixed-point: 2.4/4.8/9.6 Gop/s (64b/32b/16b)

DRAM DRAM DRAM DRAM – FP: 1.6 Gflop/s (32b)

© C. Kozyrakis,11/ 2002 7 Vectorizing Compiler

Frontends Optimizer Code Generators

C T3D, T3E Cray’s C++ C90, T90, SV1 PDGCS Fortran95 X1(SV2), VIRAM

• Based on Cray PDGCS compiler – Used with all vector and MPP Cray machines • Extensive vectorization capabilities – Including outer-loop vectorization • Vectorization of narrow operations and reductions

© C. Kozyrakis,11/ 2002 8 EEMBC Benchmarks

• The de-facto industrial standard for embedded CPUs • Used consumer & telecommunication categories – Representative of workload for multimedia devices with wireless/broadband capabilities – C code, EEMBC reference input data • Consumer category – Image processing tasks for digital camera devices – Rgb2cmyk & rgb2yiq conversions, convolutional filter, jpeg encode & decode • Telecommunication category – Encoding/decoding tasks for DSL/wireless – Autocorrelation compression, convolutional encoder, DSL bit allocation, FFT, Viterbi decoding

© C. Kozyrakis,11/ 2002 9 EEMBC Metrics

• Performance: repeats/second (throughput) – Use geometric means to summarize scores • Code size and static data size in bytes – Data sizes the same for the processors we discuss • Pitfall with caching behavior – Fundamentally, no temporal locality in most benchmarks – Repeating kernel on same (small) data creates locality – Unfair advantage for cache based architectures • VIRAM has no data cache for vector loads/stores

© C. Kozyrakis,11/ 2002 10 Embedded Processors

Issue Cache Architecture Chip Width OOO L1I/L1D/L2 Clock Power (KB) (MHz) (W)

VIRAM Vector VIRAM 1 – 8/––/–– 200 2.0 CISC K6-IIIE+ 3 Ö 32/32/256 550 21.6 PowerPC RISC MPC7455 4 Ö 32/32/256 1000 21.3 MIPS RISC VR5000 2 – 32/32/–– 250 5.0 Trimedia VLIW+ TM1300 5 – 32/16/–– 166 2.7 SIMD VelociTI VLIW+ TMS320C6 8 – 96/512/–– 300 1.7 DSP 203

© C. Kozyrakis,11/ 2002 11 Degree of Vectorization

Vector Operations Scalar Operations 100%

80%

60%

40%

20%

% of Dynamic Operations 0%

Fft Filter Bital Cjpeg Djpeg Viterbi Rgb2yiq AutocorConvenc Rgb2cmyk • Typical degree of vectorization is 90% – Even for Viterbi decoding which is partially vectorizable – ISA & compiler can capture the in EEMBC – Great potential for vector hardware

© C. Kozyrakis,11/ 2002 12 Vector Length

Average Maximum Supported

16-bit ops 32-bit ops 120 100 80 60 40 20 0

Vector Length (Elements) Fft Bital Filter Cjpeg Djpeg Cjpeg Djpeg Viterbi Rgb2yiq Convenc Autocor Rgb2cmyk • Short vectors – Unavoidable with Viterbi, artifact of code with Cjpeg/Djpeg • Even shortest length operations have ~20 16-bit elements – More parallelism than 128-bit SSE/Altivec can capture © C. Kozyrakis,11/ 2002 13 Code Size

Consumer Telecommunications 12.3 5 4.1 4

3 2.5 2.0 1.9 1.9 2 1.6 1.1 1.0 1.1 1.0 1 0.6

Normalized Code Size 0

x86 x86 MIPS MIPS VIRAM VIRAM PowerPC VLIW (cc) PowerPC VLIW (cc) VLIW (opt) VLIW (opt) • VIRAM high code density due to – No loop unrolling/software pipelining, compact expression of strided/indexed/permutation patterns, no small loops • Vectors: a “code compression” technique for RISC

© C. Kozyrakis,11/ 2002 14 Comparison: Performance

VIRAM PowerPC VLIW (cc) VLIW (opt)

16 41 48 17 12 10 8 6 4 2 Performance 0

Fft Filter Bital Cjpeg Djpeg Viterbi Rgb2yiq Autocor Convenc Rgb2cmyk Geom. Mean • 200MHz, cache-less, single-issue VIRAM is – 100% faster than 1-GHz, 4-way OOO processor – 40% faster than manually-optimized VLIW with SIMD/DSP • VIRAM can sustain high computation throughput – Up to 16 (32-bit) to 32 (16-bit) arithmetic ops/cycle • VIRAM can hide latency of accesses to DRAM

© C. Kozyrakis,11/ 2002 15 Comparison: Performance/MHz

VIRAM PowerPC VLIW (cc) VLIW (opt)

43 112 131 31 33 25 20 15 10 5

Performance/MHz 0

Fft Filter Bital Cjpeg Djpeg Viterbi Rgb2yiq Autocor Convenc Rgb2cmyk Geom. Mean • VIRAM Vs 4-way OOO & VLIW with compiler code – 10x-50x with long vectors, 2x-5x with short vectors • VIRAM Vs VLIW with manually optimized code – VLIW within 50% of VIRAM – VIRAM would benefit from the same optimizations • Similar results if normalized by power or complexity

© C. Kozyrakis,11/ 2002 16 VIRAM Scalability

1 Lane 2 Lanes 4 Lanes 8 Lanes

8 7 6 5 4 3.5 3 2.5 1.7

Performance 2 1.0 1 0 Rgbcmyk Rgbyiq Filter Cjpeg Djpeg Autocor Convenc Bital Fft Viterbi Geom. Mean • Same executable, frequency, and memory system • Decreased efficiency for 8 lanes – Short vector lengths, conflicts in the memory system – Difficult to hide overhead with short execution times • Overall: 2.5x with 4 lanes, 3.5x with 8 lanes – Can you scale similarly a superscalar processor?

© C. Kozyrakis,11/ 2002 17 Conclusions

• Vectors architectures are great match for embedded multimedia processing – Combined high performance, low power, low complexity – Add a vector unit to your media-processor! • VIRAM code density – Similar to x86, 5-10 times better than optimized VLIW • VIRAM performance • With compiler vectorization and no hand-tuning – 2x performance of 4-way OOO superscalar • Even if OOO runs at 5x the clock frequency – 50% faster than manually-optimized 5 to 8-way VLIW • Even if VLIW has hand-inserted SIMD and DSP support

© C. Kozyrakis,11/ 2002 18