Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis David Patterson Stanford University U.C. Berkeley http://csl.stanford.edu/~christos Motivation • Ideal processor for embedded media processing – High performance for media tasks – Low cost • Small code size, low power consumption, highly integrated – Low power consumption (for portable applications) – Low design complexity – Easy to program with HLLs – Scalable • This work – How efficient is a simple vector processor for embedded media processing? • No cache, no wide issue, no out-of-order execution – How does it compare to superscalar and VLIW embedded designs? © C. Kozyrakis,11/ 2002 2 Outline • Motivation • Overview of VIRAM architecture – Multimedia instruction set, processor organization, vectorizing compiler • EEMBC benchmarks & alternative architectures • Evaluation – Instruction set analysis & code size comparison – Performance comparison – VIRAM scalability study • Conclusions © C. Kozyrakis,11/ 2002 3 VIRAM Instruction Set • Vector load-store instruction set for media processing – Coprocessor extension to MIPS architecture • Architecture state – 32 general-purpose vector registers – 16 flag registers – Scalar registers for control, addresses, strides, etc • Vector instructions – Arithmetic: integer, floating-point, logical – Load-store: unit-stride, strided, indexed – Misc: vector & flag processing (pop count, insert/extract) – 90 unique instructions © C. Kozyrakis,11/ 2002 4 VIRAM ISA Enhancements • Multimedia processing – Support for multiple data-types (64b/32b/16b) • Element & operation width specified with control register – Saturated and fixed-point arithmetic • Flexible multiply-add model without accumulators – Simple element permutations for reductions and FFTs – Conditional execution using the flag registers • General-purpose systems – TLB-based virtual memory • Separate TLB for vector loads & stores – Hardware support for reduced context switch overhead • Valid/dirty bits for vector registers • Support for “lazy” save/restore of vector state © C. Kozyrakis,11/ 2002 5 VIRAM Processor Microarchitecture © C. Kozyrakis,11/ 2002 6 VIRAM Chip • 0.18mm CMOS (IBM) • Features MIPS – 8KB vector register file Core – 2 256-bit integer ALUs DRAM DRAM DRAM DRAM – 1 256-bit FPU IO-FPU – 13MB DRAM • 325mm2 die area Vector LANE LANE LANE LANE • 125M transistors DRAM • 200 MHz, 2 Watts Control • Peak vector performance – Integer: 1.6/3.2/6.4 Gop/s (64b/32b/16b) – Fixed-point: 2.4/4.8/9.6 Gop/s (64b/32b/16b) DRAM DRAM DRAM DRAM – FP: 1.6 Gflop/s (32b) © C. Kozyrakis,11/ 2002 7 Vectorizing Compiler Frontends Optimizer Code Generators C T3D, T3E Cray’s C++ C90, T90, SV1 PDGCS Fortran95 X1(SV2), VIRAM • Based on Cray PDGCS compiler – Used with all vector and MPP Cray machines • Extensive vectorization capabilities – Including outer-loop vectorization • Vectorization of narrow operations and reductions © C. Kozyrakis,11/ 2002 8 EEMBC Benchmarks • The de-facto industrial standard for embedded CPUs • Used consumer & telecommunication categories – Representative of workload for multimedia devices with wireless/broadband capabilities – C code, EEMBC reference input data • Consumer category – Image processing tasks for digital camera devices – Rgb2cmyk & rgb2yiq conversions, convolutional filter, jpeg encode & decode • Telecommunication category – Encoding/decoding tasks for DSL/wireless – Autocorrelation compression, convolutional encoder, DSL bit allocation, FFT, Viterbi decoding © C. Kozyrakis,11/ 2002 9 EEMBC Metrics • Performance: repeats/second (throughput) – Use geometric means to summarize scores • Code size and static data size in bytes – Data sizes the same for the processors we discuss • Pitfall with caching behavior – Fundamentally, no temporal locality in most benchmarks – Repeating kernel on same (small) data creates locality – Unfair advantage for cache based architectures • VIRAM has no data cache for vector loads/stores © C. Kozyrakis,11/ 2002 10 Embedded Processors Issue Cache Architecture Chip Width OOO L1I/L1D/L2 Clock Power (KB) (MHz) (W) VIRAM Vector VIRAM 1 – 8/––/–– 200 2.0 x86 CISC K6-IIIE+ 3 Ö 32/32/256 550 21.6 PowerPC RISC MPC7455 4 Ö 32/32/256 1000 21.3 MIPS RISC VR5000 2 – 32/32/–– 250 5.0 Trimedia VLIW+ TM1300 5 – 32/16/–– 166 2.7 SIMD VelociTI VLIW+ TMS320C6 8 – 96/512/–– 300 1.7 DSP 203 © C. Kozyrakis,11/ 2002 11 Degree of Vectorization Vector Operations Scalar Operations 100% 80% 60% 40% 20% % of Dynamic Operations 0% Fft Filter Bital Cjpeg Djpeg Viterbi Rgb2yiq AutocorConvenc Rgb2cmyk • Typical degree of vectorization is 90% – Even for Viterbi decoding which is partially vectorizable – ISA & compiler can capture the data parallelism in EEMBC – Great potential for vector hardware © C. Kozyrakis,11/ 2002 12 Vector Length Average Maximum Supported 16-bit ops 32-bit ops 120 100 80 60 40 20 0 Vector Length (Elements) Fft Bital Filter Cjpeg Djpeg Cjpeg Djpeg Viterbi Rgb2yiq Convenc Autocor Rgb2cmyk • Short vectors – Unavoidable with Viterbi, artifact of code with Cjpeg/Djpeg • Even shortest length operations have ~20 16-bit elements – More parallelism than 128-bit SSE/Altivec can capture © C. Kozyrakis,11/ 2002 13 Code Size Consumer Telecommunications 12.3 5 4.1 4 3 2.5 2.0 1.9 1.9 2 1.6 1.1 1.0 1.1 1.0 1 0.6 Normalized Code Size 0 x86 x86 MIPS MIPS VIRAM VIRAM PowerPC VLIW (cc) PowerPC VLIW (cc) VLIW (opt) VLIW (opt) • VIRAM high code density due to – No loop unrolling/software pipelining, compact expression of strided/indexed/permutation patterns, no small loops • Vectors: a “code compression” technique for RISC © C. Kozyrakis,11/ 2002 14 Comparison: Performance VIRAM PowerPC VLIW (cc) VLIW (opt) 16 41 48 17 12 10 8 6 4 2 Performance 0 Fft Filter Bital Cjpeg Djpeg Viterbi Rgb2yiq Autocor Convenc Rgb2cmyk Geom. Mean • 200MHz, cache-less, single-issue VIRAM is – 100% faster than 1-GHz, 4-way OOO processor – 40% faster than manually-optimized VLIW with SIMD/DSP • VIRAM can sustain high computation throughput – Up to 16 (32-bit) to 32 (16-bit) arithmetic ops/cycle • VIRAM can hide latency of accesses to DRAM © C. Kozyrakis,11/ 2002 15 Comparison: Performance/MHz VIRAM PowerPC VLIW (cc) VLIW (opt) 43 112 131 31 33 25 20 15 10 5 Performance/MHz 0 Fft Filter Bital Cjpeg Djpeg Viterbi Rgb2yiq Autocor Convenc Rgb2cmyk Geom. Mean • VIRAM Vs 4-way OOO & VLIW with compiler code – 10x-50x with long vectors, 2x-5x with short vectors • VIRAM Vs VLIW with manually optimized code – VLIW within 50% of VIRAM – VIRAM would benefit from the same optimizations • Similar results if normalized by power or complexity © C. Kozyrakis,11/ 2002 16 VIRAM Scalability 1 Lane 2 Lanes 4 Lanes 8 Lanes 8 7 6 5 4 3.5 3 2.5 1.7 Performance 2 1.0 1 0 Rgbcmyk Rgbyiq Filter Cjpeg Djpeg Autocor Convenc Bital Fft Viterbi Geom. Mean • Same executable, frequency, and memory system • Decreased efficiency for 8 lanes – Short vector lengths, conflicts in the memory system – Difficult to hide overhead with short execution times • Overall: 2.5x with 4 lanes, 3.5x with 8 lanes – Can you scale similarly a superscalar processor? © C. Kozyrakis,11/ 2002 17 Conclusions • Vectors architectures are great match for embedded multimedia processing – Combined high performance, low power, low complexity – Add a vector unit to your media-processor! • VIRAM code density – Similar to x86, 5-10 times better than optimized VLIW • VIRAM performance • With compiler vectorization and no hand-tuning – 2x performance of 4-way OOO superscalar • Even if OOO runs at 5x the clock frequency – 50% faster than manually-optimized 5 to 8-way VLIW • Even if VLIW has hand-inserted SIMD and DSP support © C. Kozyrakis,11/ 2002 18.

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support