Evaluating Signal Processing and Multimedia Applications on SIMD, VLIW and Superscalar Architectures

Evaluating Signal Processing and Multimedia Applications on SIMD, VLIW and Superscalar Architectures Deependra Talla, Lizy K. John, Viktor Lapinskii, and Brian L. Evans Department of Electrical and Computer Engineering The University of Texas, Austin, TX 78712 {deepu, ljohn, lapinski, bevans}@ece.utexas.edu Abstract systems as a dominant computing workload [1] [2]. The importance of multimedia technology, services and appli- This paper aims to provide a quantitative understanding cations is widely recognized by microprocessor designers. of the performance of DSP and multimedia applications Special-purpose multimedia processors such as the on very long instruction word (VLIW), single instruction Trimedia processor from Philips, Mpact from Chromatics multiple data (SIMD), and superscalar processors. We research, Mitsubishi’s multimedia processor and the Mul- evaluate the performance of the VLIW paradigm using timedia signal processor from Samsung usually have Texas Instruments Inc.’s TMS320C62xx processor and hardware assists in the form of peripherals for one or the SIMD paradigm using Intel’s Pentium II processor more of the multimedia decoding functions. The market (with MMX) on a set of DSP and media benchmarks. for such special-purpose multimedia processors is pri- Tradeoffs in superscalar performance are evaluated with marily in low-cost embedded applications such as set-top a combination of measurements on Pentium II and simu- boxes, wireless terminals, digital TVs, and stand-alone lation experiments on the SimpleScalar simulator. Our entertainment devices like DVD players. On the other benchmark suite includes kernels (filtering, autocorrela- hand, general-purpose CPUs accelerate audio and video tion, and dot product) and applications (audio effects, processing through multimedia extensions. G.711 speech coding, and speech compression). Opti- The architecture of choice for general-purpose media mized assembly libraries and compiler intrinsics were extensions has been the Single Instruction Multiple Data used to create the SIMD and VLIW code. We used the (SIMD) paradigm. The Sun UltraSPARC processor en- hardware performance counters on the Pentium II and the hanced with the “Visual Instruction Set” (VIS) [16], the stand-alone simulator for the C62xx to obtain the execu- “MultiMedia eXtensions” (MMX) and streaming-SIMD tion cycle counts. In comparison to non-SIMD Pentium II instructions from Intel [17], the 3DNow! extension from performance, the SIMD version exhibits a speedup rang- AMD [18], and AltiVec technology from Motorola [19] ing from 1.0 to 5.5 while the speedup of the VLIW version are examples of SIMD signal processing instruction set ranges from 0.63 to 9.0. The benchmarks are seen to extensions on general-purpose processors. Such CPUs contain large amounts of available parallelism, however, will likely take over the multimedia functions like most of it is inter-iteration parallelism. Out-of-order exe- audio/video decoding and encoding, modem, telephony cution and branch prediction are observed to be ex- functions, and network access functions on a tremely important to exploit such parallelism in media PC/workstation platform, along with the general-purpose applications. computing they currently perform. Another paradigm to exploit the fine- and coarse- grained parallelism of DSP applications is the very long 1. Introduction instruction word (VLIW) architecture. VLIW processors rely on software to identify the parallelism and assemble wide instruction packets. VLIW architectures can exploit Digital signal processing (DSP) and multimedia appli- instruction-level parallelism (ILP) in programs even if cations are becoming increasingly important for computer vector style data-level parallelism does not exist. A high- end DSP processor, the Texas Instruments TMS320- L. John is supported in part by the State of Texas Advanced Technology C62xx uses the VLIW approach. program grant #403, the National Science Foundation under grants CCR- DSP and media applications involve vectors and SIMD 9796098 and EIA-9807112, and by Dell, Intel, Microsoft, and IBM. B. Evans style processing is intuitively suited for these applications. was supported under a US National Science Foundation CAREER Award under grant MIP-9702707. Many of the DSP and multimedia applications can use vectors of packed 8-, 16- and 32-bit integers and floating- their Web sites [20][21], and both Intel’s C/C++ compiler point numbers that allow potential benefits of SIMD ar- [22] and TI’s C62xx compiler [23] allow the use of ‘in- chitectures like the MMX and VIS. Most of these appli- trinsics’. The MMX and VLIW technology intrinsics are cations are very structured and predictable, and parallel- coded with the syntax of C language, but trigger the com- ism is potentially identifiable at compile-time, favoring piler to generate corresponding MMX instructions and statically scheduled architectures compared to complex optimized VLIW code, respectively. Using assembly lan- dynamically scheduled processors such as state-of-the-art guage libraries and compiler intrinsics, we create SIMD superscalar processors. However, superscalar out-of-order and VLIW versions of our benchmark suite of kernels and processors have been commercially very successful, and applications and analyze the performance impact of media favored by many over architectures that heavily depend applications on these paradigms. We found that the SIMD on efficient compilers. versions of our benchmarks exhibited a speedup ranging Although SIMD and VLIW techniques present oppor- from 1.0 to 5.5 (over non-SIMD), while the speedup of tunity for performance increase, to our knowledge no in- the VLIW version ranges from 0.63 to 9.0. We also ob- dependent evaluations of applications comparing the two serve that out-of-order execution techniques are extremely architectural paradigms are reported in literature. Are the important to exploit data parallelism in media applica- aforementioned paradigms equivalent or is any approach tions. particularly favorable for DSP and media applications? Several efforts have analyzed the benefits of SIMD This paper is an attempt to understand this issue based on extensions on general-purpose processors [3][4][5][6][7] a few media applications and several kernels. First we [8]. An evaluation of MMX on a Pentium processor on evaluate the effectiveness of VLIW and SIMD processors kernels and applications was presented in [3]. However, for signal processing and multimedia applications choos- such an analysis on a modern out-of-order speculative ing one modern representative commodity processor from machine like the Pentium II is not reported in literature. each category – Texas Instruments Inc.’s TMS320C62xx Performance of image and video processing with VIS ex- processor as the VLIW representative and Intel’s Pentium tensions was analyzed in [4] and benefits of VIS were re- II with MMX as the SIMD representative. The Pentium II ported. It was shown that conventional ILP techniques evaluation utilized the on-chip performance monitoring provided 2x to 4x performance improvements and media counters, while the evaluation of the TMS320C62xx extensions provided an additional 1.1x to 4.2x perform- processor was done with a simulator from Texas Instru- ance improvements. However, our work includes the ments. Although the C62xx processor is comparable to VLIW paradigm as well. A performance increase by us- state of the art general-purpose microprocessors in ma- ing AltiVec technology for DSP and multimedia kernels chine level parallelism, no performance comparison is was reported in [6]. Performance analysis of MMX tech- available except for BDTI’s rating, which portrays nology for an H.263 video encoder was presented in [8]. C62xx’s BDTImark to be twice as that of the Pentium- A number of commercial general-purpose and DSP proc- MMX’s score [9]. The BDTI rating is based on kernels essors have been benchmarked by BDTI [9] on a suite of only. 11 kernels. However, only a single performance metric In addition to the major contribution of evaluating denoting the execution time is released in the public do- SIMD and VLIW architectures, we also perform an main for all of the benchmarks together. Moreover, the analysis of the performance of these applications on su- Pentium II has not been evaluated in their work. In addi- perscalar processors. The Pentium II is a superscalar tion their benchmark suite includes only kernels and no processor and is used as the baseline (non-SIMD) for applications. comparing SIMD and VLIW architectures. However, A reasonable benchmark suite was presented in [10], since we are working with actual hardware and hardware but there are no SIMD or optimized VLIW versions of the monitoring counters, we cannot change the processor con- benchmarks. Available parallelism in video workloads figuration or perform analysis of tradeoffs. Hence we use was measured in [11] with a VLIW architecture. But they the SimpleScalar simulator tools [29] to evaluate the im- assume infinite number of functional units and a powerful pact of dynamic scheduling on media applications. compiler with 100% accurate prediction capabilities. In- Compilers for SIMD and VLIW paradigms are still in stead, our approach was to use state-of-the-art SIMD and their infancy, and the burden of generating optimized VLIW commodity processors and realistic commercial code for these processors is still largely on the developers compilers. An implementation of MPEG-2 video decoder of an application. Achieving

Load more