Evaluating MMX Technology Using DSP and Multimedia Applications

Evaluating MMX Technology Using DSP and Multimedia Applications Ravi Bhargava, Lizy K. John, Brian L. Evans and Ramesh Radhakrishnan Electrical and Computer Engineering Department The University of Texas at Austin fravib,ljohn,bevans,radhakri g@ece .ute xas. edu Keywords: Digital Signal Pro cessing, Machine Mea- 1 Intro duction surement, MMX, Performance Monitoring, Workload Characterization. Demand for digital signal pro cessing DSP and multimedia capabilities on a p ersonal computer has b een increasing to accommo date 3-D graphics, video con- Abstract ferencing, and other applications. The PC industry's attempt to satisfy these demands resulted in the rst Many current general purpose processors are using addition to the X86 instruction set architecture ISA extensions to the instruction set architecture to enhance in almost a decade. This extension, intro duced in 1996, the performance of digital signal processing DSP and has b een dubb ed MMX for MultiMedia eXtension multimedia applications. In this paper, we evaluate [1,2] and can outp erform lower-end DSP pro cessors [3]. the X86 architecture's multimedia extension MMX This technology adds new assembly instructions and instruction set on a set of benchmarks. Our benchmark data typ es to the existing ISA to exploit the data par- suite includes kernels ltering, fast Fourier transforms, allelism that is often available in DSP and multimedia and vector arithmetic and applications JPEG com- applications. In this study, weinvestigate the p erfor- pression, Doppler radar processing, imaging, and G.722 mance of a suite of programs on Intel Pentium pro ces- speech encoding. Each benchmark has at least one sors with MMX technology. non-MMX version in C and an MMX version that MMX and native signal pro cessing NSP exten- makes cal ls to an MMX assembly library. The versions to general purp ose pro cessors [4, 5] are single- sions di er in the implementation of ltering, vector instruction multiple-data SIMD architectures. One arithmetic, and other relevant kernels. The observed SIMD data typ e which contains several pieces of data speedup for the MMX versions of the suite ranges from is sent to a pro cessing unit. By packing many pieces of less than 1.0 to 6.1. In addition to quantifying the data into one 64-bit MMX register, several calculations speedup, we perform detailed instruction level pro l- can take place simultaneously [6]. For example, image ing using Intel's VTune pro ling tool. Using VTune, pro cessing applications typically manipulate matrices we pro le static and dynamic instructions, microarchi- of 8-bit data. Eight pieces of this data could b e packed tecture operations, and data references to isolate the into an MMX register, arithmetic or logical op erations speci c reasons for speedup or lack thereof. This anal- could be p erformed on the pixels in parallel, and the ysis al lows one to understand which aspects of native results could b e written to a register. signal processing instruction sets are most useful, the Toachieve this functionality, MMX technology adds current limitations, and how they can be utilized most 57 new assembly instructions to the X86 instruction eciently. set. These instructions can op erate on any of the L. John is supported by the National Science Founda- packed data typ es and on unsigned or signed data. tion under Grants CCR-9796098 CAREER Award, and EIA- Saturation and wrap-around arithmetic are also sup- 9807112, and a grant from the Texas AdvancedTechnology Pro- p orted. Multiply-accumulate MAC, a frequent op- gram. B. Evans was supported by the US National Science Foundation CAREER Award under grant MIP-9702707, the eration in DSP applications, is also part of the ISA. US Defense AdvancedResearch Projects Agency under contract MMX can multiply 8-bit and 16-bit xed-p oint data, DAAB07-97-C-J007, Accelerix, and Rockwel l. but not 32-bit data. Data widths of 8 and 16 bits are sucient for sp eech, image, audio, and video pro cessing applications as well as 3-D graphics. could interface it with the Intel libraries. MMX maintains full compatibility with existing X86 Although MMX presents opp ortunity for op erating systems and applications. MMX registers p erformance increase, to our knowledge no indep en- and state are aliased onto the oating-p oint registers dent evaluations of applications on an X86 pro cessor and state, so no new registers or states are intro duced with MMX corrob orate and explain the p erformance by MMX. Maintaining compatibility places limitations increase. We quantify sp eedup for our b enchmark suite on MMX. The MMX registers are limited to the width of kernels and applications and o er insight into de- of the oating-p oint registers MMX uses 64 of the 80 velopment of DSP and multimedia applications on the available bits and mixing of oating p oint and MMX Pentium family of pro cessors. We analyze variations in co de b ecomes costly. the execution time, dynamic co de size, static co de size, In DSP applications, several algorithms surface more number of memory references, function calls, number frequently than others [3, 4, 7, 8, 9, 10, 11, 12]. The of MMX instructions, and mix of MMX instructions. most common DSP kernels are the nite impulse re- Sp eedup is achieved on some, but not all, DSP and sp onse FIR lter, in nite impulse resp onse I IR l- multimedia applications. The e ects of packing and ter, fast Fourier transform FFT, least mean square unpacking do not undermine the advantages of MMX, LMS adaptive lters, and matrix-vector arithmetic. and some applications do not require packing and un- Applications commonly b enchmarked are sp eech, au- packing of data b ecause of prop erly aligned data. On dio, image and video compression systems. applications such as JPEG image compression that ran Previous e orts have analyzed NSP on general pur- slower with MMX, we found that the core kernels of p ose pro cessors [5,13]. However, e orts to compare ap- the applications show sp eedup, but numerous calls to plications with MMX instructions versus applications NSP assembly libraries as well as the formatting of without MMX on the same X86 pro cessor have b een data for these libraries prove to be signi cant over- incomplete [8]. The results found in [8] are only antici- head. The most e ective MMX applications require pated results based on simulation. A b enchmarking of bu ered data and high data parallelism. The instruc- several applications on the UltraSPARC pro cessor [4] tion mix of the applications allows us to provide more using the Visual Instruction Set VIS showed a p er- insight into these sp eci c observations and elucidate formance sp eedup for some DSP applications over non- other concerns. VIS versions. Applications with FIR lters showed the In this pap er, Section 2 describ es the b enchmark most improvement while I IR lters and FFTs exhibited programs. Section 3 describ es the exp eriment metho d- little or no p erformance increase [4]. ology, and Section 4 analyzes the results. Section 5 To study the e ects of DSP and multimedia pro- concludes the pap er. grams, we need co de with and without MMX for these applications. It is imp ortant to note that there are no publicly available compilers that supp ort MMX in- 2 Benchmarks structions. This means that the burden of incorp orat- ing MMX is placed solely on the develop ers of the ap- For our study,we pro le four DSP and multimedia plication. Achieving the largest p erformance increase kernels and four applications. Table 1 summarizes the would involve tailoring MMX assembly co de for each implementations and general characteristics of the four sp eci c application or kernel and then in-lining this kernels and four applications that comprise our MMX assembly co de into the application source co de. A b enchmark suite. We provide more information ab out less time-consuming metho d would b e to write generic our b enchmark source co de at the following Web site: MMX libraries for common algorithms and kernels http://www.ece.utexas.edu/~ljohn/mmxdsp/. which can b e accessed via function calls. The rest of this section provides some details on the Intel provides a suite of optimized assembly libraries b enchmarks. on their Web site [14] and with VTune [15]. Recentver- sions, which include the Signal Pro cessing Library 4.0, 2.1 Kernels Recognition Primitives Library 3.1, and Image Pro cessing Library 2.0, supp ort xed-p oint functions that uti- Finite Impulse Resp onse FIR Filters allow lize MMX and oating-p oint functions. Not all DSP al- certain frequency comp onents of the input to pass un- gorithms have corresp onding MMX functions e.g. the changed to the output while blo cking other comp o- LMS algorithm. We develop ed some C co de and ac- nents. FIR lters are moving average lters. Their quired the remainder from several sources [10, 16, 17, resp onse to an impulse dies away in a nite number of 18]. We lo oked for co de that is fast and ecient, yet samples. The output y nisaweighted average of the somewhat mo dular and easy to interpret so that we Table 1: Summary of Benchmark Kernels and Applications Kernels Fast Fourier Transform fft 4096 p oint, in-place FFT Finite Impulse Resp onse Filter fir Low-pass lter of length 35 i.e. 35 co ecients and 35 entry history.

Load more