Evaluating MMX Technology
Using DSP and Multimedia Applications
Ravi Bhargava, Lizy K. John, Brian L. Evans and Ramesh Radhakrishnan
Electrical and Computer Engineering Department
The University of Texas at Austin
fravib,ljohn,bevans,radhakri g@ece .ute xas. edu
Keywords: Digital Signal Pro cessing, Machine Mea- 1 Intro duction
surement, MMX, Performance Monitoring, Workload
Characterization.
Demand for digital signal pro cessing DSP and mul-
timedia capabilities on a p ersonal computer has b een
increasing to accommo date 3-D graphics, video con-
Abstract
ferencing, and other applications. The PC industry's
attempt to satisfy these demands resulted in the rst
Many current general purpose processors are using
addition to the X86 instruction set architecture ISA
extensions to the instruction set architecture to enhance
in almost a decade. This extension, intro duced in 1996,
the performance of digital signal processing DSP and
has b een dubb ed MMX for MultiMedia eXtension
multimedia applications. In this paper, we evaluate
[1,2] and can outp erform lower-end DSP pro cessors [3].
the X86 architecture's multimedia extension MMX
This technology adds new assembly instructions and
instruction set on a set of benchmarks. Our benchmark
data typ es to the existing ISA to exploit the data par-
suite includes kernels ltering, fast Fourier transforms,
allelism that is often available in DSP and multimedia
and vector arithmetic and applications JPEG com-
applications. In this study, weinvestigate the p erfor-
pression, Doppler radar processing, imaging, and G.722
mance of a suite of programs on Intel Pentium pro ces-
speech encoding. Each benchmark has at least one
sors with MMX technology.
non-MMX version in C and an MMX version that
MMX and native signal pro cessing NSP exten-
makes cal ls to an MMX assembly library. The ver-
sions to general purp ose pro cessors [4, 5] are single-
sions di er in the implementation of ltering, vector
instruction multiple-data SIMD architectures. One
arithmetic, and other relevant kernels. The observed
SIMD data typ e which contains several pieces of data
speedup for the MMX versions of the suite ranges from
is sent to a pro cessing unit. By packing many pieces of
less than 1.0 to 6.1. In addition to quantifying the
data into one 64-bit MMX register, several calculations
speedup, we perform detailed instruction level pro l-
can take place simultaneously [6]. For example, image
ing using Intel's VTune pro ling tool. Using VTune,
pro cessing applications typically manipulate matrices
we pro le static and dynamic instructions, microarchi-
of 8-bit data. Eight pieces of this data could b e packed
tecture operations, and data references to isolate the
into an MMX register, arithmetic or logical op erations
speci c reasons for speedup or lack thereof. This anal-
could be p erformed on the pixels in parallel, and the
ysis al lows one to understand which aspects of native
results could b e written to a register.
signal processing instruction sets are most useful, the
Toachieve this functionality, MMX technology adds
current limitations, and how they can be utilized most
57 new assembly instructions to the X86 instruction
eciently.
set. These instructions can op erate on any of the
L. John is supported by the National Science Founda-
packed data typ es and on unsigned or signed data.
tion under Grants CCR-9796098 CAREER Award, and EIA-
Saturation and wrap-around arithmetic are also sup-
9807112, and a grant from the Texas AdvancedTechnology Pro-
p orted. Multiply-accumulate MAC, a frequent op-
gram. B. Evans was supported by the US National Science
Foundation CAREER Award under grant MIP-9702707, the
eration in DSP applications, is also part of the ISA.
US Defense AdvancedResearch Projects Agency under contract
MMX can multiply 8-bit and 16-bit xed-p oint data,
DAAB07-97-C-J007, Accelerix, and Rockwel l.
but not 32-bit data. Data widths of 8 and 16 bits are
sucient for sp eech, image, audio, and video pro cess-
ing applications as well as 3-D graphics.
could interface it with the Intel libraries. MMX maintains full compatibility with existing X86
Although MMX presents opp ortunity for op erating systems and applications. MMX registers
p erformance increase, to our knowledge no indep en- and state are aliased onto the oating-p oint registers
dent evaluations of applications on an X86 pro cessor and state, so no new registers or states are intro duced
with MMX corrob orate and explain the p erformance by MMX. Maintaining compatibility places limitations
increase. We quantify sp eedup for our b enchmark suite on MMX. The MMX registers are limited to the width
of kernels and applications and o er insight into de- of the oating-p oint registers MMX uses 64 of the 80
velopment of DSP and multimedia applications on the available bits and mixing of oating p oint and MMX
Pentium family of pro cessors. We analyze variations in co de b ecomes costly.
the execution time, dynamic co de size, static co de size, In DSP applications, several algorithms surface more
number of memory references, function calls, number frequently than others [3, 4, 7, 8, 9, 10, 11, 12]. The
of MMX instructions, and mix of MMX instructions. most common DSP kernels are the nite impulse re-
Sp eedup is achieved on some, but not all, DSP and sp onse FIR lter, in nite impulse resp onse I IR l-
multimedia applications. The e ects of packing and ter, fast Fourier transform FFT, least mean square
unpacking do not undermine the advantages of MMX, LMS adaptive lters, and matrix-vector arithmetic.
and some applications do not require packing and un- Applications commonly b enchmarked are sp eech, au-
packing of data b ecause of prop erly aligned data. On dio, image and video compression systems.
applications such as JPEG image compression that ran Previous e orts have analyzed NSP on general pur-
slower with MMX, we found that the core kernels of p ose pro cessors [5,13]. However, e orts to compare ap-
the applications show sp eedup, but numerous calls to plications with MMX instructions versus applications
NSP assembly libraries as well as the formatting of without MMX on the same X86 pro cessor have b een
data for these libraries prove to be signi cant over- incomplete [8]. The results found in [8] are only antici-
head. The most e ective MMX applications require pated results based on simulation. A b enchmarking of
bu ered data and high data parallelism. The instruc- several applications on the UltraSPARC pro cessor [4]
tion mix of the applications allows us to provide more using the Visual Instruction Set VIS showed a p er-
insight into these sp eci c observations and elucidate formance sp eedup for some DSP applications over non-
other concerns. VIS versions. Applications with FIR lters showed the
In this pap er, Section 2 describ es the b enchmark most improvement while I IR lters and FFTs exhibited
programs. Section 3 describ es the exp eriment metho d- little or no p erformance increase [4].
ology, and Section 4 analyzes the results. Section 5 To study the e ects of DSP and multimedia pro-
concludes the pap er. grams, we need co de with and without MMX for these
applications. It is imp ortant to note that there are
no publicly available compilers that supp ort MMX in-
2 Benchmarks
structions. This means that the burden of incorp orat-
ing MMX is placed solely on the develop ers of the ap-
For our study,we pro le four DSP and multimedia
plication. Achieving the largest p erformance increase
kernels and four applications. Table 1 summarizes the
would involve tailoring MMX assembly co de for each
implementations and general characteristics of the four
sp eci c application or kernel and then in-lining this
kernels and four applications that comprise our MMX
assembly co de into the application source co de. A
b enchmark suite. We provide more information ab out
less time-consuming metho d would b e to write generic
our b enchmark source co de at the following Web site:
MMX libraries for common algorithms and kernels
http://www.ece.utexas.edu/~ljohn/mmxdsp/.
which can b e accessed via function calls.
The rest of this section provides some details on the
Intel provides a suite of optimized assembly libraries
b enchmarks.
on their Web site [14] and with VTune [15]. Recentver-
sions, which include the Signal Pro cessing Library 4.0,
2.1 Kernels
Recognition Primitives Library 3.1, and Image Pro cess-
ing Library 2.0, supp ort xed-p oint functions that uti-
Finite Impulse Resp onse FIR Filters allow
lize MMX and oating-p oint functions. Not all DSP al-
certain frequency comp onents of the input to pass un-
gorithms have corresp onding MMX functions e.g. the
changed to the output while blo cking other comp o-
LMS algorithm. We develop ed some C co de and ac-
nents. FIR lters are moving average lters. Their
quired the remainder from several sources [10, 16, 17,
resp onse to an impulse dies away in a nite number of
18]. We lo oked for co de that is fast and ecient, yet
samples. The output y nisaweighted average of the
somewhat mo dular and easy to interpret so that we
Table 1: Summary of Benchmark Kernels and Applications
Kernels
Fast Fourier Transform fft 4096 p oint, in-place FFT
Finite Impulse Resp onse Filter fir Low-pass lter of length 35 i.e. 35 co ecients and 35 entry history.
In nite Impulse Resp onse Filter iir Butterworth, direct form, eighth-order bandpass lter. Filter length of eight with
17 co ecients.
Matrix and Vector Arithmetic matvec Matrix-vector multiplication of a 512 512 matrix with a vector of length 512. Dot
pro duct on twovectors of length 512.
Applications
JPEG Image compression jpeg Compresses an image into JPEG format. Converted an 118 kB Windows bitmap
image into a JPEG image. Primary kernels include vector arithmetic for imaging
and the discrete cosine transform DCT kernel.
Image Manipulation image Dimming and switching the colors of a Windows bitmap. 480 640 Red-Green-
Blue RGB image in which each pixel is represented by 24 bits. Essentially vector
addition and multiplication.
G.722 Sp eech Enco ding g722 Standard for digital enco ding and compression of sp eech and audio signals. Uses
adaptive di erential pulse co de mo dulation ADPCM. Enco ded a 6 kB sp eech le.
Doppler Radar Pro cessing radar Subtracts successive complex echo signals to remove stationary targets from a radar
signal and estimates the p ower sp ectrum of the resulting samples. The dominant
frequency is then estimated using the p eak of the FFT sp ectrum. The FFT is a
16-p oint, in-place, radix-2, decimation-in-time FFT.
input values xn. Fast Fourier Transform FFT is an ecient al-
gorithm for computing the discrete Fourier transform
M 1
X
DFT of a sequence. The DFT is represented as fol-
y n= c xn k : 1
k
lows:
N 1
k=0
X