Evaluating MMX Technology

Using DSP and Multimedia Applications



Ravi Bhargava, Lizy K. John, Brian L. Evans and Ramesh Radhakrishnan

Electrical and Computer Engineering Department

The University of Texas at Austin

fravib,ljohn,bevans,radhakri g@ece .ute xas. edu

Keywords: Digital Signal Pro cessing, Machine Mea- 1 Intro duction

surement, MMX, Performance Monitoring, Workload

Characterization.

Demand for digital signal pro cessing DSP and mul-

timedia capabilities on a p ersonal computer has b een

increasing to accommo date 3-D graphics, video con-

Abstract

ferencing, and other applications. The PC industry's

attempt to satisfy these demands resulted in the rst

Many current general purpose processors are using

addition to the instruction set architecture ISA

extensions to the instruction set architecture to enhance

in almost a decade. This extension, intro duced in 1996,

the performance of digital signal processing DSP and

has b een dubb ed MMX for MultiMedia eXtension

multimedia applications. In this paper, we evaluate

[1,2] and can outp erform lower-end DSP pro cessors [3].

the X86 architecture's multimedia extension MMX

This technology adds new assembly instructions and

instruction set on a set of benchmarks. Our benchmark

data typ es to the existing ISA to exploit the data par-

suite includes kernels  ltering, fast Fourier transforms,

allelism that is often available in DSP and multimedia

and vector arithmetic and applications JPEG com-

applications. In this study, weinvestigate the p erfor-

pression, Doppler radar processing, imaging, and G.722

mance of a suite of programs on pro ces-

speech encoding. Each benchmark has at least one

sors with MMX technology.

non-MMX version in C and an MMX version that

MMX and native signal pro cessing NSP exten-

makes cal ls to an MMX assembly library. The ver-

sions to general purp ose pro cessors [4, 5] are single-

sions di er in the implementation of ltering, vector

instruction multiple-data SIMD architectures. One

arithmetic, and other relevant kernels. The observed

SIMD data typ e which contains several pieces of data

speedup for the MMX versions of the suite ranges from

is sent to a pro cessing unit. By packing many pieces of

less than 1.0 to 6.1. In addition to quantifying the

data into one 64-bit MMX register, several calculations

speedup, we perform detailed instruction level pro l-

can take place simultaneously [6]. For example, image

ing using Intel's VTune pro ling tool. Using VTune,

pro cessing applications typically manipulate matrices

we pro le static and dynamic instructions, microarchi-

of 8-bit data. Eight pieces of this data could b e packed

tecture operations, and data references to isolate the

into an MMX register, arithmetic or logical op erations

speci c reasons for speedup or lack thereof. This anal-

could be p erformed on the pixels in parallel, and the

ysis al lows one to understand which aspects of native

results could b e written to a register.

signal processing instruction sets are most useful, the

Toachieve this functionality, MMX technology adds

current limitations, and how they can be utilized most

57 new assembly instructions to the X86 instruction

eciently.

set. These instructions can op erate on any of the



L. John is supported by the National Science Founda-

packed data typ es and on unsigned or signed data.

tion under Grants CCR-9796098 CAREER Award, and EIA-

Saturation and wrap-around arithmetic are also sup-

9807112, and a grant from the Texas AdvancedTechnology Pro-

p orted. Multiply-accumulate MAC, a frequent op-

gram. B. Evans was supported by the US National Science

Foundation CAREER Award under grant MIP-9702707, the

eration in DSP applications, is also part of the ISA.

US Defense AdvancedResearch Projects Agency under contract

MMX can multiply 8-bit and 16-bit xed-p oint data,

DAAB07-97-C-J007, Accelerix, and Rockwel l.

but not 32-bit data. Data widths of 8 and 16 bits are

sucient for sp eech, image, audio, and video pro cess-

ing applications as well as 3-D graphics.

could interface it with the Intel libraries. MMX maintains full compatibility with existing X86

Although MMX presents opp ortunity for op erating systems and applications. MMX registers

p erformance increase, to our knowledge no indep en- and state are aliased onto the oating-p oint registers

dent evaluations of applications on an X86 pro cessor and state, so no new registers or states are intro duced

with MMX corrob orate and explain the p erformance by MMX. Maintaining compatibility places limitations

increase. We quantify sp eedup for our b enchmark suite on MMX. The MMX registers are limited to the width

of kernels and applications and o er insight into de- of the oating-p oint registers MMX uses 64 of the 80

velopment of DSP and multimedia applications on the available bits and mixing of oating p oint and MMX

Pentium family of pro cessors. We analyze variations in co de b ecomes costly.

the execution time, dynamic co de size, static co de size, In DSP applications, several algorithms surface more

number of memory references, function calls, number frequently than others [3, 4, 7, 8, 9, 10, 11, 12]. The

of MMX instructions, and mix of MMX instructions. most common DSP kernels are the nite impulse re-

Sp eedup is achieved on some, but not all, DSP and sp onse FIR lter, in nite impulse resp onse I IR l-

multimedia applications. The e ects of packing and ter, fast Fourier transform FFT, least mean square

unpacking do not undermine the advantages of MMX, LMS adaptive lters, and matrix-vector arithmetic.

and some applications do not require packing and un- Applications commonly b enchmarked are sp eech, au-

packing of data b ecause of prop erly aligned data. On dio, image and video compression systems.

applications such as JPEG image compression that ran Previous e orts have analyzed NSP on general pur-

slower with MMX, we found that the core kernels of p ose pro cessors [5,13]. However, e orts to compare ap-

the applications show sp eedup, but numerous calls to plications with MMX instructions versus applications

NSP assembly libraries as well as the formatting of without MMX on the same X86 pro cessor have b een

data for these libraries prove to be signi cant over- incomplete [8]. The results found in [8] are only antici-

head. The most e ective MMX applications require pated results based on simulation. A b enchmarking of

bu ered data and high data parallelism. The instruc- several applications on the UltraSPARC pro cessor [4]

tion mix of the applications allows us to provide more using the VIS showed a p er-

insight into these sp eci c observations and elucidate formance sp eedup for some DSP applications over non-

other concerns. VIS versions. Applications with FIR lters showed the

In this pap er, Section 2 describ es the b enchmark most improvement while I IR lters and FFTs exhibited

programs. Section 3 describ es the exp eriment metho d- little or no p erformance increase [4].

ology, and Section 4 analyzes the results. Section 5 To study the e ects of DSP and multimedia pro-

concludes the pap er. grams, we need co de with and without MMX for these

applications. It is imp ortant to note that there are

no publicly available compilers that supp ort MMX in-

2 Benchmarks

structions. This means that the burden of incorp orat-

ing MMX is placed solely on the develop ers of the ap-

For our study,we pro le four DSP and multimedia

plication. Achieving the largest p erformance increase

kernels and four applications. Table 1 summarizes the

would involve tailoring MMX assembly co de for each

implementations and general characteristics of the four

sp eci c application or kernel and then in-lining this

kernels and four applications that comprise our MMX

assembly co de into the application source co de. A

b enchmark suite. We provide more information ab out

less time-consuming metho d would b e to write generic

our b enchmark source co de at the following Web site:

MMX libraries for common algorithms and kernels

http://www.ece.utexas.edu/~ljohn/mmxdsp/.

which can b e accessed via function calls.

The rest of this section provides some details on the

Intel provides a suite of optimized assembly libraries

b enchmarks.

on their Web site [14] and with VTune [15]. Recentver-

sions, which include the Signal Pro cessing Library 4.0,

2.1 Kernels

Recognition Primitives Library 3.1, and Image Pro cess-

ing Library 2.0, supp ort xed-p oint functions that uti-

Finite Impulse Resp onse FIR Filters allow

lize MMX and oating-p oint functions. Not all DSP al-

certain frequency comp onents of the input to pass un-

gorithms have corresp onding MMX functions e.g. the

changed to the output while blo cking other comp o-

LMS algorithm. We develop ed some C co de and ac-

nents. FIR lters are moving average lters. Their

quired the remainder from several sources [10, 16, 17,

resp onse to an impulse dies away in a nite number of

18]. We lo oked for co de that is fast and ecient, yet

samples. The output y nisaweighted average of the

somewhat mo dular and easy to interpret so that we

Table 1: Summary of Benchmark Kernels and Applications

Kernels

Fast Fourier Transform fft 4096 p oint, in-place FFT

Finite Impulse Resp onse Filter fir Low-pass lter of length 35 i.e. 35 co ecients and 35 entry history.

In nite Impulse Resp onse Filter iir Butterworth, direct form, eighth-order bandpass lter. Filter length of eight with

17 co ecients.

Matrix and Vector Arithmetic matvec Matrix-vector multiplication of a 512  512 matrix with a vector of length 512. Dot

pro duct on twovectors of length 512.

Applications

JPEG Image compression jpeg Compresses an image into JPEG format. Converted an 118 kB Windows bitmap

image into a JPEG image. Primary kernels include vector arithmetic for imaging

and the discrete cosine transform DCT kernel.

Image Manipulation image Dimming and switching the colors of a Windows bitmap. 480  640 Red-Green-

Blue RGB image in which each pixel is represented by 24 bits. Essentially vector

addition and multiplication.

G.722 Sp eech Enco ding g722 Standard for digital enco ding and compression of sp eech and audio signals. Uses

adaptive di erential pulse co de mo dulation ADPCM. Enco ded a 6 kB sp eech le.

Doppler Radar Pro cessing radar Subtracts successive complex echo signals to remove stationary targets from a radar

signal and estimates the p ower sp ectrum of the resulting samples. The dominant

frequency is then estimated using the p eak of the FFT sp ectrum. The FFT is a

16-p oint, in-place, radix-2, decimation-in-time FFT.

input values xn. Fast Fourier Transform FFT is an ecient al-

gorithm for computing the discrete Fourier transform

M1

X

DFT of a sequence. The DFT is represented as fol-

y n= c xn k : 1

k

lows:

N1

k=0

X

j 2kn=N

X k = xn e 3

On eachinvo cation, the fir lter takes one new in-

n=0

put value and returns one new output value. The

nk j 2kn=N

Letting W = e , 3 can b e simpli ed to

non-MMX versions p erform 32-bit oating-p oint arith-

metic, while the MMX version p erforms 16-bit xed-

k

X k = X n+W X n; 4

ev od

N=2

p oint calculations. Applications of an FIR lter in-

clude sp eech pro cessing, audio pro cessing mo dem chan-

where X represents the even-indexed elements and

ev

nel equalization, linear predictive co ding, and general

X represents the o dd-indexed elements. The DFT

od

ltering.

can be divided into even and odd halves rep eatedly

In nite Impulse Resp onse I IR Filters are

until only two-p oint DFTs remain to evaluate.

also frequency selective and include autoregressive l-

Our implementation of the FFT fft supplies all of

ters. I IR lters feed backaweighted sum of previous

the data at once to the functions that compute it. The

output values y n and add this to a weighted sum of

non-MMX version of the FFT p erform 32-bit, oating-

previous and current input values xn

p oint calculations. The MMX version uses 16-bit xed-

p oint data. FFT-based applications include radar pro-

Q1

P 1

X X

cessing, sonar pro cessing, MPEG audio compression,

y n= b xn q  a y n p: 2

q p

ADSL mo dems, and sp ectral analysis.

q=0 p=0

Matrix-Vector Arithmetic includes dot pro ducts

and matrix-vector multiplications matvec. All ver- A given order I IR lter can b e made more frequency

sions use 16-bit xed-p oint data. Anynumeric compu- selective than the same order FIR lter, making them

tation with some degree of data parallelism can take more computationally ecient for the same b ehavior.

advantage of vector arithmetic, esp ecially image appli- Unlike the FIR lter, our IIR lter iir p erforms

cations. blo ck ltering, passing eight samples to the IIR lter

The DSP kernels b enchmarked in our suite contain function p er invo cation. The non-MMX versions p er-

a C-only co de version, a version that uses Intel's MMX form 64-bit oating-p oint arithmetic, while the MMX

assembly library functions, and, if applicable, a version version p erforms 16-bit xed-p oint calculations. Appli-

that uses Intel's oating-p oint assembly library func- cations include audio equalization, sp eech compression,

tions. It is regular practice to optimize common DSP linear predictive co ding and general ltering.

kernels in assembly co de, so we compare the MMX gram takes an image stored as a long array of 8-bit

version to b oth a compiler-optimized version and a values and prop erly scales the values to pro duce a dim-

hand-optimized version. There is no hand-optimized ming e ect. This primarily consists of vector multiply

oating-p ointversion for the vector arithmetic matvec op erations. Second, the program increases or decreases

b ecause it uses only integer data. the value of certain pixels to pro duce a switch in colors.

This function primarily involves vector addition. This

application shows no loss of qualitybetween the MMX

2.2 Applications

and C-only versions when viewed using the Intel image

library viewing routine.

Doppler Radar Pro cessing radar subtracts suc-

All of the non-MMX versions of the applications

cessive complex echo signals to remove stationary tar-

generously use 32-bit xed-p oint, 32-bit oating-p oint,

gets from a radar signal and then estimates the p ower

and 64-bit oating-p oint numb ers to p erform arith-

sp ectrum of the resulting samples. The mean frequency

metic. This strategy is safe, increases precision, and is

is then estimated by nding the p eak of the FFT sp ec-

not very costly on a general-purp ose pro cessor. Since

trum. The input is complex and represents 12 range

ecient use of MMX requires either 16-bit or 8-bit

lo cations from each echo [16]. The MMX version of

data, MMX versions reduce some data to 16 or 8 bit

this application uses the vector arithmetic kernels and

values where appropriate. The applications run with-

FFT kernel. There is little measured change in the out-

6

out signi cant loss of precision. For larger applications,

put precision 10 between the MMX and non-MMX

we use VTune to pro le the C version of each appli-

versions.

cation, so that we can optimize the most frequently

JPEG is a standardized compression metho d for

called functions with MMX assembly functions if p os-

full-color and gray-scale images. We use JPEG to com-

sible. For each application, the functions that are op-

press images with loss of information | the output im-

timized account for 65-75 of the samples during the

age is not necessarily identical to the input image. For

non-MMX application's runtime.

natural images, medium compression ratios may pro-

In the pro cess of creating our b enchmark with and

duce no visible change, and high compression ratios will

without MMX, we made e orts to ensure wewere com-

pro duce low-quality images which may be tolerable.

paring equivalent sections of co de. In some programs,

Our JPEG b enchmark program jpeg [17] compresses

data is obtained from a le or written to a le. In these

but not decompresses images. It p erforms color con-

cases, we bu er the data and monitor only the reading

version, two-dimensional forward discrete cosine trans-

from the bu er and not the I/O. We do not monitor the

form DCT, and quantization of the DCT co ecients.

initialization, setup routines, op erating system work,

We use it to reduce an 118 kB Windows bitmap le into

or le I/O for any of the programs. We strive only to

a 7 kB JPEG le. The MMX version shows no visible

monitor the core of the kernels and applications and

di erence in quality than the non-MMX version, al-

measure the e ects of MMX p erforming useful DSP

though some precision is lost in the pixel calculations.

and multimedia work.

Wevalidated the JPEG results pro duced by b oth the

MMX and non-MMX versions of the JPEG enco der by

using the Imaging for Windows NT program.

3 Metho dology

G.722 Sp eech Enco ding g722 is a standard for

compressing and decompressing sp eech using adaptive

We compile the b enchmarks using Microsoft Visual

di erential pulse co de mo dulation ADPCM. The in-

C++ 5.0 on a Pentium I I pro cessor running Windows

put signal to the enco der is 16-bit data sampled at 16

NT 4.0. All programs are compiled with the optimiza-

kHz. Output from the enco der is 8 bits at an 8 kHz

tion \Maximize for Sp eed." Intel provides a separate

sample rate. The deco der op erates in exactly the opp o-

version of their libraries for di erentPentium versions.

site fashion. An 8-bit co ded input signal is deco ded by

For MMX, we cho ose the Pentium II MMX version

the ADPCM deco ders. The result is a 16 kHz sampled

since it is their most recent, and presumably most ef-

output [16] [19]. We enco deda6kBspeech le. Both

cient, version. We obtain p erformance data for the

versions of this application p erform real-time enco ding

b enchmarks by using the dynamic analysis utility in

and deco ding. Only one sample of sp eech is enco ded

VTune 2.5.1 [15].

and deco ded at a time. The quality of sp eech in the

MMX version is tolerable, but slightly inferior to that

3.1 Using Intel's Assembly Libraries

of the C-only version.

Imaging program image manipulates the pixels

We create the b enchmarks by taking ecient, re-

in a 640  480 bitmap image uniformly. First, the pro-

liable, C programs and then mo difying them to use

Pentium, e.g., cache misses and branch target bu er assembly libraries. Intel's assembly libraries [14] pro-

misses. The most recent version available while p er- vide versions of many common signal pro cessing, vector

forming this research VTune 2.5.1, May 1998 do es arithmetic, and image pro cessing kernels. While the

not yet have the full capabilities to simulate the dy- Intel library functions are generally robust, intuitive,

namic execution micro-core of the and and functionally correct, limitations exist in b oth the

Pentium I I pro cessors. libraries and existing C co de.

First, the C programs often use oating-p oint and

32-bit xed-p ointnumb ers, but the Intel MMX libraries

3.3 Metrics

do not provide functions that accept 32-bit xed-p oint

inputs. We rarely can simply quantize parameters e.g.,

Execution time and the resulting sp eedup are the

lter co ecients to 16-bit or 8-bit xed-p oint formats

primary metrics we use to evaluate MMX. We also

and achieve acceptable results. The library functions

analyze the number of times individual assembly in-

require that the output data b e the same length as the

structions, including MMX instructions, are executed

input data. A \scale factor" is provided to handle over-

instruction mix, numb er of instructions executed dur-

ow, but scaling often results in a loss of precision. In

ing actual runtime dynamic instructions, number of

their implementation, this scale factor must b e known

unique instructions executed static instructions, num-

a priori to use MMX and therefore must allow for the

b er of micro-op erations that are dynamically executed,

largest p ossible over ow.

and architecture related p enalties cache misses, branch

Second, C co de can present obstacles to eciently

target bu er misses, and other CPU-related p enalties.

using MMX. In order to exploit the data parallelism in

Not all of the obtained statistics are accurate for ev-

C lo ops that access vector elements sequentially, all of

ery pro cessor in the Pentium family. Instruction mix,

the temp orary variables in the lo op must b e placed in

dynamic instructions, and static instructions are valid

vectors as well. Signi cantoverhead may b e asso ciated

for any Pentium family pro cessor with MMX. Micro-

with this extra allo cation of memory, esp ecially if al-

op erations only apply to the Pentium Pro with MMX

lo cated dynamically. Similarly, some signal pro cessing

and Pentium I I pro cessors. Clo ck cycles and architec-

library calls require library-sp eci c data structures to

ture related p enalties are unique to the Pentium with

b e created and initialized b efore calling kernels suchas

MMX pro cessor.

FIR and I IR lters.

Table 2 provides the basic characteristics of the

Third, b ecause the same registers and state are used

b enchmarks as measured byVTune. Included are the

for b oth oating-p oint and MMX co de, p otential over-

number of dynamic X86 instructions, the number of

head exists when switching b etween mo des. The emms

static X86 instructions that pro duced the dynamic in-

Empty MMX State instruction that switches from

structions, the numb er of dynamic micro-op erations or

MMX to oating-p oint mo de can incur up to a 50-cycle

-ops Pentium Pro and Pentium I I, the p ercentage

p enalty [18].

of X86 instructions that use any memory referencing

mo de, and the p ercentage of X86 instructions that are

3.2 Using VTune

MMX instructions.

Figure 1a shows the p ercentage of MMX instruc-

VTune is a system p erformance pro ling to ol cre- tions within the MMX version of each program. The

ated by Intel [15]. The dynamic analysis utility pro- programs are arranged in ascending order of sp eedup.

vides a way to monitor which assembly instructions in The sp eedup obtained by each program is shown ab ove

a given application are b eing executed. VTune takes each bar. In addition, the MMX instructions are bro-

this stream of instructions and uses them to simulate ken down into their individual categories: packing and

timing and p erformance information. unpacking, MMX arithmetic, MMX moves 64-bit,

VTune is capable of simulating timing using the Pen- and the emms instruction. Figure 1b presents infor-

tium sup erscalar, two-pip eline, in-order architecture. mation on static and dynamic instruction countasthe

It traces the execution of the entire application. In or- ratio of C-only version to MMX version.

der to isolate co de, the user may sp ecify starting and

stopping execution p oints so that only results for that

4 Analysis of Results

section of the program are rep orted. When comput-

ing cache, branching, and other p enalties, however, the

Table 3 presents some of the results of the study as

entire application is considered. Clo ck cycles are cal-

ratios of non-MMX version totals to MMX version to-

culated from the known latency of each assembly in-

tals. Figure 2a presents C-only to MMX ratios for

struction and known latency of each p enalty on the

execution time sp eedup, dynamic instructions, and

Table 2: Benchmark Instruction Characteristics

Benchmark Static Dynamic Dynamic  Memory  MMX

Program Instructions -ops Instructions References Instructions

t.c 110 8,429,851 5,619,929 53.64

t.fp 1,446 3,285,827 2.389.118 54.61

t. 1,640 2,585,564 1,842,347 49.54 4.69

r.c 32 2,580,000 2,112,000 40.62

r.fp 78 2,922,288 2,190,000 42.46

r.mmx 218 2,040,889 1,332,051 31.98 20.27

iir.c 60 2,924,802 2,678,258 22.37

iir.fp 223 1,652,784 1,325,964 37.16

iir.mmx 227 1,299,588 1,010,568 28.33 71.23

matvec.c 35 2,106,409 2,105,355 25.04

matvec.mmx 159 1,085,055 395,125 45.83 91.6

radar.c 389 12,953,062 10,110,365 47.04

radar.mmx 1,105 11,193,249 7,190,019 36.36 8.64

g722.c 1,281 16,258,744 11,618,849 59.92

g722.mmx 1,752 25,898,326 17,582,880 43.44 1.58

jp eg.c 3,755 12,901,353 9,700,077 43.25

jp eg.mmx 4,434 25,343,001 16,294,772 44.29 6.52

image.c 68 37,934,090 26,870,550 27.47

image.mmx 175 5,063,817 2,707,314 38.29 85.10

Static Instructions are the static instructions that are executed. Dynamic -ops are dynamic micro-op erations that o ccur on a Pentium

II. Dynamic Instructions get executed during the running of the program. Memory References are assembly instructions that use any

memory referencing mo de. .c denotes the C-only co de version of the program. .fp denotes the C version with calls to the oating-p oint

library. .mmx denotes C version with calls to the MMX library.

memory references. Figure 2b do es the same for the such as lo op unrolling, MMX packing and unpack-

C with oating p oint library version to C with MMX ing of SIMD typ es, calling and returning from function

ratio. In these gures, the reduction of memory refer- calls, and handling precision problems i.e. scaling.

ences and dynamic instructions in the MMX versions The FIR lter kernel shows a decent sp eedup fac-

corresp ond closely with the decrease in execution time. tor of 1.57 against C-only co de and a factor of 1.34

The reduction in memory references is imp ortant to versus the oating-p oint library. Although this kernel

notice b ecause of the signi cant cost of accessing o - loses some data parallelism b ecause it only pro cesses

chip memory. We will further discuss these results one input at a time, the relatively high numb er of taps

and each application's p erformance in the forthcoming allows many parallel op erations to take place while cal-

paragraphs. culating the summation 20 of instructions are MMX

instructions. The MMX version rep orts zero pack-

ing and unpacking instructions as a result of prop erly

4.1 Results from Kernels

aligned stores and moves. The MMX version of this

kernel sp ends 11 of the rep orted clo ck cycles doing

As shown in Table 3, the kernels show reasonable

call and ret and twice as many total clo ck cycles as

sp eedup with MMX co de. The MMX sp eedup relative

the C-only version. Finally, the FIR lter su ers little

to the optimized oating-p oint library .fpversions is

loss of precision in the MMX xed-p ointversion order

less than the sp eedup relative to the C-only .ccode

4

10  b ecause the error loss is not cumulativeatany

versions in all cases. Additional sp eedup is achieved

p oint.

using MMX instead of hand-optimized oating-p oint

The IIR lter kernel shows a more impressive

assembly co de. In all cases, the static instruction size is

sp eedup factor of 2.55 compared to the C-only co de

increased when using MMX, even versus oating-p oint

and 1.71 compared to the oating-p oint library routine.

assembly co de. Although sp eedup is achieved and dy-

The static co de size increases in the MMX version, but

namic instructions are reduced signi cantly, the static

not as signi cantly as fir. The higher sp eedup results

co de size increases when using MMX. This is the result

from blo ck pro cessing of the input samples 71 MMX

of the combination of software optimization techniques

a Breakdown of MMX instructions. b C-only vs. MMX implementations.

Figure 1: Instruction mix analysis of MMX co de and instruction count comparison b etween MMX and non-MMX

versions of co de. For the b enchmarks given, sp eedup increases from left to right.

all matvec's instructions are MMX instructions. The instructions which increases the data parallelism and

execution time due to MMX is reduced by a factor of reduces the numb er of functions called. The output of

6.61 and the dynamic instructions are reduced by a the MMX version of this I IR lter b ecomes unstable.

factor of 5.3. Note that this kernel op erates on 16-bit Although the non-MMX versions pro duce reasonable

data, so four pieces of data can b e op erated on in par- results, the loss of precision in the MMX version com-

allel, yet the sp eedup is much greater than 4.0. The p ounds iteration after iteration and the output over-

sup erlinear sp eedup is largely due to the imul instruc- ows the 16-bit data typ e. In a previous study [20],

tion whichdoesinteger multiplication in 10 cycles ver- our implementation of an I IR lter yielded b etter pre-

sus the pmaddwd MMX instruction which can p erform cision but that version achieved a slowdown rather than

twomultiplications in 3 cycles. The dynamic instruc- a sp eedup.

tion size reduction is due in large part to more ecient The FFT kernel has sp eedup of 1.98 compared with

management of the lo op structure in the MMX co de. the C-only version, but only 1.25 compared with the

It may be observed that matvec has the largest p er- oating-p oint library version. Factors other than MMX

centage of packing and unpacking MMX instructions are sp eeding up this kernel. One factor is the large dif-

20.5, yet has a signi cant sp eedup. ference in the numb er of data memory references. Ac-

cessing memory can b e very exp ensive three cycles for

a data cache miss, 8 cycles for an L2 access, and 15 cy-

4.2 Application Results

cles for an L2 miss. Also, the MMX version of the FFT

only uses 4.6 MMX instructions, which is the fewest

Overall, the results of the applications are disap-

of all the kernels. Although the implementation of the

p ointing. Two of the four applications JPEG com-

FFT is not outlined in the Intel do cumentation, the as-

pression and G.722 sp eech enco ding did not yield any

sembly co de indicates that the samples are converted

sp eedup. The image application is clearly the b est

to oating-p oint, and then the FFT is computed in a

suited for MMX. It shows sp eedup of 5.5 and a dy-

similar manner to the oating-p oint library version. In

namic instruction reduction of 5.3. The most imp or-

the initial stages of our study,we used an earlier ver-

tant factor is that the use of 8-bit data allows twice the

sion of the Pentium MMX library. In this case, the

parallelism compared to the use of 16-bit data. Also,

FFT used 40 MMX instructions, but provided only

the images are stored in a large array of 8-bit data

1.49 sp eedup over a C-only version. This suggests that

and are prop erly aligned on 8-byte b oundaries. This

computing the FFT with MMX integer calculations is

allows \automatic" packing and unpacking of data by

not an ecient strategy. The limited use of MMX do es

simply loading and storing quad-words 64 bits from

provide a sp eedup over the oating-p ointversion with

memory. Also contributing to the sp eedup are a 9.92

2

little loss of precision order 10  using the 16-bit data

times reduction in dynamic instructions, a 7.12 times

in our implementation.

reduction in data memory references, and 85 usage

The matrix-vector kernel is predictably well-suited

of MMX instructions.

for an MMX implementation. Approximately 90 of

The radar application has a low factor of sp eedup

Table 3: Results as ratios of Non-MMX program to MMX program

Benchmark Static Dynamic Memory

Program Sp eedup Instructions Instructions Micro-ops References

t.c 1.98 0.067 3.05 3.26 3.30

t.fp 1.25 0.881 1.29 1.27 1.42

r.c 1.57 0.146 1.58 1.26 2.01

r.fp 1.34 0.357 1.64 1.43 2.18

iir.c 2.55 0.264 2.65 2.25 2.09

iir.fp 1.71 0.982 1.31 1.27 1.72

matvec.c 6.61 0.220 5.32 1.94 2.91

g722.c 0.77 0.731 0.66 0.62 0.91

image.c 5.50 0.388 9.92 7.49 7.12

jp eg.c 0.49 0.847 0.62 0.51 0.61

radar.c 1.21 0.352 1.40 1.15 1.81

Speedup is the ratio of clo ck cycles. Static instructions are the static instruction that are executed. Dynamic instructions get executed

during the running of the program. Micro-ops are dynamic micro-op erations that o ccur on a Pentium I I. Memory References are

assembly instructions that use any memory referencing mo de. .c in the Benchmark Program column denotes the C-only version of

the program and .fp denotes the C version with calls to the oating-p oint library.

computation of the 2-D DCT shows only 1.1 sp eedup. even though all of the arithmetic is accomplished us-

Function calling accounts for 8.3 times more clo ck cy- ing MMX vector and FFT routines. The execution

cles in the Intel MMX version than the C-only version. time sp eedup is 1.21 and the dynamic instructions are

In general, accessing the pixels in the JPEG program is reduced by 1.40. Although several MMX routines are

rarely p erformed in sequential order. This increases the called, only 9.58 of the instructions are MMX instruc-

diculty of p erforming MMX op erations at either the tions. One shortcoming of the MMX version is that

function call level or the assembly level b ecause ecient 27 times more function calls are made, many of which

MMX co ding requires data that is contiguous. Another are unseen to the user b ecause they are called within

problem that exists is the precision handling and typ e the libraries themselves. The ret and call functions

conversion that accompanies using the libraries in this themselves consume 23.88 of the total cycles without

program. The disruption that the ab ove factors cause including the p enalty for passing parameters.

outweighs the b ene t of using MMX in our implemen- We exp ect sp eedup for the JPEG compression pro-

tation. gram, but the nal results show the C-only version to

The G.722 sp eech compression and enco ding actu- be 1.92 times faster although 5 of jpeg uses MMX

ally shows a slow-down as well when implemented with instructions. The slowdown is the result of forcing

MMX function calls. Some of the same problems from the C application to use MMX library functions. The

the JPEG application apply here as well. The primary JPEG program's computation is dominated by two-

drawback of this particular program as an MMX pro- dimensional DCT calculations, quantization, and color

gram is that it only pro cesses one input at a time while conversion. These three functions account for 74 of

enco ding and deco ding. Op erating on blo cks of data at the C-only version clo ck cycles. We inserted MMX ver-

once would de nitely increase the opp ortunity to use sions for these functions. The core op erations of these

MMX co de. The MMX version uses 1.5 MMX in- three functions which contain 24 MMX instructions

structions and sp ends 7.7 of its clo ck cycles on func- actually show a 1.6 sp eedup.

tion call overhead and 3.3 times more cycles than the The JPEG program op erates only on 8  8 blo cks

C-only program. In addition, the quality of sp eech de- of image pixels. The MMX version would b ene t from

teriorates even with a low MMX instruction count. a larger blo ck size to minimize function calls. In addi-

The disapp ointing results are due, in large part, to tion, there is no two-dimensional DCT call in the MMX

our strategy of editing existing co de instead of co d- libraries. So, instead of one call to a MMX 2-D DCT

ing the applications from scratch. The programs we function, there are 16 calls to a one-dimensional DCT

obtained are all highly optimized, esp ecially the JPEG function. A highly optimized 2-D DCT written using

compression program, and ran very fast with high com- MMX assembly co de is found to have a sp eedup of 1.7

piler optimizations. Editing the optimized core lo ops compared to a C implementation [21] while our MMX

a Ratio of C-only program to C program with MMX library b Ratio of C program with optimized oating-p oint calls to

calls. C program with MMX library calls.

Figure 2: Comparison of sp eedup, dynamic instruction count, and memory references.

and trying to convert them to MMX often disrupts the as a whole application. Potential overhead and

continuity of the co de with function calls and initial- other eciency issues which can deteriorate p er-

izations. formance arise when using exible, robust library

functions. These issues include factors such as

excessive function calls, preformatting the data

5 Conclusion

to t the library format, and switching b etween

oating-p oint and MMX co de.

We analyzed the usage of MMX enhanced libraries

in implementing DSP and multimedia programs. We

 MMX has the p otential to reduce dynamic in-

found the various changes in execution time that o ccur

structions and micro-op erations, but can dramat-

when our b enchmarks are run on the Pentium with

ically increase the static co de size of the applica-

MMX. We used several parameters to evaluate native

tions.

signal pro cessing p erformance enhancement, including

 Although packing and unpacking of data gener-

execution time, dynamic instruction size, instruction

ates overhead, programs with high p ercentage of

mix, and numb er of data references. We observed that:

these instructions are shown to be faster than

their non-MMX counterparts due to high utiliza-

 MMX can provide signi cant sp eedup in certain

tion of SIMD parallelism.

DSP and multimedia applications, even over hand-

 MMX seems well-suited for imaging applications

optimized oating-p oint assembly co de. The

b ecause they often have plenty of contiguous, 8-

sp eedups range from 1.25 to 6.6 for the kernels

bit data available and rarely require precision b e-

and 0.49 to 5.5 for the applications.

yond 8 bits. Higher precision signal pro cessing

 E ectively using MMX to pro duce sp eedup is a

applications seem to cause large problems due to

dicult chore for large or complicated C applica-

their real-time and high-precision requirements.

tions, even with assembly libraries. It can require

 Although the Intel MMX library is useful, intu-

signi cant restructuring of existing co de, reco d-

itive, and time-saving, providing 32-bit outputs

ing with libraries in mind, or using exclusively

for 16-bit integer op erations would enable preser-

assembly co de.

vation of precision across function calls. The in-

 Function libraries are a viable option for obtain-

terleaving of high and lowwords during multipli-

ing sp eedup; however, the b est p erformance in-

cation is a signi cant problem. Image and video

crease will always b e obtained by tailoring MMX

compression programs would b ene t from a two-

assembly co de to t the application and refrain-

dimensional DCT function in the MMX library.

ing from hierarchical function calling.

 Reducing memory references is just as imp ortant

 Although jpeg and g722 exp erienced sp eedup in

as reducing the numb er of arithmetic op erations

their core kernels, they did not exp erience sp eedup

b ecause accessing o -chip cache can b e very ex- [11] M. A. Saghir, P. Chow, and C. G. Lee, \Exploiting

p ensive on a general purp ose pro cessor. Dual Data Memory Banks in Digital Signal Pro-

cessors.," Proc. Conf. Architectural Support for

Based on our observations, MMX assembly co de in-

Prog. Lang. and Operating Sys., pp. 234-243, Oct.

tersp ersed in C co de might not b e the b est use of the

1996.

MMX instructions set. Benchmarks that can truly ex-

ploit MMX will require more restructuring of C co de

[12] V. Zivo jnovic, H. Schraut, M. Willems, and

and hand-co ding some functions not available in the

R. Scho enen, \DSP's, GPP's, and Multimedia Ap-

Intel assembly libraries, such as the 2-D DCT. Future

plications - an Evaluation of DSPstone," Proc. Int.

work on the sub ject could include these mo di cations

Conf. on Signal Proc. Appl. and Tech., pp. 1779-

in addition to more b enchmarks, such as an MPEG

1783, Oct. 1995.

video co dec or commercial applications.

[13] R. B. Lee, \Multimedia Extensions For General-

Purp ose Pro cessors," Proc. IEEE Workshop on

References

Signal Processing Systems, pp. 9-23, Nov. 1997.

[1] A. Peleg and U. Weiser, \The MMX Technology

[14] Intel, \Performance Library Suite." http://

Extension to the Intel Architecture," IEEE Micro,

developer.intel.com/design/perftool/per ibst/.

vol. 16, no. 4, pp. 42-50, Aug. 1996.

[15] Intel, \VTune CD." http://developer.intel.com/

[2] S. Wilkie and U. Weiser, \MMX Technology

design/perftool/vtcd/.

for Personal Computers," Communications of the

ACM,vol. 40, pp. 24{38, Jan. 1997.

[16] P.M.Embree, CAlgorithms for Real-Time DSP.

NJ: Prentice Hall, 1995.

[3] G. Blalo ck, \Micropro cessors Outp erform DSPs

2:1," MicroProcesor Report, vol. 10, no. 17, pp.

[17] Indep endent JPEG Group. http://www.ijg.org/.

1-4, Dec. 1995.

[18] Intel, \Develop ers' Insight." http://developer.

[4] W. Chen, H. J. Reekie, S. Bhave, and E. A. Lee,

intel.com/drg/mmx/manuals/overview/.

\Native Signal Pro cessing on the UltraSparc in

[19] F. G. Stremler, Introduction to Communication

the PtolemyEnvironment," Proc. IEEE Asilomar

Systems. Reading, MA: Addison-Wesley Publish-

Conference on Signals, Systems, and Computers,

ing Company, 3rd ed., 1990.

pp. 1368-1372, Nov. 1996.

[5] R. B. Lee, \Accelerating Multimedia with En-

[20] R. Bhargava, R. Radhakrishnan, B. L. Evans, and

hanced Mircopro cessors," IEEE Micro,vol. 15, no.

L. K. John, \Characterization of MMX-enhanced

2, pp. 23-32, Apr. 1995.

DSP and Multimedia Applications on a General

Purp ose Pro cessor," Digest of the Workshop on

[6] D. Bistry, C. Dulong, M. Gutman, M. Julier, and

Performance Analysis and Its Impact on Design

M. Keith, The Complete Guide to MMX Technol-

held in conjunction with ISCA98, pp. 16{23,

ogy. McGraw-Hill, 1997.

June. 1998.

http://www.ece.utexas.edu/~ljohn/mmxdsp/.

[7] G. Blalo ck, \The BDTIMark: A Measure of DSP

Execution Sp eed"," 1997. White Pap er from

[21] B. Erol, F. Kossentini, and H. Alnuweiri, \Imple-

Berkeley Design Technology, Inc.

mentation of a Fast H.263+ Enco der/Deco der,"

http://www.b dti.com/articles/wtpap er.htm.

Proc. IEEE Asilomar Conf. on Signals, Systems,

[8] L. Gwennap, \Intel's MMX Sp eeds Multimedia,"

and Computers, Paci c Grove, CA, Nov. 1998.

MicroProcesor Report,vol. 10, no. 3, p. 1, 1996.

[9] P. Lapsley and G. Blalo ck, \Evaluating DSP Pro-

cessor Performance," 1996. White Pap er from

Berkeley Design Technology, Inc.

http://www.b dti.com/articles/wp eval.htm.

[10] C. Lee, M. Potkonjak, and W. Mangione-Smith,

\MediaBench: ATo ol for Evaluating and Synthe-

sizing Multimedia and Communications Systems,"

IEEE Micro,vol. 30, no. 1, pp. 330-335, Dec. 1997.