VLIW DSP VS. SUPERSCALAR IMPLEMENTATION OF

A BASELINE H.263 VIDEO ENCODER

Serene Banerjee, Hamid R. Sheikh, Lizy K. John, Brian L. Evans, and Alan C. Bovik

Dept. of Electrical and Engineering

The University of Texas at Austin, Austin, TX 78712-1084 USA

fserene,sheikh,ljohn,b evans,b [email protected]

ABSTRACT sum-of-absolute di erences SAD calculations. SAD

provides a measure of the closeness between a 16 

AVery Long Instruction Word VLIW pro cessor and

16 macroblo ck in the current frame and a 16  16

a sup erscalar pro cessor can execute multiple instruc-

macroblo ck in the previous frame. Other computa-

tions simultaneously. A VLIW pro cessor dep ends on

tional complex op erations are image interp olation, im-

the compiler and programmer to nd the parallelism

age reconstruction, and forward discrete cosine trans-

in the instructions, whereas a sup erscaler pro cessor de-

form DCT.

termines the parallelism at runtime. This pap er com-

In this pap er, we evaluate the p erformance of the

pares TI TMS320C6700 VLIW digital signal pro cessor

C source co de for a research H.263 co dec develop ed at

DSP and SimpleScalar sup erscalar implementations

the University of British Columbia UBC [4] on two

of a baseline H.263 video enco der in C. With level

pro cessor architectures. The rst architecture is a very

two C compiler optimization, a one-way issue sup er-

long instruction word VLIW digital signal pro cessor

scalar pro cessor is 7.5 times faster than the VLIW DSP

DSP represented by the TI TMS320C6701 [5, 6]. The

for the same pro cessor clo ck sp eed. The sup erscalar

second is an out-of-order sup erscalar architecture rep-

sp eedup from one-way to four-way issue is 2.88:1, and

resented by the SimpleScalar simulator [7].

from four-way to 256-way issue is 2.43:1. To reduce

Weevaluate the p erformance of the sup erscalar pro-

the execution time on the C6700, we write assembly

cessor vs. the maximum number of simultaneous in-

routines for sum-of-absolute-di erence, interp olation,

structions executed. We compare the p erformance of

and reconstruction, and place frequently used co de and

the C source co de on b oth pro cessor architectures. For

data into on-chip memory. We use TI's discrete co-

the VLIW DSP implementation, we demonstrate that

sine transform assembly routines. The hand optimized

manual placement of frequently used data and co de

VLIW DSP implementation is 61x faster than the C

into on-chip memory provides a 29x sp eedup, whereas

version compiled with level two optimization. Most

hand co ding ve of the most computational complex

of the improvementwas due to the ecient placement

routines gives a 4x sp eedup. Our hand-co ded SAD,

of data and programs in memory. The hand optimized

clipping, interp olation, and ll data C6700 assembly

VLIW implementation is 14 faster than a 256-way su-

routines and our SimpleScalar settings are available

p erscalar implementation without hand optimizations.

from

1. INTRODUCTION

http://www.ece.utexas.edu/~sheikh/h263/

Two factors limit the use of real-time video communi-

cations: network bandwidth and pro cessing resources.

2. BACKGROUND

The ITU-T H.263 standard [1, 2, 3] for video commu-

nication over wireless and wireline networks has high

2.1. TMS320C6700 VLIW DSP Family

computational complexity. An H.263 enco der includes

the H.263 deco der minus the variable length deco d- The TMS320C6700 is a oating-p oint subfamily within

ing in a feedback path. The H.263 enco der is roughly the C6000 VLIW DSP family. The C6000 family has

three times more complex than an H.263 deco der. 32-bit data words and 256-bit instructions. The 256-

In the H.263 enco der, the most computational com- bit instruction consists of eight 32-bit instructions that

plex op eration is motion estimation, even when using may b e executed at the same time. The C6000 dep ends

ecient search algorithms. The b ottleneck is in the on the compiler to nd the parallelism in the co de.

The C6000 family has two parallel 32-bit data paths. of the pro cessor core, and branch

Each data path has 16 32-bit registers, and can only predictor can be sp eci ed by using command-line ar-

communicate one 32-bit word to the other data path guments. The clo ck sp eed of typical sup erscalar pro-

p er . Each data path has four 32-bit cessors is as high as 1 GHz. Power dissipation is a few

RISC units: one , one 16  16 multiplier, one tens of Watts, and the transistor countmayvary from

shifter, and one load/store unit. Each 32-bit RISC a few tens to a hundreds of millions.

unit has a throughput of one cycle, but its result is

delayed by one cycle for 16  16 multiplication, zero

2.3. UBC's H.263 Video Enco der

cycles for logical and arithmetic op erations, four cycles

The H.263 Version 2 H.263+ video enco der [1] con-

for load/store instructions, and ve cycles for branch

tains 23,000 lines 720 kbytes of C co de. It was writ-

instructions. Every instruction may be conditionally

ten for desktop PC applications and increases memory

executed. Conditional execution avoids the delay of

usage for faster execution. It incorp orates the base-

a branch instruction and avoids interrupts from b eing

line H.263 enco der with many optional H.263+ mo des.

disabled b ecause a branch is in the pip eline.

Our goal was to optimize the baseline H.263 enco der

The C6000 family has one bank of program mem-

only for emb edded applications, where the amountof

ory and multiple banks of data memory. Each data

available memory is muchlower. This enco der irregu-

memory bank is single-p orted and twobytes wide, and

larly uses oating p ointvariables and arithmetic. So,

has one cycle data access throughput. Accessing one

wecho ose the oating-p oint TMS320C6701 instead of

data memory bank twice in the same cycle results in

the xed-p oint TMS320C6201 as the VLIW DSP rep-

a pip eline stall and increased cycle counts. The C6701

resentative. In the p erformance evaluation, we enco de

has 64 kbytes of program memory and 64 kbytes of

sub-QCIF 128  96 frames. A related pap er optimizes

data memory. The program memory bank has 2k of

the MPEG-2 deco der on the TMS320C6000 [9].

256-bit instruction words, and may b e op erated as an

instruction . Each of the 16 data memory banks

has 2k of 16-bit halfwords.

3. VLIW DSP IMPLEMENTATION

On the C6700, the pip eline is 11{17 stages, dep end-

The most demanding op erations in a typical video

ing on the instruction. The clo ck sp eed varies from 100

enco der are motion estimation, motion comp ensation,

to 200 MHz. Power dissipation is typically 1.5{2.0 W,

discrete cosine transform, half-pixel interp olation, and

and the is less than 0.5 million. All

reconstruction, as shown in Table 1. We hand opti-

simulation results for the C6700 were obtained by run-

mize these routines to sp eed up the enco der. The entire

ning the co de on a 100-MHz C6701 Evaluation Mo dule

H.263+ enco der, which needs 340 kB of program mem-

b oard. The C6700 simulator do es not rep ort pip eline

ory and 2.2 MB of data memory, do es not t in internal

stalls or memory bank con icts, so running the co de on

RAM on the C6701. We used memory placement and

the b oard gives the true p erformance.

co de optimizations to reduce cycle count.

2.2. SimpleScalar Sup erscalar Pro cessor

3.1. Memory Placement

The SimpleScalar has a sup erscalar architecture de-

Placing the co de and data in slow external RAM wastes

rived from the MIPS-IV. The architecture prefetches

at least four times the numb er of instructions to b e ex- many cycles, esp ecially for frequently accessed data

and co de. We therefore p erformed manual placement

ecuted and nds the parallelism between instructions

of co de and data into fast on-chip program and data

at run time. All instructions have a xed length of 64

bits and xed format, which promotes fast instruction memory resp ectively. Computationally intensive rou-

tines for motion estimation, interp olation, DCT, SAD,

deco ding. It supp orts out-of-order issue and execution,

based on the Register Up date Unit RUU [8]. The and reconstruction routines were placed in the on-chip

program memory. Commonly used runtime functions

RUU uses a reorder bu er for and

from TI's libraries suchas memcpy, memcmp and mem-

holds the results of p ending instructions. The memory

system employs a load/store queue to supp ort sp ecu- set were linked into internal program memory. A total

of 43 kB out of the 340 kB of needed program memory

lative execution. The results of a sp eculative store are

was placed in the internal program memory.

placed in the store queue. If the address of all previ-

ous stores are known, then loads are dispatched to the

Of the 2.2 MB of data memory, the lo okup ta-

memory. The six pip eline stages are issue, dispatch, ble for quantization consumed 1.6 MB. The following

fetch, commit, writeback, and load/store queue refresh.

were placed in internal data memory: 1 16  16 mac-

For the SimpleScalar simulator, the con guration roblo cks and their corresp onding search area for the

Time -o2 -o2 and Memory -o2 and Co de All

 Routine Optimization Optimization Optimization Optimizations

91.9 Motion Estimation 1356000 33400 325000 13000

1.8 Motion Comp ensation 26200 4200 6400 3400

1.6 DCT 2-D DCT 17200 670 6000 130

0.52 Quantization 7700 660 3300 660

1.1 Interp olation 16900 4200 2200 750

2.1 Reconstruction 31000 4000 9100 2260

3.5 Rest 52000 8300 31200 6270

100.0 Entire enco der 1476000 51400 374000 24200

Table 1: Cycle counts x1000 of H.263 routines p er frame on the TMS320C6700 VLIW DSP.

Program Usage -o2 and Memory -o2 and Co de All

Optimization Optimization Optimizations

SAD Motion Estimation 54 x 4x 264 x

Clip MB Reconstruction 7.5 x 13 x 228 x

DCT DCT of Frames 26 x 3x 138 x

Interp olate Bilinear Interp olation 4x 8x 22 x

FillMBData Reconstruction 7x 8x 22 x

Enco der | 29 x 4x 61 x

Table 2: Sp eedup of hand optimized H.263 routines on the TMS320C6700 VLIW DSP p er call.

Program 1-way 4-way 8-way 16-way 32-way 256-way VLIW

SAD 12723 7281 6775 6017 6031 6422 290

DCT 56409 22770 16381 14069 13193 12885 384

Clip MB 24545 9604 7023 6250 6438 6438 1173

FillMBData 12692 4693 3082 3425 3438 3438 8740

Interp olate 2011397 640799 425703 456605 456480 456480 750000

Enco der 196 M 68 M 59 M 30 M 30 M 28 M 24 M

Table 3: SimpleScalar with increasing issue widths and hand optimized VLIW DSP cycle counts for H.263 routines.

motion estimation routines; 2 16  16 macroblo cks 4. SUPERSCALAR IMPLEMENTATION

for the DCT, quantization, co ding and reconstruction

routines; 3 lo cal data of computationally intensive

For a fair comparison while comparing the individual

routines; and 4 the stack. These choices giveanover-

H.263 routines, co de and data were placed in the in-

all sp eedup of 29x over the unoptimized version alone.

ternal RAM for b oth the VLIW and the sup erscalar

pro cessor. Wevary issue widths of the sup erscalar pro-

3.2. Hand co ding assembly language routines

cessor from 1-way to 256-way. The other parameters,

Compiler intrinsics give little p erformance improvement

i.e. the fetch and deco de widths and the load/store

for critical routines. To reduce cycle counts further, we

queue and RUU sizes, were chosen to minimize the

write parallel assembly routines for SAD and Clip MB

cycle counts. Typically, the fetch queue and the de-

and linear assembly routines for Interp olateImage and

co de queue were twice the issue bandwidth and the

FillMBData. We use TI's DCT assembly routines.

load/store queue and the RUU size were four times the

SAD consists of two read, one subtract, one abso-

issue bandwidth. The 4 kbyte G-share branch predic-

lute value, and one accumulate op eration p er pixel of

tor [10]was used, which gives go o d p erformance over

b oth blo cks. We wrote a parallelized pip elined assem-

a variety of applications. For memory latencies, the

bly kernel to compute the SAD between two 16  16

default values were chosen, which are equal to typical

macroblo cks that avoided memory bank con icts. Of

values in mo dern general-purp ose pro cessors.

the 16 instructions in the fully unrolled inner lo op, 11

MB routines on a 256- For the SAD, DCT and Clip

instructions use ve of the six arithmetic units and ve

way issue sup erscalar pro cessor, the b est cycle counts

instructions use all six arithmetic units. The SAD is

were 6017, 12885 and 6250, resp ectively, which exceeds

computed in 290 cycles per macroblo ck in the worst

the cycle counts for corresp onding VLIW DSP parallel

case. Since the SAD routine ab orts computation if the

assembly routines, as shown in Table 4. In Table 4,

accumulated di erence exceeds the current minimum

the sup erscalar cycle counts for FillMBData and Inter-

for the search window, the hand-optimized SAD rou-

p olate routines are lower than the cycle counts for the

tine takes on average 110 cycles per macroblo ck for

corresp onding VLIW DSP linear assembly routines.

full-search motion estimation. The sp eedup over the

level two C compiler optimization is 264 times.

For the entire enco der, the cycle countwas 28 M cy-

The Clip MB routine clips over owing values to 255

cles, or equivalently, 3.5 sub-QCIF 128  96 frames/s

and under owing values to zero for every pixel in a

on a 100 MHz pro cessor. The cycle counts for the sub-

macroblo ck. We unroll the lo ops and pip eline the com-

routines and the entire enco der for a 1-, 4-, 8-, 16-,

parisons. The cycle counts reduce to 1173 per mac-

32-, and 256-issue pro cessor are summarized in Table

roblo ck, which is an improvement of 228 times over the

3. With increasing issue widths, the sp eedup shows

compiler's b est optimization.

diminishing returns. Increased issue widths may some-

The TI DCT routine runs 138 times faster than

times lead to more missed sp eculations. This results in

theCversion with level two optimization only. How-

an increased the cycle counts, as observed in Table 3.

ever, the enco der uses 32-bit integers for co ecients,

whereas TI's DCT routine uses 16-bit short integers.

The conversion is an additional overhead.

We write FillMBData and Interp olate routines in

VLIW Simple

linear assembly. Both these routines use packed data

DSP Scalar

access to external memory. The sp eedup is ab out 22

Routine cycles cycles Ratio

times eachover the corresp onding C compiled version

SAD 290 6017 1:20.7

with level two optimization.

DCT 384 12885 1:33.5

Table 1 summarizes improving cycle counts and Ta-

Clip MB 1173 6250 1:5.3

ble 2 summarizes sp eedups of the optimized routines.

FillMBData 8740 3082 2.8:1

Overall sp eedup for the enco der is 61 times over the

Interp olate 750000 425703 1.8:1

version with C compiler optimizations enabled. The

Enco der 24 M 28 M 1.2:1

enco der requires 24 M cycles per frame, or equiva-

lently, 4 sub-QCIF 128  96 frames/s on a 100-MHz

C6701 EVM b oard. Additional improvement may be

Table 4: Comparison of 256-way issue sup erscalar and

obtained by optimizing the variable length enco ding.

hand optimized VLIW DSP implementations.

Our hand optimized VLIW routines can also b e used in the MPEG standards.

Measure -o2 -o2 and Memory -o2 and Co de All Sup erscalar

Optimizations Optimizations Optimizations Optimizations Performance

Cycles p er frame 1476 M 51 M 374 M 24 M 28 M

Frames p er second 0.14 2 0.54 4 3.5

Sp eedup | 29 4 61 53

Table 5: Comparison of a VLIW DSP and 256-way issue Sup erscalar implementations at a clo ck sp eed of 100 MHz.

6. REFERENCES 5. CONCLUSION

[1] G. Cote, B. Erol, M. Gallant, and F. Kossentini,

This pap er evaluates p erformance of a baseline H.263

\H.263+: Video co ding at low bit rates," IEEE Trans.

enco der in C on a VLIW DSP TMS320C6700 pro cessor

on Circuits and Systems for VideoTechnology,vol. 8,

and on an out-of-order SimpleScalar sup erscalar pro-

pp. 849{866, Nov. 1998.

cessor to enco de 128  96 frames. The H.263 enco der

[2] T. Gardos, \H.263+: The new ITU-T recommenda-

was originally written for PC applications and trades

tion for video co ding at low bit rates," in Proc. IEEE

memory o for sp eed. We validate that the VLIW

Int. Conf. on Acoustics, Speech and Signal Processing,

DSP and sup erscalar implementations generated iden-

vol. 6, pp. 3793{3796, May 1998.

tical H.263 bitstreams to the PC version of the enco der.

[3] ITU Telecom Standardization Sector, \Video co ding

When using level two C compiler optimization only,

for low bit rate communication," ITU-T Recommen-

we nd that for the same pro cessor clo ck sp eeds

dation H.263 Version 2, Jan. 1998.

1. one-way issue sup erscalar implementation is 7.5x

[4] B. Erol, F. Kossentini, and H. Alnuweiri, \Implementa-

faster than the VLIW DSP implementation,

tion of a fast H.263+ enco der/deco der," in Proc. IEEE

Asilomar Conf. on Signals, Systems and Comp.,vol. 1,

2. four-way to one-way issue sp eedup is 2.88:1, and

pp. 462{466, Nov. 1998.

3. 256-way to four-way issue sp eedup is 2.4:1.

[5] TMS320C6000 Programmer's Guide. No. SPRU 198D,

Texas Instruments, Inc., Mar. 2000.

In analyzing the p erformance of the VLIW DSP

[6] TMS320C6000 CPU and Instruction Set. No. SPRU

implementation, external memory access was the key

189E, Texas Instruments, Inc., Jan. 2000.

b ottleneck. The amount of on-chip memory was only a

[7] T. Austin and D. Burger, \The SimpleScalar to ol set,

fraction of the total memory needed by the application.

Version 2.0," in Tech. Rep., Univ. of Wisconsin Comp.

A less signi cant but still imp ortant problem was that

Sciences Dept., no. TR-1342, June 1997.

the compiler cannot always nd the full parallelism in C

[8] G. S. Sohi, \Instruction issue logic for high-

co de to take advantage of the parallel functional units

p erformance, interruptible, multiple functional unit,

in a VLIW pro cessor. Once we placed the data and

pip elined ," IEEE Trans. on Computers,

co de that were most often used into on-chip memory,

vol. 39, pp. 349{359, Mar. 1990.

we hand co ded SAD, image interp olation, and image

[9] S. Sriram and C.-Y. Hung, \MPEG-2 video deco ding

reconstruction routines on the C6700.

on the TMS320C6x DSP architecture," in Proc. IEEE

After all of the manual optimizations, the VLIW

Asilomar Conf. on Signals, Systems and Comp.,vol. 2,

DSP implementation was 14 faster than the 256-way

pp. 1735{1739, Nov. 1998.

sup erscalar implementation with level two compiler op-

[10] S. Onder, X. Jun, and R. Gupta, \Caching and pre-

timization and with all data and co de on-chip. This

dicting branch sequences for improved fetch e ective-

re ects a key di erence in the architectures. On an

ness," in Proc. Int. Conf. on Paral lel Architectures and

out-of-order sup erscalar pro cessor, the parallelism is

Compilation Tech.,vol. 1, pp. 294{302, Nov. 1999.

detected and exploited at run time. On a VLIW DSP,

the parallelism must b e detected and exploited at com-

pile time. Optimizing a VLIW DSP implementation

requires careful manual placement of data and co de

into on-chip memory, which is not required on a su-

p erscalar pro cessor. Optimizing b oth sup erscalar and

VLIW DSP implementations generally requires hand

co ding the most frequently called routines.