VLIW DSP VS. SUPERSCALAR IMPLEMENTATION OF
A BASELINE H.263 VIDEO ENCODER
Serene Banerjee, Hamid R. Sheikh, Lizy K. John, Brian L. Evans, and Alan C. Bovik
Dept. of Electrical and Computer Engineering
The University of Texas at Austin, Austin, TX 78712-1084 USA
fserene,sheikh,ljohn,b evans,b [email protected]
ABSTRACT sum-of-absolute di erences SAD calculations. SAD
provides a measure of the closeness between a 16
AVery Long Instruction Word VLIW pro cessor and
16 macroblo ck in the current frame and a 16 16
a sup erscalar pro cessor can execute multiple instruc-
macroblo ck in the previous frame. Other computa-
tions simultaneously. A VLIW pro cessor dep ends on
tional complex op erations are image interp olation, im-
the compiler and programmer to nd the parallelism
age reconstruction, and forward discrete cosine trans-
in the instructions, whereas a sup erscaler pro cessor de-
form DCT.
termines the parallelism at runtime. This pap er com-
In this pap er, we evaluate the p erformance of the
pares TI TMS320C6700 VLIW digital signal pro cessor
C source co de for a research H.263 co dec develop ed at
DSP and SimpleScalar sup erscalar implementations
the University of British Columbia UBC [4] on two
of a baseline H.263 video enco der in C. With level
pro cessor architectures. The rst architecture is a very
two C compiler optimization, a one-way issue sup er-
long instruction word VLIW digital signal pro cessor
scalar pro cessor is 7.5 times faster than the VLIW DSP
DSP represented by the TI TMS320C6701 [5, 6]. The
for the same pro cessor clo ck sp eed. The sup erscalar
second is an out-of-order sup erscalar architecture rep-
sp eedup from one-way to four-way issue is 2.88:1, and
resented by the SimpleScalar simulator [7].
from four-way to 256-way issue is 2.43:1. To reduce
Weevaluate the p erformance of the sup erscalar pro-
the execution time on the C6700, we write assembly
cessor vs. the maximum number of simultaneous in-
routines for sum-of-absolute-di erence, interp olation,
structions executed. We compare the p erformance of
and reconstruction, and place frequently used co de and
the C source co de on b oth pro cessor architectures. For
data into on-chip memory. We use TI's discrete co-
the VLIW DSP implementation, we demonstrate that
sine transform assembly routines. The hand optimized
manual placement of frequently used data and co de
VLIW DSP implementation is 61x faster than the C
into on-chip memory provides a 29x sp eedup, whereas
version compiled with level two optimization. Most
hand co ding ve of the most computational complex
of the improvementwas due to the ecient placement
routines gives a 4x sp eedup. Our hand-co ded SAD,
of data and programs in memory. The hand optimized
clipping, interp olation, and ll data C6700 assembly
VLIW implementation is 14 faster than a 256-way su-
routines and our SimpleScalar settings are available
p erscalar implementation without hand optimizations.
from
1. INTRODUCTION
http://www.ece.utexas.edu/~sheikh/h263/
Two factors limit the use of real-time video communi-
cations: network bandwidth and pro cessing resources.
2. BACKGROUND
The ITU-T H.263 standard [1, 2, 3] for video commu-
nication over wireless and wireline networks has high
2.1. TMS320C6700 VLIW DSP Family
computational complexity. An H.263 enco der includes
the H.263 deco der minus the variable length deco d- The TMS320C6700 is a oating-p oint subfamily within
ing in a feedback path. The H.263 enco der is roughly the C6000 VLIW DSP family. The C6000 family has
three times more complex than an H.263 deco der. 32-bit data words and 256-bit instructions. The 256-
In the H.263 enco der, the most computational com- bit instruction consists of eight 32-bit instructions that
plex op eration is motion estimation, even when using may b e executed at the same time. The C6000 dep ends
ecient search algorithms. The b ottleneck is in the on the compiler to nd the parallelism in the co de.
The C6000 family has two parallel 32-bit data paths. of the pro cessor core, memory hierarchy and branch
Each data path has 16 32-bit registers, and can only predictor can be sp eci ed by using command-line ar-
communicate one 32-bit word to the other data path guments. The clo ck sp eed of typical sup erscalar pro-
p er instruction cycle. Each data path has four 32-bit cessors is as high as 1 GHz. Power dissipation is a few
RISC units: one adder, one 16 16 multiplier, one tens of Watts, and the transistor countmayvary from
shifter, and one load/store unit. Each 32-bit RISC a few tens to a hundreds of millions.
unit has a throughput of one cycle, but its result is
delayed by one cycle for 16 16 multiplication, zero
2.3. UBC's H.263 Video Enco der
cycles for logical and arithmetic op erations, four cycles
The H.263 Version 2 H.263+ video enco der [1] con-
for load/store instructions, and ve cycles for branch
tains 23,000 lines 720 kbytes of C co de. It was writ-
instructions. Every instruction may be conditionally
ten for desktop PC applications and increases memory
executed. Conditional execution avoids the delay of
usage for faster execution. It incorp orates the base-
a branch instruction and avoids interrupts from b eing
line H.263 enco der with many optional H.263+ mo des.
disabled b ecause a branch is in the pip eline.
Our goal was to optimize the baseline H.263 enco der
The C6000 family has one bank of program mem-
only for emb edded applications, where the amountof
ory and multiple banks of data memory. Each data
available memory is muchlower. This enco der irregu-
memory bank is single-p orted and twobytes wide, and
larly uses oating p ointvariables and arithmetic. So,
has one cycle data access throughput. Accessing one
wecho ose the oating-p oint TMS320C6701 instead of
data memory bank twice in the same cycle results in
the xed-p oint TMS320C6201 as the VLIW DSP rep-
a pip eline stall and increased cycle counts. The C6701
resentative. In the p erformance evaluation, we enco de
has 64 kbytes of program memory and 64 kbytes of
sub-QCIF 128 96 frames. A related pap er optimizes
data memory. The program memory bank has 2k of
the MPEG-2 deco der on the TMS320C6000 [9].
256-bit instruction words, and may b e op erated as an
instruction cache. Each of the 16 data memory banks
has 2k of 16-bit halfwords.
3. VLIW DSP IMPLEMENTATION
On the C6700, the pip eline is 11{17 stages, dep end-
The most demanding op erations in a typical video
ing on the instruction. The clo ck sp eed varies from 100
enco der are motion estimation, motion comp ensation,
to 200 MHz. Power dissipation is typically 1.5{2.0 W,
discrete cosine transform, half-pixel interp olation, and
and the transistor count is less than 0.5 million. All
reconstruction, as shown in Table 1. We hand opti-
simulation results for the C6700 were obtained by run-
mize these routines to sp eed up the enco der. The entire
ning the co de on a 100-MHz C6701 Evaluation Mo dule
H.263+ enco der, which needs 340 kB of program mem-
b oard. The C6700 simulator do es not rep ort pip eline
ory and 2.2 MB of data memory, do es not t in internal
stalls or memory bank con icts, so running the co de on
RAM on the C6701. We used memory placement and
the b oard gives the true p erformance.
co de optimizations to reduce cycle count.
2.2. SimpleScalar Sup erscalar Pro cessor
3.1. Memory Placement
The SimpleScalar has a sup erscalar architecture de-
Placing the co de and data in slow external RAM wastes
rived from the MIPS-IV. The architecture prefetches
at least four times the numb er of instructions to b e ex- many cycles, esp ecially for frequently accessed data
and co de. We therefore p erformed manual placement
ecuted and nds the parallelism between instructions
of co de and data into fast on-chip program and data
at run time. All instructions have a xed length of 64
bits and xed format, which promotes fast instruction memory resp ectively. Computationally intensive rou-
tines for motion estimation, interp olation, DCT, SAD,
deco ding. It supp orts out-of-order issue and execution,
based on the Register Up date Unit RUU [8]. The and reconstruction routines were placed in the on-chip
program memory. Commonly used runtime functions
RUU uses a reorder bu er for register renaming and
from TI's libraries suchas memcpy, memcmp and mem-
holds the results of p ending instructions. The memory
system employs a load/store queue to supp ort sp ecu- set were linked into internal program memory. A total
of 43 kB out of the 340 kB of needed program memory
lative execution. The results of a sp eculative store are
was placed in the internal program memory.
placed in the store queue. If the address of all previ-
ous stores are known, then loads are dispatched to the
Of the 2.2 MB of data memory, the lo okup ta-
memory. The six pip eline stages are issue, dispatch, ble for quantization consumed 1.6 MB. The following
fetch, commit, writeback, and load/store queue refresh.
were placed in internal data memory: 1 16 16 mac-
For the SimpleScalar simulator, the con guration roblo cks and their corresp onding search area for the
Time -o2 -o2 and Memory -o2 and Co de All
Routine Optimization Optimization Optimization Optimizations
91.9 Motion Estimation 1356000 33400 325000 13000
1.8 Motion Comp ensation 26200 4200 6400 3400
1.6 DCT 2-D DCT 17200 670 6000 130
0.52 Quantization 7700 660 3300 660
1.1 Interp olation 16900 4200 2200 750
2.1 Reconstruction 31000 4000 9100 2260
3.5 Rest 52000 8300 31200 6270
100.0 Entire enco der 1476000 51400 374000 24200
Table 1: Cycle counts x1000 of H.263 routines p er frame on the TMS320C6700 VLIW DSP.
Program Usage -o2 and Memory -o2 and Co de All
Optimization Optimization Optimizations
SAD Motion Estimation 54 x 4x 264 x
Clip MB Reconstruction 7.5 x 13 x 228 x
DCT DCT of Frames 26 x 3x 138 x
Interp olate Bilinear Interp olation 4x 8x 22 x
FillMBData Reconstruction 7x 8x 22 x
Enco der | 29 x 4x 61 x
Table 2: Sp eedup of hand optimized H.263 routines on the TMS320C6700 VLIW DSP p er call.
Program 1-way 4-way 8-way 16-way 32-way 256-way VLIW
SAD 12723 7281 6775 6017 6031 6422 290
DCT 56409 22770 16381 14069 13193 12885 384
Clip MB 24545 9604 7023 6250 6438 6438 1173
FillMBData 12692 4693 3082 3425 3438 3438 8740
Interp olate 2011397 640799 425703 456605 456480 456480 750000
Enco der 196 M 68 M 59 M 30 M 30 M 28 M 24 M
Table 3: SimpleScalar with increasing issue widths and hand optimized VLIW DSP cycle counts for H.263 routines.
motion estimation routines; 2 16 16 macroblo cks 4. SUPERSCALAR IMPLEMENTATION
for the DCT, quantization, co ding and reconstruction
routines; 3 lo cal data of computationally intensive
For a fair comparison while comparing the individual
routines; and 4 the stack. These choices giveanover-
H.263 routines, co de and data were placed in the in-
all sp eedup of 29x over the unoptimized version alone.
ternal RAM for b oth the VLIW and the sup erscalar
pro cessor. Wevary issue widths of the sup erscalar pro-
3.2. Hand co ding assembly language routines
cessor from 1-way to 256-way. The other parameters,
Compiler intrinsics give little p erformance improvement
i.e. the fetch and deco de widths and the load/store
for critical routines. To reduce cycle counts further, we
queue and RUU sizes, were chosen to minimize the
write parallel assembly routines for SAD and Clip MB
cycle counts. Typically, the fetch queue and the de-
and linear assembly routines for Interp olateImage and
co de queue were twice the issue bandwidth and the
FillMBData. We use TI's DCT assembly routines.
load/store queue and the RUU size were four times the
SAD consists of two read, one subtract, one abso-
issue bandwidth. The 4 kbyte G-share branch predic-
lute value, and one accumulate op eration p er pixel of
tor [10]was used, which gives go o d p erformance over
b oth blo cks. We wrote a parallelized pip elined assem-
a variety of applications. For memory latencies, the
bly kernel to compute the SAD between two 16 16
default values were chosen, which are equal to typical
macroblo cks that avoided memory bank con icts. Of
values in mo dern general-purp ose pro cessors.
the 16 instructions in the fully unrolled inner lo op, 11
MB routines on a 256- For the SAD, DCT and Clip
instructions use ve of the six arithmetic units and ve
way issue sup erscalar pro cessor, the b est cycle counts
instructions use all six arithmetic units. The SAD is
were 6017, 12885 and 6250, resp ectively, which exceeds
computed in 290 cycles per macroblo ck in the worst
the cycle counts for corresp onding VLIW DSP parallel
case. Since the SAD routine ab orts computation if the
assembly routines, as shown in Table 4. In Table 4,
accumulated di erence exceeds the current minimum
the sup erscalar cycle counts for FillMBData and Inter-
for the search window, the hand-optimized SAD rou-
p olate routines are lower than the cycle counts for the
tine takes on average 110 cycles per macroblo ck for
corresp onding VLIW DSP linear assembly routines.
full-search motion estimation. The sp eedup over the
level two C compiler optimization is 264 times.
For the entire enco der, the cycle countwas 28 M cy-
The Clip MB routine clips over owing values to 255
cles, or equivalently, 3.5 sub-QCIF 128 96 frames/s
and under owing values to zero for every pixel in a
on a 100 MHz pro cessor. The cycle counts for the sub-
macroblo ck. We unroll the lo ops and pip eline the com-
routines and the entire enco der for a 1-, 4-, 8-, 16-,
parisons. The cycle counts reduce to 1173 per mac-
32-, and 256-issue pro cessor are summarized in Table
roblo ck, which is an improvement of 228 times over the
3. With increasing issue widths, the sp eedup shows
compiler's b est optimization.
diminishing returns. Increased issue widths may some-
The TI DCT routine runs 138 times faster than
times lead to more missed sp eculations. This results in
theCversion with level two optimization only. How-
an increased the cycle counts, as observed in Table 3.
ever, the enco der uses 32-bit integers for co ecients,
whereas TI's DCT routine uses 16-bit short integers.
The conversion is an additional overhead.
We write FillMBData and Interp olate routines in
VLIW Simple
linear assembly. Both these routines use packed data
DSP Scalar
access to external memory. The sp eedup is ab out 22
Routine cycles cycles Ratio
times eachover the corresp onding C compiled version
SAD 290 6017 1:20.7
with level two optimization.
DCT 384 12885 1:33.5
Table 1 summarizes improving cycle counts and Ta-
Clip MB 1173 6250 1:5.3
ble 2 summarizes sp eedups of the optimized routines.
FillMBData 8740 3082 2.8:1
Overall sp eedup for the enco der is 61 times over the
Interp olate 750000 425703 1.8:1
version with C compiler optimizations enabled. The
Enco der 24 M 28 M 1.2:1
enco der requires 24 M cycles per frame, or equiva-
lently, 4 sub-QCIF 128 96 frames/s on a 100-MHz
C6701 EVM b oard. Additional improvement may be
Table 4: Comparison of 256-way issue sup erscalar and
obtained by optimizing the variable length enco ding.
hand optimized VLIW DSP implementations.
Our hand optimized VLIW routines can also b e used in the MPEG standards.
Measure -o2 -o2 and Memory -o2 and Co de All Sup erscalar
Optimizations Optimizations Optimizations Optimizations Performance
Cycles p er frame 1476 M 51 M 374 M 24 M 28 M
Frames p er second 0.14 2 0.54 4 3.5
Sp eedup | 29 4 61 53
Table 5: Comparison of a VLIW DSP and 256-way issue Sup erscalar implementations at a clo ck sp eed of 100 MHz.
6. REFERENCES 5. CONCLUSION
[1] G. Cote, B. Erol, M. Gallant, and F. Kossentini,
This pap er evaluates p erformance of a baseline H.263
\H.263+: Video co ding at low bit rates," IEEE Trans.
enco der in C on a VLIW DSP TMS320C6700 pro cessor
on Circuits and Systems for VideoTechnology,vol. 8,
and on an out-of-order SimpleScalar sup erscalar pro-
pp. 849{866, Nov. 1998.
cessor to enco de 128 96 frames. The H.263 enco der
[2] T. Gardos, \H.263+: The new ITU-T recommenda-
was originally written for PC applications and trades
tion for video co ding at low bit rates," in Proc. IEEE
memory o for sp eed. We validate that the VLIW
Int. Conf. on Acoustics, Speech and Signal Processing,
DSP and sup erscalar implementations generated iden-
vol. 6, pp. 3793{3796, May 1998.
tical H.263 bitstreams to the PC version of the enco der.
[3] ITU Telecom Standardization Sector, \Video co ding
When using level two C compiler optimization only,
for low bit rate communication," ITU-T Recommen-
we nd that for the same pro cessor clo ck sp eeds
dation H.263 Version 2, Jan. 1998.
1. one-way issue sup erscalar implementation is 7.5x
[4] B. Erol, F. Kossentini, and H. Alnuweiri, \Implementa-
faster than the VLIW DSP implementation,
tion of a fast H.263+ enco der/deco der," in Proc. IEEE
Asilomar Conf. on Signals, Systems and Comp.,vol. 1,
2. four-way to one-way issue sp eedup is 2.88:1, and
pp. 462{466, Nov. 1998.
3. 256-way to four-way issue sp eedup is 2.4:1.
[5] TMS320C6000 Programmer's Guide. No. SPRU 198D,
Texas Instruments, Inc., Mar. 2000.
In analyzing the p erformance of the VLIW DSP
[6] TMS320C6000 CPU and Instruction Set. No. SPRU
implementation, external memory access was the key
189E, Texas Instruments, Inc., Jan. 2000.
b ottleneck. The amount of on-chip memory was only a
[7] T. Austin and D. Burger, \The SimpleScalar to ol set,
fraction of the total memory needed by the application.
Version 2.0," in Tech. Rep., Univ. of Wisconsin Comp.
A less signi cant but still imp ortant problem was that
Sciences Dept., no. TR-1342, June 1997.
the compiler cannot always nd the full parallelism in C
[8] G. S. Sohi, \Instruction issue logic for high-
co de to take advantage of the parallel functional units
p erformance, interruptible, multiple functional unit,
in a VLIW pro cessor. Once we placed the data and
pip elined computers," IEEE Trans. on Computers,
co de that were most often used into on-chip memory,
vol. 39, pp. 349{359, Mar. 1990.
we hand co ded SAD, image interp olation, and image
[9] S. Sriram and C.-Y. Hung, \MPEG-2 video deco ding
reconstruction routines on the C6700.
on the TMS320C6x DSP architecture," in Proc. IEEE
After all of the manual optimizations, the VLIW
Asilomar Conf. on Signals, Systems and Comp.,vol. 2,
DSP implementation was 14 faster than the 256-way
pp. 1735{1739, Nov. 1998.
sup erscalar implementation with level two compiler op-
[10] S. Onder, X. Jun, and R. Gupta, \Caching and pre-
timization and with all data and co de on-chip. This
dicting branch sequences for improved fetch e ective-
re ects a key di erence in the architectures. On an
ness," in Proc. Int. Conf. on Paral lel Architectures and
out-of-order sup erscalar pro cessor, the parallelism is
Compilation Tech.,vol. 1, pp. 294{302, Nov. 1999.
detected and exploited at run time. On a VLIW DSP,
the parallelism must b e detected and exploited at com-
pile time. Optimizing a VLIW DSP implementation
requires careful manual placement of data and co de
into on-chip memory, which is not required on a su-
p erscalar pro cessor. Optimizing b oth sup erscalar and
VLIW DSP implementations generally requires hand
co ding the most frequently called routines.