RealTime Layered Video Compression using

SIMD Computation

Morten Vadskr Jensen and Brian Nielsen

Aalb org University

Department of Computer Science

Fredrik Ba jersvej E

DK Aalb org Denmark

fmvj bnielsengcsaucdk

Abstract We present the design and implementation of a high p erfor

mance software layered video codec designed for deployment in band

width heterogeneous networks The co dec facilitates layered spatial and

SNR signaltonoise ratio co ding for bitrate adaption to a wide range

of receiver capabilities The co dec uses a wavelet subband decomp osi

tion for spatial layering and a discrete cosine transform combined with

rep eated quantization for SNR layering

Through the use of the on SUNs UltraSPARC

platform we demonstrate howSIMDparallel image pro cessing enables

layered realtime software enco ding and deco ding The co dec partitions

ers at a sp eed of our bit test video stream into lay

frames per second and reconstructed at frames p er second The

Visual Instruction Set accelerated enco der stages are ab out times as

fast as an optimized version We nd that this sp eedup is well worth

the extra implementation eort

Intro duction

Smart technology for multicasting live digital video across wide area networks is

necessary for future applications like video conferencing distance learning and

telecommuting The Internet Multicast Backb one MBone is already p opu

lar and allows p eople from anywhere on the planet to exchange mo dest quality

video and audio signals However a fundamental problem with nearly all large

computer networks is that network capacity bandwidth varies extremely from

one network segment to another This variation is the source of a serious video

multicast problem All users wish to participate in the video conference with

the highest quality video their connection capacity varying between kbps

for ISDN dialup connections to Mbits LAN connections and host compu

tational resources allow High capacity receivers are capable of pro cessing go o d

this outp erforms the low quality video with bitrates of Mbps or more but

capacity receivers by more than an order of magnitude Even a mo dest bit rate

mayoverood low capacity receivers

Conventional video compression techniques co de the video signal to a xed

target bit rate and is then multicasted to all receivers A low target bit rate

is typically chosen to enable as many receivers as p ossible to participate How

ever the resulting quality is unacceptable for the high capacity receivers Al

ternatively the video signal can b e enco ded rep eatedly to a multitude of data

rates This is however very inecient Firstlyitinvolves extra computational

overhead from compressing the same frame rep eatedly Secondly it intro duces

bandwidth redundancy as each high bitrate video stream contains all the in

formation also included in the lower bitrate ones This translates to inecient

bandwidth utilization from the ro ot of the the multicast tree

We prop ose layeredcoding coupled with multicasting where the video stream

is co ded and compressed into several distinct layers of signicance as shown in

Figure In principle eachlayer is transmitted on its own multicast channel

The receivers may subscrib e to these channels according to their capabilities and

preferences The more layers received the higher video quality but also higher

wer requirements The most signicantlayer con bandwidth and pro cessing p o

stitutes the base layer and the following layers enhancement layers

Multicast Frame Encoding

(a) Multicast

Frame Encoding MSLs

LSLs

(b)

Fig Conventional single bitrate video co ding a vs layered co ding b

Live video requires that video frames are enco ded and deco ded in realtime

This is even more challenging for layered co ding than for conventional co ding

b ecause layer construction and reassembly requires use of additional image l

ter functions and rep eated pro cessing of image data and hence requires more

CPUpro cessing We b elieve that it is imp ortant that this co ding is p ossible

in realtime on mo dern general purp ose pro cessors without dedicated external

co dec hardwarepartly b ecause no current hardware unit has the functionality

we advertise for but more imp ortantly b ecause applications in the near future

will integrate video as a normal data typ e and manipulate it in application de

p endent manner This require signicant exibility Fortunately most mo dern

CPUarchitectures have b een extended with SIMD instructions to sp eed up dig

ital signal pro cessing Examples include VIS in Suns UltraSPARC MMX in

Intels PentiumI I MAX in HewlettPackards PARISC MDMX in Silicon

Graphics MIPS MVI in Digitals Alpha and recently Altivec in Motorolas

PowerPC CPUs

In this pap er we show a high p erformance implementation of a layered co dec

capable of constructing a large set of layers from reasonably sized testvideo in

realtime We demonstrate how SIMD parallelism and careful considerations of

sup erscalar pro cessing can sp eedup image pro cessing signicantly Our enco der

implementation exists in both an optimized Cversion and in aversion almost

exclusively using SUN microsystems Visual Instruction Set VIS available on

the SUN UltraSPARC platform The deco der only exists in a VISaccelerated

version

The remainder of the pap er is structured as follows Section presents our

layered co ding scheme and our co dec mo del which incorp orates it Section

presents the implementation of an instance of our co dec mo del along with our

p erformance optimization eorts Section evaluates the co ding scheme through

a series of measurements Finally section discusses related work and section

concludes

The Co dec Mo del

Our co dec combines two indep endent layering facilities spatial layering and

signalnoiseratio SNR layering Spatial layering controls the resolution of the

imagesubscription to more spatial layers means higher image resolution SNR

t image pixelsthe more layering controls the number of bits used to represen

bits the higher pixel precision and thus a more sharp and detailed image

The wavelet lter p erforms a subband decomp osition of the image and decom

p oses it into one low L and one high frequency H subband p er dimension The

eect of subband decomp osition on the Lenaimage is depicted in Figure The

LL subband is a subsampled and ltered version of the original image The three

subbands HL LH and HH contain the high frequency information necessary to

reconstruct the horizontal vertical and diagonal resolution resp ectively There

fore the HL LH and HH layers act as renement or enhancement layers for

the LL layer Further the LL subband can itself b e sub ject to further subband

decomp ositions or can be sub jected to the familiar blo ckbased DCT co ding

scheme known from other co decs for further psychovisual enhanced compres

sion We use a Dversion of the tap spline wavelet p b ecause

it achieves good pixel value approximation and b ecause it is highly summetric

and consequently can b e implemented very eciently

SNR layering divides the incoming co ecients into several levels of signif

icance whereby the lower levels include the most signicant information re

sulting in a coarse image representation This is then progressively rened with

the number of layers SNR layering is a exible and computationally ecient

way of reducing the video stream bitrate and allows negrain control of im

age quality at dierentlayers In our co dec SNRlayering is achieved through a

DCTtransform combined with rep eated uniform quantization

Figure shows our co dec mo del Enco ding is p erformed by the following

comp onents The Wavelet Transformer transforms YUV images into LL HL

LH and HH layers corresp onding to four spatial layers The enhancementlayers

HL LH HH are sent directly to the quantizer while the LL layer is passed on to

ansformer p erforms the Discrete Cosine Transform the DCT stage The DCT tr

on the LL image from ab ove The resulting co ecientblocks are passed on to the

LL HL

LH HH

Fig Wavelet transformation of the Lena image into LL HL LH HH layers The

LL layer is a downsampled version of the original image and the HL LH and HH

layers contain the information required to restore horizontal vertical and diagonal

resolution resp ectively

quantization stage The Quantizer splits incoming co ecients into several layers

of signicance The base layer contains a coarse version of the co ecients and

eachlayer adds precision to co ecient resolution The Coder takes the quantized

co ecients from each separate layer and Human compresses them to pro duce

a single bitstream for eachlayer

Encoding stages

Wavelet LL DCT DCT Quantizer Coder Sender Transformer Transformer

HL LH HH

HL LH HH

Σ Inverse Inverse Receiver Decoder IDCT DCT LL Wavelet Transformer Transformer DeQuantizer Decoding stages Legend: Wavelet DCT Frame Frame LL transformed DCT transformed Layer

frame frame remainder

Fig Our co dec mo del In the enco ding stages the wavelet transformer sends the LL

layer to the DCT stage b efore all layers go through the quantizer and on to transmis

sion The quantizer reuses co ecient remainders as depicted by the arrowreentering

the quantizer

The deco ding stages essentially p erform the inverse op erations The Decoder

uncompresses the received Human co des The Dequantizer reconstructs co ef

cients based on the deco ded layers the dequantizer works b oth as co ecient

reconstructor and as layer merger Like the enco der there are two dierent de

quantizers one for DCT co ecients which are passed on to the IDCT stage

and one for wavelet co ecients which are passed directly on to the wavelet re

construction stage The Inverse DCT Transformer p erforms IDCT on incoming

DCT co ecient blo cks which reconstructs the lowest resolution spatial layer

LL The LL layer is used in the Inverse Wavelet Transformer which p erforms

wavelet reconstruction

The Sender and Receiver mo dules which are outside the actual co dec mo del

must supp ort multicasting for ecient bandwidth utilization The layers are

transmitted to dierentmulticast groups so receivers can adjust their reception

rate by joining and leaving multicast groups Further details on this design can

befoundin

Our design do es not oer temp oral layering b esides simply dropping frames

The most eective temp oral compression schemes use motion comp ensation but

this do es not in it self add any layering only a lower bit rate Our goal is to

provide high bandwidth versatilityMoreover it is untrivial to obtain scalability

from a motion comp ensated stream b ecause frames are interdep endent That is

b oth the desired frame and the frames it dep ends on must b e received Wehave

therefore decided to p ostp one temp oral motion comp ensated layering to future

work

Implementation

The visual instruction set found on SUNs UltraSPARC CPUs is capable of

pro cessing bit bit or bit partitioned data elements in parallel In

addition to the usual SIMD multiplication and addition instructions VIS contains

various sp ecial instructions dedicated to video compression and manipulation of

dimensional data Furthermore the UltraSPARCCPUhastwo pip elines and

is therefore able to execute pairs of VIS instructions in parallel

The VIS instructions are available to the application programmer through a

set of macros which can b e used from Cprograms with a reasonable amountog

eort although it must b e realized that even with these Cmacros programming

VIS is essentially at the assembler level The programmer is however alleviated

from certain asp ects of register allo cation and instruction scheduling

Our implementation is optimized at the architecture level as wellasatthe

algorithmic level The architecture level optimizations fall in three categories

using the VIS to achieve SIMD parallel computation of data elements p er

instruction using sup er scalar pro cessing to execute two indep endent in

structions in parallel p er clo ck cycle and reducing memory access bykeeping

constants and input data in registers and by using the bit loadstore capabil

ities These optimization principles are applied in all stages of the co dec Below

we exemplify these on the layer decomp osition stage

The the decomp osition lter is applied bymultiplying each pixel in a

input blo ck with the corresp onding element in the lter co ecient matrix The

results are then summed into a single value and divided by to pro duce

an average which is the nal output This is done for each layer typ e The

lter dictates a twopixel overlap b etween the current and the next input blo ck

which therefore b ecomes the image data starting two pixels to the right relative

to the currentblock The implementation of the decomp osition routine op erates

on pixel blo cks at a time see Figure pro ducing bit outputs one for

each of the LL HL LH HH layers It uses bit loads to load input data and

since all data ts in registers it requires only bit loads and bit stores

pr pixel blo ck The routine spills overlapping pixels from one blocktothe

next so horizontal traversal causes no redundant memory accesses Vertically

there is a pixel overlap dictated by the lter so eachpairofvertical lines is

read twice

By utilizing the VIS fmulx instruction whichmultiplies four bit values

with four bit values pro ducing four bit results the four pixels in row in

blo ckcanbemultiplied with row of the lter co ecients in parallel Similarly

row in group can b e SIMDmultiplied with row of the lter co ecients

Moreover b oth of these multiplications can execute in parallel The sup er scalar

pro cessing in the UltraSPARC CPU allows execution of two indep endent VIS

instructions in parallel p er clo ckcycleTwo instructions are indep endentifthe

destination of one instruction is dierent from the source of the next instruction

as is indeed the case here

The UltraSPARC CPU is incapable of outoforder execution so instructions

must b e carefully scheduled either manually or by the compiler to fully exploit

instruction indep endence However we found that the compiler supplied with

the platform did not do a satisfactory job on this p ointandwehave therefore

manually organized the VISco de to pair indep endent instructions and thereby

maximize the b enet from sup er scalar pro cessing Thus SIMD parallelism re

duces the numb er of calculations by three quarters and sup er scalar parallelism

further halves this number

SIMD processing block 2 block 1⎩ ⎨ ⎧ block 3 ⎩ ⎨ ⎧ ⎩ ⎨ ⎧

HLL-filter coefficients

super- ⎧ 1-3-31 scalar ⎨ procesing⎩ -3 9 9 -3

-3 9 9 -3 8x4 pixel block (1 byte per pixel)

1-3-31

Fig Utilizing SIMD and sup er scalar pro cessing SIMD instructions allow elements

in the same rowtobemultiplied with the corresp onding lter co ecientrow in parallel

Sup er scalar pro cessing enables tworows in the same blo ck to b e computed in parallel

Thus the enclosed pixels are pro cessed in parallel

At the algorithmic level we exploit the fact that the lter co ecients for the

four lters are highly symmetric This p ermits computation of the LHblo ck after

the LLblo ck has b een computed through simple sums and dierences of data

already cached in CPUregisters that is without the need for multiplication of

a new set of lter co ecients Similarly the HHblo ck can b e derived from the

HLblo ck Therefore all four typ es of data blo cks are constructed from a single

scan of each YUV picture comp onents Y is the luminance comp onent and U

V are chrominance comp onents

With these optimizations the implementation uses only memory ac

cesses p er pixel for creating all wavelet layers and a maximum of memory

accesses p er pixel p er reconstructed layer

Performance Measurements

The input is a test video stream with frames of pixels at bit

colour digitized at frames p er second The video stream is a typical talking

head sequence with one p erson talking and two p ersons entering and leaving

the image as background movement The CPUusage was measured on a Sun

Ultra Creator machine with one MHz UltraSPARC CPU and MB RAM

The test video stream is stored in YUV format so the colour space conversion

stage is not included in the enco ding CPUusage measurements Likewise the

colour space conversion and display times for the deco der are not included as

they combined take a nearly constant ms

Our co dec conguration uses one level of subband decomp osition and has a

yer is the most signicant down to total of layers as seen in Figure La

layer as the least signicant The rst layers DCT DCT are DCT

co ded layers constructed from the wavelet LL layer The remaining layers

HL HL LH LH HH HH are constructed from the corresp onding

wavelet enhancementlayers distributed with SNR layers from each The most

signicant SNR layers level consist of the upp er bits of each co ecientfrom

eachofthewavelet layers Layers from levels and consist of additional bits

p er co ecient p er layer All three levels thus add bits of precision

The most signicantla yers ie those on level are Human co ded The layers

on levels and are sentverbatim b ecause their contents are very random and

Human co ding would add very little if any compression

SNR layers SNR layers from wavelet from DCT enhancement layers coded LL layer Level 0 Level 1 Level 2

DCT DCT DCT HL LH HH HL LH HH HL LH HH 0 1 .... 11 0 0 0 1 1 1 2 2 2

Huffman coded layers

Fig The layer denitions used for testing along with their signicance ordering

DCT DCT are layers constructed from the DCT codedLLlayer HL LH and



HH contain the most signicant bits from the corresp onding wavelet enhancement

layers Levels and are renementlayers to these

The resulting accumulated bitrates are evenly distributed b etween kbits

with one layer and kbits with all layers as shown in Figure The lowest

numberoflayers necessary to pro duce clearly recognizable p ersons in the test

movie was found to b e DCT DCT corresp onding to an average frame size

of kbits This corresp onds to a factor in dierence b etween lowest capac

ity receiver and highest capacity receivera signicantvariation Sample images

can b e found on the World Wide Web httpwwwcsaucdkbnielsen codec Bits Bits per layer kbits Accumulated average frame size 100000 800

90000 700 80000 600 70000 500 60000

50000 400

40000 300 30000 200 20000 100 10000 Layer Layer 0 0

0 5 10 15 20 0 5 10 15 20

a b

Fig Bandwidth distribution on layers on the layer co dec a Average size of the

individual layers b Accumulated frame size versus numb er of added layers

The CPUusage measurements showaverage enco ding times for b oth the C

and VISversion average deco ding times for the VISversion and identies

the cost distribution between all co decstages The results are summarized in

Table On average the enco der uses ms to construct and compress all

layers This enables the enco der to pro cess frames p er second whichismore

than fast enough for realtime software enco ding Similarly the total average

deco ding time is ms or fps An additional ms is required for color

space conversion and display drawing but this still allows realtime deco ding

and display of the test video stream Further in most reallife applications one

would rarely reconstruct all layers meaning even faster deco ding times

Table Average encoding and deco ding time The Wavelet Transform pro duces all

wavelet layers the DCT transforms the LL layer and the QuantHuDCTwavelet

stages quantize and co de the output wavelet enhancementlayers and the DCT

SNR layers resp ectivelyThe DeQuant UnHu DCT stage deco des all SNR

layers DeQuant UnHuwavelet deco des the wavelet enhancementlayers Wavelet

reconstruction upscales the LL image and adds resolution by reconstructing the HL

LH and HH layers

Enco ding time ms Deco ding time ms

VIS C VIS

z

Wavelet Transform DeQntUnHuDCT

z y

QuantHuwavelet IDCT

y z

DCT DeQntUnHuwavelet

Quant Hu DCT Wavelet reconstruction

Total

y

TheCandVISversions of DCT and IDCT functions are from SUNs MediaLib graphics library

z

Only the C version exists of the dequantization and human deco ding stages

Deco ding app ears more exp ensive than enco ding whichisunusual However

the reason for this is that the deco der is designed to b e very exible with resp ect

to which frames it deco des and in what order it do es so It p ermits sep erate

deco ding of eachlayer which require a separate iteration across the image p er

layer This makes it easier to change the set of layered subscrib ed to dynamically

eg to reduce congestion Also deco ding can b egin as so on as a layer is received

Finally note that two of the stages are not VISaccellerated

The SIMD implementation provides a signicant p erformance improvement

For comparison an otherwise identical ecient and compiler optimized non

VIS accelerated Cimplementation of the enco der is capable of enco ding only

fps A more detailed insp ection of the individual stages reveals that VIS

provides sp eedups in the range for these particular algorithms The Cversion

of wavelet transformation takes ms on average as opp osed to the ms

needed by the VISversion This yields a sp eedup factor of The sp eed of

quantization and Human co ding of the DCT co ecients is increased by a fac

tor Also we measure a sp eedup of similar magnitude factor for SUNs

MediaLib implementation of DCT The overall eect of VIS acceleration is

that realtime co ding b ecomes p ossiblewith time to spare for network commu

nication display up dates and other application pro cessing

Related Work

Layered video co ding and related network design issues are active and recent

research areas Related work on layered video co decs exist notably The

co dec in resembles our co dec in that they use wavelets for subband decom

p osition and DCT co ding for the LL wavelet layer Their design do es not like

ours allow for several levels of subband decomp osition They include temp oral

layering in the co dec using conditional blo ck replenishment The authors stress

the need for error resilient co ding mechanisms for error prone networks suchas

the Internet Their receiverdriven layered multicast RLM scheme where re

ceivers adjust their reception rate by joining and leaving multicast groups was

develop ed to work in environments like the MBone using IP multicasting Al

though suggestions for implementation optimizations are presented in the pap er

they presentvery little information ab out runtime p erformance

MPEG was the rst standard to incorp orate a form of layering called scal

ability as part of the standard But MPEG is intended for higher bitrate

video streams and therefore only allows for three enhancements to the base

layer one from eachof scalability category spatial SNR and temp oral scala

bility Also no MPEG implementation exists that includes the scalabilities A

combination of MPEG and wavelet based spatial layering for very high band

width video is prop osed in The video is rep eatedly downsampled until the

resolution reaches the resolution of the common intermediate format cif The

n

cifsized LL layer is then further pro cessed by an MPEG based co dec Their

prop osal also oer hierarchical motion comp ensation of the high frequency sub

bands but not temp oral scalability

Compression p erformance of our co dec may b e improved by using other meth

o ds than Human compression One of the most ecient metho ds is the emb ed

ded zerotree wavelet co ding but it is most eective when using several levels of decomp osition

Conclusions

This pap er addresses the problem of eciently distributing a video stream to a

numb er of receivers on bandwidth heterogeneous networks Weproposelayered

co ding and multicast distribution of the video We design a proprietary video

co dec incorp orating wavelet ltering for spatial layering and rep eated quantiza

tion for SNR layering Wecontribute with a high p erformance implementation

using the SIMD capabilities of our platform Our measurements showthatthe

co dec is capable of realtime software enco ding a test movie at fps and deco d

ing at fps We found that the Visual Instruction Set accelerated stages in the

co dec was times as fast as an otherwise identical ecient Cimplementation

we therefore nd SIMD acceleration to b e worth the extra implementation eort

As this work was carried out as part of a masters thesis our exp erience

suggest that the new media instructions can b e successfully applied by program

mers without years of signal pro cessing exp erience Indeed in several cases the

implementation was in fact simplied by using SIMD although it requires a

dierent line of thought Since SIMD is integrated into virtually all mo dern

e think that develop ers should consider using it given the CPUarchitectures w

p ossible sp eedups Unfortunately the SIMD engines are not compatible b etween

CPUarchitectures

References

Hans Eriksson MBONE The Multicast Backb one Communications of the ACM

Barry G Haskel Atul Puri and Arun N Netravali Digital Video An Introduction

to MPEG Chapman and Hall

Klaus Illgner and Frank Muller Spatially Scalable Video Compression Employ

ing Resolution Pyramids IEEE Journal on Selected Areas in Communications

Decemb er

Morten Vadskr Jensen and Brian Nielsen Design and Implementation of an

Ecient Layered Video Co dec Technical Rep ort R Aalb org University

Dept of Computer Science Aalb org S Denmark August

Martin Vetterli and Jelena Kovacevic Wavelets and Subband Coding Prentice

Hall

Steven McCanne Martin Vetterli and Van Jacobson LowComplexity Video Co d

ing for ReceiverDriven Multicast IEEE Journal on SelectedAreas in Communi

cations August

UltraSPARC and NewMedia Supp ort WPR

Jerome M Shapiro Emb edded Image Co ding Using Zerotrees of Wavelet Co e

cients IEEE Trans on Signal Processing Decemb er

Sun Microsystems mediaLib Users Guide June

guidepdf httpwwwsuncommicroelectro nics vis mlib

Marc TremblayJMichael OConnor V Narayanan and Liang He VIS Sp eeds

New Media Pro cessing IEEE Micro August

Qi Wang and Mohammed Ghanbari Scalable Co ding of Very High Resolution

Video Using the Virtual Zerotree IEEE Transactions on Circuits and Systems for

VideoTechnology Octob er