TABLE OF CONTENTS

PART I BACKGROUND AND STANDARDS

Video Communications

Imp ortance of Video Compression

Advances in Video Co ding

WaveformBased Video Co ding

Mo delBased Video Co ding

MotionComp ensated DCT Video Co ding

Basic Principles of Motion Comp ensated Transform Co ding

Picture Formats

Color Spaces and Sample Positions

Layers in Video Stream

Intraframe Blo ckBased Co ding

Spatial Decorrelation through DCT

Exploitation of Visual Insensitivity Through Quantization

Lossless Compression Through EntropyCoding

Interframe Blo ckBased Co ding

Blo ckBased Motion Estimation Algorithms

Blo ckBased Motion Comp ensation

Co ding DCT Co ecients in Interframes

MotionComp ensated DCT Video Enco der and Deco der

Fully DCTBased MotionComp ensated Video Co der Structure

Video Co ding Standards

Overview of Video Co ding Standards

JPEG Standards

ITU H series

MPEG Standards iii

iv CONTENTS

Video Co ding Standards

H

H

MPEG

MPEG H and HDTV

MPEG

PART I I ALGORITHMS

DCTBased Motion Estimation

DCT PseudoPhase Techniques

D Translational Motion Mo del

The DXTME Algorithm

Unitary Prop erty of the System Matrix

Motion Estimation in Uniformly BrightBackground

Computational Issues and Complexity

Simulation for Application to Image Registration

DCTBased Motion Estimation Approach

Prepro cessing

AdaptiveOverlapping Approach

Simulation Results

Rough CountofComputations

Interp olationFree Subpixel Motion Estimation

Pseudo Phases at Subpixel Level

One Dimensional Signal Mo del

Two Dimensional Image Mo del

Subp el Sinusoidal Orthogonality Principles

DCTBased Subpixel Motion Estimation

DCTBased HalfPel Motion Estimation Algorithm

HDXTME

DCTBased QuarterPel Motion Estimation Algorithm QDXT

ME and QDXTME

Simulation Result

DCTBased Motion Comp ensation

IntegerPel DCTBased Motion Comp ensation

Subpixel DCTBased Motion Comp ensation

Interp olation Filter

Bilinear Interp olated Subpixel Motion Comp ensation

CONTENTS v

Cubic Interp olated Subpixel Motion Comp ensation

Simulation Results

Interp olation By DCTDST

DCTI Interp olated Sequence

DCTI I of DCTI Interp olated HalfPel Motion Comp ensated

Blo ck

Matching Enco ders with Deco ders

Matching SE with SD

Matching TE with TD

Matching TE with SD

Matching SE with TD

MPEG and Contentbased Video Co ding

Overview of MPEG Standard

MPEG Architecture

MPEG Video Co ding

Overview of MPEG Video Co ding

Arbitrarily Shap ed Region Texture Co ding

Motion Estimation and Comp ensation

Arbitrary Shap e Co ding

Advanced Co ding Techniques

Deliver Video Bitstream over Networks

Rate Control

Error resilience

Universal Accessibility

DCTdomain Contentbased Video Co ding

Transform Domain Motion EstimationComp ensation

Simulation Results

PART III ARCHITECTURES AND IMPLEMENTATION

Dual Generation of DCT and DST

Discrete Sinusoidal Transforms

Evolution of the Algorithms and Architectures

What Is Unique in Our Design

OneDimensional DCT Lattice Structures

Dual Generation of DCT and DST

Inverse Transforms

MultiplierReduction of the Lattice Structure

Comparisons of Architectures

vi CONTENTS

TwoDimensional DCT Lattice Structures

Evolution of the Algorithms and Architectures

Dual Generation of D DCT and DSCT

Architectures of FrameRecursive Lattice DDCT and D

DSCT

Comparisons

Applications to the HDTV Systems

Ecient Design of Video Co ding Engine

Overview of Emb edded Video Co ding Engine

Evolution of the Algorithms and Architectures

Overview of an Emb edded Video Co der Design

Ecient Architecture of a Video Co ding Engine

Why Should We Use CORDICbased Design

DDXTIDXTI I Programmable Mo dule

Typ e Transformation Mo dule

PseudoPhase Computation

Peak Searching

Halfp el Motion Estimator Design

Simulation Results

VLSI Design of Video Co ding Engine

Design Criteria

VLSI Implementation

LowPower and HighPerformance Design

LowPower Design

LowPower Design Approaches

AlgorithmArchitectureBased LowPower High Performance

Approaches

Lo okahead and Multirate Computing Concepts

LowPower and HighPerformance Architectures

Twostage Lo okahead Typ eI I DCTIDCT Co der

Pip elining Design for DCT Co ecients Conversion

Multirate Design for Pseudophase Computation

Pip elining Design for Peaksearch

Twostage Lo okahead Halfp el Motion Estimator

Simulation Results and Hardware Cost

PART IV APPLICATIONS

CONTENTS vii

Endtoend Video over IP Delivery

Overview of Our Design

A SONET Network Adapter Design

Joint Sourcechannel Multistream Co ding

A SONET Network Adapter Design

The Brief Overview of SONET

Packet over SONET or Directly over Fib er

Design and Implement a SONET Network Adapter

The Performance of SONET Device

Multistream Video Co ding

What is Unique in the Multistream Video Co ding

The Design of Multistream Video Co ding

Simulation Results

Bibliography

Index

viii CONTENTS

Preface

The hybrid DCT motioncomp ensated approach for video co ding has b een the core

of almost all of recent multimedia standards such as MPEG MPEG H

H and even including MPEG Therefore an ecient high p erformance cost

eective design of a digital video enco der and deco der relies on a go o d design of the

hybrid DCT motioncomp ensated Co dec

The concept ofahybrid DCT motioncomp ensated Co dec comes mainly from

two parts One is to employ the discrete cosine transform DCT similar to the fa

mous still image standard JPEG as a means to remove spatial redundancy within

an image frame through transform co ding The other is to p erform motion es

timation and comp ensation to remove temp oral redundancy among image frames

through some kinds of prediction Naturally such a concept leads to an enco der

architecture that the temp oral redundancy is rst removed by taking the dier

ence from the current image frame and the prediction of current frame from mo

tion prediction and comp ensation of the previous frame Then the dierence is

further pro cessed by DCT to remove the spatial redundancy Such architecture

commonly used nowadays has a p erformancecritical feedback lo op consisting of a

DCT quantization unit and dequantization unit an Inverse DCT and a spatial

domain motion estimationcomp ensation unit Note that both DCT and motion

estimationcomp ensation consume most of the computational resource of a digital

video enco der Such a heavily loaded feedback lo op not only increases the overall

complexity of the enco der but also limits the throughput b ecoming the b ottleneck

for designing a realtime highp erformance cost eective digital video system

Is there a b etter way to design the video enco der This is the question wehave

b een trying to answer In this monograph we present an enco der structure that

bycombining transform co ding motion estimation and comp ensation completely in

the DCT domain can reduce the complexity inside the lo op signicantly The ques

tion is can we p erform motion estimation and comp ensation in the DCT domain

eciently ie with lower overall complexity and higher data through rate Wehave

develop ed a motion estimation scheme completely on the DCT domain At the rst

lo ok it may seem that such a scheme b ecause of the need of other transforms

of similar family may require higher computational complexity from algorithmic

point of view Nevertheless we can show that with an ecient design of a sig ix

x CONTENTS

nal pro cessing architecture those transforms can b e generated altogether naturally

with almost no or little hardware p enalty comparing to the basic hardware cost of

DCT In fact through the generation of those transforms the op erations of motion

estimation have been inherently p erformed As such both the DCT and motion

estimation are combined into a single unied comp onent Therefore to answer the

question of nding a b etter way for designing a digital video enco der the solution

comes not only from algorithms domain alone but also from the interactions with

our understanding of architecturehardware issues

In fact given nowadays optical technology the rep eated computation of those

required transforms can b e easily handled by an optical engine with almost ignorant

time Therefore the prop osed complete transform domain approachcan gain in

credible advantages over conventional electronic designs in areas such as

b er optical multimedia communications where sp eed is the essence of everything

If the optical engine can b e cost eective then the prop osed approachcaneven b e

employed to deliver lowcost realtime p ersonal video enco ders everywhere

This b o ok contains part of the researchwe have b een conducting in searchof

a b etter implementation of digital video enco der The scop e of the entire view

as it is related to the interactions and evolution of algorithms and architectures

cannot b e easily presented and understo o d through various technical publications

of limited scop e given the constraint of page limitation Thus we are motivated

to devote this book to the readers who are interested in designing a new class of

highp erformance lowp ower digital video enco der This is just a starting p ointof

the journey as readers may nd that there are many p ossibilities and unanswered

questions We hop e this book can serveasa seed that can lead readers to think

that p erhaps there is a b etter way to the design and implementation of digital video

enco ders

In order to prepare the readers with dierent background to understand the

materials there are four parts in this b o ok Part I covers fundamental material on

the background and standards of digital video In Part I I the algorithmic asp ects

are considered followed by the discussion of design and implementation in Part III

FinallyinPart IV an application to SONET optical transco der is presented

Part I contains Chapters and We devote Chapter to the basics of the

motion comp ensated DCT video co ding approach MCDCT Then various MC

DCT based video co ding standards such as H H MPEG and MPEG

will b e presented in Chapter After intro duction of the commonly used MCDCT

approach in Chapter the disadvantages of the conventional blo ckbased motion

estimation and comp ensation video co der structure used in all the co ding standards

arealsopointed out Toovercome those disadvantages the idea of fully DCT based

co der design is presented

CONTENTS xi

Part I I is from Chapter to Chapter To b e able to realize transform domain

based motion estimation DCT PseudoPhase Techniques are develop ed in Chap

ter to estimate the motion directly from the DCT co ecients of two consecutive

blo cks Such techniques serve as the basic foundation of the DCTbased motion

estimation algorithm The interp olationfree subpixel DCTbased motion estima

tion algorithms are discussed in Chapter to estimate the displacements of half

p el and even quarterp el accuracy without image interp olation In Chapter the

integerp el and subpixel DCTbased motion comp ensation algorithms are devised

to complete the fully DCTbased video co der structure Toallow freely matching

conventional video co decs with compressed domain co decs for the sakeofinterop er

ability a set of rules on the requirement of DCTIDCT and motion comp ensation

algorithms are develop ed in Chapter In order to p erform motion estimation

of an arbitrarily shap ed video ob ject plane in MPEG video in Chapter we

presenta contentbased transform domain motion estimation scheme EDXTME

based on DCT PseudoPhase Techniques Notice that if the original input image

sequences are not decomp osed into several video ob ject layers of arbitrary shap e

the EDXTME scheme simply degenerates into a single layer representation that

supp orts conventional image sequences of rectangular shap e

In Part III Chapters and are presented Unlike many architectures

for computing DDCT the timerecursive lattice structures presented in Chap

ter can generate all required transforms extremely eective with low overhead

Those transforms op erations will b e used rep eatedly for DCT PseudoPhase com

putation The compressed domain video co ding algorithm calls for a larger set of

elementary op erations square ro ots divisions trigonometric functions and some

what less often hyp erb olic transformations which cannot be evaluated eciently

with conventional multiplication and accumulation based arithmetic units On the

other hand CORDIC involving only simple elements such as adders shifters and

registers oers an ecientwaytoevaluate eachof those elementary functions A

fully pip elined parallel CORDICbased architecture is therefore presented in Chap

ter to estimate motion with b oth integerp el and halfp el accuracy Furthermore

this multiplierfree structure is regular mo dular and has solely lo cal connection

suitable for VLSI implementation Therefore we present our single chip implemen

tation to demonstrate the design p erformance in Chapter With the advent of

p ersonal communications services PCS and p ersonal data assistant PDA the

future trend is to run MPEG applications on those p ortable devices The need

for highsp eed datasignal pro cessing will lead to muchhigherpower consumption

than traditional p ortable applications To meet the needs of p ortable highquality

highbitrate picture transmission we extend our compressed domain design for low

power and highsp eed applications An algorithmbased lowp ower and highsp eed

xii CONTENTS

video co der design is presented in Chapter Techniques such as lo okahead

multirate pip elining and folding have b een combined and used in the design The

power saving is in the range of

After having discussed complete compressed domain video Co dec design Part

IV contains the last chapter to p ortrait a panorama picture by addressing its appli

cations under the currentcommunication environments Anticipating packet video

over SONET or directly over optical b er is a leading exp edient solution to pro

vide highcapacityinterconnection b etween end users we present a exible wayto

design and implement a SONET transco der or network adapter served as Layer

IP router in Chapter Although optical networks are ideal for video trans

mission the cost is now still beyond the reach of average users and the lastmile

services wireline or wireless connections are most likely needed b efore reaching the

optical networks We therefore present a joint sourcechannel multistream video

co ding scheme to combat the transmission errors for access networks On top of

the conventional error control and concealmenttechniques this multistream design

provides another layer of error protection by taking advantage of the contentbased

video co ding presented in the previous chapters

The results presented in this book had been in part supp orted by National

Science Foundation and Oce of Naval Research We would like to take this op

p ortunity to thank John Cozzens of National Science Foundation and Cli Lau of

Oce of Naval Research for their research supp ort

Jie Chen

UtVaKoc

K J RayLiu

Chapter

Video Communications

The demands for multimedia services are rapidly increasing while the exp ec

tation of quality for these services is b ecoming higher and higher To attain the

highest p ossible quality analog signals suchasspeech audio image and video are

sampled and digitized as digital data for transmissionrecording and reconstructed

at the receiving ends in order to b e free from noise and waveform distortion induced

in transmission and storage However these digitized data are usually voluminous

Even though the technology is continuously progressing at pushing up the band

width limit and reducing the transmissionstorage cost still channel bandwidths

and storage capacities as tabulated in Table are limited and relatively ex

p ensive in comparison with the volume of these raw digital signals To make all

the digital services feasible and cost eective datasignal compression is essential

As depicted by the Schouten diagram in Fig all the digital signals carry re

Redundant coding Predictive/Transform

Relevant

Irrelevant Efficiently Quantization Compressed Signal

Non-Redundant

Figure Schouten diagram shows signal compression through the removal of

redundancy and irrelevancy in digital signals

CHAPTER VIDEO COMMUNICATIONS

CHANNEL BANDWIDTH BITRATE MEDIUM

POTS mo dem kbps copp er

DS kbps copp er

T DS DS Mbps copp er

T DS DS Mbps copp er

Cable mo dem Mbps copp ercoaxial

Ethernet Mbps copp ercoaxial

Fast Ethernet Gbps copp ercoaxial

ISDN p kbps copp ercoaxial

ADSL Mbs to user Kbs to net copp ercoaxial

VDSL Mbs to user Mbs to net copp ercoaxial

FDDI XT Mbps b er

SONET SDH p Mbps b er

CDPD kbps wireless

GSM DCS kbps wireless

IS kbps wireless

IS kbps wireless

WCDMA Kbs wireless

EDGE Kbs wireless

PDC kbps wireless

TETRA kbps wireless

APCO Pro ject kbps wireless

STORAGE CAPACITY MEDIUM

Floppy Disk Mbytes magnetism

CD CDROM Mbytes laser

DVD GB for singlelayer laser

GB for double layer discs

DAT Mbitss hours magnetism

DRAM Mbits semiconductor

POTSPlain Old Telephone System ADSLAsymmetric Digital Subscrib e Line

VDSLVery highsp eed Digital Subscrib e Line SONETSynchronous Optical NETwork

SDHSynchronous Digital Hierarchy FDDIFib er Distributed Data Interface

CDPDCellular Digital Packet Data DCSDigital Cellular System

GSMGlobal System for Mobile Communications PDCPersonal Digital Cellular

WCDMAWideband co dedivision multiple access EDGEGSM Evolution

TETRATrans Europ ean Trunked Radio p

APCOAsso ciated Public SafetyCommunications Ocers Pro ject

CDCompact Disk DATDigital Audio Tap e

DVDDigital Versatile Disc DRAMDynamic Random Access Memory

ROMRead Only Memory

Table List of channel bandwidths and storage capacities

dundant irrelevant information and are sub ject to b e compressed byremoving the

redundancy and irrelevancy for ecient use of bandwidths at the lowest p ossible

cost

In view of compressibility of digital signals and its imp ortance in digital commu

nication including transmission and storage extensive research is vigorously b eing

pursued on datasignal compression also called source co ding in the area of digital

communication over decades to reduce the data size and at the same time

improve the p erceived quality of the compressed signal As illustrated in Fig

Signal Quality (mos)

Efficiency Delay (bit./sec, bits) (ms)

Complexity

( MIPS, mW, ops)

Figure Dimensions of p erformance of signal compression or source co ding

the p erformance of signal compression or a source co der can b e measured in four

dimensions

Signal quality is measured in the vep oint mean opinion scale mos asso ci

ated with a set of standardized adjectival descriptions bad poor fair good

and excel lent

Compression eciency indicates the number of bits per second required to

transmit the compressed signal or the total number of bits for storage An

alternative indicator is the compression rate dened as the ratio of the raw

to the compressed bit rate

The computational complexity of a compressiondecompression algorithm refers

to the computational requirement of the compressiondecompression pro cess

typically measured in terms of the number of arithmetic op erations ops

memory requirement computing p ower requirement millions of instructions

p er second or MIPS p ower consumption chip area required and the cost to

implement

CHAPTER VIDEO COMMUNICATIONS

Communication delayis critical to the p erformance of a signal compression

algorithm only when twowayinteractive communication is involved suchas

in the videophone application

Some regions in this fourdimensional space are theoretically unallowable or prac

tically unreachable However there always exist tradeos among these four p erfor

mance criteria Dep ending on sp ecic communication applications certain tradeos

may b e more preferable than others

In the arena of datasignal compression image compression and im

age sequence compression video co ding have attracted a lot of

attention from the technical community due to many challenging research topics

and immediate or p otential applications such as video conferencing videophony

multimedia high denition television HDTV interactive TV telemedicine etc

Because of emergence of various international video co ding standards advances in

VLSI technology and widespread availability of digital computer and telecommu

nication networks research eorts in video co ding b ecome directly applicable to

pro duct development and b ecome increasingly imp ortant in the industries In this

merging trend research in video co dingcompression plays an increasingly imp or

tantrole

Imp ortance of Video Compression

The volume of digital video data is notoriously huge It is implausible to transmit

raw video data over communication channels of limited transmission bandwidth or

to save on storage devices For the convenience of discussion a numb er of commonly

used source imagevideo formats are listed in Table

For a high quality HDTV picture that has a spatial resolution square

pixels and digitized as bit pixels in color comp onentsataHzinterlaced scan

the uncompressed bit rate is ab out Gbitsec To compress such high

volume video data the video pro cessor must b e of high throughput to handle such

high bit rate data and low complexity to reduce the cost and increase the sp eed In

spite of the requirements of high throughput and low complexity for video co decs

a high compression rate is also crucial for any p ossible applications For a MHz

HDTV simulcast transmission channel bandwidth the channel capacity is limited

to Mbitsec requiring a compression rate around

Consider also the Common Intermediate Format CIF the standard for video

conferencing recommended by CCIR which contains pixels p er line and

lines p er picture for the luminance signal ie resolution and p els

per line lines p er picture for the two color dierence comp onents chromina

At the frame rate frames p er sec fps and bits p er pixel bpp the

IMPORTANCE OF VIDEO COMPRESSION

RAW

FORMAT RESOLUTION BITRATE REMARK

CCIR ppl lpf fps Mbps digital video

CIF ppl lpf fps Mbps digital video

QCIF ppl lpf fps Mbps digital video

SIF ppl lpf fps Mbps digital video

HDTV ppl lpf fps Mbps digital TV

ppl lpf fps y Gbps

NTSC lpf fps y analog TV

PAL lpf fps y analog TV

SECAM lpf fps y analog TV

VGA ppl lines computer

SVGA ppl lines computer

CIFCommon Intermediate Format QCIFQuarter CIF

SIFSource Input Format HDTVHigh Denition TV

NTSCNational Television System Committee PALPhase Alternate Line

SECAMSequential Couleur avec Memoire VGAVideo Graphics Adapter

y interlaced scan frame consists of elds

pplpixels p er line lpflines p er frame fpsframes p er sec

Table List of source imagevideo formats

CHAPTER VIDEO COMMUNICATIONS

uncompressed bit rate for CIF is ab out Mbitsec

Even if we use a smaller format the Quarter CIF QCIF having half the

numb er of p els and half the numb er of lines stated ab ove the bit rate of raw video

data is still huge reaching Mbitsec POTS Plain Old Telephone System

the most accessible channel by the general public currently has a bandwidth of

only kbitsec Even a dedicated ISDN channel has only kbitsec Without

compression most of the applications listed in Table would not be feasible

or economically realistic to transmit over network or store such highvolume video

data

Application Uncompressed Compressed

Slowmotion video framess Mbps kps

framesize

Video conference framess Mbps kbps

framesize

Digital video on CDROM framess Mbps Mbps

framesize

HDTV framess Gbps Mbps

framesize

Table Applications for image and video compression

In the past decades there have been signicant advancements in algorithms

and architectures for pro cessing image and video signals These advancements have

pro ceeded along several directions On the algorithm front new techniques haveled

to the development of robust metho ds to compress the image and video data Such

metho ds are extremely vital in many applications that manipulate and store digital

data On the architecture front it is now feasible to put sophisticated compression

pro cesses on the relatively lowcost and lowp ower hardwares this has spurred a

great deal of activity in developing multimedia systems for the large consumer

market

Advances in Video Co ding

The research in image sequence compression or video co ding is a natural extension of

the research in image compressionco ding activeover several decades Beyond the

removal of spatial and sp ectral redundancy in resp onse to our human visual system

HVS video co ding exploits further the temp oral correlation b etween consecutive

ADVANCES IN VIDEO CODING

frames In image co ding the rst generation research fo cuses on pixeltopixel cor

relation waveformbased based on some statistical image mo dels while the second

generation research utilizes the knowledge of more complicated structural image

mo dels and the prop erties of the human visual system to achieve higher compres

sion eciency ab ovethe theoretical limit predicted bythe classical source co ding

theory The second generation co ding techniques can b e further divided into

twogroups

Localoperator basedtechniques are based on the mo dels of HVS and include

pyramidal and subband co ding and anisotropic nonstationary predictivecod

ing

Contourtexture orientedtechniques describ e an image in terms of structural

primitives such as contours and textures Twoapproaches were develop ed a

region growing based co ding approach and a directional decomp osition co ding

approach

In video co ding recent research fo cuses can be categorized roughly in two main

groups waveformbased coding and modelbased or know ledgebased

coding

WaveformBased Video Co ding

Inawaveformbased co ding compression is achieved directly on a twodimensional

discrete distribution of lightintensities Although the distribution is a pro jection of

threedimensional scenes on the D image plane what is visible is the D waveform

of sampling p oints A basic problem in waveformbased compression is to achieve

The minimum p ossible waveform distortion for a given enco ding rate or

A given acceptable level of waveform distortion with the least p ossible enco d

ing rate by eliminating three typ es of redundancies spatial temp oral and

sp ectral chromatic

Due to the fact that high sp ectral correlation exists among three primary colors red

green and blue and the HVS is not as sensitive to the chrominance comp onents

as to the luminance comp onent of a color image reduction of sp ectral redundancy

is attained by linearly transforming the color space from RGB redgreenblue to

YUV or YCrCb lumachroma and then subsampling the chrominance comp onents

called subsampling Spatial and temp oral compression can b e achieved ei

ther separately spatialtemp oral or jointly spatiotemp oral as shown in Fig

a and b resp ectively A video compression system should smartly combine

spatial temp oral and sp ectral redundancy reduction techniques Toachievehigh

CHAPTER VIDEO COMMUNICATIONS

temp oral compression waveformbased co ding usually requires motion estimation and comp ensation

Spatial Temporal Temporal Spatial Compression Compression Compression Compression

Spatial and temporal compression Temporal and spatial compression

(a) Spatial/temporal compression

Spatio−temporal Compression

(b) Spatial−temperal compression

Figure Compression systems hybrid or joint spatial and temp oral compression

SpatialTemp oral Compression Hybrid Approach

This hybrid approach treats temp oral compression and spatial compression sepa

rately so that the b enets of b oth schemes can b e retained Temp oral compression is

often achieved through temp oral prediction or motion estimation and comp ensation

while spatial compression is usually accomplished via transform co ding subband

co ding or vector quantization Combination of two co ding blo cks in dierent orders

sp ecify twotyp es of hybrid schemes

T Q VLC T Q VLC

−1 Q−1 Q

T−1

Temporal Temporal Predictor Predictor

Hybrid spatial an dtemporal coding Hybrid temporal and spatial coding

(a) (b)

Figure Hybrid compression a hybrid temp oral and spatial compression b hy

brid spatial and temp oral compression Here T denotes transformation Q denotes

quantization while Q denotes inverse quantization op eration reconstruction

VLC nally denotes entropy co ding normally implemented as variable length co d

ing

Hybrid spatial and temporal compression A transform co der is followed bya

Dierential Pulse Co de Mo dulation DPCM co der temp oral predictor as

ADVANCES IN VIDEO CODING

shown in Fig a It is probably the rst hybrid co der

Hybrid temporal and spatial compression also cal ledvectorpredictive coding

The transform co der is put inside the feedback lo op of the predictive co der

as shown in Fig b This hybrid scheme was presented in byForch

heimer and Ericson and Jain indep endently The advantage

recognized at that time are a errors o ccurring from the transform co der

can b e handled by the feedbackcontrol lo op b the probability distribution

of the error vectors may b e more easily mo deled and co ded than the image

itself c the feedbackloopworks in the image domain Go o d predictors eg

predictors considering motion information can b e fully utilized A theoretical

study shown that it is essentially optimum for the case of a stationary Gaus

sian source and a mean square error distortion measure After the eorts

of several years it has evolved into motioncomp ensated hybrid approaches

as shown in Fig

T Q VLC

Q−1

T−1

Temporal Predictor

Motion

Estimation

Figure Motioncomp ensated hybrid compression

Motioncompensatedhybridapproaches The temp oral predictor is assisted by

motion estimation and comp ensation to further reduce temp oral redundancy

and result in much smaller motion comp ensated residuals This scheme to day

has governed the eorts on video co ding standards ranging from terrestrial

broadcasting of HDTV and digital video such as MPEG MPEG MPEG

into the videophone standard H H as we will discuss in Chapter

The main reason b ehind the p opularityofthisscheme b esides those already

mentioned in the previous vectorpredictive coding design are

a High coding ecient The temp oral redundancy is exploited by motion

comp ensated prediction And the spatial correlation existing in the motion

comp ensated dierence signal is further removed using transform co ding such

as DCT

b Maturedcoding scheme and techniques The transform co ding suchasDCT

CHAPTER VIDEO COMMUNICATIONS

and blo ckbased motion comp ensation prediction are all matured co ding tech

niques

c Short coding delay In many situations eg visual communication co d

ing delay must be strictly limited Having no frame delay this motion

comp ensated hybrid approach is esp ecially well suited for bidirectional visual

communication applications

Dep ending on what spatial compression metho d is used these hybrid ap

proaches can b e further categorized as follows

Motioncompensatedtransform coding Overlapping blo ckbased motion

estimationcomp ensation techniques are employed to achieve temp oral

compression and the resulting motion comp ensated prediction errors

are further compressed by a transform co der If the transform co der

adopts Discrete Cosine Transform DCT wecallthishybrid approach

the motioncomp ensated DCT scheme MCDCT When the images are

generated by a rstorder Markov pro cess DCT is equivalent to the op

timum KarhunenLo eve transform KLT whichpacks most energy

in as few transform co ecients as p ossible Moreover for images of real

world scenery DCT is also a very ecient transform co der MCDCT

is the basis of many international video co ding standards and will be

discussed in detail in a later chapter

Motioncompensated subband coding The motion comp ensated frame

dierences are decomp osed into D subbands and compression is facili

tated by truncating some subbands

Motioncompensated vector quantization Either or vector

quantization is applied to either the motion comp ensated residuals if

motion is detected or the intra frames if no signicant motion is found

In addition to the dierence in approaches the shap e of a basic enco ding unit

can also b e a square blo ck an irregular blo ck or the full frame

Blockbased approach A whole frame is divided in to many squared blo cks

each of which is co ded by dierent approaches such as the ab ove motion

comp ensated hybrid approaches or a fractal approach

Regionbased approach Unlike the blo ckbased approach a frame is seg

mented into blo cks of dierent irregular shap es according to some criteria

suchasmotion vector elds Usually apatch mesh is built and the motion

of the grid p oints is tracked instead of every pixel Dep ending on how many

ADVANCES IN VIDEO CODING

grid p oints determine one segment or whether the mesh is adjusted frame by

frame it can b e

Either quadranglebased four grid p oints for each segment or triangle

based three grid p oints

Either xed mesh or adaptive mesh For the adaptive mesh approach

grid p oints are tracked based on an energy criterion and pixels inside

a segment are interp olated by a simple function found bycurve tting

This technique is also found to apply to co ding mouth motion whichis

dicult for conventional blo ck based approaches

Ful lframe approach Treat each frame as a p oint in a subspace and track

the slowchange of this subspace

Spatiotemp oral Compression Joint Approach

A video sequence is considered as a dimensional signal two spatial dimensions

and one temp oral dimension and therefore spatial and temp oral compressions are

achieved in a uniform manner The joint approach is considered to b e an extension

of the D image co ding and may incorp orate motion comp ensation within the D

co der Dierent techniques can b e applied and classied roughly as

D transform co ding including D DCT co ding

Several consecutive frames are group ed together and

divided into D blo cks which are then co ded with transform co ding usually

DCT Some approaches also use motion comp ensation to align moving ob jects

across frames In this way temp oral correlation as well as spatial correlation

can be compacted with the well understo o d transform techniques such as

DCT This approach can also b e mixed with other techniques suchaswavelet

or subband co ding

D wavelet co ding or D subband co ding

Basically these ap

proaches divide the whole sequence into dierent frequency bands or with

dierent wavelet bases Each band will be enco ded separately according to

its characteristics

D fractal co ding Instead of intra and interframe co ding of

individual frames threedimensional regions of the sequence are co ded si

multaneously The principle of D fractal video co ding is similar to D

fractal co ding In essence D fractal co ding partitions a D image into non

overlapping range blocks and nds a larger blo ck of the same image domain

CHAPTER VIDEO COMMUNICATIONS

block for every range blo cksuch that a transformation combination of geo

metrical transformation and luminance transformation of the domain blo ck

is a go o d approximation of the range blo ck In the D case range cubes are

approximated through transformation of domain cubes Motion comp ensation

can also incorp orated b efore the D fractal co ding pro cess

The spatiotemp oral approaches largely suer from substantially higher computa

tional complexity and require much more memory space to store more than frames

Though the spatiotemp oral approaches regard the temp oral dimension as one of the

spatial dimensions and thus are able to achieve a higher co ding gain it is sometimes

hard to justify such an increase in complexity and hardware costs of spatiotemp oral

approaches against their mo dest or even marginal gain in compression ratios

Motion Estimation and Comp ensation

As can b e seen in the ab ove discussion Motion Estimation and Compensation

b elongs to the waveformbased video co ding approach Due to its simplicity to

design and implement compared to mo delbased video co ding approach it is widely

used for video compression and its blo ckbased motion estimation and comp ensation

scheme has b een adopted in H H and MPEG video co ding standards

Motion estimation is eective in removing temp oral redundancy for video co d

ing Unlike DPCM linear prediction it b elongs to the class of nonlinear predictive

co ding techniques For video compression motion estimation techniques estimate

the eld asso ciated with the spatiotemp oral variation of intensity called the optical

ow instead of the true motion eld of ob jects as required in the eld of computer

vision In other words estimation of the true motion is not the ultimate goal but it

is desirable to obtain the true motion information to avoid any articial discontinu

ities in the predicted error As a result the terms motion eld and optical ow

are usually used interchangeably without distinction Then motion comp ensation

techniques are employed to predict the current frames based on the motion informa

tion and the previous frames The purp ose of video compression is to minimize the

overall amount of information including motion information and prediction error

information to b e sent or stored for deco ding Therefore the tradeo of bit allo

cations exists b etween motion parameters and prediction errors Furthermore for

limiting the co ding delay motion estimation in video co ding usually utilizes either

the previous frame or the next future frame as the reference even though all the

other frames in a video sequence can be referenced ideally in an accurate motion

estimation pro cedure

For the consideration of motion estimation in the context of video co ding three

main causes give rise to the spatiotemp oral intensityvariation

ADVANCES IN VIDEO CODING

sa2

v2 288 lines sa u u 1 1 2 bs

v1

bs search area 352 pels (b) reference block and search area

(a) CIF frame

Figure Full SearchBlockMatching Approach BKMME

Global motion or camera motion such as pan or zo om causing the apparent

motion of the ob jects in the scene

Lo cal motion of ob jects with resp ect to each other and the background

Achange of the illumination condition which is generally not taken into ac

countby the motion estimation techniques

The problem of motion estimation can b e approached in a deterministic frame

work or a sto chastic Bayesian one In the sto chastic framework the motion is

usually mo deled as a Markov random eld with a joint distribution characterized

as a Gibbs distribution and techniques suchasmaximum a p osteriori MAP and

minimum exp ected cost MEC can b e applied to motion estimation However for

the deterministic approach the motion is considered as an unknown quantity and

can b e mo deled as either a p ersp ective pro jection or an orthographic pro jection from

the D co ordinate to the D image co ordinate on the camera plane In this frame

work motion estimation techniques can b e classied in four main groups block

matching techniques gradient optical ow techniques pelrecursive techniques

and frequencydomain techniques

Blo ckmatching techniques

By assuming only the translational motion of rigid ob jects on the D image

plane the entire image is partitioned into N N blo cks as shown in Fig Each

blo ck in the current blo ck is measured against all the p ossible blo cks in the search

area of the previous frame based on some optimization criterion Precisely the

CHAPTER VIDEO COMMUNICATIONS

blo ckmatching metho ds try to nd the b est motion vector satisfying

P

jjx m n x m u n v jj

W

d u var g min

N uv S

where jjxjj is the metric distance dened as jjxjj x for the MeanSquareError

MSE criterion or jjxjj jxj for the MeanAbsoluteDierence MAD criterion and

S and W denote the set of allowable displacements and the measurement window

resp ectively dep ending on which blo ck matching approachis in use For the full

exhaustive search blo ck matching approach W fN g and S will

include all the p ossible blo ck p ositions in the search area

The blo ck matching approaches enjoy certain advantages such as simplicityin

concept direct minimization of the motioncomp ensated residuals in terms of MAD

or MSE and little overhead motion information However there are some ma jor

drawbacks unreliable motion elds blo cking artifacts and p o or prediction along

moving edges

Gradienttechniques

Assuming the invariant illuminative condition it can b e shown that

I x y t

v rI x y t

t

T

where v v v dt as t This equation is known as the optical

x y

ow constraint equation Since the solution for v at x y t is not unique an

v v

v y y v

x x

g and minf g additional smo othness constraint minf

x y x y

is intro duced to limit the solution of v x y ttovary smo othly in the spatial domain

x y Consequently the optical ow is obtained by minimizing the following error

term

Z Z

I x y t v v

x x

fv rI x y t

t x y

v v

y y

gdx dy

x y

where is a weighting factor This minimization problem is solved by an iterative

GaussSeidel pro cedure

The gradienttechniques provide an accurate dense motion eld but have two

ma jor drawbacks The dense motion eld requires many bits to enco de

The prediction error is large on moving ob ject b oundaries due to the smo othness

constraint

ADVANCES IN VIDEO CODING

Pelrecursive techniques

Pelrecursivetechniques can b e considered as a subset of the gradient techniques

Given the intensity proles in two consecutive frames I and I it iteratively

t t

minimizes the Displaced Frame Dierence DFD value dened as

DF Dk l dj I k l I k u l v j

t t

by the steep est descent optimization algorithm where d u v The resulting

th

iterative equation to estimate the motion dk l uk l vk l at the i

iteration is given as follows

d I k u l v d k l D F D k l d k l

i t i i i i

x x

t t

where is a convergence factor and x x y

t

x y

The p elrecursivetechniques can up date the motion vectors based only on previ

ously transmitted data and thus no overhead motion information is required b ecause

motion can b e estimated at the deco der as well However the drawbacks include the

convergence dep ending on the choice of susceptibility to noise and incapability

to handle large displacements and motion discontinuities

Frequencydomain techniques

Frequencydomain techniques are based on the relationship b etween transformed

co ecients of shifted images Several metho ds are available in this category the

Complex Lapp ed Transform CLT motion estimation metho d and the Fourier

Transform DFT or FFT phase correlation metho d and D spatiotemp oral frequency

domain analysis using Wigner distributions or Gab or lters

the CLT approach estimates the motion by nding over all p ossible values of

k l within the search area the minimum of the dimensional cross correlation

function y k l dened as follows

N

X

m n

y k l x m nx m k n l cos cos

N N

mnN

N N

X X

u v k lg f X u v X

v

uN

where fx m n I m n m n N N g is the search

t

area from the previous frame I and fx m n I m n m n N

t t

N g is the reference blo ck from the current frame I Here X u v

t

CHAPTER VIDEO COMMUNICATIONS

and X u v are the Dimensional CLTof x m nand x m n resp ectively

The Dimensional CLT X k l of xm nisdenedas

N

X

m n n m

j k l

N N

p

X k l xm ne cos cos

N N

N

mnN

for k NN l N

The phase correlation metho d is based on the principle that a relative shift in the

spatial domain results in a linear phase shift in the Fourier domain It estimates

the translational motion u v between two N N image matrices x and x

of which the m n element is I m n and I m n resp ectively for m n

t t

N If these two image matrices dier by a translational displacement

then the displacement can b e found by lo cating the p eak of the inverse D Fourier

transform of the normalized crosscorrelation function of the Fourier transform of

these two blo cks

X X

g IDFTf

jX X j

where IDFT denotes the Inverse DFT Discrete Fourier Transform X DFT fx g

and X DFT fx g

Mo delBased Video Co ding

In general signal pro cessing the term modelbased has also been used to imply

an underlying signal source mo del such as a simple or comp osite Markovmodel

In video co ding there has b een a gradual historical developmentfromsim

ple Markov mo dels through to ob jectrelated mo dels for video co ding as we will

discuss later in this section Underlying most video co ding implementations is a

conception in the human mind of a signal mo del However while there is an ob

vious implication in mo delbase co ding that the image sequence is b eing mo deled

as a comp osite collection of moving ob jects rather than as a Markov source the

term modelbasedcoding MBC further implies that software mo dels of ob jects are

actually implemented and animated as part of the co ding pro cess Most of the

exp erimental work in mo delbased co ding has b een concerned with mo deling and

co ding the human head and shoulders This is b ecause of p ossible applications in

videotelephony and video conferencing The head is in some resp ects an easy ob ject

to mo del in these applications b ecause there is usually not much lateral movement

and not much rotation On the other hand the human face has a exible rather

ADVANCES IN VIDEO CODING

than a rigid shap e with a complex set of controlling muscles this makes accurate

analysis and synthesis fairly dicult Some researchers have b egun to think ab out

and exp eriment with ob jects other than faces with broader applications of MBC

in mind

Because of the slightvariations in terminology used by researchers in the eld

let us dene and distinguish the terms used in the literature Modelbasedcoding will

b e used as a generic term to cover all co ding systems in which a mo del of an ob ject

is used in the co ding and deco ding pro cess as in Fig Know ledgebasedcoding

MODEL−BASED MODEL−BASED ENCODER DECODER Animation data Analysis Synthesis Output Input Image source Residual pixel data Image source Image

Image model model

Figure General mo delbased co ding system In this case the enco der and

deco der have an ob ject mo del the co der analyzes input images and the deco der

generates output images using the mo del

is the subset of mo delbased systems in which sp ecic knowledge ab out the form of

the ob ject eg ahuman face is available to the co der in the co ding pro cess After

carefully reading the literature we nd that there are very few if any mo delbased

systems whichdonotmake some assumptions consciously or unconsciously ab out

the form of the ob ject A more useful working distinction is therefore between

systems that acquire knowledge ab out the ob ject during the co ding pro cess and

those that rely on prior knowledge The system might for example b e given prior

knowledge that the ob ject in front of the camera is a p erson it might then acquire

knowledge of that p ersons D shap e Future systems maybeabletoacquirethe

knowledge that the p erson is a named individual A system with a large amount

of op eratorgiven prior knowledge may co de a particular sequence very eciently

but it may turn out to b e less ecient when co ding a wider variety of scenes than

another co ding with less prior knowledge but with a b etter ability to acquire it

The fo cus of this section is limited to a review of the past and current eorts

on mo delbased video co ding esp ecially its application in very low bitrate image

sequence co ding

Video co ding at very low bitrates is motivated by its potential applications

for videophones multimedia electronic mail remote sensing electronic news

papers interactive multimedia databases etc

Due to the practical medium capacity limitations the main problem of intro ducing

CHAPTER VIDEO COMMUNICATIONS

these applications lies in how to compress a huge amount of visual information into

a very low bitrate stream for transmission or storage purp oses This is typically

reected in the videophone problem that is the transmission of videophone scenes

through the available narrowband networks such as public switch telephone net

work PSTN The available PSTN network is mainly used for the transmission of

sp eech However visual data is considerably larger than sp eech data For instance

if we adopt the CIF video format then the bitrate of the CIF video sequence is

approximately million bs When trying to transmit such a color video signal

via the PSTN under the assumption that channel capacity is extended to Kbs

and using Kbs for video and voice resp ectively The compression ratio must b e

as high as Achieving such a high compression ratio indeed prop oses a serious

challenge to the researchers in the image co ding eld

Intro duction of Mo delbased Co ding

In general it is imp ossible to compress a full TV signal at such high compression

ratios while still keeping high quality of the deco ded images Fortunately there are

certain restrictions which are implied in these applications Letustake videophone

signal compression as an example Typical videophone scenes have the following

three characteristic

Fixed scene content The typical scene is a headandshoulder image of the

sp eaker Due to the ob jects of the scene b eing known aprior some knowledge

ab out them can b e used eg the D shap e of the face

Limited motion The interframe motion is mainly caused bythemovementof

the sp eaker and the camera is generally xed this situation is not valid for

a mobile videophone But even for this case the camera undergo es limited

motion such as zo om pan and vibration The movement of the sp eaker

mainly contains the global movement of the shoulder and head and the lo cal

motion of facial expression changes Due to the inertia of the human body

the global motion is relatively slow and can b e describ ed using only a few bits

per frame In this way more bits can b e sp ent on facial expressions

Special requirements for visual information Interp ersonal video communica

tion do es not usually require the full resolution that is provided by broadcast

television or CIF The key in visual communication is to provide the emo

tional dimensions Therefore a lower resolution image format is often used

esp ecially for kbs applications One commonly used format is QCIF that

is resolution is reduced to for luminance and for chromi

nance The frame rate is reduced to Hz or even Hz The combination of

knowledge of the scene the spatiotemp oral redundancythelower resolution

ADVANCES IN VIDEO CODING

etc allow the visual information to b e compressed to a very high compression

ratio Similar limitations exist also in the other applications

Dierent from various conventional waveform co ding metho ds mentioned previ

ously in these mo delbased schemes some sense of D prop erties of the scenes are

taken into consideration Images are viewed as a D pro jection of a D real scene

The concept is to construct a mo del with a priori knowledge of images and nd the

mo del parameters In this way only the mo del parameters need to b e sentandthus

a very high compression rate is achieved The term mo delbased co ding denotes

ascheme of the kind shown in Fig A video sequence containing one or more

moving ob jects is analyzed using computer vision techniques to yield information

ab out the size lo cation and motion of the ob jects This information is employed to

synthesize by computergraphic metho ds a mo del of each ob ject Tracking tech

niques are used to make the mo del mimic the movements of the ob ject it represents

The parameters needed to animate the mo del are then co ded and transmitted to the

receiver which reconstructs the mo del For lowquality repro duction the animation

data are sucienttogive an approximation to the app earance of the original image

sequence For higher quality and higher bit rates a residual pixel signal is trans

mitted that typically comprises the co ded frame dierences between the original

video sequence and that derived from the animated mo del

Mo delbased video co ding approach has three key elements modeling analysis

and synthesis According to dierent mo deling steps two ma jor categories are

Figure Popular D wireframe mo del

Objectbased approach No explicit ob ject mo del is given Every frame is

comp osed of ob jects and each ob ject is asso ciated with sets of parameters

motion shap e and color It is the MPEG approach whichwe will discuss

in more detail in Chapter

Semanticbased approach This approachis sometimes called compression

CHAPTER VIDEO COMMUNICATIONS

through animation b ecause it uses explicit ob ject mo dels This approachis

usually limited to co ding a talking human face Ahuman facial mo del must

b e constructed rst by means of dierent geometric mo dels

Surfacebased parametric model spline harmonic surface for relatively

regular geometric shap es

Surfacebasednonparametric model wireframe D wireframe mo del

planar p olygonal patches of adjustable size This is the most p opular

mo del as shown in Fig

Volumebasedparametric model generalized cylinder GC sup erquadrics

This mo del is capable of mo deling nonrigid motion

Volumebased nonparametric model voxels

The problems for these approaches are the timeconsuming analysis steps

which nd the mo del parameters to b est t the images

The go o d reviews have b een given by Li Lundmark and Forchheimer Aizawa

and Huang Buck and Diehl and Pearson

Mo delbased video co ding promises p otentially large reductions in bit rate com

pared to hybrid interframe co ders represented by the H MPEG and MPEG

standards Itisinteresting for us to know that the original target set for MPEG

is lowbit rate co ding instead of arbitrarily shap ed video co ding Therefore MBC

was considered as a contender for MPEG in It has b een shown that

simple animated faces require only bs with more realistic

facial representations needing p erhaps Kbs Go o d repro ductions

of CIF or QCIF color headandshoulder sequences using signicant residual pixel

data can b e obtained at Kbs A range of co ding results have

b een rep orted for head and shoulder images from Kbs down to Kbs with

CIF or QCIF image sequences Overall MBC a technique whichshows promise

of achieving very large bitrate reductions for moving images

Evolution of Mo delbased Co ding

The MBC techniques are constantly improving with time though there are still

problems to b e solved

In a comic electric telegraph consisting of iron bars attached to a

exible mo del of a face was demonstrated at the Great Exhibition in London

The demonstration by GR Smith consisted in distorting the face using

magnets which could b e op erated electrically at a distance In Gab or

and Hill prop osed that common ob jects such as grass or crowds of

ADVANCES IN VIDEO CODING

p eople could b e recognized and after recognition a standard form substituted

in the enco ding pro cess we now call this co deb o ok co ding In his book

Signals Systems and Noise J R Pierce imagined a receiver in which

there was stored a mo del of the human face The transmitter would follow

the movements of the eyes lips and jaws of a real face and transmit these to

the receiver

During the s there was ground breaking work in the elds of computer

graphics computer vision and psychology Parke develop ed pa

rameterized mo dels of the human face as a to ol for computerassisted anima

tion as shown in Fig a These used p olygonal facets of varying sizes to

a b c

Figure Some co ding mo dels for MBC a Parkes parameterized mo dels of the

human face b CANDIDE wireframe mo del for human head c Aizawas mo del

for tracking and mimicking the D motion of a real face

construct a wire frame representation of the face with Phong shading

b eing used to pro duce the app earance of smo oth though plasticlo oking skin

Parke sp eculated that parameterized mo dels might b e useful in elds suchas

medicine and the data compression of image sequences but did not suggest

ways of extracting the parameters from images of real faces

Another strand in the evolution of mo delbased video co ding is traceable to

the eld of so cial psychology As part of their researchinto understanding the

link b etween facial expression and emotion Ekman and Friesen develop ed

a scoring system for measuring facial expressions which they termed the facial

action co ding scheme FACS system FACS provides over dierent

facial actions whichcanbecombined to givevarious expressions The use of

anatomically based facial mo deling gives improved verisimilitude and economy

of sp ecication

CHAPTER VIDEO COMMUNICATIONS

The third strand in the development of ideas in mo delbased video co ding was

the tremendous growth in interest during the s in image understanding

and computer vision While this was to nd its more immediate applica

tions in rob otics it raised exp ectations that the considerable problems

of analysis asso ciated with mo delbased co ding might be solved Though

exp erts in computer vision and image co ding have since met to discuss the

common ground b etween their sub jects the two communities still havemuch

learn from each other Results obtained in the eld of rob otics are not always

transp ortable into the eld of co ding b ecause the assumptions and constraints

in co ding are dierentinkey resp ects

Early modelbasedcoding proposals At the International Picture Co ding

Symp osium PCS in Montreal Canada a system called Speechmaker with

an intended use in video conferencing was describ ed by Lippman This

employed facial images stored on video disc with dierent primitive lip

p ositions the selections were driven by sp eech and thus required in princi

ple no additional bits for the video At the PCS the achievement of

very low bit rates through the use of animation was presented in It

describ es most of the elements currently asso ciated with mo delbased co ding

Analysis of an input image sequence synthesis of a D mo del of the ob ject

at the enco der transmission of information to the receiver to allowittore

construct the D mo del and the formation of the deco ded image sequence

as the pro jection of the D mo del onto the image plane Subsequently at

PCS Forchheimer presented a metho d for tracking real head movement

and using the results to cause a mo del head to mimic the movements

Forchheimers group subsequently develop ed a wireframe mo del of the human

head known as CANDIDE as shown in Fig b which is widely used to day

in mo delbased co ding research

A startling leap forward in the realism of synthesized mo dels Texturemapping

of the face came with the intro duction of texture mapping which originated

in the eld of computer graphics In this pro cedure an image from

the rst or other selected frame in a sequence is pro jected or mapp ed onto

a wireframe mo del of the head When the mo del is animated the skin and

hair texture which the initial pro jection has eectively glued to its surface

moves stretching and transforming the facial app earance into smile frowns

etc The technique is similar in principle to optical pro jection of a movie of

someone talking onto the head of a tailors dummywhich is also known to b e

remarkably realistic Welsh Yau and Duy showed how amazingly

convincing texture mapping can be in fo oling the eye into b elieving that a

rough triangularmesh structure is a smo oth face Welsh also mapp ed the

ADVANCES IN VIDEO CODING

image of the interior of the mouth onto a second concave wireframe mo del

so that when the p erson op ened his mouth there was no customary blackhole

Aizawa Harashima and Saito succeeded in tracking and mimicking

the D motion of a real face onto which white lo cating dots had b een xed

Aizawas mo del of face is shown in Fig c They demonstrated the

remarkable p ower of texture mapping by pro jecting four dierent still images

of faces including a monkeys face and the Mona Lisa in turn onto the

animated mo del The texturemapping work of Aizawa Welsh and others

in the late s had a very great impact on international picture co ding

community It op ened up the p ossibilityofrealistic repro duction of complex

moving ob jects there are not many ob jects of common interest more complex

than the human face at extremely low bit rates Texture mapping was a

technique which had b een develop ed in another eld for another purp ose

but when transp orted and mo died it b ecame inspirational in the drive to

accomplish a fully op erational co ding system What remained however were

the formidable tasks of analysis at the transmitting end and the incorp oration

of such analysis intoacompletecodingsystem

Integration of MBC with traditional image coding techniques The next step

in development of MBC was a most signicant one in its path to ultimate

practicality Musmann Ho tter and Ostermann suggested that the pro

cess whereby a mo del of an ob ject is created texturemapp ed and mo died

over time could b e incorp orated in a feedback lo op please refer to Fig

They called this system objectoriented analysissynthesis coding OOASC

Source model Receiver Model

Image Parameter Transmission Channel Analysis Coding Input Actual Parameters Image Parameter Synthesis Decoding

Stored Memory Motion/shape/color Parameters for Object

Parameter

Figure Ob jectoriented analysissynthesis co ding

Co ding lo ops have been used in most practical co ding schemes from early

DPCM to interframe predictivecodingschemes such as H their eect is

to prevent the accumulation of quantization distortion by incorp orating this

distortion in the prediction In OOASC the video is co ded into motion shap e

and color the animation signal thus consists of information ab out the shap e

CHAPTER VIDEO COMMUNICATIONS

and motion of the mo del whichallows the receiver to displace and rotate it

as well as to mo dify its shap e The residual or color signal is transmitted

in areas where the pro jection of the mo died mo del onto the image plane

fails adequately to predict the next frame Ob jectoriented analysis synthesis

co ding is thus a natural progression from the hybrid interframe co ders which

co de into motion and pixel data only In principle it is capable of adapting

to ob jects other than the human head though most of the published results

have b een with videophone application in mind

Harashima Aizawa and Saito have summarized mo delbased co ding in a

table of co ding developmentasshown in Table In this classication

Generations Co ding Scheme Exp ected Examples of co ding

bitrate

 

th Generation Direct Waveform Co ding PCM

 

st Generation Statistical Redundancy Predictivecoding

Reduction co ding Transform co ding

 

nd Generation StructureFeature Extraction Contour co ding

Co ding Segmentation co ding

 

rd Generation Analysis Synthesis Co ding Mo delbased synthetic

co dingParameter co ding

 

th Generation Recognition Reconstruction Knowledgebased synthetic

Co ding co dingCommand co ding



th Generation Intelligent Co ding Semantic co ding

Table Generation of mo delbased co ding development as suggested by Ha

rashima et al and Musmann

the th generation waveform co ding metho ds assume that the video is devoid

of any structure

the st generation statistical redundancy metho ds assume that there is sta

tistical correlation b etween pixels

the nd generation structurefeature extraction metho ds the MPEG de

sign assume that images are pro jections of ob jects and can be segmented

into features such as motion contours and texture This generation roughly

corresp onds to those co ding schemes summarized byKunt et al in p

Mo delbased co ding is describ ed as a thirdgeneration technique in which

there is b oth analysis and synthesis Increasing amounts of intelligence are

evidentineach generation of co der

ADVANCES IN VIDEO CODING

By the fourth generation it is supp osed that ob jects are not only lo cated

and segmented but also recognized Such co ding metho ds if realized would

p ermit symb olic data to b e transmitted for the reconstruction of ob jects

In the fth generation of co ders sp eech and video are pro cessed together

Li Lundmark and Forchheimer have prop osed that co ding techniques

for very low bit rates can b e divided into waveformbased and mo delbased

co ding They divide mo delbased co ding into semantic coding which uses

sp ecic ob ject mo dels such as the human head and objectorientedcodingin

which no explicit ob ject mo del is sp ecied

Other than Harashimas suggestion Musmann has suggested a six level clas

sication based on the mo del used to represent the video source a pixel mo del

leading to PCM co ding a mo del of statistically dep endent pixels parameter

ized as blo cks of pixels and their colors and leading to predictive and transform

co ding a mo del of translating blo cks of pixels leading to motion vectors and

color b eing co ded as in motioncomp ensated hybrid co ding a mo del of moving

unknown ob jects in which shap e motion and color are the parameters and analysis

synthesis co ding is used predened ob ject mo dels when we know in advance

that the scene contains human faces cars etc the metho d of co ding b eing termed

knowledgebased co ding and when facial expressions themselves are co ded using

action units this b eing termed semantic co ding

At the image analysis side considerable progress has b een made recently in very

low bitrate video co ding and D motion estimation techniques

for complex semirigid ob jects As part of b oth the analysis and synthesis of facial

movements musclebased mo dels with synthetic skin texture have b een employed

Synthesis techniques have similarly improved with realistic facial animation now

b eing p ossible at very low bit rates A range of co ding results have b een rep orted

for head and shoulder images from Kbs down to Kbs with CIF or QCIF

image sequences the lower bit rates generally requiring more complex ob jectsp ecic

mo dels and resulting in some visible co ding imp erfections Among the problems

with current mo delbased co der prototyp es are their tendency to be to o rigidly

ob jectsp ecic and their need in some cases for a degree of op erator assistance

to help them track and mimic ob ject movements There is little rep orted to date

concerning the p erformance of mo delbased co ders with long sequences of video

The concern is that with such sequences severe or sustained failure of tracking

or mo deling may o ccur with the consequence that either the bit rate will increase

dramatically VBR mo de or the picture quality deteriorate markedly CBR mo de

Encouraging thinking is emerging ab out how to overcome such problems and to

generalize mo delbased co ding to a wider range of ob jects further attention needs

CHAPTER VIDEO COMMUNICATIONS

to be given to implementation issues and to the use of parameterized co ding at higher image resolutions

Chapter

MotionComp ensated DCT Video Co ding

Most of the current video co ding standards are based on the motioncomp ensated

DCT video co ding approach MCDCT As mentioned in Chapter the motion

comp ensated DCT approach b elongs to the class of hybrid blo ckbased lossy

motioncomp ensated transform co ding in the category of waveformbased video

co ding metho ds Due to its imp ortance in the widely used video co ding standards

we devote this chapter to the detailed discussion of this approach The detail of

the MCDCT video co ding standards will b e discussed in the next chapter More

information can b e found in the literature

The MCDCT approachisa hybrid approach in the sense that it achieves spatial

and temp oral compression through twovery dierent means

Spatial compression Digitized picture elements pixels in the same frame

are decorrelated through the Discrete Cosine Transform DCT whichpacks

most energy in as few DCT co ecients as p ossible and usually in the low

frequency region Except the loss of precision in the computations via nite

precision DCT implementation DCT co ecients can be exactly converted

back to pixels bytheinverse DCT IDCT Therefore DCTIDCT is a lossless

transform pair After spatial decorrelation most DCT co ecients are very

small and close to zero Hence spatial compression can b e eciently achieved

by quantization and entropycoding

Temp oral compression When ob ject movement can b e mo deled or approxi

mated by linear motion it is p ossible to predict the next frame based on the

current frame and motion information This is the basic idea b ehind temp o

ral compression Motion information must be estimated rst based on two

consecutive frames called motion estimation The predicted frame is gen

erated through the pro cess called motion comp ensation from the previous

frame with this motion information carried in the form of motion vectors

The dierence between the current frame and the predicted one called the

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

prediction residual is usually small and can b e further compressed spatially

as describ ed ab ove

DCT and motion estimationcomp ensation are blo ckbased in nature though they

have also b een successfully mo died for the regionbased or ob jectedbased video

co ding metho ds as in the case of MPEG It is also easier to implementifwe divide

the whole frame into blo cks of pixels for pro cessing than into irregular shap es In

the following section we will discuss dierent picture formats and how eachframe

is divided into blo cks

Basic Principles of Motion Comp ensated Trans

form Co ding

Motion comp ensated transform co ding is a well known approach for video co d

ing At the enco der side threestage pro cessing can be used to describ e video

co ding as shown in Fig which is common to all video co ding standards The

Video Bitstream

Video Coding Quantization Entropy Coding

Figure Threestage pro cessing for video co ding

rst stage is signal pro cessing which includes motion estimation and comp ensation

MEMC and a D spatial transformation The ob jective of MEMC and the spa

tial transformation is to takeadvantage of the temp oral and spatial correlation in a

video sequence resp ectively in order to optimize the rate distortion p erformance of

quantization and entropy co ding under a complexity constraint The most p opular

technique for MEMC has b een blo ckmatching and the most p opular spatial trans

formation has b een the discrete cosine transform DCT At the deco der all the

ab ove steps are reversed one byone Note that all the steps can b e exactly reverse

except for the quantization step which is where the loss of information arises The

compression of video data typically is based on two principles the reduction of spa

tial redundancy and the reduction of temporal redundancy For instance all video

co ding standards use DCT to remove spatial redundancy and motion comp ensation

to remove temp oral redundancy

PICTURE FORMATS

Picture Formats

Analog video is sampled in the temp oral domain at a xed rate eg framessecond

for NTSC signals to generate a sequence of pictures which are then digitized and

scanned rowwise at the resolution sp ecied in the standards After digitization

each picture consists of pictorial elements or commonly called pixels

Some of the commonly used picture formats are listed in Table Now let us

take a lo ok at CIF and QCIF picture formats as examples Basically the CIF is close

to the format commonly used by the computer industry Atsuch a resolution the

picture quality is not exp ected to b e very high It is close to the qualityofatypical

video cassette recorder and is much less than the quality of broadcast television

This is understandable for its usage in H b ecause H is designed for video

telephony and video conferencing in whichtypical source material is comp osed of

scenes of talking p ersons socalled head and shoulder sequences rather than general

TV programs that contain a lot of motion and scene changes Therefore H is

designed to deal with two picture formats CIF and the quarter CIF QCIF

Image format Resolution YCbCr Scan typ e Mbytess Use

ITUR NTSC Interlaced JPEG

ITUR PAL Interlaced JPEG

ITUR SIF NTSC Progressive MPEGJPEG

ITUR SIF PAL Progressive MPEGJPEG

CIF NTSC PAL Progressive ITUT H

QCIF NTSC PAL Progressive ITUT H

Table Digital video formats sp ecied by ITUR JPEG H and MPEG

Color Spaces and Sample Positions

As weknow any color can b e represented in three basic colors RGB red green

blue This RGB color system can be rotated to form a dierent color co ordi

nate system such as YUV or YCrCb In most MCDCT standards the color

space of each frame is usually comp osed of one luminance comp onent Y and two

chrominance comp onents Cr and Cb which are related to RGB in the following

way

Y R G B

B Y

Cr

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

R Y

Cb

While picture formats dene the size of the image the resolution of the Y pixels

or pels the resolution and p ositions of Cb and Cr p els are yet to be sp ecied

Typically the chrominance p els are designed to have less resolution than the lumi

nance p els b ecause human eyes are less sensitive to the chrominance comp onents

than to the luminance part The p osition of YCbCr samples or pixels for the

and formats is illustrated in Fig For instance in H

Scan Scan line line

1 1

263 263

2 2

264 264

(a) 4:4:4 (b) 4:2:2

Scan Scan line line

1 1

263 263

2 2

264 264

(c) 4:1:1 (d) 4:2:0

Y sample Cb and Cr sample

Figure Orthogonal sampling on the scan lines of an interlaced system

the Cb and Cr p els are sp ecied to have half the resolution b oth horizontally and

vertically of that of the Y pels which commonly referred to as the format

Chrominance subsampling and the relative p ositions of chrominance p els in H

are same as those dened in H In Figure the chrominance comp onents

are subsampled by a xed factor commonly as dened in the widely used

format

Layers in Video Stream

After digitization of video signals an enco der enco des the raw video stream to create

the compressed video stream based on the syntax or semantics sp ecied in video

co ding standards A deco der also follows this video stream syntax to parse and

deco de to pro duce a decompressed video signal In order to b e exible enough to

supp ort the variety of applications envisaged the video stream syntax is constructed

in a hierarchyofseveral generic layers

LAYERS IN VIDEO STREAM

GROUP OF PICTURES

PICTURE BLOCK IMAGE SEQUENCE GOB

Y Y Cr Cb

Y Y

Figure Sequence of pictures and picture formats

Sequence entire video sequence

Group of Pictures basic unit allowing for random access

Picture primary co ding unit with three color comp onents and dierent pic

ture formats progressiveorinterlaced scanning mo des

Slice or Group of Blo cks basic unit for resynchronization refresh and error

recovery

Macroblo ck motion comp ensation unit

Blo ck transform and compression unit

The names and functions of eachlayer mayvary in dierent standards but the basic

concept b ehind each standard remains the same

A xed number of consecutive pictures are group ed together to form a group

of pictures GOP as shown in Figure The exact number of pictures in a

GOP varies in dierent hybrid MCDCT video co ding standards For the MC

DCT approach the rst picture in the group is co ded as an intraframe whereas the

rest of the frames are co ded in the interframe format Basicallyintraframe co ding

exploits only spatial correlation while interframe co ding takes advantage of spatial

and temp oral correlation which will b e explained in detail later

In MPEG two scanning mo des are sp ecied

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

Progressive mo de This scanning mo de is commonly used in most other stan

dards It allows scanning through the same lines in each picture resulting in

frames In progressive sequences each picture in the sequence shall b e a frame

picture The sequence at the output of the deco ding pro cess consists of a

series of reconstructed frames that are separated in time by a frame p erio d

Interlaced mo de Interlaced scanning is common in the analog TV standards

suchasNTSCPAL and SECAM In the interlaced mo de alternate lines of

pixels are scanned in each eld Thus a eld consists of every other line of

samples in the three rectangular matrices of integers representing a frame A

frame is the union of a top eld and a b ottom eld The top eld is the eld

that contains the topmost line of each of the three matrices The b ottom eld is the other one

352 pixels

GOB1 GOB2 1 2 3 4 5 6 7 8 9 10 11 GOB3 GOB4 12 13 14 15 16 17 18 19 20 21 22 GOB5 GOB6 23 24 25 26 27 28 29 30 31 32 33

GOB7 GOB8 288 pixels A group of the blocks (GOB) GOB9 GOB10

GOB11 GOB12

CIF format in GOBs

8 pixel Y1 Y2 1 8 Cb Cr Y3 Y4

A Macroblock (MB) 8 lines

57 64

A block

Figure The resulting GOB structures for a frame of pictures in H

Usually the whole frame of picture is not co ded as a whole Due to the nature

of blo ckbased transform co ding motion estimation and comp ensation eachframe

is divided into blo cks of pixels The basic unit for transform co ding is usually an

block The smallest co ding unit for representation of a blo ck of pixels in color

is called a macroblock MB Without chrominance subsampling each macroblo ck

consists of one Cr blo ck one Cb blo ck and one Y blo ck However in view of

chrominance subsampling we need one blo ck of Cr pels one blo ck of Cb p els

and four blo cks of Y p els for the case of the subsampling factor ie the

format to form one macroblo ck To further exploit spatial correlation a number

INTRAFRAME BLOCKBASED CODING

of macroblo cks are group ed together to form a group of blocks GOB as depicted

in Figure In H a GOB contains MBs and the resulting GOB structures

for a frame of picture are shown in Fig However H uses dierentGOB

structures Unlike H a GOB in H always contains at least one full rowof

MBs

Intraframe Blo ckBased Co ding

original ENTROPY blocks of pixels QUANTIZATION Encoded intraframe DCT CODING

reconstructed INVERSE ENTROPY blocks of pixels IDCT

QUANTIZATION DECODING

Figure Intraframe blo ckbased co ding and deco ding in MCDCT approach

In the intraframe co ding mo de the rst frame of each GOP is spatially compressed

through transform co ding and quantization and then co ded losslessly byentropy

co ding The transform co ding used in the MCDCT standards is the Discrete Co

sine Transform DCT to pack most energy in as few co ecients as p ossible The

resulted DCT co ecients are then scalarly quantized to remove visual irrevelancy

Further saving can b e achieved by running entropycodingover the bit stream of

these quantized co ecients The following describ es eachof these comp onents in

more detail

Spatial Decorrelation through DCT

Except edges pixels in texture regions of each frame are spatially correlated It

is well known that the optimum KarhunenLo eve transform KLT can eciently

decorrelate pixels spatially and pack most energy in the fewest co ecients How

ever KLTisnot a xed transform and can only b e determined on the basis of the

statistical ensemble of texture regions which is not known priori Therefore KLT

must be calculated at the enco der and sent to the deco der along with the co e

cients asso ciated with the transform This is not an ecientway Hence a xed

transform with eciency as close to KLT as p ossible is a b etter choice It has b een

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

known that DCT is equivalent to the optimum KarhunenLo eve transform KLT

when the images are generated by a rstorder Markov pro cess whichisagood

mo del for textures Furthermore unlike FFT DCT involves only op erations on real

numb ers

There are four DCT variants DCTI DCTI I DCTI I I DCTIV as dened in

DCTI I is widely used in the MCDCT standards and commonly referred to

as DCT which is dened as

N

X

k l

cc

X k l C k C l x m ncos m cos n

t

t

N N N

mn

for k l fN g

p

for k or N

where C k

otherwise

The other typ es of DCT will be dened and used in later chapters As shown

in Figure each blo ck is parsed through a DCT with N as dened

ab ove The DCT co ecients thus generated contain a DC comp onent and AC

co ecients The DC comp onent is basically the average value of all the pixels

within a blo ck and the AC comp onents represent the variation of the texture from

this average value For a at texture blo ck DCT packs all energy in the DC

comp onent with all zero AC co ecients Further compression can be achieved

through p erceptually weighted quantization by exploiting human visual insensitivity

to high frequency comp onents which is discussed next

Exploitation of Visual Insensitivity Through Quantiza

tion

DCT decorrelates pixels and eciently packs most energy in few DCT co ecients at

dierent frequencies Since human eyes are acute on DC co ecients but relatively

insensitivetothe high frequency comp onents we can exploit this visual prop erty

for higher compression through p erceptually weighted quantization Furthermore

prop er quantization can help the later stage for entropy co ding as discussed later

The MCDCT approach adopts the scalar quantization approach instead of vec

tor quantization Furthermore unlike nonuniform quantization ALaworLaw

compandorcompressor widely used for sp eech co ding in

uniform quantization is commonly used in the MCDCT standards However the

quantization step size can be tailored for each DCT co ecient in an blo ck

as in JPEG and can also b e adjusted adaptively for every macroblo ck or group

of blo cks as in H Usually the following rules are applied to the quantization

INTRAFRAME BLOCKBASED CODING

AC coefficients pixels DC coefficient

DCT

zigzag scan

8x8 BLOCK

Figure Spatial decorrelation through DCT

stage

More aggressive quantization is allowed in highfrequency comp onents than

the DC or lowfrequency co ecients

Chrominance can b e quantized more than luminance co ecients

Flickering noise can b e avoided byintro ducing dead zone around zero during

quantization

Quantization can be adjusted to help the later entropy co ding stage For

example adaptive threshold can b e used to increase the run of zeros in RM

of H

The rst three exploit the human visual prop erty and the last one tries to bridge

between transform co ding and entropy co ding As opp osed to reversibilityofDCT

IDCT quantization is irreversible in the sense that inverse quantization can not

recover the original value of a quantized co ecient Loss of information results from

the quantization pro cess and intro duces as little degradation of the picture quality

as p ossible that information is irrelevant to our p erception of picture quality

Therefore for any recursive co ding scheme such as interframe co ding based on

previous deco ded frames an enco der must b e able to keep track of what a deco der can see

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

For the case of JPEG an quantization table Qi j is sp ecied in the

bitstream that an enco der sends a deco der On the other hand in H the

quantization step sizes for all AC co ecients are the same but dierent from that

for DC comp onents In MPEGMPEG two sets of default quantization tables

are sp ecied in the standard without the need of transmission of the tables from an

enco der to a deco der

a quantization table for intra co ding b quantization table for nonintra co ding

Table Default quantization tables for intra and nonintra co ding for b oth lu

minance and chrominance

Quantization table for intra co ding Used for quantizing DCT co ecients of

luminance and chrominance comp onents of intraframes this default table as

shown in Table a has a distribution of quantizing values that roughly

match the frequency resp onse of the human eye at a viewing distance of

approximately six times the screen width of a p el picture

Quantization table for nonintra inter co ding This table in Table b

is used primarily for quantizing the DCT co ecients of motioncomp ensated

residues of interframes Basically it is at with a xed value of for all

co ecients including the DC terms Even with this at quantization table

without fully exploiting the prop ertyofa human visual system HSV DCT

quantization is still an eectiveway to reduce the bit rate

In addition to dierentquantization tables for intra and nonintra co ding the quan

tization functions are also dierentasshown in Fig

In intra co ding fractional values are rounded to the nearest integer

cc

QX k l QfX k lg

t

cc cc

X k lsig nX k l quantizer scale Qk l

t t

RN D f g

quantizer scale Qk l

INTRAFRAME BLOCKBASED CODING

quantized quantized

original original

(a) Intra quantization (b) Non-Intra quantization

Figure Quantization for intra and nonintra DCT co ecients

cc th

where X k listhek l DCT co ecient of the image blo ck Qk listhe

t

scale is intra quantization table value for the co ecientk l and quantizer

dened in the bit stream as a to ol for improving picture quality and controlling

scale Qk l Here bitrate Thus the quantization step size is quantizer

RN D means rounding truncating fractional values and

for X

sig nX

for X

for X

The inverse quantization for intra DCT is

scale Qk l quantizer

QX k l Q fQX k lg

In nonintra inter co ding fractional values are always rounded down to the

smaller magnitude thus creating a dead zone around zero

cc

Y k l

t

cc

g k lg RN D f QY k lQfY

t

quantizer scale Q k l

cc th

where Y DCT co ecient of the motioncomp ensated k l is the k l

t

residue and Q k l is the nonintra quantization table value for the co e

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

cientk l The inverse quantization for nonintra DCT is

quantizer scale Qk l

Q fQX k lg QX k l sig nQX k l

Lossless Compression Through Entropy Co ding

As mentioned previously DCT results in high energy concentration in few co ef

cients usually in DC and lowfrequency comp onents and quantization exploits

the fact that human eyes are sensitive to DC and lowfrequency co ecients As

a result the quantized DCT co ecients of the same frequency sum of horizontal

and vertical frequencies tend to have similar values zero or nonzero unless the

texture has strong directional patterns Therefore zeros tend to cluster along the

sp ectral lines in the Cartesian co ordinate of quantized DCT co ecients with the

DC comp onent as the origin and the horizontal and vertical frequencies as the x

and y axes resp ectively In the manner shown in Figure zigzag scan of AC

co ecients tends to allow a longer run of zeros for eciententropy co ding which

is either Human co ding variable length co ding or arithmetic co ding noninteger

length co ding In view of the fact that DC and AC co ecients have dierent

characteristics in intraframes they are co ded dierently as describ ed b elow

Entropy Co ding

Entropy co ding also known as noiseless co ding lossless co ding and data com

paction co ding is a class of co ding techniques to reduce the average number of

symb ols sent without suering any loss of delity or information Since entropy

co ding relies on the statistical nature of the source it requires a statistically opti

mum co de length for eachsymb ol in order to transmit the stream of symb ols from

the source in as few bits as p ossible As a result entropy co ding generally creates

a code book containing variable length co des VLC instead of xed length co des

FLC Entropy co ding has b een extensively studied in the literature and detailed

discussion can b e found in for example A classical example

of entropy co ding is the Morse co de where short binary co dewords are used for more

likely alphab ets and long co dewords for less probable letters Therefore the Morse

co de can more eciently enco de English text resulting in fewer bits on the average

than the onebyte ASCI I co de

We will discuss three of the most p opular techniques for entropy co ding esp e

cially useful for motioncomp ensated DCT video co ding

Human Co ding HC It provides optimum co des of integer co de lengths

with unique prexes according to a priori knowledge of the statistical prole

of the source symbols

INTRAFRAME BLOCKBASED CODING

Arithmetic Co ding AC It provides optimum co des of noninteger co de

lengths according to a priori knowledge of the statistical prole of the source

symb ols

RunLength or RunLevel Co ding RLC It is ecient in co ding sources

which tend to rep eat symb ols for long p erio ds of time

Before the discussion of each technique lets intro duce the concept of entropy and

Shannon co ding theory For the sake of brevitywe will state the fundamental

theorems without pro of which can b e found in various references cited ab ove

Given a random symbol source X having a symbol set A and describ ed by

the probability mass function pmf p its entropy H X or equivalently H p is

dened as the average numb er of bits p er symbol

X

H X H p palog

pa

aA

It has b een shown that given a uniquely deco dable scalar lossless variable

length co de with an enco der op erating on a source X with marginal pmf p then

n

the resulting average co deword length satises

E fl gH p

The equality holds if and only if

l a log pa for all a A

Furthermore there exists a uniquely deco dable scalar lossless co de for a source with

marginal pmf p for whichtheaverage co deword length satises

E fl g Hp

From Equation the entropy of the source provides the lower b ound to the

average co deword length Therefore optimum uniquely deco dable variable length

co des are often called entropycodes In achieving the lower b ound Equation

must be satised It holds only when the symbol probabilities p are powers of

otherwise the lower b ound is not achievable However Equation states

that there always exists a co de that is not to o far ab ove the lower b ound

Human Co ding HC

To co de a source with a set of symbols A and a known pmf p not necessarily

in powers of Human develop ed an optimum co de assignment pro cedure to

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

create a compact co de of integer co de lengths This Human co de assignment

pro cedure is outlined as follows

Let P b e the list of probabilities of the source symb ols

P fpa for all a Ag

It is useful to list the symb ols and their probabilities from top to b ottom in

the decreasing order of their probabilities

Takethe two smallest probabilities in P and make the corresp onding no des

siblings to generate an intermediate no de as their parent Lab el to the

branch from parenttooneofthechild no des and to the other branch

Replace the two probabilities and asso ciated no des in the list by the single

new intermediate parent no de with the sum of the two probabilities ie the

numb er of elements is reduced by in the new list If the list now has more

than one element go to Step otherwise the Human co de for this source

is created

An example is shown in Figure for a symb ol source with

P f g

P

l ipi The nal Human co de has an average co de length E l

i

whereas the entropy of this source is H p

p(7) = .25 1 01 1 11 p(6) = .2 1 0 1 10 p(5) = .2 .4 0 1 001 p(4) = .16 .6 0 0001 p(3) = .11 1 0 .35 00001 p(2) = .04 1 1 0 .19 000001 p(1) = .03 0 .08

000000 p(0) = .01 0 .04

Figure Illustration of the Human co de assignment pro cedure for a symbol

source

Since the Human co de assignment pro cedure relies on the priori knowledge of

the source statistics it can not adapt to the source whose statistical prop erty is

INTRAFRAME BLOCKBASED CODING

not known in advance or maychange with time b ecause of its nonstationarity An

adaptive Human co ding is available by mo difying the ab ove pro cedure which falls

out of the scop e of this chapter

In JPEG Human tables are generated at the enco der to match the statistics

for a particular picture to b e enco ded However most video co ding standards such

as H MPEGMPEG and H use predetermined xed variable length

co de VLC tables instead of adjusting the tables for each picture or even video

stream enco ded for the sake of reduced complexity This is an example of tradeo

for co ding eciency versus co ding complexity

Arithmetic Co ding AC

Arithmetic co ding can b e viewed as Elias co des with nite precision arith

metic It was develop ed by Pasco and Rissanen and improved subse

quently Agoodoverview and reference list can b e found in

Recall from the discussion ab ove that Human co ding can achieve the entropy

minimum redundancy only when all symb ol probabilities are integral p owers of

The worst case happ ens for a source with one symbol having probability

approaching Symb ols emanating from such a source convey negligible information

on average but require at least one bit to transmit Arithmetic co ding removes

the restriction that each symbol must translate into an integral number of bits

thereby co ding more eciently It actually achieves the theoretical entropy b ound

to compression eciency for anysource

The basic idea of arithmetic co ding is to cascade a sequence of messages together

to approach the lower b ound condition namely a source with all symb ol probabili

ties b eing integral p ower of To facilitate discussion we assume a binary source

with px q The idea is describ ed b elow

Map input symb ol sequences into a subinterval I a b on Attime

n n n

n if the rst symbol x then I q otherwise if it is x

set I q At time n the subinterval I a b is determined based

n n n

on the input symbol x and the past history

n

a if x

n n

a

n

a q b a if x

n n n n

b q b a if x

n n n n

b

n

b if x

n n

This algorithm generates a subinterval I of length equal to the probability

n

of the input sequence fx x g pro duced up to the time n Knowing this

n

subinterval I is sucient to determine completely the original input sequence n

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

Generate the co de sequence from the sequence of subintervals I The subin

n

terval endp oints a and b of I have binary expansions At time n

n n n

check to see if the rst term in the binary expansions of a and b agree For

example if I then u is output from the enco der However if

I u is the output symbol If further binary symb ols second

third and so on agree then they are enco ded and output to o If the rst

symb ols in the binary expansions of the interval endp oints do not agree then

no enco der symb ol is output The enco der rep eats the test for I for the rst

n

symb ol If the rst symb ols agree it is output and the enco der tests if the

second symb ols of the binary expansions of a and b agree In general at

n n

time n the enco der nd the largest k such that the rst k symbols in the

binary expansions of a and b agree Then it will generate a sequence of

n n

output symbols u u This is equivalentto

k

k k

X X

i i k

I J u u

n k i i

i i

In other words this output symb ol sequence is used to denote the subinterval

that it b elongs to

From the symb ol sequence and thus J the deco der can then checkif J

k k

qorJ q to determine x The deco der continues in this waytosee

k

if J b elongs to one of the p ossible subintervals

k

From the ab ove description it may happ en that the precision required to sp ecify the

interval endp oints grows without b ound Mo dication of the ab ove pro cedure with

o ccasional rescaling to avoid subinterval overlapping and nite precision arith

metic leads to arithmetic co ding For more detail of arithmetic co ding should

b e consulted

H uses the syntaxbased arithmetic co ding as an alternativecodetoVLC

Arithmetic co ding is more ecient and more complex than the VLC and typically

results in savings of around for intra frames and for interframes

RunLevel Co ding RLC

If the source tends to have a long series of rep etitive symb ols one means of

compression is to send sequentially a symbol followed bythenumb er of its rep eti

tions the run length For example a binary source such as facsimile of text

do cumentmay pro duce long runs of zeros with o ccasional ones Since most quan

tized DCT co ecients in either intraframe or inter motion comp ensated residual

frame are zeros esp ecially after rearrangement in a zigzag order runlength co d

ing RLC is eective and b ecomes widely used in most motioncomp ensated DCT

INTRAFRAME BLOCKBASED CODING

video co ding standards suchasH MPEG and H However in video

co ding runlength co ding is mo died as runlevel co ding to more eectively enco de

zigzagordered DCT co ecients A long series of zeros are represented by run

and the magnitude of the nonzero quantized co ecient is represented by level

Further compression is p ossible by co ding the runs and levels with VLC Detailed

description can b e found in the later sections of this chapter

Co ding DC co ecients in intraframes

After DCT the DC ie rst co ecients represent the blo ckaverages Neigh

boring blo cks sharing the same sceneob ject usually have highly correlated DC

comp onents For this reason DC comp onents of neighb oring blo cks are group ed

together and co ded by the predictive DPCM technique dierential PCM

th

The dierential DC value DDC i of the i blo ck in a slice is the dierence

between the DC co ecient DC i and the predictor value PDCiasfollows

DDC iDC i PDCi

where PDCiDC i Three predictors are maintained for all three color com

bits of pr ecision

p onents YCrCb These predictors are reset to the reset value

at the following times

at the start of a slice

whenever a nonintra macroblo ck is enco deddeco ded

whenever a macroblo ck is skipp ed

At the deco der side each time a DC co ecientinablockinanintra macroblo ckis

deco ded the predictor is added to the dierential to recover the actual co ecient

Then the predictor is set to the value of the co ecient just deco ded or reset to the

reset value according to the ab ove rules

The dierential DC value DDC i is decomp osed into two parts concatenated

together in the bitstream

Size category dct dc siz e It determines the numb er of bits required to fully

sp ecify the magnitude and sign of the DC dierence

dct dc siz e dct dc siz e

jDDC ij

dc siz e dct

The corresp onding magnitude range for each size category

dct dc siz e dct dc siz e dct dc siz e

and can b e found in Table

dc siz e is co ded in VLC as shown in Table dep ending on whether Then dct

the DC dierence is for luminance or chrominance

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

Sign and Magnitude After the size category is determined then the DC

dierence is converted to dct dc dif f which is represented in its binary rep

dc siz e resentation of a xed number of bits ie xed length co de of dct

bits

dct dc siz e

dct dc dif f DDC

At the deco der the following pro cedure is adopted to recover the DC co ecients

if dct dc siz e f

dct dif f

g else f

dct dc siz e

if dct dc dif f

dif f dct dc dif f dct

else

dct dc siz e

dct dif f dct dc dif f

g

dc pr ed dct dif f DDC dct

dc pr ed DDC dct

As an example wehave a list of dierential DC values for luminance as follows

Their size categories dct dc siz e are determined as follows

Additional bits required to represent them in the binary format are

Therefore the binary co de VLCFLC for this list is

Co ding AC co ecients in intraframes

All AC co ecients DCT co ecients except DC within a blo ck are rst ar

ranged in a zigzag order as in Table whichalsoshows an alternate scan order

INTRAFRAME BLOCKBASED CODING

VLC co de for VLC co de for

luminance di chrominance di dct dc siz e magnitude range

Table VLC co de table of dierential DC co ecients for luminance and chromi

nance MPEG allows only values in dct dc siz e but MPEG allows

dened in MPEG for the use of more eciently rearranging DCT co ecients ob

tained in elds Then each nonzero AC co ecient is co ded using a comp osite

symbol runlevel symb ol Run refers to the number of zero co ecients b efore

a nonzero co ecient Level means the amplitude of the nonzero co ecient Ta

ble tabulates the variable length co des for the runlevel symb ols used to co de

AC co ecients in MPEGMPEG In MPEG there are two sets of tables

one as used in MPEG the other appropriate for higher bitrates and resolutions

addressed in some MPEG proles The trailing bit of each runlevel co de in the

table is the bit s that denotes the sign of the nonzero co ecient If s is it is

p ositive otherwise it is negative When there is no more AC co ecient an end

ofblo ck EOB symbol is inserted to co de all the trailing zero co ecients in the

zigzagordered DCT with a single co deword EOBissolikely that it is assigned a

twobit co de as shown in Table Combinations of run lengths and levels

not found in the table are considered to o ccur rarely and thus co ded by the escap e

co de followed by a sixbit co de for run lengths to and a bit in

MPEG or or bit MPEG co de for signed levels In MPEG bit co des

are used for levels satisfying jlevelj whereas bit co des are for levels in

the range f g Therefore the total numb ers of bits for each

escap ed runlevel are for MPEG and or for MPEG dep ending on the

signed levels

Readers may notice that there are two co des for runlevel in the table and

the rst runlevel co de uses s which is the same as the EOB co de

if s ie level As a matter of fact this table is actually resulted from

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

two co de tables folded into one and thus has dual purp ose in design The rst

runlevel co de is used only in nonintra co ding discussed later in the chapter

where a completely zero DCT blo ck is co ded at a higher level and the EOB symbol

can not happ en b efore the rst runlevel symbol is co ded For intra co ding it is

p ossible that the DC co ecients are co ded immediately followed by an EOB symbol

without any nonzero AC co ecients There is no way to distinguish the EOB co de

from the runlevel As a result the rst runlevel co de is not

used in intra co ding

As an example a x DCT blo ck has the following DCT co ecients

The co ecients in the zigzag scanned order without trailing zero co ecients are

listed as follows

where is the DC co ecient co ded separately Therefore the runlevel co des are

for zigzagordered AC co ecients

EOB

The corresp onding VLC co des for the AC co ecients in this blo ckare

Interframe Blo ckBased Co ding

The rst frame of a group of pictures GOP is co ded as an intraframe Iframe

but the rest of the frames are co ded as interframes Intraframe co ding exploits

only spatial correlation whereas interframe nonintra or nonintra frame co ding

exploits b oth spatial and temp oral correlation The MCDCT approach considers

only temp oral correlation over two frames instead of multiframes whichmay not

b e consecutive but usually are The frame to b e compressed is the currentframe

INTERFRAME BLOCKBASED CODING

a zigzag scan order MPEGMPEG b alternate scan order only in MPEG

Table Scan order for DCT co ecients in a blo ck

but the reference frame can b e either the previous frame or the future frame If the

reference frame is the previous frame then this interframe is called a Pframe If

the reference frame is chosen from the b est matchoraverage of the previous andor

future IPframe against the current frame this interframe is called a bidirectional

frame Bframe

The interframe blo ckbased co ding and deco ding pro cesses in the MCDCT ap

proach are depicted in Figure The MCDCT approach mo dels translational

motion on ablockbyblo ck basis Each macroblo ckin the current frame is com

pared against the reconstructed reference frame The b est displacementmatchis

picked as the estimated motion vector in terms of usually one of two matching

criteria

Minimum Mean Squared Error MSE

P

N

x m n x m u n v

t t

mn

MSE min

uv

N

where is the search range of the currentblockover the reference frame and

usually dep ends on the chosen search strategy such as Full Search or Fast

Search approaches as describ ed b elow MSE measures the energy remaining

in the motion comp ensated residual

Minimum Absolute Dierence MAD

P

N

jx m n x m u n v j

t t

mn

MAD min

uv

N

For the sakeof implementational simplicity MAD is usually preferred since

it do es not need to calculate the square but is comparable to MSE in terms of co ding gain

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

Run Level Variable length co de bits

End of Blo ck

Escap e

srst

s next

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

Table VLC for runlevel symb ols in co ding DCT co ecients NOTE The last

bit s denotes the sign of the level for p ositive for negative End of Blo ck

shall not b e the only co de of the blo ck The rst runlevel co de is used for the

rst DC co ecient in the blo ck in nonintra co ding only

INTERFRAME BLOCKBASED CODING

Run Level Variable length co de bits

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

Table CONT VLC for runlevel symb ols in co ding DCT co ecients NOTE

The last bit s denotes the sign of the level for p ositive for negative End

of Blo ck shall not b e the only co de of the blo ck The rst runlevel co de is

used for the rst DC co ecient in the blo ck in nonintra co ding only

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

Run Level Variable length co de bits

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

Table CONT VLC for runlevel symb ols in co ding DCT co ecients NOTE

The last bit s denotes the sign of the level for p ositive for negative End

of Blo ck shall not b e the only co de of the blo ck The rst runlevel co de is

used for the rst DC co ecientintheblock in nonintra co ding only

INTERFRAME BLOCKBASED CODING

motion compensated residual ENTROPY original + DCT QUANTIZATION Encoded interframe macroblocks - CODING of pixels motion compensated macroblocks

INVERSE MOTION QUANTIZATION FRAME + + COMPENSATION MOMERY IDCT

reconstructed estimated motion compensated residual motion vectors

MOTION reconstructed motion compensated reference frame

ESTIMATION

reconstructed INVERSE ENTROPY macroblocks + IDCT of pixels reconstructed QUANTIZATION DECODING motion compensated residual

estimated MOTION motion vectors

COMPENSATION

Figure Interframe blo ckbased co ding and deco ding in MCDCT approach

The estimated motion vectors are then used to predict the motion comp ensated

reference frame which is subtracted from the current frame on a blo ckbyblo ck basis

to form the motion comp ensated frame residual through motion comp ensation This

prediction residual is treated in the similar manner to intraframes It is enco ded

through DCT quantization and entropycoding along with the estimated motion

vectors as the enco ded interframe bit stream The deco ding pro cess is basically

the reverse enco ding pro cess as drawn in Figure Two main comp onents in this

interframe co ding pro cess are motion estimation and motion comp ensation which

are usually p erformed in but not restricted to the spatial domain and will be

discussed next

Blo ckBased Motion Estimation Algorithms

The current frame is divided into contiguous blo cks and macroblo cks as describ ed

in Section One motion vector is asso ciated with either one blo ck or one mac

roblo ck dep ending on how the standards dene Rememb er that each macroblo ck

contains four Y blo cks one Cr blo ck and one Cb blo ck for the case of color

subsampling For the sake of easy discussion we call each unit asso ciated with one

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

motion vector as one blo ck

CURRENT FRAME REFERENCE FRAME displacement

original (u,v) block position

search range

REFERENCE BLOCK

CANDIDATE BLOCK

Figure Blo ckbased motion estimation

Each candidate blo ck in the current frame is compared against all the p ossi

ble blo ck p ositions within the search range over the reference frame as shown in

Figure The search range is the set of all displacements allowed for the can

didate blo ck on the reference frame Dep ending on the search strategy chosen the

search range can include all p ossible displacements for the Full Search approach

or selected ones for the Fast Search approaches within a reference blo ck whichis

larger than the candidate blo ck as in Figure Dierent search strategies may

lead to dierentblock matching motion estimation approaches and result in dier

ent costp erformance tradeo There are three typ es of search strategies commonly

used

Coarsetone approach A large step size is used at the original blo ck po

sition in the rst step to nd the b est match At each subsequent step the

step size is reduced and the new search is centered around the best match

of the previous search This strategy is most commonly used for sub optimal

fast search approaches such as ThreeStep Search TSS Logarithmic Search

LOG etc

Aggressive approach A large step size is used at the original blo ck p osition

in the rst step to nd the b est match Ateach subsequent step the center

of the new searchis moved in the direction of the b est match found at the

previous stage with the same step size The step size is reduced only when

the MADMSE value of the current stage is larger than that of the previous

stage an oversho ot in MADMSE is encountered This approach tries to

INTERFRAME BLOCKBASED CODING

move the search center to the global minimum in as few steps as p ossible An

example of this approachistheFourStep Search FSS

Hierarchical approach The search space is divided into regions for eachof

which a center p ointischosen The b est matchispicked among all the center

points and a full searchis p erformed over the entire region having the b est

match center p oint This approach is the basis for the Hierarchical Search

approaches

FullSearch Blo ck Matching Motion Estimation Algorithms

The FullSearch Blo ckMatching BKMME motion estimation algorithm min

imizes the MAD Minimum Absolute Dierence function of the candidate blo ckof

blo cksizeN over the search area reference blo ck such that

P

N

jx m n x m u n v j

t t

mn

u v ar g min

i i

uv

N

BK M

N N

g fx m n m n N g is the where fk l k l

t BK M

N

candidate blo ck and the reference blo ckisfx m n m n N

t

N

g In this case the search region contains all the p ossible displacements of

BK M

the candidate blo ck x within the reference blo ck x

t t

SubOptimal Fast Search Approaches

The Full Search approach takes into account all the p ossible displacements

within the reference blo ck and thus guarantees the b est resulted motion com

p ensated residual which needs to be co ded and sent to a deco der However it

requires N searches and adds heavy burden on a realtime video enco der

In the case when suboptimal p erformance is allowed p erformance is sacriced for

reduction of the numb er of computations required ie cost The suboptimal fast

search approaches takeadvantage of a reduced search space with the rightsearch

strategies describ ed early Anumb er of fast search approaches are widely used and

discussed in this section among many more available in the literature

ThreeStep Search Algorithm TSS

The ThreeStep Search Algorithm TSS considers a p oint grid as its search

template with the step size reduced at each stage The TSS algorithm can be

describ ed as follows

Set the grid distance d and the initial center p ointu v

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

1 1 1

LEGEND: 1 1 1

1 STEP 1

3 3 3    1 MIN AT STEP 1   332 2 2   2 STEP 2  3 3 3     2 MIN AT STEP 2   1 1 2 1 2  3 STEP 3

  3  MIN AT STEP 3

2 22

Figure A sample search path of Three Step Search TSS blo ckbased motion

estimation algorithm

th

For the i iteration nd the b est matchu v over p ossible blo ck p ositions

i i

fu v u d u u d u v d v v d v g of

i i i i i i i i i i i

the reference frame x m u n v for the candidate blo ck fx m n m n

t t

N g of blo ck size N in terms of MAD

P

N

jx m n x m u n v j

t t

mn

u v ar g min

i i

uv

N

i

Reduce the grid distance by half d d Rep eat Step until d

i i i

For an initial grid distance d the search iterates times thus the name as

shown in Figure The variation of this approach can b e found in

Logarithmic Search Algorithm LOG

The Logarithmic Search Algorithm LOG is very similar to TSS except the

search pattern being instead of apoint grid as shown in Figure The

LOG algorithm is listed as follows

Set the grid distance d and the initial center p ointu v

th

For the i iteration nd the b est matchu v over p ossible blo ck p ositions

i i

fu d v u v u v d g of the reference frame

i i i i i i i i i

INTERFRAME BLOCKBASED CODING

1

2

 33 

  LEGEND: 1 1 3 2 3 1 2   1 STEP 1

3 3 3   1  MIN AT STEP 1 2

2 STEP 2

3   2  MIN AT STEP 2 1 1

3 STEP 3

  3

 MIN AT STEP 3

Figure A sample search path of D Logarithmic Search LOG blo ckbased

motion estimation algorithm

x m u n v for the candidate blo ck fx m n m n N g of

t t

blo ck size N in terms of MAD

P

N

jx m n x m u n v j

t t

mn

u v ar g min

i i

uv

N

i

Reduce the grid distance by half d d Rep eat Step until d

i i i

For an initial grid distance d the search iterates times

Cross Search Algorithm CRS

The Cross Search Algorithm CRS also adopts the coarsetone search strategy

with a cross search pattern x as depicted in Figure The owchart of

the CRS algorithm is listed b elow

Set the grid distance d and the initial center p ointu v

th

For the i iteration nd the b est matchu v over p ossible blo ck p ositions

i i

fu d v d u v g of the reference frame x mu nv

i i i i i i i t

for the candidate blo ck fx m n m n N g of blo ck size N in terms t

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

1 1 1

LEGEND: 1 1

1 STEP 1

3 3 3    1 MIN AT STEP 1   3 2 3 2   2 STEP 2  3 3 3     2 MIN AT STEP 2   1 1 1  3 STEP 3

  3  MIN AT STEP 3

2 2

Figure A sample search path of Cross Search CRS blo ckbased motion esti

mation algorithm

of MAD

P

N

jx m n x m u n v j

t t

mn

u v ar g min

i i

uv

N

i

Reduce the grid distance by half d d Rep eat Step until d

i i i

At the nal stage a full search is p erformed on all the searchpoints around the

b est match Avariant of this algorithm may also b e found in

FourStep Search Algorithm FSS

The FourStep Search algorithm FSS starts with a small step size and tries

to move the searchcenter close to the global minimum of the matching criterion

function MADMSE as so on as p ossible without rening the step size at each

iteration as shown in Figure Once the search center is close to the optimum

p oint it reduces the search step size to rene its search The FSS algorithm can b e

describ ed as follows for the case of the maximum displacements

Set the grid distance d and the initial center p ointu v

th

iteration nd the b est matchu v over p ossible blo ck p ositions For the i

i i

fu v u d u u d u v d v v d v g of

i i i i i i i i i i i

INTERFRAME BLOCKBASED CODING

LEGEND: 1 1 1 1 STEP 1

 1  MIN AT STEP 1

1 1 1 2 2 STEP 2

 2  MIN AT STEP 2  1 1 1 2 3  3 STEP 3

 3  MIN AT STEP 3  21 2 2  4 STEP 4   4 4   4  MIN AT STEP 4  3 3 4 3



Figure A sample searchpathofFour Step Search FSS blo ckbased motion

estimation algorithm

the reference frame x m u n v for the candidate blo ck fx m n m n

t t

N g of blo ck size N in terms of MAD

P

N

jx m n x m u n v j

t t

mn

u v ar g min

i i

uv

N

i

If the b est match is lo cated at the center of the search window then go

to the nal step Step

Move the search center to the b est match and rep eat Step two more times

with the same step size d d for i and i

i

Reduce the grid distance by half d d a search window

Rep eat Step with d The nal b est match is the estimated motion

vector

For a search range fu v u v g the search iterates times with

the same step size d At the nal step the step size is reduced byhalfto

d as shown in Figure In this way the searchcenter moves in the vicinity

of the optimum p oint faster than the coarsetone approach

Multiresolution Search Approach MRS

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

search area

 Lowest  Resolution

best match

   Lower   Resolution  corresponding search region at lowest resolution

     corresponding search region     at lowest resolution    Original  Resolution

estimated motion vector

Figure A sample searchpath of Multiresolution MRS blo ckbased motion estimation algorithm

INTERFRAME BLOCKBASED CODING

Exploiting the fact that the image at the lower resolution represents a coarse ap

proximation to itself at the higher resolution the Multiresolution SearchApproach

p erforms a full searchonthedownsampled version of the original frames The b est

match at the lowest resolution is used as the starting center of the initial search

region in the higher resolution as shown in Figure The nal estimated motion

vector is obtained by searching at the original resolution around the best match

found from the image at the lower resolution

Subpixel Blo ckBased Motion Estimation

In all the ab ove mentioned motion estimation schemes integerp el displace

ments are assumed However the motion of ob jects in the real world is continuous

and do es not necessarily match the sampling grid points after digitization of the

analog images As a result ob ject motion requires a higher resolution subpixel

displacement than integerp el movements multiples of the sampling grid distance

in the rectangular sampling grid of a camera Commonly used subpixel accuracy is

halfp el whereas quarterp el is considered to b e the limit of any p ossible incremental

co ding gain

Usually subpixel motion estimation adopts one of two p ossible ways

Full search approach The original images are bilinearly interp olated at the

resolution required for subpixel motion estimation For example for half

p el motion estimation the original images need to b e interp olated four times

twice in b oth vertical and horizontal directions One of the integerp el mo

tion estimation metho ds can be applied to these interp olated images The

drawback of this approach is the large image size to b e handled

Multiresolution search approach The estimated integerp el displacement is

obtained rst through one of the integerp el motion estimation metho ds Then

the images are interp olated and a ner search at subp el accuracy around this

estimated integerp el displacement is p erformed on these interp olated images

For the case of halfp el accuracy only points are considered around the

integerp el estimate This is a signicantsaving in the numb er of op erations

over the full searchapproach

More detail in the topic of subpixel motion estimation can b e found in Chapter

Co ding Motion Vectors

Dep ending on whether a macroblo ck b elongs to a Pframe or Bframe eachmac

th

roblo ck the i macroblo ck MBi mayhave a set of asso ciated motion vectors

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

MV stiwithtwo temp oral directions forward and backward in time twospa

tial directions horizontal and vertical comp onents and two displacement precisions

full p el and half p el

s for forward motion vector and s for backward motion vector

t for the horizontal comp onent and t for the vertical comp onent

In MPEG each macroblo ckmay also havetwo sets of motion vectors each for one

eld of a picture

Motion vectors tend to b e highly correlated to those in neighb oring macroblo cks

For example in a pan all vectors would b e roughly the same Motion vectors are

co ded using a DPCM technique to make use of this correlation In other words

they are co ded dierentially with resp ect to previously deco ded motion vectors in

order to reduce the numb er of bits required in the co ded video stream The motion

vector predictor is dened as

MV sti i

PMV sti

i at start of a slice

Dierent standards sp ecify dierent rules on when PMV sti should b e reset to

For example in Ppictures of MPEG the motion vector used for DPCM the

prediction vector is set to zero at the start of eachslice and at eachintraco ded

macroblo ck Note that macroblo cks which are co ded as predictive but whichhaveno

motion vector also set the prediction vector to zero In Bpictures of MPEG there

are two motion vectors forward and backward Eachvector is co ded relative to the

predicted vector of the same typ e Both motion vectors are set to zero at the start

of each slice and at eachintraco ded macroblo ck Note that predictive macroblo cks

whichhave only a forward vector do not aect the value of the predicted backward

vector Similarly predictive macroblo cks whichhaveonlya backward vector do not

aect the value of the predicted forward vector

th

The motion vector dierence for the i macroblo ckis

MV Dsti MV sti PMV sti

In H the horizontal and vertical comp onents of this motion vector dierence

are then co ded as variable length co des VLC according to Table Notice

that there are two MVD values mapp ed to the same VLC for bandwidth eciency

except for the cases of and This wraparound representation of motion

vector dierences is made p ossible due to the fact that the range of motion vector

values is constrained to Only one of the pair will yield a motion vector

INTERFRAME BLOCKBASED CODING

falling within the p ermitted range since the dierence b etween the twovalues in a

pairiswhich is larger than the span of the p ermitted motion vector range

MVD Co de

Table VLC table for MVD motion vector dierence in H

In MPEGMPEG each comp onent of a motion vector dierence MV D is

co ded using three parameters

f codest sp ecies the range represented in in MPEG or in MPEG

unsigned bits and can b e picked up at the picture header Thus it can only b e

codest can not b e zero For MPEG changed only once p er frameeld f

codest is chosen such that for the largest it takes values through f

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

p ositive or negative MV D in the picture f codest is the minimum of all

fc f

c

f which can satisfy MV D

c

f f

c c

f codestminff MV D for all MV Dg

c

motion codestiisthe principal part which is co ded in a variable length

co de VLC in accordance with Table for MPEGMPEG or even H

As a matter of fact when there is no motion residual the co ding of motion

vectors in MPEGMPEG is compatible with H

r esiduesti is the residual part which is represented in a xed motion

length co de FLC of length f code bits and concatenated with the

code sti Both motion code and variable length co de VLC of motion

motion r esidue maychange from macroblo ck to macroblo ck

In MPEG each MV D comp onent is rst wrapp ed around to t into the range

f codest

f f where f

NMV DstiMV Dsti f f f

where denotes the mo dulo op eration For the sake of simplicitywe will drop the

index notation sti in all equations b elow Then it is decomp osed into

code f NMV D motion

sig nmotion code motion r esidual

codest f

and where f

code IN T fNMV D sig nNMV D f f g motion

motion r esidual jmotion code f j jNMV Dj

where IN T fg means rounding towards zero or taking the integer part The residual

r esidual is co ded in its onescomplementvalue as a xed length co de part motion

of length f code bits At the deco der side the motion vector can b e obtained

after wrapping the sum of reconstructed PMV and MV D

MV PMV MV D f f f

where

code f MV D motion

code motion r esidual sig nmotion

INTERFRAME BLOCKBASED CODING

As an example taken from the MPEG standard assume that f code and a

slice has the following fullp el motion vectors

Thus f and with the initial prediction set to the dierential values are

After wrapping the values into they b ecome

code and motion r esidual are The corresp onding motion

Therefore their VLCFLC co des are

However in MPEG the calculation of motion code and motion r esidual is a

little bit dierent from MPEG In deco ding the motion vectors from the com

pressed bitstream the deco ded dierential vector MV D is calculated as follows

motion code if f or motion code

MV D

sig nmotion codejmotion codej f

motion r esidual otherwise

where the residual comp onent is co ded as simple binary using f code bitsper

co de word whereas the principal part is co ded in VLC from Table The nal

reconstructed motion vector MV is obtained by wrapping the sum of the motion

predictor PMV usually previously reconstructed motion vector and MV D in the

same way as in MPEG so that MV falls in the range f f

MV PMV MV D f f f

Blo ckBased Motion Comp ensation

In the spatial domain the blo ckbased motion comp ensation in the MCDCT ap

proach b ecomes trivial As shown in Figure a for the case of motioncomp ensated

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

prediction with the estimated motion vector u v obtained in the motion esti

t t

mation stage the reconstructed reference blo ck is translationally displaced and cut

to t the candidate blo ck size N to form the predicted current frame

MC

x m n MCx u v

t t t

t

x m u n v for m n N

t t t

where x is the reconstructed reference blo ck b efore motion comp ensation For

t

bidirectional frames Bframes sometimes it is b etter to use motioncomp ensated

interp olation shown in Figure b where the predicted current frame is actually

the average of the forward and backward predictions

B B F F MC

g v MCx u v m n fMCx u x

t t

t t t t t

F F B B

fx m u n v x m u n v g

t t

t t t t

for m n N

wherex andx are the reconstructed previous and future reference blo cks re

t t

F F B B

sp ectively and u v and u v are the forward and backward motion vectors

t t t t

resp ectively for the currentblock

For the case of subpixel motion vectors the reconstructed reference frame must

be interp olated usually through the bilinear interp olation function in the spatial

domain b efore subpixel motion comp ensation can be p erformed More detailed

description can also b e found in Chapter

The reason for using a reconstructed reference frame instead of the original

reference frame is that the deco der state contents in the frame memory must

tightly track the enco der state and the deco der has only the knowledge of the

reconstructed frames but no access to the original images Anydivergence in the

enco der and deco der states will result in worse and worse image quality when the

reference frame of a P frame is also a Pframe A detailed treatmentofthistopic

can b e found in Chapter

Co ding DCT Co ecients in Interframes

A motion comp ensated residual frame is obtained by subtracting from a nonintra

picture its motion comp ensated prediction Unlikeintraco ded pictures it has b een

shown that DCT can not optimally decorrelate nonintra pictures However

b ecause the correlation in a motion comp ensated residual frame is already small

any loss in co ding eciency due to lack of nonoptimal co ecient decorrelation will

also b e small In fact relatively coarse quantization of DCT co ecients of motion

comp ensated residual blo cks is eective in reducing bit rate even with a at default

INTERFRAME BLOCKBASED CODING

current block reconstructed from previous frame

(du,dv)

ENLARGED

9 blocks from previous frame macroblock to be coded current block in current frame forward motion vector

     

CURRENT P-FRAME REFERENCE FRAME a Motioncomp ensated prediction from previous reference frame

backward motion vector

(du,dv)   

FUTURE REFERENCE FRAME

9 blocks from reference frame current block in current frame macroblock to be coded forward motion vector from future reference frame

     

CURRENT P-FRAME PREVIOUS REFERENCE FRAME

b Motioncomp ensated interp olation

Figure Motion comp ensation metho ds

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

0 1 45

2 3

Figure Numb ering of YCbCr blo cks in a macroblo ck for interpreting the

co ded blo ck pattern in each macroblo ck header

quantization table in Table b

Unlikeintraframe co ding DC and AC co ecients in the quantized DCT co ef

cients of motion comp ensated residual blo cks are treated equally since DC co ef

cients of residuals are dierential values similar to their AC counterparts The

co ding of nonintra quantized DCT co ecients follow a hierarchical co ding scheme

Completely zero macroblo cks co ded at macroblo ck layer The macroblo ck

address increment at the macroblo ck header indicates when to skip a mac

roblo ck and can eciently co de a run of completely zero macroblo cks one or

more

Zero blo cks in a nonzero macroblo ck co ded at macroblo cklayer The co ded

blo ck pattern cbp at the macroblo ck header is a bit variable co ded in VLC

to indicate which blo ck in a macroblo ck is zero and can b e skipp ed

X

iP

i

cbp

i

th

where P when the i blo ck in a macroblo ck is nonzero otherwise

i

P when it is a zero blo ck The YCbCr blo cks within a macroblo ckare

i

numb ered as shown in Figure

Nonzero blo cks at blo cklayer All the quantized DCT co ecients of nonzero

motion comp ensated residual blo cks are co ded with the runlevel VLC in the

same wayasintheintra co ding except that DC co ecients are treated in the

same wayasAC co ecients

INTERFRAME BLOCKBASED CODING

intraframe coding

x t + DCT Q VLC - interframe coding

-1 Q

IDCT

xt-1 MC Z +

vt

ME Z=FRAME MEMORY

spatial-domain motion

estimation/compensation unit (SD-ME)

a Conventional hybrid MCDCT video enco der

intraframe coding -1 z = x VLD Q IDCT + interframe t t coding

z MC t-1 Z

vt

b Conventional hybrid MCDCT video deco der

Figure Conventional hybrid motioncomp ensated DCT video co dec

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

PICTURE

intraframe (I) interframe (P/B)

GOB/SLICE GOB/SLICE

MACROBLOCK MACROBLOCK (MB=4Y+Cr+Cb) (MB=4Y+Cr+Cb) MOTION MV ESTIMATION MV AND COMPENSATION DPCM MOTION (PREDICTION/ BLOCK INTERPOLATION) COMPENSATED (8x8) RESIDUAL Ref f_code motion_residual RECONSTRUCTED motion_code FRAME MEMORY VLC DCT DCT 3/4 unsigned bit FLC(f_code-1 bits) in picture header RECONSTRUCT FRAME (IQ + IDCT) OUTPUT BUFFER Quantization Quantization for compressed bit stream (with q-table) (with flat table) DC AC

DPCM ZigZag Order nonzero BLOCK zero MACROBLOCK zero BLOCK run-length code size category sign/magnitude run-length code (dct_dc_size)

VLC for Y VLC for Cr/Cb coded block pattern VLC VLC FLC macroblock address increment (skip block)

(skip MB)

Figure Conceptual data ow for co ding a video sequence using MCDCT

approach

MotionComp ensated DCT Video Enco der and

Deco der

The MCDCT approachprovides two dierent paths for enco ding a frame

The intraframe co ding enco des the current frame without the knowledge of

any reference frame by exploiting spatial correlation

The interframe co ding enco des the current frame with the knowledge of one

reference frame by exploiting b oth spatial and temp oral correlation

However b oth intraframe and interframe co ding use a numb er of common building

blo cks DCT quantizer and entropy co der As a result the MCDCT enco der typ

ically has a switchtocontrol the co ding mo de interframe or intraframe as shown

in Figure a Similarly the MCDCT deco der is also capable of switching b e

MOTIONCOMPENSATED DCT VIDEO ENCODER AND DECODER

tween the intraframe co ding mo de and the interframe co ding mo de as depicted in

Figure b The conventional hybrid MCDCT co dec structure in Figure

is the basis of the video co der or deco der architectures used for all the MCDCT

based video co ding standards

To summarize the MCDCT approach the conceptual data ow is depicted in

Figure Based on the GOP structure the enco der needs to determine whether

the incoming picture is co ded as an intraframe I or an interframe PB Each

Iframe is then divided into macroblo cks eachofwhichcontains luminance blo cks

usually x blo cks and twochrominance blo cks dep ending on the picture format

usually Eachblock of pixels is converted through DCT to DCT co ecients

which are then quantized according to an HSVmatched quantization table The

DC co ecients of neighb oring blo cks are then arranged in a group or slice and

pro cessed by means of DPCM to generate dierential DC values DDC which

are co ded in two comp onents size category dct dc siz e co ded in VLC and the

signmagnitude part co ded in FLC The AC co ecients are reordered in a zigzag

way and then translated to the runlength co des which are nally co ded in VLC

For PBframes motion estimation must be p erformed on a macroblo ckby

macroblo ck basis on the current frame and the reconstructed reference previous

andor future frame to pro duce forward andor backward motion vectors resp ec

tively Then with these motion vectors predictive forwardbackward or inter

p olated motion comp ensation ie average of forward and backward predictions

is used on the reconstructed reference frame to generate the motion comp ensated

frame residual This residual is then passed to a DCT unit and a quantization unit

with a at quantization table to pro duce a set of quantized DCT co ecients If all

the co ecients are zero in a macroblo ck then this macroblo ck is skipp ed through

addr ess incr ement If only some blo cks in a nonzero macroblo ckhave macr obl ock

all zero quantized co ecients then the co ded blo ck pattern cbp is used to skip

those zero blo cks All the co ecients including DC of all nonzero blo cks are

enco ded in the same way as in intra blo cks DCT co ecients are rearranged in

a zigzag order and then translated into runlength co des which are co ded nally

in VLC The motion vectors are co ded with DPCM to pro duce dierential motion

vectors DM V These motion vectors are co ded in three comp onents

f code size comp onent represented in unsigned bits is stued in the

picture header

motion code principal comp onent is co ded in VLC

motion r esidual residual comp onent is co ded in FLC of f code bits

All these enco ded bits are placed in the output buer and encapsulated with the

layering information in accordance with the video stream syntax or semantics

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

dened in each video co ding standard The deco der will receive these bits in order

and recover the frames in the reverse order as describ ed ab ove

In addition to the co ding schemes describ ed in this chapter dierent video

standards may also include additional features or advanced metho ds to improve the

compression ratio or t dierent application targets The following is a small list of

the features adopted in the standards which will b e discussed in detail in the next

chapter

Scalability Scalability allows a single compressed video stream to be de

co ded at dierent qualitylevels This requires partitioning the pictures into

several layers base layer lowest layer and the enhancement layers One

layer of video base layer is co ded indep endently whereas other layers are

co ded dep endently with resp ect to the previous layer This will facilitate the

integration of multiple video services There are various typ es of scalable

co ding techniques esp ecially in MPEG

Signaltonoise ratio SNR scalability Eachlayer has incremental qual

ity improvement through increasing the number of quantization levels

but the spatial resolution remains the same

Spatial scalability Eachlayer has dierent spatial resolution larger pic

tures

Temp oral scalability Eachlayer has dierent temp oral resolution more

frames p er second

Rate control It is p ossible to vary the quantization through quantizer scal e

to improve picture quality or control the bitrate It is a tradeo to either

maintain constant picture quality or constant bitrate The enco ding pro ce

dure describ ed ab ove usually generates a compressed video bit stream at a

variable bit rate if we try to maintain the same picture quality throughout

the enco ding pro cess However we can also vary the quantization so as to

keep the output bitrate constant This can b e done by feeding back the output

buer level to control the quantization step sizes More detail can b e found

in the literature such as

Error concealmentresilence Error concealment means that whenever an

error is found in the bit stream a deco der tries not to recover the error but to

make the deco ded picture less noticeable to viewers due to the error When a

deco der detects errors through external means or internally it will replace the

part in error with skipp ed macroblo cks until the next slice is received MPEG

has an error concealment feature Iframes may contain co ded motion vectors

used only for error concealment A slice in error may b e replaced with motion

FULLY DCTBASED MOTIONCOMPENSATED VIDEO CODER STRUCTURE

comp ensated pixels from previous IP frames with the help of the motion

vectors enclosed in Iframes

Advanced or enhancement mo des are new approaches to improve the compres

sion ratio or the picture quality such as unrestricted motion vector mo des

UMV advanced prediction mo de including four motion vectors per MB

and Overlapp ed blo ck motion comp ensation OBMC advanced intra co d

ing mo de mo died quantization mo de deblo cking lter mo de improved PB

frame mo de etc

Fully DCTBased MotionComp ensated Video

Co der Structure

In most international video co ding standards such as CCITT H MPEG

MPEG as well as the prop osed HDTV standard Discrete Cosine

Transform DCT and blo ckbased motion estimation are the essential elements to

achieve spatial and temp oral compression resp ectively Most implementations of a

standardcompliant co der adopt the conventional motioncomp ensated DCT video

co der structure as shown in Fig a The feedback lo op for temp oral prediction

consists of a DCT an Inverse DCT IDCT and a spatialdomain motion estimator

SDME which is usually the full search blo ck matching approach BKM This

is undesirable In addition to the additional complexity added to the overall ar

chitecture this feedback lo op limits the throughput of the co der and b ecomes the

b ottleneck of a realtime highend video co dec A compromise is to remove the lo op

and perform op enlo op motion estimation based up on original images instead of

reconstructed images in sacrice of the p erformance of the co der

The presence of the IDCT blo ck inside the feedback lo op of the conventional

video co der design comes from the fact that currently available motion estimation

algorithms can only estimate motion in the spatial domain rather than directly in

the DCT domain Therefore developing a transformdomain motion estimation

algorithm will b e able to eliminate this IDCT Furthermore the DCT blo ckinthe

feedback lo op is used to compute the DCT co ecients of motion comp ensated resid

uals However for motion comp ensation in the DCT domain this DCT blo ckcan

be moved out of the feedback lo op From these two observations an alternative

solution without degradation of the p erformance is to develop motion estimation

and comp ensation algorithms whichcanwork in the DCT domain In this waythe

DCT can b e moved out of the lo op as depicted in Fig b and thus the op er

ating sp eed of this DCT can b e reduced to the data rate of the incoming stream

Moreover the IDCT is removed from the feedback lo op which now has only two

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

+ video in + DCT Q VLC channel

- Q-1

IDCT

SD-ME +

a Conventional hybrid motioncomp ensated DCT video co der

+ video in DCT + Q VLC channel

-

Q-1

TD-ME +

b Fully DCTbased motioncomp ensated video co der

Figure Dierent motioncomp ensated DCT video co der structures a motion

estimationcomp ensation are p erformed in the spatial domain b motion estima tioncomp ensation are completed in the transform DCT domain

FULLY DCTBASED MOTIONCOMPENSATED VIDEO CODER STRUCTURE

simple comp onents Q and Q the quantization pair in addition to the transform

domain motion estimator TDME This not only reduces the complexity of the

co der but also resolves the b ottleneck problem without any tradeo of p erformance

Furthermore dierent comp onents can b e jointly optimized if they op erate in the

same transform domain It should b e stressed that by using DCTbased estimation

and comp ensation metho ds standardcompliant bit streams can b e formed in ac

cordance to the sp ecication of any standard such as MPEG without any need to

change the structure of any standardcompliant deco der

Attempts have b een made recently on realizing DCTbased co ders on a limited

basis In this b o ok we present completely DCTbased motion estima

tion and comp ensation algorithms which p erform motion estimation and

comp ensation directly on the DCT co ecients of video frames rather than on pixels

In this way this fully DCTbased video co der architecture can b e realized to b o ost

the system throughput and reduce the total numb er of comp onents

In summary the resultant fully DCTbased motioncomp ensated video co der

structure enjoys several advantages over the conventional hybrid motioncomp ensated

DCT video co der structure

Less co der comp onents and complexity Removal of the DCTIDCT pair in the

feedback lo op of the fully DCTbased reduces the total numb er of comp onents

required in the feedback lo op and thus the complexity of the complete co der

Higher throughput rate The feedback lo op of a video co der requires pro cess

ing at the frame rate so that the previous frame data can be stored in the

frame memory and need to b e available for co ding the next incoming frame

Traditionally this lo op has four comp onents plus the spatialdomain motion

estimation and comp ensation unit and thus creates the b ottleneck for enco d

ing large frame sizes in real time In the conventional co der the whole frame

must b e pro cessed by b oth the DCTIDCT pair and the QQ pair b efore

the next incoming frame In the DCTbased structure the whole frame must

b e pro cessed by only the QQ pair This results in a less stringent require

ment on the pro cessing sp eed of the feedback lo op comp onents Alternatively

this may increase the throughput rate of the co der and thus allow pro cess

ing larger frame sizes when the technology keeps on improving the pro cessing

sp eed of these comp onents This high throughput advantage b ecomes increas

ingly imp ortant when the advances in optical networking technology p ermit

transmission of highquality pro ductiongrade video signals over broadband

networks in real time at aordable costs

Compatibility with existing standards The fully DCTbased structure en

co des the intraframes and motion comp ensated residuals in DCT in the same

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

way as the hybrid structure do es The enco ded bit stream can b e made fully

compatible with the existing video co ding standards More detail on matching

dierent co derdeco der structures can b e found in Chapter

Lower computational complexity of DCTbased motion estimation and com

p ensation approaches As demonstrated later in the book the DCTbased

motion estimation and comp ensation approaches have lower computational

complexity Furthermore due to the decorrelation of DCT exploited in most

video standards most energy tend to cluster in a few DCT co ecients es

p ecially the DC terms with the rest b eing zeros after quantization This

characteristic is particularly b enecial to the DCTbased approach since no

computation is needed for the ma jority of DCT co ecients b eing zero

Joint optimization of DCTbased comp onents A fast latticestructured DCT

co der generates dual outputs DCT and DST which can b e utilized by the

DCTbased motion estimation algorithms

Extendibility to a transco der structure An optimal transco der mo dies the

enco ded video bit stream in the DCT domain directly to t dierent usage

requirements such as frame rate conversion frame size conversion bit rate

conversion etc dierent from the usage requirement originally planned for

The fully DCTbased structure handles video data completely in the DCT

domain and therefore can b e easily extended to provide a transco der function

by cascading a DCTbased deco der with certain simplication and mo dica

tion required by the end usage For example the DCT co der at the frontof

a DCTbased co der and an IDCT deco der of a DCTbased deco der can be

removed

Additional information pro cessing DCT co ecients carry certain information

which can be utilized as an example for image segmentation in the DCT

domain The DCTbased co der structure facilitates such use of DCT

co ecients

Chapter

Video Co ding Standards

With the advances in technologies such as video compression telecommunication

and consumer electronics the era of digital video has arrived One of the exciting

prosp ects of the advancements in video compression is that multimedia information

comprising image video and audio has the p otential to b ecome just another data

typ e This usually implies that multimedia information will b e digitally enco ded so

that it can b e manipulated stored and transmitted along with other digital data

typ es This new technology accelerates the availability of video applications such

as digital laserdisc electronic camera videophone videoconferencing image and

interactive video to ols on computers HDTV and multimedia systems Unlike

the digital audio technology of the past few decades the data involved with still

or motion pictures are so huge that data compression is inevitable as wehave dis

cussed in the previous chapter In principle compression metho ds are based on the

nonlinearityofhuman vision which is more sensitive to energy with lower spatial

frequency Hence pictures can b e lossily enco ded with much less data than the orig

inal image without signicantly decreasing the quality of the reconstructed image

In addition when we develop high data compression schemes to reduce trans

missionstorage capacity we also require sophisticated picture co ding technology

to integrate the whole system p erformance For suchdatausagetobe p ervasive

it is essential that the data enco ding be standard across dierent platforms and

applications This will foster widespread development of applications and will also

promote interop erability among systems from dierent vendors Thus standards

for picture co ding are strongly required Furthermore standardization can lead to

the development of costeective implementations which in turn will promote the

widespread use of multimedia information

CHAPTER VIDEO CODING STANDARDS

Overview of Video Co ding Standards

A number of existing or evolving international video co ding standards made by

ITU International Telecommunication Union formerly called CCITT and ISO

International Standard Organization are listed in Table For still images

Standards Video Co ding BitRate Applications Glue

Organization Standard Standard

ISOCCITT JPEG For still image only

ITUT H p kbitss ISDN Video Phone H

ISO MPEG Mbitss Video on CDROM pt

ITUT H Mbitss Video over ATM H

ISO MPEG Mbitss generic high bitrate pt

applications HDTV

ITUT H kbitss PSTN Video Phone H

ISO MPEG kbitss Co ding of Ob jects MSDL

ITUT H kbitss PSTN Video Phone

ITUT HL kbitss

Table Dierent video co ding standards

data compression exploits correlation in space and for video signals in b oth space

and time It is hard to distinguish a reconstructed image that was enco ded with

a compression ratio from the original Video data even after compression at

ratios at can b e decompressed with close to analog videotap e quality

The motioncomp ensated DCT video compression scheme MCDCT is the ba

sisofseveral international video co ding standards which are tabulated in Table

ranging from the low bitrate and high compressionrate videophone application

to the highend high bitrate and high quality HighDenition Television HDTV

application requiring a mo dest compression rate As mentioned in the previous

chapter the MCDCT scheme b elongs to the class of the hybrid spatialtemp oral

waveformbased video compression approaches As illustrated in Fig the MC

DCT scheme employs motion estimation and comp ensation to reduceremovetem

p oral redundancy and then uses DCT to exploit spatial correlation among the pixels

of the motioncomp ensated predicted frame errors residuals Ecient co ding

is accomplished by adding the quantization and variable length co ding steps after

the DCT blo ck Basically all the standards in Table follows this pro cedure with

mo dications of each step to reach dierent targeted bitrate and application goals

In the following the DCTbased video co ding standards will b e discussed to provide

the readers the necessary background to understand the rest of the materials in this b o ok

OVERVIEW OF VIDEO CODING STANDARDS

Input Output

Encoder Decoder

Motion Estimation and Motion Compensation Compensation

Discrete Cosine Inverse Transform DCT

Quantizer Inverse Quantizer

Coding Decoding Model Model

Entropy Entropy Coder Decoder

Transmission

Figure MotionComp ensated DCT MCDCT Scheme

CHAPTER VIDEO CODING STANDARDS

JPEG Standards

Since the mid s a joint ISOCCITT committee group know as JPEG Joint

Photographic Exp erts Group has b een working to study an ecient co ding scheme

for continuoustone still images The standard may b e applied to various elds such

as image storage color facsimile newspap er photo transmission desktop publishing

medical imaging electronic digital cameras and so forth The JPEG standard pro

vides several co ding mo des from basic to sophisticated according to the application

elds

The JPEG baseline algorithm is based on the transform co ding approach The

source image is divided into nonoverlapping blo cks of pixels which is then

transformed using the twodimensional DCT The resulting D DCT co e

cients represent the frequency content of the given blo ck where most of the energy

concentrate near the zerofrequency or direct current term Next the DCT co ef

cients are quantized Following quantization the co ecients are zigzag scanned

to arrange in the order of ascending frequency Then the DC and low frequency

co ecients are enco ded by using Humanstyle co ding schemes There is another

DCTbased JPEG algorithm called the extended system whichprovides higher com

pression p erformance through arithmetic co ding The third mo de of JPEG co ding is

the indep endent function which utilizes a D Dierential Pulse Co de Mo dulation

technique The spatial prediction algorithm has lower compression p erformance

compared with the DCTbased algorithm

Overall the DCTbased algorithms can achieve higher compression ratio but are

lossy Because our fo cus in this b o ok is video co ding we are not going to discuss

the JPEG in detail please refer to for the detail ab out JPEG

ITU H series

In parallel to ISO MPEG ITUT makeseveral H series standards for multimedia

communication These includes H H and H H is a video co ding

standard dened by the ITUT Study Group XV SG for video telephonyand

video conferencing applications It emphasizes low bit rates and the low

co ding delay It was originated in and intended to be used for audiovisual

services at bit rates around m kbitss where m is b etween and In

the fo cus shifted and it was decided to aim at bit rates around p kbitss

where p is from to Therefore H also has an informal name called p

For p lowquality video signal for use in picture phones can be transmitted

over a kbs line If p a high quality video signal for teleconferencing can

be transmitted over a Mbs line H was approved in December

Because of its bidirectional communication nature the maximum co ding delayis

OVERVIEW OF VIDEO CODING STANDARDS

sp ecied to be ms The input formats used and dened in H are CIF

Common Intermediate Format and QCIF The H enco der is a hybrid co der

which combines motion comp ensated interframe prediction with the DCT And

the co ding algorithm used in H is basically blockmatching compensation with

transform coding Such a framework forms the basis of all video co ding standards

that were develop ed later Therefore H has very signicant inuence on many

other existing and evolving video co ding standards

H was dened by ITUT SG the same group that dened H The

eort of H started around Novemb er The main goal of this endeavor was

to design a video co ding standard suitable for applications with bit rate b elow

kbitss the socalled very lowbit rate applications For example sending video

data over the public service telephone network PSTN and the mobile network

implies video rates from to kbitss During the development of H it

was identied that the nearterm goal would be to enhance H for very low

bit rate applications and the longterm goal would be to design a video co ding

standard fundamentally dierent from H to achieve even b etter quality As

the standardization activities move along the nearterm eort b ecame H and

H and the longterm eort is now referred to as HL

After the standardization of H was nished the continued interest in very

low bit rate video co ding made it clear that further enhancements to H are p os

sible in addition to the four optional mo des the unrestricted motion vector mode

the syntaxbased arithmetic coding mode the advancedprediction mode and the PB

frame mode which will b e discussed later ITUT SG therefore established the

H eort to meet the need for standardization of such enhancements of H

Similar to H H is supp osed to provide a nearterm standardization for the

applications of realtime telecommunication and related nonconversational services

These enhancements can b e either improved quality of functionalities provided by

H or additional capabilities to broaden the range of applications For example

these can be improvement of p erceptual compression eciency reduction in the

video delay or greater error resilience Since H was a near term solution to

the standardization of enhancements to H it is considered only welldevelop ed

prop osed enhancements that t into the framework of H ie motion comp en

sation and DCTbased transform co ding On the other hand HL is a parallel

activity in ITUT SG that is intended to b e a longterm eort It considers more

radical algorithms that do not need to t the H framework It is exp ected that

HL will b e aligned with the work in MPEG

Other H series standards related to H are H for multiplexing proto col

H for communication pro cedures and H for dene the telephone terminal

equipment The collection of these together with G for sp eechcodingat

CHAPTER VIDEO CODING STANDARDS

kbitss and V for the mo dem interface from the ITUT recommendations for

very lowbit rate audiovisual telephony H is often used to refer to the whole

set of standards

MPEG Standards

In in resp onse to the growing need for a common format for co ding and stor

ing digital video International Organization for Standardization ISO established

the Moving Picture Exp ert group MPEG with the mission to develop standards

for the co ded representation of moving picture and asso ciated audio information on

digital storage media MPEG completed the rst phase of its work in with the

development of ISO standard Coding of moving pictures and associatedaudio

for digital storage mediaatuptoabout Mbitss This standard is also known

as MPEG Ahybrid co ding scheme known as the motion comp ensated in

terframe prediction and DCT is used in MPEG The prediction scheme not only

predicts from the past but also from the future As a result there are three func

tions asso ciated with prediction forward motion comp ensation backward motion

comp ensation and interp olative motion comp ensation

In MPEG started the second phase of its work namelytodevelop exten

sions to MPEG that would allow for greater inputformat exibility higher data

rates as needed by highdenition TV HDTV and b etter error resilience

That work led to the ISO or ITUT Recommendation H Generic coding

of moving pictures and associated audio This standard is also known as MPEG

The MPEG and MPEG standards are now widely used in commercial

pro ducts such as CDinteractive digital video cameras and digital audio and video

broadcasting MPEG and MPEG deal with framebased video and audio and

in many applications they have oered the solution to replace analog systems that

existed b efore with a digital system The most imp ortant goal of these standards

has b een to make storage and transmission very ecient

As digital media b ecomes widely used there is a blurring of b orders between

three distinct services communications interactivity and broadcasting and their

corresp onding industry sectors namely TVlm computers and telecommunica

tions While the convergence among these sectors may take a long time the

distinction b etween the services is disapp earing For instance in recentyears inter

activity is b eing added to broadcast services and manycommunicationinteractive

applications are app earing on the Internet In anticipation of this trend in

July of The MPEG group initiated a new standardization phase referred to

as MPEG Unlike MPEG and MPEG wherein the emphasis was

primarily on co ding eciency the ob jective of MPEG was to standardize algo

rithms for audiovisual co ding in multimedia applications allowing for interactivity

VIDEO CODING STANDARDS

high compression scalabilityof video and audio content and supp ort for natural

and synthetic audio and video content In contrast to the framebased paradigm

of MPEG and MPEG MPEG prop oses an ob jectbased paradigm for scene

representation The rst version of the MPEG international standard was re

leased in Spring and we will briey intro duce the standard in Section

Please refer to Chapter for the detailed discussion

MPEG is also working on the development of MPEG called Multimedia Con

tent Description Interface for to days information retrieval systems which supp orts

accessing multimedia data anytime and anywhere MPEG sp ecies a standard

ized description of various typ es of multimedia information This description

shall b e asso ciated with the content itself to allow fast and ecient searching for

material that a user maybeinterested in However MPEG is b eyond the scop e

of this b o ok and will not b e covered

Video Co ding Standards

H

The video data structure in H is a hierarchical structure Each frame has a

picture layer leading with a picture header sp ecifying the picture format and frame

numb er and followed by GOBs for CIF format and GOBs for QCIF EachGOB

in turn have macroblo cks MB for b oth CIF and QCIF as shown in Fig

A macroblo ckis the basic data unit for compression mo de selection and consists

352 pixels

GOB1 GOB2 1 2 3 4 5 6 7 8 9 10 11 GOB3 GOB4 12 13 14 15 16 17 18 19 20 21 22 GOB5 GOB6 23 24 25 26 27 28 29 30 31 32 33

GOB7 GOB8 288 pixels A group of the blocks (GOB) GOB9 GOB10

GOB11 GOB12

CIF format in GOBs

8 pixel Y1 Y2 1 8 Cb Cr Y3 Y4

A Macroblock (MB) 8 lines

57 64

A block

Figure The resulting GOB structures for a frame of pictures in H

CHAPTER VIDEO CODING STANDARDS

of a macroblo ck header the compression mo de four Y blo cks luminance

comp onent one U blo ck and one V blo ckchrominance comp onent

due to subsampling the chrominance comp onents

There are two ma jor mo des in H intra mo de and inter mo de In the intra

mo de only DCT is used for compression in a similar way to JPEG image compres

sion but in inter mo de the motioncomp ensated DCT approachisappliedmotion

estimation and comp ensation is used to exploit temp oral correlation in addition to

DCT compression DCT co ecients are thresholded and then uniformly quantized

with a stepsize for the DC comp onent the DCT co ecient indexed as or

the sp ecied MQUANT value for the AC comp onents all DCT co ecients other

than the DC comp onent For AC comp onents a central dead zone around zero is

used to avoid the ringing eect The quantized DC and AC co ecients are zigzag

scanned and then enco ded with the runlength co de which sp ecies a series of events

containing a run length of zero co ecients preceding a nonzero co ecient and the

value of the nonzero co ecient Under a constant visual quality the enco ded bit

stream has a variable bit rate For ISDN transmission a xed bit rate is desired

Therefore the ratebuer control mechanism is recommended but not sp ecied in

H

Because datacompressed image transmission is more sensitivetochannel errors

error resilience including synchronization and concealmenttechnique is required

in the transmission co der shown in Fig

Coding control

Coded Source Video multiplex Transmission Transmission bit stream Video coder coder buffer coder

signal

Figure A blo ck diagram of the H co der and the mirrored op erations can

b e applied at the deco der side

A TextureCoding using Discrete Cosine Transform DCT

Transform co ding has b een widely used to remove redundancy between data

samples In transform co ding a set of data samples are rst linearly transformed

into a set of transform coecients These co ecients are then quantized and

entropy co ded A prop er linear transform can decorrelate the input samples and

hence remove the redundancy Another way to lo ok at this is that a prop erly chosen

transform can concentrate the energy of input samples into a few transform co ef

VIDEO CODING STANDARDS

cients so that resulting co ecients are easier to co de than the original samples

Among many transforms the DCT is most widely used in sp eech and image pro

cessing for data compression This is due to its b etter energy compaction prop erty

and its near optimal p erformance whichis closest to that of the KarhunenLo eve

Transform KLT among many discrete transforms for highly correlated signals

esp ecially for the rst order Markov pro cess Thus the DCT is a very imp ortant

technique in video signal pro cessing

A separable dimensional D N N DCT is dened as follows

N N

X X

n l m k

X k l C k C l cos xm n cos

N N N

m n

where

p

if k

C k

otherwise

The inverse DCT IDCT is dened as follows

N N

X X

m k n l

C k C l X k lcos cos xm n

N N N

k l

where xm n is a real numb er as dened in the IDCT equation In terms

of b oth ob jective co ding gain and sub jective quality DCT p erforms well for typical

image data After the transform the DCT co ecients are quantized which is

mainly where the compression comes from

Due to the high data rate in the video communication systems sp ecialpurp ose

DCT chip sets sometimes are required to p erform realtime computation and match

the computation sp eed For example the HDTV system prop osed by General

Instrument Corp oration requires a video data rate at Mbs In Chapter

we will present a promising DCT architecture which can achieve the high sp eed

requirement of the HDTV systems

B Motion EstimationCompensation

The transform co ding describ ed previously removes spatial redundancy in each

frame of picture It is therefore referred to as intra co ding However for video

material inter co ding is also very useful Consider the fact that typical materials

is comp osed of moving ob jects Typical video material contains a large amountof

redundancy along the temp oral axis Video frames that are close in time usually

have a large amount of similarity Therefore it is p ossible to improve the prediction

pro cess by rst estimating the motion of each regions in the scene And transmit

CHAPTER VIDEO CODING STANDARDS

ting the dierence between frame is more ecient than transmitting the original

frames This is often achieved by matching each blo ckin the currentframe with

the previous frame to nd the b est matching area More sp ecically the enco der

can estimate the motion ie displacement between the previous frame and the

current frame for each blo ck as illustrated in Fig Such area in the previous

frames is then oset prop erly to form the estimate of the corresp onding blo ck in

the current frame This is similar to the concept of dierential co ding and is also

a sp ecial case of predictive co ding The previous frame is used as an estimate of

the current frame and the residual the dierence between the estimate and the

true value is co ded Now the residue has much less energy and therefore is much

easer to co de This pro cess is called motion compensation MC or more precisely

motioncomp ensation prediction The residue is then co ded using the same pro

cess as in intra co ding When the estimate is go o d it is more ecient to co de the

residual than to co de the original frame

Frames that are co ded without any reference to previously co ded frames are

called intra frames simply Iframes or I pictures Frames that are co ded using a

previous frame as a reference for prediction are called prediction frames simply P

frames or P pictures However note that Pframe may contain not only inter co ded

blo cks but also intra co ded blo cks The reason is as follows For a certain blo ck it

may b e imp ossible to nd a well matching area in the reference frame to b e used as

prediction in which case direct intra co ding of such a blo ck is more ecient This

situation happ ens often when there is o cclusion in the scene or when the motion

is very heavy Motion comp ensation saves the bits for co ding the DCT co ecients

However it do es imply that extra bits are required to carry information ab out the

motion vectors Ecient co ding of motion vector is therefore also an imp ortant

part of H Because motion vectors of neighb oring blo cks tend to be similar

dierential co ding of motion vectors should be used This is instead of co ding

motion vectors directly the previous motion vector is used as a prediction for the

current motion vector and the residue is then co ded using VLC table

The basic principle of blo ckmatching ME is for every blo ck in the currentframe

called current blo ck to nd the b est matched blo ck within a range in the previous

frame called predicted blo ck The displacement of the predicted blo ck relativeto

the current blo ck is called a motion vector MV A motion comp ensated dierence

blo ck is formed by subtracting the pixel values of the predicted blo ck from that of

the current blo ck p ointby p oint Texture co ding is then p erformed on the dierence

blo ck The co ded MV and the co ded texture information of the dierence blo ck

are transmitted to the deco der Utilizing this information the deco der can then

reconstruct an approximated currentblockby adding the quantized dierence blo ck

to the predicted blo ck according to the MV In MEMC several basic issues need

VIDEO CODING STANDARDS

to b e considered The criteria for making decision on these issues are

to result in a small dynamic range for the dierence blo ck

to use few bits for the motion vectors and

to havea low computational complexity

Due to its lower computational complexity as compared to other dierence measures

the sum of absolute dierence SAD is used Let fci j i j N g be

the pixels of the currentblock and fpm n m n R R R

N g b e the pixels in the search range of the previous frame Then

N N

X X

jci j pi j jC for x y

i j

SAD x y

N

N N

X X

jci j pi x j y j otherwise

i j

where x y R R R R and C is a p ositive constant

The x y pair resulting in the minimum SAD value is called a motion vector

MV MV Because a p ositive constant C for a blo ck C is

x y

subtracted from the sum of absolute dierence when x y the MV

is favored The purp ose is to concentrate the distribution of MVs to so that

entropy co ding of the MVs is more ecient

Motion comp ensated dierence b etween blo cks is dened as

di j ci j pi MV j MV ij N

x y

The dierence blo ck fdi j g is then transformed quantized and entropy co ded

At the deco ding end the motion vector MV MV and quantized dierence blo ck

x y

fdi j g are available to reconstruct the current frame as follows

ci j di j pi MV j MV ij N

x y

C Quantization

Quantization implies loss of information so transform co ding is a lossy co ding

pro cess The quantization step size dep ends on the available bit rate and also

dep ends on the co ding mo de Two typ es of quantizers are applied for quantiz

ing DCT co ecients in the H enco derdeco der The intra DC co ecient is

CHAPTER VIDEO CODING STANDARDS

uniformly quantized with a step size of and no dead zone Each of the other

quantizers for AC and inter DC co ecients is nearly uniform but with a central dead

zone around zero as shown in Fig The step size Q is an even integer in the

Out Out

3/2Q Th+Q/2

1/2Q

-Th Th+Q -2Q -Q Q2QIn -Th-Q Th In -1/2Q

-Th-Q/2 -3/2Q

(a) unifrom quantizer for the intra DC (b) nearly uniform midtread quantizer for coefficient only the inter DC and all AC coefficients (dead zone = 2 Th; the step size Q can be changed from MB to MB in increments of 2 from 2

to 62 adaptively).

Figure Quantizers in H Th is the threshold

range of to which represents quantizers Since manyACcoecients have

nearzero levels the midtread quantizer with zero as one of the output values is

used H sp ecies only the reconstruction levels How to p erform quantization is

left to the designer For the co ecients intra ACorinter DCAC a nearly uniform

midtread quantizer is employed The input between Th and Th is quantized

to level zero Except for the dead zone the step size Q is uniform Here the dead

zone is used to quantize all other co ecients in order to remove noise around zero

All the co ecients in a macroblo ck except for the intra DC co ecients go through

the same quantizer

The quantization step size represents the distance between p ossible values of

the quantized signal By varying the step size the amount of information used to

describ e a particular pixel or blo ck of pixels can b e changed Larger step sizes result

in less information b eing required while accuracy is reduced in the representation

Smaller step sizes result in b etter quality but also in an increase in the amount

of information to b e transmitted In the H co der the length of the co ded bit

stream is dep ended on the image prop erties complexity motion or scene changes

The easy way to control the output bit rate may b e found in quantizer step sizes In

each MB p ossible compression mo des sp ecify the ma jor mo de the quantization

step size MQUANT the motion vector data MVD the co ded blo ck pattern

VIDEO CODING STANDARDS

CBP or whether a spatial lter is applied to the motioncomp ensated residuals

The quantized DCT co ecients are then converted into a onedimensional array

for entropycoding Fig shows the scan order used in H for suchconversion

Most of the energy concentrates on the low frequency co ecients and the high

AC Coefficient Start Horizontal Frequency DC Coefficient 0123 456 7

0

1

2

Vertical 3 Frequency 4 AC Coefficient End 5

6

7

Figure Zigzag scan order of the DCT co ecients

frequency co ecients are usually very small and are quantized to zero b efore the

scanning pro cess Therefore this scan order in Fig is chosen to create long runs

of zero ecient which is imp ortant for entropy co ding

The resulting D array is then decomp osed into segments with each segment

containing one or more or none zeros followed by a nonzero co ecient With

an event representing the pair of the number of zeros and the level of the nonzero

coecients a Human co ding table can b e built to represent eacheventbyaspecic

co deword ie a sequence of bits This co ding pro cess is sometimes called runlength

coding and the table is called a variable length co ding VLC table In H this

table is often referred to as a twodimensional D VLC table b ecause of its D

nature ie the event as run level

H

H is the video co ding standard for low bitrate communication over the ordinary

telephone line POTS As in other standards the video bitrate may be

variable and controlled by the terminal or the network In addition to CIF and

QCIF as supp orted by H H also supp orts subQCIF CIF and CIF with

the color space Y C rCb sampled in the format Resolutions of those picture

formats can b e found in Table and the frame rate is framessec exactly

CHAPTER VIDEO CODING STANDARDS

frames for seconds in the progressivemode

SubCIF QCIF CIF CIF CIF

No of Pixels p er Line

No of Lines

Uncompressed Bit Rate Mbs Mbs Mbs Mbs Mbs

Table Picture Formats

A Hierarchical Structure in H

H also has a hierarchical data structure with four layers

Picture Layer Each frame consists of GOBs for subQCIF GOBs for

QCIF GOBs for CIF CIF and CIF resp ectively Data for eachpicture

starts with byte aligned Picture Start Co de PSC and other picture header

information followed by GOBs an endofsequence EOS co de and stung

bits for byte alignment The typ e information in the header sp ecies the

picture typ e INTRA frame Ipicture and INTER frame Ppicture and

optional mo des Unrestricted Motion Vector mode SyntaxbasedArithmetic

Coding mode AdvancedPrediction modeandPBframes modeetc

Group of Blocks Layer Each GOB has one macroblo ck row for subQCIF

QCIF and CIF two macroblo ckrows for CIF and four for CIF Data for

GOB starts with stung bits plus byte aligned GOB Start Co de GBSC

and other header information and ends with macroblo ck data

Macroblock Layer Each macroblo ck in turn has four luminance blo cks

and two spatially corresp onding color dierence blo cks in the default

mo de The macroblo ck data include the macroblo ck header sp ecifying the

macroblo cktyp e motion vector data and blo ck data

Block Layer DC comp onents of DCT co ecients are enco ded as INTRADC

with a xedlength co de separately from the restofthe co ecients enco ded

as TCOEF with a runlength co de

B H Quantization Method

Let COF b e a DCT co ecienttobequantized LEVEL b e the absolute value

of the quantized version of the DCT co ecient and COF b e reconstructed DCT

co ecient Quantization and dequantization of the DC co ecient of an INTRA

VIDEO CODING STANDARDS

blo ck are p erformed as follows

LEVEL COF COF LEVEL

Quantization of the AC co ecients is sp ecied by a quantization parameter QP

that maytakeinteger values from to The quantization stepsize is QP

The equations for quantization are given as follows

For IN T RA LEVEL jCOF j QP

For IN T ER LEVEL jCOF jQP QP

Clipping to is p erformed for all co ecients except intra DC Dequanti

zation is dened as follows

if LE V E L

jCOF j

QP LE V E L QP if LE V E L and QP is odd

QP LE V E L QP if LE V E L and QP is ev en

COF is then obtained as COF SignCOF jCOF j Clipping to

is p erformed b efore the IDCT

C H vs H

Since H was built on top of H the main structures of the two standards

are essentially the same A few ma jor dierences are summarized in Table

Features in H but not in H

HalfPel motion comp ensation

D VLC tables

Quantization step size can change at each MB as opp osed to H that can change the

quantization step size only every MBs

Four options that are negotiable b etween the enco der and the deco der At the b eginning

of each communication session the deco der signals the enco der which of these options the

deco der has the capability to deco de If the enco der supp orts any of these options it enables

them These four options are unrestrictedmotionvector mode the syntaxbased arithmetic

coding mode the advancedprediction modeand the PBframe mode

Table H vs H

Next we will discuss those dierences in detail

HalfPel Prediction and Motion Vector Co ding

A ma jor dierence between H and H is the halfp el prediction in the

CHAPTER VIDEO CODING STANDARDS

motion comp ensation As a result the predictive co ding of motion vectors in H

is more sophisticated than that in H In H accuracy of motion vector

MV MV can be either intp el or halfp el Therefore it is p ossible to have a

x y

motion vector such as When a motion vector has noninteger values

bilinear interp olation is used to nd the corresp onding p el values for prediction In

other words for accuracy of half pixel the bilinear interp olation has to b e used on

the previous frame so that pi x j y in is dened for x or y b eing half of

an integer p els Interp olation is p erformed as shown in Fig

A B

a b

c d

C D

Integer pixel position {a, b, c, d}: Interpolated half pixel values

Half pixel postion {A, B, C, D}: Available integer pixel values

Figure Bilinear interp olation for half pixel motion estimation and comp ensa

tion a A b A B c A C d A B C D here

denotes division and roundo

The predictivecodingofmotionvectors in H is more sophisticated than that

in H The motion vectors of three neighb oring MBs the left MB the ab ove

MB and the ab overight MB are used to form the prediction of the motion vector

of the current blo ck However around a picture or GOB b order sp ecial cases are

needed Either a zero motion vector is used for a neighb oring MB that is outside

the pictureGOB or the motion vector of the only neighb oring MB that is inside

is used to replace predictors that are outside as shown in Fig

Run Length Co ding of DCT Co ecients

H improves the runlength co ding used in H by given an extra term

Last to indicate whether the current co ecient is the last nonzero co ecientof

the blo ck Therefore a set of run level last represents an event and is mapp ed

to a co deword in the VLC table hence the name D VLC With this scheme the

end of blo ck EOB co de used in H is not needed anymore A Human co ding

table is built to represent each event by a sp ecic co deword ie a sequence of

VIDEO CODING STANDARDS

Picture or GOB border Picture or GOB border

MV2 MV3 MV1 MV1

(0,0) MV MV1 MV

(a) (b)

MV: Current motion Vector, MV1: Previous motion vector

MV2: Above motion vector, MV3: Above right motion vector

Figure Motion vector prediction at pictureGOB b oundaries In a zero

motion vector is used for a neighb oring MB outside the picture In b the motion

of only inside MB is used to replace the predictors that are outside

bits Events that o ccur more often are represented by shorter co dewords and less

frequentevents are represented by longer co dewords

Table is an example of the VLC typ e of table The transform co ecients

in this table corresp ond to input samples chosen as the residues after motion com

p ensation In Table the third column represents the magnitude of the level

Last Run Level VLC Co de

s

s

s

s

s

s

s

s

s

s

s

Table Partial VLC table for DCT co ecients

The sign bits added at the end of VLC co de takes care of the sign of the level It

can b e seen from the table that more frequently o ccurring symb ols eg symbols

with smaller magnitudes are assigned fewer bits than the less frequently o ccurring

symbols It is reasonable to assume that symb ols of smaller magnitude o ccur more

frequently than the large magnitude symb ols b ecause most of the time wecodethe

residue found after motion comp ensation and this residue do es tend to have small

CHAPTER VIDEO CODING STANDARDS

magnitudes

Unrestricted Motion Vector Mo de

This is the rst one of the four negotiable options dened in H In the default

mo de motion vectors are restricted to the range such that all referenced

pixels are conned in the co ded picture area However in the Unrestricted Motion

Vector mo de motion vectors are allowed to point outside the picture b oundary

with a maximum range When this happ ens edge pels are rep eated

to extend to the p els outside so that prediction can be done Signicant co ding

gain can b e achieved with unrestricted motion vectors if there is movement around

picture edges esp ecially for smaller picture formats like QCIF and subQCIF In

addition this mo de allows a wider range of motion vectors than H Large

motion vectors can be very eective when the motion in the scene is caused by

camera movement

SyntaxBased Arithmetic Co ding

In the Syntaxbased Arithmetic Co ding SAC mo de all the variable length

co dingdeco ding op erations are replaced with arithmetic co dingdeco ding The use

of the variable length co dec VLCVLD usually accomplished by Human co ding

implies that eachsymbol must b e enco ded into a xed integral numb er of bits but

an arithmetic co der can remove this restriction and allow a variable nonintegral

numb er of bits of the co de length Thus signicantly fewer bits are pro duced

Exp eriments show that the average gain is ab out for inter frames and ab out

for intra blo cks and frames

Advanced Prediction Mo de

This mo de contains two basic features

Four Motion Vectors per Macroblock it implies that each blockisas

so ciated with one motion vector In general using four motion vectors gives

b etter predictions since one motion vector is used to represent the movement

of a blo ck instead of a MB In other words each luminance

blo ck in the MB is allowed to have its own motion vector This allows greater

exibility in obtaining a b est match for the MB hence when these parts are

put together a much b etter prediction for the MB is obtained Of course

this implies more motion vectors and hence requires more bits to co de the

motion vectors If the savings in the residue for the MB are oset by the

extra bits needed to send the four motion vectors there is no p oint in sending

four motion vectors Therefore the enco der has to decide when to use four

motion vectors and when to use only one With four motion vectors p er MB

VIDEO CODING STANDARDS

it is no longer p ossible to use the same scheme as in the baseline case to co de

the motion vector dierence Under such a circumstance the prediction of

motion vectors has to b e redened In particular the lo cations of the three

neighb oring blo cks of which the motion vectors are to b e used as predictors

now dep end on the p osition of the current blo ck in the MB A new set of

predictors is dened in the standards as shown in Fig

MV2 MV MV3 MV2 MV3

MV1 MV MV1 MV

Macroblock boundary

MV2 MV3 MV2 MV3

MV1 MV MV1 MV

MV: Current motion vector, MV1, MV2, MV3: Predictors

Figure Advanced prediction mo de

The predictors are chosen such that none of them are redundant For instance

MV as a predictor for the upp er left blo ck of the MB probably the choice of

would makethe choice of MV redundant This is b ecause it is quite likely

that MV and MV which come from the same MB are close together Hence

the information obtained from MV is suppressed by the median op eration

Such redundancy is avoided bypicking a motion vector from another MB

Overlappedblock motion compensation OBMC it supp orts overlapp ed blo ck

motion comp ensation OBMC involves using motion vectors of neighb oring

blo cks to reconstruct a blo ck thereby leading to an overall smo othing of

the image and removal of blo cking artifacts In addition it leads to b etter

predictions which results in a smaller bitstream

In overlapp ed blo ck motion comp ensation it allows four motion vector mo des

p er macroblo ck Every pixel in the nal prediction region for a blo ckisob

tained as a weighted sum of three values In other words for overlapp ed

motion comp ensation an luminance blo ck pi j isaweighted sum of

CHAPTER VIDEO CODING STANDARDS

three prediction values created as follows

X

pi j pi u j v H i j

k k k

k

where u v is the motion vector of the current blo ckk the blo ck

k k

either ab ove or b elowk or the blo ck either to the left or rightk of

the current blo ck pi j is the reference previous frame and fH i j k

k

g are dened as follows

T

H H H H

  



Here the weights are predened in the standard For instance a pixel in the

top left part of the blo ckpicks the remote vectors as the motion vectors of

the blo cks ab ove and to the left of the current blo ckasshown in Fig

(u^^ , v ) ^ ^ 1 1 (u 00 , v )

^ ^

(u 2, v 2)

Figure OBMC for upp er left half of blo ck Here u v is the motion vector

of the current blo ck u v is the motion vector of blo ckabove and u v is

the motion vector of blo ck to the left

As in the unrestricted motion vector UMV mo de this advanced prediction

mo de allows motion vectors to cross picture b oundaries Pixels outside the co ded

area are obtained by extrap olating as in the UMV case

PBFrame Mo de

In the PBframes mo de a PBframe consists of two pictures one Ppicture

predicted from the previous deco ded Ppicture and one Bpicture predicted from

VIDEO CODING STANDARDS

both the previous deco ded Ppicture and the Ppicture currently being deco ded

co ded as one unit as shown in Fig A macroblo ck in PBframe mo de

PB frame

P B P

Figure PBFrame Here the prediction of the Bblo ck requires forward and

backward predictions Pixels not predicted bidirectionally are predicted with for

ward prediction only

comprises blo cks instead of of which are for the Ppicture and the other

blo cks for the Bpicture This is dierent from MPEG as we will discuss later

where Bframes are enco ded separately from Pframes and can b e predicted from

Iframes Compared with MPEG Bframes PB frames do not need separate bi

directional vectors Instead forward vectors for the Ppicture is scaled and added

to a small deltavector to obtain vectors for the Bpicture Co ding the PBframe

as one unit will reduce the deco ding delay caused by the bidirectional prediction

which is critical to the bidirectional interactive videophone application Thus this

results in less overhead for the Bpicture part For relatively simple sequences at

low bit rates the picture rate can be doubled with this mo de without increasing

the bit rate much However for sequences with heavy motion PBframes do not

work as well as Bpictures

C Advancedcoding modes in H

After the standardization of H was nished the continued interest in very

low bit rate video co ding made it clear that further enhancements to H are

p ossible in addition to the four optional mo des the unrestricted motion vector

mode the syntaxbased arithmetic coding mode the advancedprediction mode and

the PBframe modeaswehave mentioned ab ove ITUT SG therefore established

the H eort to meet the need for standardization of such enhancements of

H In addition to the previously mentioned mo des supp orted in H H

CHAPTER VIDEO CODING STANDARDS

or H version supp orts other advanced co ding mo des as listed in Table for

the detail please refer to

Co ding Mo des Advanced Intra Coding Mode

for Eciency Alternate Inter VLC Mo de

or Improved Mo died Quantization Mo de

Picture Deblo cking Filter Mo de

Quality Improved PB Frame Mo de

Enhancements Slicestructured mo de

for Error Reference Picture Selection Mo de

Robustness Indep endent Segment Deco ding Mo de

Other Enhance Reference Picture Resampling

ment Mo des ReducedResolution Up date Mo de

Table Enhanced co ding mo des used in H

Advanced intra coding mode This mo de attempts to improve the eciency

while co ding intra MBs in a given frame using a intra blo ck prediction using

neighb oring intra blo cks b a separate VLC for the Intra co ecients and

c mo died inverse quantization for intra co ecients

Alternate Inter VLC Mode During the pro cess of inter co ding it is assumed

that the residues have signicantly less energy than the original blo cks To

take advantage of this prop erty the VLC tables used to convert these residue

blo cks or inter blo cks are dierent from the tables for the intra blo cks

Modied Quantization Mode This mo de which mo dies the quantizer op er

ation has four key features The rst feature allows the enco der a greater

exibility in controlling the quantization step size This allows the enco der

greater bitrate controlling ability The second feature allows the enco der to

use dierent quantization parameters for chrominance and luminance com

p onents Thus the enco der can use a muchner step size for chrominance

comp onents therefore reducing anychrominance artifacts The third feature

extends the DCT co ecient range so that any p ossible true co ecientcan

b e represented The fourth feature tries to eliminate co ecientlevels that are

very unlikely therefore improving the ability to detect any errors and also to

reduce the co ding complexity

Deblocking Filter Mode The deblo cking lter is an optional blo ck edge lter

that is applied to I and P pictures in the co ding lo op This lter is applied to

VIDEO CODING STANDARDS

block edges to reduce blo cking artifacts and improve p erceptible picture

quality

ImprovedPBFrame Mode The PB frames constrain the motion vectors of the

B picture to b e estimated from the motion vectors of the P picture part of the

same frame This kind of scheme p erforms p o orly for large or complex motion

scenarios when the corresp onding prediction obtained for the B pictures is

not good This improved mo de allows for distinct forward and backward

motion vectors This is dierent from the PB frame case when the forward

and backward motion vectors were b oth derived from the motion vector of

the P picture hence were closely related to each other This allows a b etter

prediction for the B pictures as in the case with the B frames in MPEG as

we will discuss later in this chapter

Slicestructuredmode this is one of the enhancements in the standard to help

in improving error resilience When this mo de is turned on the frames are

sub divided into many slices instead of the regular GOBs A slice is a group

of consecutive MBs in scanning order the only constraints are that the slice

must start at an MB b oundary and an MB can b elong to exactly one slice

This grouping of MBs into slices instead of GOBs allows the enco der much

exibility and other advantages

Reference Picture Selection Mode This option allows for selection of anyof

the previously deco ded frames within frames of the current frame or

when used with custom picture clo ck frequency within frames of the

current frame as a reference frame to generate the prediction for the current

frame This is very dierent from the baseline case when only the frame

deco ded immediately b efore the current frame here the frame may b e used

as a reference This capability of selecting dierent frames as reference is

most useful when the deco der can send information back to the enco der So

the deco der needs to b e able to inform the enco der which frames it received

correctly and which frames were corrupted during the transfer The enco der

can then use this information to select as reference only the frames the deco der

has acknowledged as received correctly

Independent Segment Decoding Mode This mo de is another enhancementfor

improved error resilience It tries to remove data dep endencies across the

video picture segment b oundaries A video picture segmentmay b e a slice or

anumb er of consecutive GOBs When this mo de is turned on the segment

b oundaries are treated as the picture b oundaries So if this option is selected

with the baseline options each picture segment is deco ded indep endentlyas

CHAPTER VIDEO CODING STANDARDS

if it were the whole picture This means that no op eration can reference data

outside the segment This option may also b e used with options like UMV or

advanced prediction in which case reference is to data outside the segment

b oundaries In such cases one derives data outside the segment b oundaries

through extrap olation of the segment as was done in the case of the entire

frame

Reference PictureResampling This option allows the resampling of the pre

viously deco ded reference picture to create a warp ed picture that can b e used

for predicting the future frames This is very useful if the current picture has a

source format dierent from that of the previously deco ded reference picture

Resampling denes the relation b etween the current frame and the previously

deco ded reference frame In essence resampling sp ecies the alternation in

shap e size and lo cation between the current frame and the previously de

co ded reference frame

ReducedResolution Update Mode This mo de allows the enco der to send the

residue or up date information for a co ded frame with reduced resolution while

keeping the ner detail information in the higher resolution reference image

The nal frame can b e reconstructed as the higher resolution from these tow

parts without signicant loss of detail Such an option is very useful when

co ding a very active or highmotion scene In this mo de MBs are assigned a

size corresp ondingly blo cks are hence there are onequarter

the numb er of MBs p er picture as b efore All motion vectors are estimated

corresp onding to these new larger MBs or blo cks dep ending on whether we

desire one or four motion vectors for the MB The deco der uses the motion

vectors corresp onding to these large MBs and blo cks for motion comp ensation

MPEG

To meet the sp ecial requirements of the digital storage media as stated earlier

additional techniques beyond the H video co der have been incorp orated in

the MPEG co der For easy reference the ma jor dierences b etween H and

MPEG are summarized in Table MPEG was approved as an ISO

standard bylate MPEG is intended for video storage and playback

on the storage devices such as CDROM magnetic tap es and hard drives at a qual

ity comparable to the VHS analog video Therefore the maximum co ding delay

in MPEG is second enough for the purp ose of unidirectional video access and

much larger than the maximum delay sp ecied in H The typical bit rate is

VIDEO CODING STANDARDS

H MPEG

Sequential access Random access

Only one basic frame rate Flexible frame rate

Only CIFQCIF format Flexible frame size

Only I and P frames I P B frames

Fullp el motion estimation accuracy Halfp el motion estimation accuracy

Filter of motioncomp ensated residuals No lter

Variable threshold uniform quantization Quantization matrix

No GOP GOP

GOB structure Slice Layer

Table Comparison of H and MPEG

Mbps at a CDROM playbackspeed for the combination of video audio and

system bitstreams allo cated as follows

video Mbps

audio Mbps

system Mbps

The basic input format is SIF in the progressive noninterlaced mo de The YCrCb

color space and the line sampling ie subsampling color frames in b oth the

horizontal and vertical directions are used The MPEG standard do es not sp ecify

an enco ding pro cess It only sp ecies the syntax and semantics of the bit stream

and signal pro cessing in the deco der as shown in Fig Hence many options

Quantizer stepsize Motion Compensation Bit Stream Output Buffer Demultiplexer 2D (8*8) Frame Dequantizer pictures IDCT reorder

Side infromation Frame memory

Motion vectors

Figure A simplied blo ck diagram of the MPEG video deco der

are left op en to the enco ders to tradeo cost and sp eed against picture quality and

co ding eciency Unlike H many video parameters such as the picture size

and the frame rate as listed in Table are changeable and can b e sp ecied in

the MPEG bitstream syntax However the maximum frame rate and size

is limited to pixelsline linesframe framessec

CHAPTER VIDEO CODING STANDARDS

Horizontal picture size p els

Vertical picture size lines

Picture area macroblo cks

Pixel rate macroblo ckss

Picture rate Hz

Motion vector range p els using halfp el vectors

Input buer size VBV mo de bitss constant

Table Summary of the constraint parameters in MPEG Here VBV stands

for video buering verier

Dierent from H the DC comp onents of all the blo cks in each frame are

group ed together and enco ded separately due to the observation that the DC com

p onents usually have dierent statistical characteristics from the rest of DCT co

ecients In MPEG the set of co ding parameters is exible However in order

to guarantee interop erability of co decs a sp ecial subset of the parameter space is

dened as Constrained Parameter Bitstream CPB to represent a reasonable com

promise well within the primary target of MPEG and serve as an optimal p oint for

cost eective VLSI implementation in technology

A MPEG Bit Stream Hierarchy

MPEG is a generic standard dening the syntax and semantics of the enco ded

bitstream and implying the deco ding pro cess without limiting which algorithms or

metho ds to use for compression The layered structure in the MPEG bit stream is

as shown in Fig Eachlayer consists of the appropriate header and following

Video sequence

Picture Group of pictures Block Macroblock 8 pixel Slice

8 pixel

Figure The layered structure in MPEG bit stream

lower layers in a manner similar to that of the H co der The layered structure

supp orts exibility and eciency in the co derdeco der Co ding pro cesses can be

VIDEO CODING STANDARDS

logically distinct and layers can b e deco ded systematically The MPEG bitstream

has a hierarchical data structure comp osed of six layers as tabulated in Table

And eachlayer supp orts a sp ecic function

MPEG Layer Purp ose

Sequence Layer Random access unit context

It contains one or more group of pictures

Group of Pictures Layer Random access unit video co ding

It is used as random access into the sequence

Picture Layer Primary co ding unit

Slice Layer Resynchronization unit

Macroblo ckLayer Motion comp ensation unit

Blo ckLayer DCT unit

Table Six layers in a MPEG bitstream

Anumber of blocks or p els in a layer are dened by an input format identied

in the sequence header

The sequence layer consists of a sequence header one or more groups of

pictures and an endofsequence co de A sequence header starting with the

header B consists of several entities Horizontal and vertical size

Pel asp ect ratio Picture rate Bit rate etc

The groupofpictures layer GOP is a set of pictures that are in a con

tinuous display order It b egins with an I or a B picture and ends with an

I or a P picture The smallest size is a single I picture and the largest is not

sp ecied in the standard The GOP header starts with the start co de

B A time co de of bits refers to the rst picture in rst picture

in the group as a unit of hours minutes and seconds A closed GOP co de

represents closed prediction in the group only or op en prediction that requires

deco ded pictures of the previous group for motion comp ensation

The picture layer is a primary co ding unit that consists of the luminance and

twochrominance comp onents The layer starts with a picture header

Some of the other header entities are temp oral reference picture

in display order picture co ding typ e I P B and D picture typ es and

forwardbackward frame co de maximum size of the forwardbackward

motion vectors up to frames

The slice layer is imp ortant in the handling of errors The deco der can skip

the corrupted slice and go to the start of the next slice if the bit stream is

CHAPTER VIDEO CODING STANDARDS

corrupted by noise The numb er of slices in a picture can range from one to

the numb er of macroblo cks in a picture dep ending up on error environments

The principal entities in the slice header are a slice start co de

an bit vertical p osition co de in a picture and a quantizer scale index in the

range that can b e changed by the next slice or the macroblo cklayer

The macroblo cklayer is comp osed of a luminance blo ck and the

corresp onding chrominance blo cks as in the H co der Fig The header

contains information suchas MB stung MB typ es I P and B pictures

quantizer scale motion vector and co ded blo ck pattern

The blo cklayer is comp osed of p els that are transformed byD

DCT The co ded blo cklayer contains the size of the DC co ecient the DC

dierences the AC co ecients and an endofblo ckEOBcode The EOB

signies that all DCT co ecients along the scan zigzag beyond the EOB

co de are zero An MB is a skipp ed MB when its MV and all the quantized

DCT co ecients are zero

B Dierent PictureTypes in MPEG

The upp erlevel layer contains several lowerlevel layers AttheSequenceLayer

a video sequence header is inserted to sp ecify the picture width and height pel

asp ect ratio frame rate bit rate and buer size At the Group of Pictures GOP

Layer several frames are group ed together to form a random access unit as shown

in Fig In other words the deco ding pro cess must start at the rst frame of

the GOP Layer At the Picture Layer frames are classied in four typ es

Intraframe I An I picture uses only transform co ding Without any ref

erence to any other frames an I frame is used to serve as the access p ointto

the sequence In an I picture all the blo cks are co ded using DCT

quantization and VLC I pictures can b e used for predicting P and B pictures

The compression rate for I frames is mo dest

Forward predicted frame P A P picture is co ded using motion comp ensated

prediction from a previous I or P picture This technique is called forward

prediction from IP to P as shown in Fig This mo de is similar to

interframe co ding in the H co der P pictures can accumulate co ding errors

P pictures go through the feedback lo op and can b e used for predicting P and

B pictures The compression rate for P frames is higher than that of I frames

Bidirectionally predicted frame B A B picture is co ded using b oth a past

andor future picture as a reference Thus it is called bidirectional prediction

VIDEO CODING STANDARDS

Motion Video

Intra−frame (Priority: Intra > Forward > Bidirectional) Forward predicted frame Motion Compensation

Bidirectionally predicted frame

Figure Groups of pictures

In other words in the case of bidirectional predicted pictures B pictures

the prediction is based on forward previous I or P picture backward future

I or P picture or b oth motion estimations as shown in Fig B picture

is an example of multihyp othesis motion comp ensation where two motion

comp ensated signals are sup erimp osed to reduce the bitrate of a video co dec

Why B pictures work atheoryofmultihyp othesis motioncomp ensated pre

diction is given in The overall advantages of B pictures are

The uncovered area can b e predicted from the next picture As a result

avery high compression rate can b e achieved

Better signaltonoise reduction is resulted from motion comp ensation on

two pictures separately

B pictures are not used for reference byany other pictures Therefore

there is no error propagation

The numb er of B pictures in a GOP is adjustable

However B pictures require more frame buers and cause a larger co ding

delay

CHAPTER VIDEO CODING STANDARDS

Now let us explain how to p erform bidirectional prediction in more detail B

pictures can b e co ded using forward or backward or b oth motion comp ensa

tion In MPEG video co ding standards within a single B picture macroblo cks

can actually b e co ded dierently For the detail ab out decision trees for co ding

macroblo cks within I P and B pictures please refer to The following are

decisions as shown in Fig needed to b e made for selecting the co ding

macroblo cktyp es in B pictures

next B picture step

Intra coded Interpolted motion compensation

Inter coded    Not Coded

Coded

   No change to MQUANT   No change to MQUANT (May be skipped)

Forward motion compensation

next Backward motion compensation next step step next

step Change MQUANT Change MQUANT

Figure Decision tree for co ding macroblo cks in B picture

Decide if you want to use forward backward or interp olated motion

comp ensation

Decide if you want to co de the macroblo ck as an intratyp e or as an

intertyp e macroblo ck

Decide if a macroblo ck needs to b e co ded or not The entire macroblo ck

can b e skipp ed only if the previous macroblo ckwas an intertyp e mac

roblo ck and its motion comp ensation is go o d enough

Decide if the quantizer scale MQUANT needs to b e changed or not

Some of the blo cks within B picture may b e skipp ed if all the quantized DCT

values within these blo cks are zero The co ded blo ck pattern will indicate

whichoftheblocks within a macroblo ck are co ded As an example let us show

a distribution of I P and B macroblo cks for pictures of an MPEG co ded

video sequence as shown in Table Here zero MV refers to macroblo cks

that are co ded using a zero motion vector Note that for B pictures there

is a considerable numb er of macroblo cks that are co ded using predictive P

macroblo cks This usually o ccurs when there is a scene change or when ob jects

present absent in a P picture b efore B picture disapp ear app ear in the P

picture following that B picture And MPEG enco der allows for skipp ed

blo cks within a macroblo ck

VIDEO CODING STANDARDS

Picture Macroblo ckTyp e

Typ e I P B Zero MV Skipp ed

I

P

B

Table An example of the distribution of dierent macroblo cktyp es in a video

sequence The enco der uses a GOP of pictures with two B pictures for every P

picture and the co ded bit rate is Mbitss

Macroblo ckTyp e Predictor Prediction Error

Intra F xF x F x

  

F x F x Forward Predicted mv F xF x

    

F x F x Backward Predicted mv F xF x

    

Interp olated Prediction F xF x mv F x mv F xF x

      

Table Prediction mo des for macroblo ck in B picture Herex is the co ordinate

of the picture element mv is the motion vector relative to the reference frame F

mv is the motion vector relative to the reference frame F

CHAPTER VIDEO CODING STANDARDS

In the more general case of a bidirectionally co ded picture each x mac

roblo ckcan be of typ e Intra Forwardpredicted Backwardpredicted or In

terp olated prediction As expressed in Table the expression for the

predictor for a given macroblo ck dep ends on reference frames past and fu

ture as well as the motion vectors Motion Comp ensation MC is based on

the previous and the next I or P pictures The prediction mo de for a mac

roblo ckin B picture can be one of the following dep ending on which mo de

pro duces the smallest number of bits intra no MC forward predicted MC

based on the next IP picture backward predicted MC based on the pre

vious IP picture interp olated prediction MC based on the previous and

the next IP pictures The motion information consists of one vector for

forwardpredicted macroblo cks and backward predicted macroblo cks and of

twovectors for bidirectionally predicted macroblo cks The motion informa

tion asso ciated with each blo ck is co ded dierentially with resp ect to

the motion information present in the previous adjacentblock The range of

the dierential motion vector can b e selected on a picturebypicture basis to

match the spatial resolution the temp oral resolution and the nature of the

motion in a particular sequence the maximal allowable range has b een cho

sen large enough to accommo date even the most demanding situations The

dierential motion information is further co ded by means of a variablelength

co de to provide greater eciency by taking advantage of the strong spatial

correlation of the motion vector eld the dierential motion vector is likely

to b e very small except at ob ject b oundaries More accurate motion vectors

are required in the co der when motion interp olation is intro duced

A D picture is a sp ecial case of intra in which only the DC co ecientofeach

blo ck is co ded D pictures provide simple and fast forward mo de but

yield limited image quality

In view of the nature of B frames the frame reordering is required for enco d

ingdeco ding

Display order IBBPBBPBBI frame typ e

A frame number

At the Slice Layer

Co ding order IPBBPBBIBB frame typ e

A frame number

each slice is formed from several macroblo cks and used mainly for error recovery

At the Macroblo ckLayer a macroblo ck serves as the basic compression unit similar

to the case of H The basic DCT unit is a blo ckattheBlockLayer

VIDEO CODING STANDARDS

MPEG H and HDTV

The compression schemes discussed previously provide sp ecic applications at corre

sp onding co ding eciencies ITUT H has b een designed for video co ding and

transmission of slowmoving videophone and video conferencing signals MPEG

is aimed at systems working with Mbps Digital Storage Media DSM and

lowresolution SIF displays Consequently a fullmotion video co ding standard

MPEG has b een develop ed to meet a numb er of requirements For instance it

supp orts digital video transmission in the range of to Mbps including applica

tions for Digital Storage Media and HDTV Computer graphics multimedia and

video games are also included as the new application areas Hence MPEG provides

a generic solution for videoaudio co ding storage andor transmission worldwide

This standard is exible enough to allow b oth highp erformancehighcomplexity

and lowp erformancelow complexity co dec systems The generic MPEG co ding

standard thus was designed to meet a wide sp ectrum of bit rates resolutions b oth

spatial and temp oral quality levels and services

Dierent from MPEG or H MPEG supp orts interlaced video input im

ages which are scanned as even and o dd elds to form frames Therefore there are

two new picture typ es for interlaced video in addition to the picture typ es in the

progressive video mo de

Frame pictures are obtained byinterleaving the lines of an o dd eld and its

corresp onding even eld

Field pictures are formed from a eld of pixels alone

All these pictures can b e either I P or B frames as in the case of progressive video

The dierences of MPEG from MPEG are summarized as follows in Table

Layer MPEG diers from MPEG

Sequence more asp ect ratios and larger allowable frame size

and macroblo cks

indication of source video typ es color primaries etc

Picture userselectable DC precision

concealmentofmotionvectors for Ipictures to increase robustness

nonlinear macroblo ck quantization factor

signal source comp osite video characteristics

Macroblo ck no more macroblo ckstung

Table Dierences of MPEG from MPEG in the hierarchical layered struc ture

CHAPTER VIDEO CODING STANDARDS

MPEG is designed to b e the extension of MPEG to accommo date dierent

visual quality requirements at various bitrates and resolutions Compared

to MPEG some of the prominent features in MPEG are compatibility and

scalabilityaslistedinTable Because of the feasibility of MPEG to realize

MPEG MPEG

Video format SIF SIF

progressive progressiveinterlaced

Bit rate Variable Mbps Variable up to Mbps

Low delaymode ms ms no B pictures

Scalability SNR spatial temp oral simulcast

data partitioning

Transmission error Error protection Error resilience

DCT Noninterlaced Field progressive or frame interlaced

Motion estimation Noninterlaced Field frame and dualprime based

Motion vectors Motion vectors for PB Concealment motion vectors for I pictures

pictures only b eside MV for P B

Scanning of DCT Zigzag scan Zigzag scan alternate scan for interlaced

co ecients video

Table Functional comparison b etween MPEG and MPEG video

highquality highresolution TV applications the HDTV proto col adopts MPEG

A MPEG Proles

In order to t one standard to a variety of applications without causing unrea

sonable implementation diculties MPEG uses the concept of a prole whichis

a subset of the full p ossible range of algorithmic to ols called limit syntax in the

MPEG term for a particular application There are ve proles with a hierar

chical relationship Within each prole a numberoflevels are dened to limit the

range of parameter values reasonable to implement and practically useful called

limit parameters Therefore the syntax supp orted by a higher prole includes all

the syntactic elements of lower proles In other words for a given level aMain

prole deco der should b e able to deco de a bitstream conforming to Simple prole

restrictions For a given prole the same syntax set is supp orted regardless of level

Simple Prole It do es not allow use of B frames and scalable co ding The

maximum bitrate is Mbps This prole is intended for videotap e recording

Main Prole No scalabilityisallowed The intended use is the Studio TV

application It is exp ected that of MPEG users will use this prole

VIDEO CODING STANDARDS

SNR Scalable Prole It is the same as the Main Prole but added with

SNR Scalability is added allowing twolayers of co ding the lower layer and

the enhancementlayer using dierentquantizer step sizes for the DCT co ef

cients It can create a sharp er image when combined than that obtainable

from one layer alone

Spatially Scalable Prole It allows the deco der to cho ose dierent resolu

tions by employing a pyramidal co ding approach This prole supp orts only

the high level intended for the consumer HDTV

High Prole It is basically a scalable prole with either or

macroblo cks ie chrominance subsampling in the horizontal direction but not

in the vertical direction designed for the lm pro duction So ciety of Motion

Picture and Television Engineers SMPTE M standard

There are four p ossible levels Low Level Main Level High Level and High

Level The allowable combinations of levels and proles and the corresp onding

sampling density are shown in Table where a scalable MPEG video stream

can b e broken into dierentlayers base or lower layers high priority bitstream

and enhancementlayers low priority bitstream

Spatial Prole

resolution

layer

Level Simple Main SNR Spatial High

High Enhancement

Lower

High Enhancement

Lower

Main Enhancement

Lower

Low Enhancement

Lower

Table Maximum sampling density in dierent combinations of levels and

proles x y t means x pixelsline y linesframe and t framessec

B Scalable coding techniques

CHAPTER VIDEO CODING STANDARDS

Scalable video co ding is useful for a numb er of application in which video needs

to be deco ded and displayed at a variety of resolutions temp oral or spatial and

quality levels for example multip oint video conferencing windowed display on

workstations video communications on asynchronous transfer mo de ATM net

works and HDTV with embedded standard TV The easiest way to cop e with

these demands is the simulcast techniques which is based on transmitting a set of

various contents of an enco ded video sequence simultaneously In this case

channel bandwidth is not used eciently since the bandwidth should be shared

among various resolutions

An ecient alternative to simulcast is scalable video co ding in which the band

width allo cated to a given scale can b e reused through partially for co ding other

scales Multiuse of the bandwidth is a condition for scalable video co ders Actu

ally the enhancementlayer is co ded by utilizing information from the baselower

layer enco der In MPEG if there are several scales layers the lowest layer is

called baselower layer and the others are called enhancementlayers Scalable video

co ding can b e achieved in the spatial temporal SNR and data partitioning

Spatial scalability It deals with dierent resolutions which can b e varied

by the decimation and interp olation techniques The enhancement layer is

predicted not only from previously deco ded pictures temp oral prediction

but also from deco ded and upsampled pictures of the lower layer

Temp oral scalability It deals with dierent frame rate temp oral resolu

tion It involves at least two layers Both the lower and the enhancement

layers pro cess the same spatial resolution pictures but at dierent temp oral

resolutions The enhancementlayer has a full temp oral picture rate

SNR scalability It provides two or more video layers of the same spatial

resolution but at dierent qualities The enhancement layers contain only

co ded renement data for the DCT co ecients of the base layer

Data partitioning It splits a video bit stream into two or more layers

called partitions A priority breakp oint in the slice layer header indicates

which syntax elements are placed in partition or partition Partition is

called the base partition or highpriority partition Partition is called the

lowpriority partition For the DCT co ecients the lower the frequency of

the co ecients the more imp ortant is the picture quality and the higher the

priority

VIDEO CODING STANDARDS

MPEG

The MPEG was historically supp osed to b e low bitrate co ding in that is

audio and video co ding at data rates below kbits For video co ding

the goal was to develop algorithms to outp erform the then stateoftheart co ding

standard H by a factor of ten in terms of compression Later MPEG shifted

its fo cus of this very ambitious lowbit rate co ding Its activities since have

resulted in extended applications to ols algorithms proles and bit rates for arbi

trary shap e audiovisual natural and synthetic ob jects The ob jectiveistodevelop

a exible and extensible co ding standard And it facilitates the users ability to

achievevarious forms of interactivity and to mix synthetic and natural audiovisual

information in a seamless way The MPEG have six parts as shown in Table

And its rst version b ecame international standard in the spring of

ISO Number MPEG Part Name

Systems

Visual

Audio

Conformance Testing

Technical Rep ort

DSMCC Multimedia Integration Framework

Table Six parts of MPEG standard Here DSMCC stands for Digital Stor

age Media Command and Control

As of the visual part MPEG fo cuses on

A set of coding tools for audiovisual objects These ob jects can be video

ob ject of arbitrary shap e audio ob jects or combined audiovisual natural and

synthetic ob jects New functionalities in MPEG are outlined in Table

A syntactic language MPEG Syntactic Description Language MSDL as

listed in Table to describ e b oth the co ding to ols and the co ded ob jects It

is used not only for description of the bit stream structure but also for cong

uration and programming of the deco der MSDL is a exible and extensible

description language that allows selection description and downloading of

to ols algorithms and proles

We will discuss the MPEG in more detail in Chapter

CHAPTER VIDEO CODING STANDARDS

Functionalities Detail description

Interactive The user should b e able to inuence the presentation of audiovisual content

Contentbased An ob jectbased data representation should allowcontentbased access to

multimedia data

Universal Access to MPEG data and communications should b e p ossible using any

Accessibility communications network

Flexible MPEG data streams should b e scalable such that they can b e pro cessed by

receivers with dierent levels of computational p ower

Extensible The transmitter should b e able to congure the receiver in order to download

new applications and algorithms

Table The summary of functionalities which MPEG supp orts

Chapter

MPEG and Contentbased Video Co ding

Prior to MPEG the MPEG of ISO International Standards Organization and

the ITUT International Telecommunication Union have

develop ed video compression standards

MPEG was develop ed for use in CDROM and PC industries and has

a target bitrate of Mbps

MPEG was develop ed for use in the home entertainmentmarket suchas

HDTV High Denition TV with a target bitrate of Mbitss Mbitss

H and H were develop ed for the application of full duplex

video conferencing over ISDN and POTS transmission lines resp ectively

In those cases a xed set of techniques is included in the standards targeting a lim

ited set of applications The transmission channels asso ciated with each application

are well known and considered to b e very reliable ie the probability of a residual

bit error corrupting the video data is extremely low This a priori knowledge of

the transmission channels was utilized during the design and development of the

algorithms In the case of MPEG additional proles were later added to the

standard to allow its use in dierent applications

Anticipating the rapid convergence of telecommunications computer and TVlm

industries the MPEG group ocially initiated a new MPEG standardization

phase in with the mandate to standardize algorithms for audiovisual co ding

in multimedia applications allowing for interactivity high compression andor uni

versal accessibility and p ortability of audio and video content The MPEG rst

version b ecame international standard in the spring of In this chapter we are

going to briey overview the new standard in Section Section Then we

will discuss how to extend our DCT domain motion estimationcomp ensation for

MPEG applications in Section

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

Functionalities Detail description

Interactive The user should b e able to inuence the presentation of audiovisual content

Contentbased An ob jectbased data representation should allowcontentbased access to

multimedia data

Universal Access to MPEG data and communications should b e p ossible using any

Accessibility communications network

Flexible MPEG data streams should b e scalable such that they can b e pro cessed by

receivers with dierent levels of computational p ower

Extensible The transmitter should b e able to congure the receiver in order to download

new applications and algorithms

Table The summary of functionalities which MPEG supp orts

Overview of MPEG Standard

The MPEG was historically supp osed to b e low bitrate co ding in that is

audio and video co ding at data rates below kbits For video co ding

the goal was to develop algorithms to outp erform the then stateoftheart co ding

standard H by a factor of ten in terms of compression Later MPEG shifted

its fo cus of this very ambitious low bitrate co ding towards new functionalities as

outlined in Table Let us explain some of new functionalities intro duced in

MPEG which are not covered in the previous multimedia standards in more

detail

Universal accessibility is the ability to access audiovisual data over a wide

variety of storage and transmission channels It covers Robustness in error

prone environments and Contentbased scalability To truly supp ort this

functionality it implies that a user can access video information over dierent

kinds of transmission channels wired or wireless channels Obviouslythese

channels will not have the same error characteristics or bandwidth Therefore

the error resilience and scalability to ols discussed later in this chapter are

extremely imp ortant when attempting to supp ort this universal accessibility

Ob jectbased interactivity is a functionality that provides the user with

the ability to interact with ob jects of an audiovisual scene in a meaningful

way It covers Contentbased manipulation and bit stream editing Content

based multimedia data access tools Hybrid natural and synthetic data coding

and Improved temporal access To supp ort this functionality it

implies that the video scene is co ded such that a particular video ob ject is

distinguishable from the other ob jects of the scene By utilizing to ols such

as ob jectbased scalability shap e co ding and sprite co ding in combination

OVERVIEW OF MPEG STANDARD

with the shap e adaptive DCT MPEG is able to supp ort this ob jectbased

interactivity

MPEG Architecture

A general MPEG video co ding system is depicted in Fig At the enco der

the video ob jects and their spatiotemp oral relationships needed by the deco der

are enco ded into bit streams These bit streams after optional errorprotection

are multiplexed with stored ob jects and then transmitted downstream to the de

co der The bit streams can be transmitted across multiple channels where each

channel oers a dierent quality of service This p ermits dierent ob jects to be

reconstructed at the deco der at dierent qualities The multiplexer in the MPEG

system combines the elementary data streams into one output data stream The

multiplexor also provides functions needed to recover the system clo ck synchronize

multiple streams interleavemultiple streams used by the comp ositor at the deco der

side etc

At the deco der the comp ositor uses the spatiotemp oral relationships and user

interactions to render the scene The deco der can use the interaction information

lo cally or it can transmit it upstream to the enco der so that the enco der can gen

erate the scene as desired by the user Note that supp ort for deco derenco der

interactivity is not explicit in the MPEG and MPEG co ding standards Before

video ob jects are transmitted the source co der and deco der exchange conguration

information This allows the source to determine which class of algorithms to ols

and other ob jects are needed by the deco der to pro cess the video ob jects Then the

denitions of any missing classes are downloaded to the MPEG deco der This is

distinct from the MPEG and MPEG hardwired push mo del where the mo del

and capabilities of the deco der are assumed aprioriby the enco der

Toenvision the MPEG ob jectbased interactivity let us take a lo ok at a simple

example as shown in Fig In this simple example an image scene contains a

numb er of video ob jects A and B It is attempted to enco de the sequence in such

away that it will allow the separate deco ding and reconstruction of the ob jects and

allow the manipulation of the original scene by simple op erations on the bit stream

The bit stream will be object layered A and B ob ject layers The shap e and

the spatial co ordinates as well as other additional parameters ie ob jects scaling

rotation or related parameters are describ ed in the bit stream of each ob ject layer

The receiver can reconstruct the entire original sequence by deco ding all object layers

and display the ob jects with original size at the original lo cation Other than the

previous ob jectbased video deco ding it is also p ossible to manipulate the video

scene with some simple op erations For example the new ob ject C from lo cal

image library can b e added and mixed with the original video scene In addition

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

Stored Local Video Objects Objects C Objects B B B Encoder MUX DE- Decoder Compositor A Channel MUX A A

Display User C Interface

B

A

Figure Schematic overview of an MPEG video co ding system

we can rearrange the scene by rotating the ob ject A Since the bit stream of the

sequence is organized in ob ject layered form the manipulation is p erformed on the

bit stream level without the need for further transco ding

Unlike the previous MPEG and MPEG approaches the MPEG standard

supp orts a rich set of data typ es natural and synthetic D D audiovisual ob

jects and a syntax for describing complete animatedscenes Furthermore MPEG

images as well as image sequences are in general considered to be arbitrarily

shap ed in contrast to the standard MPEG and MPEG rectangular denitions

Because it do es not always make sense to sp ecify a rigid standard addressing just

one application MPEG standard concentrates on supp orting those functionali

ties common to clusters of applications in the computer telecommunication and

entertainment ie TV andor lm industries Basically MPEG is a new co ding

standard intendedtoprovide a exible framework and an op en set of co ding to ols

for communication access and manipulation of digital audiovisual data Through

the exible framework of MPEG various combinations of these to ols and their

corresp onding functionalities will b e utilized to supp ort particular applications re

quired by those industries The MPEG System Description Language MSDL is

designed to glue those functionalities together

Although MPEG standard includes video audio graphics synthetic and nat

ural hybrid co ding SNHC and systems we will only discuss the visual p ortion

of MPEG in this chapter whichprovides the core technologies allowing ecient

storage transmission and manipulation of video data in multimedia environments

During the pro cess of developing the MPEG video standard the exp ert group

fo cuses on development of Video Verication Mo dels VMs The VM is a common

platform with a precise denition of enco ding and deco ding algorithms whichcan

OVERVIEW OF MPEG STANDARD

b e presented as to ols addressing sp ecic functionalities New algorithmsto ols are

added to the VM and old algorithmsto ols are replaced in the VM by successful

core exp eriments As we have mentioned previously the MPEG video co ding

standard fo cuses on providing solutions in the form of to ols and algorithms which

enable common functionalities suchasecient compression ob ject scalabilityspa

tial and temp oral scalability and error resilience Each VM addresses the increasing

numb er of desired functionalities suchas

Ecient compression For most applications involving digital video such

as video conferencing Internet video games or digital TV co ding eciency

is essential Therefore many dierent video co ding algorithms have b een

prop osed to reduce the bandwidth requirement for transmission and storage

of video information MPEG evaluated over those metho ds

intended to improve the co ding eciency of existing standards The target of

MPEG is to provide exible multimedia communications within the range

of kbitss Mbitss

Shape and alpha map coding The shap e of a D ob ject is describ ed byal

pha maps Multilevel alpha maps are frequently used to blend dierentlayers

of image sequences for the nal lm Other applications that b enet from

asso ciating binary alpha maps with images are content based image represen

tations for image databases interactive games surveillance and animation

Arbitrarily shaped region texture coding Co ding of texture for arbitrarily

shap ed regions is required for achieving an ecient texture representation for

arbitrarily shap ed ob jects Hence these algorithms are used for ob jects whose

shap e is describ ed with an alpha map

Error resilience The error resilience addresses the problem of accessing

video information over a wide range of storage and transmission media In

particular due to the rapid growth of mobile communications it is extremely

imp ortant that access is available to audio and video information via wireless

networks This implies a need for the useful op eration of audio and video

compression algorithms in errorprone environments at low bitrates ie less

than kbps MPEG Video Group evaluated to ols for video compression

which address b oth the band limited nature and error resilience asp ects of the

problem in providing access over wireless networks

Multifunctional coding tools and algorithms Multifunctional co ding is aim

ing to provide to ols to supp ort a number of content based and other function

alities For instance for Internet and database applications ob ject based

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

spatial and temp oral scalability are provided for content based access Like

wise for mobile multimedia applications spatial and temp oral scalabilityare

essential for channel bandwidth scaling for robust delivery Multifunctional

co ding also addresses multiview and stereoscopic applications as well as repre

sentations that enable simultaneous co ding and tracking of ob jects for surveil

lance and other applications Besides the aforementioned

applications a number of to ols were develop ed for segmentation of a video

scene into ob jects and for co ding noise suppression

MPEG Video Co ding

The motion and texture co ding techniques in the MPEG are direct extensions

of those used in traditional video co ding Thus the blo ck matching and the

Discrete Cosine Transform DCT are still the basic techniques This ensures

that MPEG video co ding is as ecient as traditional video co ding for traditional

rectangular frames of image sequences and provides ob jectbased functionalities

for new applications Since an imp ortant feature of MPEG is its exibility of

conguring various to ols for a given application more motion and texture co ding

to ols that are very dierent from those in traditional video co ding are included to

further improve co ding eciency

In MPEG each video frame is segmented into a numb er of arbitrary shap ed

image regions called video ob ject planes VOP The word segmentation has a

meaning that dep ends to a large extent on the application and the context in which

it is used The basic goal of any segmentation algorithm is to dene a partition

of the space In the context of image and video the space can be temp oral

onedimensional D spatialD or spatiotemp oral D

Segmentation can b e an extremely easy task if one has access to the pro duction

pro cess that has created the discontinuities For example the generation of a

synthetic image or of a synthetic video implies the mo deling of the D world

and of its temp oral evolution During the creation itself it is very easy to

recover and store the D b oundaries of the various ob jects Another example

is video editing which creates a large numb er of discontinuities either in space

or in time Spatial discontinuities are created by combining foreground

ob jects that have b een lmed over a blue screen with a background sequence

that has been taken indep endently Temp oral transition are pro duced by

cutting and concatenation rushes In b oth cases the discontinuities detection

is trivial if one has access to the information at this level of pro duction

Segmentation can also b e an extremely dicult task if the segmentation in

MPEG VIDEO CODING

tends to estimate what has b een done during the pro duction or online pro cess

Wehave to recognize that the state of the art has still to b e improved to lead

to robust segmentation algorithms able to deal with generic images and video

sequences please refer to for an overview

In this chapter we assumed that the video source either already exists in terms of

separate entities ie is generated with chromakey technology or is generated

by means of online or oline segmentation algorithms Notice that the pro cess of

segmentation is outside the scop e of the MPEG standard Similar to MPEG

and MPEG MPEG sp ecies only the minimum set of functions that are needed

for interop erability Successive VOPs b elonging to the same physical ob ject in

a scene are referred to as video ob jects VO A VO in MPEG is equivalentto

a GOP group of pictures in the MPEG and MPEG standards The shap e

motion and texture information of the VOPs b elonging to the same VO is enco ded

into a separate video ob ject layer VOL Then this information is multiplexed into

a VOL bit stream as shown in Fig in the order of the co ded shap e informa

tion followed by motion and texture co ded data Here motion vectors and DCT

Video Object0 Video Object0 Encoder Decoder System Demultiplexer System Multiplexer Video Objects Video Video Object1 Video Object1 Video Objects Video Segmenter/ Encoder Decoder in Formatter Compositer out

Video Object2 Video Object2

Encoder Decoder

Figure MPEG video co ding Based on the VOP shap e information each

VOP in a VO is separated by the VOP denition blo ck

co ecients can be co ded either jointly as in H or separately In addition

relevant information needed to identify each of VOLs and how various VOLs are

comp osed is also enco ded This allows for selective deco ding of VOPs and also

provides ob jectlevel scalability at the deco der

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

a b

c d

e f

Figure Envision the concept of VOPs using News test sequence as an example

a One frame taken from the original scene b efore segmentation b background

VOP c foreground VOP d foreground VOP e foreground VOP f

the binary alpha plane of foreground VOP

MPEG VIDEO CODING

Overview of MPEG Video Co ding

The notion of VOPs and their use in video co ding in MPEG is illustrated in

Fig Here we use the actual MPEG video test sequence News in CIF

format with frame size as an example for illustration This sequence

b elongs to the class of medium spatial detail and low amountofmovement We

can co de the video sequence in twoways

The entire frame comprising the background and foreground can b e classied

as a single VOPThentheVOP co ding b ecomes a straightforward application

of MPEG and MPEG co ding techniques

Alternatively by applying segmentations we can decomp ose the scene into

four VOPs say VOP for the background ob ject in Fig b VOP in

Fig c and VOP in Fig d as well as VOP in Fig e for the

foreground ob jects A binary alphaplane as depicted in Fig f is co ded

in this example to indicate to the deco der the shap e of the foreground ob ject

VOP and its lo cation with resp ect to background VOP The shap e infor

mation hereafter is also referred to as alpha plane In general MPEG may

supp ort the co ding of grayscale alpha planes to allow the deco der to comp ose

the VOPs with various levels of transparency

We can enco de the VOPs using dierentcodingschemes either nonoverlap or

overlap co ding Toenvision those co ding schemes we consider a simple exam

ple by co ding VOP and VOP as shown in Fig a and b resp ectively

Note that the two regions covered by VOP and VOP are nonoverlapping

Furthermore the sum of pixels covered by these twoVOPs is identicaltothe

image sequence as shown in Fig c Since eachVOP is co ded separately

based on the deco ded information from the alpha channel for nonoverlap

co ding the deco der can either deco de and display each VOP separately or

reconstruct the entire original sequence by deco ding and comp ositing b oth

VOPs Other than the nonoverlap co ding MPEG also supp orts the over

lapping conguration for VOPs For instance if the entire background frame

as shown in Fig d is known aprioriat the enco der the foreground VOP

can then b e as shown in Fig b Since the background is stationary only

one frame needs to b e co ded for the background Thus the foreground and the

background can have dierent display rates at the deco der which is called

temporal scalability in MPEG In principle we can select either nonoverlap

or overlap co ding scheme based on the character of input image sequences

The VOP co ding pro cess for this example is summarized in Fig Because

MPEG supp orts contentbased scalability the comp ositor at the deco der side can

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

a b

c d

Figure Dierentcodingschemes nonoverlapping vs overlapping coding a

nonoverlapping background VOP b foreground VOP c scene after non

overlapping co ding bycombining VOP and VOP d background VOP in this

case is a stationary rectangular image only co ded once

MPEG VIDEO CODING

either cho ose to only deco de certain VOPs of interest or even edit the scene by

deleting the VOPs from the original scene and adding the new VOPs from lo cal

database as shown in Fig Let us take News video co ding again as an

example to illustrate contentbased scalability The comp ositor at the deco der

sayonlydecodes VOP in Fig d as the foreground and VOP in Fig b

as the background As a result the nal reconstructed video scene is as shown in

Fig c instead of the complete scene as shown in Fig a

For eachVO the shap e motion and texture information of VOPs comprising

the VO are co ded After intro ducing the overall video co ding schemes we will fo

cus on howto co de this information for individual VOP Ob ject based temp oral

scalability and spatial scalability can b e achieved by means of VOLs which corre

sp ond to either the base layers or enhancement layersofaVOP One imp ortant

feature of MPEG video co ding is its ability to co de an arbitrarily shap ed VO

sp ecial care has to b e taken for motion estimation and comp ensation MEMC as

well as the DCT of the boundary blocks of an arbitrarily shap ed VOP Besides the

MPEG supp orts three more advanced techniques namely unrestricted motion

vector advanced prediction and bidirectional MEMC There are many ways in

which the shap e motion and texture information will b e co ded We will restrict

our discussion to the baseline scheme as adopted by the MPEG

Arbitrarily Shap ed Region Texture Co ding

The intra VOPs and the residual errors after motion comp ensated prediction are

co ded using DCT on blo cks in a manner similar to that employed in MPEG

MPEG H and H After computing the DCT zigzag scanning and

quantization are applied same as those in the previous standards Here two

scalar quantization metho ds namely H and MPEG quantizations are used In

addition variable length co de VLC of DC and AC is applied for entropy co ding

In addition the MPEG supp orts the texture co ding of arbitrarily shap ed

VOP Macroblo cks can b e classied as either standard or contour macroblo cks the

transparent blo cks are skipp ed and not co ded

For a standard macroblo ck where all of its pixels are inside the activeVOP

area as shown in Fig techniques identical to that describ ed in MPEG

and MPEG can b e used The macroblo cks that do not b elong to the arbi

trary shap e but inside the b ounding b oxofaVOP as shown in Fig are not

co ded at all For each macroblo ck there could b e four luminance blo cks

and two chrominance blo cks As in the motionestimation step

blo cks well within the VOP active area can be co ded in a straight forward manner

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

For a contour macroblo ck some of its pixels may b e outside the activeVOP

area see Fig The blocks that b elong to the macroblo cks on the

b order of the VOP shap e may b e co ded bytwo dierenttechniques namely

low pass extrapolation LPE padding and shape adaptive DCT SADCT

SADCT is more complex but has a higher co ding eciency for the b oundary

blo cks For the co ding of motioncomp ensated prediction error blo cks P

VOPs that straddle the VOP b oundary pixels outside the active area are set

to a value of prior to DCT co ding

Motion Estimation and Comp ensation

Temp oral redundancies b etween video content in separate VOPs within a VOare

exploited using blo ckbased motion estimation and comp ensation In general these

techniques can b e viewed as extensions of the standard blo ckmatching techniques

used in MPEG MPEG H and H to image sequences of arbitrary shap e

To p erform blo ckbased motion estimation and comp ensation b etween VOPs of

varying lo cation size and shap e a shap eadaptive macroblo ck approachshown in

Fig is used The reference window is the original images b order A shift

Reference window Shift VOP window

contour macroblock

Standard

macroblock

Figure Macroblo ck grid for co ding VOP Here macroblo cks can b e classied as

either standard or contour macroblo cks

MPEG VIDEO CODING

parameter is co ded to indicate the lo cation of VOP with resp ect to the b orders of

the reference window AVOP window surrounding the foreground video ob ject is

restricted to b e a multiple of pixels in b oth the horizontal and vertical directions

Furthermore it is p ositioned such that it contains the minimum numberof

blo cks of pixels which are not transparent Pixels which are outside the b ounding

box are treated as transparent pixels

Like the arbitrarily shap ed region texture co ding any of the motion estimation

and comp ensation techniques for MPEG and MPEG can be used for a stan

dard macroblo ck However the motion estimation of a contour macroblo ckhasto

be mo died from blo ck matching to p olygon matching Furthermore a sp ecial

padding technique ie the macroblo ckbased rep etitive padding is required for the

reference VOP as shown in Fig The details of these techniques are describ ed

as following

A Macroblockbased Padding of VOP

The macroblo ckbased padding pro cess allows the deco der to pad a macroblo ck

as so on as it is reconstructed as depicted in Fig The padded VOP is then

used for motion comp ensation At the enco der a reference VOP is padded in a similar manner for motion estimation prior to motion comp ensation

Inverse Inverse Quantization DCT

Motion Macroblock-based Compensation padding

Frame

Memory

Figure Macroblo ckbased padding in MPEG deco der corresp onding to the

video ob ject deco der shown in Fig For this illustration we use the the sim

plied view of MPEG deco der which do es not include the shap e deco ding VOP

comp ositor etc

The padding pro cess is as follows The frame memory see Fig is rst

initialized with the value of for the luminance and chrominance comp onents

Then the contour blo cks are padded using rep etitive padding describ ed next To

cop e with VOP of big motions the padding is further extended to blo cks which are

completely outside the VOP but immediately next to b oundary blo cks as shown

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

in Fig These blo ck are padded by replicating the samples of padded adjacent

b oundary blo cks as shown in Fig If a blo ck is next to two or more b oundary

adjacent padded boundary block block

repeat

Figure Extended padding for VOP with big motions

blo cks the blo ck is padded by replicating the samples at the b order of one of the

b oundary blo cks determined according to the following convention

Let the b oundary blo ck at the b ottom of a target blo ckbenumb er the one

on top the one on its right and the one on its left The target blo ckis

then padded by replicating the samples at the b order of the b oundary blo ck

with the largest numb er Fig shows an example of a VOP after extended

padding Note that the padded area covers the region outside the tightest

b ounding blo cks

Repetitive Padding

Extended

Padding

Figure VOP after rep etitive padding and extended padding

MPEG VIDEO CODING

B Repetitive Padding Technique

Steps Detail Pro cedures

Consider each undened pixel outside the ob ject b oundary a zero pixel

Scan each horizontal line of a blo ckablock could b e or Each scan

line is p ossibly comp osed of two kinds of line segments

zero segments that have all zero pixels within each segment

nonzero segments that have all nonzero pixels within each segment

If there are no nonzero segments do nothing

Otherwise there are two situations for a particular zero segment

it can b e p ositioned b etween an end p oint of the scan line and the end p ointof

a nonzero segment Then ll all of the pixels in the zero segments with the

pixel value of the end p oint of the nonzero segment

it can b e p ositioned b etween the end p oints of two dierent nonzero segments

Then ll all of the pixels in the zero segments with the average pixel value of

the two end p oints

Scan eachvertical line of the blo ck and p erform the identical pro cedure as describ ed

in Step to eachvertical line

If a zero pixel can b e lled in by b oth Steps and the nal value takes the

average of the two p ossible values

Consider the rest of zero pixels

scan any one of them horizontally to nd the closest nonzero pixel on the same

horizontal scan if there is a tie the nonzero pixel to the left of the current

pixel is selected

scan any one of them vertically to nd the closest nonzero pixels on the same

vertical scan if there is a tie the nonzero pixel on the top of the current

pixel is selected

Replace the zero pixel by the average of these two horizontally and vertically closest

nonzero pixels

Table Macroblo ckbased rep etitive padding pro cedures

The macroblo ckbased rep etitive padding pro cess illustrated in Fig consists

of ve steps as listed in Table As an example the VOP of News sequence

after the macroblockbasedrepetitive and extendedpadding are shown in Fig a

and Fig b resp ectively

C ModiedBlock Polygon Matching

After padding the reference VOP the motionestimation and comp ensation pro

cess of the contour macroblo cks is the same as in the case of standard macroblo cks

except that during blo ckmatching only pixels b elonging to the active area of the

VOP are used in the motion estimation pro cess Here the alpha plane for the VOP

is used to exclude the pixels of the macroblo ck that are outside the VOP This forms

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

a b

c d

Figure Illustration of rep etitive padding in Table a horizontal padding

b vertical padding c average the horizontal and vertical padding d exterior

pixels padding

a rep etitive padding of VOP b extended padding of VOP

Figure Padding technique is employed on VOP of News test sequence

MPEG VIDEO CODING

a p olygon of the macroblo ck that are on the VOP b oundary as shown in Fig

as an example Due to its lower computational complexityascomparedtoother

Transparent pixels

Pixels of polygon Macroblock

VOP

Figure Polygon matching for an arbitrary shap e VOP

dierence measures the sum of absolute dierence SAD is used in MPEG as

error measure and is computed only for the pixels with nonzero alpha value It is

dened as

N N

X X

jci j pi j ji j C

i j

for x y

SAD x y

N

N N

X X

jci j pi x j y ji j

i j

otherwise

where fci j i j N g are the pixels of the current VOP and

fpm n m n R R R N g b e the pixels in the search

range R of the reference VOP Motion vector x y fR R R

Rg and i j is the alpha comp onent sp ecifying the shap e information Further

more i j in implies that we only compute SAD for those macroblo ck

containing video ob ject The SAD computation in is further divided into two

cases with or without signicantmotion For those macroblo ck without signicant

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

N

B

motion ie x y SAD is reduced by constant C C

N

where N is the numb er of pixels inside the blo ck The purp ose of this reduction is

B

to concentrate the distribution of motion vectors to co ordinate so that entropy

co ding of the zero dierence motion vectors is more ecient

Arbitrary Shap e Co ding

The representation of an ob jects shap e has b een shown to b e very useful in many

elds of image and video pro cessing Sp ecically the utilization of shap e informa

tion in the areas of image analysis image compression computer vision and graphics

has b een throughly investigated These investigations have led to the development

of several techniques for ecient representation of shap e information MPEG is

the rst attempt at providing a standardized approach to the representation of an

ob jects shap e within a video bitstream

In MPEG it is assumed that each video ob ject is provided with its corresp ond

ing shap e information This shap e information is provided in one of two formats

binary format or grey scale format The binary format for the shap e information

consists of a pixel map which is generally the same size as the b ounding box of

the corresp onding VOP Each pixel takes on one of two p ossible values indicating

whether it is contained within the video ob ject or not The grey scale format is

similar to the binary format with the additional feature that each pixel can take

on a range of values ie usually b etween and These values represent the

transparency of that pixel Avalue of corresp onds to a video ob ject which is com

pletely transparent while a completely opaque video ob ject would b e represented

by pixel values of Video ob jects whose shap e are represented byvalues b etween

and corresp ond to an intermediate level of transparency This approachto

representing the shap e of a video ob ject along with its transparency is very similar

to the alpha plane approach used in computer graphics

Both the binary and grey scale formats represent the shap e of video ob ject as a

matrix of binary or grey values resp ectively This matrix of values is referred to as

a bitmap The suitable shap e co ding metho ds include

Contourbased methods extract and co de the contour residing on the b ound

ary of the video ob jects The contourbased metho ds transform the source

binary image to another binary image where contour pixels are distinguished

from all other pixels The vertexbased coding andchain coding

are prominentcontourbased approaches The disadvantage is that these ap

proaches dont work within the conventional blo ck based video co ding frame

work

Bitmapbased methods are applied directly to the source binary images

MPEG VIDEO CODING

The modiedREAD MR method andcontextbased arithmetic encoding

CAE are commonly used bitmapbased approaches

Chromakeying is an implicit metho d for shap e co ding wherebythe binary

alpha comp onent of the ob ject is actually merged into the YUV comp onents

The YUV comp onents are then enco ded by the texture enco der Chroma

keying allows arbitrarily shap ed ob jects to b e co ded without an explicit shap e

enco der However the DCT quantization noise may bleed into the recon

structed video ob ject at its edges

Bitmap based compression for shap e co ding such as the blo ckbased metho ds of

CAE was adopted by MPEG b ecause it oers good compression eciency with

relatively reduced computational complexity as compared to other approaches such

as vertex based shap e co ding

The shap e co ding techniques adopted by the standard supp ort b oth lossless co d

ing of alpha planes and lossy co ding of shap es and transparency information thus

allowing tradeos b etween bit rate and accuracy of shap e representation Further

more intra and intershap e co ding functionalities employing motioncomp ensated

shap e prediction is envisioned so as to allow b oth ecient random access op erations

as well as ecient compression of shap e and transparency information for diverse

applications In the MPEG the shap e of every VOP is enco ded along with its

other characteristics ie luminance chrominance etc Therefore the shap e of each

VOP is b ounded by a rectangular window The b ounding b ox is then partitioned

into blo ckof pixels called shapeblockswhich are the same as those contour

blo cks The selection and partitioning of the b ounding boxinto shap e blo cks for

a particular VOP is demonstrated in Fig It is these shap e blo cks up on which

the enco ding and deco ding pro cess is p erformed

A Binary ShapeCoding

Like the texture co ding in MPEG and MPEG the bitmap based co ding

metho d for the binary format contains b oth an intra and inter mo de The ma jor

dierence b etween these two mo des is the addition of motion comp ensation to the

inter mo de in order to achieve greater compression eciency by rst removing the

temp oral redundancies

B Grey Scale Coding

The grey scale format is enco ded using a blo ck based DCT where motion com

p ensation again can b e used to reduce the temp oral redundancies This metho d is

very similar to that used to compress the texture information and is strictly a lossy

compression technique The grey scale bitmap is enco ded by separately enco ding

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

the shap e and transparency information The shap e information is enco ded by the

same binary shap e co ding metho d describ ed ab ove The transparency values are

treated as luminance values and enco ded using the same blo ck DCT transform

approach used to enco de the texture information of a VOP

As discussed ab ove the grey scale format is utilized for comp ositing a scene using

several dierent video ob jects Since a feature of the grey scale format is that each

pixel can take on a range of values ie usually b etween and These values

represent the transparency of that pixel When dierent ob jects o ccupy the same

spatial lo cation they are blended together based on the value of their grey value

format and are normalized based on the maximum value of This approachto

represent the shap e of a video ob ject and its transparency is very similar to the

alpha plane approach used in computer graphics

C Sprite Coding

A sprite is an image comp osed of pixels b elonging to a video ob ject that are

visible throughout an entire video segment For example a sprite generated from a

panning sequence maycontain all the visible pixels of the background throughout

the sequence as shown in Fig a In this particular case the video ob ject

(a)

(b) (c)

Figure Sprite co ding a the panning image sprite containing all the visible

pixels of the background b the foreground VOP Stefan c the reconstructed frame

MPEG VIDEO CODING

used to generate the sprite is the background Portions of this background may

not be visible in certain frames due to the o cclusion of the foreground ob jects as

shown in Fig b or the camera motion Since the sprite contains all parts

of the background that were at least visible once the sprite can b e used for direct

reconstruction of the background VOPs or the predictivecodingofthebackground

VOPs and the reconstructed frame is shown in Fig c

In MPEG spritebased co ding two main typ es of sprites have b een distin

guished static and dynamic Static sprites are those that are directly copied

to generate a particular rendition of the sprite ob ject at a particular time instant

namely a VOP This copying however also includes the appropriate warping and

cropping In contrast a dynamic sprite is used as reference in predictive co ding

where motion is comp ensated using the warping parameters for the sprite ob ject

A dynamic sprite can in turn b e generated either online or oline An oline

sprite is built co ded and transmitted as an IVOP prior to co ding the video itself

An online sprite is dynamically built during co ding in b oth the enco der and the

deco der An online sprite is always dynamic On the other hand an oline sprite

can be static or dynamic dep ending on its usage Oline static sprite is well

suited for synthetic ob jects and ob jects that mostly undergo rigid motion Online

dynamic sprite provides a nolatency solution in the case of natural motion and it

provides an enhanced predictive co ding environment One of the ma jor comp onents

of spritebased co ding is the generation of the sprite This assumes that the sprite

is not known in advance which may be the case for synthetic video ob ject The

sprite is built in a similar way in b oth oline and online cases In particular the

same global motion estimation algorithm is used In the oline case a sprite is

built b efore starting the enco ding pro cess It is constructed using every original

VOP available for the video sequence In the online case both the enco der and

the deco der build the same sprite from reconstructed VOPs In sprite co ding the

chroma comp onents are pro cessed in the same way as the luminance comp onents

with the prop erly scaled parameters For oline static sprites temp oral scalability

is implicit since the transmission of tra jectories of eachVOP is indep endent

Advanced Co ding Techniques

A Unrestricted Motion Vector

In the basic MEMC technique the predicted blo ck has to be a blo ck in the

previous frame If the current blo ck is at a corner or a b order blo ck of the current

frame the motion vector MV is then restricted into a smaller range One of

the advanced techniques in MPEG is to allow unrestricted MVs for such b order

blo cks Fig illustrates this technique The previous frame is extended in all

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

Predicted block Predicted block (motion vector =(8,0)) (Motion vector = (8, -8))

Search range Search range

Previous frame Previous frame Extended previous frame

Current block Current block

Current frame

Motion Estimation with Restricted Motion Vector Motion Estimation with Unresetricted Motion Vector

Figure Illustration of unrestricted motion vector technique

four directions by rep eating the b order pixels a numb er of times based on the search

range The dierence blo ck is generated by applying MEMC against the extended

previous frame and taking the dierence of the current blo ck and the predicted

blo ck that maybe partially out of the frame b oundary This technique improves

the co ding eciency of the b oundary blo cks

B AdvancedPrediction

There are two asp ects of advanced prediction

Adaptive method it decides whether a currentblock of x pixels is divided

into four blo cks of pixels each for MEMC The decision is made based

on

X

SAD SAD

i

here SAD stands for the sum of absolute dierence If the dierence less

than prediction is chosen otherwise is chosen If wecho ose

prediction there are four MVs for the four luminance blo cks

OverlappedMC Each pixel in an luminance predicted blo ckisaweighted

DELIVER VIDEO BITSTREAM OVER NETWORKS

sum of three prediction values sp ecied in the following equation



W i j P i j W i j P i j W i j P i j

s i j

where division by is with roundo W i j W i j and W i j are the

weighting matrixes which can b e found in the MPEG standards

The values of P i j P i j and P i j are the pixels of the previous

frame

C Bidirectional Motion Estimation and Compensation

There are four mo des in bidirectional motion estimation and comp ensation

They are dierent in forming the predicted blo ck

Direct mode it is the only mo de in which it is p ossible to use MVs of x

blo cks For each x blo ck of the Bframe the forward and backward

motion vectors are derived from the MVs of the next Pframe that follows the

Bframe

Interpolate mode backwardmode and forwardmode They p erform MEMC

on blo cks The MVs are obtained by forward ME and backward

ME Selection of these mo des is based on a comparison of the SAD values

generated by the four mo des and the mo des with the minimum SAD value

In this comparison the direct mo de is favored by subtracting from its

SAD value b efore the comparison

Deliver Video Bitstream over Networks

In MPEG it provides the syntax and metho ds necessary to eciently represent

the shap e information of an ob ject as wehave mentioned ab ove within the co ded

bitstream Now the problem b ecomes how to deliver those bitstreams Due to the

large variety of existing network technologies it is most likely that hybrid networks

will be used to supp ort video services However dierent networks have dierent

characteristics To optimize the p erformance of those multimedia systems with

the given QoS Quality of Service requirements rate control is used in MPEG by

jointly considering video compression and delivery schemes based on the network

alternatives capacities and characteristics In addition other techniques prop osed

in MPEG standard is used to meet the challenge of delivering video over networks

in bandwidth ecient universal accessibility and error resilient manner

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

Rate Control

Rate control and buer regulation is an imp ortant issue for b oth variable bit rate

VBR and constant bit rate CBR applications In the case of VBR enco ding the

rate controller attempts to achieve optimum QoS for a given target average rate

In the case of CBR enco ding and realtime applications the rate control scheme has

to satisfy the lowlatency and video buer verier constraints In addition the rate

control scheme has to be applicable to a wide variety of sequences and bit rates

The scalable rate control scheme is designed to meet the requirementofbothVBR

without delay constraints and CBR with lowlatency and buer constraints

The numb er of bits used for a frame dep ends on the quantization stepsize and

the signal dynamic range The scalable rate control SRC scheme controls the bits

assigned to the Pframe

N R Q Q

bit

where N is the number of bits used for a frame R is the dynamic range of

bit

the frame Qis the quantization stepsize used for the frame and are two

mo deling parameters Since the SRCscheme is used for inter frames the motion

comp ensated SAD value of the frame is used for the dynamic range R of the frame

Error resilience

Error resilience provides an error robustness capability to allow access to applica

tions over a variety of wireless and wired networks as well as storage media The

error resilience to ol basically covers resynchronization data recovery and error

concealment

A Resynchronization

Resynchronization to ols as the name implies attempt to enable resynchroniza

tion b etween the deco der and the bitstream after a residual error or errors have b een

detected Generally sp eaking the data b etween the synchronization p oints prior to

the error and the rst point where synchronization is reestablished is discarded

as shown in Fig b ecause it is usually not p ossible to detect the error at the

exact error o ccurrence lo cation at the deco der This kind of errors typically o ccur

in bursts on wireless channels which corrupt many bits when the channel fades

If the resynchronization approach is eective at lo calizing the amountofdata

discarded by the deco der then the abilityofothertyp es of to ols which recover data

andor conceal the eects of errors is greatly enhanced

The resynchronization approach adopted by MPEG is similar to the Group

of Blo ck GOBs structure utilized by the ITUT H and H In

DELIVER VIDEO BITSTREAM OVER NETWORKS

Discarded data       Resync  Resync     point   point              Error Error

location detected

Figure All the data b etween the two resynchronization p oints may need to b e

discarded

these standards a GOB is dened as one or more rows of macroblo cks MB At

the start of a new GOB information called a GOB header is placed within the

bitstream This header information contains a GOB start co de which is dierent

from a picture start co de and allows the deco der to lo cate this GOB Furthermore

the GOB header contains information which allows the deco ding pro cess to be

restarted ie resynchronize the deco der to the bitstream and reset all predictively

co ded data

The GOB approach to resynchronization is based on spatial resynchronization

That is once a particular macroblo ck lo cation is reached in the enco ding pro cess

a resynchronization marker is inserted into the bitstream A potential problem

with this approach is that since the enco ding pro cess is variable rate these resyn

chronization markers will most likely b e unevenly spaced throughout the bitstream

Therefore certain p ortions of the scene such as high motion areas will be more

susceptible to errors which will also b e more dicult to conceal

The video packet approach adopted by MPEG as shown in Fig is based on

providing p erio dic resynchronization markers throughout the bitstream In other

Resync MB Quantiza- Macroblock Data HEC (I-VOP: DCT DC/Mode Information) Texture information marker Number tion (P-VOP: Motion/Mode Information)

Header

Figure Resynchronization markers help in lo calizing the eect of errors to an

MPEG video packet The header of each video packet contains all the necessary

information to deco de the macroblo ck data in the packet Here HEC stands for

header extension co des Macroblo ck data include DCT and texture information for

IVOP and motion and texture information for PVOP resp ectively

words the length of the video packets are not based on the numb er of macroblo cks

but instead on the numberofbitscontained in that packet If the numb er of bits

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

contained in the current video packet exceeds a predetermined threshold then a

new video packet is created at the start of the next macroblo ck

A resynchronization marker is used to distinguish the start of a new video packet

This marker is distinguishable from all p ossible variable length co ding co dewords

as well as the VOP start co de Header information is also provided at the start of

a video packet Contained in this header is the information necessary to restart the

deco ding pro cess and includes the macroblo cknumb er of the rst macroblo ck

contained in this packet It provides the necessary spatial resynchronization while

the quantization parameter allows the dierential deco ding pro cess to b e resynchro

nized the quantization parameter necessary to deco de that rst macroblo ck

Imp ortant information that remains constantover a video frame suchasthe

spatial dimensions of the video data the time stamps asso ciated with the deco ding

and the presentation of this video data and the typ e of the current frame INTER

co dedINTRAco ded are transmitted in the header at the b eginning of the video

frame data If some of this information is corrupted due to channel errors the

deco der has no other recourse but to discard all the information b elonging to the

current video frame In order to reduce the sensitivity of this data a bit eld

called HEC is intro duced in the video packet header When HEC is set the im

p ortant header information that describ es the video frame is rep eated in the bits

following the HEC This duplicate information can be used to verify and correct

the header information of the video frame The use of HEC signicantly reduces

the number of discarded video frames and helps achievea higher overall deco ded

video quality

B Data Recovery

After synchronization has been reestablished data recovery to ol attempts to

recover data that in general would be lost These to ols are not simply error

correcting co des but instead techniques which enco de the data in an error resilient

manner For instance one particular to ol under consideration is Reversible Variable

Length Co des RVLC as shown in Fig In this approach the variable length

co dewords are designed such that they can b e read b oth in the forward as well as

the reverse direction Examples of suchcodewords are Co dewords

suchaswould not b e used Obviously this approach reduces the compression

eciency achievable by the entropy enco der However the improvementin error

resilience is substantial

C Error Concealment

Error concealment is an extremely imp ortant comp onent of any error robust

video co dec Similar to the error resilience to ols the eectiveness of a error con

DELIVER VIDEO BITSTREAM OVER NETWORKS

2 Error dectected goto next resync marker

 Resync Motion/Mode Texture data reversible VLC Resync Header  information  marker marker    Errors 3 Backward decoding 1 Forward decoding 4 Errors localized and

discarded

Figure Reversible VLCs can be parsed in both the forward and backward

direction making it p ossible to recover more DCT data from a corrupted texture

partition instead of discarding the data b etween the consecutive resynchronization

markers

cealment strategy is highly dep endent on the p erformance of the resynchronization

scheme Basically if the resynchronization metho d can eectively lo calize the error

then the error concealment problem b ecomes much more tractable

We can achieve error concealment by taking advantage of the separate mo

tion texture co ding mo de of MPEG Sp ecically this approach utilizes the data

partitioning capabilities of separating the motion and the texture This approach

requires that a second resynchronization marker b e inserted between motion and

texture information If the texture information is lost the approach utilizes the

motion information to conceal these errors That is due to the errors the tex

ture information is discarded while the motion is used to motion comp ensate the

previous deco ded VOP

This approach can be extended through the transmission of a mean motion

vector and shap e information for each ob ject ie side information For the case

when b oth texture and motion information or just motion information is corrupted

this side information can b e utilized to motion comp ensate the ob ject This is as

opp osed to the typical error concealment strategy which generally works at the

macroblo cklevel

Universal Accessibility

Contentbased scalability enables the user to achieve scalability with a ne granular

ityincontent quality eg spatial resolution temp oral resolution and complexity

This allows for the manual or automated selection of deco ded video quality based on

the available bandwidth in a particular network For example a user can browse

a database at dierent qualities scales andor resolutions based on the bandwidth

resources of a particular network In general scalabilityof video means the abil

ity to achieve video of more than one resolution qualitysimultaneously Scalable

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

video co ding involves generating a co ded representation in a manner that allows the

derivation of video of more than one resolution qualityby scalable deco ding Bit

stream scalability is the prop erty of a bitstream that allows deco ding of appropriate

subsets of a bitstream to generate complete pictures of resolution quality commen

surate with the prop ortion of the bitstream deco ded A truly scalable bitstream

allows b oth low and high p erformance deco ders to co exist That is alow p erfor

mance deco der may deco de small p ortions of the bitstream pro ducing basic quality

while a high p erformance deco der may deco de the entire bitstream and pro duce

signicantly higher quality

The ability to provide contentbased spatial and temp oral scalability are two

very imp ortant functionalities whichhave b een prop osed in MPEG MPEG

The concept of spatial and temp oral scalability can also extended to VOPs of arbi

trary shap e which is referred to as generalized scalability Eachtyp e of scalability

involves more than one layer suchasalower layer and a higher layer

ObjectBasedscalability It is imp ortanttokeep in mind that MPEG oers

the ability to do ob jectbased scalability This unique functionality is a re

sult of MPEGs ability to resolve ob jects into dierentVOPs Utilizing the

multiple VOP structure dierent resolution enhancements can b e applied to

dierent p ortions of a video scene Therefore within MPEG the following

two enhancement mechanism are allowed Enhancement Typeand En

hancement Type In Enhancement Type the enhancementlayer increases

the resolution of a particular ob ject or region of the base layer In Enhance

ment Type the enhancementlayer increases the resolution of the entire base

layer

Spatial scalability the lower layer is referred to as the base layer and the higher

layer is called the enhancement layer as shown in Fig Traditionally

these scalability are applied to frames of video such that in case of spatial

scalability the enhancement layer frames enhance the spatial resolution of

base layer frames If needed adownsampling pro cess is p erformed by the

scalability pro cessor

Temporal scalability the enhancement layer frames are temp orally multi

plexed with the base layer frames to provide high temp oral resolution video

In temp oral scalability as shown in Fig the frame rate of a selected

ob ject is enhanced such that it has a smo other motion than the remaining

area In other words the frame rate of the selected ob ject is higher than that of the remaining area

DELIVER VIDEO BITSTREAM OVER NETWORKS

Coder/Decoder Pair

2 2

Coder/Decoder Pair

2 2

Coder/Decoder

Pair

Figure Spatial scalability

Coder/Decoder Pair

60 Hz

60 Hz 2 2

Coder/Decoder Pair

2 2 30 Hz

Coder/Decoder Pair

15 Hz

Figure Temp oral scalability

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

DCTdomain Contentbased Video Co ding

As stated in the previous sections the motion and texture co ding techniques in the

MPEG are the direct extension of those used in traditional video co ding Thus

the blo ck matching and the Discrete Cosine Transform DCT are still the basic

techniques However the dierence b etween the approaches of MPEG and tradi

tional video co ding is that in MPEG the motion estimation of the blo cks on the

VOP b orders has to b e mo died from blo ck matching to p olygon matching To

accomplish p olygon matching the macroblo ckbased rep etitive padding is required

to estimate motion for those contour macroblo cks which reside on the b oundary of

the video ob ject and contain partial video information as shown in Fig The

pro cedures of padding are listed in Table To cop e with VOP of big motions

the padding is further extended to blo cks which are completely outside the VOP

but immediately next to the b oundary blo cks In addition to the high computa

tional complexity O N of mo died blo ck p olygon matching motion estimation

MBKMME the macroblo ckbased rep etitive padding increases the overall com

plexity and system data ow and makes the realtime video co dec implementation

even harder The resulting design of video co ding structure is shown in Fig a

Therefore how to handle the demanding computational tasks in real time and how

to implement them in a costeectivewayhave b ecome big challenges to us Now

the following questions can be logically p osed Are there some disadvantages in

MPEG video co der design The answer is yes For instance in order to sup

p ort spatial domain motion estimationcomp ensation the IDCT is used to restore

the compressed video ob ject back to the spatial domain However with such a

design the throughput of the co der is limited by the pro cessing sp eed of four ma

jor comp onents DCT IDCT Spatial Domain Motion Estimation SDME and

Macroblockbased repetitive padding in the feedback lo op The feedback lo op

therefore b ecomes the ma jor b ottleneck of the entire digital video system

Is there any b etter lowcomplexity design to achieve MPEG compatible p er

formance The answer is p ositiveandwe will provide such a solution byworking

at the algorithmic level in next section

Transform Domain Motion EstimationComp ensation

Besides the spatial domain motion estimationcomp ensation we can also esti

matecomp ensate motion for arbitrarily shap ed video ob ject in DCT domain With

such a DCTdomain design wecan movetheDCT unit out of the feedbackloop

to realize the fully DCTbased co der structure as shown in Fig b Now the

p erformancecritical feedback lo op of the DCTbased co der contains only transform

domain motion estimation unit TDME instead of four ma jor comp onents DCT

DCTDOMAIN CONTENTBASED VIDEO CODING

VOP Input + DCT Q Bitstream Formation + VLC - Q-1

Shape IDCT Information Macroblock-

SD-ME based Padding + a The video co ding structure in MPEG enco der

VOP Macroblock- Input DCT + Formation based Padding + Q VLC Bitstream -

Q-1 Shape Information

TD-ME +

b The video co ding structure in fully DCTBased enco der

Figure Comparison of dierent co der structures a Commonly used motion

comp ensated DCT hybrid co der p erforms motion estimation in the spatial domain

SDME b Fully DCTbased co der estimates motion in the transform domain TDME

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

IDCT Spatial Domain Motion Estimation SDME Macroblockbased repetitive

padding This not only reduces the complexity of the co der but also achieves

higher system throughput Most imp ortantly the DCTbased nature enables the

combination of b oth DCT and motion estimation units which consume more than

of computing p ower for a video co der into one comp onenttosavechip area

In addition we can also move the macroblo ckbased rep etitive padding unit out of

the feedbackloop Nowthe p erformancecritical feedbackloopcontains only one

transformdomain motion estimation unit TDME instead of four ma jor compu

tational involved units as in Fig a As a result this not only reduces the

system complexity of the co der but also achieves higher data throughput

In this section we will extend our DCTdomain motion estimationcomp ensation

scheme discussed in previous chapters for arbitrarily shap ed video In principle we

mo dify and extend the DCT PseudoPhase Technique to estimate motions at inte

ger and halfpixel accuracy for arbitrarily shap ed VOP of MPEG video Unlike

the mo died blo ck matching p olygon motion estimation algorithm adopted in

MPEG the presented motion estimation scheme EDXTME works solely in the

DCT transform domain instead of in the spatial domain In other words if we

consider the conventional blo ckbased motion estimationcomp ensation as a time

domain approach our design is then a frequency domain approach Thus it en

ables us to extract motion displacement directly from the current and previous

VOPs even without the macroblo ckbased rep etitive padding In addition wecan

p erform the motion comp ensation in the DCT domain without converting backto

the spatial domain However for the sake of MPEG compatibilitywe will still

keep the padding pro cedure as we will explain later In other words to make our

design work with the MPEG deco der the VOP has to b e padded

The EDXTME algorithm is summarized in Table In terms of arbitrary

shap e motion estimation we can treat the contour macroblo cks the same as the

regular ones except for pixels outside are padded based on the video content inside

the video b oundary by following the pro cedures as listed in Table Next we will

discuss eachstepofEDXTME algorithm in more detail

A VOP Formation

The VOP is represented by means of a b ounding rectangle as describ ed next

The phase between the luminance and chrominance samples of the b ounding

rectangle has to b e correctly set according to the format as shown in Fig

Sp ecically the top left co ordinate of the b ounding rectangle should b e rounded to

the nearest even number not greater than the top left co ordinates of the tightest

rectangle Accordingly the top left co ordinate of the bounding rectangle in the

chrominance comp onent is the top left co ordinate of the luminance divided bytwo

DCTDOMAIN CONTENTBASED VIDEO CODING

Input The video ob ject planes VOPs

Motion vectors and prediction errors Output

VOP formation and padding Based on the shap e information of the VOP we can

generate the tightest rectangle boundedVOP window that contains the video ob ject

to achieve high co ding eciency The b ounded windowhas the minimum number of

macroblo cks with eachofsize p els A shift parameter is hereafter enco ded as

spat ref in MPEG to indicate the lo cation of the b ounded VOP window with horver

resp ect to the b orders of a reference VOPTheVOP is then macroblo ckbased rep etitive

padded

The motion vector is computed Contentbased motion estimationcomp ensation

only for each of the macroblo ck or the blo ck for advanced motion comp ensation

which contains the video ob ject Otherwise jump to Step

cc

k l Compute the D DCT co ecients of second kind DDCTI I X

t

cs sc ss

X k l X k land X k l of a macroblo ck of pixels in the currentVOP

t t t

fx g

t

Meanwhile those DCT co ecients in the corresp onding macroblo ck of pixels in

the reference VOP x are converted to D DCT co ecients of rst kind

t

cc cs sc ss

DDCTI Z k l Z k l Z k l and Z k l through a plane

t t t t

rotation

Determine the normalized pseudo phases f k landg k l from the system equation

which contains those typeI and typeII DCT co ecients obtained from step

Compute F m n and Gm n the inverse DCT DIDCTI I of f k l and

g k l which are comp osed of impulse functions whose p eak p ositions indicate

the integerp el motion vector m m and p eak signs reveal the direction of the

u v

movement

The halfp el motion vector is then determined by only considering the nine p ossible

p ositions around the integerp el displacement m m without interp olation

u v

Basically the halfp el motion vector is determined by computing DC S u v and

DS C u v for u fm m m g and v fm m m g

u u u v v v

The p eak value of DC S u v and DS C u v indicate the halfp el motion and its

direction

Based on the previously derived motion estimation DCT of the motion

comp ensated residual DBD Displaced Blo ck Dierence between current blo ck

B and displaced reference blo ck B is computed as

cur r

ref

DC T fDBD g DC T fB gDC T fB g

cur r

ref

The prediction errors are then quantized and send to the receiver along with co ded

macroblo ck motion vectors

Go to step until the whole video ob ject is estimated The pro cess While lo op

starts from the top left macroblo ckintheboundedVOP window to the top right

one and then to the next row and so on for every macroblo ck in the b ounded

VOP window

Table Summary of the EDXTME algorithm

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

0 1 2 3 Bouned 0 VOP window

1

Luminance

2

Chrominance

3

Figure Luminance vs chrominance sample p ositions in format

Here the shap e information is used to form a VOP By using the following pro cedure

the minimum number of macroblo cks that contain the ob ject will be attained to

get a higher co ding eciency

Generate the tightest rectangle with even numb ered top left p osition as shown

in Fig a

If the top left p osition of this rectangle is the same as the origin of the image

frame skip the formation pro cedure

Form a control macroblo ck at the top left corner of the tightest rectangle

as shown in Fig b

Count the number of macroblo cks that completely contain the ob ject

starting at eacheven numb ered p ointofthecontrol macroblo ck Details

are as follows

Generate a b ounding rectangle from the control p oint to the right

b ottom side of the ob ject which consists of multiples of blo cks

Count the number of macroblo cks in this b ounding rectangle

which contain at least one ob ject p el It is sucient to take into

account only the b oundary p els of a macroblo ck

Select that control point that results in the smallest number of mac

roblo cks for the given ob ject

Extend the top left co ordinate of the tightest rectangle generated in

Fig b to the selected control co ordinate This will create a rectan

gle that completely contains the ob ject but with the minimum number

DCTDOMAIN CONTENTBASED VIDEO CODING

Bounded VOP window (Tightest rectangle)

Shift

Current VOP a Control Macro- Bounded VOP window block (Tightest rectangle) . ... .

: Control point

Extended bound

Intelligently generated VOP

Current VOP

b

Figure Intelligent VOP formation a generate the tightest rectangle b

extended the VOP window

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

of macroblo cks in it The VOP horizontal and vertical spatial references

are taken directly from the mo died topleft co ordinate

B Contentbased Motion EstimationCompensation

The reason we call it contentbased video co ding is that the motion estima

tioncomp ensation is p erformed only for those macroblo cks containing the video

information It is the kernel of the EDXTME algorithm and is computational in

tensive The whole system p erformance dep ends on this core pro cess design we

will describ e the detailed costeective architectures and its corresp onding VLSI

implementation in the following chapters Basicallywe can view our approachas

a logical extension of those DCTbased motion estimation schemes toward co ding

video sequences of arbitrary shap e

After motion estimation the current blo ck B of size N N in the current

cur r

frame can b e b est predicted by the blo ck B displaced from the previous blo ck

ref

p osition with the estimated motion vector m m Based on the derivation in

u v

the DCT of the motioncomp ensated residual displaced blo ck dierence

DBD is given by

DC T fDBDg DC T fB B g DC T fB gDC T fB g

ref cur r ref cur r

In other words the DCT of the motioncomp ensated residual can b e expressed as

the dierence b etween the DCT of the displaced blo ck and the DCT of the current

blo ck As a result we can p erform motion comp ensation in the DCT domain as

shown in Fig b which serves the purp ose of building a fully DCTbased

motion comp ensated video without converting back to the spatial domain b efore

motion comp ensation

Now the question b ecomes How to extract the displaced DCT blo ck in the

DCT domain or howtocompute DC T fB g Let us illustrate the solution by

ref

taking a simple example As illustrated in Fig a after motion estimation

the current blo ck B of size N N in the current frame can b e b est predicted

cur r

from the blo ck displaced from the previous blo ck p osition by the estimated motion

vector m m in the spatial domain This motion estimation determines which

u v

four contiguous predened DCT blo cks are chosen for the prediction of the current

blo ck out of eight surrounding DCT blo cks and the blo ck at the current blo ck

p osition To extract the displaced DCT blo ck in DCT domain a direct metho d is

to obtain four subblo cks separately from these four contiguous blo cks which can then

be combined together to form the nal displaced DCT blo ckasshown in Fig

b with the upp erleft lowerleft upp erright and lowerright blo cks lab eled as

B B B B resp ectively Subblo cks S are extracted from these four blo cks by

k

DCTDOMAIN CONTENTBASED VIDEO CODING

h 1 h3 B1 B3 h1 h (m u , m v ) 3 v v v 1 1 1 B1 B3 v 2 v v 2 2

B2 B4 B2 h B4

h1 3

a DCTbased motion estimation b Pixelwise translated DCT blo ck c Motion comp ensation

Figure DCTbased motion comp ensation

premultiplication and p ostmultiplication of the shifting matrices H and V

k k

S H B V for k

k k k k

where the shift amount is determined by the estimated motion vectors and H

k

V are dened as

k

I I

h h

H H H H

I I

h h

I I

v v

V V V V

I I

v v

Here I is the n n identity matrix ie I diag fg and n is deter

n n

mined by the heightwidth of the corresp onding subblo ckasshown in Fig b

These premultiplication and p ostmultiplication matrix op erations can be vi

sualized in Fig c where the overlapp ed grey areas represent the extracted

subblo ck These four subblo cks are then summed to form the desired translated

blo ck B The DCT co ecients of four subblo cks can then b e combined together

ref

to form the nal displaced DCT blo ck

X

DC T fH gDC T fB gDC T fV g DC T fB g

k k k ref

k

C An Example to Il lustrate Our Design

To facilitate explanation of EDXTME algorithm let us use the MPEG video

test sequence News in CIF format with frame size as input image

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

sequence The panorama scene is shown in Fig The News sequence consists

Figure Panorama scene of News in CIF format

of four VOPs and three corresp onding binary alpha planes the background VOP

has no alpha plane as shown in Fig Here we apply our design only to VOP

the third video ob ject plane as shown in Fig a as an example to illustrate our

design b ecause VOP is the foreground VOP Most imp ortantlythelo cation and

shap e of VOP as shown in Fig b vary with time After taking the rst step

of EDXTME VOP formationtheVOP now is b ounded by the tightest rectangle

containing the video ob ject However this tightest rectangle may not consist of

multiples of macroblo ckofsize Therefore we need to extend the b ottom

right co ordinate of VOP window in Fig a to satisfy that requirement The

nal b ounded VOP with the window size of and its corresp onding alpha

plane are shown in Fig c and d resp ectively The reason to intro duce

VOP formation is to achieve high data compression rate b ecause we dont need

to estimate motions for those macroblo cks containing no video information After

VOP formation the b ounded VOP is padded as shown in Fig e and f The

b ounded VOP window is then further divided into nonoverlapp ed macroblo cks

The contentbased motion estimation the second step of EDXTME is p erformed

on the corresp onding texture of padded VOP in the b ounded window

The binary alpha plane as shown in Fig d is co ded by mo died CAE

Contentbased Arithmetic Enco ding The adopted blo ckbased syntax has

allowed compressed binary alpha blo cks BABs to b e blended seamlessly into the

video syntax This in turn eases the task of supp orting the imp ortant features such

as errorresilient bitallo cation and ratecontrolled op erations Just as the YUV

enco ding a BAB maybeintraco ded using contextbased arithmetic enco ding it

may be interco ded using motion comp ensation and CAE or it may merely be

DCTDOMAIN CONTENTBASED VIDEO CODING

a b

c d

e f

Figure Explanation of VOP formation using News video test sequence a

VOP of New b the alpha plane of VOP c VOP in b ounded window d

the alpha plane of b ounded VOP e VOP after rep etitive padding f VOP

after extended padding

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

reconstructed by motion comp ensation without CAE which is analogous to the

notco ded macroblo ck mo de in MPEG video standard Both YUV and binary

alpha deco ding require the use of motion estimationcomp ensation to exploit spatial

redundancy

To envision how our presented DCTdomain scheme works here we present

an example to estimate motion of a contour macroblo ck by following Steps

in our design as shown in Fig The p eak p osition among F m n and

Gm n indicates the integerp el motion vector of The p eak p osition among

DS C u v andDC S u v implies the halfp el motion vector of

D Computational Complexity

Now let us take a lo ok at the overall computational complexity To pro cess

each video ob ject plane Step of the prop osed approach VOP formation and

padding is only needed to b e executed once Therefore the overall computational

complexity of design is determined by the complexityof Steps and is listed

in Table which serves as the computing engine of the whole design Overall

Step Op eration Multiplication Additions Computational

Subtractions complexity



DDCT typ eI I computation O N



Rotation typ eI DCT O N



Pseudo phases calculation O N



F G computation O N



Halfp el motion estimation O N



Prediction errors computation O N



Total O N

Table Computational complexity of Step in our design for a macroblo ck

size of N N And the size N is adjustable Here weuse N as an example

the scheme requires the computational complexityofO N Here N stands for the

macroblo ck size which is adjustable For the large motions going b eyond the blo ck

b oundary we will use motion vector instead Notice that if the original

input image sequences are not decomp osed into several video ob ject layers VOLs

of arbitrary shap e the EDXTME scheme simply degenerates into a single layer

representation which supp orts conventional image sequences of rectangular shap e

The EDXTME approach can thus be seen as a logical extension of MPEG

and MPEG compatible motion estimation algorithm in transform domain toward image input sequences of arbitrary shap e

DCTDOMAIN CONTENTBASED VIDEO CODING

Contour Contour macroblock macroblock

Reference Frame after Padding Current Frame after Padding

DSCT−I DCCT−I DSCT−II DCCT−II

0.05 0.05 0.05 0.04

0 0 0.02 0 0 −0.05 −0.05 20 20 −0.05 −0.02 20 20 20 20 20 20 10 10 10 10 10 10 10 10 0 0 0 0 0 0 0 0

DCST−I DSST−I DCST−II DSSC−II

(Step 2.1) 0.05 0.05 (Step 2.1) 0.05 0.05

0 0 0 0

−0.05 −0.05 −0.05 −0.05 20 20 20 20 20 20 20 20 10 10 10 10 10 10 10 10 0 0 0 0 0 0 0 0 Type-I DCT Coefficients Type-II DCT Coefficients of contour macroblock of contour macroblock

Pseudo−phase f(k,l) in EDXT−ME (SNR=40) F(m,n) in EDXT−ME (SNR=40) DSC(m,n) in EDXT−ME (SNR=40)

1.5 1.2 1.2

1 1 1 0.8 0.8 0.5 0.6 0.6

0 0.4 0.4

0.2 0.2 −0.5 0 0 −1 −0.2 −0.2 20 20 40 15 20 15 20 30 40 15 10 15 30 10 10 20 5 10 20 5 5 10 5 10 LAV = 0 0 0 D=(3,−1) 0 0 N=16 0 0 N=16 D=(3,2) D=(3.5,2.5)

f(k,l) F(m,n) DSC(m.n) (Step 2.3) (Step 2.4) (Step 2.2) Pseudo−phase g(k,l) in EDXT−ME (SNR=40) G(m,n) in EDXT−ME (SNR=40) DCS(m,n) in EDXT−ME (SNR=40)

1.5 1.2 1.2

1 1 1 0.8 0.8 0.5 0.6 0.6 0 0.4 0.4 −0.5 0.2 0.2

−1 0 0

−1.5 −0.2 −0.2 20 20 40 15 20 15 20 30 40 15 30 10 15 10 20 10 10 20 5 5 10 5 5 10 0 0 N=16 0 0 N=16 0 0 LAV = 0 D=(3,−1) D=(3,2) D=(3.5,2.5)

g(k,l) G(m.n) DCS(m,n)

Figure Illustration of estimating motion of a contour macroblo ckby following step in our compressed domain design

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

Simulation Results

Simulations have b een p erformed on the News sequence in CIF format The

b ounded previous and currentVOPs are shown in Fig a and b resp ectively

The reconstructed VOP using our presented compressed domain co ding scheme is

a b

c d

Figure Illustrate the p erformance of our contentbased video co ding a

b ounded previous VOP b bounded currentVOP c bounded alpha plane for

previous VOP d reconstructed VOP using our presented design

shown in Fig d The simulation results demonstrate the comparable video

qualitybetween the reconstructed and currentVOPs

Due to its lower computational complexity as compared to other dierence mea

sures the sum of absolute dierence SAD as dened in is adopted in the

MPEG standards to measure the prediction errors Simulations have also b een

p erformed to compare our design with the mo died blo ck matching or p olygon

matching metho d used in MPEG in terms of prediction errors SAD Here the

MPEG video reference software MoMuSys is used as reference in simulating

the p erformance of mo died blo ck matching approach The results are shown in

Fig The simulation results demonstrate the comparable p erformance of b oth

our design and the one used in MPEG in terms of prediction errors Compared

to the conventional arbitrarily shap ed video co ding design we optimize the hard

ware complexityby minimizing the computational units along the data path more

costeectively

Other than the News test sequence the simulations are also p erformed for

Foreman and Mother and Daughter sequences etc In order to show that our

DCTDOMAIN CONTENTBASED VIDEO CODING

4 x 10 5

Our design MPEG−4 design

4.5

4

3.5 total sum of absolute differences

3

2.5 0 10 20 30 40 50 60

frame number

Figure Comparing the p erformance of dierent video co ding approaches in

terms of prediction errors using News testing sequence Here total sum of absolute

dierences is the summation of SAD for all macroblo cks within each frame

design is also backward compatible to handle the rectangular frame of video here

we treat Mother and Daughter sequence as the regular frame of pixels The

simulation results as shown in Fig demonstrate the comparable video quality

between our compressed domain design and the conventional MPEG approach

used in video standards In other words considering that the motion comp ensated

video co ding of the rectangular frame is the sp ecial case of our arbitrarily shap ed

video co ding it is easy to see that our presented design is backward compatible to co de regular images

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

a Foreman using MPEG b Foreman using our design

MoMuSys reference software

c Mother and Daughter d Mother and Daughter

using MPEG using our design

Figure Comparing the video qualityofdierent video co ding approaches in

terms of video quality