MACROBLOCK MACROBLOCK (MB=4Y+Cr+Cb) (MB=4Y+Cr+Cb) MOTION MV ESTIMATION MV AND COMPENSATION DPCM MOTION (PREDICTION/ BLOCK INTERPOLATION) COMPENSATED (8x8) RESIDUAL Ref f_code motion_residual RECONSTRUCTED motion_code FRAME MEMORY VLC DCT DCT 3/4 unsigned bit FLC(f_code-1 bits) in picture header RECONSTRUCT FRAME (IQ + IDCT) OUTPUT BUFFER Quantization Quantization for compressed bit stream (with q-table) (with flat table) DC AC

DPCM ZigZag Order nonzero BLOCK zero MACROBLOCK zero BLOCK run-length code size category sign/magnitude run-length code (dct_dc_size)

VLC for Y VLC for Cr/Cb coded block pattern VLC VLC FLC macroblock address increment (skip block)

(skip MB)

Figure Conceptual data ow for co ding a video sequence using MCDCT

approach

MotionComp ensated DCT Video Enco der and

Deco der

The MCDCT approachprovides two dierent paths for enco ding a frame

The intraframe co ding enco des the current frame without the knowledge of

any reference frame by exploiting spatial correlation

The interframe co ding enco des the current frame with the knowledge of one

reference frame by exploiting b oth spatial and temp oral correlation

However b oth intraframe and interframe co ding use a numb er of common building

blo cks DCT quantizer and entropy co der As a result the MCDCT enco der typ

ically has a switchtocontrol the co ding mo de interframe or intraframe as shown

in Figure a Similarly the MCDCT deco der is also capable of switching b e

MOTIONCOMPENSATED DCT VIDEO ENCODER AND DECODER

tween the intraframe co ding mo de and the interframe co ding mo de as depicted in

Figure b The conventional hybrid MCDCT co dec structure in Figure

is the basis of the video co der or deco der architectures used for all the MCDCT

based video co ding standards

To summarize the MCDCT approach the conceptual data ow is depicted in

Figure Based on the GOP structure the enco der needs to determine whether

the incoming picture is co ded as an intraframe I or an interframe PB Each

Iframe is then divided into macroblo cks eachofwhichcontains luminance blo cks

usually x blo cks and twochrominance blo cks dep ending on the picture format

usually Eachblock of pixels is converted through DCT to DCT co ecients

which are then quantized according to an HSVmatched quantization table The

DC co ecients of neighb oring blo cks are then arranged in a group or slice and

pro cessed by means of DPCM to generate dierential DC values DDC which

are co ded in two comp onents size category dct dc siz e co ded in VLC and the

signmagnitude part co ded in FLC The AC co ecients are reordered in a zigzag

way and then translated to the runlength co des which are nally co ded in VLC

For PBframes motion estimation must be p erformed on a macroblo ckby

macroblo ck basis on the current frame and the reconstructed reference previous

andor future frame to pro duce forward andor backward motion vectors resp ec

tively Then with these motion vectors predictive forwardbackward or inter

p olated motion comp ensation ie average of forward and backward predictions

is used on the reconstructed reference frame to generate the motion comp ensated

frame residual This residual is then passed to a DCT unit and a quantization unit

with a at quantization table to pro duce a set of quantized DCT co ecients If all

the co ecients are zero in a macroblo ck then this macroblo ck is skipp ed through

addr ess incr ement If only some blo cks in a nonzero macroblo ckhave macr obl ock

all zero quantized co ecients then the co ded blo ck pattern cbp is used to skip

those zero blo cks All the co ecients including DC of all nonzero blo cks are

enco ded in the same way as in intra blo cks DCT co ecients are rearranged in

a zigzag order and then translated into runlength co des which are co ded nally

in VLC The motion vectors are co ded with DPCM to pro duce dierential motion

vectors DM V These motion vectors are co ded in three comp onents

f code size comp onent represented in unsigned bits is stued in the

picture header

motion code principal comp onent is co ded in VLC

motion r esidual residual comp onent is co ded in FLC of f code bits

All these enco ded bits are placed in the output buer and encapsulated with the

layering information in accordance with the video stream syntax or semantics

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

dened in each video co ding standard The deco der will receive these bits in order

and recover the frames in the reverse order as describ ed ab ove

In addition to the co ding schemes describ ed in this chapter dierent video

standards may also include additional features or advanced metho ds to improve the

compression ratio or t dierent application targets The following is a small list of

the features adopted in the standards which will b e discussed in detail in the next

chapter

Scalability Scalability allows a single compressed video stream to be de

co ded at dierent qualitylevels This requires partitioning the pictures into

several layers base layer lowest layer and the enhancement layers One

layer of video base layer is co ded indep endently whereas other layers are

co ded dep endently with resp ect to the previous layer This will facilitate the

integration of multiple video services There are various typ es of scalable

co ding techniques esp ecially in MPEG

Signaltonoise ratio SNR scalability Eachlayer has incremental qual

ity improvement through increasing the number of quantization levels

but the spatial resolution remains the same

Spatial scalability Eachlayer has dierent spatial resolution larger pic

tures

Temp oral scalability Eachlayer has dierent temp oral resolution more

frames p er second

Rate control It is p ossible to vary the quantization through quantizer scal e

to improve picture quality or control the bitrate It is a tradeo to either

maintain constant picture quality or constant bitrate The enco ding pro ce

dure describ ed ab ove usually generates a compressed video bit stream at a

variable bit rate if we try to maintain the same picture quality throughout

the enco ding pro cess However we can also vary the quantization so as to

keep the output bitrate constant This can b e done by feeding back the output

buer level to control the quantization step sizes More detail can b e found

in the literature such as

Error concealmentresilence Error concealment means that whenever an

error is found in the bit stream a deco der tries not to recover the error but to

make the deco ded picture less noticeable to viewers due to the error When a

deco der detects errors through external means or internally it will replace the

part in error with skipp ed macroblo cks until the next slice is received MPEG

has an error concealment feature Iframes may contain co ded motion vectors

used only for error concealment A slice in error may b e replaced with motion

FULLY DCTBASED MOTIONCOMPENSATED VIDEO CODER STRUCTURE

comp ensated pixels from previous IP frames with the help of the motion

vectors enclosed in Iframes

Advanced or enhancement mo des are new approaches to improve the compres

sion ratio or the picture quality such as unrestricted motion vector mo des

UMV advanced prediction mo de including four motion vectors per MB

and Overlapp ed blo ck motion comp ensation OBMC advanced intra co d

ing mo de mo died quantization mo de deblo cking lter mo de improved PB

frame mo de etc

Fully DCTBased MotionComp ensated Video

Co der Structure

In most international video co ding standards such as CCITT H MPEG

MPEG as well as the prop osed HDTV standard Discrete Cosine

Transform DCT and blo ckbased motion estimation are the essential elements to

achieve spatial and temp oral compression resp ectively Most implementations of a

standardcompliant co der adopt the conventional motioncomp ensated DCT video

co der structure as shown in Fig a The feedback lo op for temp oral prediction

consists of a DCT an Inverse DCT IDCT and a spatialdomain motion estimator

SDME which is usually the full search blo ck matching approach BKM This

is undesirable In addition to the additional complexity added to the overall ar

chitecture this feedback lo op limits the throughput of the co der and b ecomes the

b ottleneck of a realtime highend video co dec A compromise is to remove the lo op

and perform op enlo op motion estimation based up on original images instead of

reconstructed images in sacrice of the p erformance of the co der

The presence of the IDCT blo ck inside the feedback lo op of the conventional

video co der design comes from the fact that currently available motion estimation

algorithms can only estimate motion in the spatial domain rather than directly in

the DCT domain Therefore developing a transformdomain motion estimation

algorithm will b e able to eliminate this IDCT Furthermore the DCT blo ckinthe

feedback lo op is used to compute the DCT co ecients of motion comp ensated resid

uals However for motion comp ensation in the DCT domain this DCT blo ckcan

be moved out of the feedback lo op From these two observations an alternative

solution without degradation of the p erformance is to develop motion estimation

and comp ensation algorithms whichcanwork in the DCT domain In this waythe

DCT can b e moved out of the lo op as depicted in Fig b and thus the op er

ating sp eed of this DCT can b e reduced to the data rate of the incoming stream

Moreover the IDCT is removed from the feedback lo op which now has only two

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

+ video in + DCT Q VLC channel

- Q-1

IDCT

SD-ME +

a Conventional hybrid motioncomp ensated DCT video co der

+ video in DCT + Q VLC channel

Q-1

TD-ME +

b Fully DCTbased motioncomp ensated video co der

Figure Dierent motioncomp ensated DCT video co der structures a motion

estimationcomp ensation are p erformed in the spatial domain b motion estima tioncomp ensation are completed in the transform DCT domain

FULLY DCTBASED MOTIONCOMPENSATED VIDEO CODER STRUCTURE

simple comp onents Q and Q the quantization pair in addition to the transform

domain motion estimator TDME This not only reduces the complexity of the

co der but also resolves the b ottleneck problem without any tradeo of p erformance

Furthermore dierent comp onents can b e jointly optimized if they op erate in the

same transform domain It should b e stressed that by using DCTbased estimation

and comp ensation metho ds standardcompliant bit streams can b e formed in ac

cordance to the sp ecication of any standard such as MPEG without any need to

change the structure of any standardcompliant deco der

Attempts have b een made recently on realizing DCTbased co ders on a limited

basis In this b o ok we present completely DCTbased motion estima

tion and comp ensation algorithms which p erform motion estimation and

comp ensation directly on the DCT co ecients of video frames rather than on pixels

In this way this fully DCTbased video co der architecture can b e realized to b o ost

the system throughput and reduce the total numb er of comp onents

In summary the resultant fully DCTbased motioncomp ensated video co der

structure enjoys several advantages over the conventional hybrid motioncomp ensated

DCT video co der structure

Less co der comp onents and complexity Removal of the DCTIDCT pair in the

feedback lo op of the fully DCTbased reduces the total numb er of comp onents

required in the feedback lo op and thus the complexity of the complete co der

Higher throughput rate The feedback lo op of a video co der requires pro cess

ing at the frame rate so that the previous frame data can be stored in the

frame memory and need to b e available for co ding the next incoming frame

Traditionally this lo op has four comp onents plus the spatialdomain motion

estimation and comp ensation unit and thus creates the b ottleneck for enco d

ing large frame sizes in real time In the conventional co der the whole frame

must b e pro cessed by b oth the DCTIDCT pair and the QQ pair b efore

the next incoming frame In the DCTbased structure the whole frame must

b e pro cessed by only the QQ pair This results in a less stringent require

ment on the pro cessing sp eed of the feedback lo op comp onents Alternatively

this may increase the throughput rate of the co der and thus allow pro cess

ing larger frame sizes when the technology keeps on improving the pro cessing

sp eed of these comp onents This high throughput advantage b ecomes increas

ingly imp ortant when the advances in optical networking technology p ermit

transmission of highquality pro ductiongrade video signals over broadband

networks in real time at aordable costs

Compatibility with existing standards The fully DCTbased structure en

co des the intraframes and motion comp ensated residuals in DCT in the same

CHAPTER MOTIONCOMPENSATED DCT VIDEO CODING

way as the hybrid structure do es The enco ded bit stream can b e made fully

compatible with the existing video co ding standards More detail on matching

dierent co derdeco der structures can b e found in Chapter

Lower computational complexity of DCTbased motion estimation and com

p ensation approaches As demonstrated later in the book the DCTbased

motion estimation and comp ensation approaches have lower computational

complexity Furthermore due to the decorrelation of DCT exploited in most

video standards most energy tend to cluster in a few DCT co ecients es

p ecially the DC terms with the rest b eing zeros after quantization This

characteristic is particularly b enecial to the DCTbased approach since no

computation is needed for the ma jority of DCT co ecients b eing zero

Joint optimization of DCTbased comp onents A fast latticestructured DCT

co der generates dual outputs DCT and DST which can b e utilized by the

DCTbased motion estimation algorithms

Extendibility to a transco der structure An optimal transco der mo dies the

enco ded video bit stream in the DCT domain directly to t dierent usage

requirements such as frame rate conversion frame size conversion bit rate

conversion etc dierent from the usage requirement originally planned for

The fully DCTbased structure handles video data completely in the DCT

domain and therefore can b e easily extended to provide a transco der function

by cascading a DCTbased deco der with certain simplication and mo dica

tion required by the end usage For example the DCT co der at the frontof

a DCTbased co der and an IDCT deco der of a DCTbased deco der can be

removed

Additional information pro cessing DCT co ecients carry certain information

which can be utilized as an example for image segmentation in the DCT

domain The DCTbased co der structure facilitates such use of DCT

co ecients

Chapter

Video Co ding Standards

With the advances in technologies such as video compression telecommunication

and consumer electronics the era of digital video has arrived One of the exciting

prosp ects of the advancements in video compression is that multimedia information

comprising image video and audio has the p otential to b ecome just another data

typ e This usually implies that multimedia information will b e digitally enco ded so

that it can b e manipulated stored and transmitted along with other digital data

typ es This new technology accelerates the availability of video applications such

as digital laserdisc electronic camera videophone videoconferencing image and

interactive video to ols on computers HDTV and multimedia systems Unlike

the digital audio technology of the past few decades the data involved with still

or motion pictures are so huge that data compression is inevitable as wehave dis

cussed in the previous chapter In principle compression metho ds are based on the

nonlinearityofhuman vision which is more sensitive to energy with lower spatial

frequency Hence pictures can b e lossily enco ded with much less data than the orig

inal image without signicantly decreasing the quality of the reconstructed image

In addition when we develop high data compression schemes to reduce trans

missionstorage capacity we also require sophisticated picture co ding technology

to integrate the whole system p erformance For suchdatausagetobe p ervasive

it is essential that the data enco ding be standard across dierent platforms and

applications This will foster widespread development of applications and will also

promote interop erability among systems from dierent vendors Thus standards

for picture co ding are strongly required Furthermore standardization can lead to

the development of costeective implementations which in turn will promote the

widespread use of multimedia information

CHAPTER VIDEO CODING STANDARDS

Overview of Video Co ding Standards

A number of existing or evolving international video co ding standards made by

ITU International Telecommunication Union formerly called CCITT and ISO

International Standard Organization are listed in Table For still images

Standards Video Co ding BitRate Applications Glue

Organization Standard Standard

ISOCCITT JPEG For still image only

ITUT H p kbitss ISDN Video Phone H

ISO MPEG Mbitss Video on CDROM pt

ITUT H Mbitss Video over ATM H

ISO MPEG Mbitss generic high bitrate pt

applications HDTV

ITUT H kbitss PSTN Video Phone H

ISO MPEG kbitss Co ding of Ob jects MSDL

ITUT H kbitss PSTN Video Phone

ITUT HL kbitss

Table Dierent video co ding standards

data compression exploits correlation in space and for video signals in b oth space

and time It is hard to distinguish a reconstructed image that was enco ded with

a compression ratio from the original Video data even after compression at

ratios at can b e decompressed with close to analog videotap e quality

The motioncomp ensated DCT video compression scheme MCDCT is the ba

sisofseveral international video co ding standards which are tabulated in Table

ranging from the low bitrate and high compressionrate videophone application

to the highend high bitrate and high quality HighDenition Television HDTV

application requiring a mo dest compression rate As mentioned in the previous

chapter the MCDCT scheme b elongs to the class of the hybrid spatialtemp oral

waveformbased video compression approaches As illustrated in Fig the MC

DCT scheme employs motion estimation and comp ensation to reduceremovetem

p oral redundancy and then uses DCT to exploit spatial correlation among the pixels

of the motioncomp ensated predicted frame errors residuals Ecient co ding

is accomplished by adding the quantization and variable length co ding steps after

the DCT blo ck Basically all the standards in Table follows this pro cedure with

mo dications of each step to reach dierent targeted bitrate and application goals

In the following the DCTbased video co ding standards will b e discussed to provide

the readers the necessary background to understand the rest of the materials in this b o ok

OVERVIEW OF VIDEO CODING STANDARDS

Input Output

Encoder Decoder

Motion Estimation and Motion Compensation Compensation

Discrete Cosine Inverse Transform DCT

Quantizer Inverse Quantizer

Coding Decoding Model Model

Entropy Entropy Coder Decoder

Transmission

Figure MotionComp ensated DCT MCDCT Scheme

CHAPTER VIDEO CODING STANDARDS

JPEG Standards

Since the mid s a joint ISOCCITT committee group know as JPEG Joint

Photographic Exp erts Group has b een working to study an ecient co ding scheme

for continuoustone still images The standard may b e applied to various elds such

as image storage color facsimile newspap er photo transmission desktop publishing

medical imaging electronic digital cameras and so forth The JPEG standard pro

vides several co ding mo des from basic to sophisticated according to the application

elds

The JPEG baseline algorithm is based on the transform co ding approach The

source image is divided into nonoverlapping blo cks of pixels which is then

transformed using the twodimensional DCT The resulting D DCT co e

cients represent the frequency content of the given blo ck where most of the energy

concentrate near the zerofrequency or direct current term Next the DCT co ef

cients are quantized Following quantization the co ecients are zigzag scanned

to arrange in the order of ascending frequency Then the DC and low frequency

co ecients are enco ded by using Humanstyle co ding schemes There is another

DCTbased JPEG algorithm called the extended system whichprovides higher com

pression p erformance through arithmetic co ding The third mo de of JPEG co ding is

the indep endent function which utilizes a D Dierential Pulse Co de Mo dulation

technique The spatial prediction algorithm has lower compression p erformance

compared with the DCTbased algorithm

Overall the DCTbased algorithms can achieve higher compression ratio but are

lossy Because our fo cus in this b o ok is video co ding we are not going to discuss

the JPEG in detail please refer to for the detail ab out JPEG

ITU H series

In parallel to ISO MPEG ITUT makeseveral H series standards for multimedia

communication These includes H H and H H is a video co ding

standard dened by the ITUT Study Group XV SG for video telephonyand

video conferencing applications It emphasizes low bit rates and the low

co ding delay It was originated in and intended to be used for audiovisual

services at bit rates around m kbitss where m is b etween and In

the fo cus shifted and it was decided to aim at bit rates around p kbitss

where p is from to Therefore H also has an informal name called p

For p lowquality video signal for use in picture phones can be transmitted

over a kbs line If p a high quality video signal for teleconferencing can

be transmitted over a Mbs line H was approved in December

Because of its bidirectional communication nature the maximum co ding delayis

OVERVIEW OF VIDEO CODING STANDARDS

sp ecied to be ms The input formats used and dened in H are CIF

Common Intermediate Format and QCIF The H enco der is a hybrid co der

which combines motion comp ensated interframe prediction with the DCT And

the co ding algorithm used in H is basically blockmatching compensation with

transform coding Such a framework forms the basis of all video co ding standards

that were develop ed later Therefore H has very signicant inuence on many

other existing and evolving video co ding standards

H was dened by ITUT SG the same group that dened H The

eort of H started around Novemb er The main goal of this endeavor was

to design a video co ding standard suitable for applications with bit rate b elow

kbitss the socalled very lowbit rate applications For example sending video

data over the public service telephone network PSTN and the mobile network

implies video rates from to kbitss During the development of H it

was identied that the nearterm goal would be to enhance H for very low

bit rate applications and the longterm goal would be to design a video co ding

standard fundamentally dierent from H to achieve even b etter quality As

the standardization activities move along the nearterm eort b ecame H and

H and the longterm eort is now referred to as HL

After the standardization of H was nished the continued interest in very

low bit rate video co ding made it clear that further enhancements to H are p os

sible in addition to the four optional mo des the unrestricted motion vector mode

the syntaxbased arithmetic coding mode the advancedprediction mode and the PB

frame mode which will b e discussed later ITUT SG therefore established the

H eort to meet the need for standardization of such enhancements of H

Similar to H H is supp osed to provide a nearterm standardization for the

applications of realtime telecommunication and related nonconversational services

These enhancements can b e either improved quality of functionalities provided by

H or additional capabilities to broaden the range of applications For example

these can be improvement of p erceptual compression eciency reduction in the

video delay or greater error resilience Since H was a near term solution to

the standardization of enhancements to H it is considered only welldevelop ed

prop osed enhancements that t into the framework of H ie motion comp en

sation and DCTbased transform co ding On the other hand HL is a parallel

activity in ITUT SG that is intended to b e a longterm eort It considers more

radical algorithms that do not need to t the H framework It is exp ected that

HL will b e aligned with the work in MPEG

Other H series standards related to H are H for multiplexing proto col

H for communication pro cedures and H for dene the telephone terminal

equipment The collection of these together with G for sp eechcodingat

CHAPTER VIDEO CODING STANDARDS

kbitss and V for the mo dem interface from the ITUT recommendations for

very lowbit rate audiovisual telephony H is often used to refer to the whole

set of standards

MPEG Standards

In in resp onse to the growing need for a common format for co ding and stor

ing digital video International Organization for Standardization ISO established

the Moving Picture Exp ert group MPEG with the mission to develop standards

for the co ded representation of moving picture and asso ciated audio information on

digital storage media MPEG completed the rst phase of its work in with the

development of ISO standard Coding of moving pictures and associatedaudio

for digital storage mediaatuptoabout Mbitss This standard is also known

as MPEG Ahybrid co ding scheme known as the motion comp ensated in

terframe prediction and DCT is used in MPEG The prediction scheme not only

predicts from the past but also from the future As a result there are three func

tions asso ciated with prediction forward motion comp ensation backward motion

comp ensation and interp olative motion comp ensation

In MPEG started the second phase of its work namelytodevelop exten

sions to MPEG that would allow for greater inputformat exibility higher data

rates as needed by highdenition TV HDTV and b etter error resilience

That work led to the ISO or ITUT Recommendation H Generic coding

of moving pictures and associated audio This standard is also known as MPEG

The MPEG and MPEG standards are now widely used in commercial

pro ducts such as CDinteractive digital video cameras and digital audio and video

broadcasting MPEG and MPEG deal with framebased video and audio and

in many applications they have oered the solution to replace analog systems that

existed b efore with a digital system The most imp ortant goal of these standards

has b een to make storage and transmission very ecient

As digital media b ecomes widely used there is a blurring of b orders between

three distinct services communications interactivity and broadcasting and their

corresp onding industry sectors namely TVlm computers and telecommunica

tions While the convergence among these sectors may take a long time the

distinction b etween the services is disapp earing For instance in recentyears inter

activity is b eing added to broadcast services and manycommunicationinteractive

applications are app earing on the Internet In anticipation of this trend in

July of The MPEG group initiated a new standardization phase referred to

as MPEG Unlike MPEG and MPEG wherein the emphasis was

primarily on co ding eciency the ob jective of MPEG was to standardize algo

rithms for audiovisual co ding in multimedia applications allowing for interactivity

VIDEO CODING STANDARDS

high compression scalabilityof video and audio content and supp ort for natural

and synthetic audio and video content In contrast to the framebased paradigm

of MPEG and MPEG MPEG prop oses an ob jectbased paradigm for scene

representation The rst version of the MPEG international standard was re

leased in Spring and we will briey intro duce the standard in Section

Please refer to Chapter for the detailed discussion

MPEG is also working on the development of MPEG called Multimedia Con

tent Description Interface for to days information retrieval systems which supp orts

accessing multimedia data anytime and anywhere MPEG sp ecies a standard

ized description of various typ es of multimedia information This description

shall b e asso ciated with the content itself to allow fast and ecient searching for

material that a user maybeinterested in However MPEG is b eyond the scop e

of this b o ok and will not b e covered

Video Co ding Standards

The video data structure in H is a hierarchical structure Each frame has a

picture layer leading with a picture header sp ecifying the picture format and frame

numb er and followed by GOBs for CIF format and GOBs for QCIF EachGOB

in turn have macroblo cks MB for b oth CIF and QCIF as shown in Fig

A macroblo ckis the basic data unit for compression mo de selection and consists

352 pixels

GOB1 GOB2 1 2 3 4 5 6 7 8 9 10 11 GOB3 GOB4 12 13 14 15 16 17 18 19 20 21 22 GOB5 GOB6 23 24 25 26 27 28 29 30 31 32 33

GOB7 GOB8 288 pixels A group of the blocks (GOB) GOB9 GOB10

GOB11 GOB12

CIF format in GOBs

8 pixel Y1 Y2 1 8 Cb Cr Y3 Y4

A Macroblock (MB) 8 lines

57 64

A block

Figure The resulting GOB structures for a frame of pictures in H

CHAPTER VIDEO CODING STANDARDS

of a macroblo ck header the compression mo de four Y blo cks luminance

comp onent one U blo ck and one V blo ckchrominance comp onent

due to subsampling the chrominance comp onents

There are two ma jor mo des in H intra mo de and inter mo de In the intra

mo de only DCT is used for compression in a similar way to JPEG image compres

sion but in inter mo de the motioncomp ensated DCT approachisappliedmotion

estimation and comp ensation is used to exploit temp oral correlation in addition to

DCT compression DCT co ecients are thresholded and then uniformly quantized

with a stepsize for the DC comp onent the DCT co ecient indexed as or

the sp ecied MQUANT value for the AC comp onents all DCT co ecients other

than the DC comp onent For AC comp onents a central dead zone around zero is

used to avoid the ringing eect The quantized DC and AC co ecients are zigzag

scanned and then enco ded with the runlength co de which sp ecies a series of events

containing a run length of zero co ecients preceding a nonzero co ecient and the

value of the nonzero co ecient Under a constant visual quality the enco ded bit

stream has a variable bit rate For ISDN transmission a xed bit rate is desired

Therefore the ratebuer control mechanism is recommended but not sp ecied in

Because datacompressed image transmission is more sensitivetochannel errors

error resilience including synchronization and concealmenttechnique is required

in the transmission co der shown in Fig

Coding control

Coded Source Video multiplex Transmission Transmission bit stream Video coder coder buffer coder

signal

Figure A blo ck diagram of the H co der and the mirrored op erations can

b e applied at the deco der side

A TextureCoding using Discrete Cosine Transform DCT

Transform co ding has b een widely used to remove redundancy between data

samples In transform co ding a set of data samples are rst linearly transformed

into a set of transform coecients These co ecients are then quantized and

entropy co ded A prop er linear transform can decorrelate the input samples and

hence remove the redundancy Another way to lo ok at this is that a prop erly chosen

transform can concentrate the energy of input samples into a few transform co ef

VIDEO CODING STANDARDS

cients so that resulting co ecients are easier to co de than the original samples

Among many transforms the DCT is most widely used in sp eech and image pro

cessing for data compression This is due to its b etter energy compaction prop erty

and its near optimal p erformance whichis closest to that of the KarhunenLo eve

Transform KLT among many discrete transforms for highly correlated signals

esp ecially for the rst order Markov pro cess Thus the DCT is a very imp ortant

technique in video signal pro cessing

A separable dimensional D N N DCT is dened as follows

N N

X X

n l m k

X k l C k C l cos xm n cos

N N N

m n

where

if k

C k

otherwise

The inverse DCT IDCT is dened as follows

N N

X X

m k n l

C k C l X k lcos cos xm n

N N N

k l

where xm n is a real numb er as dened in the IDCT equation In terms

of b oth ob jective co ding gain and sub jective quality DCT p erforms well for typical

image data After the transform the DCT co ecients are quantized which is

mainly where the compression comes from

Due to the high data rate in the video communication systems sp ecialpurp ose

DCT chip sets sometimes are required to p erform realtime computation and match

the computation sp eed For example the HDTV system prop osed by General

Instrument Corp oration requires a video data rate at Mbs In Chapter

we will present a promising DCT architecture which can achieve the high sp eed

requirement of the HDTV systems

B Motion EstimationCompensation

The transform co ding describ ed previously removes spatial redundancy in each

frame of picture It is therefore referred to as intra co ding However for video

material inter co ding is also very useful Consider the fact that typical materials

is comp osed of moving ob jects Typical video material contains a large amountof

redundancy along the temp oral axis Video frames that are close in time usually

have a large amount of similarity Therefore it is p ossible to improve the prediction

pro cess by rst estimating the motion of each regions in the scene And transmit

CHAPTER VIDEO CODING STANDARDS

ting the dierence between frame is more ecient than transmitting the original

frames This is often achieved by matching each blo ckin the currentframe with

the previous frame to nd the b est matching area More sp ecically the enco der

can estimate the motion ie displacement between the previous frame and the

current frame for each blo ck as illustrated in Fig Such area in the previous

frames is then oset prop erly to form the estimate of the corresp onding blo ck in

the current frame This is similar to the concept of dierential co ding and is also

a sp ecial case of predictive co ding The previous frame is used as an estimate of

the current frame and the residual the dierence between the estimate and the

true value is co ded Now the residue has much less energy and therefore is much

easer to co de This pro cess is called motion compensation MC or more precisely

motioncomp ensation prediction The residue is then co ded using the same pro

cess as in intra co ding When the estimate is go o d it is more ecient to co de the

residual than to co de the original frame

Frames that are co ded without any reference to previously co ded frames are

called intra frames simply Iframes or I pictures Frames that are co ded using a

previous frame as a reference for prediction are called prediction frames simply P

frames or P pictures However note that Pframe may contain not only inter co ded

blo cks but also intra co ded blo cks The reason is as follows For a certain blo ck it

may b e imp ossible to nd a well matching area in the reference frame to b e used as

prediction in which case direct intra co ding of such a blo ck is more ecient This

situation happ ens often when there is o cclusion in the scene or when the motion

is very heavy Motion comp ensation saves the bits for co ding the DCT co ecients

However it do es imply that extra bits are required to carry information ab out the

motion vectors Ecient co ding of motion vector is therefore also an imp ortant

part of H Because motion vectors of neighb oring blo cks tend to be similar

dierential co ding of motion vectors should be used This is instead of co ding

motion vectors directly the previous motion vector is used as a prediction for the

current motion vector and the residue is then co ded using VLC table

The basic principle of blo ckmatching ME is for every blo ck in the currentframe

called current blo ck to nd the b est matched blo ck within a range in the previous

frame called predicted blo ck The displacement of the predicted blo ck relativeto

the current blo ck is called a motion vector MV A motion comp ensated dierence

blo ck is formed by subtracting the pixel values of the predicted blo ck from that of

the current blo ck p ointby p oint Texture co ding is then p erformed on the dierence

blo ck The co ded MV and the co ded texture information of the dierence blo ck

are transmitted to the deco der Utilizing this information the deco der can then

reconstruct an approximated currentblockby adding the quantized dierence blo ck

to the predicted blo ck according to the MV In MEMC several basic issues need

VIDEO CODING STANDARDS

to b e considered The criteria for making decision on these issues are

to result in a small dynamic range for the dierence blo ck

to use few bits for the motion vectors and

to havea low computational complexity

Due to its lower computational complexity as compared to other dierence measures

the sum of absolute dierence SAD is used Let fci j i j N g be

the pixels of the currentblock and fpm n m n R R R

N g b e the pixels in the search range of the previous frame Then

N N

X X

jci j pi j jC for x y

i j

SAD x y

N N

X X

jci j pi x j y j otherwise

i j

where x y R R R R and C is a p ositive constant

The x y pair resulting in the minimum SAD value is called a motion vector

MV MV Because a p ositive constant C for a blo ck C is

x y

subtracted from the sum of absolute dierence when x y the MV

is favored The purp ose is to concentrate the distribution of MVs to so that

entropy co ding of the MVs is more ecient

Motion comp ensated dierence b etween blo cks is dened as

di j ci j pi MV j MV ij N

x y

The dierence blo ck fdi j g is then transformed quantized and entropy co ded

At the deco ding end the motion vector MV MV and quantized dierence blo ck

x y

fdi j g are available to reconstruct the current frame as follows

ci j di j pi MV j MV ij N

x y

C Quantization

Quantization implies loss of information so transform co ding is a lossy co ding

pro cess The quantization step size dep ends on the available bit rate and also

dep ends on the co ding mo de Two typ es of quantizers are applied for quantiz

ing DCT co ecients in the H enco derdeco der The intra DC co ecient is

CHAPTER VIDEO CODING STANDARDS

uniformly quantized with a step size of and no dead zone Each of the other

quantizers for AC and inter DC co ecients is nearly uniform but with a central dead

zone around zero as shown in Fig The step size Q is an even integer in the

Out Out

3/2Q Th+Q/2

1/2Q

-Th Th+Q -2Q -Q Q2QIn -Th-Q Th In -1/2Q

-Th-Q/2 -3/2Q

(a) unifrom quantizer for the intra DC (b) nearly uniform midtread quantizer for coefficient only the inter DC and all AC coefficients (dead zone = 2 Th; the step size Q can be changed from MB to MB in increments of 2 from 2

to 62 adaptively).

Figure Quantizers in H Th is the threshold

range of to which represents quantizers Since manyACcoecients have

nearzero levels the midtread quantizer with zero as one of the output values is

used H sp ecies only the reconstruction levels How to p erform quantization is

left to the designer For the co ecients intra ACorinter DCAC a nearly uniform

midtread quantizer is employed The input between Th and Th is quantized

to level zero Except for the dead zone the step size Q is uniform Here the dead

zone is used to quantize all other co ecients in order to remove noise around zero

All the co ecients in a macroblo ck except for the intra DC co ecients go through

the same quantizer

The quantization step size represents the distance between p ossible values of

the quantized signal By varying the step size the amount of information used to

describ e a particular pixel or blo ck of pixels can b e changed Larger step sizes result

in less information b eing required while accuracy is reduced in the representation

Smaller step sizes result in b etter quality but also in an increase in the amount

of information to b e transmitted In the H co der the length of the co ded bit

stream is dep ended on the image prop erties complexity motion or scene changes

The easy way to control the output bit rate may b e found in quantizer step sizes In

each MB p ossible compression mo des sp ecify the ma jor mo de the quantization

step size MQUANT the motion vector data MVD the co ded blo ck pattern

VIDEO CODING STANDARDS

CBP or whether a spatial lter is applied to the motioncomp ensated residuals

The quantized DCT co ecients are then converted into a onedimensional array

for entropycoding Fig shows the scan order used in H for suchconversion

Most of the energy concentrates on the low frequency co ecients and the high

AC Coefficient Start Horizontal Frequency DC Coefficient 0123 456 7

Vertical 3 Frequency 4 AC Coefficient End 5

Figure Zigzag scan order of the DCT co ecients

frequency co ecients are usually very small and are quantized to zero b efore the

scanning pro cess Therefore this scan order in Fig is chosen to create long runs

of zero ecient which is imp ortant for entropy co ding

The resulting D array is then decomp osed into segments with each segment

containing one or more or none zeros followed by a nonzero co ecient With

an event representing the pair of the number of zeros and the level of the nonzero

coecients a Human co ding table can b e built to represent eacheventbyaspecic

co deword ie a sequence of bits This co ding pro cess is sometimes called runlength

coding and the table is called a variable length co ding VLC table In H this

table is often referred to as a twodimensional D VLC table b ecause of its D

nature ie the event as run level

H is the video co ding standard for low bitrate communication over the ordinary

telephone line POTS As in other standards the video bitrate may be

variable and controlled by the terminal or the network In addition to CIF and

QCIF as supp orted by H H also supp orts subQCIF CIF and CIF with

the color space Y C rCb sampled in the format Resolutions of those picture

formats can b e found in Table and the frame rate is framessec exactly

CHAPTER VIDEO CODING STANDARDS

frames for seconds in the progressivemode

SubCIF QCIF CIF CIF CIF

No of Pixels p er Line

No of Lines

Uncompressed Bit Rate Mbs Mbs Mbs Mbs Mbs

Table Picture Formats

A Hierarchical Structure in H

H also has a hierarchical data structure with four layers

Picture Layer Each frame consists of GOBs for subQCIF GOBs for

QCIF GOBs for CIF CIF and CIF resp ectively Data for eachpicture

starts with byte aligned Picture Start Co de PSC and other picture header

information followed by GOBs an endofsequence EOS co de and stung

bits for byte alignment The typ e information in the header sp ecies the

picture typ e INTRA frame Ipicture and INTER frame Ppicture and

optional mo des Unrestricted Motion Vector mode SyntaxbasedArithmetic

Coding mode AdvancedPrediction modeandPBframes modeetc

Group of Blocks Layer Each GOB has one macroblo ck row for subQCIF

QCIF and CIF two macroblo ckrows for CIF and four for CIF Data for

GOB starts with stung bits plus byte aligned GOB Start Co de GBSC

and other header information and ends with macroblo ck data

Macroblock Layer Each macroblo ck in turn has four luminance blo cks

and two spatially corresp onding color dierence blo cks in the default

mo de The macroblo ck data include the macroblo ck header sp ecifying the

macroblo cktyp e motion vector data and blo ck data

Block Layer DC comp onents of DCT co ecients are enco ded as INTRADC

with a xedlength co de separately from the restofthe co ecients enco ded

as TCOEF with a runlength co de

B H Quantization Method

Let COF b e a DCT co ecienttobequantized LEVEL b e the absolute value

of the quantized version of the DCT co ecient and COF b e reconstructed DCT

co ecient Quantization and dequantization of the DC co ecient of an INTRA

VIDEO CODING STANDARDS

blo ck are p erformed as follows

LEVEL COF COF LEVEL

Quantization of the AC co ecients is sp ecied by a quantization parameter QP

that maytakeinteger values from to The quantization stepsize is QP

The equations for quantization are given as follows

For IN T RA LEVEL jCOF j QP

For IN T ER LEVEL jCOF jQP QP

Clipping to is p erformed for all co ecients except intra DC Dequanti

zation is dened as follows

if LE V E L

jCOF j

QP LE V E L QP if LE V E L and QP is odd

QP LE V E L QP if LE V E L and QP is ev en

COF is then obtained as COF SignCOF jCOF j Clipping to

is p erformed b efore the IDCT

C H vs H

Since H was built on top of H the main structures of the two standards

are essentially the same A few ma jor dierences are summarized in Table

Features in H but not in H

HalfPel motion comp ensation

D VLC tables

Quantization step size can change at each MB as opp osed to H that can change the

quantization step size only every MBs

Four options that are negotiable b etween the enco der and the deco der At the b eginning

of each communication session the deco der signals the enco der which of these options the

deco der has the capability to deco de If the enco der supp orts any of these options it enables

them These four options are unrestrictedmotionvector mode the syntaxbased arithmetic

coding mode the advancedprediction modeand the PBframe mode

Table H vs H

Next we will discuss those dierences in detail

HalfPel Prediction and Motion Vector Co ding

A ma jor dierence between H and H is the halfp el prediction in the

CHAPTER VIDEO CODING STANDARDS

motion comp ensation As a result the predictive co ding of motion vectors in H

is more sophisticated than that in H In H accuracy of motion vector

MV MV can be either intp el or halfp el Therefore it is p ossible to have a

x y

motion vector such as When a motion vector has noninteger values

bilinear interp olation is used to nd the corresp onding p el values for prediction In

other words for accuracy of half pixel the bilinear interp olation has to b e used on

the previous frame so that pi x j y in is dened for x or y b eing half of

an integer p els Interp olation is p erformed as shown in Fig

A B

a b

c d

C D

Integer pixel position {a, b, c, d}: Interpolated half pixel values

Half pixel postion {A, B, C, D}: Available integer pixel values

Figure Bilinear interp olation for half pixel motion estimation and comp ensa

tion a A b A B c A C d A B C D here

denotes division and roundo

The predictivecodingofmotionvectors in H is more sophisticated than that

in H The motion vectors of three neighb oring MBs the left MB the ab ove

MB and the ab overight MB are used to form the prediction of the motion vector

of the current blo ck However around a picture or GOB b order sp ecial cases are

needed Either a zero motion vector is used for a neighb oring MB that is outside

the pictureGOB or the motion vector of the only neighb oring MB that is inside

is used to replace predictors that are outside as shown in Fig

Run Length Co ding of DCT Co ecients

H improves the runlength co ding used in H by given an extra term

Last to indicate whether the current co ecient is the last nonzero co ecientof

the blo ck Therefore a set of run level last represents an event and is mapp ed

to a co deword in the VLC table hence the name D VLC With this scheme the

end of blo ck EOB co de used in H is not needed anymore A Human co ding

table is built to represent each event by a sp ecic co deword ie a sequence of

VIDEO CODING STANDARDS

Picture or GOB border Picture or GOB border

MV2 MV3 MV1 MV1

(0,0) MV MV1 MV

(a) (b)

MV: Current motion Vector, MV1: Previous motion vector

MV2: Above motion vector, MV3: Above right motion vector

Figure Motion vector prediction at pictureGOB b oundaries In a zero

motion vector is used for a neighb oring MB outside the picture In b the motion

of only inside MB is used to replace the predictors that are outside

bits Events that o ccur more often are represented by shorter co dewords and less

frequentevents are represented by longer co dewords

Table is an example of the VLC typ e of table The transform co ecients

in this table corresp ond to input samples chosen as the residues after motion com

p ensation In Table the third column represents the magnitude of the level

Last Run Level VLC Co de

Table Partial VLC table for DCT co ecients

The sign bits added at the end of VLC co de takes care of the sign of the level It

can b e seen from the table that more frequently o ccurring symb ols eg symbols

with smaller magnitudes are assigned fewer bits than the less frequently o ccurring

symbols It is reasonable to assume that symb ols of smaller magnitude o ccur more

frequently than the large magnitude symb ols b ecause most of the time wecodethe

residue found after motion comp ensation and this residue do es tend to have small

CHAPTER VIDEO CODING STANDARDS

magnitudes

Unrestricted Motion Vector Mo de

This is the rst one of the four negotiable options dened in H In the default

mo de motion vectors are restricted to the range such that all referenced

pixels are conned in the co ded picture area However in the Unrestricted Motion

Vector mo de motion vectors are allowed to point outside the picture b oundary

with a maximum range When this happ ens edge pels are rep eated

to extend to the p els outside so that prediction can be done Signicant co ding

gain can b e achieved with unrestricted motion vectors if there is movement around

picture edges esp ecially for smaller picture formats like QCIF and subQCIF In

addition this mo de allows a wider range of motion vectors than H Large

motion vectors can be very eective when the motion in the scene is caused by

camera movement

SyntaxBased Arithmetic Co ding

In the Syntaxbased Arithmetic Co ding SAC mo de all the variable length

co dingdeco ding op erations are replaced with arithmetic co dingdeco ding The use

of the variable length co dec VLCVLD usually accomplished by Human co ding

implies that eachsymbol must b e enco ded into a xed integral numb er of bits but

an arithmetic co der can remove this restriction and allow a variable nonintegral

numb er of bits of the co de length Thus signicantly fewer bits are pro duced

Exp eriments show that the average gain is ab out for inter frames and ab out

for intra blo cks and frames

Advanced Prediction Mo de

This mo de contains two basic features

Four Motion Vectors per Macroblock it implies that each blockisas

so ciated with one motion vector In general using four motion vectors gives

b etter predictions since one motion vector is used to represent the movement

of a blo ck instead of a MB In other words each luminance

blo ck in the MB is allowed to have its own motion vector This allows greater

exibility in obtaining a b est match for the MB hence when these parts are

put together a much b etter prediction for the MB is obtained Of course

this implies more motion vectors and hence requires more bits to co de the

motion vectors If the savings in the residue for the MB are oset by the

extra bits needed to send the four motion vectors there is no p oint in sending

four motion vectors Therefore the enco der has to decide when to use four

motion vectors and when to use only one With four motion vectors p er MB

VIDEO CODING STANDARDS

it is no longer p ossible to use the same scheme as in the baseline case to co de

the motion vector dierence Under such a circumstance the prediction of

motion vectors has to b e redened In particular the lo cations of the three

neighb oring blo cks of which the motion vectors are to b e used as predictors

now dep end on the p osition of the current blo ck in the MB A new set of

predictors is dened in the standards as shown in Fig

MV2 MV MV3 MV2 MV3

MV1 MV MV1 MV

Macroblock boundary

MV2 MV3 MV2 MV3

MV1 MV MV1 MV

MV: Current motion vector, MV1, MV2, MV3: Predictors

Figure Advanced prediction mo de

The predictors are chosen such that none of them are redundant For instance

MV as a predictor for the upp er left blo ck of the MB probably the choice of

would makethe choice of MV redundant This is b ecause it is quite likely

that MV and MV which come from the same MB are close together Hence

the information obtained from MV is suppressed by the median op eration

Such redundancy is avoided bypicking a motion vector from another MB

Overlappedblock motion compensation OBMC it supp orts overlapp ed blo ck

motion comp ensation OBMC involves using motion vectors of neighb oring

blo cks to reconstruct a blo ck thereby leading to an overall smo othing of

the image and removal of blo cking artifacts In addition it leads to b etter

predictions which results in a smaller bitstream

In overlapp ed blo ck motion comp ensation it allows four motion vector mo des

p er macroblo ck Every pixel in the nal prediction region for a blo ckisob

tained as a weighted sum of three values In other words for overlapp ed

motion comp ensation an luminance blo ck pi j isaweighted sum of

CHAPTER VIDEO CODING STANDARDS

three prediction values created as follows

pi j pi u j v H i j

k k k

where u v is the motion vector of the current blo ckk the blo ck

k k

either ab ove or b elowk or the blo ck either to the left or rightk of

the current blo ck pi j is the reference previous frame and fH i j k

g are dened as follows

H H H H

Here the weights are predened in the standard For instance a pixel in the

top left part of the blo ckpicks the remote vectors as the motion vectors of

the blo cks ab ove and to the left of the current blo ckasshown in Fig

(u^^ , v ) ^ ^ 1 1 (u 00 , v )

^ ^

(u 2, v 2)

Figure OBMC for upp er left half of blo ck Here u v is the motion vector

of the current blo ck u v is the motion vector of blo ckabove and u v is

the motion vector of blo ck to the left

As in the unrestricted motion vector UMV mo de this advanced prediction

mo de allows motion vectors to cross picture b oundaries Pixels outside the co ded

area are obtained by extrap olating as in the UMV case

PBFrame Mo de

In the PBframes mo de a PBframe consists of two pictures one Ppicture

predicted from the previous deco ded Ppicture and one Bpicture predicted from

VIDEO CODING STANDARDS

both the previous deco ded Ppicture and the Ppicture currently being deco ded

co ded as one unit as shown in Fig A macroblo ck in PBframe mo de

PB frame

P B P

Figure PBFrame Here the prediction of the Bblo ck requires forward and

backward predictions Pixels not predicted bidirectionally are predicted with for

ward prediction only

comprises blo cks instead of of which are for the Ppicture and the other

blo cks for the Bpicture This is dierent from MPEG as we will discuss later

where Bframes are enco ded separately from Pframes and can b e predicted from

Iframes Compared with MPEG Bframes PB frames do not need separate bi

directional vectors Instead forward vectors for the Ppicture is scaled and added

to a small deltavector to obtain vectors for the Bpicture Co ding the PBframe

as one unit will reduce the deco ding delay caused by the bidirectional prediction

which is critical to the bidirectional interactive videophone application Thus this

results in less overhead for the Bpicture part For relatively simple sequences at

low bit rates the picture rate can be doubled with this mo de without increasing

the bit rate much However for sequences with heavy motion PBframes do not

work as well as Bpictures

C Advancedcoding modes in H

After the standardization of H was nished the continued interest in very

low bit rate video co ding made it clear that further enhancements to H are

p ossible in addition to the four optional mo des the unrestricted motion vector

mode the syntaxbased arithmetic coding mode the advancedprediction mode and

the PBframe modeaswehave mentioned ab ove ITUT SG therefore established

the H eort to meet the need for standardization of such enhancements of

H In addition to the previously mentioned mo des supp orted in H H

CHAPTER VIDEO CODING STANDARDS

or H version supp orts other advanced co ding mo des as listed in Table for

the detail please refer to

Co ding Mo des Advanced Intra Coding Mode

for Eciency Alternate Inter VLC Mo de

or Improved Mo died Quantization Mo de

Picture Deblo cking Filter Mo de

Quality Improved PB Frame Mo de

Enhancements Slicestructured mo de

for Error Reference Picture Selection Mo de

Robustness Indep endent Segment Deco ding Mo de

Other Enhance Reference Picture Resampling

ment Mo des ReducedResolution Up date Mo de

Table Enhanced co ding mo des used in H

Advanced intra coding mode This mo de attempts to improve the eciency

while co ding intra MBs in a given frame using a intra blo ck prediction using

neighb oring intra blo cks b a separate VLC for the Intra co ecients and

c mo died inverse quantization for intra co ecients

Alternate Inter VLC Mode During the pro cess of inter co ding it is assumed

that the residues have signicantly less energy than the original blo cks To

take advantage of this prop erty the VLC tables used to convert these residue

blo cks or inter blo cks are dierent from the tables for the intra blo cks

Modied Quantization Mode This mo de which mo dies the quantizer op er

ation has four key features The rst feature allows the enco der a greater

exibility in controlling the quantization step size This allows the enco der

greater bitrate controlling ability The second feature allows the enco der to

use dierent quantization parameters for chrominance and luminance com

p onents Thus the enco der can use a muchner step size for chrominance

comp onents therefore reducing anychrominance artifacts The third feature

extends the DCT co ecient range so that any p ossible true co ecientcan

b e represented The fourth feature tries to eliminate co ecientlevels that are

very unlikely therefore improving the ability to detect any errors and also to

reduce the co ding complexity

Deblocking Filter Mode The deblo cking lter is an optional blo ck edge lter

that is applied to I and P pictures in the co ding lo op This lter is applied to

VIDEO CODING STANDARDS

block edges to reduce blo cking artifacts and improve p erceptible picture

quality

ImprovedPBFrame Mode The PB frames constrain the motion vectors of the

B picture to b e estimated from the motion vectors of the P picture part of the

same frame This kind of scheme p erforms p o orly for large or complex motion

scenarios when the corresp onding prediction obtained for the B pictures is

not good This improved mo de allows for distinct forward and backward

motion vectors This is dierent from the PB frame case when the forward

and backward motion vectors were b oth derived from the motion vector of

the P picture hence were closely related to each other This allows a b etter

prediction for the B pictures as in the case with the B frames in MPEG as

we will discuss later in this chapter

Slicestructuredmode this is one of the enhancements in the standard to help

in improving error resilience When this mo de is turned on the frames are

sub divided into many slices instead of the regular GOBs A slice is a group

of consecutive MBs in scanning order the only constraints are that the slice

must start at an MB b oundary and an MB can b elong to exactly one slice

This grouping of MBs into slices instead of GOBs allows the enco der much

exibility and other advantages

Reference Picture Selection Mode This option allows for selection of anyof

the previously deco ded frames within frames of the current frame or

when used with custom picture clo ck frequency within frames of the

current frame as a reference frame to generate the prediction for the current

frame This is very dierent from the baseline case when only the frame

deco ded immediately b efore the current frame here the frame may b e used

as a reference This capability of selecting dierent frames as reference is

most useful when the deco der can send information back to the enco der So

the deco der needs to b e able to inform the enco der which frames it received

correctly and which frames were corrupted during the transfer The enco der

can then use this information to select as reference only the frames the deco der

has acknowledged as received correctly

Independent Segment Decoding Mode This mo de is another enhancementfor

improved error resilience It tries to remove data dep endencies across the

video picture segment b oundaries A video picture segmentmay b e a slice or

anumb er of consecutive GOBs When this mo de is turned on the segment

b oundaries are treated as the picture b oundaries So if this option is selected

with the baseline options each picture segment is deco ded indep endentlyas

CHAPTER VIDEO CODING STANDARDS

if it were the whole picture This means that no op eration can reference data

outside the segment This option may also b e used with options like UMV or

advanced prediction in which case reference is to data outside the segment

b oundaries In such cases one derives data outside the segment b oundaries

through extrap olation of the segment as was done in the case of the entire

frame

Reference PictureResampling This option allows the resampling of the pre

viously deco ded reference picture to create a warp ed picture that can b e used

for predicting the future frames This is very useful if the current picture has a

source format dierent from that of the previously deco ded reference picture

Resampling denes the relation b etween the current frame and the previously

deco ded reference frame In essence resampling sp ecies the alternation in

shap e size and lo cation between the current frame and the previously de

co ded reference frame

ReducedResolution Update Mode This mo de allows the enco der to send the

residue or up date information for a co ded frame with reduced resolution while

keeping the ner detail information in the higher resolution reference image

The nal frame can b e reconstructed as the higher resolution from these tow

parts without signicant loss of detail Such an option is very useful when

co ding a very active or highmotion scene In this mo de MBs are assigned a

size corresp ondingly blo cks are hence there are onequarter

the numb er of MBs p er picture as b efore All motion vectors are estimated

corresp onding to these new larger MBs or blo cks dep ending on whether we

desire one or four motion vectors for the MB The deco der uses the motion

vectors corresp onding to these large MBs and blo cks for motion comp ensation

MPEG

To meet the sp ecial requirements of the digital storage media as stated earlier

additional techniques beyond the H video co der have been incorp orated in

the MPEG co der For easy reference the ma jor dierences b etween H and

MPEG are summarized in Table MPEG was approved as an ISO

standard bylate MPEG is intended for video storage and playback

on the storage devices such as CDROM magnetic tap es and hard drives at a qual

ity comparable to the VHS analog video Therefore the maximum co ding delay

in MPEG is second enough for the purp ose of unidirectional video access and

much larger than the maximum delay sp ecied in H The typical bit rate is

VIDEO CODING STANDARDS

H MPEG

Sequential access Random access

Only one basic frame rate Flexible frame rate

Only CIFQCIF format Flexible frame size

Only I and P frames I P B frames

Fullp el motion estimation accuracy Halfp el motion estimation accuracy

Filter of motioncomp ensated residuals No lter

Variable threshold uniform quantization Quantization matrix

No GOP GOP

GOB structure Slice Layer

Table Comparison of H and MPEG

Mbps at a CDROM playbackspeed for the combination of video audio and

system bitstreams allo cated as follows

video Mbps

audio Mbps

system Mbps

The basic input format is SIF in the progressive noninterlaced mo de The YCrCb

color space and the line sampling ie subsampling color frames in b oth the

horizontal and vertical directions are used The MPEG standard do es not sp ecify

an enco ding pro cess It only sp ecies the syntax and semantics of the bit stream

and signal pro cessing in the deco der as shown in Fig Hence many options

Quantizer stepsize Motion Compensation Bit Stream Output Buffer Demultiplexer 2D (8*8) Frame Dequantizer pictures IDCT reorder

Side infromation Frame memory

Motion vectors

Figure A simplied blo ck diagram of the MPEG video deco der

are left op en to the enco ders to tradeo cost and sp eed against picture quality and

co ding eciency Unlike H many video parameters such as the picture size

and the frame rate as listed in Table are changeable and can b e sp ecied in

the MPEG bitstream syntax However the maximum frame rate and size

is limited to pixelsline linesframe framessec

CHAPTER VIDEO CODING STANDARDS

Horizontal picture size p els

Vertical picture size lines

Picture area macroblo cks

Pixel rate macroblo ckss

Picture rate Hz

Motion vector range p els using halfp el vectors

Input buer size VBV mo de bitss constant

Table Summary of the constraint parameters in MPEG Here VBV stands

for video buering verier

Dierent from H the DC comp onents of all the blo cks in each frame are

group ed together and enco ded separately due to the observation that the DC com

p onents usually have dierent statistical characteristics from the rest of DCT co

ecients In MPEG the set of co ding parameters is exible However in order

to guarantee interop erability of co decs a sp ecial subset of the parameter space is

dened as Constrained Parameter Bitstream CPB to represent a reasonable com

promise well within the primary target of MPEG and serve as an optimal p oint for

cost eective VLSI implementation in technology

A MPEG Bit Stream Hierarchy

MPEG is a generic standard dening the syntax and semantics of the enco ded

bitstream and implying the deco ding pro cess without limiting which algorithms or

metho ds to use for compression The layered structure in the MPEG bit stream is

as shown in Fig Eachlayer consists of the appropriate header and following

Video sequence

Picture Group of pictures Block Macroblock 8 pixel Slice

8 pixel

Figure The layered structure in MPEG bit stream

lower layers in a manner similar to that of the H co der The layered structure

supp orts exibility and eciency in the co derdeco der Co ding pro cesses can be

VIDEO CODING STANDARDS

logically distinct and layers can b e deco ded systematically The MPEG bitstream

has a hierarchical data structure comp osed of six layers as tabulated in Table

And eachlayer supp orts a sp ecic function

MPEG Layer Purp ose

Sequence Layer Random access unit context

It contains one or more group of pictures

Group of Pictures Layer Random access unit video co ding

It is used as random access into the sequence

Picture Layer Primary co ding unit

Slice Layer Resynchronization unit

Macroblo ckLayer Motion comp ensation unit

Blo ckLayer DCT unit

Table Six layers in a MPEG bitstream

Anumber of blocks or p els in a layer are dened by an input format identied

in the sequence header

The sequence layer consists of a sequence header one or more groups of

pictures and an endofsequence co de A sequence header starting with the

header B consists of several entities Horizontal and vertical size

Pel asp ect ratio Picture rate Bit rate etc

The groupofpictures layer GOP is a set of pictures that are in a con

tinuous display order It b egins with an I or a B picture and ends with an

I or a P picture The smallest size is a single I picture and the largest is not

sp ecied in the standard The GOP header starts with the start co de

B A time co de of bits refers to the rst picture in rst picture

in the group as a unit of hours minutes and seconds A closed GOP co de

represents closed prediction in the group only or op en prediction that requires

deco ded pictures of the previous group for motion comp ensation

The picture layer is a primary co ding unit that consists of the luminance and

twochrominance comp onents The layer starts with a picture header

Some of the other header entities are temp oral reference picture

in display order picture co ding typ e I P B and D picture typ es and

forwardbackward frame co de maximum size of the forwardbackward

motion vectors up to frames

The slice layer is imp ortant in the handling of errors The deco der can skip

the corrupted slice and go to the start of the next slice if the bit stream is

CHAPTER VIDEO CODING STANDARDS

corrupted by noise The numb er of slices in a picture can range from one to

the numb er of macroblo cks in a picture dep ending up on error environments

The principal entities in the slice header are a slice start co de

an bit vertical p osition co de in a picture and a quantizer scale index in the

range that can b e changed by the next slice or the macroblo cklayer

The macroblo cklayer is comp osed of a luminance blo ck and the

corresp onding chrominance blo cks as in the H co der Fig The header

contains information suchas MB stung MB typ es I P and B pictures

quantizer scale motion vector and co ded blo ck pattern

The blo cklayer is comp osed of p els that are transformed byD

DCT The co ded blo cklayer contains the size of the DC co ecient the DC

dierences the AC co ecients and an endofblo ckEOBcode The EOB

signies that all DCT co ecients along the scan zigzag beyond the EOB

co de are zero An MB is a skipp ed MB when its MV and all the quantized

DCT co ecients are zero

B Dierent PictureTypes in MPEG

The upp erlevel layer contains several lowerlevel layers AttheSequenceLayer

a video sequence header is inserted to sp ecify the picture width and height pel

asp ect ratio frame rate bit rate and buer size At the Group of Pictures GOP

Layer several frames are group ed together to form a random access unit as shown

in Fig In other words the deco ding pro cess must start at the rst frame of

the GOP Layer At the Picture Layer frames are classied in four typ es

Intraframe I An I picture uses only transform co ding Without any ref

erence to any other frames an I frame is used to serve as the access p ointto

the sequence In an I picture all the blo cks are co ded using DCT

quantization and VLC I pictures can b e used for predicting P and B pictures

The compression rate for I frames is mo dest

Forward predicted frame P A P picture is co ded using motion comp ensated

prediction from a previous I or P picture This technique is called forward

prediction from IP to P as shown in Fig This mo de is similar to

interframe co ding in the H co der P pictures can accumulate co ding errors

P pictures go through the feedback lo op and can b e used for predicting P and

B pictures The compression rate for P frames is higher than that of I frames

Bidirectionally predicted frame B A B picture is co ded using b oth a past

andor future picture as a reference Thus it is called bidirectional prediction

VIDEO CODING STANDARDS

Motion Video

Intra−frame (Priority: Intra > Forward > Bidirectional) Forward predicted frame Motion Compensation

Bidirectionally predicted frame

Figure Groups of pictures

In other words in the case of bidirectional predicted pictures B pictures

the prediction is based on forward previous I or P picture backward future

I or P picture or b oth motion estimations as shown in Fig B picture

is an example of multihyp othesis motion comp ensation where two motion

comp ensated signals are sup erimp osed to reduce the bitrate of a video co dec

Why B pictures work atheoryofmultihyp othesis motioncomp ensated pre

diction is given in The overall advantages of B pictures are

The uncovered area can b e predicted from the next picture As a result

avery high compression rate can b e achieved

Better signaltonoise reduction is resulted from motion comp ensation on

two pictures separately

B pictures are not used for reference byany other pictures Therefore

there is no error propagation

The numb er of B pictures in a GOP is adjustable

However B pictures require more frame buers and cause a larger co ding

delay

CHAPTER VIDEO CODING STANDARDS

Now let us explain how to p erform bidirectional prediction in more detail B

pictures can b e co ded using forward or backward or b oth motion comp ensa

tion In MPEG video co ding standards within a single B picture macroblo cks

can actually b e co ded dierently For the detail ab out decision trees for co ding

macroblo cks within I P and B pictures please refer to The following are

decisions as shown in Fig needed to b e made for selecting the co ding

macroblo cktyp es in B pictures

next B picture step

Intra coded Interpolted motion compensation

Inter coded Not Coded

Coded

No change to MQUANT No change to MQUANT (May be skipped)

Forward motion compensation

next Backward motion compensation next step step next

step Change MQUANT Change MQUANT

Figure Decision tree for co ding macroblo cks in B picture

Decide if you want to use forward backward or interp olated motion

comp ensation

Decide if you want to co de the macroblo ck as an intratyp e or as an

intertyp e macroblo ck

Decide if a macroblo ck needs to b e co ded or not The entire macroblo ck

can b e skipp ed only if the previous macroblo ckwas an intertyp e mac

roblo ck and its motion comp ensation is go o d enough

Decide if the quantizer scale MQUANT needs to b e changed or not

Some of the blo cks within B picture may b e skipp ed if all the quantized DCT

values within these blo cks are zero The co ded blo ck pattern will indicate

whichoftheblocks within a macroblo ck are co ded As an example let us show

a distribution of I P and B macroblo cks for pictures of an MPEG co ded

video sequence as shown in Table Here zero MV refers to macroblo cks

that are co ded using a zero motion vector Note that for B pictures there

is a considerable numb er of macroblo cks that are co ded using predictive P

macroblo cks This usually o ccurs when there is a scene change or when ob jects

present absent in a P picture b efore B picture disapp ear app ear in the P

picture following that B picture And MPEG enco der allows for skipp ed

blo cks within a macroblo ck

VIDEO CODING STANDARDS

Picture Macroblo ckTyp e

Typ e I P B Zero MV Skipp ed

Table An example of the distribution of dierent macroblo cktyp es in a video

sequence The enco der uses a GOP of pictures with two B pictures for every P

picture and the co ded bit rate is Mbitss

Macroblo ckTyp e Predictor Prediction Error

Intra F xF x F x

F x F x Forward Predicted mv F xF x

F x F x Backward Predicted mv F xF x

Interp olated Prediction F xF x mv F x mv F xF x

Table Prediction mo des for macroblo ck in B picture Herex is the co ordinate

of the picture element mv is the motion vector relative to the reference frame F

mv is the motion vector relative to the reference frame F

CHAPTER VIDEO CODING STANDARDS

In the more general case of a bidirectionally co ded picture each x mac

roblo ckcan be of typ e Intra Forwardpredicted Backwardpredicted or In

terp olated prediction As expressed in Table the expression for the

predictor for a given macroblo ck dep ends on reference frames past and fu

ture as well as the motion vectors Motion Comp ensation MC is based on

the previous and the next I or P pictures The prediction mo de for a mac

roblo ckin B picture can be one of the following dep ending on which mo de

pro duces the smallest number of bits intra no MC forward predicted MC

based on the next IP picture backward predicted MC based on the pre

vious IP picture interp olated prediction MC based on the previous and

the next IP pictures The motion information consists of one vector for

forwardpredicted macroblo cks and backward predicted macroblo cks and of

twovectors for bidirectionally predicted macroblo cks The motion informa

tion asso ciated with each blo ck is co ded dierentially with resp ect to

the motion information present in the previous adjacentblock The range of

the dierential motion vector can b e selected on a picturebypicture basis to

match the spatial resolution the temp oral resolution and the nature of the

motion in a particular sequence the maximal allowable range has b een cho

sen large enough to accommo date even the most demanding situations The

dierential motion information is further co ded by means of a variablelength

co de to provide greater eciency by taking advantage of the strong spatial

correlation of the motion vector eld the dierential motion vector is likely

to b e very small except at ob ject b oundaries More accurate motion vectors

are required in the co der when motion interp olation is intro duced

A D picture is a sp ecial case of intra in which only the DC co ecientofeach

blo ck is co ded D pictures provide simple and fast forward mo de but

yield limited image quality

In view of the nature of B frames the frame reordering is required for enco d

ingdeco ding

Display order IBBPBBPBBI frame typ e

A frame number

At the Slice Layer

Co ding order IPBBPBBIBB frame typ e

A frame number

each slice is formed from several macroblo cks and used mainly for error recovery

At the Macroblo ckLayer a macroblo ck serves as the basic compression unit similar

to the case of H The basic DCT unit is a blo ckattheBlockLayer

VIDEO CODING STANDARDS

MPEG H and HDTV

The compression schemes discussed previously provide sp ecic applications at corre

sp onding co ding eciencies ITUT H has b een designed for video co ding and

transmission of slowmoving videophone and video conferencing signals MPEG

is aimed at systems working with Mbps Digital Storage Media DSM and

lowresolution SIF displays Consequently a fullmotion video co ding standard

MPEG has b een develop ed to meet a numb er of requirements For instance it

supp orts digital video transmission in the range of to Mbps including applica

tions for Digital Storage Media and HDTV Computer graphics multimedia and

video games are also included as the new application areas Hence MPEG provides

a generic solution for videoaudio co ding storage andor transmission worldwide

This standard is exible enough to allow b oth highp erformancehighcomplexity

and lowp erformancelow complexity co dec systems The generic MPEG co ding

standard thus was designed to meet a wide sp ectrum of bit rates resolutions b oth

spatial and temp oral quality levels and services

Dierent from MPEG or H MPEG supp orts interlaced video input im

ages which are scanned as even and o dd elds to form frames Therefore there are

two new picture typ es for interlaced video in addition to the picture typ es in the

progressive video mo de

Frame pictures are obtained byinterleaving the lines of an o dd eld and its

corresp onding even eld

Field pictures are formed from a eld of pixels alone

All these pictures can b e either I P or B frames as in the case of progressive video

The dierences of MPEG from MPEG are summarized as follows in Table

Layer MPEG diers from MPEG

Sequence more asp ect ratios and larger allowable frame size

and macroblo cks

indication of source video typ es color primaries etc

Picture userselectable DC precision

concealmentofmotionvectors for Ipictures to increase robustness

nonlinear macroblo ck quantization factor

signal source comp osite video characteristics

Macroblo ck no more macroblo ckstung

Table Dierences of MPEG from MPEG in the hierarchical layered struc ture

CHAPTER VIDEO CODING STANDARDS

MPEG is designed to b e the extension of MPEG to accommo date dierent

visual quality requirements at various bitrates and resolutions Compared

to MPEG some of the prominent features in MPEG are compatibility and

scalabilityaslistedinTable Because of the feasibility of MPEG to realize

MPEG MPEG

Video format SIF SIF

progressive progressiveinterlaced

Bit rate Variable Mbps Variable up to Mbps

Low delaymode ms ms no B pictures

Scalability SNR spatial temp oral simulcast

data partitioning

Transmission error Error protection Error resilience

DCT Noninterlaced Field progressive or frame interlaced

Motion estimation Noninterlaced Field frame and dualprime based

Motion vectors Motion vectors for PB Concealment motion vectors for I pictures

pictures only b eside MV for P B

Scanning of DCT Zigzag scan Zigzag scan alternate scan for interlaced

co ecients video

Table Functional comparison b etween MPEG and MPEG video

highquality highresolution TV applications the HDTV proto col adopts MPEG

A MPEG Proles

In order to t one standard to a variety of applications without causing unrea

sonable implementation diculties MPEG uses the concept of a prole whichis

a subset of the full p ossible range of algorithmic to ols called limit syntax in the

MPEG term for a particular application There are ve proles with a hierar

chical relationship Within each prole a numberoflevels are dened to limit the

range of parameter values reasonable to implement and practically useful called

limit parameters Therefore the syntax supp orted by a higher prole includes all

the syntactic elements of lower proles In other words for a given level aMain

prole deco der should b e able to deco de a bitstream conforming to Simple prole

restrictions For a given prole the same syntax set is supp orted regardless of level

Simple Prole It do es not allow use of B frames and scalable co ding The

maximum bitrate is Mbps This prole is intended for videotap e recording

Main Prole No scalabilityisallowed The intended use is the Studio TV

application It is exp ected that of MPEG users will use this prole

VIDEO CODING STANDARDS

SNR Scalable Prole It is the same as the Main Prole but added with

SNR Scalability is added allowing twolayers of co ding the lower layer and

the enhancementlayer using dierentquantizer step sizes for the DCT co ef

cients It can create a sharp er image when combined than that obtainable

from one layer alone

Spatially Scalable Prole It allows the deco der to cho ose dierent resolu

tions by employing a pyramidal co ding approach This prole supp orts only

the high level intended for the consumer HDTV

High Prole It is basically a scalable prole with either or

macroblo cks ie chrominance subsampling in the horizontal direction but not

in the vertical direction designed for the lm pro duction So ciety of Motion

Picture and Television Engineers SMPTE M standard

There are four p ossible levels Low Level Main Level High Level and High

Level The allowable combinations of levels and proles and the corresp onding

sampling density are shown in Table where a scalable MPEG video stream

can b e broken into dierentlayers base or lower layers high priority bitstream

and enhancementlayers low priority bitstream

Spatial Prole

resolution

layer

Level Simple Main SNR Spatial High

High Enhancement

Lower

High Enhancement

Lower

Main Enhancement

Lower

Low Enhancement

Lower

Table Maximum sampling density in dierent combinations of levels and

proles x y t means x pixelsline y linesframe and t framessec

B Scalable coding techniques

CHAPTER VIDEO CODING STANDARDS

Scalable video co ding is useful for a numb er of application in which video needs

to be deco ded and displayed at a variety of resolutions temp oral or spatial and

quality levels for example multip oint video conferencing windowed display on

workstations video communications on asynchronous transfer mo de ATM net

works and HDTV with embedded standard TV The easiest way to cop e with

these demands is the simulcast techniques which is based on transmitting a set of

various contents of an enco ded video sequence simultaneously In this case

channel bandwidth is not used eciently since the bandwidth should be shared

among various resolutions

An ecient alternative to simulcast is scalable video co ding in which the band

width allo cated to a given scale can b e reused through partially for co ding other

scales Multiuse of the bandwidth is a condition for scalable video co ders Actu

ally the enhancementlayer is co ded by utilizing information from the baselower

layer enco der In MPEG if there are several scales layers the lowest layer is

called baselower layer and the others are called enhancementlayers Scalable video

co ding can b e achieved in the spatial temporal SNR and data partitioning

Spatial scalability It deals with dierent resolutions which can b e varied

by the decimation and interp olation techniques The enhancement layer is

predicted not only from previously deco ded pictures temp oral prediction

but also from deco ded and upsampled pictures of the lower layer

Temp oral scalability It deals with dierent frame rate temp oral resolu

tion It involves at least two layers Both the lower and the enhancement

layers pro cess the same spatial resolution pictures but at dierent temp oral

resolutions The enhancementlayer has a full temp oral picture rate

SNR scalability It provides two or more video layers of the same spatial

resolution but at dierent qualities The enhancement layers contain only

co ded renement data for the DCT co ecients of the base layer

Data partitioning It splits a video bit stream into two or more layers

called partitions A priority breakp oint in the slice layer header indicates

which syntax elements are placed in partition or partition Partition is

called the base partition or highpriority partition Partition is called the

lowpriority partition For the DCT co ecients the lower the frequency of

the co ecients the more imp ortant is the picture quality and the higher the

priority

VIDEO CODING STANDARDS

MPEG

The MPEG was historically supp osed to b e low bitrate co ding in that is

audio and video co ding at data rates below kbits For video co ding

the goal was to develop algorithms to outp erform the then stateoftheart co ding

standard H by a factor of ten in terms of compression Later MPEG shifted

its fo cus of this very ambitious lowbit rate co ding Its activities since have

resulted in extended applications to ols algorithms proles and bit rates for arbi

trary shap e audiovisual natural and synthetic ob jects The ob jectiveistodevelop

a exible and extensible co ding standard And it facilitates the users ability to

achievevarious forms of interactivity and to mix synthetic and natural audiovisual

information in a seamless way The MPEG have six parts as shown in Table

And its rst version b ecame international standard in the spring of

ISO Number MPEG Part Name

Systems

Visual

Audio

Conformance Testing

Technical Rep ort

DSMCC Multimedia Integration Framework

Table Six parts of MPEG standard Here DSMCC stands for Digital Stor

age Media Command and Control

As of the visual part MPEG fo cuses on

A set of coding tools for audiovisual objects These ob jects can be video

ob ject of arbitrary shap e audio ob jects or combined audiovisual natural and

synthetic ob jects New functionalities in MPEG are outlined in Table

A syntactic language MPEG Syntactic Description Language MSDL as

listed in Table to describ e b oth the co ding to ols and the co ded ob jects It

is used not only for description of the bit stream structure but also for cong

uration and programming of the deco der MSDL is a exible and extensible

description language that allows selection description and downloading of

to ols algorithms and proles

We will discuss the MPEG in more detail in Chapter

CHAPTER VIDEO CODING STANDARDS

Functionalities Detail description

Interactive The user should b e able to inuence the presentation of audiovisual content

Contentbased An ob jectbased data representation should allowcontentbased access to

multimedia data

Universal Access to MPEG data and communications should b e p ossible using any

Accessibility communications network

Flexible MPEG data streams should b e scalable such that they can b e pro cessed by

receivers with dierent levels of computational p ower

Extensible The transmitter should b e able to congure the receiver in order to download

new applications and algorithms

Table The summary of functionalities which MPEG supp orts

Chapter

MPEG and Contentbased Video Co ding

Prior to MPEG the MPEG of ISO International Standards Organization and

the ITUT International Telecommunication Union Telecommunications have

develop ed video compression standards

MPEG was develop ed for use in CDROM and PC industries and has

a target bitrate of Mbps

MPEG was develop ed for use in the home entertainmentmarket suchas

HDTV High Denition TV with a target bitrate of Mbitss Mbitss

H and H were develop ed for the application of full duplex

video conferencing over ISDN and POTS transmission lines resp ectively

In those cases a xed set of techniques is included in the standards targeting a lim

ited set of applications The transmission channels asso ciated with each application

are well known and considered to b e very reliable ie the probability of a residual

bit error corrupting the video data is extremely low This a priori knowledge of

the transmission channels was utilized during the design and development of the

algorithms In the case of MPEG additional proles were later added to the

standard to allow its use in dierent applications

Anticipating the rapid convergence of telecommunications computer and TVlm

industries the MPEG group ocially initiated a new MPEG standardization

phase in with the mandate to standardize algorithms for audiovisual co ding

in multimedia applications allowing for interactivity high compression andor uni

versal accessibility and p ortability of audio and video content The MPEG rst

version b ecame international standard in the spring of In this chapter we are

going to briey overview the new standard in Section Section Then we

will discuss how to extend our DCT domain motion estimationcomp ensation for

MPEG applications in Section

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

Functionalities Detail description

Interactive The user should b e able to inuence the presentation of audiovisual content

Contentbased An ob jectbased data representation should allowcontentbased access to

multimedia data

Universal Access to MPEG data and communications should b e p ossible using any

Accessibility communications network

Flexible MPEG data streams should b e scalable such that they can b e pro cessed by

receivers with dierent levels of computational p ower

Extensible The transmitter should b e able to congure the receiver in order to download

new applications and algorithms

Table The summary of functionalities which MPEG supp orts

Overview of MPEG Standard

The MPEG was historically supp osed to b e low bitrate co ding in that is

audio and video co ding at data rates below kbits For video co ding

the goal was to develop algorithms to outp erform the then stateoftheart co ding

standard H by a factor of ten in terms of compression Later MPEG shifted

its fo cus of this very ambitious low bitrate co ding towards new functionalities as

outlined in Table Let us explain some of new functionalities intro duced in

MPEG which are not covered in the previous multimedia standards in more

detail

Universal accessibility is the ability to access audiovisual data over a wide

variety of storage and transmission channels It covers Robustness in error

prone environments and Contentbased scalability To truly supp ort this

functionality it implies that a user can access video information over dierent

kinds of transmission channels wired or wireless channels Obviouslythese

channels will not have the same error characteristics or bandwidth Therefore

the error resilience and scalability to ols discussed later in this chapter are

extremely imp ortant when attempting to supp ort this universal accessibility

Ob jectbased interactivity is a functionality that provides the user with

the ability to interact with ob jects of an audiovisual scene in a meaningful

way It covers Contentbased manipulation and bit stream editing Content

based multimedia data access tools Hybrid natural and synthetic data coding

and Improved temporal access To supp ort this functionality it

implies that the video scene is co ded such that a particular video ob ject is

distinguishable from the other ob jects of the scene By utilizing to ols such

as ob jectbased scalability shap e co ding and sprite co ding in combination

OVERVIEW OF MPEG STANDARD

with the shap e adaptive DCT MPEG is able to supp ort this ob jectbased

interactivity

MPEG Architecture

A general MPEG video co ding system is depicted in Fig At the enco der

the video ob jects and their spatiotemp oral relationships needed by the deco der

are enco ded into bit streams These bit streams after optional errorprotection

are multiplexed with stored ob jects and then transmitted downstream to the de

co der The bit streams can be transmitted across multiple channels where each

channel oers a dierent quality of service This p ermits dierent ob jects to be

reconstructed at the deco der at dierent qualities The multiplexer in the MPEG

system combines the elementary data streams into one output data stream The

multiplexor also provides functions needed to recover the system clo ck synchronize

multiple streams interleavemultiple streams used by the comp ositor at the deco der

side etc

At the deco der the comp ositor uses the spatiotemp oral relationships and user

interactions to render the scene The deco der can use the interaction information

lo cally or it can transmit it upstream to the enco der so that the enco der can gen

erate the scene as desired by the user Note that supp ort for deco derenco der

interactivity is not explicit in the MPEG and MPEG co ding standards Before

video ob jects are transmitted the source co der and deco der exchange conguration

information This allows the source to determine which class of algorithms to ols

and other ob jects are needed by the deco der to pro cess the video ob jects Then the

denitions of any missing classes are downloaded to the MPEG deco der This is

distinct from the MPEG and MPEG hardwired push mo del where the mo del

and capabilities of the deco der are assumed aprioriby the enco der

Toenvision the MPEG ob jectbased interactivity let us take a lo ok at a simple

example as shown in Fig In this simple example an image scene contains a

numb er of video ob jects A and B It is attempted to enco de the sequence in such

away that it will allow the separate deco ding and reconstruction of the ob jects and

allow the manipulation of the original scene by simple op erations on the bit stream

The bit stream will be object layered A and B ob ject layers The shap e and

the spatial co ordinates as well as other additional parameters ie ob jects scaling

rotation or related parameters are describ ed in the bit stream of each ob ject layer

The receiver can reconstruct the entire original sequence by deco ding all object layers

and display the ob jects with original size at the original lo cation Other than the

previous ob jectbased video deco ding it is also p ossible to manipulate the video

scene with some simple op erations For example the new ob ject C from lo cal

image library can b e added and mixed with the original video scene In addition

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

Stored Local Video Objects Objects C Objects B B B Encoder MUX DE- Decoder Compositor A Channel MUX A A

Display User C Interface

Figure Schematic overview of an MPEG video co ding system

we can rearrange the scene by rotating the ob ject A Since the bit stream of the

sequence is organized in ob ject layered form the manipulation is p erformed on the

bit stream level without the need for further transco ding

Unlike the previous MPEG and MPEG approaches the MPEG standard

supp orts a rich set of data typ es natural and synthetic D D audiovisual ob

jects and a syntax for describing complete animatedscenes Furthermore MPEG

images as well as image sequences are in general considered to be arbitrarily

shap ed in contrast to the standard MPEG and MPEG rectangular denitions

Because it do es not always make sense to sp ecify a rigid standard addressing just

one application MPEG standard concentrates on supp orting those functionali

ties common to clusters of applications in the computer telecommunication and

entertainment ie TV andor lm industries Basically MPEG is a new co ding

standard intendedtoprovide a exible framework and an op en set of co ding to ols

for communication access and manipulation of digital audiovisual data Through

the exible framework of MPEG various combinations of these to ols and their

corresp onding functionalities will b e utilized to supp ort particular applications re

quired by those industries The MPEG System Description Language MSDL is

designed to glue those functionalities together

Although MPEG standard includes video audio graphics synthetic and nat

ural hybrid co ding SNHC and systems we will only discuss the visual p ortion

of MPEG in this chapter whichprovides the core technologies allowing ecient

storage transmission and manipulation of video data in multimedia environments

During the pro cess of developing the MPEG video standard the exp ert group

fo cuses on development of Video Verication Mo dels VMs The VM is a common

platform with a precise denition of enco ding and deco ding algorithms whichcan

OVERVIEW OF MPEG STANDARD

b e presented as to ols addressing sp ecic functionalities New algorithmsto ols are

added to the VM and old algorithmsto ols are replaced in the VM by successful

core exp eriments As we have mentioned previously the MPEG video co ding

standard fo cuses on providing solutions in the form of to ols and algorithms which

enable common functionalities suchasecient compression ob ject scalabilityspa

tial and temp oral scalability and error resilience Each VM addresses the increasing

numb er of desired functionalities suchas

Ecient compression For most applications involving digital video such

as video conferencing Internet video games or digital TV co ding eciency

is essential Therefore many dierent video co ding algorithms have b een

prop osed to reduce the bandwidth requirement for transmission and storage

of video information MPEG evaluated over those metho ds

intended to improve the co ding eciency of existing standards The target of

MPEG is to provide exible multimedia communications within the range

of kbitss Mbitss

Shape and alpha map coding The shap e of a D ob ject is describ ed byal

pha maps Multilevel alpha maps are frequently used to blend dierentlayers

of image sequences for the nal lm Other applications that b enet from

asso ciating binary alpha maps with images are content based image represen

tations for image databases interactive games surveillance and animation

Arbitrarily shaped region texture coding Co ding of texture for arbitrarily

shap ed regions is required for achieving an ecient texture representation for

arbitrarily shap ed ob jects Hence these algorithms are used for ob jects whose

shap e is describ ed with an alpha map

Error resilience The error resilience addresses the problem of accessing

video information over a wide range of storage and transmission media In

particular due to the rapid growth of mobile communications it is extremely

imp ortant that access is available to audio and video information via wireless

networks This implies a need for the useful op eration of audio and video

compression algorithms in errorprone environments at low bitrates ie less

than kbps MPEG Video Group evaluated to ols for video compression

which address b oth the band limited nature and error resilience asp ects of the

problem in providing access over wireless networks

Multifunctional coding tools and algorithms Multifunctional co ding is aim

ing to provide to ols to supp ort a number of content based and other function

alities For instance for Internet and database applications ob ject based

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

spatial and temp oral scalability are provided for content based access Like

wise for mobile multimedia applications spatial and temp oral scalabilityare

essential for channel bandwidth scaling for robust delivery Multifunctional

co ding also addresses multiview and stereoscopic applications as well as repre

sentations that enable simultaneous co ding and tracking of ob jects for surveil

lance and other applications Besides the aforementioned

applications a number of to ols were develop ed for segmentation of a video

scene into ob jects and for co ding noise suppression

MPEG Video Co ding

The motion and texture co ding techniques in the MPEG are direct extensions

of those used in traditional video co ding Thus the blo ck matching and the

Discrete Cosine Transform DCT are still the basic techniques This ensures

that MPEG video co ding is as ecient as traditional video co ding for traditional

rectangular frames of image sequences and provides ob jectbased functionalities

for new applications Since an imp ortant feature of MPEG is its exibility of

conguring various to ols for a given application more motion and texture co ding

to ols that are very dierent from those in traditional video co ding are included to

further improve co ding eciency

In MPEG each video frame is segmented into a numb er of arbitrary shap ed

image regions called video ob ject planes VOP The word segmentation has a

meaning that dep ends to a large extent on the application and the context in which

it is used The basic goal of any segmentation algorithm is to dene a partition

of the space In the context of image and video the space can be temp oral

onedimensional D spatialD or spatiotemp oral D

Segmentation can b e an extremely easy task if one has access to the pro duction

pro cess that has created the discontinuities For example the generation of a

synthetic image or of a synthetic video implies the mo deling of the D world

and of its temp oral evolution During the creation itself it is very easy to

recover and store the D b oundaries of the various ob jects Another example

is video editing which creates a large numb er of discontinuities either in space

or in time Spatial discontinuities are created by combining foreground

ob jects that have b een lmed over a blue screen with a background sequence

that has been taken indep endently Temp oral transition are pro duced by

cutting and concatenation rushes In b oth cases the discontinuities detection

is trivial if one has access to the information at this level of pro duction

Segmentation can also b e an extremely dicult task if the segmentation in

MPEG VIDEO CODING

tends to estimate what has b een done during the pro duction or online pro cess

Wehave to recognize that the state of the art has still to b e improved to lead

to robust segmentation algorithms able to deal with generic images and video

sequences please refer to for an overview

In this chapter we assumed that the video source either already exists in terms of

separate entities ie is generated with chromakey technology or is generated

by means of online or oline segmentation algorithms Notice that the pro cess of

segmentation is outside the scop e of the MPEG standard Similar to MPEG

and MPEG MPEG sp ecies only the minimum set of functions that are needed

for interop erability Successive VOPs b elonging to the same physical ob ject in

a scene are referred to as video ob jects VO A VO in MPEG is equivalentto

a GOP group of pictures in the MPEG and MPEG standards The shap e

motion and texture information of the VOPs b elonging to the same VO is enco ded

into a separate video ob ject layer VOL Then this information is multiplexed into

a VOL bit stream as shown in Fig in the order of the co ded shap e informa

tion followed by motion and texture co ded data Here motion vectors and DCT

Video Object0 Video Object0 Encoder Decoder System Demultiplexer System Multiplexer Video Objects Video Video Object1 Video Object1 Video Objects Video Segmenter/ Encoder Decoder in Formatter Compositer out

Video Object2 Video Object2

Encoder Decoder

Figure MPEG video co ding Based on the VOP shap e information each

VOP in a VO is separated by the VOP denition blo ck

co ecients can be co ded either jointly as in H or separately In addition

relevant information needed to identify each of VOLs and how various VOLs are

comp osed is also enco ded This allows for selective deco ding of VOPs and also

provides ob jectlevel scalability at the deco der

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

a b

c d

e f

Figure Envision the concept of VOPs using News test sequence as an example

a One frame taken from the original scene b efore segmentation b background

VOP c foreground VOP d foreground VOP e foreground VOP f

the binary alpha plane of foreground VOP

MPEG VIDEO CODING

Overview of MPEG Video Co ding

The notion of VOPs and their use in video co ding in MPEG is illustrated in

Fig Here we use the actual MPEG video test sequence News in CIF

format with frame size as an example for illustration This sequence

b elongs to the class of medium spatial detail and low amountofmovement We

can co de the video sequence in twoways

The entire frame comprising the background and foreground can b e classied

as a single VOPThentheVOP co ding b ecomes a straightforward application

of MPEG and MPEG co ding techniques

Alternatively by applying segmentations we can decomp ose the scene into

four VOPs say VOP for the background ob ject in Fig b VOP in

Fig c and VOP in Fig d as well as VOP in Fig e for the

foreground ob jects A binary alphaplane as depicted in Fig f is co ded

in this example to indicate to the deco der the shap e of the foreground ob ject

VOP and its lo cation with resp ect to background VOP The shap e infor

mation hereafter is also referred to as alpha plane In general MPEG may

supp ort the co ding of grayscale alpha planes to allow the deco der to comp ose

the VOPs with various levels of transparency

We can enco de the VOPs using dierentcodingschemes either nonoverlap or

overlap co ding Toenvision those co ding schemes we consider a simple exam

ple by co ding VOP and VOP as shown in Fig a and b resp ectively

Note that the two regions covered by VOP and VOP are nonoverlapping

Furthermore the sum of pixels covered by these twoVOPs is identicaltothe

image sequence as shown in Fig c Since eachVOP is co ded separately

based on the deco ded information from the alpha channel for nonoverlap

co ding the deco der can either deco de and display each VOP separately or

reconstruct the entire original sequence by deco ding and comp ositing b oth

VOPs Other than the nonoverlap co ding MPEG also supp orts the over

lapping conguration for VOPs For instance if the entire background frame

as shown in Fig d is known aprioriat the enco der the foreground VOP

can then b e as shown in Fig b Since the background is stationary only

one frame needs to b e co ded for the background Thus the foreground and the

background can have dierent display rates at the deco der which is called

temporal scalability in MPEG In principle we can select either nonoverlap

or overlap co ding scheme based on the character of input image sequences

The VOP co ding pro cess for this example is summarized in Fig Because

MPEG supp orts contentbased scalability the comp ositor at the deco der side can

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

a b

c d

Figure Dierentcodingschemes nonoverlapping vs overlapping coding a

nonoverlapping background VOP b foreground VOP c scene after non

overlapping co ding bycombining VOP and VOP d background VOP in this

case is a stationary rectangular image only co ded once

MPEG VIDEO CODING

either cho ose to only deco de certain VOPs of interest or even edit the scene by

deleting the VOPs from the original scene and adding the new VOPs from lo cal

database as shown in Fig Let us take News video co ding again as an

example to illustrate contentbased scalability The comp ositor at the deco der

sayonlydecodes VOP in Fig d as the foreground and VOP in Fig b

as the background As a result the nal reconstructed video scene is as shown in

Fig c instead of the complete scene as shown in Fig a

For eachVO the shap e motion and texture information of VOPs comprising

the VO are co ded After intro ducing the overall video co ding schemes we will fo

cus on howto co de this information for individual VOP Ob ject based temp oral

scalability and spatial scalability can b e achieved by means of VOLs which corre

sp ond to either the base layers or enhancement layersofaVOP One imp ortant

feature of MPEG video co ding is its ability to co de an arbitrarily shap ed VO

sp ecial care has to b e taken for motion estimation and comp ensation MEMC as

well as the DCT of the boundary blocks of an arbitrarily shap ed VOP Besides the

MPEG supp orts three more advanced techniques namely unrestricted motion

vector advanced prediction and bidirectional MEMC There are many ways in

which the shap e motion and texture information will b e co ded We will restrict

our discussion to the baseline scheme as adopted by the MPEG

Arbitrarily Shap ed Region Texture Co ding

The intra VOPs and the residual errors after motion comp ensated prediction are

co ded using DCT on blo cks in a manner similar to that employed in MPEG

MPEG H and H After computing the DCT zigzag scanning and

quantization are applied same as those in the previous standards Here two

scalar quantization metho ds namely H and MPEG quantizations are used In

addition variable length co de VLC of DC and AC is applied for entropy co ding

In addition the MPEG supp orts the texture co ding of arbitrarily shap ed

VOP Macroblo cks can b e classied as either standard or contour macroblo cks the

transparent blo cks are skipp ed and not co ded

For a standard macroblo ck where all of its pixels are inside the activeVOP

area as shown in Fig techniques identical to that describ ed in MPEG

and MPEG can b e used The macroblo cks that do not b elong to the arbi

trary shap e but inside the b ounding b oxofaVOP as shown in Fig are not

co ded at all For each macroblo ck there could b e four luminance blo cks

and two chrominance blo cks As in the motionestimation step

blo cks well within the VOP active area can be co ded in a straight forward manner

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

For a contour macroblo ck some of its pixels may b e outside the activeVOP

area see Fig The blocks that b elong to the macroblo cks on the

b order of the VOP shap e may b e co ded bytwo dierenttechniques namely

low pass extrapolation LPE padding and shape adaptive DCT SADCT

SADCT is more complex but has a higher co ding eciency for the b oundary

blo cks For the co ding of motioncomp ensated prediction error blo cks P

VOPs that straddle the VOP b oundary pixels outside the active area are set

to a value of prior to DCT co ding

Motion Estimation and Comp ensation

Temp oral redundancies b etween video content in separate VOPs within a VOare

exploited using blo ckbased motion estimation and comp ensation In general these

techniques can b e viewed as extensions of the standard blo ckmatching techniques

used in MPEG MPEG H and H to image sequences of arbitrary shap e

To p erform blo ckbased motion estimation and comp ensation b etween VOPs of

varying lo cation size and shap e a shap eadaptive macroblo ck approachshown in

Fig is used The reference window is the original images b order A shift

Reference window Shift VOP window

contour macroblock

Standard

macroblock

Figure Macroblo ck grid for co ding VOP Here macroblo cks can b e classied as

either standard or contour macroblo cks

MPEG VIDEO CODING

parameter is co ded to indicate the lo cation of VOP with resp ect to the b orders of

the reference window AVOP window surrounding the foreground video ob ject is

restricted to b e a multiple of pixels in b oth the horizontal and vertical directions

Furthermore it is p ositioned such that it contains the minimum numberof

blo cks of pixels which are not transparent Pixels which are outside the b ounding

box are treated as transparent pixels

Like the arbitrarily shap ed region texture co ding any of the motion estimation

and comp ensation techniques for MPEG and MPEG can be used for a stan

dard macroblo ck However the motion estimation of a contour macroblo ckhasto

be mo died from blo ck matching to p olygon matching Furthermore a sp ecial

padding technique ie the macroblo ckbased rep etitive padding is required for the

reference VOP as shown in Fig The details of these techniques are describ ed

as following

A Macroblockbased Padding of VOP

The macroblo ckbased padding pro cess allows the deco der to pad a macroblo ck

as so on as it is reconstructed as depicted in Fig The padded VOP is then

used for motion comp ensation At the enco der a reference VOP is padded in a similar manner for motion estimation prior to motion comp ensation

Inverse Inverse Quantization DCT

Motion Macroblock-based Compensation padding

Frame

Memory

Figure Macroblo ckbased padding in MPEG deco der corresp onding to the

video ob ject deco der shown in Fig For this illustration we use the the sim

plied view of MPEG deco der which do es not include the shap e deco ding VOP

comp ositor etc

The padding pro cess is as follows The frame memory see Fig is rst

initialized with the value of for the luminance and chrominance comp onents

Then the contour blo cks are padded using rep etitive padding describ ed next To

cop e with VOP of big motions the padding is further extended to blo cks which are

completely outside the VOP but immediately next to b oundary blo cks as shown

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

in Fig These blo ck are padded by replicating the samples of padded adjacent

b oundary blo cks as shown in Fig If a blo ck is next to two or more b oundary

adjacent padded boundary block block

repeat

Figure Extended padding for VOP with big motions

blo cks the blo ck is padded by replicating the samples at the b order of one of the

b oundary blo cks determined according to the following convention

Let the b oundary blo ck at the b ottom of a target blo ckbenumb er the one

on top the one on its right and the one on its left The target blo ckis

then padded by replicating the samples at the b order of the b oundary blo ck

with the largest numb er Fig shows an example of a VOP after extended

padding Note that the padded area covers the region outside the tightest

b ounding blo cks

Repetitive Padding

Extended

Padding

Figure VOP after rep etitive padding and extended padding

MPEG VIDEO CODING

B Repetitive Padding Technique

Steps Detail Pro cedures

Consider each undened pixel outside the ob ject b oundary a zero pixel

Scan each horizontal line of a blo ckablock could b e or Each scan

line is p ossibly comp osed of two kinds of line segments

zero segments that have all zero pixels within each segment

nonzero segments that have all nonzero pixels within each segment

If there are no nonzero segments do nothing

Otherwise there are two situations for a particular zero segment

it can b e p ositioned b etween an end p oint of the scan line and the end p ointof

a nonzero segment Then ll all of the pixels in the zero segments with the

pixel value of the end p oint of the nonzero segment

it can b e p ositioned b etween the end p oints of two dierent nonzero segments

Then ll all of the pixels in the zero segments with the average pixel value of

the two end p oints

Scan eachvertical line of the blo ck and p erform the identical pro cedure as describ ed

in Step to eachvertical line

If a zero pixel can b e lled in by b oth Steps and the nal value takes the

average of the two p ossible values

Consider the rest of zero pixels

scan any one of them horizontally to nd the closest nonzero pixel on the same

horizontal scan if there is a tie the nonzero pixel to the left of the current

pixel is selected

scan any one of them vertically to nd the closest nonzero pixels on the same

vertical scan if there is a tie the nonzero pixel on the top of the current

pixel is selected

Replace the zero pixel by the average of these two horizontally and vertically closest

nonzero pixels

Table Macroblo ckbased rep etitive padding pro cedures

The macroblo ckbased rep etitive padding pro cess illustrated in Fig consists

of ve steps as listed in Table As an example the VOP of News sequence

after the macroblockbasedrepetitive and extendedpadding are shown in Fig a

and Fig b resp ectively

C ModiedBlock Polygon Matching

After padding the reference VOP the motionestimation and comp ensation pro

cess of the contour macroblo cks is the same as in the case of standard macroblo cks

except that during blo ckmatching only pixels b elonging to the active area of the

VOP are used in the motion estimation pro cess Here the alpha plane for the VOP

is used to exclude the pixels of the macroblo ck that are outside the VOP This forms

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

a b

c d

Figure Illustration of rep etitive padding in Table a horizontal padding

b vertical padding c average the horizontal and vertical padding d exterior

pixels padding

a rep etitive padding of VOP b extended padding of VOP

Figure Padding technique is employed on VOP of News test sequence

MPEG VIDEO CODING

a p olygon of the macroblo ck that are on the VOP b oundary as shown in Fig

as an example Due to its lower computational complexityascomparedtoother

Transparent pixels

Pixels of polygon Macroblock

VOP

Figure Polygon matching for an arbitrary shap e VOP

dierence measures the sum of absolute dierence SAD is used in MPEG as

error measure and is computed only for the pixels with nonzero alpha value It is

dened as

N N

X X

jci j pi j ji j C

i j

for x y

SAD x y

N N

X X

jci j pi x j y ji j

i j

otherwise

where fci j i j N g are the pixels of the current VOP and

fpm n m n R R R N g b e the pixels in the search

range R of the reference VOP Motion vector x y fR R R

Rg and i j is the alpha comp onent sp ecifying the shap e information Further

more i j in implies that we only compute SAD for those macroblo ck

containing video ob ject The SAD computation in is further divided into two

cases with or without signicantmotion For those macroblo ck without signicant

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

motion ie x y SAD is reduced by constant C C

where N is the numb er of pixels inside the blo ck The purp ose of this reduction is

to concentrate the distribution of motion vectors to co ordinate so that entropy

co ding of the zero dierence motion vectors is more ecient

Arbitrary Shap e Co ding

The representation of an ob jects shap e has b een shown to b e very useful in many

elds of image and video pro cessing Sp ecically the utilization of shap e informa

tion in the areas of image analysis image compression computer vision and graphics

has b een throughly investigated These investigations have led to the development

of several techniques for ecient representation of shap e information MPEG is

the rst attempt at providing a standardized approach to the representation of an

ob jects shap e within a video bitstream

In MPEG it is assumed that each video ob ject is provided with its corresp ond

ing shap e information This shap e information is provided in one of two formats

binary format or grey scale format The binary format for the shap e information

consists of a pixel map which is generally the same size as the b ounding box of

the corresp onding VOP Each pixel takes on one of two p ossible values indicating

whether it is contained within the video ob ject or not The grey scale format is

similar to the binary format with the additional feature that each pixel can take

on a range of values ie usually b etween and These values represent the

transparency of that pixel Avalue of corresp onds to a video ob ject which is com

pletely transparent while a completely opaque video ob ject would b e represented

by pixel values of Video ob jects whose shap e are represented byvalues b etween

and corresp ond to an intermediate level of transparency This approachto

representing the shap e of a video ob ject along with its transparency is very similar

to the alpha plane approach used in computer graphics

Both the binary and grey scale formats represent the shap e of video ob ject as a

matrix of binary or grey values resp ectively This matrix of values is referred to as

a bitmap The suitable shap e co ding metho ds include

Contourbased methods extract and co de the contour residing on the b ound

ary of the video ob jects The contourbased metho ds transform the source

binary image to another binary image where contour pixels are distinguished

from all other pixels The vertexbased coding andchain coding

are prominentcontourbased approaches The disadvantage is that these ap

proaches dont work within the conventional blo ck based video co ding frame

work

Bitmapbased methods are applied directly to the source binary images

MPEG VIDEO CODING

The modiedREAD MR method andcontextbased arithmetic encoding

CAE are commonly used bitmapbased approaches

Chromakeying is an implicit metho d for shap e co ding wherebythe binary

alpha comp onent of the ob ject is actually merged into the YUV comp onents

The YUV comp onents are then enco ded by the texture enco der Chroma

keying allows arbitrarily shap ed ob jects to b e co ded without an explicit shap e

enco der However the DCT quantization noise may bleed into the recon

structed video ob ject at its edges

Bitmap based compression for shap e co ding such as the blo ckbased metho ds of

CAE was adopted by MPEG b ecause it oers good compression eciency with

relatively reduced computational complexity as compared to other approaches such

as vertex based shap e co ding

The shap e co ding techniques adopted by the standard supp ort b oth lossless co d

ing of alpha planes and lossy co ding of shap es and transparency information thus

allowing tradeos b etween bit rate and accuracy of shap e representation Further

more intra and intershap e co ding functionalities employing motioncomp ensated

shap e prediction is envisioned so as to allow b oth ecient random access op erations

as well as ecient compression of shap e and transparency information for diverse

applications In the MPEG the shap e of every VOP is enco ded along with its

other characteristics ie luminance chrominance etc Therefore the shap e of each

VOP is b ounded by a rectangular window The b ounding b ox is then partitioned

into blo ckof pixels called shapeblockswhich are the same as those contour

blo cks The selection and partitioning of the b ounding boxinto shap e blo cks for

a particular VOP is demonstrated in Fig It is these shap e blo cks up on which

the enco ding and deco ding pro cess is p erformed

A Binary ShapeCoding

Like the texture co ding in MPEG and MPEG the bitmap based co ding

metho d for the binary format contains b oth an intra and inter mo de The ma jor

dierence b etween these two mo des is the addition of motion comp ensation to the

inter mo de in order to achieve greater compression eciency by rst removing the

temp oral redundancies

B Grey Scale Coding

The grey scale format is enco ded using a blo ck based DCT where motion com

p ensation again can b e used to reduce the temp oral redundancies This metho d is

very similar to that used to compress the texture information and is strictly a lossy

compression technique The grey scale bitmap is enco ded by separately enco ding

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

the shap e and transparency information The shap e information is enco ded by the

same binary shap e co ding metho d describ ed ab ove The transparency values are

treated as luminance values and enco ded using the same blo ck DCT transform

approach used to enco de the texture information of a VOP

As discussed ab ove the grey scale format is utilized for comp ositing a scene using

several dierent video ob jects Since a feature of the grey scale format is that each

pixel can take on a range of values ie usually b etween and These values

represent the transparency of that pixel When dierent ob jects o ccupy the same

spatial lo cation they are blended together based on the value of their grey value

format and are normalized based on the maximum value of This approachto

represent the shap e of a video ob ject and its transparency is very similar to the

alpha plane approach used in computer graphics

C Sprite Coding

A sprite is an image comp osed of pixels b elonging to a video ob ject that are

visible throughout an entire video segment For example a sprite generated from a

panning sequence maycontain all the visible pixels of the background throughout

the sequence as shown in Fig a In this particular case the video ob ject

(a)

(b) (c)

Figure Sprite co ding a the panning image sprite containing all the visible

pixels of the background b the foreground VOP Stefan c the reconstructed frame

MPEG VIDEO CODING

used to generate the sprite is the background Portions of this background may

not be visible in certain frames due to the o cclusion of the foreground ob jects as

shown in Fig b or the camera motion Since the sprite contains all parts

of the background that were at least visible once the sprite can b e used for direct

reconstruction of the background VOPs or the predictivecodingofthebackground

VOPs and the reconstructed frame is shown in Fig c

In MPEG spritebased co ding two main typ es of sprites have b een distin

guished static and dynamic Static sprites are those that are directly copied

to generate a particular rendition of the sprite ob ject at a particular time instant

namely a VOP This copying however also includes the appropriate warping and

cropping In contrast a dynamic sprite is used as reference in predictive co ding

where motion is comp ensated using the warping parameters for the sprite ob ject

A dynamic sprite can in turn b e generated either online or oline An oline

sprite is built co ded and transmitted as an IVOP prior to co ding the video itself

An online sprite is dynamically built during co ding in b oth the enco der and the

deco der An online sprite is always dynamic On the other hand an oline sprite

can be static or dynamic dep ending on its usage Oline static sprite is well

suited for synthetic ob jects and ob jects that mostly undergo rigid motion Online

dynamic sprite provides a nolatency solution in the case of natural motion and it

provides an enhanced predictive co ding environment One of the ma jor comp onents

of spritebased co ding is the generation of the sprite This assumes that the sprite

is not known in advance which may be the case for synthetic video ob ject The

sprite is built in a similar way in b oth oline and online cases In particular the

same global motion estimation algorithm is used In the oline case a sprite is

built b efore starting the enco ding pro cess It is constructed using every original

VOP available for the video sequence In the online case both the enco der and

the deco der build the same sprite from reconstructed VOPs In sprite co ding the

chroma comp onents are pro cessed in the same way as the luminance comp onents

with the prop erly scaled parameters For oline static sprites temp oral scalability

is implicit since the transmission of tra jectories of eachVOP is indep endent

Advanced Co ding Techniques

A Unrestricted Motion Vector

In the basic MEMC technique the predicted blo ck has to be a blo ck in the

previous frame If the current blo ck is at a corner or a b order blo ck of the current

frame the motion vector MV is then restricted into a smaller range One of

the advanced techniques in MPEG is to allow unrestricted MVs for such b order

blo cks Fig illustrates this technique The previous frame is extended in all

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

Predicted block Predicted block (motion vector =(8,0)) (Motion vector = (8, -8))

Search range Search range

Previous frame Previous frame Extended previous frame

Current block Current block

Current frame

Motion Estimation with Restricted Motion Vector Motion Estimation with Unresetricted Motion Vector

Figure Illustration of unrestricted motion vector technique

four directions by rep eating the b order pixels a numb er of times based on the search

range The dierence blo ck is generated by applying MEMC against the extended

previous frame and taking the dierence of the current blo ck and the predicted

blo ck that maybe partially out of the frame b oundary This technique improves

the co ding eciency of the b oundary blo cks

B AdvancedPrediction

There are two asp ects of advanced prediction

Adaptive method it decides whether a currentblock of x pixels is divided

into four blo cks of pixels each for MEMC The decision is made based

SAD SAD

here SAD stands for the sum of absolute dierence If the dierence less

than prediction is chosen otherwise is chosen If wecho ose

prediction there are four MVs for the four luminance blo cks

OverlappedMC Each pixel in an luminance predicted blo ckisaweighted

DELIVER VIDEO BITSTREAM OVER NETWORKS

sum of three prediction values sp ecied in the following equation

W i j P i j W i j P i j W i j P i j

s i j

where division by is with roundo W i j W i j and W i j are the

weighting matrixes which can b e found in the MPEG standards

The values of P i j P i j and P i j are the pixels of the previous

frame

C Bidirectional Motion Estimation and Compensation

There are four mo des in bidirectional motion estimation and comp ensation

They are dierent in forming the predicted blo ck

Direct mode it is the only mo de in which it is p ossible to use MVs of x

blo cks For each x blo ck of the Bframe the forward and backward

motion vectors are derived from the MVs of the next Pframe that follows the

Bframe

Interpolate mode backwardmode and forwardmode They p erform MEMC

on blo cks The MVs are obtained by forward ME and backward

ME Selection of these mo des is based on a comparison of the SAD values

generated by the four mo des and the mo des with the minimum SAD value

In this comparison the direct mo de is favored by subtracting from its

SAD value b efore the comparison

Deliver Video Bitstream over Networks

In MPEG it provides the syntax and metho ds necessary to eciently represent

the shap e information of an ob ject as wehave mentioned ab ove within the co ded

bitstream Now the problem b ecomes how to deliver those bitstreams Due to the

large variety of existing network technologies it is most likely that hybrid networks

will be used to supp ort video services However dierent networks have dierent

characteristics To optimize the p erformance of those multimedia systems with

the given QoS Quality of Service requirements rate control is used in MPEG by

jointly considering video compression and delivery schemes based on the network

alternatives capacities and characteristics In addition other techniques prop osed

in MPEG standard is used to meet the challenge of delivering video over networks

in bandwidth ecient universal accessibility and error resilient manner

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

Rate Control

Rate control and buer regulation is an imp ortant issue for b oth variable bit rate

VBR and constant bit rate CBR applications In the case of VBR enco ding the

rate controller attempts to achieve optimum QoS for a given target average rate

In the case of CBR enco ding and realtime applications the rate control scheme has

to satisfy the lowlatency and video buer verier constraints In addition the rate

control scheme has to be applicable to a wide variety of sequences and bit rates

The scalable rate control scheme is designed to meet the requirementofbothVBR

without delay constraints and CBR with lowlatency and buer constraints

The numb er of bits used for a frame dep ends on the quantization stepsize and

the signal dynamic range The scalable rate control SRC scheme controls the bits

assigned to the Pframe

N R Q Q

bit

where N is the number of bits used for a frame R is the dynamic range of

bit

the frame Qis the quantization stepsize used for the frame and are two

mo deling parameters Since the SRCscheme is used for inter frames the motion

comp ensated SAD value of the frame is used for the dynamic range R of the frame

Error resilience

Error resilience provides an error robustness capability to allow access to applica

tions over a variety of wireless and wired networks as well as storage media The

error resilience to ol basically covers resynchronization data recovery and error

concealment

A Resynchronization

Resynchronization to ols as the name implies attempt to enable resynchroniza

tion b etween the deco der and the bitstream after a residual error or errors have b een

detected Generally sp eaking the data b etween the synchronization p oints prior to

the error and the rst point where synchronization is reestablished is discarded

as shown in Fig b ecause it is usually not p ossible to detect the error at the

exact error o ccurrence lo cation at the deco der This kind of errors typically o ccur

in bursts on wireless channels which corrupt many bits when the channel fades

If the resynchronization approach is eective at lo calizing the amountofdata

discarded by the deco der then the abilityofothertyp es of to ols which recover data

andor conceal the eects of errors is greatly enhanced

The resynchronization approach adopted by MPEG is similar to the Group

of Blo ck GOBs structure utilized by the ITUT H and H In

DELIVER VIDEO BITSTREAM OVER NETWORKS

Discarded data Resync Resync point point Error Error

location detected

Figure All the data b etween the two resynchronization p oints may need to b e

discarded

these standards a GOB is dened as one or more rows of macroblo cks MB At

the start of a new GOB information called a GOB header is placed within the

bitstream This header information contains a GOB start co de which is dierent

from a picture start co de and allows the deco der to lo cate this GOB Furthermore

the GOB header contains information which allows the deco ding pro cess to be

restarted ie resynchronize the deco der to the bitstream and reset all predictively

co ded data

The GOB approach to resynchronization is based on spatial resynchronization

That is once a particular macroblo ck lo cation is reached in the enco ding pro cess

a resynchronization marker is inserted into the bitstream A potential problem

with this approach is that since the enco ding pro cess is variable rate these resyn

chronization markers will most likely b e unevenly spaced throughout the bitstream

Therefore certain p ortions of the scene such as high motion areas will be more

susceptible to errors which will also b e more dicult to conceal

The video packet approach adopted by MPEG as shown in Fig is based on

providing p erio dic resynchronization markers throughout the bitstream In other

Resync MB Quantiza- Macroblock Data HEC (I-VOP: DCT DC/Mode Information) Texture information marker Number tion (P-VOP: Motion/Mode Information)

Header

Figure Resynchronization markers help in lo calizing the eect of errors to an

MPEG video packet The header of each video packet contains all the necessary

information to deco de the macroblo ck data in the packet Here HEC stands for

header extension co des Macroblo ck data include DCT and texture information for

IVOP and motion and texture information for PVOP resp ectively

words the length of the video packets are not based on the numb er of macroblo cks

but instead on the numberofbitscontained in that packet If the numb er of bits

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

contained in the current video packet exceeds a predetermined threshold then a

new video packet is created at the start of the next macroblo ck

A resynchronization marker is used to distinguish the start of a new video packet

This marker is distinguishable from all p ossible variable length co ding co dewords

as well as the VOP start co de Header information is also provided at the start of

a video packet Contained in this header is the information necessary to restart the

deco ding pro cess and includes the macroblo cknumb er of the rst macroblo ck

contained in this packet It provides the necessary spatial resynchronization while

the quantization parameter allows the dierential deco ding pro cess to b e resynchro

nized the quantization parameter necessary to deco de that rst macroblo ck

Imp ortant information that remains constantover a video frame suchasthe

spatial dimensions of the video data the time stamps asso ciated with the deco ding

and the presentation of this video data and the typ e of the current frame INTER

co dedINTRAco ded are transmitted in the header at the b eginning of the video

frame data If some of this information is corrupted due to channel errors the

deco der has no other recourse but to discard all the information b elonging to the

current video frame In order to reduce the sensitivity of this data a bit eld

called HEC is intro duced in the video packet header When HEC is set the im

p ortant header information that describ es the video frame is rep eated in the bits

following the HEC This duplicate information can be used to verify and correct

the header information of the video frame The use of HEC signicantly reduces

the number of discarded video frames and helps achievea higher overall deco ded

video quality

B Data Recovery

After synchronization has been reestablished data recovery to ol attempts to

recover data that in general would be lost These to ols are not simply error

correcting co des but instead techniques which enco de the data in an error resilient

manner For instance one particular to ol under consideration is Reversible Variable

Length Co des RVLC as shown in Fig In this approach the variable length

co dewords are designed such that they can b e read b oth in the forward as well as

the reverse direction Examples of suchcodewords are Co dewords

suchaswould not b e used Obviously this approach reduces the compression

eciency achievable by the entropy enco der However the improvementin error

resilience is substantial

C Error Concealment

Error concealment is an extremely imp ortant comp onent of any error robust

video co dec Similar to the error resilience to ols the eectiveness of a error con

DELIVER VIDEO BITSTREAM OVER NETWORKS

2 Error dectected goto next resync marker

Resync Motion/Mode Texture data reversible VLC Resync Header information marker marker Errors 3 Backward decoding 1 Forward decoding 4 Errors localized and

discarded

Figure Reversible VLCs can be parsed in both the forward and backward

direction making it p ossible to recover more DCT data from a corrupted texture

partition instead of discarding the data b etween the consecutive resynchronization

markers

cealment strategy is highly dep endent on the p erformance of the resynchronization

scheme Basically if the resynchronization metho d can eectively lo calize the error

then the error concealment problem b ecomes much more tractable

We can achieve error concealment by taking advantage of the separate mo

tion texture co ding mo de of MPEG Sp ecically this approach utilizes the data

partitioning capabilities of separating the motion and the texture This approach

requires that a second resynchronization marker b e inserted between motion and

texture information If the texture information is lost the approach utilizes the

motion information to conceal these errors That is due to the errors the tex

ture information is discarded while the motion is used to motion comp ensate the

previous deco ded VOP

This approach can be extended through the transmission of a mean motion

vector and shap e information for each ob ject ie side information For the case

when b oth texture and motion information or just motion information is corrupted

this side information can b e utilized to motion comp ensate the ob ject This is as

opp osed to the typical error concealment strategy which generally works at the

macroblo cklevel

Universal Accessibility

Contentbased scalability enables the user to achieve scalability with a ne granular

ityincontent quality eg spatial resolution temp oral resolution and complexity

This allows for the manual or automated selection of deco ded video quality based on

the available bandwidth in a particular network For example a user can browse

a database at dierent qualities scales andor resolutions based on the bandwidth

resources of a particular network In general scalabilityof video means the abil

ity to achieve video of more than one resolution qualitysimultaneously Scalable

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

video co ding involves generating a co ded representation in a manner that allows the

derivation of video of more than one resolution qualityby scalable deco ding Bit

stream scalability is the prop erty of a bitstream that allows deco ding of appropriate

subsets of a bitstream to generate complete pictures of resolution quality commen

surate with the prop ortion of the bitstream deco ded A truly scalable bitstream

allows b oth low and high p erformance deco ders to co exist That is alow p erfor

mance deco der may deco de small p ortions of the bitstream pro ducing basic quality

while a high p erformance deco der may deco de the entire bitstream and pro duce

signicantly higher quality

The ability to provide contentbased spatial and temp oral scalability are two

very imp ortant functionalities whichhave b een prop osed in MPEG MPEG

The concept of spatial and temp oral scalability can also extended to VOPs of arbi

trary shap e which is referred to as generalized scalability Eachtyp e of scalability

involves more than one layer suchasalower layer and a higher layer

ObjectBasedscalability It is imp ortanttokeep in mind that MPEG oers

the ability to do ob jectbased scalability This unique functionality is a re

sult of MPEGs ability to resolve ob jects into dierentVOPs Utilizing the

multiple VOP structure dierent resolution enhancements can b e applied to

dierent p ortions of a video scene Therefore within MPEG the following

two enhancement mechanism are allowed Enhancement Typeand En

hancement Type In Enhancement Type the enhancementlayer increases

the resolution of a particular ob ject or region of the base layer In Enhance

ment Type the enhancementlayer increases the resolution of the entire base

layer

Spatial scalability the lower layer is referred to as the base layer and the higher

layer is called the enhancement layer as shown in Fig Traditionally

these scalability are applied to frames of video such that in case of spatial

scalability the enhancement layer frames enhance the spatial resolution of

base layer frames If needed adownsampling pro cess is p erformed by the

scalability pro cessor

Temporal scalability the enhancement layer frames are temp orally multi

plexed with the base layer frames to provide high temp oral resolution video

In temp oral scalability as shown in Fig the frame rate of a selected

ob ject is enhanced such that it has a smo other motion than the remaining

area In other words the frame rate of the selected ob ject is higher than that of the remaining area

DELIVER VIDEO BITSTREAM OVER NETWORKS

Coder/Decoder Pair

2 2

Coder/Decoder Pair

2 2

Coder/Decoder

Pair

Figure Spatial scalability

Coder/Decoder Pair

60 Hz

60 Hz 2 2

Coder/Decoder Pair

2 2 30 Hz

Coder/Decoder Pair

15 Hz

Figure Temp oral scalability

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

DCTdomain Contentbased Video Co ding

As stated in the previous sections the motion and texture co ding techniques in the

MPEG are the direct extension of those used in traditional video co ding Thus

the blo ck matching and the Discrete Cosine Transform DCT are still the basic

techniques However the dierence b etween the approaches of MPEG and tradi

tional video co ding is that in MPEG the motion estimation of the blo cks on the

VOP b orders has to b e mo died from blo ck matching to p olygon matching To

accomplish p olygon matching the macroblo ckbased rep etitive padding is required

to estimate motion for those contour macroblo cks which reside on the b oundary of

the video ob ject and contain partial video information as shown in Fig The

pro cedures of padding are listed in Table To cop e with VOP of big motions

the padding is further extended to blo cks which are completely outside the VOP

but immediately next to the b oundary blo cks In addition to the high computa

tional complexity O N of mo died blo ck p olygon matching motion estimation

MBKMME the macroblo ckbased rep etitive padding increases the overall com

plexity and system data ow and makes the realtime video co dec implementation

even harder The resulting design of video co ding structure is shown in Fig a

Therefore how to handle the demanding computational tasks in real time and how

to implement them in a costeectivewayhave b ecome big challenges to us Now

the following questions can be logically p osed Are there some disadvantages in

MPEG video co der design The answer is yes For instance in order to sup

p ort spatial domain motion estimationcomp ensation the IDCT is used to restore

the compressed video ob ject back to the spatial domain However with such a

design the throughput of the co der is limited by the pro cessing sp eed of four ma

jor comp onents DCT IDCT Spatial Domain Motion Estimation SDME and

Macroblockbased repetitive padding in the feedback lo op The feedback lo op

therefore b ecomes the ma jor b ottleneck of the entire digital video system

Is there any b etter lowcomplexity design to achieve MPEG compatible p er

formance The answer is p ositiveandwe will provide such a solution byworking

at the algorithmic level in next section

Transform Domain Motion EstimationComp ensation

Besides the spatial domain motion estimationcomp ensation we can also esti

matecomp ensate motion for arbitrarily shap ed video ob ject in DCT domain With

such a DCTdomain design wecan movetheDCT unit out of the feedbackloop

to realize the fully DCTbased co der structure as shown in Fig b Now the

p erformancecritical feedback lo op of the DCTbased co der contains only transform

domain motion estimation unit TDME instead of four ma jor comp onents DCT

DCTDOMAIN CONTENTBASED VIDEO CODING

VOP Input + DCT Q Bitstream Formation + VLC - Q-1

Shape IDCT Information Macroblock-

SD-ME based Padding + a The video co ding structure in MPEG enco der

VOP Macroblock- Input DCT + Formation based Padding + Q VLC Bitstream -

Q-1 Shape Information

TD-ME +

b The video co ding structure in fully DCTBased enco der

Figure Comparison of dierent co der structures a Commonly used motion

comp ensated DCT hybrid co der p erforms motion estimation in the spatial domain

SDME b Fully DCTbased co der estimates motion in the transform domain TDME

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

IDCT Spatial Domain Motion Estimation SDME Macroblockbased repetitive

padding This not only reduces the complexity of the co der but also achieves

higher system throughput Most imp ortantly the DCTbased nature enables the

combination of b oth DCT and motion estimation units which consume more than

of computing p ower for a video co der into one comp onenttosavechip area

In addition we can also move the macroblo ckbased rep etitive padding unit out of

the feedbackloop Nowthe p erformancecritical feedbackloopcontains only one

transformdomain motion estimation unit TDME instead of four ma jor compu

tational involved units as in Fig a As a result this not only reduces the

system complexity of the co der but also achieves higher data throughput

In this section we will extend our DCTdomain motion estimationcomp ensation

scheme discussed in previous chapters for arbitrarily shap ed video In principle we

mo dify and extend the DCT PseudoPhase Technique to estimate motions at inte

ger and halfpixel accuracy for arbitrarily shap ed VOP of MPEG video Unlike

the mo died blo ck matching p olygon motion estimation algorithm adopted in

MPEG the presented motion estimation scheme EDXTME works solely in the

DCT transform domain instead of in the spatial domain In other words if we

consider the conventional blo ckbased motion estimationcomp ensation as a time

domain approach our design is then a frequency domain approach Thus it en

ables us to extract motion displacement directly from the current and previous

VOPs even without the macroblo ckbased rep etitive padding In addition wecan

p erform the motion comp ensation in the DCT domain without converting backto

the spatial domain However for the sake of MPEG compatibilitywe will still

keep the padding pro cedure as we will explain later In other words to make our

design work with the MPEG deco der the VOP has to b e padded

The EDXTME algorithm is summarized in Table In terms of arbitrary

shap e motion estimation we can treat the contour macroblo cks the same as the

regular ones except for pixels outside are padded based on the video content inside

the video b oundary by following the pro cedures as listed in Table Next we will

discuss eachstepofEDXTME algorithm in more detail

A VOP Formation

The VOP is represented by means of a b ounding rectangle as describ ed next

The phase between the luminance and chrominance samples of the b ounding

rectangle has to b e correctly set according to the format as shown in Fig

Sp ecically the top left co ordinate of the b ounding rectangle should b e rounded to

the nearest even number not greater than the top left co ordinates of the tightest

rectangle Accordingly the top left co ordinate of the bounding rectangle in the

chrominance comp onent is the top left co ordinate of the luminance divided bytwo

DCTDOMAIN CONTENTBASED VIDEO CODING

Input The video ob ject planes VOPs

Motion vectors and prediction errors Output

VOP formation and padding Based on the shap e information of the VOP we can

generate the tightest rectangle boundedVOP window that contains the video ob ject

to achieve high co ding eciency The b ounded windowhas the minimum number of

macroblo cks with eachofsize p els A shift parameter is hereafter enco ded as

spat ref in MPEG to indicate the lo cation of the b ounded VOP window with horver

resp ect to the b orders of a reference VOPTheVOP is then macroblo ckbased rep etitive

padded

The motion vector is computed Contentbased motion estimationcomp ensation

only for each of the macroblo ck or the blo ck for advanced motion comp ensation

which contains the video ob ject Otherwise jump to Step

k l Compute the D DCT co ecients of second kind DDCTI I X

cs sc ss

X k l X k land X k l of a macroblo ck of pixels in the currentVOP

t t t

fx g

Meanwhile those DCT co ecients in the corresp onding macroblo ck of pixels in

the reference VOP x are converted to D DCT co ecients of rst kind

cc cs sc ss

DDCTI Z k l Z k l Z k l and Z k l through a plane

t t t t

rotation

Determine the normalized pseudo phases f k landg k l from the system equation

which contains those typeI and typeII DCT co ecients obtained from step

Compute F m n and Gm n the inverse DCT DIDCTI I of f k l and

g k l which are comp osed of impulse functions whose p eak p ositions indicate

the integerp el motion vector m m and p eak signs reveal the direction of the

u v

movement

The halfp el motion vector is then determined by only considering the nine p ossible

p ositions around the integerp el displacement m m without interp olation

u v

Basically the halfp el motion vector is determined by computing DC S u v and

DS C u v for u fm m m g and v fm m m g

u u u v v v

The p eak value of DC S u v and DS C u v indicate the halfp el motion and its

direction

Based on the previously derived motion estimation DCT of the motion

comp ensated residual DBD Displaced Blo ck Dierence between current blo ck

B and displaced reference blo ck B is computed as

cur r

ref

DC T fDBD g DC T fB gDC T fB g

cur r

ref

The prediction errors are then quantized and send to the receiver along with co ded

macroblo ck motion vectors

Go to step until the whole video ob ject is estimated The pro cess While lo op

starts from the top left macroblo ckintheboundedVOP window to the top right

one and then to the next row and so on for every macroblo ck in the b ounded

VOP window

Table Summary of the EDXTME algorithm

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

0 1 2 3 Bouned 0 VOP window

Luminance

Chrominance

Figure Luminance vs chrominance sample p ositions in format

Here the shap e information is used to form a VOP By using the following pro cedure

the minimum number of macroblo cks that contain the ob ject will be attained to

get a higher co ding eciency

Generate the tightest rectangle with even numb ered top left p osition as shown

in Fig a

If the top left p osition of this rectangle is the same as the origin of the image

frame skip the formation pro cedure

Form a control macroblo ck at the top left corner of the tightest rectangle

as shown in Fig b

Count the number of macroblo cks that completely contain the ob ject

starting at eacheven numb ered p ointofthecontrol macroblo ck Details

are as follows

Generate a b ounding rectangle from the control p oint to the right

b ottom side of the ob ject which consists of multiples of blo cks

Count the number of macroblo cks in this b ounding rectangle

which contain at least one ob ject p el It is sucient to take into

account only the b oundary p els of a macroblo ck

Select that control point that results in the smallest number of mac

roblo cks for the given ob ject

Extend the top left co ordinate of the tightest rectangle generated in

Fig b to the selected control co ordinate This will create a rectan

gle that completely contains the ob ject but with the minimum number

DCTDOMAIN CONTENTBASED VIDEO CODING

Bounded VOP window (Tightest rectangle)

Shift

Current VOP a Control Macro- Bounded VOP window block (Tightest rectangle) . ... .

: Control point

Extended bound

Intelligently generated VOP

Current VOP

Figure Intelligent VOP formation a generate the tightest rectangle b

extended the VOP window

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

of macroblo cks in it The VOP horizontal and vertical spatial references

are taken directly from the mo died topleft co ordinate

B Contentbased Motion EstimationCompensation

The reason we call it contentbased video co ding is that the motion estima

tioncomp ensation is p erformed only for those macroblo cks containing the video

information It is the kernel of the EDXTME algorithm and is computational in

tensive The whole system p erformance dep ends on this core pro cess design we

will describ e the detailed costeective architectures and its corresp onding VLSI

implementation in the following chapters Basicallywe can view our approachas

a logical extension of those DCTbased motion estimation schemes toward co ding

video sequences of arbitrary shap e

After motion estimation the current blo ck B of size N N in the current

cur r

frame can b e b est predicted by the blo ck B displaced from the previous blo ck

ref

p osition with the estimated motion vector m m Based on the derivation in

u v

the DCT of the motioncomp ensated residual displaced blo ck dierence

DBD is given by

DC T fDBDg DC T fB B g DC T fB gDC T fB g

ref cur r ref cur r

In other words the DCT of the motioncomp ensated residual can b e expressed as

the dierence b etween the DCT of the displaced blo ck and the DCT of the current

blo ck As a result we can p erform motion comp ensation in the DCT domain as

shown in Fig b which serves the purp ose of building a fully DCTbased

motion comp ensated video without converting back to the spatial domain b efore

motion comp ensation

Now the question b ecomes How to extract the displaced DCT blo ck in the

DCT domain or howtocompute DC T fB g Let us illustrate the solution by

ref

taking a simple example As illustrated in Fig a after motion estimation

the current blo ck B of size N N in the current frame can b e b est predicted

cur r

from the blo ck displaced from the previous blo ck p osition by the estimated motion

vector m m in the spatial domain This motion estimation determines which

u v

four contiguous predened DCT blo cks are chosen for the prediction of the current

blo ck out of eight surrounding DCT blo cks and the blo ck at the current blo ck

p osition To extract the displaced DCT blo ck in DCT domain a direct metho d is

to obtain four subblo cks separately from these four contiguous blo cks which can then

be combined together to form the nal displaced DCT blo ckasshown in Fig

b with the upp erleft lowerleft upp erright and lowerright blo cks lab eled as

B B B B resp ectively Subblo cks S are extracted from these four blo cks by

DCTDOMAIN CONTENTBASED VIDEO CODING

h 1 h3 B1 B3 h1 h (m u , m v ) 3 v v v 1 1 1 B1 B3 v 2 v v 2 2

B2 B4 B2 h B4

h1 3

a DCTbased motion estimation b Pixelwise translated DCT blo ck c Motion comp ensation

Figure DCTbased motion comp ensation

premultiplication and p ostmultiplication of the shifting matrices H and V

k k

S H B V for k

k k k k

where the shift amount is determined by the estimated motion vectors and H

V are dened as

I I

h h

H H H H

I I

h h

I I

v v

V V V V

I I

v v

Here I is the n n identity matrix ie I diag fg and n is deter

n n

mined by the heightwidth of the corresp onding subblo ckasshown in Fig b

These premultiplication and p ostmultiplication matrix op erations can be vi

sualized in Fig c where the overlapp ed grey areas represent the extracted

subblo ck These four subblo cks are then summed to form the desired translated

blo ck B The DCT co ecients of four subblo cks can then b e combined together

ref

to form the nal displaced DCT blo ck

DC T fH gDC T fB gDC T fV g DC T fB g

k k k ref

C An Example to Il lustrate Our Design

To facilitate explanation of EDXTME algorithm let us use the MPEG video

test sequence News in CIF format with frame size as input image

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

sequence The panorama scene is shown in Fig The News sequence consists

Figure Panorama scene of News in CIF format

of four VOPs and three corresp onding binary alpha planes the background VOP

has no alpha plane as shown in Fig Here we apply our design only to VOP

the third video ob ject plane as shown in Fig a as an example to illustrate our

design b ecause VOP is the foreground VOP Most imp ortantlythelo cation and

shap e of VOP as shown in Fig b vary with time After taking the rst step

of EDXTME VOP formationtheVOP now is b ounded by the tightest rectangle

containing the video ob ject However this tightest rectangle may not consist of

multiples of macroblo ckofsize Therefore we need to extend the b ottom

right co ordinate of VOP window in Fig a to satisfy that requirement The

nal b ounded VOP with the window size of and its corresp onding alpha

plane are shown in Fig c and d resp ectively The reason to intro duce

VOP formation is to achieve high data compression rate b ecause we dont need

to estimate motions for those macroblo cks containing no video information After

VOP formation the b ounded VOP is padded as shown in Fig e and f The

b ounded VOP window is then further divided into nonoverlapp ed macroblo cks

The contentbased motion estimation the second step of EDXTME is p erformed

on the corresp onding texture of padded VOP in the b ounded window

The binary alpha plane as shown in Fig d is co ded by mo died CAE

Contentbased Arithmetic Enco ding The adopted blo ckbased syntax has

allowed compressed binary alpha blo cks BABs to b e blended seamlessly into the

video syntax This in turn eases the task of supp orting the imp ortant features such

as errorresilient bitallo cation and ratecontrolled op erations Just as the YUV

enco ding a BAB maybeintraco ded using contextbased arithmetic enco ding it

may be interco ded using motion comp ensation and CAE or it may merely be

DCTDOMAIN CONTENTBASED VIDEO CODING

a b

c d

e f

Figure Explanation of VOP formation using News video test sequence a

VOP of New b the alpha plane of VOP c VOP in b ounded window d

the alpha plane of b ounded VOP e VOP after rep etitive padding f VOP

after extended padding

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

reconstructed by motion comp ensation without CAE which is analogous to the

notco ded macroblo ck mo de in MPEG video standard Both YUV and binary

alpha deco ding require the use of motion estimationcomp ensation to exploit spatial

redundancy

To envision how our presented DCTdomain scheme works here we present

an example to estimate motion of a contour macroblo ck by following Steps

in our design as shown in Fig The p eak p osition among F m n and

Gm n indicates the integerp el motion vector of The p eak p osition among

DS C u v andDC S u v implies the halfp el motion vector of

D Computational Complexity

Now let us take a lo ok at the overall computational complexity To pro cess

each video ob ject plane Step of the prop osed approach VOP formation and

padding is only needed to b e executed once Therefore the overall computational

complexity of design is determined by the complexityof Steps and is listed

in Table which serves as the computing engine of the whole design Overall

Step Op eration Multiplication Additions Computational

Subtractions complexity

DDCT typ eI I computation O N

Rotation typ eI DCT O N

Pseudo phases calculation O N

F G computation O N

Halfp el motion estimation O N

Prediction errors computation O N

Total O N

Table Computational complexity of Step in our design for a macroblo ck

size of N N And the size N is adjustable Here weuse N as an example

the scheme requires the computational complexityofO N Here N stands for the

macroblo ck size which is adjustable For the large motions going b eyond the blo ck

b oundary we will use motion vector instead Notice that if the original

input image sequences are not decomp osed into several video ob ject layers VOLs

of arbitrary shap e the EDXTME scheme simply degenerates into a single layer

representation which supp orts conventional image sequences of rectangular shap e

The EDXTME approach can thus be seen as a logical extension of MPEG

and MPEG compatible motion estimation algorithm in transform domain toward image input sequences of arbitrary shap e

DCTDOMAIN CONTENTBASED VIDEO CODING

Contour Contour macroblock macroblock

Reference Frame after Padding Current Frame after Padding

DSCT−I DCCT−I DSCT−II DCCT−II

0.05 0.05 0.05 0.04

0 0 0.02 0 0 −0.05 −0.05 20 20 −0.05 −0.02 20 20 20 20 20 20 10 10 10 10 10 10 10 10 0 0 0 0 0 0 0 0

DCST−I DSST−I DCST−II DSSC−II

(Step 2.1) 0.05 0.05 (Step 2.1) 0.05 0.05

0 0 0 0

−0.05 −0.05 −0.05 −0.05 20 20 20 20 20 20 20 20 10 10 10 10 10 10 10 10 0 0 0 0 0 0 0 0 Type-I DCT Coefficients Type-II DCT Coefficients of contour macroblock of contour macroblock

Pseudo−phase f(k,l) in EDXT−ME (SNR=40) F(m,n) in EDXT−ME (SNR=40) DSC(m,n) in EDXT−ME (SNR=40)

1.5 1.2 1.2

1 1 1 0.8 0.8 0.5 0.6 0.6

0 0.4 0.4

0.2 0.2 −0.5 0 0 −1 −0.2 −0.2 20 20 40 15 20 15 20 30 40 15 10 15 30 10 10 20 5 10 20 5 5 10 5 10 LAV = 0 0 0 D=(3,−1) 0 0 N=16 0 0 N=16 D=(3,2) D=(3.5,2.5)

f(k,l) F(m,n) DSC(m.n) (Step 2.3) (Step 2.4) (Step 2.2) Pseudo−phase g(k,l) in EDXT−ME (SNR=40) G(m,n) in EDXT−ME (SNR=40) DCS(m,n) in EDXT−ME (SNR=40)

1.5 1.2 1.2

1 1 1 0.8 0.8 0.5 0.6 0.6 0 0.4 0.4 −0.5 0.2 0.2

−1 0 0

−1.5 −0.2 −0.2 20 20 40 15 20 15 20 30 40 15 30 10 15 10 20 10 10 20 5 5 10 5 5 10 0 0 N=16 0 0 N=16 0 0 LAV = 0 D=(3,−1) D=(3,2) D=(3.5,2.5)

g(k,l) G(m.n) DCS(m,n)

Figure Illustration of estimating motion of a contour macroblo ckby following step in our compressed domain design

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

Simulation Results

Simulations have b een p erformed on the News sequence in CIF format The

b ounded previous and currentVOPs are shown in Fig a and b resp ectively

The reconstructed VOP using our presented compressed domain co ding scheme is

a b

c d

Figure Illustrate the p erformance of our contentbased video co ding a

b ounded previous VOP b bounded currentVOP c bounded alpha plane for

previous VOP d reconstructed VOP using our presented design

shown in Fig d The simulation results demonstrate the comparable video

qualitybetween the reconstructed and currentVOPs

Due to its lower computational complexity as compared to other dierence mea

sures the sum of absolute dierence SAD as dened in is adopted in the

MPEG standards to measure the prediction errors Simulations have also b een

p erformed to compare our design with the mo died blo ck matching or p olygon

matching metho d used in MPEG in terms of prediction errors SAD Here the

MPEG video reference software MoMuSys is used as reference in simulating

the p erformance of mo died blo ck matching approach The results are shown in

Fig The simulation results demonstrate the comparable p erformance of b oth

our design and the one used in MPEG in terms of prediction errors Compared

to the conventional arbitrarily shap ed video co ding design we optimize the hard

ware complexityby minimizing the computational units along the data path more

costeectively

Other than the News test sequence the simulations are also p erformed for

Foreman and Mother and Daughter sequences etc In order to show that our

DCTDOMAIN CONTENTBASED VIDEO CODING

4 x 10 5

Our design MPEG−4 design

4.5

3.5 total sum of absolute differences

2.5 0 10 20 30 40 50 60

frame number

Figure Comparing the p erformance of dierent video co ding approaches in

terms of prediction errors using News testing sequence Here total sum of absolute

dierences is the summation of SAD for all macroblo cks within each frame

design is also backward compatible to handle the rectangular frame of video here

we treat Mother and Daughter sequence as the regular frame of pixels The

simulation results as shown in Fig demonstrate the comparable video quality

between our compressed domain design and the conventional MPEG approach

used in video standards In other words considering that the motion comp ensated

video co ding of the rectangular frame is the sp ecial case of our arbitrarily shap ed

video co ding it is easy to see that our presented design is backward compatible to co de regular images

CHAPTER MPEG AND CONTENTBASED VIDEO CODING

a Foreman using MPEG b Foreman using our design

MoMuSys reference software

c Mother and Daughter d Mother and Daughter

using MPEG using our design

Figure Comparing the video qualityofdierent video co ding approaches in

terms of video quality

Table of Contents