RATE DISTORTION OPTIMIZATION FOR INTERPREDICTION IN H.264/AVC CODING

Thesis

Submitted to

The School of Engineering of the

UNIVERSITY OF DAYTON

In Partial Fulfillment of the Requirements for

The Degree of

Master of Science in Electrical Engineering

By

Jonathan Patrick Skeans

UNIVERSITY OF DAYTON

Dayton, Ohio

August, 2013 RATE DISTORTION OPTIMIZATION FOR INTERPREDICTION IN H.264/AVC

VIDEO CODING

Name: Skeans, Jonathan Patrick

APPROVED BY:

Eric Balster, Ph.D. Frank Scarpino, Ph.D. Advisor Committee Chairman Committee Member Assistant Professor, Department of Professor Emeritus, Department of Electrical and Computer Engineering Electrical and Computer Engineering

Vijayan Asari, Ph.D. Committee Member Professor, Department of Electrical and Computer Engineering

John G. Weber, Ph.D. Tony E. Saliba, Ph.D. Associate Dean Dean, School of Engineering School of Engineering & Wilke Distinguished Professor

ii ABSTRACT

RATE DISTORTION OPTIMIZATION FOR INTERPREDICTION IN H.264/AVC VIDEO

CODING

Name: Skeans, Jonathan Patrick University of Dayton

Advisor: Dr. Eric Balster

Part 10 of MPEG-4 describes the (AVC) method widely known as

H.264. H.264 is the product of a collaborative effort known as the Joint Video Team (JVT). The

final draft of the standard was completed in May of 2003 and since then H.264 has become one of the most commonly used formats for compression [1]. H.264, unlike previous standards, describes a myriad of coding options that involve variable block size inter prediction methods, nine different intra prediction modes, multi frame prediction and B frame prediction. There are a huge number of options for coding that will tend to generate a different number of coded bits and different recon- struction quality. A video encoder is challenged to minimize coded bitrate and maximize quality.

However, choosing the coding mode of a to achieve this is a difficult problem due to the large number of coding combinations and parameters. Rate Distortion Optimization is an effective technique for choosing the ’best’ coding mode for a macroblock. This thesis presents two features of an H.264 encoder, multi frame prediction and B frame prediction. Additionally, a Rate Distor- tion Optimization scheme is implemented with the features to improve overall performance of the encoder. iii For my friends and family

iv ACKNOWLEDGMENTS

I would like to thank my family for their support during my time as a college student. I would also like to thank the following people for making the experience more rewarding:

• Thank you to Chris McGuinness for helping me learn H.264 and always being available to answer any questions I had. You have been an excellent role model, colleague, and friend.

• Thank you to the William Turri and the rest of the ADDA lab for all the assistance you have given me through out my time with UDRI.

• Thank you to Kerry Hill, Al Scarpelli, and the Air Force Research Laboratory for enabling the experience.

• Thank you to Dr. Frank Scarpino and Dr. Vijayan Asari for serving on my thesis committee.

• Thank you to Mike Ratterman and Chris Direnzi for putting up with me during undergrad.

• Special thanks to Dr. Eric Balster for taking a chance on me and serving as my advisor.

v TABLE OF CONTENTS

ABSTRACT ...... iii

DEDICATION ...... iv

ACKNOWLEDGMENTS ...... v

LIST OF FIGURES ...... viii

LIST OF TABLES ...... x

I. Introduction ...... 1

1.1 Video Coding Overview ...... 1 1.2 Video Coding Standards ...... 2 1.2.1 H.264 Standard ...... 3 1.3 H.264 Overview ...... 3 1.4 Prediction ...... 4 1.4.1 Intra Prediction ...... 5 1.4.2 Inter Prediction ...... 6 1.5 Transform, Scaling, and Quantization ...... 10 1.5.1 Hadamard Transform ...... 11 1.5.2 Quantization ...... 13 1.6 Entropy Coding ...... 14 1.6.1 Exp- ...... 14 1.6.2 CAVLC ...... 15 1.6.3 CABAC ...... 15 1.7 Profiles and Levels ...... 15 1.8 Mode Selection ...... 16 1.8.1 Rate Distortion Optimized Mode Selection ...... 19 1.9 Motivation and Organization ...... 19

vi II. Multi Frame Prediction ...... 20

2.1 Interprediction Overview ...... 20 2.2 Syntax Overview ...... 21 2.3 Picture Ordering ...... 22 2.4 Reference Picture Lists ...... 25 2.5 Exp-Golomb Coding ...... 26 2.6 Motion Vector Prediction ...... 28 2.7 Multi Frame Encoding ...... 29 2.8 Conclusions ...... 33

III. B Frame Inter Prediction ...... 38

3.1 B Frame Inter Prediction Overview ...... 38 3.2 B Frame Reference Picture Lists ...... 39 3.3 B Frame Coding ...... 41 3.3.1 SPS and PPS ...... 41 3.3.2 Decoded Picture Buffer ...... 43 3.3.3 Create Search Window ...... 43 3.3.4 Block Match, Transform and Quantize ...... 43 3.3.5 Motion Vector Prediction ...... 44 3.4 B Frame Implementation ...... 44 3.5 B Frame Conclusions ...... 45

IV. Mode Selection ...... 48

4.1 Introduction ...... 48 4.2 Proposed Low Complexity RDO Method ...... 49 4.3 Proposed RDO Method Implementation ...... 50 4.4 Conclusions ...... 52

V. Conclusions and Future Work ...... 56

5.1 Conclusions ...... 56 5.2 Future Work ...... 57

BIBLIOGRAPHY ...... 58

vii LIST OF FIGURES

1.1 Subdivision of Picture into Slices ...... 5

1.2 Prediction Samples for luma 4x4 prediction ...... 6

1.3 Macroblock Partitioning for Inter prediction ...... 8

1.4 Multiframe ...... 9

1.5 Extracting DC Coefficients ...... 12

1.6 H.264 Syntax Layers ...... 14

1.7 H.264 Profiles ...... 16

1.8 Available Prediction Modes ...... 18

2.1 Macroblock Layer Overview: Baseline ...... 23

2.2 mb pred syntax overview ...... 24

2.3 sub mb pred syntax overview ...... 25

2.4 Display Order Example, Type 0 ...... 26

2.5 Reference Picture Order Example: P Slices ...... 27

2.6 Current and neighboring partitions: 16x16 partitions ...... 29

2.7 Current and neighboring partitions: different partitions sizes ...... 30

2.8 Multi Frame Prediction Foreman ...... 32

viii 2.9 Multi Frame Prediction Flower ...... 33

2.10 Multi Frame Prediction Flyby ...... 34

2.11 Multi Frame Prediction Foreman ...... 35

2.12 Multi Frame Prediction Flower Complexity ...... 36

2.13 Multi Frame Prediction Flyby Complexity ...... 37

3.1 IPBB Display Order ...... 40

3.2 List0 and List1 Ordering Example ...... 41

3.3 B MB Prediction Block Diagram ...... 42

3.4 B MB Motion Vector Prediction ...... 44

3.5 Rate Distortion Curve Foreman using B Frame Interprediction ...... 45

3.6 Rate Distortion Curve Foreman using B Frame Interprediction for QPs 24 through 28 46

3.7 Complexity using B Frame Interprediction ...... 47

4.1 Traditional RDO ...... 50

4.2 Proposed RDO Method ...... 51

4.3 Proposed RDO Results: Foreman ...... 52

4.4 Proposed RDO Complexity: Foreman ...... 53

4.5 Proposed RDO Results: Flower ...... 54

4.6 Proposed RDO Complexity: Foreman ...... 55

ix LIST OF TABLES

1.1 Video Compression Standards ...... 2

1.2 Luma Prediction Modes, 4x4 prediction ...... 7

2.1 Exp-Golomb Codewords ...... 27

2.2 Mappings to codeNum ...... 28

3.1 Display Order Example ...... 40

x CHAPTER I

Introduction

1.1 Video Coding Overview

Digital media has gone through a significant change over the past 10 years [1]. Most consumers now receive digital television which offers a greater choice of channels, electronic guides, and high definition programming. DVDs and Blu-Ray Disks are the primacy medium for playing pre- recorded movies and television programs. An alternative to this technology is Internet downloading and streaming. Many other changes in digital media include increased functionality of cellular telephones, increase in home internet speeds, and video calling via the internet.

Many factors have contributed to the shift towards digital video including commercial factors, legislation, social changes and technological advances [1] One technical aspect that is key to the widespread adoption of digital video technology is video compression. Video compression is the process of reducing the amount of data required to represent a digital video signal, prior to transmis- sion or storage [1]. The complementary operation, video decompression, recovers a digital signal from a compressed representation, prior to display. This entire process, known as video coding, is essential for any video application in which storage capacity or transmission bandwidth is limited.

1 1.2 Video Coding Standards

Standards exist to simplify inter-operability between encoder and decoders from different man- ufactures. Table 1.1 shows a partial history of video compression standards.

Table 1.1: Video Compression Standards

Year Standard Publisher 1984 H.120 ITU-T 1988 H.261 ITU-T 1993 MPEG-1 Part 2 ISO, IEC 1995 H.262/MPEG-2 Part 2 ISO,IEC,ITU-T 1996 H.263 ITU-T 1999 MPEG-4 Part 2 ISO, IEC 2003 H.264/MPEG-4 AVC ISO,IEC,ITU-T 2013 H.265 ISO-Under development

The requirements for a successful video coding standard include:

• interpretability: should assure that encoder and decoders from different manufactures work

together seamlessly.

• Innovation: should perform significantly better than previous standard

• Competition: should be flexible enough to allow competition between manufactures based on

technical merit. Only standardize bit-stream syntax and reference decoder

• Independence from transmission and storage media: should be flexible enough to be used for

a range of applications.

• Forward compatibility: should decode bit-streams from prior standard

2 • Backward compatibility: prior generation decoders should be able to partially decode new

bit-streams.

1.2.1 H.264 Standard

The most recent video compression standard, H.264 was finalized in May 2003 by the Inter- national Telecommunications Union (ITU) and the International Standards Organization(IS0)[1].

H.264, also known as MPEG-4 Part 10 and Advanced Video Coding, describes and defines a method of coding video that can give better performance than any of the preceding standards. Using H.264, it is possible to compress video into a smaller space, which means less transmission bandwidth and/or less storage space when required. H.264 has more options and parameters compared to any of the previous standards [1]. Tuning these parameters properly delivers high compression perfor- mance; tuning them improperly leads to poor-quality pictures and/or poor bandwidth efficiency. A standout feature of H.264 is its flexibility, coming with several tools for reducing redundant video information. This allows H.264 to vary from being highly complex to a rather simple algorithm, depending on the quality requirements. In addition to these tools, the H.264 standard defines 17 sets of capabilities, or profiles, that allow for this flexibility [1]. Each profile is targeted for a specific application ranging, ranging from high quality 3D stereoscopic video compression to a relatively low quality 2D video streaming.

1.3 H.264 Overview

The H.264 encoding process consists of three steps to produce a compressed H.264 bitstream: prediction, transformation, and encoding. The H.264 decoding process performs the complementary process to produce a decoded video sequence. The decoded version is, in general, not an identical to the ordinal sequence because H.264 is a format.

3 The structure of a typical encoder is shown below. Data is segmented in units of a macroblock

(MB), which correspond to 16x16 group of displayed pixels. A prediction MB is generated and subtracted from the current MB to form a residual MB. The residual MB is transformed, quantized and encoded. Meanwhile, the quantized data is re-scaled,inverse transformed, and added to the prediction MB to form a reconstructed MB which is stored for later predictions.

1.4 Prediction

H.264 supports a wide range of prediction options that include

• Intra prediction: Prediction formed from previously encoded data within the current frame.

• Inter prediction: Prediction formed using motion compensation from previously coded frames.

• Multiple prediction block sizes: Used in both intra and inter prediction in order to form more

accurate predictions.

• Multi Frame Prediction: Used in inter prediction in order to form more accurate predictions.

• Skip Mode: no macroblock data or residual data is coded.

Intra and inter prediction are discussed in more detail in the following sections. are grouped into slices which are processed in the order of a raster scan. A picture, a array of luma samples and two corresponding chroma samples, may be split into one or several slices as shown in

1.1.

A slice be coded using different coding types which include:

• I Slice: A slice in which all MBs of the slice are coded using intra prediction.

4 Figure 1.1: Picture divided into three slices

• P Slice: In addition to the coding types of the I slice, some MB of the P slice can also be coded

using inter prediction with at most one motion-compensated prediction signal per prediction

block.

• B Slice: In addition to the coding types available in a P slice, some MBs of the B slice

can also be coded using inter prediction with two motion-compensated prediction signals pre

prediction block.

The three slices mentioned above are very similar to to those in previous standards with the excep- tion of the use of reference pictures as discussed later on. Two new coding types for slices are SP and SI slices. For more information, refer to [2]

1.4.1 Intra Prediction

A prediction that is formed based on spatial data is known as intra prediction. MBs formed using intra prediction are known as I MBs. Intra prediction uses samples from adjacent, previously coded blocks to predict values in the current MB. H.264 supports three choices of intra prediction block sizes for the luma component: 16x16, 8x8, and 4x4. Figure 1.2 shows the prediction samples for a 4x4 luma block. A single predication block is generated for each chroma component. Each 5 prediction block is generated using one of many possible prediction modes. Table 1.2 summarizes these modes.

Figure 1.2: Prediction Samples for luma 4x4 prediction

The choice of intra prediction block size for the luma component tends to be a trade-off between prediction efficiency and the number of bits required to code the prediction mode. Smaller blocks tend to give more accurate predictions but more bits are required to code the prediction choices.

Larger blocks tend to give less accurate predictions but require fewer bits to code the prediction choice.

H.264 supports a number of intra prediction modes for different block sizes. Block sizes of

16x16, 8x8, and chroma blocks each use a subset of the nine prediction modes for 4x4 luma blocks.

Refer to [3] for a complete description of prediction modes for all block sizes.

1.4.2 Inter Prediction

A prediction that is formed based on temporal data is known as inter prediction. This involves two processes; and motion compensation. Motion estimation consists of locating a search region and forming a prediction MB. Motion compensation consists of subtracting the predicted block from the current MB to form the residual that is to be coded and transmitted. The

6 Table 1.2: Luma Prediction Modes, 4x4 prediction

Mode 0 (Vertical) The upper samples A,B,C,D are extrapolated verti- cally. Mode 1 (Horizontal) The left sample I,J,K,L are extrapolated horizon- tally. Mode 2 (DC) All samples in P are predicted by the mean of sam- ples A..D and I..L. Mode 3 (Diagonal Down-Left) The samples are interpolated at a 45◦ angle between lower-left and upper-right. Mode 4 (Diagonal Down-Right) The samples are extrapolated at a 45◦ angle down and to the right. Mode 5 (Vertical-Left) Extrapolation at an angle of approximately 26.6◦ to the left of vertical. Mode 6 (Horizontal-Down) Extrapolation at an angle of approximately 26.6◦ be- low horizontal. Mode 7 (Vertical-Right) Extrapolation at an angle of approximately 26.6◦ to the right of vertical. Mode 8 (Horizontal-Up) Interpolation at an angle of approximately 26.6◦ above horizontal.

block of samples to be predicted can be predicted using a range of block sizes. The macroblock can be split into one, two, or four macroblock partitions:

• one 16x16 partition (the entire MB)

• two 8x16 partitions

• two 16x8 partitions

• four 8x8 partitions

If an 8x8 partition is chosen, then each 8x8 block of samples may be further divided into a sub- macroblock consisting of one, two, or four sub-macroblock partitions:

• one 8x8 partition

• two 8x4 partitions 7 • two 4x8 partitions

• four 4x4 partitions

Figure 1.3 illustrates MB partitioning.

Figure 1.3: Macroblock Partitioning for Inter prediction. Top: Segmentation of MBs, Bottom: segmentation of 8x8 partitions

The prediction signal for each MB is obtained by displacing an area of the corresponding ref- erence pictures, which is specified by a translational motion vector and a reference picture index.

Each partition requires a reference picture index and motion vector. If an 8x8 partition is used, each sub-partition requires a motion vector as well. It is possible that 16 motion vectors are required for a single P MB. The motion vector components are differentially coded uses either median or directional prediction from neighboring blocks.

H.264 has the capability of interpolating reference pictures. Each partition in an inter-coded

1 MB is predicted from an area of the same size in a reference picture. The motion vector has 4 pixel

1 resolution for the luma component and 8 pixel resolution for the chroma components. The sub-pixel positions do not exist in the reference picture so it is necessary to create them using interpolation

8 from nearby image samples. Interpolating the reference picture at the half and quarter pixel loca- tions leads to more accurate motion representation. More detailed information on fractional sample accuracy is presented in [4].

H.264 supports multipicture motion-compensated prediction [5] [6]. This means that more than one prior coded picture can be used as reference for the motion-compensation process. Figure 1.4 illustrates this concept. Multipicture motion-compensated prediction requires both the encoder and decoder to store the reference pictures used for inter prediction in the decoded picture buffer (DPB).

Unless the size of the DPB is set to one picture, the index at which the reference picture is located inside the DPB must be signalled. The reference index parameter is transmitted for each motion- compensated 16x16, 16x8, 8x16, or 8x8 luma block. Motion compensation for smaller regions than

8x8 use the same reference index for prediction of all blocks within the 8x8 region.

Figure 1.4: Multiframe Motion Compensation. Motion vector and reference index are transmitted

An additional MB mode for a P MB is called skip, or P Skip. For this coding type, neither a quantized prediction error signal, nor motion vector or reference index parameter is transmitted. P

Skips are useful for temporally homogeneous regions; which can be represented with very few bits.

The concept of inter prediction using B slices is generalized in H.264/AVC [7]. This extension refers back to [8] and is further investigated in [9]. Unlike previous standards, other pictures can reference pictures that contain B slices for motion-compensated prediction which is dependent on 9 the memory management control operation of the DPB. B slices utilize two distinct lists of reference pictures, list 0 and list 1. The reference pictures may include pictures before and after the current picture in display order. A prediction block for a B MB is generated from two prediction regions in reference pictures. Optionally, the prediction block may be weighted according to the temporal distance between the current and reference picture(s), known as weighted prediction.

In B slices, four different types of inter-picture prediction are supported: list 0, list 1, bi- predictive, and direct prediction. For list 0 and list 1 prediction, and prediction block is generated from either a picture in list 0 or list 1. The respected motion vector and reference picture index (if necessary) are coded and transmitted. Bi-prediction consists of forming a prediction block formed by a weighted average of motion-compensated list 0 and list 1 prediction signals. Direct prediction mode is inferred from previously coded syntax elements and can be either list 0 or list 1 prediction or bi-predictive. Direct mode is similar to the prediction mode P skip. The difference being that direct mode encodes and transmits an error signal for the MB. If no prediction error is encoded and transmitted for direct mode, this is known as B Skip mode; which is coded very similar to a P skip.

1.5 Transform, Scaling, and Quantization

Similar to previous video coding standards, H.264 utilizes of the residual data.

However, H.264 performs the transformation on 4x4 blocks and instead of a 4x4 discrete cosine transform (DCT), an integer transformation is used with similar properties as the 4x4 DCT. To ease the memory requirement of H.264, the integer transform was developed such that there would be a zero mismatch between the forward and inverse transforms [1]. Also, because it is an integer transform, there is no loss of decoding accuracy due to rounding. The DCT operates on X, a block of NxN samples, typically image samples or residual values after prediction to create Y, an NxN block of coefficients.

10 Equation 1.1 shows the DCT integer transformation  a a a a   a b a c   b c −c −b   a c −a −b  Y = AXAT =   [X]    a −a −a a   a −c −a b  a −b b −a a −b a −c (1.1)

1 q 1 π  q 1 3π  Where a = 2 , b = 2 cos 8 , c = 2 cos 8

The DCT shown in Equation 1.1 can be factorized to form the H.264 integer transform shown in Equation 1.2.

     2 ab 2 ab  1 1 1 1 1 2 1 1 a 2 a 2 2 2 2 1 −1 −2 1 1 −1 −2  ab b ab b  Y = CXCT ⊗ E =   [X]   ⊗  2 4 2 2   1 −1 −1 1   1 −1 −1 2   a2 ab a2 ab       2 2  1 −2 2 −1 1 −2 1 −1 ab b2 ab b2 2 4 2 2

1 q 1 π  q 1 3π  where a = 2 , b = 2 cos 8 , c = 2 cos 8 . (1.2)

The multiplication performed at the end of the integer transform is absorbed into the quan- tization process. Thus, the core portion of the integer transform can be performed using addition, subtraction, and shifts. This is the same for the inverse transformation show in Equation 1.3. Further discussion on the specifics of the H.264 integer transform can be found in [10].

 1    2 2    1 1 1 2 a ab a ab 1 1 1 1 1 2 2 1 1 T  1 2 −1 −1    ab b ab b   1 2 − 2 −1  Y = C (Y ⊗ Ei)Ci =   [X] ⊗     i  1 −1 −1 1    a2 ab a2 ab   1 −1 −1 1  1 1 2 2 1 1 1 − 2 2 − 2 ab b ab b 2 −1 1 − 2

1 q 1 π  q 1 3π  where a = 2 , b = 2 cos 8 , c = 2 cos 8 . (1.3)

1.5.1 Hadamard Transform

If the macroblock is predicted using 16x16 intra prediction or if the data represents chrominance, the Hadamard Transform is performed on the lowest or ’DC’ coefficients [3]. The DC coefficients 11 are extracted and stored as a separate 4x4 block of data if working with a 16x16 luma macroblock as shown in Figure 1.5. In the case of chroma data, a 2x2 Hadamard transform is performed on the

2x2 block of chroma DC coefficients.

Figure 1.5: Extracting DC Coefficients

The 4x4 block of DC coefficients (XDC ) are then transformed to obtain (YDC ) using Equation

1.4. If working with a 2x2 block of data, the coefficients are transformed using Equation 1.5. The

Hadamard transform provides additional data reduction in cases where prediction methods create an abundance of DC data. Further information on the motivation for the Hadamard transform can be found in [7].

1 1 1 1  1 1 1 1  1 1 1 −1 −1 1 1 −1 −1 YDC =   [XDC ]   (1.4) 2 1 −1 −1 1  1 −1 −1 1  1 −1 1 −1 1 −1 1 −1

1 1  1 1  Y = [X ] (1.5) DC 1 −1 DC 1 −1

12 1.5.2 Quantization

A quantizer maps a signal with a range of values X to a quantizer signal with a reduced range of values Y. It should be possible to represent the quantized signal with fewer bits than the original since the range of possible of possible values is smaller. H.264 uses a scalar quantizer that maps one input signal to one quantized output value.

A simple scalar quantization process consists of round a fractional number to the nearest integer.

This process is lossy since it is not possible to determine the exact value of the original signal. A quantization parameter is used for determining the quantization of transform coefficients in H.264.

The parameter can take 52 values. The values are arranged so that an increase of 1 in quantization parameter means an increase of quantization step size by approximately 12% [7]. Typically, the result is a block in which most or all of the coefficients are zero, with few non-zero coefficients.

Setting a QP to a high value means that more coefficients are set to zero, resulting in high com- pression at the expense of poor decoded image quality. Setting QP to a low value means that more non-zero coefficients remain after quantization, resulting in better image quality at the decoder but also in lower compression.

The quantization process performs both the scaling required by the integer transform and the actual data quantization. H.264 uses a scalar quantizer, which is also integer based [11]. The basic forward quantization process is shown in 1.6.

 Y  Z = round (1.6) Qstep

13 1.6 Entropy Coding

The entropy coding portion of the H.264 is where all the necessary data to recreate the video

file is converted to binary. An H.264 file is organized into syntax layers as shown in figure 1.6. The

H.264/AVC standard supports to methods of entropy coding.

Figure 1.6: H.264 Syntax Layers

1.6.1 Exp-Golomb Coding

The simpler method uses a single codeword table for all syntax elements except the quantized transform coefficients. This method is called exp-Golomb coding and consists of a very simple and regular decoding properties. This coding method has the advantage of avoiding creating a variable length coded table for each syntax element. Instead, only the mapping to the single codeword table is customized according to the data statistics.

14 1.6.2 CAVLC

For transmitting the quantized coefficients, a more efficient method called Context-Adaptive

Variable Length Codding (CAVLC) is employed. CAVLC uses VLC tables for various syntax el- ements are switched depending on already transmitted syntax elements. Since the VLC tables are designed to match the corresponding conditional statistics, the entropy coding performance is im- proved in comparison to a single VLC table [7]. CAVLC is required for all profiles of the H.264 standard. Further details on the overview of CAVLC can be found in [7].

1.6.3 CABAC

The efficiency of can be improved further with Context-Adaptive Binary

Arithmetic Coding (CABAC)[12]. CABAC uses statistics of already coded syntax elements are used to estimated conditional probabilities. These conditional probabilities are used for switching several estimated probability models. Compare to CAVLC, CABAC typically provides a reduction in between 5%-15% [7]. More detailed information about CABAC can be found in [12].

1.7 Profiles and Levels

The H.264 standard specifies profiles, each contains a subset of the coding tools available in the

H.264 standard. A profile defines a set of coding tools or algorithms that can be used in generating a conforming bitstream [1]. A level places constraints on certain key parameters of the bitstream such as the maximum number of frames per second for a given resolution. All decoders that conform to a specific profile must support all the features of that particular profile. Encoders are not required to make use of any particular set of features supported in a profile but have to provide conforming bitstreams. H.264 defines three profiles; baseline, main, and extended. Figure 1.7 illustrates these profiles and their respected features. A subset of baseline, constrained baseline, is also defined.

15 The baseline profile is intended for low delay applications such as mobile transmission [1]. The extended profile is a superset of the baseline profile, adding further tools that my be beneficial for efficient network streaming of H.264 data. The main profile is a superset of constrained baseline profile and adds tools that may be suitable for broadcast and entertainment applications.

Figure 1.7: H.264 Profiles

1.8 Mode Selection

Many coding methods have been presented in the previous sections. Figure 1.8 illustrates all the prediction possibilities for a macroblock. These include:

• Skip Mode: no information sent for MB

16 • Four intra-16x16 modes

• Nine intra-4x4 modes, with a different choice possible for each 4x4 block

• 16x16 inter mode: prediction from reference picture(s) from one P or B MB or two B MB

lists.

• 8x16 inter mode: prediction from multiple reference pictures as above, with the option of

different reference picture(s) for each partition.

• 16x8 inter mode: same prediction choices as above.

• 8x8 inter mode: reference picture choices as above, with further sub-division of each 8x8

partition into 8x4, 4x8, or 4x4 sub MB partitions.

In addition to the prediction mode, the encoder can choose to change the QP as well as within each inter mode the encoder has a wide choice of possible motion vectors (discussed in chapter

2). There are an enormous amount of options for coding each MB. Each combination of coding parameters will tend to generate a different number of coded bits which can range from very low

(skips) to high (intra) and a different distortion.

The ultimate goal of a video encoder aims to minimize coded bitrate and maximize decoded quality. This is a very difficult goal due to the large number of coding options as well as deciding the ”best” tradeoff between minimizing bitrate and minimizing distortion. The three measurements that are taken into account when coding a MB are:

• Header bits: The number of bits required to signal the MB mode, plus any prediction param-

eters such as intro mode, reference choices and/or motion vector difference.

• Coefficient bits: The number of bits required to code the quantized transform coefficients.

17 Figure 1.8: Available Prediction Modes

• SSD: (Sum of Square Differences) Distortion of the decoded, reconstructed MB, measured

in sum of squared distortion (equation 1.7) where x, y are the sample positions in a block,

0 b(x, y) are the original sample values and b (x, y) are the decoded sample values at each

sample position.

X  0 2 SSD = b (x, y) − b (x, y) (1.7) x,y

18 1.8.1 Rate Distortion Optimized Mode Selection

Rate Distortion Optimization (RDO) mode selection is a technique used for choosing the coding mode of a MB based on the rate and distortion cost [1]. The bitrate cost R and distortion cost D are combined in a single cost function J shown in Equation 1.8. The RDO mode selection algorithm attempts to fine a mode that minimizes the joint cost J. The trade-off between rate and distortion is controlled by the Lagrange multiplier λ.

J = D + λR (1.8)

1.9 Motivation and Organization

Because of the many encoding options, this thesis presents a rate distortion optimization tech- nique in order to incorporate multiple block sizes, multi reference frames, and B frames.

This thesis is divided into four additional chapters. Chapter 2 presents interprediction in terms of multi-frame coding. Chapter 3 presents B frame coding in more detail. Chapter 4 will discuss RDO implemented on multi-frame coding and B frames and compares the compression performance to a straight up error decision process. Finally, chapter 5 provides details on future work and conclusions concerning the implementations and performance of the proposed RDO scheme.

19 CHAPTER II

Multi Frame Prediction

2.1 Interprediction Overview

A prediction formed temporally is referred to as interprediction. Inteprediction requires the use of at least one picture that has previously been encoded, a reference picture. The basic steps to interprediction are listed below. Before adding B frames to the encoder, given the current state of the encoder, a logical stepping block is to add the capabilities of multi frame prediction since B frames require more than a single reference frame in order to be useful. The following sections discuss the process of adding multi frame prediction capabilities.

1 • Interpolate the picture(s) in the DPB to generate 4 same positions in the luma component and

1 8 sample positions in the chroma component.

• Choose an inter prediction mode. This involves the choice of reference picture, choice of

macroblock partition and sub-partition, and the choice of prediction types which include list

0 and list 1 depending on the MB type.

• Choose motion vectors

• Predict the motion vectors from previously-transmitted vectors and generate motion vector

differences.

20 • Code the MB type, choice of prediction references, motion vector differences, and residual

data.

• Apply a deblocking filter.

Each partition in an inter-coded MB is predicted from an area of the same size in a reference

1 picture. The offset between the two areas, the motion vector, has 4 pixel resolution for the luma

1 components and 8 pixel resolution for the chroma components. Sub-pixel motion estimation is possible in H.264. Refer to [1] for more information.

2.2 Syntax Overview

Figure 1.6 shows an overview of the H.264 syntax elements. It is be worthwhile to discuss some of these syntax for the future discussion. Coded H.264 data is stored or transmitted in a series of packets call Network Abstraction Layer (NAL) units. The first NAL unit to be discussed is the

Sequence Parameter Sets (SPS). This particular NAL units carries coding parameters common to an entire video sequence. These parameters include the profile, constraints on the profile, number of reference frames, and the size of the image. Picture Parameter Sets (PPS) contain common parameters to a sequence of coded frames such as entropy encoding type, number of active reference pictures, and initial QP. A complete list of SPS and PPS parameters can be found in [3].

Another NAL unit of interest is the slice layer. The slice layer is made up of the slice header which conveys information common to all macroblocks in the slice, such as slice type which deter- mines which macro blocks types are allowed. For example, I MBs are permitted in P and B slices.

The slice layer also provides the frame number in which the slice corresponds to, the reference pic- ture settings, and the default QP. A full list of the syntax elements found in the slice layer can be found in [3].

21 The slice data sections contains a series of MBs that make up that particular slice. The proposed implementation contains one slice per frame and all MBs within that slice are equivalent to the slice type. This means that if a slice type is set as P type, then only P MBs are contained within that slice.

Each MB in the slice contains what is known as the Macroblock Layer. An overview of the MB layer for the Baseline profile is shown in 2.1. If the mb pred or sub mb pred process yields a coded block pattern of zero, indicating that the transform and quantize process returned zero coefficients, the rest of the encoding processes process can be skipped.

Each MB contains a mb pred process or a sub mb pred process if sub-partitions are present.

An simple overview of each of these processes are shown in figure 2.2 and figure 2.3. As will be shown later, the reference indices and motion vectors differences may not be present under certain conditions thus it will not be necessary to encode. Figure 2.1 also indicates the residual process. This process involves implementing CAVLC or CABAC on the transformed and quantized coefficients.

The CAVLC process does not change for multi frame prediction and so will not be discussed any further. CABAC on the other hand does change due to the fact that encoding MB data is done in the

CABAC process as opposed to look up tables as done by CAVLC. CABAC has not been modified in the proposed encoder to support multi frame prediction.

2.3 Picture Ordering

Before discussing the topic of multi frame encoding, it is important to distinguish three different orderings of pictures:

• Decoding order - the order in which pictures are decoded from the bitstream.

• Display order - the order in which pictures are output for display.

• Reference order - the order in which pictures are arranged for inter prediction of other pic-

tures. 22 Figure 2.1: Macroblock Layer Overview: Baseline

The parameter frame num , decoded from the slice header, determines the decoding order of pictures. In most cases, frame num for each decoded picture increases by one compared with the previous reference frame in decoding order. Display order is determined by the parameters Top-

FieldOrderCount and BottomFieldOrderCount , collectively known as Picture Order Count (POC).

The display order is derived from the the sliced header using three methods:

• Type 0: The least significant bits of POC are sent in every slice header. Allows maximum

flexibility but require the most bits compared to other methods.

• Type 1: A ’cycle’ of POC increments is set up in the sequence parameter set and POC changes

according to this cycle unless otherwise signalled in the slice header using a Delta offset. The

23 Figure 2.2: mb pred syntax overview

cycle defines the interval between frames used for reference, plus a POC offset to frames no

used for reference. Only the Delta offset value is encoded and transmitted.

• Type 2: POC is derived directly from frame num and display order is the same as decoding

order.

The proposed encoder uses Type 0 for display order. An example using only P frames is shown in

figure2.4. Starting with the I slice, frame num starts at a value of zero. If the previous picture is used for reference, frame num will increment by a value of one. Since all frames in this example are used for reference, frame num is incremented by one for each frame. POC is incremented by two for every complete frame, i.e. every two fields. Display order is then determined by the values of POC. In this particular sequence, decoding order and display order are the same. 24 Figure 2.3: sub mb pred syntax overview

2.4 Reference Picture Lists

A picture that is coded and available for reference is stored in the DPB and marked as one of the following:

• short term reference picture, indexed according to frame num (PSlice) or POC (BSlice)

• long term reference picture

The proposed encoder focuses on short term reference pictures. More information on long term reference pictures can be found in [3]. Reference pictures are ordered in one or two lists prior to encoding or decoding a frame. In the case of P slices, a single list, list0 is used. The default

25 Figure 2.4: Display Order Example, Type 0

order of list0 depends or decoding order. The lists orders are important, since indices to reference pictures earlier in the list require few bits to signal. Hence the default orders are organized so that the reference pictures temporally ’closer’ to the current picture occur early in the list since these are most likely to be the best prediction match for the current picture. Figure 2.5 shows an example of list0 ordering for P slices.

The default reference picture list order can be changed in order to place a particularly useful reference frame earlier in the list than its usual position. This feature is not implemented in the proposed encoder.

2.5 Exp-Golomb Coding

Exponential Golomb coding is utilized for most of the syntax elements outside of residual data.

An Exp-Golomb codeword consists of a prefix of 0s followed by a stop bit ’1’ and a suffix con- taining a binary number related to the value being encoded, codeNum. The length of the suffix is

26 Figure 2.5: Reference Picture Order Example: P Slices

identical to the number of zeros found in the prefix. The codeNum derived from the suffix is inter- preted differently based on the type of Exp-Golomb code used. By assigning short codewords to frequently-occurring data symbols and long codewords to less common data symbols, the data may be represented in a compressed form. Table 2.1 illustrates the first few Exp-Golomb codes.

Table 2.1: Exp-Golomb Codewords

code num CodeWord 0 1 1 010 2 011 3 00100 4 00101 5 00110 6 00111 7 0001000 8 0001001 ......

The variable codeNum can either directly represent the syntax element value to be coded or used in a mapping process define in [3]. Multiple mapping methods are necessary since Exp-Golomb by

27 itself cannot support signed values. An additional reason is that a syntax element may belong to a set of elements and the number of elements in that set is less than the value of the syntax element itself.

Exp-Golomb then maps that element to an index in the set, reducing the length of the codeword.

Table 2.2 gives an overview of the mapping processes. The mapping process te is called upon very little through the coding process. One of the few exceptions is the reference index syntax element.

Caution must be used when encoding this particular syntax element since there are multiple factors that determine if the reference index is part of the bitstream or not, as will be discussed more in a later section.

Table 2.2: Mappings to codeNum

Mapping type Description ue Unsigned direct mapping: Used for macroblock type and others. te Truncated mapping: if the largest value of the syntax element is 1, then a single bit is sent, otherwise ue mapping is used. se Signed mapping: used for motion vector difference, delta QP, and others. me Mapped symbols: syntax element is mapped accord- ing to a table specified in [3].

2.6 Motion Vector Prediction

In order to save on the number of bits coded, a predicted motion vector is determined in the inter prediction process. The predicted motion vector is then subtracted from the actual motion vector to form the motion vector difference; syntax element mvd lX or sub mvd lX if sub partitions are present and X = 0 or 1 indicating the proper list. Encoding a motion vector can cost a significant number of bits, especially if small partition sizes are chosen. Motion vectors for neighboring parti- tions are generally highly correlated and so each motion vector is predicted from the surrounding, 28 previously encoded partitions. These partitions may lie within the current MB or a neighboring MB.

The method of forming the predicted motion vector, MVP , depends on the motion compensation partition size and on the availability of nearby vectors. Figure 2.6 shows neighboring blocks A,B, and C for 16x16 MBs only. Figure 2.7 shows neighboring blocks A,B, and C for a 16x16 MB with different partition sizes as the neighboring blocks. The MVP is the median of the motion vectors for partitions A,B, and C.

Figure 2.6: Current and neighboring partitions: 16x16 partitions

If one or more of the previously transmitted blocks shown in 2.6 or 2.7 is unavailable, the choice of MVP is modified accordingly. Full details of these modifications can be found in [3]. A partition may not be available due to the partition being outside the current picture, outside the current slice, or the partition was coded as a skipped MB.

2.7 Multi Frame Encoding

A simple block match algorithm for motion estimation can be described as:

• creating a window that consisted of N search points

29 Figure 2.7: Current and neighboring partitions: different partitions sizes

• calculate the SSE for each search point

• the block with the least SSE is the best candidate

• determine the corresponding motion vector for the best candidate

For the addition of multi frame coding, several changes must be made to the encoding process.

The first of these changes is in the picture parameters sets (SPS). This is the location where the max- imum number of reference frames is determined by parameter num ref frames. The second change is in the picture parameters sets (PPS). The PPS contains a parameter called num ref idx l0 minus1.

This parameters tells the decoder the number reference pictures that are contained in list0. This does not necessarily mean the maximum number of reference frames as will be shown in B frames.

For this discussion, the maximum number of reference frames will also be the number of frames contained in list0. For example, if the number of reference frames is chosen as 3, the value of 30 num ref idx l0 minus1 can include 0,1,or 2 but cannot exceed the maximum number of reference frames minus one.

The third syntax element that must be set with caution falls within the slice header. The slice header contains the parameter num ref idx active override flag. The decoder determines the num- ber of active reference frames from num ref idx l0 minus1 in the PPS. However, the number of ac- tive reference frames is not always at this value. If this syntax element is set, it means that the DPB is not full and the number of reference pictures must be overridden by the associated syntax element num ref idx l0 active override flag. For instance, if the current num ref frames = 3, frames two and three will only have one and two reference frames available for coding respectively. These number of reference frames must be recognized by the decoder. Therefore in the slice header of each of these frames, num ref idx active override flag will be set to one and num ref idx l0 minus1 will be set to zero and one respectively. When the picture buffer becomes full, num ref idx active override flag is set to zero and the number of active reference frames is inferred from the SPS in order to save bits.

The final parameter that must be adjusted from single frame coding exists in the macroblock/subMacroblock layer. A reference index, syntax element ref idx l0 must be coded and transmitted indicating which reference frame in list0 is being used for inter prediction. In the case of single reference frame prediction, this ref idx l0 is inferred to zero by the decoder. Multi frame prediction requires that ref idx l0 be coded and transmitted. Of course if only one reference frame is active, indicated by num ref idx l0 minus1 in the slice header, it is not necessary to encoded and transmit ref idx l0.

After applying the discussed changes to the syntax elements, the following tests are generated using the proposed encoder on foreman , flower, and flyby video files [13]. The tests consist of coding up to five reference frames for 16x16 MBs only. QPs range from 10(upper right) up to

31 40(lower left). The PSNR for the luma component wisas calculated for the entire sequence. As

expected, in each case there is a slight improvement in bitrate for all QP values.

Foreman Multi Frame Coding: Performance 16x16, 100 frames at 25 FPS 55

50

45

40 PSNR (Y)

35

1 Ref 30 2 Ref 3 Ref 4 Ref 5 Ref

25 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Rate (kbps)

Figure 2.8: Multi Frame Prediction: Foreman

The encoding time for each sequence in seconds is shown in Figures 2.11 though 2.13:

With the exception of Foreman,the noticeable difference for all three tests is the transition from

a single reference frame to two reference frames. This appears to be the most beneficial for slow

moving image such as flower and flyby. Foreman has relatively high motion, particularly around the

face and mouth regions. The results show more of an increase in performance when transitioning

32 Flower Multi Frame Coding: Performance 16x16, 100 frames at 25 FPS 55

50

45

40 PSNR (Y)

35

1 Ref 30 2 Ref 3 Ref 4 Ref 5 Ref

25 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Rate (kbps)

Figure 2.9: Multi Frame Prediction: Flower

from two reference frames to three reference frames. Flower also follows this trend to a degree due

to the motion in that particular sequence.

2.8 Conclusions

In this chapter, a detailed description of what syntax elements from the proposed encoder needed

to be modified in order to support multiple reference frames. This functionality is imperative for the

implementation of B frames later on. In addition, reference list management was also introduced

for P slices. Chapter 3 will expand on this idea to incorporate B slices. A rate distortion curve

was generated for three containing different resolutions. The results showed an increase in

33 FlyBy Multi Frame Coding: Performance 16x16, 100 frames at 25 FPS 52

50

48

46

44

42 PSNR (Y)

40

38

36 1 Ref

2 Ref 34 3 Ref

32 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Rate (kbps)

Figure 2.10: Multi Frame Prediction: Flyby

performance for each additional reference frame. The negative affects off multi frame coding are in

the complexity of the encoder. More memory must be allocated for reference frame storage as well

as additional processing time for the block search algorithm. In applications when speed is a high

priority, multi frame encoding must be exercised with caution. However, if quality is the goal of the

decoder, multi frame prediction certainly would be useful.

34 Foreman Multi Frame Coding: Complexity 16x16, 100 frames at 25 FPS 9

8

7

6 Time (s)

5

1 Ref 2 Ref 3 Ref 4 Ref 4 5 Ref

3 10 15 20 25 30 35 40 QP)

Figure 2.11: Multi Frame Prediction Complexity: Foreman

35 Flower Multi Frame Coding: Complexity 16x16, 100 frames at 25 FPS 45

40

35

30 Time (s)

25

1 Ref 20 2 Ref 3 Ref 4 Ref 5 Ref

15 10 15 20 25 30 35 40 QP

Figure 2.12: Multi Frame Prediction Complexity: Flower

36 FlyBy Multi Frame Coding: Complexity, 100 frames at 25 FPS 178

176

174

172 Time (seconds) 170

1 Ref 2 Ref

168

166

164 10 15 20 25 30 35 40 QP

Figure 2.13: Multi Frame Prediction Complexity: Flyby

37 CHAPTER III

B Frame Inter Prediction

3.1 B Frame Inter Prediction Overview

As mentioned in Chapter 1, B frame prediction consists of prediction MB data from two predic- tion regions in reference pictures. The reference pictures are ordered into two lists, list0 and list1. If the specified slice in the PPS is defined to be a B slice, this means that a MB contained within that slice can be a I,P,or B MB. As was the case with P slices, if a B slice is specified in the PPS, then all MBs in that slice will be coded as B MBs for the proposed encoder. B MBs can be coded in a variety of ways such as:

• single prediction from list0 or list1. MB partitions can be predicted from either list

• Two references: known as biprediction. requires 2 motion vectors each pointing to a region

of the same size in a reference picture, one from list0 and one from list1. each sample of the

prediction is calculated as an average of the samples in list0 and list1 regions.

• Weighted prediction: Method of scaling the samples of motion-compensated prediction data.

(also applies to P MBs)

– explicit: weighing factors are determined by the encoder and are transmitted in the slice

header

38 – implicit: weighing factors are calculated based on the relative temporal positions of the

list0 and list1 reference frames. (useful in ’fade’ transitions where one scene fades into

another).

• Direct Mode: No motion vector is transmitted for a B MB. Instead, the motion vectors are

inferred from list0 and list1 based on previously coded vectors and uses these to carry out the

bipredicted motion compensation of the decoded residual samples.

• Skip Mode: No MB data or residual data is coded. All MB data is inferred by the decoder

3.2 B Frame Reference Picture Lists

Similar to P slices, any previously coded picture that is marked for reference is stored in the

DPB and is marked as short term reference or long term reference. Again, the proposed encoder is focusing on short term reference frames. Recall that if coding a P slice, reference pictures are ordered in list0 in a particular order that in which the most recently coded picture that is available for reference is the first entry in the list. This default order was based on decoding order, or the parameter frame num.

If a B is to be coded, it must have at least 2 reference pictures available in the DPB. One of these pictures will be occur before the current picture in display order (past picture) and one picture will occur after the current picture in display order (future picture). The display order and transmission order for an IPBB sequence is shown in figure 3.1. The arrows indicate the reference frames used for inter prediction of that particular frame. In this example, the B slices are not used for reference prediction of any other pictures. POC increments by two for every complete frame which consists of two fields. Also note that frame num increments only after a reference frame has been transmitted.

Unlike list ordering for P slices, B slices order lists according to display order, or the variable

POC. The default orderings of the list are used to stay consistent with the idea of the most recent 39 Figure 3.1: IPBB Display Order

Table 3.1: Display Order Example

Slice Type Used for reference frame num POC LSBs Display Order 1st I yes 0 0 0 2nd P yes 1 6 3 3rd B no 2 2 1 4th B no 2 4 2 5th P yes 2 12 6 6th B no 3 8 4 7th B no 3 10 5 8th P yes 3 14 8

coded pictures are the first pictures in each of the list. The default order for list0 and list1 is as follows:

• List0: default order is (1) in decreasing order of POC, for pictures with POC earlier than

current picture, then (2) in increasing order of POC, for pictures with POC later than the

current picture.

• List1: default order is (1) in increasing order of POC, for pictures with POC later than the

current picture, then (2) in decreasing order of POC, for picture with POC earlier than the

current picture.

40 An example of picture ordering for a B slice is shown in figure 3.2

Figure 3.2: List0 and List1 Ordering Example

3.3 B Frame Coding

Performing a prediction of a B MB is not much different than a P MB. The reference list data must be handled differently but the over all process is the same. Figure 3.3 shows a block diagram of the proposed encoder with the addition of B frame capabilities.

3.3.1 SPS and PPS

For the addition of B slices, there is no change in the sequence parameter set from the previous version of the encoder. The picture parameter sets require that the variable num ref idxł1 active minus1 be set. This parameter is the same as num ref idxł0 active minus1 discussed in chapter ?? except 41 Figure 3.3: B MB Prediction Block Diagram

that it applies to list1. Setting these parameters improperly will cause some redundancy in the block match algorithm. For example, in figure 3.2, both list0 and list1 contain the same refer- ence frames only stored in a different order. The block match module (discussed later) will search all frames in each list. This is not necessary in this case. One solution to this would be to set num ref idxł0 active minus1 = 1, meaning there are at most two reference frames contained in list0;

10 and 12. Likewise, set num ref idxł1 active minus1 = 2, meaning there are at most three reference frames in list1; 20,22, and 28.

42 3.3.2 Decoded Picture Buffer

This module reads from the SPS and PPS to handle and set up the correct number of reference frames. The outputs of this module are list0 and list1. The DPB module must make sure that maximum number of reference frames in each list specified in the PPS is not exceeded. The number of active reference frames must also be determined so that the mb pred function in the MB layer of encoding knows whether or not to transmit a reference index or not. If the decoded picture buffer exceeds the limit of reference frame it can handle, this module will remove the least recent picture that was coded, shift the buffer, and finally insert the most recent picture at the front of the buffer.

Once list0 and list1 are created, they are passed to the search window function.

3.3.3 Create Search Window

This function takes in the number of references in list0 and list1 and creates a search window for each reference frame. By default, the search window is set to 24 for a 16x16 MB meaning 81 search points. The user has the option to increase or decrease the search window size.

3.3.4 Block Match, Transform and Quantize

This function is responsible for selecting the best match for the current MB. Each search window will be searched until the 16x16 block with the least amount of error is found. To increase the efficiency of the search, list0 and list1 are searched concurrently. Once the block with least amount of error is determined, the motion vector, reference index and the list from which the prediction was made is stored. Using the prediction data, a prediction block is formed that will be transformation and quantization process described in chapter 1.

43 3.3.5 Motion Vector Prediction

Motion vector prediction is carried out in the same manner as P MBs discussed in chapter ??.

Recall that a neighboring partition may be marked as unavailable if it is outside the current frame or outside the current slice. If this is the case, the process adjusts accordingly to carry out motion vector prediction. When coding a B MB, one more condition exists that can cause a neighboring partition to be unavailable and that is if the neighboring partition was predicted from the opposite list. Partition A in figure 3.4 would be marked as unavailable since it was predicted using list1 and the current partition, partition X, was predicted using list0. After the transformation, quantization, and motion vector prediction are complete, the MB is ready to be coded. The coding process is the same process described in chapter 2 for P MBs.

Figure 3.4: B MB Motion Vector Prediction

3.4 B Frame Implementation

Once the changes discussed above are applied, the following tests were generated using the pro- posed encoder on foreman (176x144 luma resolution). The tests consist of coding an IPB sequence with a single frame used for P frame prediction and two reference frames for B frame prediction using 16x16 MBs only and then once again using 8x8 MBs only. QPs range from 10(upper right) up 44 to 40(lower left). The PSNR for the luma component is calculated for the entire sequence. The re-

sults are compared against 16x16 interprediction using only P frames with a single reference frame.

Figure 3.5 shows the rate distortion plot for the simulation. Figure 3.6 shows a close up between

QP values of 24 through 28.

Foreman B Slices 100 frames at 25 FPS 13

12

11

10

9

8 Coding Time (seconds) 7 B 16x16 only B 8x8 only P 16x16 only 6

5

4 10 15 20 25 30 35 40 QP

Figure 3.5: Rate Distortion Curve Foreman using B Frame Interprediction

3.5 B Frame Conclusions

The results show a gain in performance while providing a slight increase in complexity. The

search algorithm is implemented so that both reference frames for a B MB are searched in the same

45 Foreman B Slices 100 frames at 25 FPS

45

44

43

42

PSNR (Y) 41

40

39 B 16x16 only B 8x8 only P 16x16 only 38

600 650 700 750 800 850 900 950 1000 1050 Rate (kbps)

Figure 3.6: Rate Distortion Curve Foreman using B Frame Interprediction for QPs 24 through 28

iteration. This removes some of the additional computational requirements involved for searching

multiple reference frames. Improved performance can be achieved with additional reference frames

using B MBs with less of a cost in complexity as opposed to multi-frame coding with PMBs. The

proposed implementation only using single reference picture prediction for B MBs, meaning that

it only chooses one picture for reference from the available reference pictures. This is effectively

the same process as a P MB since P MBs use a single reference as well. Performance could be

potentially be further increased with the addition of bi-prediction (two pictures used for reference).

46 Foreman B Slices 100 frames at 25 FPS 13

12

11

10

9

8 Coding Time (seconds) 7 B 16x16 only B 8x8 only P 16x16 only 6

5

4 10 15 20 25 30 35 40 QP

Figure 3.7: Complexity using B Frame Interprediction

47 CHAPTER IV

Mode Selection

4.1 Introduction

As mentioned in chapter 1, an H.264 encoder can choose from many different options or modes when it codes a macroblock. Therefore the more features that are added to the encoder, the more necessary that a mode selection algorithm becomes. Rate Distortion Optimized (RDO) mode selec- tion is a technique for choosing the coding mode of a macroblock base on the rate and distortion cost. The bitrate cost R and distortion cost D are combined into a single cost J as shown in equation

4.1. The RDO mode selection algorithm attempts to find a mode that minimizes the joint cost J.

The tradeoff between the bitrate and distortion is controlled by the Lagrange multiplier λ. Smaller values of λ favor quality over bitrate and larger values of λ favor bitrate over quality. This technique has gained importance due to its effectiveness, conceptual simplicity, and its ability to effectively evaluate a large number of choices in an optimized fashion [14]. More information on Lagrangian optimization can be found in [15]. RDO is presented to try and answer the question: ”What part of the video signal should be coded using what method and parameter settings?”

J = D + λR (4.1)

48 This clearly is a computationally intensive process due to the many number of coding options.

The proposed encoder, even with 16x16 prediction only, has many coding options. With the addition of multi frame prediction and B frame interprediction compounds the problem. A low complexity

RDO method is proposed in order to increase encoder performance while keep the complexity as minimal as possible.

4.2 Proposed Low Complexity RDO Method

Due to the computational complexity of the traditional RDO method, the following method is proposed. The traditional RDO method involves:

• Obtain a prediction block P

• Code block P to obtain R(number of bits)

• Reconstruct the block to obtain B’

• Calculate distortion between B and B’

This means that a MB must go through the prediction, transform, quantization,reconstruction, and coding process for each mode as shown in figure 4.1.

The proposed RDO method is as follows:

• Obtain prediction block P

• Determine Rpred.

• Calculate distortion between P and original MB.

The proposed method eliminates the transform, quantization, and MB coding processes. Ob- serve how Rcode (number of bits to code the residual data) is removed from the cost calculation. By

49 Figure 4.1: Traditional RDO

doing this, it is assumed that the distortion between the predicted MB and the original MB and the number of bits required to code that MB is a linear relationship. In other words, the smaller the error the less bits it takes to code the residual. While this is a probable assumption, it may not be the case in all circumstances. For the purposes of low complexity, this assumption will be made. A block diagram of the proposed method is shown in figure 4.2

4.3 Proposed RDO Method Implementation

The goal of the encoder is to provide the best bitrate at a given quality. In order to test the proposed RDO method, 8x8 prediction for P slices is added to the encoder. Figure 4.3 shows the

50 Figure 4.2: Proposed RDO Method

rate distortion curve for the current encoder in terms of 16x16 prediction only and 8x8 prediction only for Foreman. At low QP values, 16x16 outperforms 8x8 and the opposite can be said at higher

QP values. The proposed RDO method is compared against the previous two prediction modes.

Figure 4.4 shows the complexity for the proposed RDO implementation compared with coding with

16x16 only and 8x8 only. With the exception of high QP values, the proposed method increases encoding performance. The value of λ for each QP is calculated using 4.2 [14].

51 QP −12 λ = 0.85 × 2 3 (4.2)

Foreman: RD0, 100 frames at 25 FPS 55

50

45

40

PSNR (Y) 16x16 8x8 RDO 35

30

25 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Rate (kbps)

Figure 4.3: Proposed RDO Results: Foreman

Figures 4.5 and 4.6 show the same test sequence for flower.

4.4 Conclusions

A software RDO optimization scheme is introduced in this chapter. Due to the extensive com-

putation time for the full RDO algorithm, a less complex version of the scheme is proposed that

52 Foreman: RDO complexity 16

14

12

10 Time (seconds)

8

16x16 8x8 RDO 6

4 10 15 20 25 30 35 40 QP

Figure 4.4: Proposed RDO Complexity: Foreman

removes the burden of fully encoding a macroblock twice. The proposed scheme utilizes 16x16 and

8x8 block sizes in order to increase compression performance.

The results in Figures 4.3 and ?? show that the proposed method achieved better compression

for most QP values. The cost of implementing this scheme comes in the form of complexity as

shown in Figures 4.4 and 4.6. The proposed encoder is a more simple version of the full RDO

method due to the fact that it does not perform transformation, quantization, and entropy encoding

for each tested mode. However, most of the coding time is involved in the motion compensation

process. This process is being implemented for each test mode which is was is causing the additional

complexity.

53 Flower: RD0, 100 frames at 25 FPS 55

50

45

40

PSNR (Y) 16x16 8x8 RDO 35

30

25 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Rate (kbps)

Figure 4.5: Proposed RDO Results: Flower

The results are significant due to the fact that the encoder now has a ”best case” algorithm. This

means that less complex mode selection algorithms can now be developed and tested to perform as

close as possible on the rate-distortion curve of the proposed method.

54 Flower: RDO complexity 60

55

50

45

40 16x16 8x8 Time (seconds) RDO 35

30

25

20 10 15 20 25 30 35 40 QP

Figure 4.6: Proposed RDO Complexity: Foreman

55 CHAPTER V

Conclusions and Future Work

5.1 Conclusions

Multi-frame prediction is introduced so that B frame interprediction could be implemented.

For the proposed encoder, the results show that adding an additional reference frames can increase performance but due to the added complexity, adding more than two reference frames tends to make the coding time less desirable for gain in compression.

B frame interprediction is introduced in order to improve coding efficiency and performance.

Simulations show that adding a B frame in between the P frames increases performance on the rate distortion curve. The complexity for B frames being implemented with two reference frames is less than P frames with two references since both lists for B frames can be searched concurrently during motion estimation.

The Rate Distortion Optimization method is necessary in H.264 due to the variety of coding options available not seen in previous standards. Due to the computational demands, fast imple- mentations of RDO may be called upon for practical applications. The proposed RDO method provides better performance than using 16x16 prediction blocks only but at the cost of complexity.

The proposed encoder now has a ”best case” RDO method that can be used to test less complex methods in the future.

56 5.2 Future Work

The proposed RDO method has opened up some new areas of research for the H.264 project. As additional features are added to the encoder, the RDO algorithm must be taken into account. A low complexity RDO module will be necessary for the hardware encoder as well. RDO is not just limited to inter prediction. Multiple intra prediction modes already exist in the hardware encoder that only take into account distortion as the determining factor as to which intra mode is chosen. From the research done in this thesis, a simple RDO module can be written to improve on intra prediction performance. There are many existing alternatives to an RDO algorithm such as [16] and [17].

These methods set out to reduce complexity even further while have the equivalent performance of

RDO. Looking ahead to the next coding standard, H.265 or High Efficiency Video Coding, a variety of coding options will be available as well. This necessitates an RDO algorithm.

57 BIBLIOGRAPHY

[1] I. E. Richardson, The H.264 Avanced Video Compression Standard 2nd Edition. John Wiley and Sons, 2003.

[2] ”ISO/IEC 14 496-2(MPEG-4 Visual Version 1.0)”, ”Coding of audio-visual objects-Part 2: Visual”, April 1999.

[3] ”Joint Vedo Team (JVT) of ISO/IEC MPEG and ITU-T VCEG,JVT-G050)”, ”Draft ITU-T recommendation and final draft international standard of joing video specifications(ITU-T Rec. H.264/ISO/IEC 14 496-10 AVC.)”, May 2003.

[4] T. Wedi, “”Motion Compensation in H.264/AVC.”,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, pp. 577–586, July 2003.

[5] X. Z. T. Wiegand and B. Girod, “”Long-Term memory motion-compensated prediction.”,” IEEE Trans. Circuits Syst. Video Technol., vol. 9, pp. 70–84, Feb 1999.

[6] T. Wiegand and B. Girod, ”Multi-Frame Motion-Compensated Prediction for Video Transms- sion”. Norwell, MA: Kluwer., 2001.

[7] T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h.264/avc video coding standard,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 13, no. 7, pp. 560–576, 2003.

[8] M. Flierl, T. Wiegand, and B. Girod, “A locally optimal design algorithm for block-based multi-hypothesis motion-compensated prediction,” in Conference, 1998. DCC ’98. Proceedings, 1998, pp. 239–248.

[9] M. Flierl and B. Girod, “Generalized b pictures and the draft h.264/avc video-compression standard,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 13, no. 7, pp. 587–597, 2003.

[10] H. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, “Low-complexity transform and quantization in h.264/avc,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 13, no. 7, pp. 598–603, 2003.

58 [11] H.-C. K. Youn-Long Steve Lin, Chao-Yang Kao and J.-W. Chen, ”VLSI Deisgn for Video Coding”. Spring,New York 1st Edition, 2010.

[12] D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive binary in the h.264/avc video compression standard,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 13, no. 7, pp. 620–636, 2003.

[13] “Yuv video sequences,” http://trace.eas.asu.edu/yuv/, April 2012.

[14] G. Sullivan and T. Wiegand, “Rate-distortion optimization for video compression,” Signal Processing Magazine, IEEE, vol. 15, no. 6, pp. 74–90, 1998.

[15] A. Ortega and K. Ramchandran, “Rate-distortion methods for image and video compression,” Signal Processing Magazine, IEEE, vol. 15, no. 6, pp. 23–50, 1998.

[16] F. Pan, X. Lin, S. Rahardja, K. Lim, Z. Li, D. Wu, and S. Wu, “Fast mode decision algorithm for intraprediction in h.264/avc video coding,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 15, no. 7, pp. 813–822, 2005.

[17] B.-G. Kim and C.-S. Cho, “Fast inter-mode decision algorithm for p slices in h.264/avc video standard,” in Consumer Electronics, 2007. ISCE 2007. IEEE International Symposium on, 2007, pp. 1–6.

59