Design and Implementation of a Fast HEVC Random Access Encoder

ALFREDO SCACCIALEPRE

Master’s Degree Project Stockholm, Sweden March 2014

XR-EE-KT 2014:003

Contents

1 Introduction 11 1.1 Background ...... 11 1.2 Thesis work ...... 12 1.2.1 Factors to consider ...... 12 1.3 The problem ...... 12 1.3.1 C65 ...... 12 1.4 Methods and thesis outline ...... 13 1.4.1 Methods ...... 13 1.4.2 Objective measurement ...... 14 1.4.3 Subjective measurement ...... 14 1.4.4 Test sequences ...... 14 1.4.5 Thesis outline ...... 16 1.4.6 Abbreviations ...... 16

2 General concepts 19 2.1 Color spaces ...... 19 2.2 Frames, slices and tiles ...... 19 2.2.1 Frames ...... 19 2.2.2 Slices and Tiles ...... 19 2.3 Predictions ...... 20 2.3.1 Intra ...... 20 2.3.2 Inter ...... 20 2.4 Merge mode ...... 20 2.4.1 Skip mode ...... 20 2.5 AMVP mode ...... 20 2.5.1 I, P and B frames ...... 21 2.6 CTU, CU, CTB, CB, PB, and TB ...... 21 2.7 Transforms ...... 23 2.8 Quantization ...... 24 2.9 Coding ...... 24 2.10 Reference picture lists ...... 24 2.11 Gop structure ...... 24 2.12 Temporal scalability ...... 25 2.13 Hierarchical B pictures ...... 25 2.14 Decoded picture buffer (DPB) ...... 25 2.15 Low delay and random access configurations ...... 26 2.16 H.264 and its encoders ...... 26 2.16.1 H.264 ...... 26

1 2 CONTENTS

3 Preliminary tests 27 3.1 Speed - quality considerations ...... 27 3.1.1 Interactive applications ...... 27 3.1.2 Entertainment applications ...... 28 3.2 Compression efficiency tests ...... 29 3.2.1 C65 and HM12 ...... 30 3.2.2 C65 and x264 ...... 30 3.3 Conclusions ...... 31

4 Implementing B pictures 33 4.1 Gop structure ...... 33 4.1.1 C65’s original gop structure ...... 33 4.1.2 Hierarchical gop structure ...... 34 4.1.3 Choice of gop size ...... 34 4.1.4 C65 optional parameter -hgopsize ...... 35 4.1.5 Considerations on coding delay ...... 35 4.1.6 Memory requirements for DPB ...... 37 4.1.7 Syntax elements to set ...... 38 4.2 B slices - details about implementation ...... 39 4.2.1 Gop structure and references ...... 39 4.2.2 B slices - building list 0 and list 1 ...... 39 4.2.3 Generation of merge candidates ...... 40 4.2.4 Motion Vector Prediction ...... 41 4.2.5 B slices - syntax fixes ...... 42 4.2.6 Other modifications ...... 43 4.3 ...... 44 4.3.1 Interpolation ...... 44 4.3.2 Averaging ...... 45 4.3.3 SIMD code ...... 46 4.4 ...... 47

5 Rate distortion optimization 49 5.1 Motion estimation ...... 49 5.2 Motion estimation in C65 B...... 49 5.2.1 Step zero: generate the merge candidates list ...... 49 5.2.2 Step one: choose best merge index ...... 50 5.2.3 Step two: test the skip mode ...... 51 5.2.4 Step three: test the intra mode ...... 52 5.2.5 Step four: test uni-prediction ...... 52 5.2.6 Step five: motion vector local search ...... 54 5.2.7 Step six: mode selection ...... 54 5.3 Quantization parameter choice ...... 55

6 Results and future works 57 6.1 Methodology ...... 57 6.2 Results ...... 58 6.2.1 hgopsize 8 - various qp values ...... 58 6.2.2 hgopsize 4 - various qp values ...... 62 6.2.3 hgopsize 2 - various qp values ...... 64 6.2.4 hgopsize 2 vs hgopsize 4 vs hgopsize 8 ...... 66 CONTENTS 3

6.2.5 C65 against C65 B...... 66 6.2.6 Subjective test ...... 67 6.2.7 C65 B against x264 ...... 69 6.2.8 C65 B against HM ...... 69 6.2.9 Encoding speed ...... 69 6.2.10 Final considerations ...... 70 6.3 Future works ...... 70 6.3.1 Gop selection ...... 70 6.3.2 Combined prediction signal for motion vector search . . . 71 6.3.3 32 x 32 and 64 x 64 mode ...... 72 6.3.4 Deblocking filter and TMVP ...... 72 6.3.5 I frames to improve with more directions ...... 72

7 Conclusions 75

A Listings and result data 77 A.1 Listings ...... 77 A.2 Result data ...... 78 4 CONTENTS List of Figures

1.1 Video sequences under test ...... 16

2.1 CTU and CTB ...... 22 2.2 CTB split in CB ...... 22 2.3 CB and PB ...... 23 2.4 CB and TB ...... 23 2.5 Temporal scalability ...... 25 2.6 Hierarchical coding structure with 4 temporal levels ...... 26

3.1 Interactive applications, low-delay mode c65 comparison against other ...... 27 3.2 Entertainment applications, c65 comparison against other codecs 28

4.1 C65 original encoding order ...... 34 4.2 Dyadic hierarchical gop structures implemented in C65 ...... 36 4.3 Hierarchical non-dyadic gop structure size 8, -hgopsize 8n ... 37 4.4 Sample DPB content ...... 38 4.5 Position of spatial candidates of motion information ...... 40 4.6 Fractional positions in motion compensation ...... 44 4.7 Averaging of input signal in H.264 ...... 45 4.8 Averaging of input signal in HEVC ...... 46

5.1 Scene cut ...... 53 5.2 Video sequence with scene cut ...... 54

6.1 Subjective test ...... 68 6.2 Hierarchical gop structure ...... 72

5 6 LIST OF FIGURES List of Tables

3.1 C65 vs HM 12.0 random access configuration ...... 29 3.2 HM 12.0 random access configuration vs same software using only one reference picture...... 31 3.3 C65 vs x264 with settings: --psnr --threads 1 --profile high --preset veryslow --tune psnr -I 48 ...... 32 3.4 x264 with settings: --psnr --threads 1 --profile high --preset veryslow --tune psnr -I 48 vs x264 with settings: --psnr --threads 1 --profile high --preset veryslow --ref 1 --tune psnr -I 48 ...... 32

4.1 Luma interpolation filter in HEVC ...... 45

6.1 Qp toggling possibilities ...... 58 6.2 Summary qp toggling configurations, gop size 8 ...... 58 6.3 Summary qp toggling configurations, gop size 4 ...... 63 6.4 Summary qp toggling configurations, gop size 2 ...... 65 6.5 Summary gop structures comparison ...... 66 6.6 C65 vs C65 B...... 67 6.7 Encoding speed ...... 69 6.8 C65 B vs HM 12.0 for one Intra picture ...... 73

A.1 No qp toggle vs configuration -4 -1 0 1 2...... 78 A.2 No qp toggle vs configuration -4 -2 0 2 4...... 79 A.3 No qp toggle vs configuration -4 -2 2 3 4...... 79 A.4 No qp toggle vs configuration -4 -3 0 4 8...... 80 A.5 No qp toggle vs configuration -4 -2 2 4 6...... 80 A.6 No qp toggle vs configuration -4 -3 2 4 8...... 81 A.7 No qp toggle vs configuration -4 -3 -3 4 8 ...... 81 A.8 No qp toggle vs configuration -4 -1 0 1...... 82 A.9 No qp toggle vs configuration -4 -2 0 2...... 82 A.10 No qp toggle vs configuration -4 -3 0 3...... 83 A.11 No qp toggle vs configuration -4 -3 2 6...... 83 A.12 No qp toggle vs configuration -4 -2 2...... 84 A.13 No qp toggle vs configuration -4 -3 3...... 84 A.14 No qp toggle vs configuration -4 -3 6...... 85 A.15 Configuration -4 -3 3 vs configuration -4 -3 2 6...... 85 A.16 Configuration -4 -3 2 6 vs configuration -4 -3 -3 4 8 ...... 86 A.17 C65 B vs x264 ...... 86

7 8 LIST OF TABLES

A.18 C65 B vs HM 12 ...... 87 Abstract This master thesis report studies the possibility to modify a fast video encoder specialized for video conferencing applications (c65), in order to make it more suitable for encoding of general video content. The modifications includes coding with hierarchical B pictures and various other operations like rate distortion op- timizations and improved motion estimation. Results show that the hierarchical coding and the other modifications contribute positively to the coding efficiency of the encoder. Considerations about the work results are given, regarding the new encoding scheme (in terms of coding delay and memory requirements) and a comparison of coding efficiency for different coding structures. Finally, ideas for future works are presented. This thesis was done in cooperation with Ericsson Research. 10 LIST OF TABLES Chapter 1

Introduction

Video content generates huge amounts of data, which would not be practical to handle if uncompressed. For instance, current Internet throughput rates can not handle raw uncompressed video in real time (even at low frame rates and/or small frame size). A DVD can only store a few minutes of raw video at television-quality resolution. For this reason video content is usually compressed by an encoder and de- coded by a decoder that recreates the video data. With the increased use of digital video with higher resolution and quality, more and more sophisticated codecs have been developed to achieve efficient compression while minimizing the distortion introduced by the compression pro- cess. This field of research has been very active in the last few decades and many standards have been produced in order to allow products from different manu- facturers to interoperate effectively. Among the most important codecs we have the H.26x family, that includes the new standard HEVC.

1.1 Background

Digital video compression is not a recent concept. It has been around during the last few decades. For example, almost 20 years ago the best technology was repredented by the MPEG 2 video coding standard, that is still extensively used today. A radical milestone was given by the H.264 standard, which was published in 2003. In 2004, the Video Coding Experts Group (VCEG) started preliminary but in depth studies in order to put the basis for the next video coding standard. The goal was, already at the time, a 50% saving of bitrate, still keeping the same subjective quality as H.264 High Profile. In 2007 also the Moving Picture Experts Group (MPEG) started a similar kind of study, building on the work from VCEG. Finally, the two groups decided to merge into the Joint Collaborative Team on Video Coding (JCT-VC). The first official meeting was in April 2010, and the name High Efficiency Video Coding (HEVC) was formally chosen for the new standard. The first version of HEVC was completed and published in early 2013, but studies continued after 2013 to extend it with various features like scalable coding and support for 3D

11 12 CHAPTER 1. INTRODUCTION video. Still today, many companies like Ericsson participate in standardization meetings and bring their contribution to the project. In particular Ericsson research has also developed an own encoder, whose output is compliant with the HEVC standard, and that is optimized for video-conferencing applications. The name of that Ericsson encoder is C65.

1.2 Thesis work 1.2.1 Factors to consider The two main factors to be considered in video coding are compression efficiency and computational complexity. The target of designer is to find a good trade off between these. Assessing the characteristics of an encoded sequence can be challenging. For example, quality is difficult to measure. There are objective and subjective measures. Objective ones are easy to compute (the measure can be automated) but they do not necessarily correspond exactly to visual quality as perceived by the human eye. On the other hand, subjective tests are slow and difficult to perform, as they require a group of people spending time rating video test clips. Compression (the number of bits required to code a sequence) and compu- tational complexity (usually measured by the time needed to encode a number of pictures, or equivalently by the number of frames encoded per unit of time) are objective values that can be expressed by numbers. In real time applications, complexity is a strong constraint that must be taken into consideration; moreover the encoding must not introduce excessive delay. In general we can say that pictures need to be processed in a timely manner with respect to the of the video.

1.3 The problem

HEVC is a video coding standard that was published in April 2013 and is expected to be widely deployed for various video services. Throughout the standardization of HEVC, a reference software codec, called HM, was used. The aim of this software is to provide a reference implementation of the HEVC standard. It was extensively used during the standardization phase of HEVC to give a basis upon which to conduct experiments in order to determine which coding tools provide desired coding performances. The code is freely available but it is not meant to be a computationally efficient implementation and for this reason can be unsuitable for particular usages[1].

1.3.1 C65 Ericsson Research implemented the world’s first multiparty HD video conference [2] using HEVC[3]. The HEVC codec that was used for video conferencing is written in C/C++ by Ericsson Research and is capable of encoding and decoding HD video in real-time on a personal computer. The codec compresses video conference scenes very efficiently but was not optimized for general video input when this thesis work was started. 1.4. METHODS AND THESIS OUTLINE 13

Objective of the thesis : C65 B The objective of the thesis is to work on improving the compression efficiency of the HEVC codec for general video input such as TV series, movies and sport. At the same time, keeping the complexity low such that real-time encoding can be still feasible is also part of the objectives. The work includes learning about video compression, implementing and evaluating well-known video encoding al- gorithms, designing and testing new encoding algorithms and writing a report. For the purpose, the existing codec C65 will be modified and adapted, and the new version of the software will take the name of C65 B. The thesis has been done at the Visual Technology unit at Ericsson Research in Kista outside Stockholm. The unit is responsible for research and standard- ization related to video coding for multimedia communication and distribution within Ericsson.

1.4 Methods and thesis outline 1.4.1 Methods The main work of this thesis was to improve the compression efficiency perfor- mance of C65 on general content video sequences. The work focused only on the encoder: for this reason the sequences chosen are encoded with C65 and decoded with the reference software decoder HM, version 12. The matching between encoded and decoded content is easily assessed using MD5 hashes (the hash for each reconstructed/decoded picture is calculated by encoder/decoder). Moreover the reconstructed sequence (output by the encoder) is compared bit by bit with the decoded sequence, in order to check the output order of the pictures. The tests are performed on a set of video sequences (presented later during the dissertation) using the four qp values 22, 27, 32, 37 as specified in the JCT-VC’s common test conditions[4]. For further analysis, charts will be presented showing typical curves repre- senting the relationship between bitrate and video quality (in term of objective measurements), at given qp values. In order to verify the correctness of the encoded bitstream, log files were implemented in both C65 and HM in order to trace the encoding and decoding processes during development. In particular, logs contain motion vector deci- sions, CABAC trace and generated merge candidates. By comparing log files from both encoder and decoder it is possible to detect where discrepancies lie and to correct bugs. Moreover a bitstream analyzer provided with a graphical interface proved to be very useful in some contexts. In particular a software called Elecard HEVC Analyzer (trial version) is used.

PSNR Peak signal noise ratio (PSNR) is the method used within this thesis to measure quality in an objective manner. It is computed according to the following steps: • calculate the Mean Squared Error (MSE) as in equation 1.1. M and N are width and height of a picture in pixels and Panchor and Ptest are single 14 CHAPTER 1. INTRODUCTION

pixel values for the original (anchor) picture and the decoded (test) picture respectively.

• compute PSNR as in equation 1.2. MAX is the highest value a pixel can have (2n−1 for n bit representation).

M−1 N−1 1 X X MSE = [P (i, j) − P (i, j)]2 (1.1) MN anchor test i=0 j=0

MAX2 PSNR = 10 log (1.2) 10 MSE

1.4.2 Objective measurement

The tests and results shown in this dissertation present objective measurements in order to assess the quality of a decoded video sequence compared to the original source. In particular we use BD-PSNR and BD-Rate averages. BD-PSNR/Rate is a way to compute a comparable average value from a larger set of test results[5]. The algorithm takes as input four PSNR/bitrate pairs of two encodings, one called ”anchor” and one called ”test” and returns two values: the reduction for equivalent PSNR (BD-rate) and the average PSNR improvement for equivalent bitrate (BD-PSNR). Negative numbers for the BD-Rate mean a better performing algorithm: in other words, at equal quality corresponds a lower . At the same way, a positive value for the BD-PSNR shows a better performance as, at the same bit rate, the quality is higher.

1.4.3 Subjective measurement

In addition to objective measurements, some subjective evaluations are present in the thesis report. Estimation of subjective video quality is a very time-consuming process, and many ways of conducting such experiments are possible (a few standardized, like in [6]). Moreover, the results can be influenced by many parameters, such as room illumination, display type, brightness, contrast, resolution, viewing distance and the age and experience of the viewers. For this reason, in this dissertation subjective tests will be performed in an informal way, by the author only, comparing sequences side by side.

1.4.4 Test sequences

The tests were executed on a set of thirteen video sequences1.1. Some were chosen from the eighteen sequences used by JVT-VC [4] during the development of HEVC. All these video sequences have a resolution of 1280x720 and they are chosen to be heterogeneous and exercise the encoder in different contexts. None of them has scene cuts and they are either 10 or 3 seconds in length. 1.4. METHODS AND THESIS OUTLINE 15

(a) American Football, 720p, 60 Hz, 789 frames

(b) ABC Sitcom, 720p, 50 Hz, 500 frames (c) Le Match, 720p, 50 Hz, 500 frames

(d) Basketball drive, 720p, 50 Hz, 500 (e) Cactus, 720p, 50 Hz, 500 frames frames

(f) Christmas tree, 720p, 50 Hz, 500 (g) Crowd run, 720p, 50 Hz, 500 frames frames

(h) Ducks take off, 720p, 50 Hz, 500 (i) Into tree, 720p, 50 Hz, 500 frames frames

(j) Park Joy, 720p, 50 Hz, 500 frames 16 CHAPTER 1. INTRODUCTION

(k) Big Ships, 720p, 60 Hz, 150 frames (l) City, 720p, 60 Hz, 150 frames

(m) Crew, 720p, 60 Hz, 150 frames

Figure 1.1: Video sequences under test

1.4.5 Thesis outline Here the outline of the thesis is presented.

Chapter 2 gives a general introduction about the HEVC standard. It also provides an explanation of general video coding concepts that will be useful for the reader during the course of the dissertation. Chapter 3 presents the preliminary work, consisting of a series of tests com- paring the C65 encoder against other well known encoders such as HM, JM, and x264. Chapter 4 explains the work that was done in order to implement B-slices in the C65 encoder. The choice of the gop structure is analyzed here, as well as the necessity to change the output order of the frames. Moreover, a quick overview is given about the steps taken to make the encoder compliant with the standard, so that bitstreams can be correctly decoded by the HM decoder. Finally there is an overview of the modifications of motion estimation and motion compensation, that were needed when B slices were implemented. Chapter 5 presents some rate distortion optimization work that has been done in order to obtain an efficient encoder under the new conditions. In par- ticular the motion estimation process is analyzed as well as the choice of qp values assigned to each picture, depending on its position in the gop structure. Chapter 6, finally, shows the overall results of the work in terms of bitrate saving, PSNR enhancement and encoding time. Finally, a quick overview on possible future works is given.

1.4.6 Abbreviations BD-PSNR/Rate - Bjøntegaard Delta-PSNR/Rate CABAC - Context Adaptive Binary Arithmetic Coder 1.4. METHODS AND THESIS OUTLINE 17

CB – Coding block CTB – Coding tree block

CTU – Coding tree unit CU - Coding Unit dB – Decibel HEVC - High Efficiency Video Coding

HM - HEVC Test Model ITU-T – International telecommunication union – telecommunication standardization sector ISO/IEC – International organization for standardization/International electrotechnical commission MPEG - Motion Picture Experts Group MSE - Mean Square Distortion

PB – Prediction block PSNR - Peak Signal to Noise Ratio PU - Prediction Unit QP - Quantization Parameter

SAD - Summed Added Distortions SPS – Sequence parameter set SSD - Summed Squared Distortions RPS – Reference picture set

TB – Transform block TU - Transform Unit VCEG - Video Coding Experts Group 18 CHAPTER 1. INTRODUCTION Chapter 2

General concepts

This chapter briefly covers some basic concepts of video coding followed by a very quick overview about HEVC, stressing those aspects that will be necessary for the understanding of the rest of the dissertation. Many of the concepts of the HEVC standard originate from the H.264 codec, so for more details literature about both standards can be consulted.

2.1 Color spaces

The YUV is used to represent pixel values. It is composed of three channels: Y is the luminance (luma), U or Cb is chroma blue and V or Cr is chroma red. The YUV color space was designed taking into account human perception. The chrominance bandwidth is reduced since the human visual system is less sensitive to color compared to luminance. This allows a better masking of possible transmission error or compression artifacts, for example with respect to a direct RGB representation. This happens because the human eye is sensitive to light more than to color, so the chroma can be reduced without any visual degradation. This allows for a gain in coding efficiency. The subsampling used is often 4:2:0; this means that for each 2 × 2 luma samples block there is one sample of each chroma.

2.2 Frames, slices and tiles 2.2.1 Frames A frame or picture contains the data for a specific time within a video sequence. For our purposes the terms “picture” and “frame” are equivalent (they have differences for interlaced which is not considered in this thesis work).

2.2.2 Slices and Tiles Slices are independently decodable picture areas. Within a frame, every slice can be decoded without the need of any data from any other slice of the same picture. They introduce an overhead because of the presence of slice headers and also because every slice boundary is breaking the prediction between slices.

19 20 CHAPTER 2. GENERAL CONCEPTS

Tiles are also portions of the frame, but they can span multiple slices. Their main target is achieving parallelism, while slices can be used to control the size packet if, for instance, the stream is going over a network.

2.3 Predictions 2.3.1 Intra The process of intra prediction basically consists in exploiting spatial redundan- cies within a picture and transmitting the source data transformed. The slice has no dependency and can be decoded in an autonomous way, since it does not rely on any reference frame. The bitcost of an intra coded block is usually relatively high.

2.3.2 Inter An inter predicted block takes advantage of the temporal redundancy between subsequent frames. This usually leads to a higher compression rate compared to intra prediction. The concept of motion compensation is used: a displacement is given for blocks of pixel in the current frame with respect to some previous reference, allowing so to represent a moving portion of the picture. The displacement is given by so called motion vectors. The motion can be also at sub-pixel level, so interpolation filters must be used to estimate the prediction values. Moreover a residual is transmitted: it represents the difference between the prediction and the source. This residual difference is added by the decoder in order to recreate the original data.

2.4 Merge mode

The merge mode of HEVC is a new concept, that is similar to the direct and skip modes of H.264[7]. It is one of the modes for inter prediction, that does not require transmitting explicit motion vectors. A list of motion vector candidates is built by the encoder and the decoder following the standard same process. The encoder selects one of those (the one it considers optimal) and transmits an index to refer to it. Moreover, a residual is transmitted.

2.4.1 Skip mode Skip mode is a special case of merge mode, in which the residual is not trans- mitted.

2.5 AMVP mode

In HEVC the other way to encode an inter-coded block, when merge is not used, is advanced motion vector prediction. The motion vector is differentially coded with respect to a predictor. The predictor is chosen among a list of candidates and only its index is transmitted, together with the differential coding of the chosen motion vector. 2.6. CTU, CU, CTB, CB, PB, AND TB 21

2.5.1 I, P and B frames While the HEVC specification[8] talks about I, P and B as slice types, it is common to refer to these as picture (or frame) types.

Intra slice As reported in the HEVC standard[8], an Intra coded slice (i.e. that only uses intra prediction) is called I-slice. In the standard we also read that intra prediction is a prediction derived from only data elements (e.g., sample values) of the same decoded slice. So the prediction is not relative to any other picture within the video sequence. The principle it is based on is spatial redundancy: it exploits the similarity of the values of pixels that are (spatially speaking) neighbors. Intra frames first appeared in the H.261 standard[9]. A part from some slight implementation details, this technique is similar to the one of the JPEG still image video encoder.

Predictive slice Predictive (P) slices can use data from another frame for prediction and are usually more compressible than I slices. A P frame can still have intra coded blocks, but it may also use uni-prediction: the principle exploited here is tempo- ral redundancy, so the fact that two frames that are temporally close in a video sequence can have a rather small difference in their content. For the decoding of each block, a reference index (to indicate which single frame is the reference) and a motion vector (that points to a specific space region in that frame) must be provided to the decoder (a residual can also be provided, depending on the mode).

Bi predictive (B) slice Bi predictive (B) slices can use data from two frames to decompress and are usually more compressible than I and P slices. A B slice can still have intra coded and/or uni-predicted blocks, but it may also use bi-prediction: blocks can be coded with interpolated prediction from more than one reference. For the decoding of each block, two reference indices (to indicate the two reference pictures) and two motion vectors (that point to specific space regions in the pictures) must be provided to the decoder.

2.6 CTU, CU, CTB, CB, PB, and TB

In the HEVC standard, each picture is divided into Coding Tree Units. The possible sizes for a CTU are 64 × 64 (usually employed), 32 × 32 or 16 × 16 and this information is contained in the Sequence Parameter Set (SPS), so just once in each sequence. For this reason, all the Coding Tree Units in a video stream have the same size. CTUs are composed of one luma (Y) and two chroma blocks (Cb and Cr), indicated with the name of Coding Tree Block. CTB have the same size of the corresponding CTU. These entities can be divided into Coding Blocks (CB). A luma and the cor- 22 CHAPTER 2. GENERAL CONCEPTS

Figure 2.1: CTU and CTB

Figure 2.2: CTB split in CB 2.7. TRANSFORMS 23 responding chroma block form a Coding Unit. The decision about the prediction type (intra, inter) is made for each CU. Coding units can still be too large, for example for very small moving objects. For this reason, CBs can be divided into Prediction Blocks, that are assigned, for instance, a motion vector.

Figure 2.3: CB and PB

Finally, the residual coding operation is performed on Transform Blocks, that do not necessarily correspond to PBs.

Figure 2.4: CB and TB

2.7 Transforms

In order to increase compression, a transform operation is applied to the intra coded pixels and to the residual after motion compensation. The transform is applied for each block and exploits spatial correlation. DCT-based transforms 24 CHAPTER 2. GENERAL CONCEPTS are used in the HEVC standard, since after applying it most of the signal in- formation tends to be concentrated in a few low-frequency components of the DCT[10], approaching the Karhunen-Lo`eve transform (which is optimal in the decorrelation sense)[11].

2.8 Quantization

Quantization is a process that limits the numbers of bits allowed to express values[12]. This is done by mapping a set of values into a smaller set of values (this is a lossy operation). In video encoding, quantization is applied to transformed coefficients in or- der to reduce the bitrate required to transmit the information. The level of quantization is defined by a Quantization Parameter (QP). High values mean a larger quantization step, so less bitrate and more quality loss.

2.9 Coding

In order to losslessy compress syntax elements like prediction modes, motion vectors and transformed coefficients, entropy coding is used. CABAC is the form of entropy coding used in HEVC. It gives 9-14% better compression efficiency than CAVLC[13], the other entropy coder that was optionally used in H.264 and also better performance than the CABAC version implemented in AVC[14]. A look-up table, called codebook, is used for coding and decoding. CABAC uses “Context-adaptive” coding: a Context Model stores statistics of codewords and known probabilities to adaptively make the codebook more efficient.

2.10 Reference picture lists

Focusing on Inter predictions, we explained how the coding of a frame can be done taking as reference another (preceding in decoding order) frame. These reference pictures are ordered in one or two lists prior to encoding or decoding a slice. P slices use only list 0 and B slices use list 0 and list 1. The two lists are filled with reference frames, following a precise schema. For more details, refer to subsection 4.2.2.

2.11 Gop structure

A gop (group of pictures) defines the order in which intra and inter frames are arranged. A video sequence is usually composed of successive gops. The gop structure is often referred to by a number that tells the distance between two anchor frames (I or P). A typical gop structure is, for instance, IBBPBBP. . . With this structure, the I-frame can be used to predict the first P-frame and these two frames can also used to predict the first and the second B-frame. The second P-frame can be predicted using the first P-frame and they can join to predict the third and fourth B-frames. 2.12. TEMPORAL SCALABILITY 25

2.12 Temporal scalability

“A video bit stream is called temporal scalable when parts of the stream can be removed in a way that the resulting substream forms another valid bit stream for some target decoder, and the substream represents the source content with a frame rate that is smaller than the frame rate of the complete original bit stream”[15]. This is achieved by partitioning the video pictures into a base layer and one or more temporal enhancement layers, as in Fig. 2.5

Figure 2.5: Temporal scalability

2.13 Hierarchical B pictures

Hierarchical B picture coding was first introduced in H.264/AVC[16]. A typical hierarchical prediction structure is shown in Fig. 2.6. This has 4 hierarchy stages. The Picture Order Count (poc) is shown under each picture: it repre- sents the number of the picture in display order. Pictures I0 and P8 are called key-pictures. They are at the lowest temporal layer. The other pictures of the gop are hierarchically coded: the pictures at a lower temporal level are the reference for the pictures at the next temporal level. It is important to notice that the pictures at lower temporal levels contribute more to the overall coding efficiency, since they are directly or indirectly used as reference for all the pictures of the higher layers.

2.14 Decoded picture buffer (DPB)

During the course of the encoding and of the decoding processes, previously encoded or decoded pictures are stored in a DPB. They can then be used by the encoder to form predictions for subsequent pictures[17]. Depending on the video layer1 and the size of the picture, the DPB can have a capacity of pictures that ranges from 6 to 16[8][18].

1A layer defines a set of constraints in the bitstream[8] 26 CHAPTER 2. GENERAL CONCEPTS

Figure 2.6: Hierarchical coding structure with 4 temporal levels

2.15 Low delay and random access configura- tions

The three types of coding structures are intra-only, low delay and random access. In the first structure, all the pictures are encoded as intra. In the low delay one, the first is an intra frame while the others are encoded as generalized P or B pictures (GPB)[8]. This structure is conceived for interactive real-time communication. Finally, the random access structure is similar to hierarchical structure and intra pictures are inserted periodically, at the rate of about one per second. It is designed to enable relatively frequent random access points in the coded video data. This coding order has an impact on latency, since it requires frame reordering: for this reason the decoder might have to wait to have decoded several frames before sending them to output.

2.16 H.264 and its encoders 2.16.1 H.264 Since the standardization of H.264, several compliant encoders have been de- veloped by organizations and individuals. The software developed by the Joint Video Team, in particular, is known as the Joint Model (JM) [19]. It has been used by developers in order to improve and test existing algorithms. It has a high compression efficiency but its low computational efficiency, makes it not suitable in many cases. Among the developed encoders for H.264 we find x264[20], which has been used in many popular and successful applications like ffmpeg[21]. In [22] we can see that x264 performs better than several commercial H.264 encoders. Chapter 3

Preliminary tests

The preliminary work consisted of a series of tests in order to precisely evaluate the performance of C65 with respect to other codecs: the C64 encoded is the predecessor of C65, and it was compliant to the H.264 standard. H.264 (limited and max) refers to the JM software. HEVC (limited and max) refer instead to the HM software.

3.1 Speed - quality considerations

3.1.1 Interactive applications

Figure 3.1: Interactive applications, low-delay mode c65 comparison against other codecs

Figure 3.1, which comes from an internal Ericsson test, shows the perfor-

27 28 CHAPTER 3. PRELIMINARY TESTS mance of various video encoders at equal PSNR value 1 for interactive applica- tions (difference between interactive and entertainment application is mentioned in [23]). The test is on 1080p video conference sequences that have the prop- erty of having little motion between consecutive pictures. This test simulates a scenario where encoding delay is an important constraint, so the low-delay mode is selected for the encoders: only one picture is an I picture (the first in the video sequence) and there is no reordering of the picture (encoding and output order coincide). As we can notice, the figure shows the relationship be- tween encoding speed (expressed in frames encoded per second) and resulting bitrate. The comparison shows how C65 manages to outperform x264 both in terms of bitrate and encoding speed. It encodes 3.75 times faster and manages to have a slightly lower bitrate. The chart also shows that C65 is much faster than the HM reference encoder (300 times), at the cost, of course, of a reduced compression efficiency. Overall from this chart we deduce that C65 is a better encoder than x264 for videoconference sequences.

3.1.2 Entertainment applications

Figure 3.2: Entertainment applications, c65 comparison against other codecs

Figure 3.2 shows the performances of the same encoders as 3.1 but for enter- tainment application[23]. The test is on 1080p general content video sequences, composed by frames whose content can be also rapidly changing. The random- access mode of the encoders is used in this case, so an Intra picture is inserted roughly every second and picture reordering is used. Here we can see C65 is still faster than x264 but its compression efficiency is lower. For the specific application, the trade-off between speed and compression

1PSNR value of 35dB. 3.2. COMPRESSION EFFICIENCY TESTS 29

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma % [kbps] [dB] [kbps] [dB] avg low high Amer football 13269.04 32.38 7117.76 32.38 -51.36 -56.18 -42.78 Sitcom 7745.88 33.49 3721.14 33.82 -57.89 -61.79 -50.17 LeMatch 11968.53 33.37 5513.68 33.49 -58.51 -61.19 -53.27 BBDrive 5555.90 35.58 2613.36 36.00 -59.33 -59.78 -58.65 Cactus 5346.80 33.90 2496.80 35.03 -64.65 -63.79 -63.87 XmasTree 11153.30 31.68 5231.31 32.46 -61.99 -62.13 -59.40 CrowdRun 18593.20 30.85 9475.68 31.22 -54.80 -52.79 -53.67 Ducks 23317.90 30.80 12234.70 30.82 -53.76 -61.92 -45.10 InToTree 4925.10 33.38 2627.93 33.95 -55.80 -58.94 -57.29 ParkJoy 20534.97 30.24 10966.51 30.59 -54.15 -58.23 -49.15 BigShips 4763.85 34.25 2203.93 35.21 -66.78 -66.06 -65.05 City 6927.97 33.64 3234.16 34.14 -59.39 -58.21 -57.05 Crew 4434.25 36.78 1814.92 37.28 -68.02 -71.46 -62.35 Average -58.96 -60.96 -55.22

Table 3.1: C65 vs HM 12.0 random access configuration

efficiency is better for x264. This chart gives the starting point for the work. One of the original targets for the thesis work was to improve the compression efficiency of C65. This will most likely result in some slow down of C65, whose point on the chart will move down and left.

3.2 Compression efficiency tests

As said, encoding speed is a very important factor to take into account for in- teractive applications. In general it is well known that multimedia applications with real time requirements need high computing power and high network band- width. Indeed the latency when transmitting a video content can be subdivided in encoder latency, transmission latency and decoder latency[24]. This is not a crucial factor for random access: in general the encoding hap- pens once (imagine a DVD or Blue ray video) and the generated bitstream must then be decoded every time the video sequence is played. Given the fact that encoding latency is not a constraint for entertainment applications, picture reordering together with B-pictures is used, as it provides coding gains. On the other hand, it is important to keep the latency to a minimum for low-delay applications, so the encoding order should coincide with the output order of the frames. In the following graphs, the result of tests in term of PSNR quality and bitrate is shown for the encoders HM12, JM18.5, x264 and C65. 30 CHAPTER 3. PRELIMINARY TESTS

3.2.1 C65 and HM12 Test 1 As we can see, reaching the same quality, the HM reference software generates files whose size is less than half compared to those created by C65. Table 3.1 shows the result of the test using C65 as anchor, HM as test. The random access configuration[25] for HM is used. In this table, as in the next ones that will be presented, we can read, from left to right: • Short name of the sequence. • Bit rate (average over the 4 qp) of the encoding done by the anchor en- coder. • PSNR (average over the 4 qp) of the encoding done by the anchor encoder. • Bit rate (average over the 4 qp) of the encoding done by the test encoder. • PSNR (average over the 4 qp) of the encoding done by the test encoder. • BD rate % (negative numbers mean saving, positive mean loss). We have the average value as well as values for low and high bit rates, corresponding to highest and lowest qp values (37 and 22 respectively). The values presented as average are the result of an arithmetic mean operation among all the results obtained encoding with the different qp values. For in- stance, the field “Anchor avg PSNR” represents the arithmetic mean among the four PSNR values of a specific video sequence, when encoded with qp 37, 32, 27 and 22. See section 1.4 about test conditions for more details.

Test 2 A very interesting test is the one presented in table 3.2. This shows a comparison between two runs of the HM 12 encoder with different parameters. The anchor is the same as the test encoder used in the previous table. The test is still HM12 but configured to use only 1 reference picture for each list (list 0 and list 1, see section 4.2.2 and 2.10 for details). This test is of particular importance because it shows there is a rather marginal loss using only one reference picture per list. For this reason, the work on C65 will use simply one or two entries in each list (depending on the type of picture to encode, P or B), so to facilitate the implementation and speed up the processing at run time.

3.2.2 C65 and x264 Test 3 In test 3 we compare x264 and C65. Table 3.3 shows the comparison between c65 and x264. X264 is known for being highly customizable. Many parameters can be adjusted to best fit the need of the user. For our test we used the “veryslow” preset which gives a very high compression efficiency at the cost of a reduced encoding speed. The results show that the overall compression performance of x264 is still much better than that of c65 (-36% on average.). Moreover for all these tests there is not a big difference between the values for high and low bitrate. 3.3. CONCLUSIONS 31

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma % [kbps] [dB] [kbps] [dB] avg low high Am foot 7117.76 32.38 7190.31 32.36 1.15 0.16 2.20 Sitcom 3721.14 33.82 3856.20 33.80 2.34 0.31 7.18 LeMatch 5513.68 33.49 5563.06 33.48 1.01 0.74 1.36 BBDrive 2613.36 36.00 2630.12 35.98 1.02 0.65 1.41 Cactus 2496.80 35.03 2552.01 35.01 2.31 1.58 3.33 XmasTree 5231.31 32.46 5352.62 32.41 3.05 2.30 3.89 CrowdRun 9475.68 31.22 9507.54 31.22 0.49 0.79 0.32 Ducks 12234.70 30.82 12233.38 30.82 0.02 0.24 -0.11 InToTree 2627.93 33.95 2626.06 33.94 -0.42 -0.72 0.13 ParkJoy 10966.51 30.59 11008.24 30.58 0.65 0.92 0.47 BigShips 2203.93 35.21 2255.08 35.19 1.88 0.10 4.52 City 3234.16 34.14 3339.78 34.09 2.54 -0.54 9.24 Crew 1814.92 37.28 1893.52 37.24 5.14 3.96 8.04 Average 1.63 0.81 3.23

Table 3.2: HM 12.0 random access configuration vs same software using only one reference picture.

Test 4 In analogy with Test 2, table 3.4 evaluates the performance of the preset “verys- low” of x264, compared to the case where the same preset is used but only one reference picture is tested in each list. The option to set is here --ref 1. Here we see a still moderate and even slightly smaller loss (about 1 %).

3.3 Conclusions

The purpose of all these tests was to have a preliminary investigation in order to make decisions on the implementation of C65. The results show that using only one reference picture in each list is an acceptable compromise. For this reason, C65 B will build list of the minimum size (1 for P pictures and 2 for B pictures). 32 CHAPTER 3. PRELIMINARY TESTS

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma % [kbps] [dB] [kbps] [dB] avg low high Am foot 13269.04 32.38 12267.81 33.32 -31.94 -29.63 -30.56 Sitcom 7745.88 33.49 6869.16 34.17 -34.11 -32.35 -35.73 LeMatch 11968.53 33.37 10213.84 34.34 -38.66 -34.99 -39.63 BBDrive 5555.90 35.58 4699.76 36.39 -28.86 -22.78 -34.36 Cactus 5346.80 33.90 4175.54 35.33 -43.19 -38.10 -44.81 XmasTree 11153.30 31.68 8535.12 32.73 -40.09 -37.72 -40.16 CrowdRun 18593.20 30.85 14932.90 32.25 -41.58 -35.93 -41.82 Ducks 23317.90 30.80 21239.24 32.16 -39.22 -42.72 -34.43 InToTree 4925.10 33.38 3981.62 33.84 -25.32 -10.93 -35.77 ParkJoy 20534.97 30.24 17471.34 31.53 -38.74 -38.66 -35.62 BigShips 4763.85 34.25 3520.90 35.11 -43.17 -39.77 -43.57 City 6927.97 33.64 5604.14 34.20 -32.13 -21.31 -35.95 Crew 4434.25 36.78 3382.31 37.41 -43.17 -43.64 -41.72 Average -36.94 -32.96 -38.01

Table 3.3: C65 vs x264 with settings: --psnr --threads 1 --profile high --preset veryslow --tune psnr -I 48

Sequence Anchor Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma % [kbps] [dB] [kbps] [dB] avg low high Am foot 12267.81 33.32 12358.19 33.30 1.11 0.52 1.26 Sitcom 6869.16 34.17 7052.27 34.16 1.71 0.44 3.44 LeMatch 10213.84 34.34 10365.57 34.33 1.79 2.15 1.30 BBDrive 4699.76 36.39 4741.10 36.35 1.25 0.86 1.57 Cactus 4175.54 35.33 4244.01 35.29 1.90 0.80 3.04 XmasTree 8535.12 32.73 8685.27 32.69 2.61 2.30 2.49 CrowdRun 14932.90 32.25 14923.03 32.24 0.33 0.49 0.29 Ducks 21239.24 32.16 21266.50 32.16 0.10 0.33 -0.01 InToTree 3981.62 33.84 4017.25 33.83 0.12 -1.13 1.00 ParkJoy 17471.34 31.53 17481.15 31.52 0.45 0.76 0.36 BigShips 3520.90 35.11 3568.26 35.08 1.63 1.18 2.84 City 5604.14 34.20 5678.13 34.15 1.48 -0.68 3.80 Crew 3382.31 37.41 3428.05 37.40 1.28 0.42 2.00 Average 1.21 0.65 1.80

Table 3.4: x264 with settings: --psnr --threads 1 --profile high --preset veryslow --tune psnr -I 48 vs x264 with settings: --psnr --threads 1 --profile high --preset veryslow --ref 1 --tune psnr -I 48 Chapter 4

Implementing B pictures

There are three main pictures types: Intra (I), uni-predictive (P), and bi- predictive (B) picture. In particular B pictures can take as reference two frames: any two previously encoded reference pictures can be used, even the same pic- ture twice. In this case, the prediction is obtained combining the predictions separately from forward reference picture and backward reference picture based on motion compensation. In particular B pictures B pictures with QP toggling is a cheap way of increasing the framerate by a low bitcost and they are use- ful when there are cases of occlusion, uncovering problem caused by zooming, non-linear motion[26]. B frames were not implemented in C65 encoder as its target was mostly video-conferencing video. The reason for not implementing B slices in C65 is that they give gains when backward and forward references are used. Since this would require picture reordering, and this would in turn introduce encoding delay (see section 2.13), it was chosen not to support B slices in C65. Indeed, when not using backward and forward prediction, the difference in compression efficiency between using forward-only B and P is small and the complexity is higher both in encoder and decoder. In the following sections we give an introduction about gop structures when introducing B frames and we explain which modifications were made to C65 to support a set of possible gop structures. Moreover we explain what this entails in terms of coding delay and memory requirements. Finally, considerations are given about the details in implementing B slices in C65: constructing refer- ence lists, change of the candidates generation for merge and AMVP mode and the modifications required by the motion estimation and motion compensation units.

4.1 Gop structure

4.1.1 C65’s original gop structure In order to achieve minimal delay and to keep computational complexity low, C65 originally uses a flat gop structure like the one presented in Figure 4.1. The gop size, which is the distance between two I frames, is set by a command line parameter. We can notice that only I (red) and P (blue) frames are used.

33 34 CHAPTER 4. IMPLEMENTING B PICTURES

Figure 4.1: C65 original encoding order: here, like in the rest of the dissertation, red is used for I pictures, blue for P and green for B pictures. Moreover, the black numbers below the frames represent the number of the picture in display order, i.e. as they would normally be played. Instead the red numbers above represent the picture number in encoding order, i.e. as they are processed by the encoder.

The delay is the minimum possible: coding order and output order coincide, so pictures are sent to output as soon as they are encoded or decoded. Also the memory requirement is as low as possible: only one frame per time is required in the decoded picture buffer, plus the current picture.

4.1.2 Hierarchical gop structure In MPEG2 a gop was defined as a set of one I and many P and B frames, where each B picture used the nearest past and future I or P picture for prediction[27]. The hierarchical B-picture coding was first introduced in H.264/AVC in order to improve coding performance and provide strong temporal scalability as well. H.264 allows flexibility to choose any picture (also B) to be used as reference for prediction[28]. It is possible to use only one I picture and several B pictures in each gop, where some B frames act as reference.

4.1.3 Choice of gop size The first step to support hierarchical gop in C65 was to decide a gop structure. The choice can be made among several options: staying with pure hierarchical gop structure, the size of the gop size can vary. Usual values are powers of two and range among four and thirtytwo. The problem is that the choice of the optimal gop size is highly dependent on the content of the video sequence to encode. We can say that as a general rule, the longer the GOP size, the higher the achievable compression ratio [16]. It is proved that sequences with low variations of the content throughout the frames are encoded more efficiently when the gop is relatively large[29]. In the special case of very rapidly changing sequences, studies show that they are encoded with less bits if the gop size is small[30][31]. Additionally, the choice is not restricted just among dyadic1 cases: the gop structure can form something different from a pyramid. We can have gop sizes that are not powers of two and even lack of symmetry in the gop structure[32].These gop structures are also characterized by a specific encoder/decoder delay.

1A gop structure is called dyadic if every layer x contains 2num of slices(x−1) where num of slices(x − 1) is the number of slices in layer x − 1 4.1. GOP STRUCTURE 35

4.1.4 C65 optional parameter -hgopsize With so many possibilities for a choice of a gop structure, it seemed reasonable to provide a mechanism to choose. The possible options implemented during the course of this work in the C65 B encoder are:

• flat gop structure, as shown in Fig. 4.1

• dyadic hierarchical gop structure

– dyadic gop structure, size 2, as shown in Fig. 4.2(a) – dyadic gop structure, size 4, as shown in Fig. 4.2(b) – dyadic gop structure, size 8, as shown in Fig. 4.2(c)

• non-dyadic gop structure, size 8, as shown in Fig. 4.3

Dyadic hierarchical gop structures It is possible to choose among 3 different hierarchical gop structures through the command -hgopsize followed by the number 2, 4 or 8. The options are presented in figure 4.2.

Non-dyadic hierarchical gop structure Test 3 and 4 presented in the previous chapter were useful also because we could analyze the gop structure chosen by x264 to encode the selected video sequences. We noticed different configurations, but the most common structure was still of the kind IBBBBBBBP like those implemented in C65 B but with different entries in list 0 and list 1, so with different pictures taken as reference. For this reason we implemented this gop structure in C65 B too. The schema is shown in figure 4.3 and presents a still hierarchical structure but with one temporal layer less than the configuration of figure 4.2(c).

Flat gop structure One last gop structure is implemented in C65 B, corresponding to the original one implemented in C65 for video conferencing. It is the same as shown in figure 4.1 and it is obtained simply setting the gop size to 1, with the command -hgopsize 1. Although this rarely provides the best performance in terms of video compression (as we will see in the Results chapter), it was used for tests in order to compare compression efficiency and computational complexity of the other solutions.

4.1.5 Considerations on coding delay The main constraint about the coding order for the frames of a video sequence is that the pictures used as reference must be coded before they are used. This constraint still leaves a lot of design space and different methods can be em- ployed: these methods have different characteristics in terms of decoding delay and memory requirements[16][33]. It is interesting to make comments on the coding order and delay given by the configurations presented in the previous section and implemented in C65 B. As 36 CHAPTER 4. IMPLEMENTING B PICTURES

(a) Hierarchical gop structure size 2, -hgopsize 2

(b) Hierarchical gop structure size 4, -hgopsize 4

(c) Hierarchical gop structure size 8, -hgopsize 8

Figure 4.2: Dyadic hierarchical gop structures implemented in C65 4.1. GOP STRUCTURE 37

Figure 4.3: Hierarchical non-dyadic gop structure size 8, -hgopsize 8n already stated, the encoder latency given by the flat structure 4.1 is minimum, as every picture can be encoded right after being received by the encoder from the source (for example, from the camera). For the dyadic structure of size 2, it is straightforward to notice that it introduces an encoder latency of one picture at most: frame of poc 0 is received from the source and encoded, then frame 1 is received. Finally, frame 2 is received and encoded and used as reference to encode frame 1. For the dyadic structure of size 4, the encoder latency introduced is 3. Indeed picture 1 (in display order) is received by the encoder from the source but it can not be encoded until pictures 2 and 4 are received, according to the order of encoding 0 4 2 1 3 . . . . For the hierarchical structure of size 8, the encoding latency increases again to 7, following the order of encoding 0 8 4 2 1 3 6 5 7 . . . Finally, the non-dyadic hierarchical gop of size 8 (Fig. 4.3) has a coding delay of 7 frames, according to the encoding order 0 8 4 1 2 3 5 6 7. The choice of the gop size depends then also on other factors, like the DPB size and the delay we can afford to tolerate: despite delay is not a strict con- straint for several applications, it could be for others and large gop sizes can introduce unacceptable waiting times.

4.1.6 Memory requirements for DPB Another constraint to take into account when defining the encoding order, other than the coding delay, is the memory requirement. It is meaningful to analyze the occupation of the DPB depending on the various gop structure. The flat structure 4.1 always keeps in the buffer just the current and the previous frame, so it represents the lower bound for memory utilization. The dyadic structure of size 2 has up to three frames in the buffer. The B pictures require two references, so both the previous and the following picture (in display order) must be kept in the buffer, other than the current. The dyadic structure of size 4 consumes up to four memory slots in the buffer. For example frame with poc 1 needs frame 0 and 2 as reference. Moreover frame 4 cannot be disposed from the buffer either, as it will be used as reference for frame 3 and 8. In a similar manner we can deduce that the dyadic structure of size 8 needs 38 CHAPTER 4. IMPLEMENTING B PICTURES

Figure 4.4: The content of the DPB when processing frame with poc 1, using the dyadic gop structure of size 8. The first number (before the “/”) represents the number of the picture in coding order. The number after the “/” is instead the poc. The current frame is here omitted, but it is also part of the DPB.

up to 5 frames to be kept into the buffer, including the current (see case of poc 1). For clarity, the content of the DPB when processing frame with poc 1 is presented in figure 4.4.

4.1.7 Syntax elements to set

The bitstream generated by the encoder must signal the display order of the en- coded frames: this is not a problem as every nal unit contains the pic order cnt lsb that represents the least significant bits of the picture order count (the number of the frame in display order)[8]. Moreover, the sps max num reorder pics must be changed accordingly. It indicates the maximum allowed number of pictures that can precede any picture in decoding order and follow that picture in output order[8]. The correct value depends on the gop size, and it is 1, 2 and 3 respectively for sizes of 2, 4 and 8. It must anyway be in the range of 0 to sps max dec pic buffering minus1 (here we see the link between coding delay and memory requirement). Finally the value sps max latency increase plus1 must be correctly set. It specifies the maximum number of pictures that can precede any picture in output order and follow that picture in decoding order, so it gives a measure of the maximum allowed latency. In our case we set it to 0, meaning that arbitrary latency is allowed. 4.2. B SLICES - DETAILS ABOUT IMPLEMENTATION 39

4.2 B slices - details about implementation

This section will shortly describe the modifications that were necessary to im- plement B slices in C65. This work was done step-by-step: the path followed was to implement uni-prediction with B slices using first only list 0, then only list 1. Finally, bi-prediction was implemented too.

4.2.1 Gop structure and references The first step was to give a reference picture set explicit description, via con- figuration file. Details about the RPS can be found in [34]. To summarize, the RPS contains a list of values representing the “deltaP OC” of all reference pictures that the decoder must keep and not discard. The picture order count of the reference picture is calculate through the formula

P OCreference = P OCcurrent + deltaP OC (4.1)

An example of configuration file that gives an explicit reference picture set description, for the hierarchical gop structure of size 4 (Fig. 4.2(b)) can be found in A.1. In principle, this configuration file must contain, for each reference picture set, a list of deltaP OC elements, in order to indicate the reference pictures for each frame. Moreover, a list of frames with their associated explicit description must be also contained in the configuration file. The information needed for each frame are poc, temporalid and a link to the corresponding RPS. In the specific case of the configuration file for the software C65 B, this list is given in encoding order. Using an explicit description contained in a configuration file, it is possible to change the encoding order of the frames in the sequence and to set correctly the references. Once the encoder worked properly using as input the configuration file, the gop structure was implemented directly in the software, without the need of an explicit description. Moreover, changes were necessary to the soft- ware, because the encoding order changes but the frames that must be output to the reconstruction file must always be according to display order. After implementing all the mentioned gop structures, a command line pa- rameter was added in order to select which to use as in section 4.1.4.

4.2.2 B slices - building list 0 and list 1 In order to prepare the ground for bi-prediction, list 0 and list 1 must be filled with the right pictures that will be taken as reference. The process consists of several steps:

• calculate the poc of the pictures used as reference, using the current poc and the delta poc, as in equation 4.1;

• keep track of number of positive and negative values for delta poc (pictures coming respectively before and after the current frame on display order);

• build list 0 adding to it:

– the pictures in the DPB with poc corresponding to the values calcu- lated and that come from negative deltaP OC values; 40 CHAPTER 4. IMPLEMENTING B PICTURES

Figure 4.5: Position of spatial candidates of motion information

– the pictures in the DPB with poc corresponding to the values calcu- lated and that come from positive deltaP OC values; – the pictures in the DPB that are long term pictures; • build list 1 in a similar but symmetrical way than list 0. It is worth noticing that when employing B slices, even if bi-prediction is never used, the decoder builds the two reference lists.

4.2.3 Generation of merge candidates Speaking about merge mode, the process to generate the merge candidates must be, as said, exactly the same on encoder and decoder side. This process has variations in case the slice type of the CU to be encoded is P or B. Those changes needed to be implemented in C65 B in order to achieve an encoder- decoder match.

Zero motion vectors There are cases in which the number of merge candidates does not reach the maximum number (which is signaled in slice segment header() structure[8]). To understand this, we have to know that merge candidates are generated copy- ing the motion vectors from neighboring blocks (for the moment we assume that the concept of neighbor is only spatial). The neighbors are represented in figure 4.5. If we consider, for example, the first block of every frame (position x=0 and y=0), we can notice that it has none of the candidates available. In this case, the merge candidates are generated as “zero motion vectors”, with horizontal and vertical displacement equals to zero.. In case of P-slices, these motion vectors refer to the first picture in list 0. In case of B slices, bi-directional zero motion vectors are generated instead (one zero motion vector for each list is generated until all the merge candidates have been created). For this reason, if merge mode is chosen by the encoder and the merge index transmitted points to one of those zero motion vectors, the encoder is choosing 4.2. B SLICES - DETAILS ABOUT IMPLEMENTATION 41 bi-prediction. In an early stage of the development, where bi-prediction was not implemented on the encoder side yet, a workaround was found disabling bi-prediction operating on the decoder’s source code.

Additional merge candidates for B slices As we read in [17], the generation of merge candidates when using B slices re- quires another change: for B frames, additional merge candidates are generated, starting from already calculated candidates for list 0 and list 1. Each pair of motion vector is built following one of the twelve predefined pairs, defined as (0,1), (1,0), (0,2), (2,0), (1,2), (2,1), (0,3), (3,0), (1,3), (3,1), (2,3), (3,2). For ex- ample the first merge candidate is built using the first motion vector candidate for list 0 and the second for list 1. Up to 5 merge candidates can be included, after removing duplicates. This process of course does not apply to P frames, where only list 0 is used. For this reason the encoder needed to be changed in order to be adapted and conform to the valid sequence of operations to create merge candidates.

Direction of the prediction As a last consideration, the encoder needs also to internally store (and signal) the direction of the prediction. This is always 1 (corresponding to PRED L0) for P slices, while it can also be 2 (PRED L1) or 3 (PRED BI) for B slices. This operation is done during the generation of the merge candidates.

4.2.4 Motion Vector Prediction The process to generate the candidates for AMVP mode is performed once for each MV, so once for uni-predicted PUs, or twice for a bidirectional PU. The candidates are chosen from both list 0 and list 1. In effect even when using only uni-prediction, B slices still require building the two lists. This is sure a difference with P slices, where the candidates can instead come only from list 0. The standard[8] describes the candidate generation in detail, as it must be the same on encoder and decoder side. It can be summarized as follow:

• Try to obtain the left predictor using the following criteria:

– try A0, if not available try A1, as in figure 4.5; – if the previously chosen candidate is available for both list 0 and list 1, prefer the same list as the current motion vector to code over the opposite list. Here lies the change, two lists are possible with B slices; – prefer a motion vector with the same reference index; if none of the neighbors satisfies this condition, the motion vector must be scaled according to the distance of the picture. This point in particular required to be changed, as the original C65 allowed only one reference picture per list, so the reference index would always be 0.

If any of the previous points resulted in a candidate, this is added to the list.

• Try to obtain the above predictor using the following criteria: 42 CHAPTER 4. IMPLEMENTING B PICTURES

– prefer candidates is this order: B0, B1 and finally B2, as in figure 4.5; – prefer the same list as the current motion vector to code over the opposite list; – prefer a motion vector with the same reference index; if none of the neighbors satisfies this condition, the motion vector must be scaled according to the distance of the picture. The scaling only occurs if it was not done previously, so every prediction unit can have at most one scaling operation. Again if any of the previous points resulted in an acceptable candidate, this is added to the list. • if the list has less than 2 candidates and the option TMVP is not disabled, try to add the temporal motion vector. This topic is not treated in detail because TMVP was disabled in our implementation. In the implementation of B slices, the generation of the predictors for AMVP needed to be completely re-designed and re-implemented, also adding operations like scaling. During this complex process, it was very important to have a debug trace of the motion vectors chosen for every block, as well as the predictors, the index of the selected predictor and the difference. This same trace was added to the HM decoder in order to verify the match between the two log files.

4.2.5 B slices - syntax fixes The slice header for B slices is slightly different than the one for P slices, so these changes need to be implemented in C65. In particular, as written in the H.265 standard [8], we can see that for B slices we need: • num ref idx l1 active minus1 indicating the number of reference pic- ture for List 1; • ref pic list modification flag l1; • mvd l1 zero flag; These last two syntax elements were not so interesting for our purposes, their meaning can be found in[8]. Moreover we need the essential syntax element inter pred idc, which is used to indicate the direction of the prediction and is signaled for each prediction unit. The values it can assume are: • PRED L0 if only a reference on list 0 is being used; • PRED L1 if only a reference on list 1 is being used; • PRED BI for bi-prediction. This syntax element must be CABAC encoded; for this reason it is mandatory to follow the binarization process indicated in section 9.3.3.7 of the standard[8]. During this operation, it was essential to print a trace of the CABAC engine, including symbol to encode and the probability state[13]. The same CABAC debug trace has been implemented on the decoder side, in order to be compared and achieve a match. 4.2. B SLICES - DETAILS ABOUT IMPLEMENTATION 43

Finally, to support Advanced Motion Vector Prediction mode, it is necessary to develop for motion vectors pointing to list 1 the same mechanism as for those referring to list 0. In particular, the mvd coding() structure must be output. This represents the motion vector difference syntax and is defined as described in section 7.3.8.9 of the standard[8].

4.2.6 Other modifications Deblocking filter Implementing B slices requires some other modifications to the encoder software. Mainly, we found, as described in the standard[8] that some changes are needed for the deblocking filter[35]. The operation involved is the check of the boundary strength which shall be performed differently for P and B slices: when using bi-prediction we must take into account two references on two lists, that can also correspond to the same picture; P slices instead have always only one reference and one motion vector (these factors are used to determine the strength of the filter). Deblocking filter has so been disabled as it would require the changes de- scribed.

TMVP A temporal motion vector candidate can be generated and used among the merge candidates and possibly also when not using merge mode, for the motion vector predictor candidates. The process to generate the temporal motion vector for the given prediction unit consists of taking the motion vector from the right bottom position of the colocated PU of the reference picture. The colocated PU is the one that spatially corresponds to the current PU, but on the reference frame and not on the current frame. The right bottom position can be not available (for example consider PUs at the bottom or right border of the frame). In that case, the center position is used instead[17]. HEVC allows flexibility by transmitting an index to specify which reference picture list is used. In particular we can examine the following syntax elements: • collocated from l0 flag that specifies the reference picture list: 1 is transmitted for list 0 and 0 for list 1.

• collocated ref idx specifies the reference index of the collocated picture used for temporal motion vector prediction. As we can see, HEVC also requires to transmit the reference index of the refer- ence picture. This was not a problem in the old implementation of C65, where the gop structure was flat and the reference picture was always the previous one (in display and coding order): the reference index was always 0. Also the element collocated from l0 flag was always set to 1, indicating so list 0. The modifications made to C65 B to support B slices instead changed this scenario: for B slices, two reference pictures are available for each list, and clearly there are two lists. For this reason the right values should be assigned to the syntax elements listed above. In addition, scaling operations are necessary in case of reference index greater 44 CHAPTER 4. IMPLEMENTING B PICTURES

Figure 4.6: Fractional positions in Luma motion compensation with 1/4 pel accuracy than 1. Finally, one main concern related to the use of the temporal candidate is the necessity to store the motion information of the reference picture. Having n reference pictures, we should potentially need to save all the motion vector in- formation associated to all the prediction units of all the at least n pictures in the DPB. This was instead limited to only one picture in the case of the old C65 implementation. For these reasons, we considered that for a first implementation the temporal motion vector predictors can be disabled. This is accomplished using the syntax element slice temporal mvp enabled flag, set to 0[8].

4.3 Motion compensation 4.3.1 Interpolation Like H.264/AVC, HEVC supports motion vectors with quarter-pel accuracy for the luma component and one-eighth pel accuracy for chroma components for video in the 4:2:0 color format. In figure 4.6 we can see that A0,0,A1,0 . . . correspond to actual pixel values; Instead a0,0, c0,0, . . . correspond to quarter- pel positions and b0,0, h0,0 are half-pel positions. The values must be derived according to an interpolation process, for which a filter must be defined. In HEVC, the luma interpolation process uses a symmetric 8-tap filter for half-pel 4.3. MOTION COMPENSATION 45

α filter(α) 1/4 {-1,4,-10,58,17,-5,1} 1/2 {-1,4,-11,40,40,-11,4,-1}

Table 4.1: Luma interpolation filter in HEVC

Figure 4.7: Averaging of input signal in H.264. The precision is the same as input video depth, for example 8 bit positions and an asymmetric 7-tap filter for quarter-pel positions. The filter coefficients are shown in table 4.1. For example the value of b0,0 is given by the formula: b0,0 = (−A−3,0+4·A−2,0−11·A−1,0+40·A0,0+40·A1,0−11·A2,0+4·A3,0−A4,0+64) >> 7

B slices require roughly double the amount of filtering operations. Indeed this has to be done for both reference pictures, before the averaging. This has an impact in performance terms and it also requires additional buffers to store predictions.

4.3.2 Averaging The operation of averaging had to be implemented completely in C65 B. The old software supports only P slices, so the operation of motion compensation simply consists in copying the (possibly interpolated) values from the reference picture to the target current frame. Having more than one reference picture in B slices, the prediction signal of the bi-predictively coded motion blocks is obtained by averaging prediction signals of the two reference pictures, coming from list 0 and list 1 respectively. In older standards like H.264, the values represent directly the pixel, so the number is in the range 0. . . 255 inclusive for 8-bit video. As explained in the previous section, the value has to be obtained by interpolation in case of non-integer pel position, and the rounding happens just after the interpolation, following a schema as the one in figure 4.7. The final value is given by the formula:

S = (S1 + S2 + 1) >> 1

On the other hand, in HEVC prediction signals are averaged at a higher precision. We can notice this in case of motion vectors with sub-pixel accuracy. The rounding operation does not take place right after the interpolation, but instead 46 CHAPTER 4. IMPLEMENTING B PICTURES

Figure 4.8: Averaging of input signal in HEVC. The operations are performed at a higher precision than the one of input signal. the averaging happens first. For this reason, the operations are performed at a higher precision and then the rounding happens. The schema is represented in figure 4.8. In this case, S1 and S2 can be obtained as result of interpolation, and the shifting is so performed after the averaging. In case of integer pel motion vectors, the bit depth is first increased, by shifting right of 6 positions. The final value is given by the formula:

S = (S1 + S2 + 64) >> 7

Finally, considerations about memory requirements are necessary. In ad- dition to a possibly increased (since up to two reference frames must kept in memory), averaging also requires an intermediate buffer on which performing the average operations and storing the result. In this implementation, the buffer size is minimized, as we simply use a 2-d array of size 64x64, corresponding to the maximum size of a CU.

4.3.3 SIMD code Video encoding is one of the main areas where a great speedup can be achieved by parallelizing operations. In particular, data parallel optimizations are possi- ble and useful where a time expensive function has to operate on a large array of data. All major microprocessor vendors have developed multimedia instruction set extensions for their architectures[36]. A typical example on common desktop platform is the Intel SSE technologies[37], which are based on the prevalent single-execution-multiple-data (SIMD) execution model[38]. By enabling several computations to be executed simultaneously with a single instruction, SIMD technologies provide a series of effective approaches for fast algorithm design and implementation, which is essential for most multimedia applications where large amount of data are frequently processed. Motion compensation is a good candidate to be parallelized by the use of SIMD instruction: the same operations of filtering and averaging must be done for every single pixel of a frame. This is even more important in HEVC, whose high complexity burdens the processor. Finally, since HEVC is designed for high resolution formats (up to 8K [17]), the array of data can become very large. 4.4. MOTION ESTIMATION 47

For the reasons above, an optional SIMD version of the motion compensation functions was written for C65 B. This is different in several aspects from the SIMD optimization present in the old version of C65 that supported only P slices. In particular we can notice that:

• The filtering operations must be done for each of the reference list, so twice in case of bi-prediction. In the case of the old version of the encoder, the interpolation instead was only performed on the reference picture from list 0.

• The operation of averaging is required for bi-prediction. This was instead not required on uni-prediction, where a simple copy is performed. While averaging, a number of factors must be taken into account, like the di- mensioning of the data that can be processed in a single vector: since averaging requires sums, overflow must be taken into account.

4.4 Motion estimation

Motion estimation of a block consists in finding a sample region in a reference frame that closely matches the current block. The reference frame is a previ- ously encoded frame from the sequence. Several algorithms have been developed to perform motion estimation: from the very effective but computationally pro- hibitive full search to sophisticated methods that allow to reduce dramatically the complexity, slightly compromising the quality of the prediction. The matching between the current frame and the candidate frame can be evaluated using different methods. In C65 B the SAD and SSD methods are implemented.

Sum of Absolute Differences (SAD) is one of the simplest of the similarity measures. The SAD between the current block C and the reference block R can be expressed as N−1 N−1 X X |ci,j − ri,j| i=0 j=0

where ci,j and ri,j are the (i, j)th elements of the block C and R respectively and N is the block size. The Sum of Squared Differences SSD(C,R) between the current block C and the reference block R can be expressed as

N−1 N−1 X X 2 (ci,j − ri,j) i=0 j=0

This measure has a higher computational complexity compared to SAD al- gorithm as it involves numerous multiplication operations. By default the latest method is used. In uni-prediction blocks, the algorithm employed consists in scanning the blocks in the reference frame within the search space. For each block the SSD between that block and the current is computed, so finally the block with the lowest SSD is selected as best result. 48 CHAPTER 4. IMPLEMENTING B PICTURES

The output of this process is a matching area in the reference picture: this is indicated by a motion vector, characterized by horizontal and vertical displace- ment, reference index and list index. When implementing bi-prediction, the computation of the distance measures must vary. A simple possible approach could be applying twice the algorithm presented above, once for each list and have an average of the two SSD values so computed. In the next chapter we will see what is instead a better way of computing the SSD in case of bi-prediction. Chapter 5

Rate distortion optimization

5.1 Motion estimation

Motion estimation is one of the key elements of many video compression schemes, in particular the MPEG-like ones, and the family H.26x. Through motion esti- mation we can take advantage of the temporal redundancy in a video between adjacent frames: usually the n − th frame of a video sequence visually looks very similar to frame at position n − 1. If, for instance, there was just some motion of an object in the frame and all the rest of the content did not change at all, it makes sense to transmit just the information about the movement of that object. For this reason, as in general inter prediction, a frame is selected as a reference, and subsequent frames are predicted from the reference. The output of the motion estimation process on a given block is a motion vector. It is well known that motion estimation is a very costly operation, in term of complexity: about 60% of the computational complexity at encoder side is given by this function. At the same time, it is important to perform this process in an accurate way, as efficient motion estimation reduces the energy in the motion compensated residual and, for this reason, can dramatically improve compression performance.

5.2 Motion estimation in C65 B

The motion estimation process implemented in C65 B must take into account the modifications explained in the previous chapter: B slices impose to perform motion estimation in a different way than P slices. In this section, the motion estimation algorithm of C65 B is explained.

5.2.1 Step zero: generate the merge candidates list The first step for the motion estimation is to evaluate the merge motion vec- tor candidates, that are generated as explained in the HEVC standard[8]. It is

49 50 CHAPTER 5. RATE DISTORTION OPTIMIZATION important to notice that, when implementing B slices, it was necessary to mod- ify this merge candidate generation process as well, as explained in section 4.2.3.

5.2.2 Step one: choose best merge index The evaluation of the candidates requires to evaluate the error introduced when using a specific motion vector, so the difference between the current block and the block on the reference frame pointed by the motion vector. The operation consists in computing the SSD value between these two blocks. For P slices (still supported and used in C65 B, as shown for example in Fig. 4.2), no modifications to the encoder were needed. For B slices instead, the scenario is more complicated. The merge candidate generation process can return different kind of motion vectors:

1. The merge candidate can be composed of only one motion vector, pointing to list 0. In this case, SSD is computed exactly like in the case of P slices, so no modification is needed.

2. The merge candidate can be composed of only one motion vector, pointing to list 1.

3. The merge candidate can be bi-predictive, so composed of two motion vectors, one pointing to list 0 and the other referring to list 1.

Case 2 and 3 have been completely implemented in C65 B supporting B-slices.

Motion vector pointing to list 1 In this case, the SSD measure must be computed between the current frame to encode (taken from the source) and the frame pointed by the motion vector. While the internal steps of the SSD function are exactly the same as when using P slices (the SSD function takes as input two portions of picture), we need to use list 1 this time: for this reason, the reference block must be taken from the picture on list 1 that corresponds to the reference index provided by the motion vector.

Bi-directional motion estimation The function to compute the SSD distance between the two frames must be adapted in case of bi-prediction. It is possible that the merge candidate genera- tion process returns two motion vectors, one pointing to list 0 and the other to list 1. In this case we need to compute the similarity between the two picture following regions:

• The first (A) is still the current frame to encode (taken from the source);

• The second (B) is the result of a merging (averaging) operation between the picture areas pointed by the two motion vectors, referring to the two different lists. 5.2. MOTION ESTIMATION IN C65 B 51

For this reason, we need to compute B as a first step. The operation of averaging is similar to the one performed for motion compensation and described in section 4.3.2. It is performed at a higher precision than that of the input bit depth. From each pixel of the averaged block is then subtracted the current value of the pixel to encode, before performing the square operation. All the results, for all the pixels in a block, are summed up in order to obtain our measure of similarity, the SSD (see section 4.4). Of course, the lower the SSD, the more similar the blocks, so the better are the motion vectors.

Loop through all the merge candidates

The SSD cost is calculated for all the merge candidates, taking into account that they can be uni- or bi-directional. In order to choose which merge can- didate is the best, we need to remember that this choice will have to be sig- naled to the decoder. In particular, the merge index information will be trans- mitted. The maximum number of merge candidates is signaled in the bit- stream, in the slice segment header structure, through the syntax element five minus max num merge cand. In all the cases, the merge index is a number between 0 and 4. The cost of transmitting the merge index is so not always the same, as the value 4 requires more bits to be encoded than 0. For this reason, the cost of using the merge mode depends both on how accurate the motion vectors are and also on the merge index needed to signal them. This is a classical rate-distortion problem, that can be viewed as the minimization of the Lagrangian function

JMOTION = D + λMOTION · RMOTION (5.1) where the distortion D represents the prediction error measured in SSD[39]. This is weighted against the number of bits RMOTION associated with the merge index, using a Lagrangian multiplier λMOTION . For simplicity, we will give J the name of “merge cost”. The λ multiplier is usually (and in our case too) a function of the Quantization Parameter, as in [39]. For this reason, in order to choose the best merge candidate to use, the merge cost is computed for all the candidates and the one with the lowest value is chosen. The index of the merge candidate is stored in a best merge index variable.

5.2.3 Step two: test the skip mode

The next step, after having determined the best merge candidate, is testing if the skip mode can be used. The skip mode (see introduction) allows to save bits as it does not require to send any transform information. On the other hand, it has be chosen just when the prediction is very good and leads to little distortion, not to compromise the final video quality. For this reason, if the SSD associated to the best merge candidate is lower than a certain constant value, the skip mode is selected: the motion estimation process terminates, and the motion vectors returned are those indicated by the best merge index value. 52 CHAPTER 5. RATE DISTORTION OPTIMIZATION

5.2.4 Step three: test the intra mode At this point, the SSD of the best merge candidate is compared to the SSD asso- ciated to the intra prediction of the block (this value was previously computed). Intra prediction is used in case it provides better quality (i.e. if SSD associated with intra prediction is lower than the one associated to inter prediction using the merge candidate selected previously). In this case, the motion estimation process terminates.

5.2.5 Step four: test uni-prediction Purpose of this step is to test if, for the current block, uni-prediction can give better results than bi-prediction. Given the list of the candidates (step 0), only the bi-predictive merge candidates are selected. As explained, those are made of two motion vectors, one for each list For each of the candidates in the list, the two motion vectors are tested separately (only if there was no merge candidate with the exact same values), so an SSD is associated to each of them. This value, as explained, is not enough to perform a Rate-Distortion optimization: we also need the cost of these motion vectors. Since these motion vectors do not belong to the merge candidates list, it is not possible to use the merge mode to choose them; the AMVP mode must be used. This has a different (usually higher) cost, as it requires to transmit an index to select the motion vector predictor (see section 2.5) as well as the motion vector to use, differentially coded starting from this predictor. The function to minimize is still the one presented in equation 5.1, and the value D still represents the SSD associated with the prediction for a given motion vector. The R component, in this case, has to take into account both the prediction index and the motion vector difference that must be transmitted (whose cost increases as the absolute value of its numerical components becomes larger). When the cost for each of the uni-prediction motion vectors that use the AMVP mode is computed, this is compared to the “merge cost” of the best merge candidates. The option with the lower cost is selected, and it becomes the current best candidate.

Why testing uni-prediction is effective Not necessarily predicting a block from two references is always a better rate distortion choice than doing prediction from only one reference block. In some cases, this is ever more evident, as in the following study case, that we will call “scene-cut”. A characteristic of video content (that is specific for non videoconferencing videos) is the so called scene-cut or scene change. This happens when the video content of a certain picture within the sequence is completely unrelated to the video content of the immediately preceding pictures. We find this very often in movies, for example, when a certain scene finishes. Let us assume that the gop structure used by the encoder is the dyadic hierarchical of size 8 (see Fig. 4.2(c)). Let us also assume that the scene cut happens between pictures with poc 5 and 6, as shown in Fig. 5.1. In this example, picture 6 is likely not to be a good reference for picture 5, so it would be better if picture 5 always used uni-prediction, taking as reference 5.2. MOTION ESTIMATION IN C65 B 53

Figure 5.1: Scene cut occurring between pictures with poc 5 and 6 picture 4. At the same way, picture 6 should only use uni-prediction from picture 4 and picture 4 should only be predicted from picture 0. In order to demonstrate the proper working of this uni-prediction feature im- plemented in C65 B, we have built a video sequence made by the concatenation of two sequences named “Akiyo” and “Foreman”. Both are in QCIF Format (resolution 176x144). The first 6 frames of the generated video sequence are taken from the first 6 of “Akiyo” and the last are taken from the first of “Fore- man”. In Fig. 5.2 some mode decisions (uni- or bi-prediction) are shown for several pictures in the generated sequence. As usual, green means bi-prediction and blue means uni-prediction. The transparent CUs are skipped. Moreover, a green motion vector means list 1 and a red one means list 0.

• Fig. 5.2(a) shows picture with poc 2. As we notice, most of the time skip is used (reasonable for a slow moving sequence like Akiyo), and most of the blocks are bi-predicted.

• Fig. 5.2(b) shows picture with poc 4. Uni-prediction is always used, and the motion vectors point to list 0, so the reference picture is, as expected, the one with poc 0.

• Fig. 5.2(c) shows picture with poc 6. Uni-prediction is always used, and the motion vectors point to list 1, so the reference picture is, as expected, the one with poc 8.

• Fig. 5.2(d) shows picture with poc 8. This picture is a P slice, but inter prediction is not used: almost all the blocks are intra coded (as expected, since the reference frame has a completely unrelated content). 54 CHAPTER 5. RATE DISTORTION OPTIMIZATION

(a) Poc 2 (b) Poc 4

(c) Poc 6 (d) Poc 8

Figure 5.2: Some pictures taken from a video sequence created concatenating frames from two different video sequences.

5.2.6 Step five: motion vector local search

The current best candidate consists of one or two motion vectors. Purpose of this step is trying to find a local optimum for the values of those motion vectors, performing a search in the area of the reference picture nearby the one pointed by the current motion vectors. The algorithm is applied to each motion vector independently, so no change was needed with respect to the original version of C65 that only supports P-slices: for this reason, the description of the specific motion vector search implemented in C65 is not a relevant topic for the master thesis report, as it was not object of the thesis work.

5.2.7 Step six: mode selection

The last step consists in selecting the mode: if the current best candidate corresponds to any of the merge candidates generated in step 0, merge mode is used, so only the merge index must be transmitted to indicate the choice of the motion vectors. If instead there is no merge candidate that corresponds to the current best candidate, Advanced Motion Vector Prediction must be used. For this reason the predictor index and the motion vector difference (structure mvd coding() in the HEVC standard[8])must be signaled. 5.3. QUANTIZATION PARAMETER CHOICE 55

5.3 Quantization parameter choice

One of the main parameters to control the amount of bits that will be used to compress a video sequence (and for this reason also its quality) is the quanti- zation parameter, as we read in section 2.8. As explained in section 2.13, in a hierarchical gop structure like those adopted in C65 (see Fig. 2.6) the choice of the qp is a strategical point of decision for the overall coding efficiency. The key pictures are those at the base layer, so those satisfying the equation

poc mod hgopsize = 0 where poc is the picture order count and hgopsize is the parameter defined in section 4.1.4. These key pictures should be encoded with the highest fidelity, because they are used, directly or indirectly, as reference for all the pictures of the layers above. For the pictures one layer above the base one, a larger qp should be used, as they are taken as reference for less pictures. Following this rule, the qp should be increased for each subsequent level of the hierarchy. The optimal choice of the quantization parameter can be determined by using a very expensive rate distortion analysis, for example following the strategy presented in [40]. Such a complex operation for the encoder can be avoided at the cost of a sub-optimal choice. In the “Results” chapter, the various tests for qp values given for each hierarchy level are analyzed. As an anticipation, good results for the C65 encoder are obtained with very high variations of the qp parameter among the various levels. 56 CHAPTER 5. RATE DISTORTION OPTIMIZATION Chapter 6

Results and future works

6.1 Methodology

As explained in section 5.3, the choice of the quantization parameter to use when encoding each picture is a crucial factor for the final compression efficiency that the encoder can achieve, and the optimal value can be determined through an expensive rate distortion analysis. A reasonably good solution consists in running a set of tests with different qp values for each picture within a gop, determining which qp “configuration” provides the best compression efficiency. This method implies that the qp values are assigned to pictures not considering the specific characteristics of the video sequence to encode and it requires no rate distortion analysis (so no compu- tational time) as the qp values for each picture are embedded in the code (or better, given in a configuration file). Exhaustive tests would in principle be possible, i.e. given a base qp value (represented by an integer number) provided as command line parameter to C65 B and given a predetermined range, we could try all the possible configu- rations of qp values in the range for each picture. This problem is equivalent to trying all the dispositions with repetition of n elements in groups of k, where n is our range and k is our gop size. For instance, considering a gop structure of size k = 2, a range n = 3 and a base qp of 25, we enumerate all the possible tests we could in principle run in table 6.1. In general, the number of tests to run for each gop structure of size k and range n would be D(n, k) = nk. For the only reasonable case of k = 8 and n = 11, 214358881 configurations should be tested. This approach, other than being not feasible, is also useless, as a great part of those tests can be avoided with simple considerations: we should spend more bits for the pictures at lower temporal layers, as they are directly or indirectly used as reference for all the pictures at higher temporal layers. On the other hand, we can spend few bits on the pictures at the highest temporal layer since they are never used as references. Moreover, there is no specific reason why, at least with a priori approach, we should assign different qp values to pictures belonging to the same temporal layer. The methodology adopted in the following tests is for this reason to apply a qp value lower than the base qp for the pictures in the low temporal layers,

57 58 CHAPTER 6. RESULTS AND FUTURE WORKS

Test # qp picture 1 delta qp qp picture 2 delta qp 1 24 -1 24 -1 2 24 -1 25 0 3 24 -1 26 +1 4 25 0 24 -1 5 25 0 25 0 6 25 0 26 +1 7 26 +1 24 -1 8 26 +1 25 0 9 26 +1 26 +1

Table 6.1: Possible qp values to assign to pictures in a gop structure of size 2, given a base qp value equals to 25 and a range of 3 qp values and to give a qp value greater than the base qp to the higher temporal layers. Moreover, tests start from a gentle qp toggling (small value of qp delta) and gradually apply a heavier qp toggling, until the compression efficiency keeps getting better. Finally, small variations are applied to some qp values, to see if a further improvement is possible.

6.2 Results

A base qp value (represented by an integer number) is provided as command line parameter to the software C65 B. A series of “configurations” are here pre- sented and tested. Each configuration assigns specific qp values to the pictures within a gop.

6.2.1 hgopsize 8 - various qp values

The tests presented in this section are relative to the gop structure presented in Fig. 4.2(c). Table 6.2 presents a summary of the results: the configuration with no qp toggle (qp value equals to base qp for all the pictures) is compared to the various qp configurations under test (see following sections for details).

Configuration BD Rate % -4 -1 0 1 2 -15.02 -4 -2 0 2 4 -21.04 -4 -2 2 3 4 -21.81 -4 -3 0 4 8 -23.99 -4 -2 2 4 6 -24.09 -4 -3 2 4 8 -24.85 -4 -3 -3 4 8 -19.66

Table 6.2: Summary of savings (in BD Rate percentage) when using qp toggling compared to using the same qp value for all the pictures. Gop structure of size 8. 6.2. RESULTS 59

No qp toggle vs configuration -4 -1 0 1 2 The configuration named -4 -1 0 1 2 assigns the following values to the pictures presented in Fig. 4.2(c):

• picture 0 (I slice): qp = base qp − 4;

• picture 8 (P slice): qp = base qp − 1;

• picture 4 (B slice): qp = base qp;

• picture 2 (B slice): qp = base qp + 1;

• picture 6 (B slice): qp = base qp + 1;

• picture 1 (B slice): qp = base qp + 2;

• picture 3 (B slice): qp = base qp + 2;

• picture 5 (B slice): qp = base qp + 2;

• picture 7 (B slice): qp = base qp + 2;

In table A.1, the results of the comparison between the version of C65 with no qp toggle (qp value equals to base qp for all the pictures) and the configuration -4 -1 0 1 2 are shown.

No qp toggle vs configuration -4 -2 0 2 4 The configuration named -4 -2 0 2 4 assigns the following values to the pictures presented in Fig. 4.2(c):

• picture 0 (I slice): qp = base qp − 4;

• picture 8 (P slice): qp = base qp − 2;

• picture 4 (B slice): qp = base qp;

• picture 2 (B slice): qp = base qp + 2;

• picture 6 (B slice): qp = base qp + 2;

• picture 1 (B slice): qp = base qp + 4;

• picture 3 (B slice): qp = base qp + 4;

• picture 5 (B slice): qp = base qp + 4;

• picture 7 (B slice): qp = base qp + 4;

In table A.2, the results of the comparison between the version of C65 with no qp toggle (qp value equals to base qp for all the pictures) and the configuration -4 -2 0 2 4 are shown. 60 CHAPTER 6. RESULTS AND FUTURE WORKS

No qp toggle vs configuration -4 -2 2 3 4 The configuration named -4 -2 2 3 4 assigns the following values to the pictures presented in Fig. 4.2(c):

• picture 0 (I slice): qp = base qp − 4;

• picture 8 (P slice): qp = base qp − 2;

• picture 4 (B slice): qp = base qp − 2;

• picture 2 (B slice): qp = base qp + 3;

• picture 6 (B slice): qp = base qp + 3;

• picture 1 (B slice): qp = base qp + 4;

• picture 3 (B slice): qp = base qp + 4;

• picture 5 (B slice): qp = base qp + 4;

• picture 7 (B slice): qp = base qp + 4;

In table A.3, the results of the comparison between the version of C65 with no qp toggle (qp value equals to base qp for all the pictures) and the configuration -4 -2 2 3 4 are shown.

No qp toggle vs configuration -4 -3 0 4 8 The configuration named -4 -3 0 4 8 assigns the following values to the pictures presented in Fig. 4.2(c):

• picture 0 (I slice): qp = base qp − 4;

• picture 8 (P slice): qp = base qp − 3;

• picture 4 (B slice): qp = base qp;

• picture 2 (B slice): qp = base qp + 4;

• picture 6 (B slice): qp = base qp + 4;

• picture 1 (B slice): qp = base qp + 8;

• picture 3 (B slice): qp = base qp + 8;

• picture 5 (B slice): qp = base qp + 8;

• picture 7 (B slice): qp = base qp + 8;

In table A.4, the results of the comparison between the version of C65 with no qp toggle (qp value equals to base qp for all the pictures) and the configuration -4 -3 0 4 8 are shown. 6.2. RESULTS 61

No qp toggle vs configuration -4 -2 2 4 6 The configuration named -4 -2 2 4 6 assigns the following values to the pictures presented in Fig. 4.2(c):

• picture 0 (I slice): qp = base qp − 4;

• picture 8 (P slice): qp = base qp − 2;

• picture 4 (B slice): qp = base qp − 2;

• picture 2 (B slice): qp = base qp + 4;

• picture 6 (B slice): qp = base qp + 4;

• picture 1 (B slice): qp = base qp + 6;

• picture 3 (B slice): qp = base qp + 6;

• picture 5 (B slice): qp = base qp + 6;

• picture 7 (B slice): qp = base qp + 6;

In table A.5, the results of the comparison between the version of C65 with no qp toggle (qp value equals to base qp for all the pictures) and the configuration -4 -2 2 4 6 are shown.

No qp toggle vs configuration -4 -3 2 4 8 The configuration named -4 -3 2 4 8 assigns the following values to the pictures presented in Fig. 4.2(c):

• picture 0 (I slice): qp = base qp − 4;

• picture 8 (P slice): qp = base qp − 3;

• picture 4 (B slice): qp = base qp − 2;

• picture 2 (B slice): qp = base qp + 4;

• picture 6 (B slice): qp = base qp + 4;

• picture 1 (B slice): qp = base qp + 8;

• picture 3 (B slice): qp = base qp + 8;

• picture 5 (B slice): qp = base qp + 8;

• picture 7 (B slice): qp = base qp + 8;

In table A.6, the results of the comparison between the version of C65 with no qp toggle (qp value equals to base qp for all the pictures) and the configuration -4 -3 2 4 8 are shown. 62 CHAPTER 6. RESULTS AND FUTURE WORKS

No qp toggle vs configuration -4 -3 -3 4 8 The configuration named -4 -3 -3 4 8 assigns the following values to the pictures presented in Fig. 4.2(c):

• picture 0 (I slice): qp = base qp − 4;

• picture 8 (P slice): qp = base qp − 3;

• picture 4 (B slice): qp = base qp − 3;

• picture 2 (B slice): qp = base qp + 4;

• picture 6 (B slice): qp = base qp + 4;

• picture 1 (B slice): qp = base qp + 8;

• picture 3 (B slice): qp = base qp + 8;

• picture 5 (B slice): qp = base qp + 8;

• picture 7 (B slice): qp = base qp + 8;

In table A.7, the results of the comparison between the version of C65 with no qp toggle (qp value equals to base qp for all the pictures) and the configuration -4 -3 -3 4 8 are shown.

Results As we can see in these tests, all the configuration with qp toggling (that assign different qp values to the pictures depending on their temporal layer) are per- forming much better than the configuration with no qp toggle (same qp value for all the pictures in the gop). Out of all the configuration tested, the best per- forming in terms of compression efficiency is the one with the most aggressive qp variations within the gop: the one named -4 -3 2 4 8. In this configuration there is a difference of 11 qp values between the key picture and the picture at the highest temporal layer. This result is surprising since both HM 12 and x264 apply a much more gentle qp toggling, increasing by one the qp value for each higher temporal layer. This surprising result suggests that the rate distortion model of C65 B should be probably revised. In particular, analyzing the result tables we notice that there is a rather big gain in compression efficiency at low bitrates (up to 30% bitrate saving with respect to the no qp toggling configu- ration ), while the gain is more limited in case of high bitrates (up to only 15% efficiency gain).

6.2.2 hgopsize 4 - various qp values The tests presented in this section are relative to the gop structure presented in Fig. 4.2(b). Table 6.3 presents a summary of the results: the configuration with no qp toggle (qp value equals to base qp for all the pictures) is compared to the various qp configurations under test (see following sections for details). 6.2. RESULTS 63

Configuration BD Rate % -4 -1 0 1 -10.74 -4 -2 0 2 -15.56 -4 -3 0 3 -17.47 -4 -3 2 6 -19.69

Table 6.3: Summary of savings (in BD Rate percentage) when using qp toggling compared to using the same qp value for all the pictures. Gop structure of size 4.

No qp toggle vs configuration -4 -1 0 1 The configuration named -4 -1 0 1 assigns the following values to the pictures presented in Fig. 4.2(b): • picture 0 (I slice): qp = base qp − 4; • picture 4 (P slice): qp = base qp − 1; • picture 2 (B slice): qp = base qp; • picture 1 (B slice): qp = base qp + 1; • picture 3 (B slice): qp = base qp + 1; In table A.8, the results of the comparison between the version of C65 with no qp toggle (qp value equals to base qp for all the pictures) and the configuration -4 -1 0 1 are shown.

No qp toggle vs configuration -4 -2 0 2 The configuration named -4 -2 0 2 assigns the following values to the pictures presented in Fig. 4.2(b): • picture 0 (I slice): qp = base qp − 4; • picture 4 (P slice): qp = base qp − 2; • picture 2 (B slice): qp = base qp; • picture 1 (B slice): qp = base qp + 2; • picture 3 (B slice): qp = base qp + 2; In table A.9, the results of the comparison between the version of C65 with no qp toggle (qp value equals to base qp for all the pictures) and the configuration -4 -2 0 2 are shown.

No qp toggle vs configuration -4 -3 0 3 The configuration named -4 -3 0 3 assigns the following values to the pictures presented in Fig. 4.2(b): • picture 0 (I slice): qp = base qp − 4; • picture 4 (P slice): qp = base qp − 3; 64 CHAPTER 6. RESULTS AND FUTURE WORKS

• picture 2 (B slice): qp = base qp; • picture 1 (B slice): qp = base qp + 3; • picture 3 (B slice): qp = base qp + 3; In table A.10, the results of the comparison between the version of C65 with no qp toggle (qp value equals to base qp for all the pictures) and the configuration -4 -3 0 3 are shown.

No qp toggle vs configuration -4 -3 2 6 The configuration named -4 -3 2 6 assigns the following values to the pictures presented in Fig. 4.2(b): • picture 0 (I slice): qp = base qp − 4; • picture 4 (P slice): qp = base qp − 3; • picture 2 (B slice): qp = base qp − 2; • picture 1 (B slice): qp = base qp + 6; • picture 3 (B slice): qp = base qp + 6; In table A.11, the results of the comparison between the version of C65 with no qp toggle (qp value equals to base qp for all the pictures) and the configuration -4 -3 2 6 are shown.

Results Among the selected configurations, the one with the best compression efficiency is the one named -4 -3 2 6. Also in this case (as in the gop structure of size 8), the best performing configuration is the one with the most aggressive qp toggling, with variations of 9 qp values between the key pictures and the highest temporal layer. Moreover, also in this case there are quite big variations in the coding efficiency at low and high bitrates: this suggests that some investigation should be made in the rate distortion model, mostly when the qp base value is low.

6.2.3 hgopsize 2 - various qp values The tests presented in this section are relative to the gop structure presented in Fig. 4.2(a). Table 6.4 presents a summary of the results: the configuration with no qp toggle (qp value equals to base qp for all the pictures) is compared to the various qp configurations under test (see following sections for details).

No qp toggle vs configuration -4 -2 2 The configuration named -4 -2 2 assigns the following values to the pictures presented in Fig. 4.2(a): • picture 0 (I slice): qp = base qp − 4; • picture 2 (P slice): qp = base qp − 2; 6.2. RESULTS 65

Configuration BD Rate % -4 -2 2 -11.62 -4 -3 3 -11.98 -4 -3 6 -11.40

Table 6.4: Summary of savings (in BD Rate percentage) when using qp toggling compared to using the same qp value for all the pictures. Gop structure of size 2.

• picture 1 (B slice): qp = base qp − 2;

In table A.12, the results of the comparison between the version of C65 with no qp toggle (qp value equals to base qp for all the pictures) and the configuration -4 -2 2 are shown.

No qp toggle vs configuration -4 -3 3 The configuration named -4 -3 3 assigns the following values to the pictures presented in Fig. 4.2(a):

• picture 0 (I slice): qp = base qp − 4;

• picture 2 (P slice): qp = base qp − 3;

• picture 1 (B slice): qp = base qp − 3;

In table A.13, the results of the comparison between the version of C65 with no qp toggle (qp value equals to base qp for all the pictures) and the configuration -4 -3 3 are shown.

No qp toggle vs configuration -4 -3 6 The configuration named -4 -3 6 assigns the following values to the pictures presented in Fig. 4.2(a):

• picture 0 (I slice): qp = base qp − 4;

• picture 2 (P slice): qp = base qp − 3;

• picture 1 (B slice): qp = base qp − 6;

In table A.14, the results of the comparison between the version of C65 with no qp toggle (qp value equals to base qp for all the pictures) and the configuration -4 -3 6 are shown.

Results Among the selected configurations, the one with the best compression efficiency is the one named -4 -3 3. In this case, again the best performing configuration is one with an aggressive qp toggling (but not the most aggressive). It presents variations of 6 qp values between the key pictures and the highest temporal layer (just one layer above). Again we observe that coding efficiency is better at low bitrates. 66 CHAPTER 6. RESULTS AND FUTURE WORKS

6.2.4 hgopsize 2 vs hgopsize 4 vs hgopsize 8 In this section, the compression efficiency of the encoder C65 B using the gop structures indicated in Fig. 4.2(a), 4.2(b) and 4.2(c) are compared. In partic- ular, the best performing configurations are taken into consideration, namely -4 -3 3, -4 -3 2 6 and -4 -3 -3 4 8 respectively. Table A.15 shows the comparison between the configuration -4 -3 3 and -4 - 3 2 6, respectively corresponding to gop size of 2 (Fig. 4.2(a)) taken as anchor and 4 (Fig. 4.2(b)), as test. On the other hand, table A.16 shows the comparison between the configuration -4 -3 2 6 and -4 -3 -3 4 8, respectively corresponding to gop size of 4 (Fig. 4.2(b)) taken as anchor and 8 (Fig. 4.2(c)), as test. A summary of these tests is shown in table 6.5, where the savings in terms of BD Rate percentage are shown.

Configuration BD Rate % -4 -3 3 vs -4 -3 2 6 -3.25 -4 -3 2 6 vs -4 -3 -3 4 8 +0.40

Table 6.5: Comparison of the best performing qp configurations for each gop structure. The first row compares the best configuration of the gop structure indicated in Fig. 4.2(a) with the best configuration of the gop structure indi- cated in Fig. 4.2(b). The second row compares the best configuration of the gop structure indicated in Fig. 4.2(b) with the best configuration of the gop structure indicated in Fig. 4.2(c).

As we can see, the gop structure of size 4 represented in Fig. 4.2(b) is the best performing of all the gop structures implemented in C65 B. However, this statement is not true for each of the sequences under test. Some sequences like “American Football” are more efficiently encoded with the gop structure of size 2. Others instead require a much larger gop size, like “Big Ships”. Choosing the appropriate gop structure for each test sequence would give big gains in terms of compression efficiency.

6.2.5 C65 against C65 B This section presents the results in term of BD-Rate for C65 B, when compared to the initial version of the C65 software. For fair comparison, TMVP as well as deblocking filter are disabled in C65, since they are not implemented in C65 B yet. Table 6.6 uses as test the opti- mized version of C65 B, with the parameter hgopsize set to 4 and the best qp toggling setting presented in the previous section (-4 -3 2 6). The anchor is the initial version of C65, with only P slices and flat gop structure, as in Fig. 4.1. As we can see in the table, the overall average saving is about 15% in term of BD-Rate. For this reason we can state that C65 B is a more efficient encoder than C65. Moreover, some sequences are encoded very efficiently (like, for instance, “BigShips”), while some others only present small gains (like “Ameri- can Football”). The characteristics of the video sequences should be studied in order to understand the reason of this behavior of the C65 B encoder. In addition, some sequences (like “ABC Sitcom”) are encoded efficiently at low bitrates, but show efficiency losses at high bitrates (low qp). Some other 6.2. RESULTS 67

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma [%] [kbps] [dB] [kbps] [dB] avg low high Am foot 16576.78 33.58 14142.57 33.25 -0.97 -1.33 0.30 Sitcom 9588.20 34.59 7639.25 34.73 -10.04 -14.80 0.58 LeMatch 14788.39 34.45 12088.33 34.24 -8.49 -13.35 -2.07 BBDrive 7043.21 36.60 6157.89 36.63 -9.72 -10.01 -8.85 Cactus 6247.09 35.32 5712.21 35.87 -15.37 -14.49 -14.95 XmasTree 13097.30 33.25 10505.22 33.55 -21.75 -23.11 -18.97 CrowdRun 22643.06 32.24 18377.84 32.02 -13.14 -16.71 -8.99 Ducks 28788.33 32.00 21743.21 31.75 -18.95 -27.45 -10.02 InToTree 5831.93 34.54 5554.85 35.13 -12.12 -5.03 -17.52 ParkJoy 24987.45 31.76 19827.77 31.58 -16.74 -21.42 -11.72 BigShips 5656.37 35.56 4943.52 36.56 -33.89 -32.68 -32.69 City 8024.43 35.12 6443.43 35.55 -19.04 -14.52 -17.83 Crew 5504.95 37.75 4611.19 38.10 -18.51 -23.05 -12.65 Average -15.29 -16.77 -11.95

Table 6.6: C65 vs C65 B. Overall, with the same objective video quality, the saving in terms of bitrate is higher than 15% with the new version of C65 sequences show the opposite behavior (see “Into Tree”). This aspect should be explored as well.

6.2.6 Subjective test

In this subjective test we analyze the configuration -4 -3 -3 4 8. This presents very large variations of qp values for example between the first picture of the gop (the key picture) and the second picture of the gop (one of the pictures at the highest temporal layer). This particular encoding on the sequence “ABC Sitcom” is performed with base qp value equals 27: this means that the key pictures will be encoded with a qp value of 27 − 3 = 24 and the pictures at highest temporal layer will have a qp value of 27 + 8 = 35. The qp variation between the lowest and the highest temporal layer is, for this reason, of 11 qp values. We can observe the effect of this large difference in Fig. 6.1. Fig. 6.1(a) is a detail of picture with poc 352, so a picture on the lowest temporal layer. Fig. 6.1(b) is instead the same detail taken from picture with poc 353, so a picture on the highest temporal layer. As we can notice, the large difference of qp for the two pictures results in a rather different quality: the fingers look smooth and detailed in the first picture, while the second shows evident artifacts. The important thing to notice here is that these differences are absolutely not visible when the video is played at 50 frames per second. The alternation of the pictures with different quality is too fast to be recognized by the human eye. The overall quality looks very good. 68 CHAPTER 6. RESULTS AND FUTURE WORKS

(a) Picture 352 of ABC Sitcom. QP = 24

(b) Picture 353 of ABC Sitcom. QP = 35

Figure 6.1: Detail of pictures from “ABC Sitcom” 6.2. RESULTS 69

6.2.7 C65 B against x264 Table A.17 presents the BD-rate comparison between C65 B as anchor and x264 as test. C65 B uses the gop structure presented in Fig. 4.2(b), with the qp config- uration named -4 -3 2 6 (see section 6.2.2), which is the one that provided the best results. x264 uses the “veryslow” preset. As we can notice, x264 still performs better than C65 B. If we compare this result to the one shown in table 3.3, we can see that the compression efficiency of C65 B has considerably increased. X264 used to perform about 37% better than C65 in terms of BD-Rate, while it performs only 25% better than C65 B.

6.2.8 C65 B against HM Table A.18 presents the BD-rate comparison between C65 B as anchor and HM12 as test. C65 B uses the gop structure presented in Fig. 4.2(b), with the qp config- uration named -4 -3 2 6 (see section 6.2.2), which is the one that provided the best results. HM12 uses the standard random access configuration. As we can notice, HM still performs better than C65 B. If we compare this result to the one shown in table 3.1, we can see notice again that the compression efficiency of C65 B has considerably increased. HM used to perform about 59% better than C65 in terms of BD-Rate, while it performs only 51% better than C65 B.

6.2.9 Encoding speed Table 6.7 reports the encoding speed in frames per second for C65 B (using the gop structure of size 8) when compared to the original version of C65. The quantization parameter here used is 27.

Sequence Encoding speed C65 Encoding speed C65 B American Football 720p50 RSM 11.27 4.05 ABC Sitcom 720p50 13.15 4.88 LeMatch02 10.98 4.01 BasketBallDrive 720p50 13.65 4.99 Cactus 720p50 13.96 5.12 ChristmasTree 720p50 10.59 4.05 CrowdRun 720p50 10.01 3.96 DucksTakeOff 720p50 9.26 3.15 InToTree 720p50 15.61 5.46 ParkJoy 720p50 9.99 3.16 BigShips720p60 12.04 4.95 City720p60 13.54 4.57 Crew720p60 13.88 4.83 Normalized values 2.76 1

Table 6.7: C65 and C65 B encoding speeds in frames per second. C65 encoded on average 2.76 times faster than the new software C65 B. 70 CHAPTER 6. RESULTS AND FUTURE WORKS

As we can see, the encoder has become considerably slower after the intro- duction of B slices. This is mainly given by the additional operations that B slices require: • Averaging of the two frames used as prediction. • Motion vector search must be performed for two motion vectors. • Filtering operations must be done for the two reference frames. In addition, during the thesis work we implemented the SIMD optimization for the motion compensation operations, but not for the motion estimation ones.

6.2.10 Final considerations The results presented in this chapter give the measure of the improvement of C65 B with respect to its predecessor C65, but they also show some unexpected behavior; in particular we can notice that: • The best qp toggling strategy presents a very large difference of qp values between the lowest and the highest temporal layer (up to 11 qp values of difference in the dyadic gop structure of size 8). This result is in contrast with what happens, for example, in encoders like x264 and HM12, where the qp variations between successive layers are rarely larger than 1 qp point1. • While as a general rule a larger gop structure gives better compression efficiency results, in our tests we verified that for several sequences the best gop structure size is 2. These two anomalies suggest that the Rate-Distortion model should be re- vised and corrected, so to determine a better λ value (see equation 5.1) in function of the qp. Also, verifications and tests must be performed in order to determine if the RMOTION component of equation 5.1 takes into account correctly all the components that contribute to a cost in terms of bit required, when using the selected encoding mode.

6.3 Future works

The following sections are about ideas to improve the encoder that should prob- ably be taken into account in the near future in order to further increase the compression efficiency of C65 B.

6.3.1 Gop selection In the work here presented, the choice of the gop structure is chosen by the user as a command line parameter. One possible improvement to the encoder could be enabling it to evaluate what the optimal gop structure and size are for the encoded sequence. We observed this feature for example in the encoder x264. Here is partially reported the output for the decoding (with JM18.5)

1These consideration come from tests performed and explained in the “Preliminary tests” chapter. 6.3. FUTURE WORKS 71 of two sequences, DucksTakeOff and IntoTree, both encoded with x264, using the preset “veryslow”: some modifications are made to the decoder software in order to have it printing the poc, picture number in encoding order and content of list 0 and list 1. DucksTakeOff −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Frame POC Pic# LIST0 LIST1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 00000(IDR) 0 0 00004(P) 4 1 0 00002(B) 2 2 04 40 00001(b) 1 3 024 240 00003(b) 3 4 204 420

IntoTree −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Frame POC Pic# LIST0 LIST1 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 00000(IDR) 0 0 00006(P) 6 1 0 00003(B) 3 2 03 30 00001(b) 1 3 036 360 00002(b) 2 4 036 360 00004(b) 4 5 306 630 00005(b) 5 6 306 630 As an improvement, the encoder could analyze the first frames of a sequence and choose the appropriate gop structure and size, to maximize compression efficiency. As a second step, we also have to take into account that the content of a sequence can considerably change rhythm. For example fixed GOP structures cannot deal with scene cuts, where uncorrelated pictures prevent good motion prediction. A possible solution is to adapt the GOP structure if the content to be encoded changes, and also selecting Intra picture coding when a scene change has been detected, so that the temporal correlation is preserved within each GOP. Tests on H.264 show that the bitrate saving with certain implementations can be up to 20%[41] for some sequences. As a last consideration, we must take into account the frequent case in which the number of frames to encode is not a multiple of the gop size. The last group of pictures can be encoded in a different way in this case, for example adopting a smaller gop size for the remaining number of frames mod gop size frames. For example a sequence composed of 300 frames, encoded with a gop size of 8, could switch to a gop size of 4 for the last frames, as represented in figure 6.2.

6.3.2 Combined prediction signal for motion vector search As explained in section 5.2.6, during motion estimation a local motion vector search algorithm is applied in order to find a local optimum for the best matching 72 CHAPTER 6. RESULTS AND FUTURE WORKS

Figure 6.2: Hierarchical gop structure area in the reference picture. This search algorithm, in case of bi-prediction, is applied to each motion vector separately: first the motion vector pointing to list 0 is optimized, then the one referring to list 1 is considered. It is a fact that the coding efficiency can be increased when the combined prediction signal is considered during the motion estimation process, so the weighted sum of list 0 and list 1 predictions should be used in order to calculate the distance measures (SAD or SSD) to compute the match between the picture areas. For this reason, a future version of C65 B should perform an averaging oper- ation between the two reference picture signals to evaluate how good a candidate motion vector is.

6.3.3 32 x 32 and 64 x 64 mode For computational complexity reasons, the CUs in C65 B have dimensions of 16x16 and 8x8 only. Since B slices lead to a better prediction, it is reasonable to believe that B pictures could employ CUs of larger size, like 32x32 or 64x64, without significant quality loss, but allowing a save of several bits. A future version of C65 B should add these modes.

6.3.4 Deblocking filter and TMVP Deblocking filter and TMVP are disabled features in C65 B (as explained in section 4.2.6). These functionalities could be implemented with rather little effort in a next version of C65 B.

6.3.5 I frames to improve with more directions In C65 B, the intra frame encoding process is implemented as a simplified ver- sion of the one used in the HM 12.0 reference encoder. In particular, a few of the 34 directions tested in the reference software are actually evaluated by C65 B. Table 6.8 shows a comparison between C65 B and HM 12.0, intra coding the first picture of the video sequence “DucksTakeOff”. The qp value of 26 is chosen for HM and 21 for C65 B. As we can see, C65 B encoding has a lower PSNR 6.3. FUTURE WORKS 73 value and a much higher bitrate: for Intra coding, C65 B has very poor com- pression efficiency when compared to HM12. In the future, the Intra prediction mode of C65 B should be improved: having a good quality first Intra frame will improve the prediction for the first P frame dramatically, so the key pictures (that are the basis for the prediction of all the other pictures) will contribute positively to the overall encoding efficiency.

PSNR Bitrate C65 38.26 64841 HM 12.0 38.49 47102

Table 6.8: C65 B vs HM 12.0 for one Intra picture 74 CHAPTER 6. RESULTS AND FUTURE WORKS Chapter 7

Conclusions

The master thesis work has presented the work developed at Ericsson Visual Technology unit in order to improve the compression efficiency of the encoder C65. This encoder produced output compliant with the HEVC standard and was well optimized for video conferencing real time applications. Throughout the work, an effort was made to make this encoder more suitable for compression of general video content that, unlike video conference sequences, does not necessarily have little motion between consecutive pictures. The purpose was reached through the implementation of B slices in the original encoder and some rate distortion optimizations were also implemented. The work started with a series of tests, carried out in order to evaluate what the starting point was and to explore what were the weak points of C65. More- over, studies and tests on existing encoders (also for other standards) helped to have a clear view of the situation. Next, the implementation of B slices consumed a great part of the time spent for the entire thesis: a radical structural change was needed. In addition, some other improvements were needed by the software in order to enhance the rate distortion decisions and make it coherent with the new structure of the encoder in its entirety. Finally, a series of tests was conducted in order to check the compression efficiency of the encoder, exercising as many configurations as possible. A “best” configuration was found (at least for the set of video sequences under test). Results show that the compression efficiency has achieved a gain of 15% in terms of BD-Rate. In contrast, the encoding speed of the original software was about 2.75 higher than the new version. In conclusion, recommendations for future works and enhancement are given: video compression is a field with never ending possibilities of improvement. Moreover, the modified structure and target for the encoder under development opens a series of new requirements that will give work for months or more to the researchers and developers of the Ericsson Visual Technology unit.

75 76 CHAPTER 7. CONCLUSIONS Appendix A

Listings and result data

A.1 Listings

Listing A.1: Configuration file example # RPS in SPS description NumShortTermRefPicSets : 4 % RPS 0 ShortTermCurr : −4 % RPS 1 ShortTermCurr :2,−2 % RPS 2 ShortTermCurr :1,−1 ShortTermFoll :3 % RPS 3 ShortTermCurr :1,−1

# Each picture to be coded is explictly described ExplicitPictureDescription : 1 % Picture 0 Frame : 0 POC : 0 NalUnitType :20

% Picture 1 Frame : 4 POC : 4 TemporalID :0 NalUnitType :1 ShortTermRefPicSetIdx : 0

% Picture 2 Frame : 2 POC : 2 TemporalID :1 NalUnitType :1

77 78 APPENDIX A. LISTINGS AND RESULT DATA

ShortTermRefPicSetIdx : 1

% Picture 3 Frame : 1 POC : 1 TemporalID :2 NalUnitType :1 ShortTermRefPicSetIdx : 2

% Picture 4 Frame : 3 POC : 3 TemporalID :2 NalUnitType :1 ShortTermRefPicSetIdx : 3 A.2 Result data

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma [%] [kbps] [dB] [kbps] [dB] avg low high Am foot 17729.20 33.55 15226.83 33.17 -7.34 -11.20 -2.88 Sitcom 9669.67 34.50 7627.64 34.39 -15.43 -21.06 -5.38 LeMatch 15000.78 34.38 12644.20 34.06 -8.48 -12.01 -3.87 BBDrive 7624.98 36.60 6528.78 36.41 -11.04 -12.84 -8.84 Cactus 7059.65 35.14 5835.94 35.25 -18.86 -21.43 -15.99 XmasTree 13567.67 33.09 11048.87 33.00 -18.10 -22.41 -13.07 CrowdRun 22676.92 32.26 19307.33 31.91 -9.61 -12.74 -6.52 Ducks 25866.09 32.31 22148.94 31.85 -5.70 -9.74 -1.96 InToTree 7059.88 34.26 5408.18 34.27 -23.20 -25.14 -20.21 ParkJoy 25573.29 31.73 21568.60 31.36 -11.10 -15.49 -6.91 BigShips 5677.56 35.57 4397.85 35.75 -26.51 -31.20 -20.58 City 8836.01 34.67 6856.21 34.72 -24.25 -27.75 -17.29 Crew 5655.10 37.78 4615.04 37.73 -15.67 -20.92 -8.16 Average -15.02 -18.76 -10.13

Table A.1: No qp toggle vs configuration -4 -1 0 1 2 A.2. RESULT DATA 79

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma [%] [kbps] [dB] [kbps] [dB] avg low high Am foot 17729.20 33.55 13651.39 32.89 -10.83 -16.40 -3.44 Sitcom 9669.67 34.50 6627.28 34.34 -21.72 -29.24 -6.19 LeMatch 15000.78 34.38 11310.50 33.85 -12.20 -17.08 -5.09 BBDrive 7624.98 36.60 5949.49 36.27 -16.14 -18.23 -13.05 Cactus 7059.65 35.14 5334.82 35.31 -26.01 -28.18 -23.13 XmasTree 13567.67 33.09 9834.51 33.00 -26.57 -31.68 -19.99 CrowdRun 22676.92 32.26 17327.23 31.67 -14.19 -18.36 -9.69 Ducks 25866.09 32.31 19814.83 31.51 -7.58 -13.35 -1.75 InToTree 7059.88 34.26 4906.99 34.43 -31.86 -33.82 -29.01 ParkJoy 25573.29 31.73 19225.27 31.13 -16.72 -21.93 -11.13 BigShips 5677.56 35.57 4005.24 35.94 -35.12 -38.77 -29.46 City 8836.01 34.67 6015.26 34.81 -32.46 -35.16 -25.46 Crew 5655.10 37.78 4117.38 37.70 -22.13 -27.81 -12.44 Average -21.04 -25.38 -14.60

Table A.2: No qp toggle vs configuration -4 -2 0 2 4

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma [%] [kbps] [dB] [kbps] [dB] avg low high Am foot 17729.20 33.55 12672.75 32.57 -10.91 -16.55 -3.42 Sitcom 9669.67 34.50 5904.97 34.17 -23.24 -30.00 -6.85 LeMatch 15000.78 34.38 10469.28 33.56 -11.89 -17.02 -4.56 BBDrive 7624.98 36.60 5592.44 35.99 -16.40 -18.45 -13.29 Cactus 7059.65 35.14 4957.48 35.09 -27.44 -29.71 -24.29 XmasTree 13567.67 33.09 9087.55 32.73 -28.28 -33.84 -20.65 CrowdRun 22676.92 32.26 16141.76 31.32 -13.99 -18.50 -9.31 Ducks 25866.09 32.31 18391.98 31.14 -6.12 -11.59 -0.92 InToTree 7059.88 34.26 4455.23 34.25 -33.52 -35.22 -30.54 ParkJoy 25573.29 31.73 17906.39 30.77 -16.67 -22.53 -10.55 BigShips 5677.56 35.57 3631.79 35.78 -37.22 -40.70 -31.29 City 8836.01 34.67 5440.63 34.63 -34.52 -36.53 -27.70 Crew 5655.10 37.78 3759.35 37.51 -23.36 -28.64 -12.98 Average -21.81 -26.10 -15.10

Table A.3: No qp toggle vs configuration -4 -2 2 3 4 80 APPENDIX A. LISTINGS AND RESULT DATA

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma [%] [kbps] [dB] [kbps] [dB] avg low high Am foot 17729.20 33.55 11705.80 32.32 -11.00 -18.56 -0.06 Sitcom 9669.67 34.50 5842.27 34.19 -24.53 -35.38 0.60 LeMatch 15000.78 34.38 9855.44 33.41 -12.77 -20.28 -1.10 BBDrive 7624.98 36.60 5262.21 35.81 -17.95 -21.56 -12.57 Cactus 7059.65 35.14 4884.04 35.24 -30.78 -34.43 -26.18 XmasTree 13567.67 33.09 8714.10 32.82 -31.26 -37.82 -22.47 CrowdRun 22676.92 32.26 15043.68 31.16 -15.55 -21.42 -8.89 Ducks 25866.09 32.31 16982.31 30.94 -6.82 -15.33 2.74 InToTree 7059.88 34.26 4763.93 34.58 -36.03 -38.77 -32.36 ParkJoy 25573.29 31.73 16654.94 30.67 -19.37 -25.57 -11.74 BigShips 5677.56 35.57 3848.87 36.13 -40.61 -46.55 -33.44 City 8836.01 34.67 5322.28 34.89 -39.24 -41.90 -31.07 Crew 5655.10 37.78 3672.71 37.58 -25.94 -32.58 -13.97 Average -23.99 -30.01 -14.66

Table A.4: No qp toggle vs configuration -4 -3 0 4 8

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma [%] [kbps] [dB] [kbps] [dB] avg low high Am foot 17729.20 33.55 11272.03 32.18 -12.38 -18.33 -3.55 Sitcom 9669.67 34.50 5172.17 33.99 -25.58 -32.56 -6.57 LeMatch 15000.78 34.38 9326.46 33.25 -13.74 -19.02 -5.17 BBDrive 7624.98 36.60 5071.39 35.67 -18.31 -20.42 -14.93 Cactus 7059.65 35.14 4537.07 34.90 -30.22 -32.51 -26.90 XmasTree 13567.67 33.09 8167.23 32.48 -31.42 -36.76 -23.45 CrowdRun 22676.92 32.26 14412.68 30.94 -15.92 -20.59 -10.66 Ducks 25866.09 32.31 16267.77 30.74 -7.27 -13.58 -0.58 InToTree 7059.88 34.26 4124.23 34.17 -35.69 -37.36 -33.05 ParkJoy 25573.29 31.73 15927.88 30.39 -19.17 -24.96 -12.42 BigShips 5677.56 35.57 3341.14 35.72 -40.02 -43.82 -34.40 City 8836.01 34.67 4846.67 34.50 -37.22 -38.79 -31.24 Crew 5655.10 37.78 3362.24 37.36 -26.27 -31.31 -15.10 Average -24.09 -28.46 -16.77

Table A.5: No qp toggle vs configuration -4 -2 2 4 6 A.2. RESULT DATA 81

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma [%] [kbps] [dB] [kbps] [dB] avg low high Am foot 17729.20 33.55 11088.28 32.12 -11.49 -18.98 -0.84 Sitcom 9669.67 34.50 5390.18 34.08 -26.17 -36.42 -0.46 LeMatch 15000.78 34.38 9300.07 33.23 -13.27 -20.95 -1.64 BBDrive 7624.98 36.60 5030.46 35.64 -18.62 -22.13 -13.41 Cactus 7059.65 35.14 4625.08 35.07 -31.96 -35.74 -26.95 XmasTree 13567.67 33.09 8221.14 32.64 -32.73 -39.40 -23.25 CrowdRun 22676.92 32.26 14252.90 30.93 -16.08 -21.98 -9.40 Ducks 25866.09 32.31 16045.99 30.70 -6.55 -14.75 2.30 InToTree 7059.88 34.26 4447.31 34.45 -37.07 -39.60 -33.39 ParkJoy 25573.29 31.73 15792.88 30.43 -19.68 -26.35 -11.76 BigShips 5677.56 35.57 3587.76 36.01 -41.91 -48.13 -34.14 City 8836.01 34.67 4959.22 34.75 -40.58 -42.89 -32.89 Crew 5655.10 37.78 3442.55 37.45 -26.90 -33.26 -14.41 Average -24.85 -30.81 -15.40

Table A.6: No qp toggle vs configuration -4 -3 2 4 8

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma [%] [kbps] [dB] [kbps] [dB] avg low high Am foot 17729.20 33.55 13055.85 32.65 -7.43 -15.39 3.49 Sitcom 9669.67 34.50 7001.18 34.41 -18.12 -31.12 6.96 LeMatch 15000.78 34.38 11108.19 33.72 -9.70 -18.10 2.74 BBDrive 7624.98 36.60 5832.44 36.08 -14.23 -18.31 -8.45 Cactus 7059.65 35.14 5568.60 35.53 -26.22 -30.42 -21.20 XmasTree 13567.67 33.09 9990.42 33.16 -26.06 -32.71 -18.16 CrowdRun 22676.92 32.26 16753.94 31.53 -12.44 -18.49 -5.90 Ducks 25866.09 32.31 18909.19 31.31 -4.74 -14.14 5.67 InToTree 7059.88 34.26 5762.57 34.89 -30.74 -33.14 -27.67 ParkJoy 25573.29 31.73 18590.44 31.07 -16.12 -22.22 -9.07 BigShips 5677.56 35.57 4655.95 36.40 -35.09 -40.99 -28.53 City 8836.01 34.67 6323.36 35.16 -33.24 -36.55 -24.29 Crew 5655.10 37.78 4282.77 37.81 -21.44 -29.39 -9.73 Average -19.66 -26.23 -10.32

Table A.7: No qp toggle vs configuration -4 -3 -3 4 8 82 APPENDIX A. LISTINGS AND RESULT DATA

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma [%] [kbps] [dB] [kbps] [dB] avg low high Am foot 17500.30 33.63 16842.65 33.68 -5.31 -8.41 -2.06 Sitcom 9576.35 34.58 9054.08 34.71 -10.31 -15.63 -2.39 LeMatch 14871.57 34.46 14215.65 34.52 -6.10 -9.04 -2.79 BBDrive 7446.35 36.63 7097.63 36.77 -7.82 -8.84 -6.22 Cactus 6778.52 35.19 6347.64 35.56 -13.84 -15.88 -11.33 XmasTree 13197.32 33.15 12248.87 33.40 -12.85 -16.42 -8.94 CrowdRun 22604.37 32.27 21570.93 32.36 -6.83 -9.31 -4.60 Ducks 26702.91 32.19 25728.52 32.20 -4.30 -7.47 -1.61 InToTree 6611.42 34.35 5984.90 34.60 -17.73 -18.95 -15.06 ParkJoy 24904.80 31.76 23589.41 31.87 -8.10 -11.82 -4.96 BigShips 5694.49 35.61 5252.87 35.95 -18.68 -21.71 -14.05 City 8310.97 34.85 7649.78 35.15 -17.64 -20.83 -10.84 Crew 5598.58 37.77 5327.31 37.92 -10.16 -14.03 -5.13 Average -10.74 -13.72 -6.92

Table A.8: No qp toggle vs configuration -4 -1 0 1

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma [%] [kbps] [dB] [kbps] [dB] avg low high Am foot 17500.30 33.63 16661.38 33.76 -7.91 -13.24 -2.65 Sitcom 9576.35 34.58 9053.06 34.85 -14.08 -23.21 -1.41 LeMatch 14871.57 34.46 14068.76 34.61 -8.92 -13.38 -4.09 BBDrive 7446.35 36.63 7003.35 36.93 -12.04 -14.11 -9.82 Cactus 6778.52 35.19 6301.82 35.84 -19.61 -21.87 -17.18 XmasTree 13197.32 33.15 12007.58 33.67 -19.30 -23.88 -14.24 CrowdRun 22604.37 32.27 21266.14 32.50 -10.45 -14.01 -7.25 Ducks 26702.91 32.19 25475.01 32.25 -6.10 -10.92 -2.15 InToTree 6611.42 34.35 5936.98 34.92 -26.22 -28.45 -23.55 ParkJoy 24904.80 31.76 23165.09 32.03 -12.49 -17.42 -8.26 BigShips 5694.49 35.61 5281.66 36.28 -25.64 -29.58 -21.21 City 8310.97 34.85 7528.24 35.42 -24.52 -29.30 -16.16 Crew 5598.58 37.77 5307.64 38.08 -15.03 -20.85 -7.76 Average -15.56 -20.02 -10.44

Table A.9: No qp toggle vs configuration -4 -2 0 2 A.2. RESULT DATA 83

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma [%] [kbps] [dB] [kbps] [dB] avg low high Am foot 17500.30 33.63 16814.69 33.87 -8.40 -15.29 -1.90 Sitcom 9576.35 34.58 9418.88 35.00 -14.33 -26.43 2.29 LeMatch 14871.57 34.46 14296.93 34.74 -9.73 -16.12 -3.42 BBDrive 7446.35 36.63 7101.45 37.10 -13.97 -16.74 -11.37 Cactus 6778.52 35.19 6519.06 36.15 -22.63 -25.20 -19.82 XmasTree 13197.32 33.15 12221.49 33.96 -22.45 -27.95 -16.52 CrowdRun 22604.37 32.27 21458.86 32.66 -11.91 -16.67 -7.86 Ducks 26702.91 32.19 25752.31 32.36 -6.27 -12.69 -1.26 InToTree 6611.42 34.35 6328.87 35.25 -29.41 -31.48 -26.85 ParkJoy 24904.80 31.76 23292.64 32.22 -14.48 -20.53 -9.46 BigShips 5694.49 35.61 5643.46 36.61 -28.74 -32.45 -25.06 City 8310.97 34.85 7758.08 35.71 -27.93 -32.93 -18.08 Crew 5598.58 37.77 5501.64 38.26 -16.89 -23.77 -9.49 Average -17.47 -22.94 -11.45

Table A.10: No qp toggle vs configuration -4 -3 0 3

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma [%] [kbps] [dB] [kbps] [dB] avg low high Am foot 17500.30 33.63 14198.49 33.20 -9.03 -16.77 0.34 Sitcom 9576.35 34.58 7709.48 34.68 -16.87 -30.46 6.55 LeMatch 14871.57 34.46 12107.65 34.20 -10.68 -18.05 -0.94 BBDrive 7446.35 36.63 6179.38 36.56 -15.27 -18.66 -11.02 Cactus 6778.52 35.19 5740.16 35.83 -25.71 -29.68 -21.37 XmasTree 13197.32 33.15 10528.77 33.52 -25.61 -31.92 -18.54 CrowdRun 22604.37 32.27 18405.74 32.00 -12.91 -18.15 -7.67 Ducks 26702.91 32.19 21864.10 31.68 -6.09 -13.88 1.59 InToTree 6611.42 34.35 5610.77 35.10 -32.24 -35.11 -28.97 ParkJoy 24904.80 31.76 19846.76 31.57 -16.68 -22.86 -10.50 BigShips 5694.49 35.61 5009.81 36.48 -32.62 -38.26 -26.97 City 8310.97 34.85 6500.39 35.46 -32.06 -37.61 -20.70 Crew 5598.58 37.77 4666.36 37.99 -20.17 -28.17 -9.12 Average -19.69 -26.12 -11.33

Table A.11: No qp toggle vs configuration -4 -3 2 6 84 APPENDIX A. LISTINGS AND RESULT DATA

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma [%] [kbps] [dB] [kbps] [dB] avg low high Am foot 16761.48 33.63 17330.73 34.06 -6.43 -11.72 -1.47 Sitcom 9498.48 34.61 10383.00 35.06 -8.68 -18.83 4.15 LeMatch 14758.51 34.46 15416.98 34.88 -6.43 -10.51 -2.30 BBDrive 7200.87 36.71 7378.40 37.25 -9.21 -11.41 -7.31 Cactus 6481.65 35.29 6723.91 36.14 -14.89 -17.41 -12.68 XmasTree 13011.42 33.25 13230.01 34.02 -14.35 -17.74 -10.68 CrowdRun 22387.42 32.30 22934.58 32.87 -8.30 -11.22 -5.83 Ducks 28016.07 32.12 29051.90 32.55 -5.09 -9.29 -1.68 InToTree 6175.92 34.49 6484.88 35.24 -20.02 -21.67 -18.42 ParkJoy 24756.79 31.79 25113.55 32.41 -9.89 -13.55 -6.83 BigShips 5780.83 35.63 6188.99 36.45 -19.05 -22.23 -15.63 City 8079.43 35.01 8459.92 35.74 -17.88 -21.82 -10.59 Crew 5580.74 37.77 5997.30 38.24 -10.80 -16.09 -4.36 Average -11.62 -15.65 -7.20

Table A.12: No qp toggle vs configuration -4 -2 2

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma [%] [kbps] [dB] [kbps] [dB] avg low high Am foot 16761.48 33.63 18391.43 34.31 -5.76 -12.38 0.22 Sitcom 9498.48 34.61 11685.77 35.31 -6.82 -20.32 10.31 LeMatch 14758.51 34.46 16567.67 35.14 -6.13 -12.76 -0.44 BBDrive 7200.87 36.71 7857.56 37.53 -9.85 -13.01 -7.22 Cactus 6481.65 35.29 7380.87 36.54 -16.19 -19.25 -13.35 XmasTree 13011.42 33.25 14292.33 34.44 -15.50 -19.85 -10.98 CrowdRun 22387.42 32.30 24279.78 33.21 -8.70 -12.76 -5.31 Ducks 28016.07 32.12 30774.08 32.85 -4.80 -10.92 -0.23 InToTree 6175.92 34.49 7473.69 35.66 -21.15 -23.78 -19.05 ParkJoy 24756.79 31.79 26507.76 32.79 -10.65 -15.27 -6.87 BigShips 5780.83 35.63 7116.38 36.87 -20.13 -24.07 -17.24 City 8079.43 35.01 9400.90 36.10 -18.87 -23.66 -10.36 Crew 5580.74 37.77 6657.34 38.52 -11.26 -18.36 -5.05 Average -11.98 -17.42 -6.58

Table A.13: No qp toggle vs configuration -4 -3 3 A.2. RESULT DATA 85

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma [%] [kbps] [dB] [kbps] [dB] avg low high Am foot 16761.48 33.63 16900.37 33.91 -3.46 -11.81 4.16 Sitcom 9498.48 34.61 10823.87 35.14 -4.70 -21.68 16.59 LeMatch 14758.51 34.46 15434.66 34.84 -3.88 -12.29 4.00 BBDrive 7200.87 36.71 7361.26 37.20 -8.47 -12.88 -4.18 Cactus 6481.65 35.29 7025.28 36.37 -16.74 -20.98 -12.47 XmasTree 13011.42 33.25 13488.86 34.20 -15.15 -20.66 -9.59 CrowdRun 22387.42 32.30 22711.73 32.86 -7.35 -12.60 -3.17 Ducks 28016.07 32.12 28693.87 32.52 -3.63 -11.45 2.63 InToTree 6175.92 34.49 7247.00 35.61 -21.76 -25.30 -19.01 ParkJoy 24756.79 31.79 24730.43 32.45 -10.05 -15.10 -5.92 BigShips 5780.83 35.63 6899.52 36.83 -21.76 -27.15 -17.59 City 8079.43 35.01 8801.71 35.99 -19.74 -26.10 -9.07 Crew 5580.74 37.77 6242.82 38.39 -11.51 -20.25 -2.39 Average -11.40 -18.33 -4.31

Table A.14: No qp toggle vs configuration -4 -3 6

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma [%] [kbps] [dB] [kbps] [dB] avg low high Am foot 18391.43 34.31 14198.49 33.20 6.27 7.74 4.07 Sitcom 11685.77 35.31 7709.48 34.68 -0.06 -2.16 -1.26 LeMatch 16567.67 35.14 12107.65 34.20 -0.91 -2.29 1.20 BBDrive 7857.56 37.53 6179.38 36.56 0.97 1.54 1.37 Cactus 7380.87 36.54 5740.16 35.83 -3.78 -3.84 -2.20 XmasTree 14292.33 34.44 10528.77 33.52 -6.90 -8.12 -4.38 CrowdRun 24279.78 33.21 18405.74 32.00 -1.18 -2.52 0.33 Ducks 30774.08 32.85 21864.10 31.68 -5.87 -9.74 -2.70 InToTree 7473.69 35.66 5610.77 35.10 0.32 3.91 -2.51 ParkJoy 26507.76 32.79 19846.76 31.57 -3.95 -5.78 -1.67 BigShips 7116.38 36.87 5009.81 36.48 -15.00 -15.30 -12.53 City 9400.90 36.10 6500.39 35.46 -5.50 -4.80 -4.09 Crew 6657.34 38.52 4666.36 37.99 -6.60 -8.65 -1.13 Average -3.25 -3.85 -1.96

Table A.15: Configuration -4 -3 3 vs configuration -4 -3 2 6 86 APPENDIX A. LISTINGS AND RESULT DATA

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma [%] [kbps] [dB] [kbps] [dB] avg low high Am foot 14198.49 33.20 11088.28 32.12 4.99 5.28 3.93 Sitcom 7709.48 34.68 5390.18 34.08 2.75 0.07 3.17 LeMatch 12107.65 34.20 9300.07 33.23 3.94 2.10 5.23 BBDrive 6179.38 36.56 5030.46 35.64 0.81 1.04 0.91 Cactus 5740.16 35.83 4625.08 35.07 -1.20 -1.54 -0.05 XmasTree 10528.77 33.52 8221.14 32.64 -2.01 -3.78 -0.11 CrowdRun 18405.74 32.00 14252.90 30.93 -1.04 -1.51 -0.62 Ducks 21864.10 31.68 16045.99 30.70 -5.27 -6.87 -4.10 InToTree 5610.77 35.10 4447.31 34.45 6.28 9.53 3.63 ParkJoy 19846.76 31.57 15792.88 30.43 3.06 2.07 3.40 BigShips 5009.81 36.48 3587.76 36.01 -10.72 -11.64 -7.75 City 6500.39 35.46 4959.22 34.75 6.39 13.95 0.56 Crew 4666.36 37.99 3442.55 37.45 -2.84 -4.25 -0.54 Average 0.40 0.34 0.59

Table A.16: Configuration -4 -3 2 6 vs configuration -4 -3 -3 4 8

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma [%] [kbps] [dB] [kbps] [dB] avg Am foot 14142.57 33.25 6443.36 31.70 -37.92 Sitcom 7639.25 34.73 2378.30 33.07 -22.52 LeMatch 12088.33 34.24 5584.43 33.03 -37.27 BBDrive 6157.89 36.63 3006.73 35.04 -32.13 Cactus 5712.21 35.87 2579.40 33.93 -29.64 XmasTree 10505.22 33.55 5006.95 31.08 -17.07 CrowdRun 18377.84 32.02 9111.95 30.52 -34.08 Ducks 21743.21 31.75 12132.92 30.45 -29.25 InToTree 5554.85 35.13 1871.16 32.56 -10.61 ParkJoy 19827.77 31.58 10234.01 29.63 -25.92 BigShips 4943.52 36.56 1648.14 33.82 -9.13 City 6443.43 35.55 2333.00 32.78 -6.84 Crew 4611.19 38.10 1557.71 36.37 -34.17 Average -25.12

Table A.17: C65 B vs x264 A.2. RESULT DATA 87

Sequence Anchor Anchor Test Test BD rate avgbitrate avgPSNR avgbitrate avgPSNR Luma [%] [kbps] [dB] [kbps] [dB] avg Am foot 14142.57 33.25 3490.77 30.97 -57.46 Sitcom 7639.25 34.73 1375.82 32.88 -52.45 LeMatch 12088.33 34.24 2914.98 32.27 -58.29 BBDrive 6157.89 36.63 1568.87 34.66 -61.33 Cactus 5712.21 35.87 1495.24 33.64 -56.06 XmasTree 10505.22 33.55 3001.53 30.89 -47.73 CrowdRun 18377.84 32.02 5531.47 29.54 -48.80 Ducks 21743.21 31.75 6327.46 29.26 -46.12 InToTree 5554.85 35.13 1212.68 32.73 -44.78 ParkJoy 19827.77 31.58 6110.52 28.82 -44.17 BigShips 4943.52 36.56 1043.43 33.99 -46.92 City 6443.43 35.55 1461.91 32.88 -44.12 Crew 4611.19 38.10 846.23 36.31 -62.33 Average -51.58

Table A.18: C65 B vs HM 12 88 APPENDIX A. LISTINGS AND RESULT DATA Bibliography

[1] F. Bossen, D. Flynn, and K. Suhring, “Hm software manual.” https://hevc.hhi.fraunhofer.de/svn/svn HEVCSoftware/branches/ HM-9.2-dev/doc/software-manual.pdf.

[2] R. Sj¨oberg, “Ericsson research demonstrates world’s first hevc multi-party hd video conference.” https://labs.ericsson.com/blog/ericsson- research-demonstrates-world-s-first-hevc-multi-party-hd- video-conference.

[3] J. Str¨om, “H.265 finalized by ITU and MPEG - Ericsson main driver.” https://labs.ericsson.com/blog/h-265-hevc-finalized-by- itu-and-mpeg-ericsson-main-driver.

[4] F. Bossen, “Common test conditions and software reference configura- tions.” JCT-VC, 11th Meeting: Shanghai, CN, 10 2012.

[5] G. Bjontegaard, “Calculation of average PSNR differences between RD curves,” VCEG Meeting (ITU-T SG16 Q.6), April 2001.

[6] ITU-R, “RECOMMENDATION ITU-R BT.500-13.” Methodology for the subjective assessment of the quality of television pictures, 01.

[7] T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h.264/avc video coding standard,” Circuits and Systems for Video Tech- nology, IEEE Transactions on, vol. 13, pp. 560–576, July 2003.

[8] B. Brossand, W.-J. Han, G. J. Sullivan, J.-R. Ohm, and T. Wiegand, “Doc- ument JCTVC-K1003,” in High Efficiency Video Coding (HEVC) Text Specification Draft 9, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC).

[9] ITU-T, H.261, 1988. SERIES H: AUDIOVISUAL AND MULTIMEDIA SYSTEMS.

[10] W.-H. Chen, C. Smith, and S. Fralick, “A fast computational algorithm for the discrete cosine transform,” Communications, IEEE Transactions on, vol. 25, pp. 1004–1009, Sep 1977.

[11] W. A. Pearlman and A. Said, Digital Signal Compression. Cambridge University Press, 2011.

[12] K. Sayood, Introduction to , 4th edition. Wiley, 2012.

89 90 BIBLIOGRAPHY

[13] D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive binary in the h.264/avc video compression standard,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 13, pp. 620– 636, July 2003.

[14] V. Sze and M. Budagavi, “High throughput cabac entropy coding in hevc,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 22, pp. 1778–1791, Dec 2012.

[15] D. Marpe and H. Schwarz, “Temporal Scalability in H.264/AVC.” http://www.hhi.fraunhofer.de/de/kompetenzfelder/image- processing/research-groups/image-video-coding/svc-extension- of-h264avc/temporal-scalability-in-h264avc.html.

[16] H. Schwarz, D. Marpe, and T. Wiegand, “Hierarchical b pictures.” Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6) 16th Meeting: Poznan, PL, 07 2005.

[17] G. Sullivan, J. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (hevc) standard,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 22, no. 12, pp. 1649–1668, 2012.

[18] Wikipedia, “High efficiency video coding tiers and levels — wikipedia, the free encyclopedia,” 2013. [Online; accessed 14-February-2014].

[19] “Joint model reference software version 18.5.” http://iphome.hhi.de/ suehring/tml/index.htm.

[20] “x264.” http://developers.videolan.org/x264.html.

[21] “ffmpeg.” http://www.ffmpeg.org/.

[22] D. Vatolin, D. Kulikov, A. Parshin, M. Arsaev, and A. Voronov, “MPEG-4 AVC/H.264 comparison,” Lomonosov Moscow State Univer- sity, Graphics and Media Lab, Moscow, May 2011.

[23] J. Ohm, G. Sullivan, H. Schwarz, T. K. Tan, and T. Wiegand, “Comparison of the Coding Efficiency of Video Coding Standards - Including High Effi- ciency Video Coding (HEVC),” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 22, pp. 1669–1684, Dec 2012.

[24] R. Schreier, A. Rahman, G. Krishnamurthy, and A. Rothermel, “Architec- ture analysis for low-delay video coding,” in Multimedia and Expo, 2006 IEEE International Conference on, pp. 2053–2056, 2006.

[25] “Hm 12.0 random access configuration file.” https:// hevc.hhi.fraunhofer.de/svn/svn HEVCSoftware/tags/HM-12.0/cfg/ encoder randomaccess main.cfg.

[26] X. Ji, D. Zhao, W. Gao, Q. Huang, S. Ma, and Y. Lu, “New bi-prediction techniques for b pictures coding [video coding],” in Multimedia and Expo, 2004. ICME ’04. 2004 IEEE International Conference on, vol. 1, pp. 101– 104 Vol.1, 2004. BIBLIOGRAPHY 91

[27] ITU-T, H.222.0, 1988. SERIES H: AUDIOVISUAL AND MULTIMEDIA SYSTEMS. [28] I. E. Richardson, The H.264 Advanced Video Compression Standard, 2nd Edition. Wiley, 2010. [29] C.-W. Chiou, C.-M. Tsai, and C.-W. Lin, “Fast mode decision algorithms for adaptive gop structure in the scalable extension of h.264/avc,” in Cir- cuits and Systems, 2007. ISCAS 2007. IEEE International Symposium on, pp. 3459–3462, 2007. [30] G. H. Park and M. Woo, “Improve SVC Coding Efficiency by Adaptive GOP Structure,” ISO/IEC JTC1/SC29/WG11 JVT-O018, 04 2005. [31] D. Alfonso, B. Biffi, and L. Pezzoni, “Adaptive gop size control in h.264/avc encoding based on scene change detection,” in Signal Processing Sympo- sium, 2006. NORSIG 2006. Proceedings of the 7th Nordic, pp. 86–89, 2006. [32] D. Marpe and H. Schwarz, “Image processing, hierarchical prediction struc- tures.” http://www.hhi.fraunhofer.de/de/kompetenzfelder/image- processing/research-groups/image-video-coding/svc-extension- of-h264avc/hierarchical-prediction-structures.html. [33] H. Schwarz, D. Marpe, and T. Wiegand, “Analysis of hierarchical b pictures and mctf,” in Multimedia and Expo, 2006 IEEE International Conference on, pp. 1929–1932, 2006. [34] R. Sj¨oberg, D. Flynn, Y. Chen, T. Tan, and W. K. Wan, “Jct-vc ahg report: Reference picture buffering and list construction (ahg21).” Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11. 7th Meeting: Geneva, CH, 11 2011. [35] P. List, A. Joch, J. Lainema, G. Bjontegaard, and M. Karczewicz, “Adap- tive deblocking filter,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 13, pp. 614–619, July 2003. [36] N. Slingerland and A. J. Smith, “Multimedia extensions for general purpose microprocessors: A survey,” 12 2000. [37] Intel Corp., “Intel-R 64 and IA-32 Architectures Software Developers Man- ual.” 02 2014. [38] M. Flynn, “Some computer organizations and their effectiveness,” Com- puters, IEEE Transactions on, vol. C-21, pp. 948–960, Sept 1972. [39] G. Sullivan and T. Wiegand, “Rate-distortion optimization for video com- pression,” Signal Processing Magazine, IEEE, vol. 15, pp. 74–90, Nov 1998. [40] K. Ramchandran, A. Ortega, and M. Vetterli, “Bit allocation for dependent quantization with applications to multiresolution and mpeg video coders,” Image Processing, IEEE Transactions on, vol. 3, pp. 533–545, Sep 1994. [41] H.-W. Chen, C.-H. Yeh, M.-C. Chi, C.-T. Hsu, and M.-J. Chen, “Adap- tive gop structure determination in hierarchical b picture coding for the extension of h.264/avc,” in Communications, Circuits and Systems, 2008. ICCCAS 2008. International Conference on, pp. 697–701, May 2008.