<<

Software implementation of an MPEG-4 Part 10 (H.264/AVC) video encoder for embedded systems

Antonio´ Manuel da Costa Rodrigues

Dissertac¸ao˜ para obtenc¸ao˜ do Grau de Mestre em Engenharia Electrotecnica´ e de Computadores

J ´uri Presidente: Doutor Marcelino Bicho Santos Orientador: Doutor Nuno Filipe Valentim Roma Co-orientador: Doutor Leonel Augusto Pires Seabra de Sousa Vogal: Doutor Paulo Luis Serras Lobato Correia

Outubro de 2010

Acknowledgments

I would like to thank my supervisor, Dr Nuno Roma for his assistance and guidance throughout this project. On a personal note I would thank to all people around me for the support and patience.

Abstract

In the last few years there has been a general proliferation of advanced video services and mul- timedia applications, where video compression standards, such as MPEG-x or H.26x, have been developed to store and broadcast video information in the digital form. Among such video stan- dards, the MPEG-4 Part 10 (also known as H.264/AVC) was recently released and has proved to provide particular advantages, when compared with the previous video standards (such as MPEG-1, MPEG-2, etc.), in what concerns the obtained encoding efficiency and video quality performances. In the other hand, to achieve such encoding efficiency and quality, the computa- tional complexity of the encoder has increased exponentially. At the same , processor industries have already been taking multi-core solutions (multiple processors units on a die) to answer to a growing speedup demand. This dissertation presents a multi-core solution of the encoder H.264 software (programmed in - http://iphome.hhi.de/suehring/tml/download/old jm/ version 14.0) to speedup the processing time but keeping the efficiency and quality performances. Some restrictions have been imposed to reduce the processing dependencies and increase the level of parallelism at frame level. For - ample, the picture and macroblock interlace, rate distortion optimization and rate control modules were disabled. Through several simulations, the results indicate that the developed architectures are capable to speed up the elapsed time in the processing.

Keywords

Multicore, H.264/AVC, Parallelism, Video encoding, Speedup

iii

Resumo

Nos ultimos´ anos tem-se assistido a` proliferac¸ao˜ de servic¸os avanc¸ados de v´ıdeo e aplicac¸oes˜ multimedia,´ onde os standards de compressao˜ de v´ıdeo, tais como MPEG-x ou H.26x, temˆ sido desenvolvidos para armazenamento e difusao˜ de v´ıdeo em formato digital. Entre essas normas, o MPEG-4 Part 10 (tambem´ conhecido como H.264/AVC) foi lanc¸ado recentemente e provou ter vantagens, comparadas com as versoes˜ anteriores (tais como MPEG-1, MPEG-2, etc.), no que diz respeito a` obtenc¸ao˜ de uma codificac¸ao˜ eficiente e uma boa qualidade de v´ıdeo. Por outro lado, para atingir tal eficienciaˆ e qualidade, a complexidade computacional do codificador aumentou exponencialmente. Ao mesmo tempo, as industrias´ de processadores comec¸aram a optar por soluc¸oes˜ multicore (multiplos´ processadores no mesmo chip) para responder a` crescente necessidade de proces- sadores mais rapidos.´ Esta dissertac¸ao˜ apresenta uma soluc¸ao˜ multi-core para o software do codificador H.264 (programado em C - http://iphome.hhi.de/suehring/tml/download/old jm/ version 14.0) que visa o aumento da velocidade de processamento mantendo a eficienciaˆ e a qualidade. Algumas restric¸oes˜ foram impostas para reduzir as dependenciasˆ a n´ıvel do processamento e aumentar o paralelismo a n´ıvel da frame. Por exemplo, os modulos´ de picture and macroblock interlace, rate distortion optimization e rate control foram desactivados. Atraves´ de varias´ simulac¸oes,˜ os resultados indicaram que as arquitecturas desenvolvidas sao˜ capazes de aumentar a velocidade de processamento.

Palavras Chave

Multicore, H.264/AVC, Paralelismo, Codificac¸ao˜ de V´ıdeo, Aumento da Velocidade

v

Contents

1 Introduction 1 1.1 Motivation ...... 2 1.2 Requirements ...... 3 1.3 Objectives ...... 3 1.4 Main contributions ...... 4 1.5 Dissertation Outline ...... 4

2 Video Coding 5 2.1 Principles of Video Compression ...... 6 2.1.1 Reduction of Irrelevancies ...... 6 2.1.2 Spatial Redundancy ...... 8 2.1.3 Temporal Redundancy ...... 8 2.1.4 Comparison Metrics ...... 10 2.2 H.264 Standard ...... 11 2.2.1 H.264 Profiles and Levels ...... 11 2.2.2 H.264 Encoding Loop ...... 12 2.2.2.A Flexible Macroblock Order and Arbitrary Slice Order ...... 13 2.2.2.B Intra Prediction ...... 14 2.2.2.C Inter Prediction ...... 15 2.2.2.D Transform & Quantization ...... 17 2.2.2.E Entropy Coding ...... 20 2.2.2.F In-loop Deblocking Filter ...... 22 2.2.2.G Interpolation Block ...... 25 2.2.2.H Network Abstraction Layer ...... 26 2.2.3 H.264 Profiling ...... 27

3 Multiprocessor Architectures 29 3.1 Classification of Parallel Processor Systems ...... 30 3.2 Symmetric Shared-Memory Multiprocessor ...... 31 3.2.1 Cache Coherency ...... 33

vii Contents

3.2.1.A Software Solution ...... 34 3.2.1.B Hardware Solution ...... 34 3.3 Non-Uniform Memory Access ...... 37 3.4 Communication Mechanisms ...... 38 3.5 Synchronization ...... 42 3.5.1 Software Locks ...... 43 3.5.2 Hardware Locks ...... 43 3.6 Data Parallelism and Programming Languages ...... 43 3.6.1 MPI - Message Programming Interface ...... 44 3.6.2 POSIX Threads ...... 44 3.6.3 OpenMP ...... 45 3.6.4 CUDA and OpenCL ...... 46

4 Related work 47 4.1 Parallelizing the H.264 Decoder ...... 48 4.2 Parallelizing the H.264 Encoder ...... 49 4.3 Other Proposed Parallel Solutions ...... 52

5 Open Platform for Parallel H.264 Video Encoding 53 5.1 Assumptions ...... 55 5.2 Structures Redesign and Code Improvements ...... 57 5.2.1 Transform & Quantization ...... 59 5.2.2 Intra Prediction ...... 60 5.2.3 Inter Prediction ...... 61 5.2.4 Deblocking Filter ...... 61 5.2.5 Interpolation ...... 61 5.2.6 Data Parallelization - SIMD Instructions ...... 62 5.3 Levels of Parallelism in H.264 Encoding ...... 63 5.4 Heterogeneous vs Homogeneous Multi-core Platform ...... 65 5.5 Slice Definition ...... 65 5.6 Data Partitioning ...... 66 5.7 Parallel Programming of H.264 Video Encoders ...... 67 5.7.1 Parallel Encoding of Intra Frames ...... 68 5.7.2 Parallel the Encoding of Inter Type Frames ...... 69 5.7.3 Increasing the Slice-level Scalability ...... 73

6 Results 77 6.1 Frame Partition and Memory Consumption ...... 80 viii Contents

6.2 Architectures Comparison ...... 80 6.2.1 Full Architecture ...... 81 6.2.2 Mixed Parallel Architecture ...... 83 6.2.3 Full Parallel Architecture ...... 83 6.2.4 Discussion ...... 84 6.3 Comparison Between Slice-level and Macroblock-level Parallelism ...... 85

7 Conclusions 89 7.1 Summary ...... 90 7.2 Result’s Analysis ...... 90 7.3 Future Work ...... 91

8 Bibliography 93

A Architecture Results 99 A.1 Full pipeline architecture ...... 100 A.2 Mixed parallel architecture ...... 104 A.3 Full parallel architecture ...... 107

ix Contents

x List of Figures

2.1 Different sampling patterns...... 7 2.2 Difference between two consecutive frames. Figure c was enhanced so the differ- ences can be noticed...... 8 2.3 Example of an interframe prediction using BMA...... 9 2.4 Main features supported by each of the H.264 profiles...... 11 2.5 H.264 encoder block diagram...... 12 2.6 Different slice group maps, supported in the H.264...... 14 2.7 Directions available for Intra prediction in H.264...... 15 2.8 Supported modes in INTRA-4x4 prediction...... 15 2.9 INTRA-16 × 16 supported modes...... 16 2.10 Possible partitions of a macroblock in H.264 Inter prediction...... 16 2.11 Prediction types: (a) Type P; (b) Type P and B - presentation order; (c) Type P and B - codification order...... 17 2.12 Usage of multiple reference frames in Inter prediction...... 17 2.13 Frequency domain representations of a 4 × 4 block...... 18 2.14 Block transmission order...... 19 2.15 Transform & Quantization diagram block...... 19 2.16 Zig-zag scan of a transformed and quantized block coefficients...... 20 2.17 AC coefficients probability distribution function...... 21 2.18 CABAC simplified diagram block...... 22 2.19 Frame 34 of QCIF Foreman sequence encoded at 40Kbps (a) without filter; (b) with loop filter...... 23 2.20 Boundary strength computation flowchart...... 24 2.21 Pixels adjacent to vertical and horizontal boundaries ...... 24 2.22 Macroblock vertical and horizontal edges...... 25 2.23 Half-pixel directions...... 26 2.24 Quarter-pixel directions...... 26 2.25 NAL interface between the encoder/decoder and the transport layer...... 27

xi List of Figures

3.1 A taxonomy of parallel processors architectures...... 31 3.2 Basic structure of a centralized shared-memory multiprocessor. Multiple processor- cache subsystems share the same physical memory and the memory access time is uniform to all the processors...... 32 3.3 Example of a NUMA structure, composed form by individual nodes containing a processor, some memory, typically some I/O, and an interface to an interconnection network that connects all the nodes...... 33 3.4 Simplified block diagram of the Intel quad-core Xeon processor...... 34 3.5 Directory protocol: a) both CPUs get a shared block from memory to cache; b) both CPUs the information from respective caches, releasing the bus; c) CPU1 needs to update the block: an exclusive request is sent to the Directory, invalidates the block in cache 2; d) when CPU2 tries to access the shared an invalidated block, it sends a miss signal to the Directory; a write back signal is sent to cache controller of CPU1. e) The block is updated in the main memory; f) The block is transferred from memory to cache 2...... 35 3.6 Snoopy protocol: a) All caches read a shared block from memory; b) CPU2 up- date the block and broadcast that action; cache 1 and 3 snoop the bus looking for changes in the shared block...... 37 3.7 Static Network Topologies: a) Completely Connected; b) Star; c) Linear and Ring; d) Mesh and Mesh Ring; e) Tree; f) Three-Dimensional Hypercube...... 39 3.8 A Bus-Based Network...... 40 3.9 Example of a Crossbar Network...... 41 3.10 Possible states of the 2 × 2 Interchange Switch: a) Through; b) Cross; c) Upper Broadcast; d) Lower Broadcast...... 41 3.11 A Two-Stage Omega Network...... 42

4.1 3D-Wave scheme: frames can be decoded in parallel, because inter-frame depen- dencies have a limited spatial range [34]...... 49 4.2 Hierarchical H.264 parallel encoder [35]...... 50 4.3 Cell solution encoding flow with macroblock-level parallelism exploitation...... 51 4.4 Cell solution encoding flow with macroblock-level and slice-level parallelism ex- ploitation...... 51

5.1 Example of a main task divided in smaller sub-tasks running in a multi-core system. 54 5.2 used in probable mode’s computation...... 60 5.3 Top and Right vectors used to store the probable modes that will be used in the following predictions...... 61 5.4 Frame-level parallel architecture...... 64 xii List of Figures

5.5 Application of slice-level parallelism in a multi-core system...... 64 5.6 Application of macroblock-level parallelism in a multi-core system...... 65 5.7 Slice definition supported by the platform...... 66 5.8 Adopted techniques to solve slice scalability: in a) the parallelism is performed using four slices and a 2D-Wave front for macroblock parallelism; in b) the image is divided in four slices (color labels), each one with four sub-slices (numbers). .... 68 5.9 Generic pipeline’s architecture considering N stages...... 72 5.10 Block diagram of the full parallel architecture...... 72 5.11 Mixed pipeline’s architecture for four cores: Stage 0 - one core; Stage 1 - two cores; Stage n-1 - one core...... 73 5.12 Organization of the several encoder modules in the proposed parallel framework. . 74 5.13 Proposed architecture to simultaneously exploit slice and macroblock parallelism levels...... 74 5.14 Simplified synchronization used for each team...... 75 5.15 Slice-scattering distribution in a multi-core architecture...... 75

6.1 Eight-core NUMA system architecture used in all simulations...... 79 6.2 Video sequences snapshots...... 79 6.3 Results obtained in a pipeline architecture 4CIF format...... 82 6.4 Results obtained in a mixed parallel architecture using 4CIF format...... 83 6.5 Provided speedup using macroblock-level parallelism...... 86 6.6 Provided speedup using slice-level parallelism...... 87

A.1 Frame rate obtained for all video formats...... 102 A.2 Speedup obtained for all video formats...... 103 A.3 Frame rate obtained for all video formats...... 105 A.4 Speedup obtained for all video formats...... 106 A.5 Frame-rate obtained for all video formats...... 108 A.6 Speedup obtained for all video formats...... 109

xiii List of Figures

xiv List of Tables

2.1 Macroblock size for each video format...... 12 2.2 VLC Mapping table...... 21 2.3 Relative complexity analysis of the encoder blocks for each GOP type...... 28 2.4 Relation between the GOP types and the results bit rate, frame rate and PSNR. .. 28

3.1 Properties of some Interconnection Networks...... 42

5.1 Memory consumption for reference and optimized software versions. These results were acquired considering 3 backward and 1 forware reference frame...... 58 5.2 Processing Time of each prediction mode for every CIF video sequence [ms]. ... 71 5.3 Pipeline architecture: modes distributions among cores. The maximum delay in the pipeline was computed considering the delays acquired in Table 5.2...... 72 5.4 Mixed pipeline architecture: Distribution of the prediction modes among the pipeline stages and processor cores. #C denotes the number of cores used in each stage. 73

6.1 Specifications of the computational system used for the platform simulations. ... 79 6.2 Bit-rate increment for frame division in N-Slices for 4CIF format...... 80 6.3 PSNR results for frame division in N-Slices for 4CIF format...... 80 6.4 Comparison between the reference and the optimized software for 4CIF format. .. 81 6.5 Full pipeline speedup for 4CIF video sequences...... 82 6.6 Mixed parallel speedup for 4CIF video sequences...... 83 6.7 Full parallel speedup for 4CIF video sequences...... 84 6.8 Full parallel architecture efficiency for 4CIF video sequences (Speedup/Number of Threads)...... 84 6.9 Possible configurations of the optimized platform...... 85 6.10 Average speedup of all four video sequences for SSE and Low Memory configura- tions using macroblock-level parallelism...... 85 6.11 Average speedup of all four video sequences for SSE and Low Memory configura- tions using slice-level parallelism...... 86

xv List of Tables

6.12 Provided speedup for standard configuration using Slice and Macroblock levels of parallelism...... 87

A.1 Comparison between the reference and the optimized software...... 100 A.2 Frame rate(fps) for full pipeline architecture...... 100 A.3 Relative speedup for full pipeline architecture...... 101 A.4 Frame rate(fps) for mixed parallel architecture...... 104 A.5 Relative speedup for mixed parallel architecture...... 104 A.6 Frame rate(fps) for full parallel architecture...... 107 A.7 Full parallel performance for QCIF and CIF video sequences format...... 107 A.8 Efficiency for full parallel architecture ...... 108

xvi List of Acronyms

4CIF 4 times Common Intermediate Format

API Application Programming Interface

ASO Arbitrary Slice Order

AVC

BMA block-matching algorithm

CABAC Context Adaptive Binary Arithmetic Coding

CAVLC Context Adaptive Variable Length Coding

CCF Cross Correlation Function

CIF Common Intermediate Format

CPU Central Processing Unit

CRT Cathode Ray Tube

CUDA Compute Unified Device Architecture

DCT Discrete Cosine Transform

DPB Decoded Picture Buffer

DVD Digital Video Disk

FIFO First In First Out

FMO Flexible Macroblock Order

FRExt Fidelity Range Extension

GOP Group Of Pictures

GPU Graphic Processing Unit

HD High Definition

xvii List of Acronyms

HIDCT Hadamard Integer Discrete Cosine Transform

HVS Human Visual System

IDCT Integer Discrete Cosine Transform

ILP Instruction Level of Parallelism

MBAmp Macroblock Allocation Map

MISD Multiple Instruction Single Data

MIMD Multiple Instruction Multiple Data

MPEG Moving Picture Experts Group

MPI Interface

MSE Mean Square Error

NAL Network Abstraction Layer

NUMA Non-Uniform Memory Access

OS

PAFF Picture Adaptive Frame Field

PDF Probability Distribution Function

POSIX Portable Operating System Interface for

PPE PowerPC Processor Elements

PSNR Picture Signal-to-Noise Ratio

QCIF Quarter Common Intermediate Format

QP Quantization Parameter

RGB Red Green Blue (color space)

SAD Sum of Absolute Differences

SATD Sum of Absolute Transformed Differences

SIMD Single Instruction Multiple Data

SISD Single Instruction Single Data

SMP Symmetric Multiprocessor xviii List of Acronyms

SPE Synergistic Processor Elements

SSD Sum of Square Differences

SSE Streaming SIMD Extensions

TLP Thread Level of Parallelism

TSS Three Step Search

UMA Uniform Memory Access

UMHexagonS Simplified Uneven Multi Hexagon Search

UMTS Universal Mobile Telecommunications Service

VLC Variable Length Code

VCL Video Code Layer xDSL x Digital Subscriber Line (of any type)

YUV color space

xix List of Acronyms

xx 1 Introduction

Contents 1.1 Motivation ...... 2 1.2 Requirements ...... 3 1.3 Objectives ...... 3 1.4 Main contributions ...... 4 1.5 Dissertation Outline ...... 4

1 1. Introduction

In the last decades, digital video applications and services have been gradually growing to answer the public needs of devices capable of storage a huge amount of information with high quality. One of the most important answers of the industry was the DVD, based on the MPEG-2 video standard. However, with an increasing number of services and growing popularity of high definition TV, the usage of more efficient encoding schemes rapidly become highly required. Meanwhile, other transmission media have emerged in the market, such as the coaxial Ca- ble, xDSL or UMTS such technologies offer more efficient ways in order to lower data rates than the traditional analog broadcast channels. As a consequence, enhanced coding efficiency be- comes highly required to enable the transmission of more video channels or higher quality video representations within the existing digital transmissions capacities. The need for a new video standard, capable of a higher coding efficiency and lower data rates to answer these needs become evident. The new standard H.264/AVC reached such objective [1] [2][3][4]. When compared with the previous standards, the H.264/AVC is capable of encoding a video sequence with higher quality using a lower data rate thanks to integration of the most recent discoveries in motion prediction and entropic coding. To achieve this, the computational complexity of the encoder has grown significantly, when compared with older standards. As a consequence, real time video, with 30 frames per second, using high quality options has became very difficult to achieve with traditional software solutions.

1.1 Motivation

Moore’s law, a 40-year-old prediction, describes a long-term tendency in the history of com- puting hardware, in which the chip transistor density will double approximately every two years. Moore’s law is a synonym of performance and speed: chip complexity growled up and so they became capable of doing more complex functions; through miniaturization, clocks have become faster, leading CPUs to the gigahertz range. In 2004, the microprocessor manufacturers’ were faced with an extra difficulty concerned with heat issues due to the usage of higher clock speeds [5]. By 2006, the popular expectation of dou- bling performance per processing core was not achieved: increasing chips performance through clock speed was now impossible because chips ran too hot. Therefore, microprocessor manufac- turers saw the necessity to find new alternatives to improve performance keeping the same clock frequencies. The first multi-core solution, composed by two cores on a chip, was introduced in the end-user market in 2006. However, with multi-core microprocessors, the automatic performance benefit due to Moore’s Law was only apparent. In practice, many applications did not benefit or little improvement was achieved, since all software has been developed to run in a single core. As a consequence, developers would need to rewrite their software, in order to explore all hardware capabilities and truly gain the performance benefits of multi-core processors.

2 1.2 Requirements

1.2 Requirements

Advances in technology and the revolution in chip manufacturing allow the proliferation of the multimedia applications and services in embedded systems. Due to their high requirements in memory consumption and computational power of processing, video encoders applications are usually available for platforms with dedicated hardware, since the encoder software implementa- tion are unsuitable to answer the most demanding applications and services. In this project, an optimized and efficient software implementation was developed in order to target a larger number of platforms like homogeneous multi-core embedded and general purpose systems. Therefore, some requirements were imposed in our implementation:

• Low memory consumption;

• Static data structures allocation;

• Exploit data parallelism to increase the encoder performance;

• Capable to be run in embedded systems and homogeneous multi-core platforms.

1.3 Objectives

In the last few years, the number of advanced video services and multimedia applications has grown up. Several video compression standards, such as MPEG-x or H.26x, have been developed to store and broadcast video information in digital form. Among them, the H.264/AVC was recently released and has proved to provide particular advantages, when compared with the previous video standards. At the same time, processor industries have already been adopting multi-core solutions to answer to a growing speedup demand. The main objective of this dissertation is to increase the performance of the H.264/AVC en- coder’s reference software (programmed in C - http://iphome.hhi.de/suehring/tml/download/old jm/ version 14.0) by adapting the reference to the new generation of multi-core platforms. In a first ap- proach, the encoder was exhaustively optimized: the data structures were redesigned in order to exploit cache efficiency and some code improvements were done to have a modular and flexible architecture. Data parallelism was then added to the optimized encoder to enhance even more its performance. The main steps done to achieve the final architecture were:

• Study the several levels of data parallelism in H.264;

the most suitable architecture to exploit data parallelism - full parallel and pipeline concurrency architectures. This study was performed using only a single level of data par- allelism;

3 1. Introduction

• Increase the parallelism levels and redesigned the data structures to avoid system bottle- necks due to shared data through multiple cores;

• Study the performance of the architecture with multiple parallelism levels.

1.4 Main contributions

To fulfill the objectives that were stated for this dissertation, the H.264/AVC encoder’s reference software was largely rewritten in order to adopt it for implementations based on parallel multi- core systems and to fulfill the memory requirements. Therefore, the resultant software can be configured to be used in general purpose systems or in embedded systems. From the author’s point of view, the main contributions presented in this work are:

• The new optimized H.264 platform is suitable for execution in embedded systems without Operation System as support due to its low memory usage support;

• A parallel software implementation of the encoder in order to allow its usage in the most demanding applications;

• A platform highly scalable and configurable.

1.5 Dissertation Outline

This dissertation is organized as follows:

• Chapter 2 provides the necessary theoretical knowledge support about the main principles of video compression and the H.264 standard. An H.264 analysis profiling is also presented and discussed here.

• Chapter 3 briefly presents the current multiprocessor architectures solutions. The main problems presented by these systems are analyzed and discussed.

• Chapter 4 presents the last innovations in the state of the art concerning the H.264 decoder and encoder.

• Chapter 5 explains, in detail, the several steps that were taken to achieve the proposed software solution.

• Chapter 6 presents and discusses the performance results of the developed multi-core so- lution.

• Chapter 7 concludes this thesis with a summary of the accomplished objectives, and pro- vides possible directions for relevant future research.

4 2 Video Coding

Contents 2.1 Principles of Video Compression ...... 6 2.2 H.264 Standard ...... 11

5 2. Video Coding

2.1 Principles of Video Compression

The principal goal in the design of a video coding system is to reduce the number of bits that are necessary to represent the video source, subject to eventual video quality losses. There are two main factors that video compression possible: the psychophysical redundancy of the Human Visual System (HVS) and the statistical structure of the video data, which includes spatial and temporal redundancy [6]. Human observers are subject to perceptual limitations in amplitude, spatial resolution, and temporal acuity, allowing to discard part of the information without affecting the perceived image quality. The psychophysical redundancy, also denoted as reduction of the image irrelevancies, is mainly exploited through color subsampling and quantization (when discarding high frequency coefficients). The statistical analysis of video signals indicates that there is a strong correlation both between successive picture frames (inter-frames) and within the picture elements themselves of the given frame (intra-frame) [6]. Theoretically, decorrelation of these signals can lead to significant band- width compressions without affecting the image quality. Hence, lossy compression techniques can be used to reduce video bit rates while maintaining an acceptable image quality.

2.1.1 Reduction of Irrelevancies

A color space is a mathematical model used to specify, create and visualize color. The most well-known color space is RGB, which is an additive color system based on three primary colors - Red(R), Green(G) and Blue(B) light. RGB is relatively easy to implement, although the HVS is not equally sensitive to variations in color components and is non-linear with respect to visual perception. Moreover, it is a device-dependent color space, since R, G and B levels may vary from manufacturer to manufacturer. Nevertheless, this is the color system that is mainly used in systems which have a CRT (Cathode Ray Tube) to display images. Another important color system is YCbCr (sometimes referred to as YUV) which represents the image color’s information by a luminance component (Y) and two color differences (chromi- nance components). It is well-known that the HVS is less sensitive to color (chrominance) than luminance. This fact makes the YCbCr color system more suitable to exploit the color redundancy reduction than RGB, since in the last one all three color components are processed with equal importance. The conversions between the RGB and the YCbCr system are given by the following equations:

Y = 0.299R + 0.587G + 0.114B

Cb = 0.564(B − Y ) (2.1)

Cr = 0.713(R − Y )

6 2.1 Principles of Video Compression

R = Y + 1.402Cr

G = Y − 0.344Cb − 0.714Cr (2.2)

B = Y + 1.772Cb

To further exploit the HVS using the YCbCr color space, digital video applications also sub- sample the two chrominance components [6]. Figure 2.1 shows three sampling patterns for Y, Cb and Cr that are supported by the H.264 video standard [3]. The 4:4:4 subsampling preserves the full fidelity of the chrominance components, mainly used for lossless applications. In the 4:2:2 subsampling pattern, the chrominance components have the same vertical resolution as the luminance but half the horizontal resolution. This pattern is often used for high-quality color reproduction. In the 4:2:0 subsampling format, the chrominance components have half the vertical and horizontal resolutions of Y. This format is usually used for consumer applications, such as video conferencing, digital television and DVDs, and requires exactly half as many samples as 4:4:4 or RGB video.

(a) 4:4:4 subsampling. (b) 4:2:2 subsampling. (c) 4:2:0 subsampling.

Figure 2.1: Different sampling patterns.

It is also known that HVS is less sensitive to high frequency color components. Pre-processing techniques, such as frequency transformation and quantization, are used to reduce perceptually unimportant information in a frame. In the original spatial domain, all pixels are equally important. However, in the transform domain, the transformed coefficients have no longer the same importance since the low-order coefficients usually contain more energy than the high-order coefficients [6]. In other words, the original video data is decorrelated by the transformation. Therefore, video can be encoded more efficiently in the transform domain than in the spatial pixel domain. As a consequence most video standards, adopt the Direct Cosine Transform (DCT) to represent an image in frequency domain [7]. It is important to note that the domain transformation of the pixels does not actually yield any compression. In fact, it is the quantization that remove these coefficients forcing them to zero. Thus, the number of coefficients to be encoded by run-length mechanisms are reduced leading

7 2. Video Coding to a compression gain. During the of quantization, a weighted quantization matrix is used. The function of such quantization matrix is to quantize the higher frequencies with coarser quantization steps, in order to suppress most of the high frequencies components. Due to human visual perception characteristics, such suppression usually results in smaller subjective degradation levels.

2.1.2 Spatial Redundancy

Spatial redundancy techniques use Intra prediction in order to exploit the high correlation be- tween adjacent pixels. In Intra prediction, similarities between adjacent pixels from the same frame are used for prediction: a certain block is predicted considering the adjacent up and left pixels. Hence, the residual differences between the prediction and the block are then encoded. The usage of this technique it is possible to reduce a video spatial correlation in order to reduce the bit-rate without losing video quality.

2.1.3 Temporal Redundancy

Temporal redundancy techniques are based on temporal prediction techniques used to ex- ploit the similarities that exist in consecutive frames, by only encoding the differences between successive images [6][7]. This is called inter-frame coding. For static parts of the image se- quence, temporal differences will be close to zero and so they are not coded. In the other hand, changes between the frames, either due to illumination variation or to motion of the objects, result in significant image differences, which need to be encoded. These differences, after the transfor- mation and quantization processing, are then used by the motion compensation to generate the reconstructed frame.

(a) Frame at time t − 1. (b) Frame at time t. (c) Frame differences.

Figure 2.2: Difference between two consecutive frames. Figure c was enhanced so the differences can be noticed.

Figure 2.2(c) shows the inter-frame differences between successive frames of the Foreman video sequence.

8 2.1 Principles of Video Compression

Motion estimation To carry out motion compensation, the motion of the moving objects has to be characterized first. This process is called motion estimation. The most commonly used motion estimation technique is the Block Matching Algorithm (BMA) [7]. In a typical BMA, the video frames are firstly partitioned into non-overlapped blocks of M × N pixels or, more usually, square blocks of N 2 pixels. For a maximum motion displacement of w pixels, the current block is matched against a corresponding block at the same coordinates in a past or future reference frame, within a square window of width N + 2w pixels (search window). This process is depicted in Figure 2.3. The best match, on the basis of a given matching criterion, yields the displacement vector or motion vector. In general, more accurate matching can be achieved by using smaller block sizes at the cost of a significant increase of the number of motion vectors. As a compromise, most video standards, such as MPEG-1/2 adopted a 16×16 pixels block size for motion estimation and compensation [7]. The advanced mode of MPEG-2 uses both 16 × 16 and 8 × 8 block size for finer estimation. The H.264/AVC has extended this refinement further for 16 × 16, 16 × 8, 8 × 16, 8 × 8, 8 × 4, 4 × 8, and 4 × 4 pixels blocks for motion estimation [6]. In what concerns the search window, more opportunities to obtain the best matching, are pro- vided for larger search areas. However, this increase on the number of prediction block candidates impose a rapid and very significant increase of the computational complexity. Various measures such as the cross correlation function (CCF), sum of squared differences (SSD) and sum of absolute differences (SAD) can be used as the matching criterion. For these criterions, a block is perfectly matched when the CCF reaches 1 or when the SSD or SAD is 0. In practical encoders both SSD and SAD are usually preferred, since it has been observed that the CCF does not give good motion tracking, especially when the displacement is not large [8].

Figure 2.3: Example of an interframe prediction using BMA.

The location of the best match by full search requires (2w + 1)2 evaluations of the matching criterion. To reduce the processing cost, the SAD has been preferred to SSD in most video codecs since it avoids the usage a multiplication operation [8]. However, for each block of N 2 pixels, tests still have to be implemented, each one with N 2 additions and subtractions. This is still far from

9 2. Video Coding being suitable for implementation of BMA in software-based codecs. In fact, measurements of the video encoder’s complexity show that the motion estimation operation comprises more then 70% of the overall encoder’s complexity [9]. As a consequence, other search methods, such as Three Step Search (TSS), Hexagonal Search, have been developed to rapidly estimate the best match and to reduce the computational complexity of BMA.

2.1.4 Comparison Metrics

Current video coding standards use several techniques and algorithms to maximize the com- pression level of the video data. To compare the results that are achieved with the application of these techniques, a comparison criteria has to be defined.

The first criterion is the data compression ratio, which compares the achieved coding efficiency. In other words, it corresponds to the percentage of data saved after video is compressed. For example, a compression efficiency of 70% means that only 30% of the original amount of data are used after compression.

The video quality is also important to compare the applied encoder algorithms. In fact a good compression ratio could trivially be achieved by simply lowering the video’s quality. To conduct such comparison, each compressed frame is reconstructed and then compared with the original one. This criterion is called Peak Signal-to-Noise Ratio (PSNR) and is defined by Equation (2.3).

( ) MAX PSNR = 20 × log √ (2.3) MSE where MAX is the maximum pixel value (usually 255). The Mean Square Error (MSE) is the accumulated square difference between the original frame (I) and the reconstructed one (K), as defined in equation (2.4).

− − 1 m∑1 n∑1 MSE = ∥I(i, j) − K(i, j)∥2 (2.4) mn i=0 j=0

Since PSNR is considered a good metric for general assessed by human eye, it is used to com- pare the video quality achieved in this platform [6].

During the prediction process (intra-frame or inter-frame), the best prediction is found by look- ing for the candidate block that minimizes one of the three metrics: Sum of Absolute Differences (SAD), Sum of Square Differences (SSD), and Sum of Absolute Transformed Differences (SATD).

10 2.2 H.264 Standard

They are defined by the following equations:

m∑−1 n∑−1 SAD = ∥I(i, j) − K(i, j)∥ (2.5) i=0 j=0 m∑−1 n∑−1 SSD = ∥I(i, j) − K(i, j)∥2 (2.6) i=0 j=0 ∑3 ∑3 ∥ ∥ × − × T SAT D = Ci,j , with C = H4 (I K) H4 (2.7) i=0 j=0

T where H4 is the Integer Hadamard Integer Discrete Cosine Transform (HIDCT) and H4 its trans- pose. For faster encoders, the usage of SAD is preferable, due to its simplicity [8]. SSD and SATD are more complex metrics and hence they are not used in this work.

2.2 H.264 Standard

2.2.1 H.264 Profiles and Levels

The first version of the standard defined three profiles and fifteen levels. These three profiles are the Baseline, Main and Extended profiles [1][3][4]. The Extended profile is an expansion of the Baseline profile. Since these initial three profiles did not include the necessary tools to support high quality videos (Fidelity Range Extension - FRExt), a fourth profile was added - the High Profile [3]. All encoders are required to provide conforming bit streams consistent with their declared pro- file and level. A decoder that conforms to a specific profile must be able to support all of its features. Hence, the capability of changing information between individual encoders or decoders is defined through this profile and level compliance. Figure 2.4 summarizes the relationship be- tween the three profiles.

Multiple Ref. SI/SP Slices I & P Slices B Slices CABAC Frames

Data Intra Different Weighted MAIN Partitioning Prediction Block Sizes Prediction

Quarter Pixel In-loop Deb. Transform CAVLC MC Filter adaptivity

Quant. scaling Matrices ASO FMO BASELINE EXTENDED HIGH

Figure 2.4: Main features supported by each of the H.264 profiles.

Although it is difficult to establish a strong relation between profiles and applications, it is

11 2. Video Coding possible to say that conversational services will typically adopt the Baseline profile; entertainment services use the Main profile; streaming services are based on the Baseline or Extended profiles; and application in which high quality is required adopt the High profile [1][2][3]. The Main profile was adopted in our platform implementation. In H.264, fifteen levels are specified for each profile. Each level defines upper bounds for the bit stream or lower bounds for the decoder capabilities, decoder processing rate, size of the allocated memory for multipicture buffers, video frame rate, and motion vector range [2].

2.2.2 H.264 Encoding Loop

Figure 2.5 shows the block diagram of a general H.264 video encoder.

Macroblock of Input Image Signal Prediction Error Signal Entropy + T Q Coding -

-1 Q

Transform Quantization -1 T

+

Deblocking Filter Intra Prediction Dec. Picture Buffer References Inter Intra/ Prediction Inter MC Interpolation ME

Motion Data

Figure 2.5: H.264 encoder block diagram.

The input frame is divided into macroblocks, where each macroblock is formed by one 16 × 16 block of pixels for the luminance component and two 8 × 8 blocks of pixels for the chrominance. The blocks sizes depend on the chosen format and are listed in the Table 2.1.

Table 2.1: Macroblock size for each video format. Video Format Luminance Chrominance H:4:4:4 16 × 16 pixels 16 × 16 pixels H:4:2:2 16 × 16 pixels 16 × 8 pixels H:4:2:0 16 × 16 pixels 8 × 8 pixels

Each macroblocks is encoded either in Intra or Inter mode. In Inter mode, each macroblock is predicted from past or future reference frames using motion compensation and a motion vector

12 2.2 H.264 Standard is estimated and transmitted for each block. In Intra mode, a macroblock is predicted using only the information from the current frame meaning that previous or future reference frames are not required. Then, the prediction error, obtained as the difference between the original and the predicted block, is transformed, quantized and entropy encoded. Hence, in other to reconstruct the same image on the decoder side, the coefficients must be dequantized, inverse transformed and added to the considered prediction. The result is post-filtered using a deblocking filter and stored in a Decoded Picture Buffer (DPB) to use as reference by a subsequent frame. The H.264 standard allows the storage of multiple video frames in the DPB for future predictions [1][2][3][4]. One of the new features of the H.264 standard is the possibility of forming several groups of macroblocks. Each macroblock is assigned to a group by the Flexible Macroblock Order. These groups, so called slice groups, are classified by their macroblocks’ prediction and can be decoded independently [1][4]. There are five different slice-types that are supported:

• I-slices - all macroblocks are encoded in Intra mode;

• P-slices - all macroblocks are predicted using either motion compensation prediction, with previous reference frames, or Intra mode;

• B-slices - just like the P-slices, the macroblocks of a B or Bi-predictive slice are predicted using motion compensation prediction with a previous and future reference frames;

• SI-slices and SP-slices - specific slices that are used for an efficient switching between two different bit streams. These particular slices are not considered in this work.

2.2.2.A Flexible Macroblock Order and Arbitrary Slice Order

The Flexible Macroblock Order is one of the new features included in H.264 and has the purpose of dividing an image in regions – slice groups. A video sequence is divided in groups of pictures (GOP), formed by several frames. A frame can be also divided in several parts called slices. Each slice is assigned to a group of macroblocks. This assignment is stored in a table called Macroblock Allocation Map (MBAmp). The maximum number of slice groups in an image is limited to 8, in order to prevent complex allocation schemes. Each slice group is sent in individual packets in a transmission. This feature can be very useful for error resilience since, in presence of an environment with a certain packet loss rate, the loss of a slice does not compromise the whole image decoding [10]. H.264 allows macroblocks within pictures to use a variety of slice mapping patterns. These include interleaving, dispersed, foreground groups, box-out, raster scan, wipe and explicit. Slice groups may also be mapped in inverse raster scan, wipe left and counter clockwise box-out direc- tions. Figure 2.6 represents the available slice group maps (except the explicit one). The main issue of FMO is the minor overhead imposed for each slice group: a slice head has to be sent, which will increase the bit rate [11].

13 2. Video Coding

Type 0: Interleaving Type 1: Dispersed Type 2: Foreground

Type 3: Box-Out Type 4: Raster Scan Type 5: Wipe

Figure 2.6: Different slice group maps, supported in the H.264.

Alternatively, the Arbitrary Slice Ordering (ASO) allows the slices of an image to be sent in any order avoiding the need to wait for a full set of scenes to get all sources. ASO is typically consid- ered an error/loss robustness feature. This tool is particularly useful for real-time applications or for networks which may deliver data packets out of order [1][2].

2.2.2.B Intra Prediction

In Intra prediction, each macroblock is predicted using only the information of already transmit- ted macroblocks of the same image. The main purpose of this technique is to reduce the spatial redundancy in the image. In H.264, two different types of intra prediction are possible for the luminance component: INTRA-4 × 4 and INTRA-16 × 16 [2]. In INTRA-4 × 4 prediction mode, each 16 × 16 pixels macroblock is divided into sixteen 4 × 4 blocks and a different prediction may be applied individually for each block. Nine different predic- tion modes are supported [12]. In each prediction, neighboring samples of already reconstructed pixels are used. One prediction mode is the DC mode, where all pixels of the current 4×4 block are predicted by the mean of all pixels of the top and left neighbors. In addition to the DC prediction mode, eight prediction modes, each for a specific prediction direction, are supported. All possible directions are shown in Figure 2.7. The nine prediction modes in INTRA-4 × 4 are illustrated in Figure 2.8. In INTRA-16 × 16 prediction mode, the entire macroblock is predicted at once and only four different modes are supported: V vertical, horizontal, DC and plane predictions [2][12]. In par- ticular, the plane prediction mode uses a linear function of the left and top neighboring pixels in order to predict the current samples. This mode works very well in areas with a gently changing luminance. The four INTRA-16 × 16 predictions modes are depicted in Figure 2.9.

14 2.2 H.264 Standard

The Intra prediction applied to the chrominance components of a macroblock is similar to the INTRA-16 × 16, since the chrominance signals are very smooth in most cases. Just like in the INTRA-16 × 16, the whole chrominance macroblock is predicted at once [2].

8

1

6

3 4

7 5 0

Figure 2.7: Directions available for Intra prediction in H.264.

0 - vertical 1 - horizontal 2 - DC M A B C D E F G H M A B C D E F G H M A B C D E F G H I I I J J J Mean (A...D, K K K I...L) L L L

3 – diag. down-left 4 – diag. down-right 5 – vertical-right M A B C D E F G H M A B C D E F G H M A B C D E F G H I I I J J J K K K L L L

6 – horizontal-down 7 – vertical-left 8 – horizontal-up M A B C D E F G H M A B C D E F G H M A B C D E F G H I I I J J J K K K L L L

Figure 2.8: Supported modes in INTRA-4x4 prediction.

2.2.2.C Inter Prediction

Inter prediction creates a motion compensated prediction model using one or more previ- ously encoded video frames. H.264 introduces new techniques and features for more accurate predictions: new macroblock partitions; motion vectors with quarter pixel precision; support for bidirectional frames; weighted predictions for multiple references [1]. In H.264, the luminance block can be partitioned in subblocks of 16x16, 16x8, 8x16 and 8x8, and the last one, can be further divided into sub-partitions with sizes of 8x4, 4x8 or 4x4. The pos- sible partitions of a macroblock are illustrated in Figure 2.10. A displacement vector is estimated and transmitted for each block partitions. The accuracy of the displacement vectors is one quarter-pixel. Such fractional-pixel resolution motion vectors may refer to fractional positions in the reference image. In order to estimate and

15 2. Video Coding

0 - vertical 1 - horizontal H H ......

V ...... V

2 - DC 3 - plane H H

V Mean (H+V) V

Figure 2.9: INTRA-16 × 16 supported modes.

Macroblock Partitions

Sub-Macroblock Partitions 16x16 8x16 8x8 4x8

16x8 8x8 8x4 4x4

Figure 2.10: Possible partitions of a macroblock in H.264 Inter prediction.

compensate fractional-pixel displacements, the reference image has to be conveniently interpo- lated, by using a set of well defined filters [1].

A type of prediction, supporting both backward and forward references, is also included in the standard [2]. These prediction frames, denoted as B-frames (Bipredictive frames), provide even more data compression, since their displacement vectors result from a combination of past and future predictions. Thereby, these frames have to be encoded in a different order than they are displayed, due to their prediction dependencies. Figure 2.11 depicts some examples of how P frames and B frames are referenced.

Another technique included into H.264 is the weighted prediction, which is particularly useful to encode fade effects. This feature allows the encoder to specify multiple reference frames for the same partitioned block when performing motion compensation by weighting then using scalar factors [1]. Figure 2.12 depicts an example of a B frame encoding using multiple references.

16 2.2 H.264 Standard

1 2 3 4 5 6 7 1 2 3 4 5 6 7

(a) (b)

I Slice P Slice B Slice

1 4 2 3 7 5 6 Backward Prediction Forward Prediction

(c)

Figure 2.11: Prediction types: (a) Type P; (b) Type P and B - presentation order; (c) Type P and B - codification order.

Current Backward Frame Forward Frame Frame

Figure 2.12: Usage of multiple reference frames in Inter prediction.

2.2.2.D Transform & Quantization

It is well known that the Human Visual System is much more sensible to lower spatial frequen- cies than higher ones. This fact is widely used in video compression, since some information at high spatial frequency can be discarded to get a higher compression rate [6]. Previous video standards used the classical DCT (Direct Cosine Transform) [1][6] to represent an image in the frequency domain (see Equation (2.8)).

1 ∑7 ∑7 πu(2i + 1) πu(2j + 1) F (u, v) = C(u)C(v) f(i, j) cos cos , with (2.8) 4 16 16 { i=0 j=0 √1 if x = 0 C(x) = 2 1 otherwise

Usually, this transformation is computed in matrix form using Equation (2.9)

Y = FXF T (2.9) where X is the block of pixels, F the kernel DCT matrix and Y are the transformed coefficients. After the transformation, the top left value represents the DC level, which could be regarded as the block’s brightness average. The value immediately to the right represents the lower frequency hor- izontal information and the value in the top right corner represents the higher frequency horizontal

17 2. Video Coding information. The same pattern applies for the vertical direction. The bottom right corner value represents the transform coefficient corresponds to the greatest mutual variation in the horizontal and vertical directions.

4x4 transformed block

DC Component

Low Frequency

High Frequency

Figure 2.13: Frequency domain representations of a 4 × 4 block.

Integer Discrete Cosine Transform The major issue in DCT is that the transformed coefficients are often irrational numbers and have to be rounded for digital representation. Accuracy and rounding are device-dependent so the resulting coefficients might be different from device to device. Moreover, a floating-point unit is also needed to process irrational numbers, meaning that a more complex hardware has to be used [1]. Instead of the classical DCT approach, H.264 uses integer transform with similar properties as the DCT. The kernel size of these transforms are mainly 4x4, and in special cases 2x2. This type of transforms has the following properties [13]:

• Inverse-transform mismatches are avoided since inverse transform is defined by exact inte- ger operations;

• IDCT uses integer 4x4 matrix elements, resulting in less noise around edges and less com- putations, since it only involves adds and shifts;

• It is almost as efficient in removing statistical correlation as the conventional cosine trans- form [1].

Three different types of transforms are used in the encoding process, as it is shown in Equa- tions (2.10). The first type, defined by matrix H1, is applied to all prediction error blocks of lumi- nance and chrominance components, regardless of whether they are Inter or Intra prediction. If the macroblock’s prediction is INTRA-16×16, a second transform (Hadamard transform) is applied to all DC coefficients. This transform is defined by kernel matrix H2 (for 4 × 4 luminance blocks) and H3 (for 2 × 2 chrominance blocks) [3].

    1 1 1 1 1 1 1 1 [ ] 2 1 −1 −2 1 1 −1 −1 1 1 H =   H =   H = (2.10) 1 1 −1 −1 1  2 1 −1 −1 1  3 1 −1 1 −2 2 −1 1 −1 1 −1

18 2.2 H.264 Standard

The transmission order of the 4 × 4 subblocks of each macroblock is shown in Figure 2.14. If the macroblock is predicted using the intra prediction type INTRA-16 × 16, the DC coefficients are transmitted before their corresponding luminance or chrominance coefficients. For other predic- tions, the DC coefficient transform are not computed, therefore no coefficients are sent.

-1 16 17

0 1 2 3 18 19 22 23

4 5 6 7 20 21 24 25

8 9 10 11

12 13 14 15

Figure 2.14: Block transmission order.

Scalar Quantizer

After transformed, all coefficients are quantized by a scalar quantizer [13]. A quantization step size is defined by a quantization parameter QP, which supports 52 different steps. Theses values are arranged so that an increase of 1 in QP well represent an increase of 12% in the quantization step size.

The transformed blocks are processed by a quantization block, in which most of the high frequency coefficients are suppressed. Before entropy coding can take place, the 4 × 4 quantized coefficients are serialized using a zig-zag scan pattern, in which the coefficients are ordered from low frequency to high frequency (Figure 2.16). Then, since higher frequency quantized coefficients tend to be zero, run-length encoding is used to group trailing zeros, resulting in more efficient entropy coding. For 2 × 2 DC chrominance, coefficients are serialized using a raster-scan pattern [6].

IDCT Q Zig-zag scan

IHT Entropy Coding -1 IHT IDCT – Integer DCT

-1 -1 IHT – Integer Hadamard Transform IDCT Q Q - Quantization

Figure 2.15: Transform & Quantization diagram block.

Figure 2.15 shows the described Transform & Quantization process in the H.264 standard, all transform coefficients may be computed using only 16-bit additions and bit-shifting operations, and the scalar quantizer only requires a 16-bit memory [13].

19 2. Video Coding

Residual Data Zig-zag scan

7 6 0 0

-2 -1 1 0

0 0 0 0

0 0 0 0

Result from Zig -Zag scan : 7, 6, -2, 0, -1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0

Figure 2.16: Zig-zag scan of a transformed and quantized block coefficients.

2.2.2.E Entropy Coding

Entropy coding, the final processing block of most video coding chains, is a lossless technique used for further data compression. The entropy coding stage maps symbols representing motion vectors, quantized coefficients, and macroblock headers into actual bits, which improves coding efficiency by assigning a smaller number of bits to frequently used symbols and a greater number of bits to less frequently used symbols. In previous standards, such as MPEG-1, -2, -4, H.261, and H.263, entropy coding was based on fixed tables of variable length codes (VLCs). These prefixed codes are based on the probability distributions of generic video sequences. However, H.264 uses different VLCs to match a symbol to a code based on the context characteristics. All syntax elements, such as the slice header information, picture set of parameters, etc, are encoded by a VLC code [11]. The residual data, corresponding to the quantized transform coefficients of each macroblock, is encoded using one of two methods: a low-complexity tech- nique based on the usage of context-adaptively switched sets of variable length codes (so-called CAVLC) [14], and a more efficient but computationally more demanding algorithm denoted by context-based adaptive binary arithmetic coding (CABAC) [15] as described in the following para- graphs.

CAVLC and VLC The knowledge of the statistical behavior of the quantized DCT coefficients is important in the design of efficient codes. Several studies around statistical distribution have been proposed, in which the AC coefficients were conjectured to have Gaussian or Laplacian Probability Distribution Functions (PDF) [16]. Therefore, to maximize the coding efficiency, the previous video encoding standards used Variable Length Codes (VLC) based on Elias Gamma code, due to their high efficiency and compression rates to all kind of data with Laplacian distribution functions. The code for an integer coefficient X is constructed by a trailing of zeros (log2(X)) followed by a binary

20 2.2 H.264 Standard codification.

(a) PDF for unsigned values. (b) PDF for signed values.

Figure 2.17: AC coefficients probability distribution function.

Since the residual data are represented by signed values, the following equation is used as mapping function:  1 , if V alue = 0 Number = 2|V alue| , if V alue > 0  2|V alue| + 1 , if V alue < 0

In particular cases, only unsigned values are required. Therefore, the previous mapping function is reduced to

Number = V alue + 1

A quantized coefficient or a parameter value, when reach the entropy coding block, its value is mapped in to a VLC code through one of the previous functions. A few examples of VLC code words are shown in Table 2.2.

Table 2.2: VLC Mapping table.

Unsigned Value Signed Value Mapped Number Cγ (x) 0 0 1 1 1 1 2 010 2 -1 3 011 3 2 4 00100 4 -2 5 00101 5 3 6 00110 6 -3 7 00111 7 4 8 0001000 8 -4 9 0001001 ...... 18 -9 19 000010011 ...... 146 -73 147 000000010010011

21 2. Video Coding

CABAC CABAC uses an adaptive arithmetic coding method to encode each symbol [15]. The good compression performance provided by CABAC is mainly achieved through (a) the selection of good probability models for each syntax element according to the element’s context, (b) adaptive probability estimates, based on local statistics and (c) usage of arithmetic coding. A schematic of a simplified CABAC block diagram is depicted in Figure 2.18.

Update probability estimation

Context Probability Coding Binarization modeling estimation engine

Adaptive binary arithmetic coder

Figure 2.18: CABAC simplified diagram block.

The CABAC encoding process involves the following steps:

• Context mode selection - a context or probability model for one or more elements of the bi- nary symbols is selected from the available models depending on the statistics of previously encoded syntax elements;

• Binarization - a given non-binary-valued symbol is uniquely mapped to a binary sequence prior to arithmetic coding;

• Binary arithmetic coding - an arithmetic coder encodes each element according to the selected probability model;

• Probability update - the selected context model is updated, based on the actual coded value.

2.2.2.F In-loop Deblocking Filter

H.264 is a block-based standard, meaning that during the coding process a frame is divided in macroblocks, and sometimes even subdivided in blocks, which are processed separately by each encoder’s blocks. As a consequence, some blockage artifacts are often added. These blocking- artifacts usually significantly decrease the video quality, since the frames that are stored in the DPB are subsequently used in the prediction of the following frames [17]. Two main techniques are usually used to reduce the blocking-artefacts: post-filters and de- blocking filters. In post-filter techniques, the video frame is filtered after the decoding process. This filter can not prevent noise propagation and is not a normative part the of standard. Another solution is to add a deblocking filter in the coding loop, after the Transform & Quantization blocks.

22 2.2 H.264 Standard

The filter, applied in the encoder and the decoder, smoothes the block’s edges, in order to im- prove the decoded image appearance and reduce the residual error of the motion-compensated prediction of subsequent frames. As expected, with this technique the encoded videos have better quality (Figure 2.19).

(a) (b)

Figure 2.19: Frame 34 of QCIF Foreman sequence encoded at 40Kbps (a) without filter; (b) with loop filter.

Deblocking Filter’s reconstruction The deblocking filter is applied in the encoder and decoder after the computation of the in- verse transform. The filter is applied to each macroblock to reduce the blocking-artifacts without reducing the picture’s sharpness and its output is used for motion compensated predictions in subsequent frames. The filtering operation can be divided in three main steps: filter strength computation, filter decision and filter implementation.

Filter strength computation The deblocking filter is an adaptive filter that adjusts its strength to the compression mode of the macroblock (inter or intra), the quantization parameter, the motion vector, the frame or field coding decision and the gradient of the pixels that cross the boundary. The boundary strength is computed for each edge that separates the neighbouring 4 × 4 luminance blocks and for each edge, a value between 0 (no filtering) and 4 (strong filtering) is assigned. The rules for selecting this integer value are illustrated in the flow chart of Figure 2.20. The computed luminance strength is also applied to chrominance components. The application of these rules results in strong filtering in areas where there is a significant blocking distortion, such as boundaries of intra coded macroblocks or boundaries between blocks containing coded coefficients.

Filtering decision At this point, it is decided if the frame is filtered or not, for each block edge. That decision does not depend only on non-zero boundary strength, i.e., deblocking filtering may not be needed even

23 2. Video Coding

Figure 2.20: Boundary strength computation flowchart. in the case of non-zero boundary strength. This fact is especially true when the image includes sharp transitions across the edges and unwanted blurry areas should be avoided. Hence, filtering should only be applied if the set of pixels of the two adjacent blocks (figure 2.21) meet the following conditions:

• Boundary strength is greater than zero;

• |p0 − pq0| < α ∧ |p1 − p0| < β ∧ |q1 − q0| ≤ β , where α and β are threshold values defined in the standard.

p3

p2

p1

p0 Horizontal Boundary q0 Vertical Boundary

q1

p3 p2 p1 p0 q0 q1 q2 q3 q2

q3

Figure 2.21: Pixels adjacent to vertical and horizontal boundaries

These threshold values increase with the average of the block’s quantizers (QP). In the pres- ence of a small QP, small transitions across boundaries are likely due to image features rather than blocking effects and so α and β are low. But if QP is large, blocking distortion is likely to be significant so α and β are higher and more boundary’s pixels are filtered. If a significant change across the boundary of an original image occurs, the filter is switched off since it is very unlikely of being created from blocking-artifacts.

24 2.2 H.264 Standard

Filter Implementation The filtering applied to the picture’s luminance and chrominance components, for vertical and horizontal block edges, as it is shown in the Figure 2.22.

HLE – Horizontal Luminance Edge HCE – Horizontal Chrominance Edge

VLE0 VLE1 VLE2 VLE3 VLE – Vertical Luminance Edge VCE – Vertical Chrominance Edge HLE0

HLE1 VCE0 VCE1

HLE2 HCE0

HLE3 HCE1

16×16 8×8 luminance chrominance

Figure 2.22: Macroblock vertical and horizontal edges.

Computation of the output filtered pixels involves a 4 or 5 tapped filter, depending on the strength, and it starts firstly with the vertical edges, followed by the horizontal edges.

2.2.2.G Interpolation Block

One of the greatest innovations that were introduced in the most recent standards is quarter- pixel accurate motion vectors support [1]. With this, the encoder can predict more accurately the movement in inter predicted frames and increase the encoding performance. Nevertheless, this improvement usually imposes a significant computational cost. The image quarter-pixel is computed in the interpolation module: first, a 6-tap Wiener filter is applied in order to compute half-pixel images; then the quarter-pixel images are computed using a bilinear filter in the previ- ous pictures. This image processing block represents the second more computational complex module in the encoder [18].

Interpolation algorithm In the H.264 standard, luminance half-pixel samples are computed through a 6-tap Wiener filter, in both horizontal and vertical directions, as depicted in Figure 2.23. The interpolation algorithm is implemented through the following steps: compute all horizontal and vertical half- pixels values, using the integer pixel values; compute all diagonal half-pixel values; compute the quarter-pixel values. To compute the horizontal or vertical half-pixel values, 6 adjacent pixels are needed to imple- ment the 6-tap Wiener filter - Equations (2.11).

′ hp = (i−2 − 5i−1 + 20i0 + 20i1 − 5i2 + i3 + 16) >> 5 (2.11) 255 ′ hp = Clip0 (hp ) ′ where ij are the pixels, hp and hp are the half-pixel computed before and after truncation.

25 2. Video Coding

A aa B Integer pixel

C bb D Half-pixel pixel

E F G b H I J cc h j m ee ff K L M s N P Q

R gg S

T hh U

Figure 2.23: Half-pixel directions.

For example, to compute the values of the half-pixel b, Equation (2.11) is applied to get:

b′ = (E − 5F + 20G + 20H − 5I + J + 16) >> 5

255 ′ b = Clip0 (b ) For the values of the half-pixels in the diagonal direction, like j, the vertical or horizontal pixels can equally be used. However, it needs that the vertical or the horizontal half-pixels have already been computed. The luminance quarter-pixel values are generated by a much simpler bilinear filter as shown in Figure 2.24. Equations (2.12) define the particular filtering equations that should be applied to each type of quarter pixel.

aa = (G + b + 1) >> 1, for horizontal quarter-pixel; (2.12a)

gg = (G + h + 1) >> 1, for vertical quarter-pixel; (2.12b)

uu = (h + b + 1) >> 1, for diagonal quarter-pixel; (2.12c)

G aa b G b G aa b bb H Integer pixel gg uu gg uu hh vv ii h h h cc j dd m Half-pixel pixel jj yy kk zz ll Horizontal Vertical Diagonal M ee s ff N Quarter -pixel pixel

Figure 2.24: Quarter-pixel directions.

The chrominance half-pixel and quarter-pixel values are computed using the bilinear filter.

2.2.2.H Network Abstraction Layer

For efficient transmission in different environments, the standard is implemented over a net- work protocol called Network Abstraction Layer (NAL), which provides the interface between the encoder/decoder and the outside world (see Figure 2.25). The Video Coding Layer (VCL) spec- ifies an efficient representation for the coded video stream, while the NAL defines the protocol.

26 2.2 H.264 Standard

The packet-based protocols are supported, in most networks, by a NAL unit. In addition to the NAL concept, the VCL itself also includes several features to provide network friendliness and error robustness for real-time services such as streaming, multicasting and conferencing applica- tions [19].

H.264 Conceptual Layers

Video Coding Layer Video Coding Layer (Encoder) (Decoder)

VCL-NAL Interface

Network Abstration Network Abstration Layer (Encoder) Layer (Decoder)

NAL Encoder NAL Decoder Interface Interface

Transport MPEG-2 H.320 H.324/M RTP/IP TCP/IP ...... Layer Systems

...... Wired Network Wireless Network ......

Figure 2.25: NAL interface between the encoder/decoder and the transport layer.

2.2.3 H.264 Profiling

To achieve a better compression rate without losing quality, the computational complexity of the H.264 encoder is much higher, when compared with the previous standards [9][20][19]. Along the research work that was conducted in the scope of this dissertation, the relative computational cost of each module of the reference encoder was acquired. These measures were taken with the GNU profiler - gprof - considering three GOPs:

• GOPs including only I frames - to assess the relative computational cost of Intra main blocks;

• GOPs including I and P frames - to assess the Inter prediction cost over the Intra predic- tion, considering only past frames as references;

• GOPs including I, P and B frames - where both backward and forward references are considered.

Each of the considered cases were tested using 4 test video sequences in CIF format with different motion and detail properties. The results were acquired using GPROF, which uses time markers to estimate the time consumption in each function. Therefore, for more accurate results, 268 frames, from the video sequences Soccer, Harbour, Crew and City, were set to be encoded. The results are shown in the Table 2.3. From Table 2.3, it is possible to conclude that for an I type GOP the interpolation block rep- resents approximately 50% of encoder’s complexity. Although for an I type GOP Inter prediction is not used, the interpolation cost is still in the chain of the reference software. Note that, this

27 2. Video Coding

Table 2.3: Relative complexity analysis of the encoder blocks for each GOP type. Block Video GOP In-loop Entropy Interpolation Intra Inter DCT Others Filter Coding I 50.9% 16.0% − 8.5% 5.7% 7.6% 11.3% Soccer IP 8.2% 1.0% 84.9% 1.5% 0.5% 0.6% 3.3% IBP 2.3% 0.9% 89.1% 1.4% 0.4% 0.4% 6.9% I 49.7% 15.1% − 8.4% 5.5% 11.7% 9.6% Harbour IP 10.2% 1.3% 80.6% 2.0% 1.1% 1.0% 3.8% IBP 2.9% 1.1% 86.0% 1.7% 0.7% 0.8% 8.5% I 51.6% 15.2% − 8.5% 5.4% 7.9% 11.4% Crew IP 8.7% 1.1% 83.5% 1.8% 0.8% 0.9% 3.2% IBP 2.4% 0.9% 88.4% 1.6% 0.5% 0.6% 7.2% I 49.8% 17.4% − 7.7% 5.0% 9.8% 10.3% City IP 9.7% 1.2% 82.3% 2.1% 0.4% 0.6% 3.7% IBP 2.7% 1.1% 87.5% 1.7% 0.5% 0.3% 7.8% software version is not optimized. Nevertheless, is was also observed that such weight becomes meaningless when introducing P or P and B frames. In fact, for the other types of GOP the motion estimation and motion compensation blocks assume the role of the heaviest block, with more than 80% of encoder’s computation complexity [9][20][19].

Table 2.4: Relation between the GOP types and the results bit rate, frame rate and PSNR. CIF Video Sequences Parameters GOP Soccer Harbour Crew City GOP I 29.90 27.07 29.77 28.35 Frame Rate (fps) GOP IP 2.34 2.81 2.50 2.72 GOP IBP 1.86 2.25 1.99 2.13 GOP I 2057.3 4112.8 1946.2 3095.7 Bit Rate (kbps) GOP IP 646.2 1258.5 834.5 371.1 GOP IBP 598.0 974.2 749.8 289.5 GOP I 36.77 35.36 38.01 35.85 PSNR (dB) GOP IP 35.59 34.03 36.55 34.71 GOP IBP 35.06 33.27 35.96 34.37

Table 2.4 shows how the frame rate, the bit rate and the PSNR change with these three types of GOP. As expected, the bit rate and the PSNR reach their maximum in I type GOPs meaning that the encoder’s throughput is higher and the best video quality is achieved. As a side effect, the resulting bit rate also reaches its maximum, which implies a lower compression rate. On the other hand, when considering IBP type GOP the compression rate increases by more than 70%. However, the frame rate is much lower, due to the high computational cost imposed by the inter prediction which leads to a significant reduction of the encoder’s performance.

28 3 Multiprocessor Architectures

Contents 3.1 Classification of Parallel Processor Systems ...... 30 3.2 Symmetric Shared-Memory Multiprocessor ...... 31 3.3 Non-Uniform Memory Access ...... 37 3.4 Communication Mechanisms ...... 38 3.5 Synchronization ...... 42 3.6 Data Parallelism and Programming Languages ...... 43

29 3. Multiprocessor Architectures

Since the beginning of computing, scientists have endeavoured to make machines solve prob- lems better and faster. Miniaturization technology has resulted in improved circuitry, and more of it on a chip. Clocks have become faster, leading to CPUs in the gigahertz range. However, it is known that physical barriers exist that delimit how much a single-processor per- formance can be improved. Chip transistor density is limited by heat and electromagnetic inter- ference, and even when these problems are solved, processor speeds will always be constrained by the speed of light. There are also economic limitations: at some point, the cost of making a processor incrementally faster will exceed the price that anyone is willing to pay for it [5]. These issues have left only one feasible way to improve the processor performance: to dis- tribute the computational load among several processors. For these reasons, parallelism is be- coming increasingly popular. To make an application benefit from parallelism, programmers have to adapt their algorithms to the multiprocessor architectures, by developing synchronization procedures and other important aspects of process administration. When implemented correctly, parallelism results in higher throughput, better fault tolerance, and a more attractive price/performance ratio. Nevertheless, although parallelism can result in significant speedup levels, it is often very hard to obtain the maximum benefit from it: no matter how well an application is parallelized, there will always be some part of the processing to be done serially, leaving the remaining processors blocked [21].

3.1 Classification of Parallel Processor Systems

The taxonomy that was first introduced by Flynn, in 1972, is still the most common way of cate- gorizing systems with parallel processing capability [21][22][23]. Flynn’s taxonomy considers two factors: the number of instructions and the number of data streams that flow into the processor. It proposes the following categories of computer systems:

• Single Instruction, Single Data (SISD) stream – a single processor executes a single instruction stream to operate on data stored in a single memory. Uniprocessors fall into this category.

• Single Instruction, Multiple Data (SIMD) stream – each processing element has an as- sociated data memory, so that each instruction is executed on a different set of data by the different processors. Vector and array processors fall into this category.

• Multiple Instruction, Single Data (MISD) stream – a sequence data is transmitted to a set of processors, each of which executes a different instruction sequence. This structure is not commercially implemented.

• Multiple Instruction, Multiple Data (MIMD) stream – a set of processors simultaneously

30 3.2 Symmetric Shared-Memory Multiprocessor

execute different instruction sequences on different data sets. UMA (Uniform Memory Ac- cess) and NUMA (Non-Uniform Memory Access) systems fit into this category.

• Single Program, Multiple Data (SPMD) stream – a recent addition to Flynn’s taxonomy, which consists on the usage of multiprocessors, each one running the same program but operating on different data sets. Supercomputers fit into this category.

Flynn’s taxonomy is depicted in the Figure 3.1. Since the MIMD organization is a generaliza- tion of the other classes, it has been adopted by most commertial general-purpose processors to exploit thread-level parallelism.

Processor organizations

SISD SIMD MISD MIMD SPMD

Vector Array Uniprocessor UMA NUMA Supercomputers processor processor

Figure 3.1: A taxonomy of parallel processors architectures.

MIMDs can be further classified by the adopted strategy to share information among the pro- cessors [21][22]. In Uniform Memory Access (UMA), the main memory or pull of memory devices are physically shared uniformly by all processors. In an UMA architecture, the access time to an individual memory location is independent of which processor makes the request or which memory module contains the requested data. The most common form of such systems is de- nominated as a symmetric multiprocessor (SMP). These architectures contrast with Non-Uniform Memory Access (NUMA) architectures. In such structures, each processor has its own local memory, and therefore the access time depends on the intended memory’s physical location. A cluster, a collection of independent uniprocessors or SMPs, is an ordinary example of a NUMA architecture. Examples of UMA and NUMA architectures are depicted in Figures 3.2 and 3.3. In the following sections, a special attention will be devoted to the MIMDs class of architec- tures, with a particular emphasis on symmetric shared-memory architectures.

3.2 Symmetric Shared-Memory Multiprocessor

In the last decade, new techniques and manufacturing processes were introduced in the pro- duction line of modern processors. With them, the required silicon was also reduced, leaving

31 3. Multiprocessor Architectures

Processor Processor Processor Processor

One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache

Interface Bus

Main memory I/O System

Figure 3.2: Basic structure of a centralized shared-memory multiprocessor. Multiple processor- cache subsystems share the same physical memory and the memory access time is uniform to all the processors. free space to integrate larger caches, thus releasing the memory bandwidth demands of a single processor. Soon after it was observed that if the main memory bandwidth demands of a single processor is reduced, multiple processors may be able to share the same memory [21]. As result, new computer architectures, based on the SMP paradigm, were developed. A processing system is considered a SMP structure if adopts the following characteristics:

1. There are two or more similar processors of comparable capability;

2. The main memory and I/O facilities are shared by all processors through an interconnected bus or other internal connection, scheme such that the memory access time is approximately the same for each processor;

3. All processors can perform the same operation (that is why the term symmetric);

4. The system is controlled by an integrated operating system that provides the interaction between the processors and their programs.

An example of a SMP architecture is depicted in Figure 3.4. In terms of the data stream, the processed data can be classified has shared or private. Private data is only used by a single processor, while shared data may be used by multiple processors. This last kind of data essentially provides communication among processors. When a private item is cached, its information is cached, reducing the average access time as well as the re- quested memory bandwidth. Since no other processor uses that data, the program behaviour is similar to an exectution in an uniprocessor. On the other hand, when shared data is cached, the shared value may be replicated in multiple caches, to reduce the access latency and memory

32 3.2 Symmetric Shared-Memory Multiprocessor

Processor Processor Processor + + + cache cache cache

Memory I/O Memory I/O Memory I/O

Interconnection Network

Memory I/O Memory I/O Memory I/O

Processor Processor Processor + + + cache cache cache

Figure 3.3: Example of a NUMA structure, composed form by individual nodes containing a pro- cessor, some memory, typically some I/O, and an interface to an interconnection network that connects all the nodes. bandwidth [22]. This replication also provides a reduction in contention that may occur for shared data items that are being simultaneously read by multiple processors. Caching of shared data, however, introduces a new problem: cache coherence. This important issue will be discussed in the following subsection.

3.2.1 Cache Coherency

The essence of the cache coherency problem is: multiple copies of the same data can simul- taneously exist in different caches; if processors are allowed to freely update their own copies, it might result in an inconsistent memory view. The two common write policies used in caches are:

• Write back – write operations are usually made only in the cache. Main memory is only updated when the corresponding cache line is flushed from the caches;

• Write through – all write operations are made in the cache and in the main memory, ensur- ing that data in the main memory is always valid.

From the previous description, it is clear that a write-back policy can result in inconsistency. If two caches contain the same data, and such data is updated in one cache, the other cache will unknowingly have an invalid value. Subsequently, invalid reads will produce invalid results. Inconsistency can occur even with the write-through policy, unless the other cache monitor the memory traffic or receive some direct notification of the update [21][22]. The cache coherence protocols that have been proposed to mitigate these problems have generally been divided into software and hardware approaches, although some implementations adopt a hybrid strategy that involves both software and hardware elements.

33 3. Multiprocessor Architectures

Microprocessor

Core Core Core Core

L1 L1 L1 L1 Cache Cache Cache Cache

L2 Cache L2 Cache

Interface Bus

I/O Systems

Figure 3.4: Simplified block diagram of the Intel quad-core Xeon processor.

3.2.1.A Software Solution

Software cache coherence schemes attempt to avoid the need for additional hardware cir- cuitry and logic, by relying on the compiler and operating system to deal with the problem [22]. Software approaches are attractive because the overhead of detecting potential problems is paid during compile time instead of run time, and the design complexity is transferred from hardware to software. On the other hand, compile-time software approaches usually make very conservative decisions, thus frequently leading to an inefficient cache utilization. One of the simplest approaches is to prevent any shared data variables from being cached. This is usually too conservative, because a shared data structure may be exclusively used during some periods and may be effectively read-only during other periods. Only during certain periods, when at least one process may update the variable and at least one other process may access the variable, is cache coherence an issue. More efficient approaches analyze the code to determine safe periods for shared variables access. The compiler then inserts specific instructions into the generated code to enforce cache coherence during the critical periods. A number of techniques have been developed to perform the analysis and for enforcing the results [22].

3.2.1.B Hardware Solution

These solutions provide a run-time recognition of potential inconsistency conditions. Because the problem is only dealt with when it actually arises, there is a more efficient cache usage, lead-

34 3.2 Symmetric Shared-Memory Multiprocessor ing to improved performance over a software approach [22]. In addition, these approaches are transparent to the programmer and to the compiler, reducing the software development burden. In general, hardware schemes can be divided into two categories: directory and snoopy proto- cols [21][22].

Directory protocols Directory protocols collect and maintain information about where copies of shared data reside in one location, called directory. Typically, the directory is managed and manipulated by a cen- tralized controller integrated in the main memory controller. When an individual cache controller makes a request, the directory controller checks and issues the necessary commands for data transfer between memory and caches or between caches themselves. It is also responsible for keeping up to date the state of the information that is being shared. Therefore, every local action that can affect the global state of a late element must be reported to the central controller [21][22].

CPU2 CPU1 CPU2 CPU1 CPU2 CPU1

Cache Cache Cache Cache Cache Cache Get Invalidate Exclusive Block Access Directory Directory Directory

(a) (b) (c)

CPU2 CPU1 CPU2 CPU1 CPU2 CPU1

Cache Cache Cache Cache Cache Cache Write Block Back Miss Directory Directory Directory

(d) (e) (f)

Figure 3.5: Directory protocol: a) both CPUs get a shared block from memory to cache; b) both CPUs read the information from respective caches, releasing the bus; c) CPU1 needs to update the block: an exclusive request signal is sent to the Directory, which invalidates the block in cache 2; d) when CPU2 tries to access the shared an invalidated block, it sends a miss signal to the Directory; a write back signal is sent to cache controller of CPU1. e) The block is updated in the main memory; f) The block is transferred from memory to cache 2.

Typically, the controller maintains the information about which processors have a copy of each memory block (Figure 3.5(a)). Provided that it is not changed, the shared data may be read

35 3. Multiprocessor Architectures freely by each processor (Figure 3.5(b)). During an update process, before the writing processor can update a given block, it must request an exclusive access to such block from the controller. Before granting this exclusive access, the controller sends a message to all processors with a cached copy of that block, forcing each processor to invalidate its copy. Only after receiving the acknowledgements back from each processor is the writing process granted exclusive access (Figure 3.5(c)). On the other hand, when another processor tries to read a memory block that is exclusively granted to another processor, a miss notification is sent to the controller (Figure 3.5(d)). The controller then issues a to the processor holding that block, requiring the processor to do a write back to the main memory. The block may now be shared for reading by the original processor and the requesting processor (Figure 3.5(e)).

Directory schemes suffer from the usual drawbacks of central bottleneck and the communi- cation overhead between cache controllers and the central controller. However, they are effec- tive in large-scale systems that involve multiple buses or some other complex interconnection schemes [22].

Snoopy Protocols

In snoopy protocols, the responsibility for maintaining cache coherence is distributed among all cache controllers. A cache must recognize when a held memory block is shared with other caches (Figure 3.6(a)). When an update action is performed on a shared block, it must be announced to all other caches by a broadcast mechanism. Each cache controller is able to ”snoop” on the network to observe these broadcasted notifications, and react accordingly (Figure 3.6(b))[21][22].

Snoopy protocols are ideally suited to a bus-based multiprocessor, since the shared bus pro- vides a simple way of broadcasting and snooping. However, because one of the objectives of the use of local caches is to avoid bus accesses, care must be taken so that the increased bus traffic required for broadcasting and snooping does not cancel out the gains from the use of local caches [22].

Two basic approaches to implement the snoopy protocol have been explored: write invalidate and write update (or write broadcast) [21][22]. With a write-invalidate protocol, there can be multiple readers but only one writer at a time. Initially, a memory block may be shared among several caches for reading purposes. When one of the cache controllers wants to perform a write to that block, it firstly issues a message that invalidates that line in other caches, making it exclusive to the writing cache. Once the bock is exclusive, the owning processor can made fast local writes until some other processor requires the same memory block.

On the other hand, with a write-update protocol, multiple writers, as well as multiple readers, may coexist. When a processor wishes to update a shared block, the word to be updated is broadcasted to all other cache controllers, so that all caches containing that block can update it.

Neither one of these two approaches is superior to the other under all circumstances. The

36 3.3 Non-Uniform Memory Access

CPU3 CPU2 CPU1 CPU3 CPU2 CPU1

Cache Cache Cache Cache Cache Cache

SnoopBroadcast Snoop Update BUS BUS

(a) (b)

Figure 3.6: Snoopy protocol: a) All caches read a shared block from memory; b) CPU2 update the block and broadcast that action; cache 1 and 3 snoop the bus looking for changes in the shared block. achieved performance depends on the number of local caches and the memory read and write pattern. Some systems even implement adaptive protocols that employ both write-invalidate and write-update mechanisms.

3.3 Non-Uniform Memory Access

With an SMP computational system, there is usually a practical limit on the number of proces- sors that can be used in parallel. An effective cache scheme reduces the bus traffic between any processor and the main memory, but as the number of processors increases, this bus traffic also increases. Moreover, since the bus is also used to exchange cache-coherence signals, the bus traffic tends to increase. At some point, the bus becomes a performance bottleneck. According to reported results, the performance degradation seems to limit the number of processors in an SMP configuration to somewhere between 16 and 64 processors [21][22]. The processor limit in an SMP system is one of the driving motivations behind the development of cluster systems. However, each node in a cluster has its own private main memory and appli- cations no longer see a large global memory system. In effect, coherency must be maintained by software rather than hardware [22]. This memory granularity affects performance and software applications must be particularly customized to this environment. One possible approach to achieve large-scale multiprocessing is the NUMA architecture. The main objective is to maintain a transparent system-wide memory system, while allowing the coexistence of multiple multiprocessor nodes, each one with its own bus or other internal interconnect system. The main advantage of a NUMA system is that it can deliver effective performance at higher levels of parallelism than SMP, without requiring major software changes. With multiple NUMA

37 3. Multiprocessor Architectures nodes, the bus traffic on each node is limited to each bus speed. However, if many of the memory accesses are addressed to remote nodes, performance begins to break down [22]. This perfor- mance breakdown can be avoided by adapting the following measures. First, the use of L1 and L2 caches should be designed in order to minimize all memory accesses, including the remote ones. If much of the running software has good temporal locality, then the remote memory accesses should not be excessive. Second, if the software has good spatial locality, and if virtual memory is in use, then the data that is needed by an application will reside on a limited number of fre- quently used pages, that can be initially loaded into the local memory for the running application. Finally, the virtual memory scheme can be enhanced by including in the operating system, a migration mechanism that will move a virtual memory page to the node that is frequently using it. NUMA parallel processing structures also present some disadvantages as well. First, NUMA structures are hardly as transparent as SMP; software changes are often required to adapt the SMP operating system and applications to a NUMA system. These include page allocation, pro- cess allocation and load balancing by the operating system. Also, if the memory accesses to remote nodes increase, the performance begins to break down [22].

3.4 Communication Mechanisms

Communication is essential for synchronized processing and data sharing in parallel MIMD systems [23]. The manner in which messages are passed among the system components defines the overall system design. The most requested solutions are the usage of a or an interconnection network. Shared memory systems include one large memory that is accessed identically by all processors. On the other hand, in interconnected systems each processor has its own memory, but the processors are allowed to access the other processors’ memories via the network. Both solutions have, of course, their strengths and weaknesses. Interconnection networks are often categorized according to the adopted topology, routing strategy, and switching technique. The network topology, i.e. the way in which the components are interconnected, is a major factor in the overhead cost of message passing. The efficiencies of message passing are limited by:

• Bandwidth – the network capacity to carry the information;

• Message latency – the time required for the first bit of a message to reach its destination;

• Transport latency – the time the message spends in the network;

• Overhead – combination of excess computational time, memory, bandwidth, or other re- sources that are required to attain a particular goal.

Accordingly, most network designs attempt to minimize both the number of required messages and the distances over which they must travel.

38 3.4 Communication Mechanisms

Interconnection networks can be either static or dynamic. Dynamic networks allow the between two entities (either two processors or a processor and a memory device) to change from one communication instance to the next, whereas static networks do not. Interconnection networks can also be blocking or nonblocking. Nonblocking networks allow the issue of new connections in the presence of other simultaneous connections, whereas blocking networks do not.

Static interconnection networks are mainly used for message passing and include a variety of types. Processors are typically interconnected using static networks, whereas processor-memory pairs usually employ dynamic networks.

Completely connected networks are those where all components are connected to all other components. These are very expensive to build, and as new entities are added, they become dif- ficult to manage (see Figure 3.7(a)). Star-connected networks have a central hub through which all messages must pass (see Figure 3.7(b)). Although the hub can be a central bottleneck, it pro- vides excellent connectivity. Linear array or ring networks allow any entity to directly communicate with its two neighbors, but any other communication has to go through multiple entities to arrive at its destination (see Figure 3.7(c)). Mesh and Mesh Ring networks are expansions of the previous networks (see Figure 3.7(d)). In this connections the communications between two nodes are ashore for more than one path allowing a better communication traffic distribution. Tree networks arrange entities in non-cyclic structures, meaning that communications between two non adjacent nodes are done through top nodes, which lead to potential communication bottlenecks at the root (see Figure 3.7(e)). Hypercube networks are multidimensional extensions of mesh networks in which each dimension has two processors (see Figure 3.7(f)).

(a) (b) (c)

(d) (e) (f)

Figure 3.7: Static Network Topologies: a) Completely Connected; b) Star; c) Linear and Ring; d) Mesh and Mesh Ring; e) Tree; f) Three-Dimensional Hypercube.

39 3. Multiprocessor Architectures

Dynamic networks allow for a dynamic configuration of the network in one of two ways: either by using a bus or by using a switch that can change the routes in the network. Bus-based net- works, illustrated in Figure 3.8, are the simplest and the most efficient, when cost is a concern and the number of entities is moderate. Clearly, the main disadvantage is the bottleneck that can result from bus contention, as the number of entities grows large. Parallel buses can alleviate this problem, but their cost is considerable.

Bus

Figure 3.8: A Bus-Based Network.

Switching networks use switches to dynamically change the routing. Two types of switches are usually adopted: cross switches and multi-stage switches. Cross switches are elementary switches that are either open or closed. Any entity (processor or memory) can be connected to any other entity by closing the required switches (establishing a connection) between them. Networks consisting of crossbar switches are fully connected, since any entity can communicate directly with any other entity and simultaneous communications between different processor/mem- ory pairs are allowed. Because of this property, crossbar networks are considered non-blocking networks. However, if a single switch exist at each cross-point, N entities will require N 2 switches. Thus, managing several switches quickly becomes difficult and costly. Because of this, the usage of Crossbar switches is usually restricted to high-speed multiprocessor vector computers. An ex- ample of a crossbar switch configuration is shown in Figure 3.9. The blue switches indicate closed switches. Each processor can be connected to only one memory device at a time, so there will be at most one closed switch per column. The second type of switch is the multi-stage switch, generally formed by a 2 × 2 switch. It is similar to a crossbar switch, except that it is capable of routing its inputs to different destinations, whereas the crossbar simply opens or closes the communications channel. A 2 × 2 interchange switch has two inputs and two outputs. At any given moment, the switch can be in one of four states: through, cross, upper broadcast and lower broadcast, as it is shown in Figure 3.10. In the through state, the upper input is directed to the upper output and the lower input is directed to the lower output. In the cross state, the upper input is directed to the lower output, and the lower input is directed to the upper output. In the upper broadcast, the upper input is broadcasted to both the upper and lower outputs. In lower broadcast, the lower input is broadcasted to both the upper and lower outputs. The through and cross states are the ones that are most relevant to interconnection networks.

40 3.4 Communication Mechanisms

MM1 MM2 MM3 MM4

CPU1 Open Switch

CPU2

CPU3 Closed Switch

CPU4

Figure 3.9: Example of a Crossbar Network.

(a) (b)

(c) (d)

Figure 3.10: Possible states of the 2 × 2 Interchange Switch: a) Through; b) Cross; c) Upper Broadcast; d) Lower Broadcast.

Some of the most advanced class of networks, such as multistage interconnection networks, are built using 2x2 switches. The idea is to incorporate stages of switches, typically with proces- sors on one side and memories on the other, with a series of switching elements in the middle. These switches dynamically configure themselves in order to form a path from any given proces- sor to any given memory. The number of switches and the number of stages contribute to the path length of each communication channel. A slight delay may occur, as a consequence of the several switches through which the message must pass from the specified source to the desired destination. These multistage networks are often called shuffle networks, alluding to the pattern of the connections between the switches.

Many topologies have been suggested for multistage switching networks. These networks can be used to inter-connect processors in loosely coupled distributed systems, or they can be used in tightly coupled systems to control the processor-to-memory communications. A switch can be in only one state at a time, so it is clear that blocking can occur. Non-blocking multistage networks can be built by adding more switches and more stages. In general an N node Omega network requires log2N stages with N/2 switches per stage.

41 3. Multiprocessor Architectures

CPU Memory

00 00

Switches

01 1A 2A 01

10 1B 2B 10

11 Stage 1 Stage 2 11

Figure 3.11: A Two-Stage Omega Network.

To conclude, it should be remarked that each method for interconnecting multiprocessors has its advantages and disadvantages. For example, bus-based networks are the simplest and the most efficient solutions, when a moderate number of processors is involved. However, the bus be- comes a bottleneck if many processors simultaneously make memory requests. All these topolo- gies’ characteristics are summarized in Table 3.1.

Table 3.1: Properties of some Interconnection Networks. Property Bus Crossbar Multistage Speed Low High Moderate Cost Low High Moderate Reliability Low High High Configurability High Low Moderate Complexity Low High Moderate

3.5 Synchronization

In the first multiprocessor systems, synchronization primitives such as locks, barriers, and condition variables were implemented in software. Most of such designs were based on simple bus-based shared memory architectures. While the recent advent of scalable shared memory de- signs, increasingly sophisticated and flexible synchronization mechanisms and node controllers have been adopted. To maintain data consistency, these architectures usually require the imple- mentation of such mechanisms in hardware.

42 3.6 Data Parallelism and Programming Languages

3.5.1 Software Locks

Many software synchronization algorithms have been proposed for shared memory multipro- cessors. Most of the existing software synchronization algorithms are built on simple atomic op- erations provided by the hardware. To achieve a good overall synchronization performance on a multiprocessor system, a great care has to be taken in the way the lock data is laid out in memory and the access to this data has to be carefully managed to avoid undue contention.

3.5.2 Hardware Locks

In a directory-based distributed shared memory system, the basic atomic operations are im- plemented at the directory controller to avoid a significantly increase of hardware complexity. Both hardware and software locks must communicate with the directory controllers, in order to properly manage the locked cache block. Several hardware lock mechanisms are available in nowadays multi-core systems:

• Simple Centralized Hardware Lock Mechanism - when a directory controller receives a lock request, it either immediately replies with a lock grant (if the lock is free) or it marks the node as waiting for the lock. When a node releases a lock, it sends a lock relinquish to the corresponding directory controller, which either marks the lock as free (if no other node is waiting for it), or selects a random waiting processor and forwards the lock to that processor. This implementation is extremely simple, but starvation is possible and FIFO (First In First Out) ordering is not maintained.

• Centralized Order Hardware Lock Mechanism - this architecture implements a centralized scheme described above with FIFO access guarantees.

• Distributed Hardware Lock Mechanism - in this protocol, each directory controller only records the next request in the distributed queue. This mechanism is preferable during pe- riods of heavy load, since it reduces the number of messages required to acquire/releasing a lock.

• Adaptive Hardware Lock Mechanism - because centralized and distributed lock mecha- nisms perform quite differently during periods of heavy and light contention, adaptive hard- ware locks allows to adopt a centralized scheme during periods of low contention and switch to a distributed protocol during periods of heavy contention.

3.6 Data Parallelism and Programming Languages

In what concerns the processed data, it can be classified as private or shared. While private data may only be accessed by a single processor, shared data can be used by all cores, providing

43 3. Multiprocessor Architectures an easy communication means among the processors. However, with the generalized adoption of cache subsystems, several coherency problems often arise in the manipulation of shared data, since multiple copies can simultaneously exist in different caches, thus creating inconsistent data views. To solve these problems, several hardware protocols and control mechanisms have been proposed. Furthermore, proper synchronization schemes must also be introduced in order to control the access to shared data and resources. Although these mechanisms assure the right program function, they provide it at the cost of processing time. To take full advantage of multi-core systems, parallel programming software is usually devel- oped using at least one of these models: process model, where independent subroutines are assigned to processes that are subsequently distributed by the several cores; thread fork model, where subroutines are organized in threads. The first model is mainly used in cluster systems, while the second has been widely adopted in multi-core computers, due to its lightweight. As a consequence, the developed parallel platform was implemented using the thread fork model, due to its easier implementation and established support by most multi-core systems. Several API (Application Programming Interface) have been developed to efficiently exploit the processing capability offered by the current multi-core systems, mainly MPI, Posix, OpenMP, CUDA and OpenCL.

3.6.1 MPI - Message Programming Interface

To achieve the highest performance, thousands of computers have been inter-connected with high-speed networks. Programming these clusters is usually done through MPI (Message Passing Interface), a low-level API which provides a standard communication mechanism and which is im- plemented on top of high-performance network drivers. The MPI is based on a process fork model. Its main advantages are: runs on both distributed and shared memory-systems; for suitable prob- lems, MPI scales well to very large numbers of processors. However, the application development in MPI is often rather difficult, since each node has to be separately programmed [24][25][26].

3.6.2 POSIX Threads

POSIX (Portable Operating System Interface for Unix), introduced in the early 80s, is the name of a group of standards specified by IEEE to define the Unix Application Programming Interface (API). In this standard, parallel programming is exploited by using processes and threads. Pro- cesses are defined as independent execution units that contain their own state information, have individual address spaces, and only interact with each other via interprocess mechanisms, which are generally managed by the operating system. On the other hand, threads within a process share the same state and memory space, which leads to a lightweight processing flow with a low latency context switching. Communication between threads are usually ashore through shared variables.

44 3.6 Data Parallelism and Programming Languages

When POSIX threads are used to introduce parallelism in an application, a careful develop- ment has to be conducted, since threads creation has to be manually done using special and dedicated functions and the parallel regions should be taken out-of-line and put in to a separate subroutine. Moreover, data structures containing private information must be explicitly defined in separate. Also, accessing shared data is often assured through a pointer that is passed to each thread. As a result, huge code transformations frequently have to be done when parallelizing a fully sequential program. The following code is an example of programming multiple threads using POSIX threads [27].

Display ”Hello World” using POSIX Threads. void∗ print message ( void ∗ arg ){ printf(”Hello World . \ n ” ) ; return NULL; } i n t main ( i n t argc , char∗ argv [ ] ) { p t h r e a d t thread1, thread2; pthread create(&thread1 , NULL, print message, NULL ); pthread create(&thread2 , NULL, print message, NULL ); p t h r e a d join(thread1 , NULL); p t h r e a d join(thread2 , NULL); return 0; }

3.6.3 OpenMP

OpenMP is a standard defining a set of compiler directives for C/C++ and languages based on the thread-fork model. These directives allow an easy and explicit way to define the code that can be executed in parallel. When OpenMP is used, parallel regions are simply defined using the #pragma keyword, reducing the parallelization complexity. By default, all variables other than those specifically defined by a private clause are shared, providing more transparency and an easier implementation. This expedite methods can then be used to instantiate the work among the processors by using a schedule clause [27][28][26]. Despite of its advantages, the scalability is limited by the number of cores of the certain platform. OpenMP is a thread fork model based API.

Display ”Hello World” using OpenMP. i n t main ( i n t argc , char∗ argv [ ] ) { #pragma omp parallel printf(”Hello World . \ n ” ) ; return 0; }

45 3. Multiprocessor Architectures

3.6.4 CUDA and OpenCL

In the last years, General Purpose Graphic Processor Units (GPGPU) have emerged as a pow- erful and alternative solution in computer graphics manipulation, and their highly parallel structure make them very suitable to run complex algorithms. In the past, GPUs used to be programmed using programmable shaders found in graphics API (OpenGL, DirectX), which require the adapta- tion of general purpose code to graphic API. In 2007, NVIDIA released CUDA (Computed Unified Device Architecture), allowing a code written in C language to be executed on a GPU. CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs [29][30]. Meanwhile, a more generic framework, denoted by OpenCL (Open Computing Language), has been also used for GPUs programming. A code written in OpenCL can be executed across heterogeneous platforms, mainly formed by CPUs and GPUs. Just like CUDA, OpenCL allows any application to use the GPU for non-graphical computing [31][30]. Nevertheless, despite their efficiencies in parallel computation, the usage of CUDA or OpenCL often require a high bus bandwidth between the CPU and the GPU, leading to bus bottleneck.

46 4 Related work

Contents 4.1 Parallelizing the H.264 Decoder ...... 48 4.2 Parallelizing the H.264 Encoder ...... 49 4.3 Other Proposed Parallel Solutions ...... 52

47 4. Related work

The computational complexity of the H.264 standard became an important disadvantage when compared with previous standards. Although it was a revolutionary standard, since it provides an excellent compression ratio and still some improvements in video quality, the lower frame rates that are usually achieved with conventional computation platforms reduces its applications areas. To circumvent this problem, several approaches based on the usage of hardware have been proposed [32][33]. Meanwhile, a generation of multi-core processors was made available to the end-user con- sumer market and was rapidly introduced in consumers PCs. These new processors, capable to truly process several tasks in parallel, opened the doors for new improvements and techniques in the software’s area. New versions of the H.264 decoder began to appear, providing excellent results. The next step, already initiated by some researchers, is to apply these multi-core processors to achieve efficient implementations of the H.264 software encoders. This challenge proved to be hard. Software solutions based on multi-core systems are hard to develop and the H.264 encoder proved to have a lot of data dependencies in the processing mesh. In order to parallelize the algorithm these dependencies have to be broken without sacrificing the video quality or the compression rate. At the same time, new achievements have been made to develop new and less complex, but not less efficient, motion estimation algorithms. As it was seen in Section 2, that block represents more than 50% of the overall computational time.

4.1 Parallelizing the H.264 Decoder

The demand for computational power increases continuously in the consumer market. In the past, this demand was mainly satisfied by increasing the clock frequency and by exploiting Instruction-Level-Parallelism (ILP). Due to thermal constraints and high power consumption, in the last years the processors’ designers have been facing serious restrictions to further increase the clock frequencies. Moreover, current processor architectures have achieved a complexity level that makes it really hard to further increase the ILP. As a solution, new multi-core architectures have appeared on the market. These new architectures rely on the existence of sufficient Thread- Level Parallelism (TLP) to exploit the large number of cores in order to satisfy the consumers’ speedup demands. Meanwhile, some H.264 decoder and encoder multi-core architectures have been proposed. In [34] it is proposed a parallel H.264 decoder based on 3D-Wave. This solution is capable of dy- namically detect and exploit the dependencies and parallelism. The 3D-Wave processing scheme is based on the 2D-Wave architecture, which is explained in Chapter 5 and depicted in Figure 5.6. In the 2D-Wave scheme, parallelism at macroblock-level is exploited, by processing multi- ple macroblocks after assuring that all neighbouring dependencies are not broken. Nevertheless,

48 4.2 Parallelizing the H.264 Encoder this solution presents, as main disadvantaged, a scalability problem: in a many-core system, this architecture can not always use all available cores, since macroblocks’ dependencies can not be broken. The 3D-Wave scheme is achieved by extending this concept to the following frames (Figure 4.1), in which the scalability problem is minimized.

Figure 4.1: 3D-Wave scheme: frames can be decoded in parallel, because inter-frame depen- dencies have a limited spatial range [34].

Concerning the results of the this architecture, simulations of the 3D-Wave implementation proven that this scheme can leverage a multi-core system with up to 64 cores. While the 2D-Wave has a speedup about 8 for 16 cores or more, for high definition video sequences, the 3D-Wave has a performance gain of almost 45 on 64 cores. These results were achieved for sequences only using a macroblock-level parallelism. Despite the great results, the application of these schemes in a encoder will certainly lead to lower performance results if a high performance motion search algorithm is not used.

4.2 Parallelizing the H.264 Encoder

In [35] it is proposed a hierarchical parallel architecture of an H.264 encoder. Such solution exploits the facility offered by H.264 to divide the video sequence in several groups (Group of Pictures or GOP) each one containing the same number of frames. Each frame can be further divided in frame slices. Hence, the main idea is to use a cluster platform to the encoder’s archi- tecture, where a GOP is assigned to each cluster node. Each cluster node is usually a multi-core system, allowing further parallelization. Therefore, it is possible to parallelize the encoding pro- cedure of each frame: for a certain node, the frame slices are executed concurrent tasks (Figure 4.2). Resources available on clusters may vary from single to multiple CPU per node, and in every node it can be usually found CPU multimedia extensions (like SIMD instructions) and a powerful graphic coprocessor. To make efficient use of all these computation resources, the proposed solution combines different programming approaches:

• Message Parsing Parallelism - the usage of MPI API allow the communication between cluster nodes;

• Multithread Parallelism - the OpenMP API is used for develop the slice-level parallelism;

49 4. Related work

• Optimized libraries - sequential code can be optimized by using additional resources like SIMD extensions and GPUs (Graphic Processor Units) to perform complex operation.

These techniques are combined hierarchically. Analysis and implementation of every level is done independently and the individual improvements of each level sum up to improve the overall application performance.

Node A Node B Node C

P1 P1 P1

P0 P2 P0 P2 P0 P2

Figure 4.2: Hierarchical H.264 parallel encoder [35].

The main issue of this architecture relies on the synchronization: in each node, the frames are processed at the same time. Therefore, a more computational demanding GOP may lead to a higher wait time. Therefore, this solution presents an increase in performance and a high level of scalability.

Concerning economical factors, the investment needed to afford a cluster system is very high. Moreover, this kind of solutions could never be applied in embedded systems or in Personal Computers (PCs).

In [36] is proposed a parallel encoder architecture, based on the exploitation of thread-level parallelism, for Intel Xeon processors with Hyper-Threading Technology. As a first approach, the speed up of the computational demanding modules was improved using SIMD instructions. This improvement increase the encoder performance in 2 ∼ 3×. Furthermore, to take full advantage of the Hyper-Threading Technology and the multi-core capabilities, slice-level parallelism is ex- ploited: frames are divided in several slices and to each slice an independent thread is assigned. The data independency is assured by slice properties (macroblocks from two different slices are independent). On the other hand, frame division in to slices will increase the bit-rate. This incre- ment may be significant for lower video formats.

This solution presents an increase in performance of 4 when compared with the optimized version.

50 4.2 Parallelizing the H.264 Encoder

Bar 0 Bar 1 Bar 2 Bar 3 Bar 4 Bar 5

Previous encoding MBs

Current encoding MBs

Encoding sequence

SPU Processors SPU 0 SPU 1 SPU 2 SPU 3 SPU 4 SPU 5

Figure 4.3: Cell solution encoding flow with macroblock-level parallelism exploitation.

Other parallel High Definition (HD) encoding solution was implemented on CELL [37]. The Cell Broadband Engine is a heterogeneous multi-core processor consisting of two kinds of processors - PowerPC Processor Elements (PPEs) to handle general processing and Synergistic Processor Elements (SPEs) multimedia processing - on a single die. The PPE is the general-purpose processor for the PowerPC architecture. It runs the operating system and manage SPE tasks. The PPE is equipped with 32 KBytes of L1 data and instruction caches, and 512 KByte L2 cache. It is also capable to support thread concurrency and can execute two threads at the same time. The SPE is mainly used to process multimedia contents, such as video and audio. This pro- cessors are equipped with the SIMD instruction set.

Slice 0 Slice 1 Bar 0 Bar 1 Bar 2 Bar 0 Bar 1 Bar 2

Previous encoding MBs

Current encoding MBs

Encoding sequence

SPU Processors SPU 0 SPU 1 SPU 2 SPU 3 SPU 4 SPU 5

Figure 4.4: Cell solution encoding flow with macroblock-level and slice-level parallelism exploita- tion.

This solution proposes a pipeline encoding algorithm for CELL multi-core processor based in a macroblock-level and slice-level parallelism. The SPE are used in a pipeline architecture to exploit macroblock-level parallelism - the frame is divided in bars, group of adjacent macroblock columns, and each bar is processed by a SPE (see Figure 4.3). The main issue of this solution relies on the

51 4. Related work pipeline delay at the end of the encoding process. This issue can be overcome through slice-level parallelism: the frame is divided in independent slices and a concurrent pipeline chain is added for each slice (see Figure 4.4). Concerning the results for the proposed solution, the performance achieved is almost ideal for 8 and 16 SPUs.

4.3 Other Proposed Parallel Solutions

In the past few years, the GPUs performance have increase exponentially. With multiple cores and a large memory bandwidth, the computational capability of today GPU have overcome many CPU. Due to its high computational capabilities, nowadays many GPUs serves not only for ac- celerating the graphics display but also for speeding up non-graphic applications, such as linear algebra computation, scientific simulations and complex algorithms. Therefore, in [29] motion es- timation is implemented using GPU support. In this solution, a macroblock is firstly divided into sixteen 4 × 4 blocks, and the SAD value computed in parallel for all candidate motion vectors of each 4 × 4 block. Then, the 4 × 4 blocks are merged to form 4 × 8, 8 × 4, 8 × 8, 8 × 16, 16 × 8 and 16 × 16 blocks and each SAD is computed. For each block size, the SADs of all candidate motion vectors are compared and the motion vector with the least SAD is the inter-pixel motion vector. The fractional-pixel motion vector is then computed: the reference frame is interpolated using a six-tap and a bilinear filters defined in the H.264 standard, and the SADs at 24 fractional pixel positions, that are adjacent to the best integer motion vector, are computed. The candidate motion vector with the least SAD is the fractional motion vector. The previous algorithm is exhaustively repeated for all block sizes in a macroblock. The solu- tion presents the parallel exploitation of the previous block-based algorithm using CUDA platform. In this architecture, the motion estimation achieve a maximum speedup of 12 for high format video sequences.

52 5 Open Platform for Parallel H.264 Video Encoding

Contents 5.1 Assumptions ...... 55 5.2 Structures Redesign and Code Improvements ...... 57 5.3 Levels of Parallelism in H.264 Encoding ...... 63 5.4 Heterogeneous vs Homogeneous Multi-core Platform ...... 65 5.5 Slice Definition ...... 65 5.6 Data Partitioning ...... 66 5.7 Parallel Programming of H.264 Video Encoders ...... 67

53 5. Open Platform for Parallel H.264 Video Encoding

One of the most inportant contributions of the H.264 video standard was the introduction of many powerful tools and encoding techniques that are capable of producing a lower bit rate stream with a higher video quality [19]. On the other hand, the computational weight imposed by these tools [9][20][19] restricted the areas were software versions of this encoder can be used, since acceptable frame rates in a moderate format resolution (e.g. CIF) are only achieved by sacrificing the output video’s quality.

On the other hand, the micro-processors manufactors have changed the offered architecture paradigm and started the current disseminations of multi-processors based systems in a single die. Since them, software designers have the means to optimize and speedup their algorithms, by using parallel approaches, implemented off-the-shelf computers.

In the presented work, the H.264 software encoder was properly adapted in order to suit it to the new micro-processors capabilities. To do so, the reference software was adapted to take advantage of the multi-core capabilities provided by currrent processors. The main goal was to divide the encoder task in smaller and independent tasks, which can be processed in parallel, in order to decrease the processing time (Figure 5.1).

Main Task

TASK 0 TASK 1 TASK 2 TASK 3

Core Core Core Core 0 1 2 3 Parallel Region Parallel

TASK 0 TASK 1 TASK 2 TASK 3

Main Task

Figure 5.1: Example of a main task divided in smaller sub-tasks running in a multi-core system.

In this section, all assumptions that were made during the development of the implemented platform are presented. Then, a full description of the several steps in order to achieve the fi- nal solution is made, by covering the following points: redesign of data-structures; parallelism exploitation at slice level; parallelizing the intra prediction procedure; parallelizing the inter predic- tion procedure.

54 5.1 Assumptions

5.1 Assumptions

To confine the complexity level involved in the implementation of the current project, some restrictions on the possible encoding parametrization had to be imposed. Such design tradeoffs were adopted by taking the following aspects in consideration:

• Dependency - some parameters were disabled in order to define independent data struc- tures that can be processed in parallel, for example the rate control;

• Compression rate - it is important to keep the reference higher compression rate;

• PSNR and bit rate - the encoder’s configuration should produce a video stream with an acceptable quality and bit rate;

• Complexity - since some algorithms in H.264 encoding demand a heavy computational cost, some optimal techniques whose improvement results in a PSNR gain lower than 1dB were disabled.

In the remaining of this subsection it will be provided a brief description of the considered tradeoff and adopted design parametrization.

Video Format It is well-known that the HVS is less sensitive to color (chrominance) than brightness (lumi- nance). Therefore, only the 4:2:0 subsampling format is considered. This assumption lowers the output bit rate and the required processing time, since less macroblocks are used in chrominance components [6]. (YUVFormat=1)

Motion Estimation Search range This parameter delimits the motion estimation window search area: a higher range might provide more precise motion vectors but it lowers the frame rate. On the other hand, a lower range increases the frame rate and decrease the resulting PSNR [7]. Hence, as a tradeoff midterm, it was defined a maximum search range of 32 pixels. (SearchRange=32)

Error Metric As mentioned before there are three error metrics: SAD, SSE, SATD are usually adopted. However, the SAD metric was chosen due to its lower computational complexity [8]. (MDDistor- tion=0)

Chrominance in Motion Estimation In order to increase the speed of the motion estimation module, the chrominance components are not considered, resulting in an insignificant decrease of the obtained PSNR due to the HVS properties [6]. (ChromaMEEnable=0)

55 5. Open Platform for Parallel H.264 Video Encoding

Number of Reference Frames

A larger number of reference frames usually improves the compression efficiency and the output video quality. However, it considerably reduces the encoding throughput, since each de- cision has to be made for each reference frame. On the other hand, the temporal correlation is usually much higher for the adjacent references. Since each reference frame must be stored in the encoder frame memory, the involved accommodation cost significantly increase when a larger number of references is considered [7]. Therefore, a maximum of three backward and one forward references are used.

Bi-prediction Motion Estimation

This option, currently only available to encode 16×16 size block and it defines the computation of the motion vector as a linear combination of a forward and backward prediction signals pair [38]. Due to its complexity and availability only for 16 × 16 block size, this motion estimation mode was disabled. (BiPredMotionEstimation=0)

Picture and Macroblock Interlace

The techniques denoted as Picture Adaptive Frame Field (PAFF) and Macroblock Adaptive Frame Field, were added to H.264/AVC to achieve higher compression efficiency for interlaced video contents [39]. As a result, the introduced computational complexity increases largely, due to the algorithm’s nature. For that reason, these techniques were disabled. (PicInterlace=0 & MbInterlace=0)

Direct Mode prediction

In direct mode prediction, either spatial or temporal correlation can be used. Due to scene transitions, prediction based on temporal correlation (backward and forward frames) may diverge from the best computed motion vector [40]. As a consequence, spatial correlation was used for direct mode prediction. (DirectModeType=1)

Partition Mode

This feature provides the ability to separate the most important from the least important syntax elements in different data package, enabling the application of Unequal Error Protection (UEP) and other types of error/loss robustness improvements [3]. Considering a low noise channel, it is assumed that all syntax elements are sent in one data package. (PartitionMode=0)

Rate Distortion Optimization

H.264 introduced three models for macroblock mode encoding: simplified, complex and lossy. In the simplified mode, the best mode for a given macroblock is selected based only on an error

56 5.2 Structures Redesign and Code Improvements metric: R,D JMotion = DDFD (5.1) where DDFD is the prediction error between the current and the reference block [41]. The other modes are based on Lagrangian multipliers [42], which are applied in two stages: motion compen- sation and residue coding. In the motion compensation stage, for each block B with fixed block mode M, the motion vector associated with the block is selected through a joint rate-distortion (RD) cost function:

R,D JMotion = DDFD + λMotionRMotion (5.2)

R,D where RMotion is the estimated bit rate to encode the motion vector and JMotion is the joint R-D cost, comprising of RMotion and DDFD. λMotion is the Lagrange multiplier to control the weight R,D of the bit rate cost. JMotion is widely used to determine the optimal displacement vector. In a similar way, the joint cost of distortion and block mode selection in the residual coding stage can be written as: R,D JMode = DRec + λModeRRec (5.3) where RRec is the estimated bit rate associated with mode M, DRec is the difference between the reconstructed macroblock and the reference one, and λMode is the Lagrange multiplier. Due to its lower complexity and the higher level of independency between the encoder’s processing blocks, it was selected the simplified mode (RDOpt=0)

Motion Estimation Algorithm Due to its lower complexity [43], when compared with the full search algorithm, as well as the absence of non-dependencies between the previous encoded macroblocks, the Simplified Uneven Multi Hexagon Search (UMHexagonS) algorithm was selected.

5.2 Structures Redesign and Code Improvements

An important step to improve the software’s performance is by conveniently redesigning and adapting the data structures that are involved in the processing. Good data structures provide efficient ways to manipulate and relate the information without spending too much time while ac- quiring the information. Hence, the main goal of the redesign was to increase the cache efficiency by exploiting its main principles: spacial locality and temporal locality. Spatial locality is explored by reducing the size of the data structures, since the probability of accommodating the whole data structure in the cache becomes higher. As result, after a compulsory miss, the hit probability is much higher than a miss. The temporal locality exploitation is accomplished by two main factors: size reduction, which decreases the conflict misses, and by joining together all the information needed to process the block under consideration.

57 5. Open Platform for Parallel H.264 Video Encoding

The structures resizing was mainly done by eliminating non-used or replicated parameters and by adjusting their size to the set of possible values: for example, since the range of RDOpt parameters is confined between zero and three, instead of using an integer to store it, only a (minimum size) is used. After rebuilding the data structures, the memory usage became the main concern, since most embedded systems are characterized by having small RAM memories. For the optimized version of the encoder, a new lower memory option was introduced in order to reduce even more the memory usage. A deep analysis of the encoder had shown that DPB was the most memory con- suming block, since interpolation results were stored on it along with the frame under processing. Changes in the interpolation procedure had to be done in order to reduce and fill in the memory requirements of embedded systems. Besides these modifications, all data structures are statically allocated in memory, allowing the usage of this platform in embedded systems that do not use an Operating System (OS). Moreover, in order to support parallel video encoding, some new data structures were also added. Table 5.1 shows the memory usage results for the reference and the optimized version of the developed software. According to the presented results, it is possible to realize that the considered data structures redesign allowed a reduction of the memory usage of up to 85% for normal setup and up to 93% for lower memory setup. A further reduction in the number of reference frames could reduce even more this memory consumption, leading to a light platform that can be easily used in embedded system design.

Table 5.1: Memory consumption for reference and optimized software versions. These results were acquired considering 3 backward and 1 forware reference frame. Optimized Software Structure Reference Software High Memory Low Memory ImageParameters 82433B 248B 248B InputParameters 5.9KB 6.1KB 6.1KB Picture Parameters Set 248B 152B 152B Sequence Parameters Set 2.1KB 1.7KB 1.7KB Slice 2.3MB 2.4MB 2.4MB Macroblock 171.7KB 74.3KB 74.3KB Decoded Picture Buffer 203.1MB 22.4MB 6.6MB Intra Processing − 2.0MB 2.0MB Inter Processing − 2.8MB 2.8MB Total 205.7MB 29.7MB 13.9MB Memory Saved − 85.5% 93.2%

Two main structures were developed to support the encoding of frames in this platform: MB- Data and EntropicData. In the MBData structure, all the information needed to encode a given macroblock is gathered, in order to avoid the usage of multiple structures for just one task. MB- Data is formed by another structure, called IntProc, corresponding to a buffer where all required values, such as displacement vectors, reference frames, prediction direction or the chosen intra

58 5.2 Structures Redesign and Code Improvements mode are stored during prediction. During the Transform & Quantization processing, the quan- tization results (also known as run/level) are stored in the field entropicMB of the EntropicData structure. This structure gathers all the information needed by the CABAC/CAVCL block. The field EntropicMBData collects all the information about the macroblocks: run and level coefficients, predicted and computed displacement vectors, intra mode used for luminance and chrominance components, etc.

Main structures developed for frame encoding struct MBData{ PicData CurPic ; // Macroblock’s pixels to be predicted PicData EncPic ; // Reconstructed Macroblock IntProc PrdPic ; (...) }MBData ; typedef struct EntropicData { byte Sli ceID ; // Number used for Slice ID byte ; // Quantization parameter word mb index ; // Number of macroblocks stored word mb curr idx ; // Current index to be processed (...) TextureInfoContexts ∗ t e x c t x ; // ptr to the texture context structure MotionInfoContexts ∗ mot ctx ; // ptr to the motion context structure EntropicMBData entropicMB [PIC MB SIZE]; } EntropicData;

After the redesign of the data structures, some code improvements had to be performed: some modules of the encoder were re-organized and optimized through careful cleaning of the code and functions reutilization. Moreover, suitable initialization functions were also added to each module, so that all the parameters needed by each block are gathered, pre-computed and stored in local structures, in order to avoid wasting of time during video encoding. As a simple example, during the encoding some dedicated functions are used to find all the neighbours of a certain block. Since macroblocks inter-relations do not change during the whole video encoding, it is possible to store this pattern in the form of a set of address, instead of executing these operations for the processing of every frame. The main optimizations that were preformed in each functional module are described in the following subsections.

5.2.1 Transform & Quantization

To optimize the Transformation & Quantization functional block some changes were done at data structures and algorithm levels. To compute the quantization procedure, a matrix of quanti- zation coefficients is used. In the reference encoder, these coefficients are stored in huge lookup tables. Caching these tables could replace other important cached blocks needed to the frame encoding, which would lead to higher miss rates. After a deep analysis, it was noted that for a fixed

59 5. Open Platform for Parallel H.264 Video Encoding

Qstep (Quantization Step), all quantization coefficients used in this process are constant during the frame encoding. In the optimized version, before a certain frame is encoded, all the quan- tized coefficients used during the quantization procedure are picked up from the lookup tables and stored in a smaller data structure. In what concerns the algorithm improvements, the computational process was adapted to only use 16-bit operations instead of the 32-bits used in the reference version. Also, a new transform functions using SSE2 operations were added in order to increase even further the algorithm’s performance.

5.2.2 Intra Prediction

To implement the Intra prediction of a given block, nine intra modes have to be computed. The residues (differences) between them and the current macroblock are then processed by the Transform & Quantization module, returning the prediction costs and the minimum residue. To find the effective prediction cost, the probable prediction mode is used, which in the original software implementation is stored in a matrix after the calculation of every prediction (Figure 5.2).

Left and up 1 neighbour?

0 upMode 0 < leftMode

1

Probable DC Mode upMode leftMode Prediction Mode

Figure 5.2: Flowchart used in probable mode’s computation.

To optimize this module, lattice algorithms were designed to compute all nine intra prediction modes and the multiplication operations were replaced by shifts. The new modifications allow to compute all Intra predictions using only 16-bit additions and shifts operations. To further increase the performance of this module, in the optimized version of the encoder, the probable prediction mode matrix was replaced by two vectors: a top vector, were all predictions of the bottom macroblocks are stored, and a right vector, in which stores the predictions of the macroblocks located in the next column (Figure 5.3). Like the previous module, all parameters used to Intra predict a macroblock are stored in small buffers in order to avoid caching huge structures. These parameters are all pre-computed before the frame encoding.

60 5.2 Structures Redesign and Code Improvements

Up Vector ….…. Right Vector

Figure 5.3: Top and Right vectors used to store the probable modes that will be used in the following predictions.

5.2.3 Inter Prediction

To perform the best prediction as possible of each macroblock, all inter prediction modes are applied with a quarter-pixel precision: 16 × 16, 16 × 8, 8 × 16, 8 × 8, DIR, 8 × 4, 4 × 8 and 4 × 4. The main optimizations that were done in these module were:

• Code Cleaning - due to all assumptions that were done before, some pieces of the code could be removed and others simplified. As an example, since all P and B frames are predicted using only inter modes, the intra prediction code was deleted;

• Division of the code - the code was carefully analyzed and divided in several functions, each one containing one prediction function. After this division, each mode could be further and independently optimized;

• Introduction of new structures - before the inter prediction is applied to a certain mac- roblock a preliminary prediction of the displacement vector is obtained using the displace- ment vectors of the neighbours macroblocks, which are stored in matrix form in the refer- ence software. Since this problem is very similar to the probable mode in intra prediction, the same solution was applied. Also, a new structure compressing all the data that is used for each macroblock prediction was developed in order to increase computation efficiency;

5.2.4 Deblocking Filter

The deblocking filter performs the following steps in the processing of each macroblock: fil- ter strength computation, filtering decision and filtering computation. The source code of the reference encoder was cleaned and these steps were divided in functions, for better code com- prehension. Also, little optimizations were done in them.

5.2.5 Interpolation

The interpolation module is the second more complex module in the encoder, due to the amount of processed data that is involved in the computation of the half and quarter pixels reso-

61 5. Open Platform for Parallel H.264 Video Encoding lutions of the reconstructed frame. To reduce this weight, some optimizations were applied in its functions and in the Wiener’s filter equation (Equation 2.11), which is now computed by equation 5.4 to avoid multiplications.

′′ hp = ((i0 + i1) << 2) − (i2 + i−1)

′ ′′ ′′ hp = (hp + (hp << 2) + (i−2 + i3) + 16) >> 5 (5.4)

255 ′ hp = Clip0 (hp ) To conform to the memory requirements typically imposed by most embedded systems, deep changes were done concerning the interpolation module. In the reference version, all pixels re- sulting from interpolation were stored in DPB, leading to a huge memory consumption - the buffer dimensions are 3.6 MByte and 13.7 Mbyte for a single CIF and 4CIF frame format, respectively. This problem has no simple solution: to reduce the buffers dimension, the interpolation has to be done during the inter prediction; each encoded frame can be used as reference in multiple pre- dictions and for a single macroblock, several steps have to be done to find the optimal prediction. Therefore, interpolation during motion estimation process will compromise the performance. However, it shall be noted that during motion estimation only the luminance component is considered to compute the motion vectors. Therefore, in the adopted solution the interpolation module was divided in two sub-blocks:

• Chrominance interpolation - in the inter prediction procedure, chrominance pixels are only used in the motion compensation module, which allows performing the interpolation during the motion compensation phase. This technique reduces the buffer’s size by 50%;

• Luminance interpolation - luminance interpolation, in order to reduce the memory con- sumption, is computed in 2 steps: the half-pixel and quarter-pixel computation. The half- pixel interpolation, which requires more complex computation, is processed after the recon- structed frame has been filtered in order to exploit cache properties. On the other hand, quarter-pixel interpolation is locally computed: due to its middle complexity this interpolation is only processed during motion estimation, avoiding using buffers to storage the result but, at the same time, reducing the performance.

These techniques allow us to reduce the memory usage at the cost of a slight decreasing (5%) in performance. This tradeoff can be fully configurable allowing us to extract the full performance in systems with high memory, and, at the same time, turn the software adjustable for embedded systems.

5.2.6 Data Parallelization - SIMD Instructions

Many signal processing applications can be decomposed into a set of vector operations. In this approach, the same operation is simultaneously applied to several data elements, therefore,

62 5.3 Levels of Parallelism in H.264 Encoding providing an acceleration of the applications. Most current processor families and embedded systems include some multimedia extensions to the instruction set (MMX, SSE1, SSE2, SSE3, etc.). These instructions offer parallelization at data level by SIMD: one instruction simultaneously operates on several vector elements, which can result in large application speedup. In the developed platform, the use of SSE2 SIMD instructions can be activated through a compilation option. The usage of these instructions was mainly exploited in the algorithms that were more often used in a macroblock prediction: the residual computation algorithms (SAD), in the Transformation-Quantizer block and Interpolation-block [18][44]. To target also the systems without multimedia instructions, it is possible to disable this paral- lelism by choosing the more suitable configuration for the target platform.

5.3 Levels of Parallelism in H.264 Encoding

Along the past few years, several parallelism models have been presented in the literature to increase the performance of H.264/AVC encoders [35–37]. Due to the encoder’s nature, many of these parallelization approaches exploit concurrent execution at: frame-level, slice-level, macroblock-level. However, careful design methodologies, in what concerns the parametrization and modularity, have to be considered in order to avoid the introduction of performance losses in terms of the final bit-rate and PSNR. In H.264 standard, there are three data dependencies types: the data dependencies between frames, present in Inter prediction in which a macroblock can only be processed if the reference frames are already processed; the data dependencies between macroblock rows in the same frame forces that a macroblock can only be processed if their upper neighbours are encoded and reconstructed; the data dependencies in the same macroblock row, in which a macroblock could not be processed until its left neighbor was already processed. These three types of data dependencies must be avoided to exploit the data parallelism in the H.264 encoder. At frame-level, the input video stream is divided in GOPs. Since the GOPs are usually inde- pendent from each other, it is possible to develop a parallel architecture where a controller is in charge of distributing the GOPs among the several available cores (Fig. 5.4). The advantages of this architecture are clear: PSNR and bit-rate do not change and it is easy to implement, since the independency of the GOPs is assured with minimal changes in code. However, the memory consumption significantly increases, since each encoder must have its own DPB, where all GOPs references are stored. Moreover, real-time encoding is hardly implemented using this approach, making it more suitable for video storage purposes. Consequently, this solution has been mainly used in clusters. Further parallelism levels would have to be exploited in order to improve the performance [35]. In slice-level parallelism (Fig. 5.5), frames are divided in several independent slices, making

63 5. Open Platform for Parallel H.264 Video Encoding

CLUSTER Controller

P0 P1 P2 P0 P1 P0 P1 P2

GOP0 GOP1 GOP2

Figure 5.4: Frame-level parallel architecture. the processing of macroblocks from different slices completely independent. In the H.264 stan- dard, a maximum of eight slices are allowed in each frame. This approach allows do exploit parallelism at a finer granularity, which is suitable, for example, for multiprocessor systems. For example, in multi-core architectures parallel encoding of the several defined slices may be concur- rently executed by multiple threads for each individual frame. Moreover, the memory consumption is usually smaller, because only one DPB is required. The main issues of this solution are: the limited number of slices per frame (eight in the H.264 standard); a greater effort has also to be done in order to guarantee a good parallel performance; redesigned structures and algorithms simplifications are often required in order to avoid caching of unnecessary data.

P0 P1 P2

GOP Slice

Figure 5.5: Application of slice-level parallelism in a multi-core system.

The parallelism at macroblock-level implies that independent macroblocks can be encoded at the same time [34]. According to the standard, a macroblock (i, j) is predicted using its left and upper neighbors, which can be performed by following a wave-front approach, as depicted in Figure 5.6. The disappearance of these dependencies allow us to to exploit macroblock-level parallelism. The main design issues of this approach are: a centralized control is usually needed, to guarantee that only independent macroblocks are processed in parallel; and the computational weight may be not uniformly distributed among the cores. However, in middle and high resolu- tion video sequences, the maximum number of macroblocks that can be processed in parallel is limited by ⌈N/2⌉, where N denotes the number of diagonal macroblocks in the frame. Figure 5.6 illustrate this approach. Avoiding the previous data dependencies, it is possible to distribute different macroblock rows by different cores. This distribution have to be synchronized to avoid

64 5.4 Heterogeneous vs Homogeneous Multi-core Platform

i

j

Independet

Figure 5.6: Application of macroblock-level parallelism in a multi-core system. data dependencies.

5.4 Heterogeneous vs Homogeneous Multi-core Platform

The majority of the heterogeneous multi-core systems are mainly formed by multi-core CPUs and multi-core GPUs. In such platform, the encoder parallelism may be exploited by using the multi-core CPUs for slice-level parallelism and the multi-core GPUs to increase the performance of the motion estimation algorithm. Despite the complexity of this solution, the expected gain in performance is much higher when compared with a homogeneous solution. In the other hand, homogeneous multi-core systems have the following advantages:

• All cores are exactly the same: equivalent frequencies, cache sizes, functions and instruc- tion set;

• Easy to exploit parallelization. The different instruction sets and frequencies are the major difficulties in the adaption of developed software to heterogeneous systems. A careful pro- gramming and synchronization between cores had to be careful analyzes and developed;

• Homogeneous cores are easier to produce since the same instruction set is used across all cores and each core contains the same hardware;

Most of the developed software targeting desktops, laptops and servers are mainly based in homogeneous parallelization architectures. Therefore, in the embedded systems heterogeneity is more common. In order to target a larger number of platforms, the developed platform was designed using homogeneous architecture. Concerning the API to be used, the OpenMP characteristics presented in Section 3.6 makes this API more suitable to exploit parallelism in slice-level and macroblock-level.

5.5 Slice Definition

Slice-level parallelism is used to achieve greater performance in the developed platform. To correctly exploit parallelism, the main issue was carefully analyzed: how to divide the frame in several slices.

65 5. Open Platform for Parallel H.264 Video Encoding

To extract the maximum performance from multi-core architectures in parallel sections, the load among cores has to be equally distributed to avoid idle times. Therefore, the load between two macroblock may be significantly different when considering Inter prediction: in high motion macroblock areas, the motion search algorithm has to exhaustively compute when compared with low motion macroblock areas. Moreover, the computational cost of a certain macroblock will certainly vary during the video sequence coding. To perfectly distribute the load among cores, the computational cost of each macroblock had to be previously estimated. Then, the macroblocks were organized in slices concerning their computational cost, meaning that a dynamic slice partition had to be developed in order to support this solution. Statistically, for high video format sequences, the load among fixed slice partitions are approx- imately uniform during the video coding. Therefore, the developed platform automatically divides the frame in 2, 4, 8, 16 and 32 slices with the same number of macroblocks Concerning the shape of the slice, in rectangular configurations the broken data dependencies are minimized. With this configurations, the variations in the bit-rate, compression-rate and in the PSNR have a tendency to be minimized, since the number of broken dependencies is minimized. In order to solve the scalability problem in slice-level parallelization, slice scattering is used. This feature is described in the following sections. The several slice partitions supported in this platform is illustrated in Figure 5.7.

1 Slice 2 Slices 2 Slices

A B

A A B

C D

8 Slices

16 sub-slice Scattering 32 sub-slice Scattering A-0 B-0 A-1 B-1 A B C D C-0 D-0 C-1 D-1

E-0 F-0 E-1 F-1 E F G H G-0 H-0 G-1 H-1

Figure 5.7: Slice definition supported by the platform.

5.6 Data Partitioning

To achieve a high performance in parallel computation, the amount of data that needs to be exchanged between cores has to be minimized. In H.264 standard, the information that needs to be shared is mainly the one involved in data dependencies, like adjacent pixels/motion vectors

66 5.7 Parallel Programming of H.264 Video Encoders around a certain macroblock for Intra/Inter prediction. Two data dependencies are exploited in this platform: slice and macroblock dependency. Dividing a frame in several slices grants a certain independency, since two adjacent mac- roblocks belonging to different slices can be encoded independently. This data independency is used to exploit slice-level parallelism. Independent data structures are used for information fetch and storage in each slice encoding. However, this type of independency could also lead to a decrease of the video quality and to an increase of the bit-rate. As for the data dependencies between macroblocks from adjacent rows, because the three neighboring macroblock above are all involved, it is impossible to have data independencies be- tween cores. However, it is still possible to exploit some data parallelism, if each macroblock region should comprise two columns of macroblocks at least. The data exchanging between cores is supported by the new data structures added to the platform: IntraPrdBuffer and In- terPrdBuffer. The first buffer is used to store reconstructed and unfiltered data used in Intra prediction. The InterPrdBuffer is used for motion vectors and reference frame storage. Due the considerable data involved in the data dependency, introducing the macroblock-level parallelism could reduce the performance. The developed platform can be configured to support slice-level, macroblock-level or both levels of parallelism.

5.7 Parallel Programming of H.264 Video Encoders

As seen before and despite its complexity, there are several parallelization models that can be used to exploit parallelism in H.264/AVC: at frame level, at the slice level and at the macroblock level. Solutions based on frame level parallelism have proved to be perfectly scalable. However, the video delay is proportional to the exploited scalability and its full performance can only be achieved in cluster systems. On the other hand, although the scalability provided by slice-level parallelism is limited in H.264 to eight slices. Slice-level scalability can be achieved using the following two techniques:

• Scattered slices - a slice is defined by a group of macroblocks, but these macroblocks does not to be adjacent to each others. Therefore, the reference allows a certain slice to be divided in several parts (sub-slices), in which each part can be located anywhere in the frame. If two sub-slices are not adjacent, then their macroblocks can be processed in parallel, since there are no dependencies. Exploring this idea, it is easy to see that although the number of slices are limited to eight, introducing slice partition it is possible to increase its scalability;

• Macroblock-Level - in this solution, the macroblock inter-dependency within each slice is preserved. Therefore, bus constraints can lead to lower performance. A careful analysis

67 5. Open Platform for Parallel H.264 Video Encoding

and architecture design is needed to effectively extract a reasonable efficiency.

The adapted techniques to solve slice scalability are depicted in Figure 5.8.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(a) Macroblock and Slice (b) Slice scattering Parallelilm

Figure 5.8: Adopted techniques to solve slice scalability: in a) the parallelism is performed using four slices and a 2D-Wave front for macroblock parallelism; in b) the image is divided in four slices (color labels), each one with four sub-slices (numbers).

To efficiently exploit the execution of parallel tasks in H.264/AVC encoders, three parallel mod- els are now proposed:

• Full parallel - each task is executed concurrently;

• Full pipeline - the encoder is divided in several stages and each stage is computed concur- rently;

• Mixed parallel - parallel and pipeline models are mixed together, in order to balance the stages load.

The software implementation of these parallel models were developed using OpenMP API, due to its particularly suitability for the considered target architecture. To comply with the required real- time requisites, only slice-level and macroblock-level parallelism were exploited in these models. In particular, the usage of scattered slices proved to be particular suited in order to achieve a goal efficiency of the parallelization procedure. The following sections describe how the encoding of the Intra and Inter type frames were implemented in order to support these parallelization models.

5.7.1 Parallel Encoding of Intra Type Frames

It is known that the computational complexity of Intra prediction is lower than Inter prediction’s. Nevertheless, it was still decided to develop a parallel implementation of the procedure in order to account for the possibility to support GOPs only composed by Intra type frames (GOP=1). Only the full parallel model was adopted in the developed parallel version of Intra prediction. The Intra prediction procedure can be divided in the following steps:

1. Initialization of all slices;

68 5.7 Parallel Programming of H.264 Video Encoders

2. Usage of Intra4 × 4 and Intra16 × 16, followed by mode decision, for all macroblocks of each slice;

3. Usage of CABAC/CAVLC encoding in all macroblocks;

4. Frame package construction (RTP or stream mode).

By using several slices, the dependencies between the macroblocks from different slices are broken. As a consequence, all the above steps, except for the last one, can be conventional executed in parallel to each slice. The adopted architecture is depicted in the following figure. The following pseudo-code illustrates how the encoding of Intra type frames was parallelized. An independent thread is assigned to each slice and each thread is responsible for the encoding of all its macroblocks. When the entire prediction is concluded, the macroblocks are encoded using CABAC or CAVLC and the output package is constructed using a Real-Time-Protocol (RTP) or stream format. It is worth noting that at the end of each parallel section, there is an implicit barrier in order to synchronize all threads, meaning that no thread is able to leave the phase until all tasks have completed their jobs. Pseudo-code of a parallel encoding of Intra type frames using OpenMP I n i t S l i c e Function (); #pragma omp parallel for schedule ( static , 1) for each S l i c e for each macroblock Encode Macroblock Intra Function ( ) #pragma omp parallel for schedule ( static , 1) for each S l i c e for each macroblock E n t r o p i c Codification F u n c t i o n ( )

Create RTP Pack ( ) ;

5.7.2 Parallel the Encoding of Inter Type Frames

The Inter prediction mode in H.264 allows a multi-reference prediction and eight prediction modes. To improve the performance of this prediction, a quarter-pixel of accuracy is used. There- fore, a better video quality is often achieve at the cost of a lower frame rate. To ensure the maximum flexibility of the developed parallel software framework, the Inter pre- diction module was divided into several sub-functions, each one corresponding to one of the eight prediction modes. The dependencies between these sub-functions were kept unchanged, since it is impossible to break them without sacrificing the output video quality. Therefore, the origi- nal processing order had to be preserved. Despite of this restriction, three architectures were developed:

• Full pipeline - The Inter prediction is divided in several stages, depending on the number of available cores. The processing of a given slice

69 5. Open Platform for Parallel H.264 Video Encoding

• Full parallel - The Inter prediction is simultaneously applied to each slice in a parallel way; it is only finished after it has passed through all pipeline stages;

• Mixed parallel - The Inter prediction is divided in several stages, but some stages are applied to different slices in parallel.

In the full pipeline architecture, the inter prediction block is divided in stages (pipeline stages) and each stage computes one or more inter prediction’s sub-functions. The number of stages and concurrent slices are determined by the number of cores available in the system. In this architecture, three main issues were carefully analyzed: how the stages communicate, how they are synchronized and how the sub-functions are distributed among the stages. The communication issue is solved by making use of appropriate buffers (MBData structure). To each buffer is assigned a different slice. Firstly, a macroblock belonging to a certain slice is given by FMO to be processed. All data needed to its encoding is stored in its assigned buffer. Next, thid buffer gets through all inter predictions sub-functions (pipeline stages) in order to find the best prediction mode. At the last step, the results are stored in an EntropicData structure, so that it can be encoded using the CABAC/CAVLC module. Since it is impossible to guarantee that all stages have the same processing time, it is important to ensure that any buffer only advances forward in the pipeline in case all the remaining buffers have already been processed. To achieve this synchronization, OpenMP barrier instances have been extensively adopted. As soon as all threads reach these barriers, they resume executing in parallel the code that follows the barrier. In particular, this mechanism has to be used in two situations: after each inter prediction’s task has completed and after all buffers are assign to the following stage (advancing forward in the pipeline). The following pseudo-code illustrates how the pipeline architecture was developed and how synchronization issues were solved.

Pseudo-code corresponding to the parallelism of Inter Prediction #pragma omp parallel sections #pragma omp section / / secti on 0 for (;LAST SLICE ID != TERMINATE; ){ #pragma omp b a r r i e r Advance Buffers Forward ( ) ; Get Free Slice ( ) ; #pragma omp b a r r i e r i f ( Acquire Macroblock( SLICE ID ) != STALL ) Stage1 Processing (); } #pragma omp section / / secti on 1 for (;LAST SLICE ID != TERMINATE; ) #pragma omp b a r r i e r //Wait until all threads have completed their tasks #pragma omp b a r r i e r //Wait until all buffers are assigned to threads i f ( Buffer Data != STALL ) Stage2 Processing (); ... #pragma omp section / / secti on n for (;LAST SLICE ID != TERMINATE;

70 5.7 Parallel Programming of H.264 Video Encoders

#pragma omp b a r r i e r //Wait until all threads have completed their tasks #pragma omp b a r r i e r //Wait until all buffers are assigned to threads i f ( Buffer Data != STALL ){ StageN Processing (); Store Results ( ) ; TQ ( ) ; CABAC CAVLC Block ( ) ; } #pragma omp parallel for schedule ( static , 1){ for each S l i c e for each macroblock E n t r o p i c Codification F u n c t i o n ( ) } Create RTP Pack ( ) ;

To better balance the pipeline stages, the processing time corresponding to all sub-functions was previously measured, in order to give an idea of their computational complexity and only then were they grouped in pipeline stages. As it was mentioned before, in inter prediction the order in which the prediction modes are evaluated is important since it is impossible to remove all the dependencies without changing motion estimation prediction result. Therefore, only adja- cent modes could be grouped in a stage. The macroblock acquisition and the CABAC/CAVLC module are computed in the first and last pipeline stages. Due to the macroblock acquisition and CABAC/CAVLC lower computational complexity when compared with motion estimation, only the time corresponding to the remaining sub-functions is presented in the following table:

Table 5.2: Processing Time of each prediction mode for every CIF video sequence [ms]. Inter prediction Video sequence Average sub-function Soccer Harbour Crew City 16 × 16 77.3 71.6 67.8 73.3 72.3 16 × 8 25.8 22.9 21.1 22.0 22.5 8 × 16 26.5 24.2 22.6 23.2 23.7 DIR 1.0 1.1 1.0 1.0 1.0 8 × 8 29.6 29.7 25.4 27.6 28.6 8 × 4 29.2 29.4 25.6 27.7 28.5 4 × 8 29.4 29.6 25.3 28.0 28.7 4 × 4 29.2 29.3 25.5 27.9 28.6

The results presented in Table 5.2 show that the processing time corresponding to the several modes change slightly from video to video. As expected, the direct mode is the less time con- suming mode, since the prediction vectors are estimated based on the average prediction vectors of its neighbours. On the other hand, the 16x16 mode needs significantly more time since the displacement error is computed with more pixels. Based on this preliminary profiling, a general architecture was developed. A generic pipeline architecture is depicted in figure 5.9. Table 5.3 shows how the inter prediction sub-functions were distributed among the several pipeline stages for this architecture, considering the usage of 2, 3, 4, 6 and 8 cores. The full parallel architecture followed the same design principle that it was applied for Intra

71 5. Open Platform for Parallel H.264 Video Encoding

Slice 0

Slice 1 . . Slice Macroblock . Selector Fetch . . . . Stage Stage 0 Stage 1 CABAC Slice n-2 Stage n-1 Barrier Synchronization Barrier Synchronization Barrier Synchronization Barrier Synchronization Slice n-1

Figure 5.9: Generic pipeline’s architecture considering N stages.

Table 5.3: Pipeline architecture: modes distributions among cores. The maximum delay in the pipeline was computed considering the delays acquired in Table 5.2. Number Number of Stages Max. Cores 1 2 3 4 5 6 7 8 Delay 16 × 16 8 × 8 16 × 8 8 × 4 2 119.6 8 × 16 4 × 8 DIR 4 × 4 16 × 16 8 × 16, 4 × 8, 3 16 × 8 DIR, 8 × 8 4 × 4 94.9 8 × 4 16 × 16 16 × 8 8 × 8, 4 × 8, 4 8 × 16 8 × 4 4 × 4 72.5 DIR 16 × 16 16 × 8 8 × 16 DIR, 8 × 8, 4 × 4 72.5 6 8 × 8 4 × 8 8 16 × 16 16 × 8 8 × 16 DIR 8 × 8 8 × 4 4 × 8 4 × 4 72.5 prediction (Figure 5.10): the slices are assigned to different concurrent threads and run indepen- dently. Since each slice has approximately the same number of macroblocks, in a homogeneous system it is expected that all threads finish their tasks at the same time. After leaving the CABAC/- CAVLC module, all slices’ buffers are joined together to form the output package. Due to its nature, this type of architectures does not need explicit synchronization, due to implicit barriers entirely similar to those that were described for Intra prediction.

Parallel Tasks Parallel Tasks Slice 0 CABAC 0

Slice 1 CABAC 1 . . Master . . RTP/Stream Thread . . Package . . Slice n-2 CABAC n-2 ImplicitSynchronization ImplicitSynchronization

Slice n-1 CABAC n-1

Figure 5.10: Block diagram of the full parallel architecture.

The mixed parallel solution is very similar to the previous architecture: the inter prediction is divided in several pipeline stages, but now several slices are processed at the same time, since

72 5.7 Parallel Programming of H.264 Video Encoders

Slice 0

Slice 1

. 1 Stage ...... Stage 0 Stage CABAC Multiple Slice n-2 n-1 Stage Macroblock Fetch Barrier Barrier Synchronization Barrier Synchronization 1 Stage Barrier Synchronization Barrier Synchronization Multiple Slice Selector Slice Multiple Slice n-1

Figure 5.11: Mixed pipeline’s architecture for four cores: Stage 0 - one core; Stage 1 - two cores; Stage n-1 - one core.

Table 5.4: Mixed pipeline architecture: Distribution of the prediction modes among the pipeline stages and processor cores. #C denotes the number of cores used in each stage. Total Stages 1 Stages 2 Stages 3 Stages 4 of Cores Modes #C Modes #C Modes #C Modes #C 16 × 16, 8 × 16, DIR, CABAC 1 4 16 × 8, 1 8 × 8, 8 × 4 2 CABAC 1 4 × 8, 4 × 4 16 × 16, 2 8 × 16, DIR, 2 4 × 4 1 CABAC 1 6 16 × 8 8 × 8, 8 × 4 4 × 8 16 × 16, 8 × 16, DIR, CABAC 8 16 × 8, 2 8 × 8, 8 × 4 4 CABAC 2 4 × 8, 4 × 4 16 × 16, 8 × 16, DIR, CABAC 12 16 × 8, 4 8 × 8, 8 × 4 4 CABAC 4 4 × 8, 4 × 4 16 × 16, 8 × 16, DIR, CABAC 16 16 × 8, 4 8 × 8, 8 × 4 8 CABAC 4 4 × 8, 4 × 4 some stages have more than on thread. In the following example (Fig. 5.11), two slices are processed at the same time in each stage. To reduce the computational time in the most critical stage two processor cores were assigned to this part of the encoding algorithm. Nevertheless, in any of these stages, the slices are processed by an independent thread.

5.7.3 Increasing the Slice-level Scalability

According to the preliminary experimental results, the full parallel architecture proved to be the best architecture to fully exploit the parallelism in H.264 encoders. The organization of the several encoder modules that integrate the proposed framework is depicted in Figure 5.12. The set of processing modules that were chosen to be executed in parallel in the macroblock-level are: Intra/Inter prediction, transformation and quantization (T.Q.) and inverse quantization and transformation (Q−1.T−1.). The entropic encoder modules are executed at slice-level, to avoid the usage of shared data and memory access bottlenecks. To further increase the exploited concurrency, macroblock-level parallelization was also added

73 5. Open Platform for Parallel H.264 Video Encoding

Parallel Tasks Parallel Tasks Slice 0

Mb0 Mb1 Mb2 CABAC 0

Slice 1 CABAC 1 . Master Mb0 Mb1 Mb2 . RTP/Stream Thread . Package . . . CABAC n-2 Slice n-1 . Implicit Synchronization Implicit Synchronization Implicit

Mb0 Mb1 Mb2 CABAC n-1

Intra/Inter TQ Q-1T-1

Figure 5.12: Organization of the several encoder modules in the proposed parallel framework.

Multi-core system GroupA GroupC GroupC

A0 A1 A2 B0 B1 C0 C1 C2

Slice0 Slice1 Slice2

Figure 5.13: Proposed architecture to simultaneously exploit slice and macroblock parallelism levels. to the platform. A team of threads is created for each slice. For each team, independent mac- roblocks are distributed among the set of cores that were assigned to the processing of that slice. At the end, entropy coding is performed separately and at slice level. The synchronization between the several concurrent threads is guaranteed using appropriate synchronization barriers. Only threads belonging to the same team are blocked in these barriers. This mechanism is used before and after the execution of the control mechanism. This con- trol mechanism is responsible for the macroblocks assignment and distribution among the set of threads belonging to a certain team (Fig. 5.14). One barrier is used before and after the execution of this mechanism to ensure that all macroblocks to be processed are assigned before the encod- ing. No further synchronization is needed, since all the threads will be blocked a synchronization barrier at the end of the #pragma construct. Another considered option to increase the concurrency is to adopt slice scattering, where slices are divided in sub-slices. By definition, a slice is a group of macroblocks. Therefore, two adjacent macroblocks belonging to the same slice share some data used during their encoding. To increase the scalability of the slice-level parallelism, each slice is divided in to several sub- slices, and distributed in non-adjacent areas among the frame. This kind of distribution forces the data independency among sub-slices allowing them to be processed in different threads.

74 5.7 Parallel Programming of H.264 Video Encoders

Master Thread0 Thread1 Thread2 Thread

Barrier

Mb assignment control

Barrier

Figure 5.14: Simplified thread synchronization used for each team.

Slice scattering is illustrated in Figure 5.15). When compared with the previous solution, the greatest parallelization level provided by scattering slices is offered at the cost of the eventual slight reduction of the resulting bit-rate and PSNR levels (due to the presence of more blocking effect). As a consequence, this solution should only be used to exploit the spatial redundancy between macroblocks for higher resolution video that require the exploitation of the exploitation of the most efficient parallelization architectures.

Multi-core system

C0 C1 C2 C3 C4 C5 C6 C7

Slice A Slice B Slice A Slice B Slice C Slice D Slice C Slice D Sub 0 Sub 0 Sub 1 Sub 1 Sub 0 Sub 0 Sub 1 Sub 1

Slice Partition

Slice A Slice B Slice A Slice B Sub 0 Sub 0 Sub 1 Sub 1

Slice C Slice D Slice C Slice D Sub 0 Sub 0 Sub 1 Sub 1

Figure 5.15: Slice-scattering distribution in a multi-core architecture.

75 5. Open Platform for Parallel H.264 Video Encoding

76 6 Results

Contents 6.1 Frame Partition and Memory Consumption ...... 80 6.2 Architectures Comparison ...... 80 6.3 Comparison Between Slice-level and Macroblock-level Parallelism ...... 85

77 6. Results

In this section, the proposed architectures are compared with the reference software. All simulations were performed considering the following parameters:

• IBP type GOP - this GOP structure provides lower bit-rates at the cost of an increased computational complexity. The chosen GOP structure is formed by one I frame followed by thirty B-B-P frames;

• Intra Prediction - since Intra prediction does not represent a great computational effort, all possible modes were considered for Intra4 × 4 and Intra16 × 16;

• Inter Prediction - all prediction modes are available in order increase the compression efficiency. Therefore, it is possible to reduce the encoding time by disabling some of the optional modes;

• Number of references - each inter frame has a maximum of 3 backward and, for B type frames only, 1 forward references. Also, by decreasing the number of reference frames it is possible to increase the frame-rate;

• Search Mode - the Simplified Hexagon motion estimation mode was adopted;

• Entropy Coding - all simulations were performed using CABAC in order to achieve higher compression-rates;

• In-loop deblocking filter - enabled, to increase the video quality;

• Weighted prediction - disabled, due to its high computational complexity;

• Error metric - to reduce the computational effort, SAD is preferable than SSD or SATD;

• Sub-pixel motion estimation - the quarter-pixel precision was adopted, to get more accu- rate motion vectors.

The developed platform was simulated in an eight-core NUMA system, composed by two quad-core chip processors with support for Hyper-threading technology. It is possible to emulate two parallel threads in each core, but only one thread is executed at the time. The system is composed by two memory pools, one for each chip processor (Figure 6.1(b)). Concerning the cache distribution in each chip processor, all individual cores have their own L1 and L2 caches, and a common L3 cache, shared by all cores in the same chip (Figure 6.1(b)). To solve cache coherency problems, when shared data is used, an invalidate snooping protocol is implemented. The main system specifications are summarized in Table 6.1. The division of video frames in several slices was used to assure data independency. There- fore, its usage may lead to slight changes in bit-rate and PSNR. To properly understand the be- havior of these parameters when multiple slices are used, the reference software was simulated

78 Table 6.1: Specifications of the computational system used for the platform simulations. CPU Two general purpose processors Intel Xeon Quad-Core E5530 running at 2.40GHz with individual caches L1 and L2 of 128KB and 256KB, re- spectively and a shared cache L3 of 8MB. Memory Available 24GB (12GB in each memory pool) running at 1066MHz. Operating System 64bits , with OpenMP

Intel Xeon 4-Core E5530

Core Core Core Core Intel Xeon Intel Xeon 5530 Processor 5530 Processor

L1 Cache L1 Cache L1 Cache L1 Cache

L2 Cache L2 Cache L2 Cache L2 Cache Memory Memory Bank Bank

Shared L3 Cache Chipset

(a) Inter Xeon Quad-Core E5530 pro- (b) NUMA system architecture cessor

Figure 6.1: Eight-core NUMA system architecture used in all simulations. for several slice partitions. It is important to notice that these results are independent from the software used (reference or optimized). During the development, three architectures were proposed - full pipeline, mixed parallel and full parallel. Each architecture was extensively tested for slice level-parallelism and for multiple slices in order to determine the most efficient setup. After that, the chosen architecture was updated with the macroblock-level parallelism technique. The results are presented and discussed in the following sections. The performed evaluations considered four video sequences - Soccer, Harbour, City and Crew - each with different properties concerning motion and detail: while Soccer and Crew are char- acterized by higher amounts of movement, the Harbour and City can be classified as sequences with a higher spatial detail. Their snapshots are presented in Figure 6.2. In order to have a considerable amount of data to process, the encoder was tested with videos using a 4CIF format.

(a) Soccer video (b) Harbour video (c) City video (d) Crew video

Figure 6.2: Video sequences snapshots.

79 6. Results

6.1 Frame Partition and Memory Consumption

Frame partition into several slices may cause a slight change in bit-rate, compression rate, and PSNR. With frames divided into slices, the dependency between macroblocks located in the slices boundaries are broken. Therefore, this forced independency reduces the information that can be used to exploit redundancy, causing an increase in the bit-rate and lower compression rates and PSNR. The results acquired from the simulation are presented in Table 6.2. As a reference, it is considered the bit-rate and the PSNR results acquired for a single slice encoding.

Table 6.2: Bit-rate increment for frame division in N-Slices for 4CIF format. Ref. [kb/s] Number of Slices Video 1 Slice 2 4 8 16 Soccer 2107 +0.6% +2.3% +4.0% +6.6% Harbour 3541 +0.1% +0.7% +1.4% +2.1% City 1266 +0.4% +1.8% +3.4% +4.7% Crew 2381 +0.2% +1.4% +2.2% +3.2%

Table 6.3: PSNR results for frame division in N-Slices for 4CIF format. Ref. [dB] Number of Slices Video 1 Slice 2 4 8 16 Soccer 35.66 0.0% 0.0% 0.0% −0.1% Harbour 34.40 0.0% 0.0% 0.0% 0.0% City 34.68 0.0% 0.0% 0.0% 0.0% Crew 36.51 0.0% 0.0% 0.0% 0.0%

From the results presented in Table 6.2 it is possible to observe that the bit-rate increases, as expected, with the number of slices. By splitting a frame into slices, the macroblocks located in slices boundaries need more bits to be coded, since their neighbours predictions can no longer be used for encoding. Therefore, by increasing the number of slices, the bit-rate also increase. Concerning the PSNR, the results prove that for the considered definition video formats the losses in the PSNR are approximately zero (see Table 6.3). Concerning the memory consumption results (Table 5.1), the optimized data structures usage allow us to reduce the memory usage more than 85%.

6.2 Architectures Comparison

The resulting performance of the three conceived parallel architectures are presented in this section. The viability of these architectures can be measured by the following parameters:

• Real speedup (SR) - this parameter is used to measure and to quantify the speedup gain of the considered architecture;

80 6.2 Architectures Comparison

• Optimal speedup (SO) - this result represents the theoretical limit of the real speedup. It is computed by using the Amdahl’s law;

• Efficiency (Ef) - measures the efficiency of a certain parallel architecture.

To estimate these parameters, the encoding time obtained with the conducted parallelization

(TP arallel) and the original (sequencial) encoding time (TSeq), measured with C Language time functions, are used in the following equations:

T SR = Seq (6.1) TP arallel

1 ST = (6.2) (1 − F ) + F Sp

S Ef = R (6.3) Number of Cores where F is the parallelized fraction of the code and Sp is the partial speedup in this parallel fraction. A first simulation considering the fully sequential optimized encoder was performed. From its results, it is possible to estimate, through comparison with the reference, the gain achieved just by optimizing the data structures and the encoder’s code. The following table shows the results from reference and optimized software.

Table 6.4: Comparison between the reference and the optimized software for 4CIF format. Video Sequences Reference Optimized SUReal TT ot[s] TOpt[s] TT ot[s] TOpt[s] Soccer 417.3 411.5 237.1 235.0 1.8 Harbour 368.1 361.8 187.1 183.5 2.0 Crew 393.9 388.7 213.7 211.0 1.8 City 383.4 378.1 203.2 201.0 1.9

From Table 6.4 and Table A.1 (Appendix A), we can concluded that by simply optimizing the reference software, it was possible to achieve a real speedup of a factor of 2, meaning that all the optimizations that were taken make the processing time of the optimized encoder twice as fast as the reference software. It is important to remember that all optimizations were done at a high programming level, to ensure the software’s portability. All the performance results presented in the following section are estimated considering the optimized encoder.

6.2.1 Full Pipeline Architecture

The experimental results obtained with the full pipeline architecture are presented in this sec- tion. The obtained performance levels are computed for all video sequences and presented in

81 6. Results

Table 6.5 and in the chart of Figure 6.3. The previously obtained optimized single core results were taken as reference.

Table 6.5: Full pipeline speedup for 4CIF video sequences. Single Number of Threads Video Core 2 3 4 6 8 Soccer 1.00 1.29 1.45 1.51 1.22 0.44 Harbour 1.00 1.28 1.46 1.44 1.12 0.44 Crew 1.00 1.29 1.48 1.48 1.19 0.50 City 1.00 1.10 1.51 1.50 1.21 0.47

4CIF Video Sequences' speedup 2,0

S 1,6 p e 1,2 Soccer e Harbour d 0,8 Crew u City p 0,4 Ideal 0,0 1 2 3 4 5 6 7 8 Number of Cores

Figure 6.3: Results obtained in a pipeline architecture 4CIF format.

It is possible to conclude from Table 6.5 that the maximum performance gain is achieved for a 3-Thread configuration. Until such configuration, the speedup is a linear and with a positive slope. By adding more stages in the pipeline, the amount of data concerning all macroblock encoding information that is exchanged between the several stages (threads) increases leading to a con- sequent reduction of the achieved performance. Such reduction can be justified by the fact that these data exchanges are performed sequentiatly, by using the main memory to accommodate the data that is exchanged between two consecutive stages. Since only one single thread can ac- cess the main memory at the time, such restriction seriously compromise the performance when the number of concurrent threads increases. This communication will further decrease the perfor- mance and may lead to memory bottlenecks and higher latency times. Moreover, it is very difficult to achieve a perfect balance of the amount of work conducted by the several pipeline stages in order to achieve the minimum delay. Therefore, the performance decreases for configurations with more that three stages (threads). As a consequence, this type of architecture proved to be unsuitable to exploit even further parallelization, by adding the macroblock-level parallelism.

82 6.2 Architectures Comparison

6.2.2 Mixed Parallel Architecture

With this architecture it was tried to even increase the concurrency in the encoder by consider- ing a pipeline chain with heterogeneous distribution of threads among several stages. Therefore, in the most computational demanding stages, concurrent threads were added in order to increase the pipeline throughput. The results of the mixed parallel architecture are presented Table 6.6. To compute the performance, the optimized single core results were taken as reference.

Table 6.6: Mixed parallel speedup for 4CIF video sequences. Single Number of Threads Video Core 4 6 8 12 16 Soccer 1.00 1.57 1.65 1.40 1.03 1.38 Harbour 1.00 1.59 1.62 1.30 0.87 1.22 Crew 1.00 1.62 1.61 1.30 0.97 1.33 City 1.00 1.60 1.67 1.40 1.12 1.29

4CIF Video Sequences speedup 3

S p 2 e Soccer e Harbour d Crew u 1 City p Ideal 0 1 4 7 10 13 16 Number of Cores

Figure 6.4: Results obtained in a mixed parallel architecture using 4CIF format.

The obtained results show that the performance achieved with this architecture is still very far from the optimal performance(Figure 6.4). Just as in the full pipeline architecture, the amount of data that has to be serially transferred into the primary memory seriously limits the overall advantage provided by the several concurrent cores available in this computational system.

6.2.3 Full Parallel Architecture

The performance of the first approach to the full parallel architecture is evaluated by an isolated exploitation of the slice-level of parallelism. Table 6.7 presents the performance gain achieved for a N-thread full parallel architecture, in comparison with the optimized version without parallelism. The efficiency of this architecture is presented in Table 6.8. The results for QCIF and CIF video sequences formats are presented in Appendix A.3.

83 6. Results

Table 6.7: Full parallel speedup for 4CIF video sequences. Single Number of Threads Video Core 2 4 8 16 Soccer 1.00 1.87 3.42 5.53 6.29 Harbour 1.00 1.91 3.56 5.99 6.25 City 1.00 1.97 3.78 6.41 6.15 Crew 1.00 1.89 3.57 5.97 6.39

Table 6.8: Full parallel architecture efficiency for 4CIF video sequences (Speedup/Number of Threads). Number of Threads Video 2 4 8 16 Soccer 93% 86% 69% 39% Harbour 95% 89% 75% 39% City 99% 94% 80% 38% Crew 95% 89% 75% 40%

According to the obtained results presented in, Table 6.7, it can be observed that the achieved speedup values are very close to the theoretical optimum speedup, leading to a high processing efficiency. In part, the slices independency allows a parallelization level without data sharing between them. Therefore, the segmentation overheads and data are avoided and the obtained encoder speedup is increased. However, for a 16-thread configuration the results are far from the expected. The Hyper-threading technology emulates two threads in a single core, but only one thread can be executed at time. Therefore, in a 16-thread architecture the two slices in each core are computed sequentially, leading to an equivalent performance very close to an 8-thread configuration.

Concerning the efficiency, the results are satisfactory. The lower efficiency values are obtained for 16-thread architecture, as expected.

6.2.4 Discussion

It has been proven through the previous analyzes that the full parallel architecture is more suitable for the considered platform. Its good performance levels and higher efficiency make it preferable to exploit the macroblock-level parallelism. The lower speedups obtained with the pipeline and mixed architectures can be justified by the exchange of shared information among threads through the common primary memory.

Chosen the architecture, the macroblock-level and slice scattering techniques were developed under a full parallel architecture. The results are presented in the following section.

84 6.3 Comparison Between Slice-level and Macroblock-level Parallelism

6.3 Comparison Between Slice-level and Macroblock-level Par- allelism

The set of results presented in Section 6.2.3 depicts the evaluation of the performance pro- vided by an isolated exploitation of the slice-level of parallelization on a full parallel architecture. In this section it will be presented and discussed the set of results that were obtained for the macroblock-level of parallelism mode implemented in the conceived full-parallel architecture. The developed platform can be easily configured to better suit the capabilities of the target system. In addiction to the configuration of the set of threads to be used, four configurations can be used simultaneously. Table 6.9 summarizes the four possible configurations.

Table 6.9: Possible configurations of the optimized platform. Configuration Target Systems SSE2 Low Memory NO NO Standard systems YES NO Standard systems with support for SSE2 instructions NO YES Embedded Systems YES YES Embedded Systems with support for SSE2 instructions

Figure 6.5 illustrates the performance provided by the isolated exploitation of the macroblock- level of parallelism. The obtained speedup values are somewhat more modest, achieving a max- imum value of about 4 (see Table 6.10). The refered exchange of dependent data between the several threads of this particular NUMA multiprocessor is probably the main reason for these re- sults: since each chip-processor has its own memory pool, data has to be shared, leading to a memory bottleneck and higher latency times. Nevertheless, this model offers a somewhat wider scalability space, entirely compatible with a mutual and simultaneous exploitation of the slice-level model. In the event of an UMA parallel architecture had been adopted, the obtained speedups may be greater. In UMA parallel architecture, the communication time between each core and the main memory is the same. Therefore, this property may reduce the time spent in data exchange between threads that do not share the same memory pool.

Table 6.10: Average speedup of all four video sequences for SSE and Low Memory configurations using macroblock-level parallelism. Configuration Number of Threads SSE2 Low Memory 2 4 8 16 NO NO 1.73 2.96 4.38 3.91 NO YES 1.54 2.65 3.96 3.24 YES NO 1.84 2.98 4.33 3.99 YES YES 1.44 2.53 3.99 3.94

Table 6.11 illustrates the performance provided by the isolated exploitation of the slice-level of parallelism. The obtained results, illustrated in Figure 6.6, are very close to the theoretical opti-

85 6. Results

Average speedup for macroblock-level parallelism

16

8 Standard p With Low Memory u d With SSE e 4 p With SSE and Low Memory S Optimal Speedup

2

1 1 2 4 8 16 Number of threads

Figure 6.5: Provided speedup using macroblock-level parallelism. mum speedup in what concerns the achieved frame-rate, leading to a high processing efficiency. It is mainly due to the slices independency, which allows a parallelization level without data shar- ing between them. Therefore, the segmentation and data scheduling overheads are avoided and the obtained encoder speedup is increased. However, since frames can be divided into a max- imum of eight slices in the H.264 standard, slice scattering or other levels of parallelization will have to be applied, in order to avoid this constraint and make the solution, as well as the platform, more scalable. Nevertheless, it should be noted that the results for slice-scattering are not con- clusive, since the system where the simulations were performed can only execute eight threads simultaneously. This fact explains the similarity between the acquired values for eight and sixteen slices.

Table 6.11: Average speedup of all four video sequences for SSE and Low Memory configurations using slice-level parallelism. Configuration Number of Threads SSE2 Low Memory 2 4 8 16 NO NO 1.90 3.56 5.98 6.27 NO YES 1.65 3.11 5.44 5.65 YES YES 2.05 3.72 6.10 6.28 YES YES 1.72 3.19 5.01 5.46

Furthermore, when comparing the results obtained with the macroblock-level model, the slice- level parallelization solution presents the highest performance. The simultaneous usage of slice and macroblock parallelism levels was also validated. Table 6.12 summarizes all the results for the three adopted solutions. From them, it is possible to conclude about the relationship between the data dependencies and the speedup: the speedup

86 6.3 Comparison Between Slice-level and Macroblock-level Parallelism

Average speedup for slice-level parallelism 16

8 Standard p With Low Memory u d With SSE e 4 e With SSE and Low Memory Sp Optimal Speedup

2

1 1 2 4 8 16 Number of threads

Figure 6.6: Provided speedup using slice-level parallelism.

Table 6.12: Provided speedup for standard configuration using Slice and Macroblock levels of parallelism. Parallelization Levels #Cores Macroblock Slice Slice+Macroblock 1 1.00 1.00

2 1.73 1.90 #MBs 2 4 2.96 3.56 3.25 #Slices 2 #MBs 4 #MBs 4 8 4.38 5.98 5.00 5.32 #Slices 2 #Slices 4 #MBs 8 #MBs 4 #MBs 2 16 3.91 6.27 4.63 6.27 4.63 #Slices 2 #Slices 4 #Slices 8 is higher in architectures with lower data dependencies (slice-level of parallelism). Despite the results, the modest results from macroblock-level parallelism allow us to conclude that this solution can still be used in highly scalable systems.

87 6. Results

88 7 Conclusions

Contents 7.1 Summary ...... 90 7.2 Result’s Analysis ...... 90 7.3 Future Work ...... 91

89 7. Conclusions

7.1 Summary

The main objective of this thesis was to improve the performance of a reference video encoder through parallelization, for usage in multi-core systems based on general purpose cores and/or embedded systems. Therefore, the following characteristics had to be satisfied: the software encoder has to be flexible and modular, in order to make it easier to scale; the memory has to be statically allocated, allowing its usage on embedded systems without operating systems. Several steps were taken to achieve this goal. Firstly, the computational cost of each block of the encoder had to be measured. Gprof was used for this effect. After concluding that Inter prediction is the heaviest block, the encoder’s structures were redesigned in order to exploit the cache properties and lower the memory consumption. This step was followed by code optimiza- tion: the code was divided into encoder’s blocks and each block was optimized. After optimized, the last step was to use OpenMP tools to introduce slice-level and macroblock-level parallelism in our code. Three architectures were developed: full parallel, where inter prediction is applied to each slice at the same time; full pipeline, where inter prediction is divided in several pipeline stages and applied to each slice at a pipeline way; and mixed parallel, a joint of the previous architectures.

7.2 Result’s Analysis

It was proved that greater performance levels can be obtained through parallelization. In the- ory, under a system with N-cores it is possible to increase the speedup by a factor of N. In reality, this factor is limited by numerous reasons: cache coherency, bus traffic, non-ideal task’s division among cores, etc. Also, there are always some tasks which have to be processed by a single core (serial tasks). In a multi-core based system, a large number of different architectures can be developed. In this work, three architectures were developed: full parallel, full pipeline and mixed parallel. Four video sequences were encoded with these architectures and the presented results show that the maximum performance can be achieved using a full parallel architecture. In a pipeline based architecture, bus bandwidth demand is much higher due to data sharing after each stage’s computation. Pipeline stages’ division is also an important factor: despite of the careful division of Inter prediction in several pipeline stages, it is impossible to assurance that all stages have the same computational time, since Inter prediction time change from macroblock to macroblock and video to video sequence. Two parallelism levels were exploited in the full parallel architecture: slice level and macroblock level. The H.264 standard only allows the frame partition in eight different slices. Therefore, this limitation can be overcome with slice scattering (an extension to slice level parallelism) or macroblock level parallelism. The acquired results illustrate that the maximum performance is

90 7.3 Future Work achieved for slice level parallelism. Nevertheless, the platform was tested for all four configuration options. From the presented results, it is possible to conclude the additional memory restrictions imposed in the Low Memory configuration do not significantly affect the achieved performance levels.

7.3 Future Work

Although the H.264 standard was formally released in May 2003 and commercial decoders are now widely available in the market, there is still a considerable amount of future work worth to be conducted. Parallel heterogeneous architectures based on CPU and GPU processors are being used in order to exploit further parallelism. In such platforms, the Motion Search, Motion Compensation, Filtering and Interpolation modules can be adapted and redesigned, in order to further exploit the parallelism by using the GPU processor. Concerning the encoder profile, it is important to recall that during the development of this platform it was used a profile that targets lower bit-rates and higher PSNR. Therefore, several changes in the profile can be done in order to achieve higher performances. The definition of several parallel profiles targeting several encoding objectives and system architectures could also be done for future H.264 studies. Still under developing, the new future standard H.265 aims to substantially improve the coding efficiency in order to reduce the bit-rate requirements by half compared to H.264 AVC High Profile. These goal will probably be achieved at the expense of increased computational complexity. Slice partition may still be supported in the future standard, meaning that higher performances may be achieved by joining slice-level and macroblock-level parallelism.

91 7. Conclusions

92 8 Bibliography

93 Bibliography

[1] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Transactions on circuits and systems for video technology, vol. 13, pp. 560–576, July 2003.

[2] A. Tamhankar and K. R. Rao, “An overview of H.264/MPEG-4 Part 10,” in 4th EURASIP Conference focused on Video/Image Processing and Multimedia Communications, vol. 1, July 2003, pp. 1–51.

[3] S. kak Kwona, A. Tamhankarb, and K. Raoc, “Overview of H.264/MPEG-4 part 10,” Journal of Visual Communication and Image Representation, vol. 17, no. 2, pp. 186–216, April 2006.

[4] R. Schafer,¨ T. Wiegand, and H. Schwarz, “The emerging H.264/AVC standard,” European Broadcast Union-Technical, Jan 2003.

[5] P. Ross, “Why CPU frequency stalled,” IEEE Spectrum, vol. 45, no. 4, pp. 72–72, April 2008.

[6] I. E. G. Richardson, H.264 and MPEG-4 Video Compression. John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England: Wiley, December 2003.

[7] V. Bhaskaran and K. Konstantinides, Image and Video Compression Standards: Algorithms and Architectures, 2nd . Springer, June 1997.

[8] M.-J. Chen, L.-G. Chen, T.-D. Chiueh, and Y.-P. Lee, “A new block-matching criterion for motion estimation and its implementation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 5, no. 3, pp. 231–236, June 1995.

[9] M. Jafari and S. Kasaei, “Fast intra and interprediction mode decision in H.264 advanced video coding,” IJCSNS International Journal of Computer Science and Network Security, vol. 8, no. 5, p. 130140, May 2008.

[10] T. Ogunfunmi and W. Huang, “A flexible macroblock ordering with 3D MBAMAP for H.264/AVC,” IEEE International Symposium on Circuits and Systems, vol. 4, pp. 3475–3478, May 2005.

94 Bibliography

[11] I. T. Union, “ITU-T Recommendation H.264 advanced video coding for generic audiovisual services,” Telecommunication Standarization Sector of ITU, Tech. Rep., March 2005.

[12] S.-B. Wang, X.-L. Zhang, Y. Yao, and Z. Wang, “H.264 intra prediction architecture optimiza- tion,” IEEE International Conference on Multimedia and Expo, pp. 1571–1574, July 2007.

[13] H. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, “Low-complexity transform and quantization in H.264/AVC,” IEEE Transactions on Circuits and Systems for Video Technol- ogy, vol. 13, no. 7, pp. 598–603, July 2003.

[14] J. Heo and Y.-S. Ho, “VLC table prediction algorithm for CAVLC in H.264 using correlation and statistics of mode information,” in IWSSIP 15th International Conference on Systems, Signals and Image Processing, June 2008, pp. 307–310.

[15] D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive binary arithmetic coding in the ”h.264/avc” video compression standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 620–636, July 2003.

[16] Y. Altunbasak and N. Kamaci, “An analysis of the DCT coefficient distribution with the H.264 video coder,” in Proceedings of the 7th IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 177-180, May 2004, pp. 1195 – 1198.

[17] G. Raja and M. J. Mirza, “In-loop deblocking filter for H.264/AVC video,” EURASIP, August 2009.

[18] C.-B. Sohn and H.-J. Cho, “An efficient SIMD-based Quarter-Pixel interpolation method for H.264/AVC,” IJCSNS International Journal of Computer Science and Network Security, vol. 6, no. 11, pp. 85–89, November 2006.

[19] J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroshke, F. Pereira, T. Stockhammer, and T. Wedi, “Video coding with H.264/AVC: tool, performance, and complexity,” IEEE Transac- tions on Circuits and Systems for Video Technology, vol. 4, pp. 7–28, July 2004.

[20] M. A. B. Ayed, A. Samet, and N. Masmoudi, “H.264/AVC prediction modules complexity analysis,” Wiley InterScience on European Transactions on Telecommunications, vol. 18, p. 169177, August 2007.

[21] J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative approach, 4th ed. 500 Sansome Street, Suite 400, San Francisco, CA 94111: Morgan Kaufmann, September 2006.

[22] W. Stallings, Computer Organization & Architecture: Designing for performance, 6th ed. Upper Saddle River, New Jersey 07458: Prentice Hall, January 2003.

95 Bibliography

[23] L. Null and J. Lobur, The Essentials of Computer Organization and Architecture, 2nd ed. 40 Tall Pine Drive Sudbury, MA 01776 978-443-5000: Jones and Bartlett Publishers, February 2006.

[24] Y. Hu, L. Xiao, Y. Li, R. Li, and L. Tian, “A network extend method of OpenMPI,” in Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on, vol. 9, July 2010, pp. 366–370.

[25] P. Saito, D. Wolf, B. Mendona, K. Branco, and R. Sabatine, “A parallel approach for mobile robotic self-localization,” in Computer Sciences and Convergence Information Technology, 2009. ICCIT ’09. Fourth International Conference on, November 2009, pp. 762–767.

[26] L. Landesa and J. Taboada, “High scalability codes for the fast multipole method,” in Anten- nas and Propagation, 2007. EuCAP 2007. The Second European Conference on, November 2007, pp. 1 –4.

[27] B. Kuhn, P. Petersen, and E. O’Toole, “OpenMP versus threading in C/C++.” First European Workshop on OpenMP EWOMP99, October 1999.

[28] Y. Hu, H. Lu, A. Cox, and W. Zwaenepoel, “OpenMP for networks of SMPs,” in Parallel and Distributed Processing, 1999. 13th International and 10th Symposium on Parallel and Distributed Processing, 1999. 1999 IPPS/SPDP. Proceedings, April 1999, pp. 302–310.

[29] W.-N. Chen and H.-M. Hang, “H.264/AVC motion estimation implmentation on compute uni- fied device architecture (CUDA),” in Multimedia and Expo, 2008 IEEE International Confer- ence on, April 2008, pp. 697 –700.

[30] B. Sharma and N. Vydyanathan, “Parallel discrete wavelet transform using the open com- puting language: a performance and portability study,” in Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, April 2010, pp. 1 –8.

[31] J. Stone, D. Gohara, and G. Shi, “OpenCL: A parallel programming standard for heteroge- neous computing systems,” Computing in Science Engineering, vol. 12, no. 3, pp. 66–73, May 2010.

[32] W. Ling and C. J. Ping, “H.264 video encoder implementation based on TMS320DM642 DSP,” in Management and Service Science, 2009. MASS ’09. International Conference on, September 2009, pp. 1–4.

[33] S. Oktem and I. Hamzaoglu, “An efficient hardware architecture for quarter-pixel accurate H.264 motion estimation,” in Digital System Design Architectures, Methods and Tools, 2007. DSD 2007. 10th Euromicro Conference on, August 2007, pp. 444 –447.

96 Bibliography

[34] A. Azevedo, B. Juurlink, C. Meenderinck, A. Terechko, J. Hoogerbrugge, M. Alvarez, A. Ramirez, and M. Valero, “A highly scalable parallel implementation of H.264,” Transac- tions on High-Performance Embedded Architectures and Compilers (HiPEAC), September 2009.

[35] Hierarchical Parallelization of an H.264/AVC Video Encoder, October 2006.

[36] Y.-K. Chen, X. Tian, S. Ge, and M. Girkar, “Towards efficient multi-level threading of H.264 encoder on Intel hyper-threading architectures,” in Proceedings. 18th International Parallel and Distributed Processing Symposium. IEEE, April 2004, pp. 63–72.

[37] X. He, X. Fang, C. Wang, and S. Goto, “Parallel HD encoding on CELL,” in Circuits and Systems, 2009. ISCAS 2009. IEEE International Symposium on, May 2009, pp. 1065 –1068.

[38] M. Flierl and B. Girod, “Multihypothesis prediction for b frames.” JVT 14th Meeting, Santa Barbara, USA, September 2001.

[39] Y. Qu, G. Li, and Y. He, “A fast MBAFF mode prediction strategy for H.264-AVC,” in Proceed- ings of the 7th IEEE International Conference on Signal Processing, vol. 2, Aug-Sept 2004, pp. 1195 – 1198.

[40] A. M. Tourapis, F. Wu, and S. Li, “Direct mode coding for bipredictive slices in the H.264 standard,” IEEE Transactions on circuits and systems for video technology, vol. 15, pp. 119– 126, January 2005.

[41] Y.-K. Tu, J.-F. Yang, and M.-T. Sun, “An efficient criterion for mode decision in H.264/AVC,” in IEEE International Conference on Multimedia and Expo, July 2006, pp. 1705–1708.

[42] Y. Hu, Q. Li, S. Ma, and C.-C. J. Kuo, “Joint rate-distortion-complexity optimization for H.264 motion search,” in IEEE International Conference on Multimedia and Expo, July 2006, pp. 1949–1952.

[43] P. Liu and K. Jia, “An effective motion estimation scheme for H.264/AVC,” in International Conference on Intelligent Information Hiding and Multimedia Signal Processing, August 2008, pp. 797–801.

[44] S. Warrington, H. Shojania, S. Sudharsanan, and W.-Y. Chan, “Performance improvement of the H.264/AVC deblocking filter using SIMD instructions,” IEEE International Symposium on Circuits and Systems, May 2006.

97 Bibliography

98 A Architecture Results

Contents A.1 Full pipeline architecture ...... 100 A.2 Mixed parallel architecture ...... 104 A.3 Full parallel architecture ...... 107

99 A. Architecture Results

In this section, a comparison between the original reference software and the optimized ver- sion is presented in Table A.1, in what concerns the time spent to encode 300 frames and the relative frame rate.

Table A.1: Comparison between the reference and the optimized software. Video Sequences Reference Optimized SUReal Video Format TT ot TOpt TT ot TOpt QCIF 14560 14390 8661 8626 1.7 Soccer CIF 102249 100852 58271 57567 1.8 4CIF 417279 411450 237043 235007 1.8 QCIF 11986 11818 5849 5849 2.0 Harbour CIF 88804 86821 43166 43166 2.1 4CIF 368106 361795 187057 183499 2.0 QCIF 14248 14072 8272 8214 1.7 Crew CIF 97952 96224 53184 52473 1.8 4CIF 393944 388676 213682 210963 1.8 QCIF 13045 12635 6755 6713 1.9 City CIF 91382 90340 47751 47019 1.9 4CIF 383420 378076 203217 201034 1.9

A.1 Full pipeline architecture

The results obtained for the frame rate and the speedup achieved in a full pipeline architecture are summarized in Tables A.2 and A.3 and in Figures A.1 and A.2. In this simulation, it was used 300 frames and three video formats: QCIF, CIF and 4CIF.

Table A.2: Frame rate(fps) for full pipeline architecture. 1-Core Number of Cores Video Format Ref. Opt. 2 3 4 6 8 QCIF 10.16 17.09 20.29 24.80 23.30 18.56 9.02 Soccer CIF 2.62 4.60 6.10 6.78 6.87 5.58 2.09 4CIF 0.64 1.13 1.46 1.64 1.70 1.38 0.58 QCIF 12.35 25.11 29.18 35.14 28.58 24.54 5.36 Harbour CIF 3.02 6.12 7.81 8.92 8.52 6.27 2.40 4CIF 0.73 1.43 1.84 2.10 2.06 1.60 0.64 QCIF 10.39 17.89 23.15 28.14 23.53 20.64 9.44 Crew CIF 2.74 5.04 6.68 7.57 7.43 5.94 1.88 4CIF 0.68 1.25 1.62 1.86 1.86 1.49 0.63 QCIF 11.35 21.91 26.97 32.24 27.50 20.92 10.02 City CIF 2.93 5.61 7.77 8.47 8.30 6.29 1.97 4CIF 0.70 1.32 1.46 2.00 1.97 1.59 0.62

100 A.1 Full pipeline architecture

Table A.3: Relative speedup for full pipeline architecture. 1-Core Number of Cores Video Format Ref. Opt. 2 3 4 6 8 QCIF 0.59 1.00 1.19 1.45 1.36 1.09 0.53 Soccer CIF 0.57 1.00 1.33 1.47 1.49 1.21 0.45 4CIF 0.57 1.00 1.29 1.45 1.51 1.22 0.44 QCIF 0.49 1.00 1.16 1.40 1.14 0.98 0.21 Harbour CIF 0.49 1.00 1.28 1.46 1.39 1.03 0.39 4CIF 0.51 1.00 1.28 1.46 1.44 1.12 0.44 QCIF 0.58 1.00 1.29 1.57 1.32 1.15 0.53 Crew CIF 0.54 1.00 1.33 1.50 1.47 1.18 0.37 4CIF 0.54 1.00 1.29 1.48 1.48 1.19 0.50 QCIF 0.52 1.00 1.23 1.47 1.26 0.95 0.47 City CIF 0.52 1.00 1.38 1.51 1.48 1.12 0.35 4CIF 0.53 1.00 1.10 1.51 1.50 1.21 0.47

101 A. Architecture Results

Frame rate (fps) of QCIF Video Sequences 40

35

30

25 Soccer 20 Harbour Crew 15 City 10

5

0 Reference Optimized 2-Core 3-Core 4-Core 6-Core 8-Core

(a) QCIF Video sequences frame rate(fps).

Frame rate (fps) of CIF Video Sequences 40

35

30

25

20 Soccer Harbour 15 Crew 10 City 5

0 Reference Optimized 2-Core 3-Core 4-Core 6-Core 8-Core

(b) CIF Video sequences frame rate(fps).

Frame rate (fps) of 4CIF Video Sequences 40

35

30

25

20 Soccer Harbour 15 Crew 10 City 5

0 Reference Optimized 2-Core 3-Core 4-Core 6-Core 8-Core

(c) 4CIF Video sequences frame rate(fps).

Figure A.1: Frame rate obtained for all video formats.

102 A.1 Full pipeline architecture

QCIF Video Sequences' speedup 2,0

S 1,6 p e 1,2 Soccer e Harbour d 0,8 Crew u City p 0,4 Ideal 0,0 1 2 3 4 5 6 7 8 Number of Cores

(a) QCIF Video sequences Speedup.

CIF Video Sequences' speedup 2,0

S 1,6 p e 1,2 Soccer e Harbour d 0,8 Crew u City p 0,4 Ideal 0,0 1 2 3 4 5 6 7 8 Number of Cores

(b) CIF Video sequences Speedup.

4CIF Video Sequences' speedup 2,0

S 1,6 p e 1,2 Soccer e Harbour d 0,8 Crew u City p 0,4 Ideal 0,0 1 2 3 4 5 6 7 8 Number of Cores

(c) 4CIF Video sequences Speedup.

Figure A.2: Speedup obtained for all video formats.

103 A. Architecture Results

A.2 Mixed parallel architecture

The results obtained for the frame rate and the speedup achieved in a full mixed architecture are summarized in Tables A.4 and A.5 and in Figures A.3 and A.4. In this simulation, it was used 300 frames and three video formats: QCIF, CIF and 4CIF.

Table A.4: Frame rate(fps) for mixed parallel architecture. 1-Core Number of Cores Video Format Ref. Opt. 4 6 8 12 16 QCIF 10.16 17.09 20.29 24.80 23.30 18.56 9.48 Soccer CIF 2.62 4.60 6.10 6.78 6.87 5.58 1.88 4CIF 0.64 1.13 1.46 1.64 1.70 1.38 0.61 QCIF 12.35 25.11 29.18 35.14 28.58 24.54 9.77 Harbour CIF 3.02 6.12 7.81 8.92 8.52 6.27 2.09 4CIF 0.73 1.43 1.84 2.10 2.06 1.60 0.77 QCIF 10.39 17.89 23.15 28.14 23.53 20.64 6.07 Crew CIF 2.74 5.04 6.68 7.57 7.43 5.94 2.22 4CIF 0.68 1.25 1.62 1.86 1.86 1.49 0.62 QCIF 11.35 21.91 26.97 32.24 27.50 20.92 10.02 City CIF 2.93 5.61 7.77 8.47 8.30 6.29 4.07 4CIF 0.70 1.32 1.46 2.00 1.97 1.59 0.63

Table A.5: Relative speedup for mixed parallel architecture. 1-Core Number of Cores Video Format Ref. Opt. 4 6 8 12 16 QCIF 0.59 1.00 1.19 1.45 1.36 1.09 0.55 Soccer CIF 0.57 1.00 1.33 1.47 1.49 1.21 0.41 4CIF 0.57 1.00 1.29 1.45 1.51 1.22 0.54 QCIF 0.49 1.00 1.16 1.40 1.14 0.98 0.39 Harbour CIF 0.49 1.00 1.28 1.46 1.39 1.03 0.34 4CIF 0.51 1.00 1.28 1.46 1.44 1.12 0.54 QCIF 0.58 1.00 1.29 1.57 1.32 1.15 0.34 Crew CIF 0.54 1.00 1.33 1.50 1.47 1.18 0.44 4CIF 0.54 1.00 1.29 1.48 1.48 1.19 0.49 QCIF 0.52 1.00 1.23 1.47 1.26 0.95 0.46 City CIF 0.52 1.00 1.38 1.51 1.48 1.12 0.73 4CIF 0.53 1.00 1.10 1.51 1.50 1.21 0.48

104 A.2 Mixed parallel architecture

Frame rate (fps) of QCIF Video Sequences 40

35

30

25 Soccer 20 Harbour Crew 15 City 10

5

0 Reference Optimized 4-Core 6-Core 8-Core 12-Core 16-Core

(a) QCIF Video sequences frame rate(fps).

Frame rate (fps) of CIF Video Sequences 40

35

30

25

20 Soccer Harbour 15 Crew 10 City 5

0 Reference Optimized 4-Core 6-Core 8-Core 12-Core 16-Core

(b) CIF Video sequences frame rate(fps).

Frame rate (fps) of 4CIF Video Sequences 40

35

30

25

20 Soccer Harbour 15 Crew 10 City 5

0 Reference Optimized 4-Core 6-Core 8-Core 12-Core 16-Core

(c) 4CIF Video sequences frame rate(fps).

Figure A.3: Frame rate obtained for all video formats.

105 A. Architecture Results

QCIF Video Sequences speedup 3

S p 2 e Soccer e Harbour d Crew u 1 City p Ideal 0 1 4 7 10 13 16 Number of Cores

(a) QCIF Video sequences Speedup.

CIF Video Sequences speedup 3

S p 2 e Soccer e Harbour d Crew u 1 City p Ideal 0 1 4 7 10 13 16 Number of Cores

(b) CIF Video sequences Speedup.

4CIF Video Sequences speedup 3

S p 2 e Soccer e Harbour d Crew u 1 City p Ideal 0 1 4 7 10 13 16 Number of Cores

(c) 4CIF Video sequences Speedup.

Figure A.4: Speedup obtained for all video formats.

106 A.3 Full parallel architecture

A.3 Full parallel architecture

The results obtained for the frame rate and the speedup achieved in a full mixed architecture are summarized in Tables A.6 and A.7 and in Figures A.5 and A.6. In this simulation, it was used 300 frames and three video formats: QCIF, CIF and 4CIF.

Table A.6: Frame rate(fps) for full parallel architecture. 1-Core Number of Cores Video Format Ref. Opt. 2 3 4 6 8 12 16 QCIF 10.16 17.09 29.44 38.68 50.58 59.70 45.36 57.32 59.87 Soccer CIF 2.62 4.60 8.40 11.39 14.75 19.38 23.19 20.42 23.91 4CIF 0.64 1.13 2.05 2.82 3.66 4.81 6.17 5.69 6.43 QCIF 12.35 25.11 44.89 66.55 77.65 102.14 80.96 78.68 73.82 Harbour CIF 3.02 6.12 11.32 15.65 21.02 26.30 23.27 29.06 30.68 4CIF 0.73 1.43 2.62 3.70 4.84 6.14 7.51 7.38 8.62 QCIF 10.39 17.89 31.18 48.27 55.00 77.12 48.24 62.03 60.88 Crew CIF 2.74 5.04 9.36 13.78 17.75 21.44 21.23 22.03 26.36 4CIF 0.68 1.25 2.32 3.27 4.38 5.56 7.05 6.74 6.60 QCIF 11.35 21.91 36.92 60.61 64.91 89.37 56.55 75.13 67.49 City CIF 2.93 5.61 10.88 16.03 20.54 26.04 27.42 27.54 25.49 4CIF 0.70 1.32 2.05 3.55 4.60 6.21 6.25 7.65 8.68

Table A.7: Full parallel performance for QCIF and CIF video sequences format. Number of Cores Video Optimized 2 3 4 6 8 12 16 QCIF 1.00 1.72 2.26 2.96 3.49 2.65 3.35 3.50 Soccer CIF 1.00 1.83 2.48 3.21 4.21 5.04 4.44 5.20 QCIF 1.00 1.79 2.65 3.09 4.07 3.22 3.13 2.94 Harbour CIF 1.00 1.85 2.56 3.44 4.30 3.81 4.75 5.02 QCIF 1.00 1.74 2.70 3.07 4.31 2.70 3.47 3.40 Crew CIF 1.00 1.86 2.73 3.52 4.26 4.21 4.37 5.23 QCIF 1.00 1.68 2.77 2.96 4.08 2.58 3.43 3.08 City CIF 1.00 1.94 2.86 3.66 4.64 4.89 4.91 4.54

107 A. Architecture Results

Table A.8: Efficiency for full parallel architecture Number of Cores Video Format 2 3 4 6 8 12 16 QCIF 86% 75% 74% 58% 33% 28% 22% Soccer CIF 91% 83% 80% 70% 63% 37% 32% QCIF 89% 88% 77% 68% 40% 26% 18% Harbour CIF 93% 85% 86% 72% 48% 40% 31% QCIF 87% 90% 77% 72% 34% 29% 21% Crew CIF 93% 91% 88% 71% 53% 36% 33% QCIF 84% 92% 74% 68% 32% 29% 19% City CIF 97% 95% 91% 77% 61% 41% 28%

Frame rate (fps) of QCIF Video Sequences 125

100

75 Soccer 50 Harbour Crew 25 City

0

(a) QCIF Video sequences frame rate(fps).

Frame rate (fps) of CIF Video Sequences 125

100

75 Soccer 50 Harbour Crew 25 City

0

(b) CIF Video sequences frame rate(fps).

Figure A.5: Frame-rate obtained for all video formats.

108 A.3 Full parallel architecture

QCIF Video Sequences' speedup 8 7 S 6 p Soccer 5 e Harbour e 4 Crew d 3 City u 2 p Ideal 1 0 1 4 7 10 13 16 Number of Cores

(a) QCIF Video sequences Speedup.

CIF Video Sequences' speedup 8 7 S 6 p e 5 Soccer e 4 Harbour d 3 Crew u 2 City p 1 Ideal 0 1 4 7 10 13 16 Number of Cores

(b) CIF Video sequences Speedup.

Figure A.6: Speedup obtained for all video formats.

109