Software Implementation of an MPEG-4 Part 10 (H.264/AVC) Video Encoder for Embedded Systems

Software implementation of an MPEG-4 Part 10 (H.264/AVC) video encoder for embedded systems Antonio´ Manuel da Costa Rodrigues Dissertaçao˜ para obtençao˜ do Grau de Mestre em Engenharia Electrotecnica´ e de Computadores J úri Presidente: Doutor Marcelino Bicho dos Santos Orientador: Doutor Nuno Filipe Valentim Roma Co-orientador: Doutor Leonel Augusto Pires Seabra de Sousa Vogal: Doutor Paulo Luis Serras Lobato Correia Outubro de 2010 Acknowledgments I would like to thank my supervisor, Dr Nuno Roma for his assistance and guidance throughout this project. On a personal note I would thank to all people around me for the support and patience. Abstract In the last few years there has been a general proliferation of advanced video services and multimedia applications, where video compression standards, such as MPEG-x or H.26x, have been developed to store and broadcast video information in the digital form. Among such video standards, the MPEG-4 Part 10 (also known as H.264/AVC) was recently released and has proved to provide particular advantages, when compared with the previous video standards (such as MPEG-1, MPEG-2, etc.), in what concerns the obtained encoding efficiency and video quality performances. In the other hand, to achieve such encoding efficiency and quality, the computa- tional complexity of the encoder has increased exponentially. At the same time, processor industries have already been taking multi-core solutions (multiple processors units on a die) to answer to a growing speedup demand. This dissertation presents a multi-core solution of the encoder H.264 software (programmed in C - http://iphome.hhi.de/suehring/tml/download/old jm/ version 14.0) to speedup the processing time but keeping the efficiency and quality performances. Some restrictions have been imposed to reduce the processing dependencies and increase the level of parallelism at frame level. For example, the picture and macroblock interlace, rate distortion optimization and rate control modules were disabled. Through several simulations, the results indicate that the developed architectures are capable to speed up the elapsed time in the processing. Keywords Multicore, H.264/AVC, Parallelism, Video encoding, Speedup iii Resumo Nos ultimos´ anos tem-se assistido a` proliferaçao˜ de serviços avançados de v´ıdeo e aplicaçoes˜ multimedia,´ onde os standards de compressao˜ de v´ıdeo, tais como MPEG-x ou H.26x, temˆ sido desenvolvidos para armazenamento e difusao˜ de v´ıdeo em formato digital. Entre essas normas, o MPEG-4 Part 10 (tambem´ conhecido como H.264/AVC) foi lançado recentemente e provou ter vantagens, comparadas com as versoes˜ anteriores (tais como MPEG-1, MPEG-2, etc.), no que diz respeito a` obtençao˜ de uma codificaçao˜ eficiente e uma boa qualidade de v´ıdeo. Por outro lado, para atingir tal eficienciaˆ e qualidade, a complexidade computacional do codificador aumentou exponencialmente. Ao mesmo tempo, as industrias´ de processadores começaram a optar por soluçoes˜ multicore (multiplos´ processadores no mesmo chip) para responder a` crescente necessidade de processadores mais rapidos.´ Esta dissertaçao˜ apresenta uma soluçao˜ multi-core para o software do codificador H.264 (programado em C - http://iphome.hhi.de/suehring/tml/download/old jm/ version 14.0) que visa o aumento da velocidade de processamento mantendo a eficienciaˆ e a qualidade. Algumas restriçoes˜ foram impostas para reduzir as dependenciasˆ a n´ıvel do processamento e aumentar o paralelismo a n´ıvel da frame. Por exemplo, os modulos´ de picture and macroblock interlace, rate distortion optimization e rate control foram desactivados. Atraves´ de varias´ simulaçoes,˜ os resultados indicaram que as arquitecturas desenvolvidas sao˜ capazes de aumentar a velocidade de processamento. Palavras Chave Multicore, H.264/AVC, Paralelismo, Codificaçao˜ de V´ıdeo, Aumento da Velocidade v Contents 1 Introduction 1 1.1 Motivation ......................................... 2 1.2 Requirements ....................................... 3 1.3 Objectives ......................................... 3 1.4 Main contributions .................................... 4 1.5 Dissertation Outline ................................... 4 2 Video Coding 5 2.1 Principles of Video Compression ............................ 6 2.1.1 Reduction of Irrelevancies ............................ 6 2.1.2 Spatial Redundancy ............................... 8 2.1.3 Temporal Redundancy .............................. 8 2.1.4 Comparison Metrics ............................... 10 2.2 H.264 Standard ...................................... 11 2.2.1 H.264 Profiles and Levels ............................ 11 2.2.2 H.264 Encoding Loop .............................. 12 2.2.2.A Flexible Macroblock Order and Arbitrary Slice Order ........ 13 2.2.2.B Intra Prediction ............................. 14 2.2.2.C Inter Prediction ............................. 15 2.2.2.D Transform & Quantization ....................... 17 2.2.2.E Entropy Coding ............................ 20 2.2.2.F In-loop Deblocking Filter ....................... 22 2.2.2.G Interpolation Block ........................... 25 2.2.2.H Network Abstraction Layer ...................... 26 2.2.3 H.264 Profiling .................................. 27 3 Multiprocessor Architectures 29 3.1 Classification of Parallel Processor Systems ...................... 30 3.2 Symmetric Shared-Memory Multiprocessor ...................... 31 3.2.1 Cache Coherency ................................ 33 vii Contents 3.2.1.A Software Solution ........................... 34 3.2.1.B Hardware Solution ........................... 34 3.3 Non-Uniform Memory Access .............................. 37 3.4 Communication Mechanisms .............................. 38 3.5 Synchronization ..................................... 42 3.5.1 Software Locks .................................. 43 3.5.2 Hardware Locks ................................. 43 3.6 Data Parallelism and Programming Languages .................... 43 3.6.1 MPI - Message Programming Interface .................... 44 3.6.2 POSIX Threads .................................. 44 3.6.3 OpenMP ..................................... 45 3.6.4 CUDA and OpenCL ............................... 46 4 Related work 47 4.1 Parallelizing the H.264 Decoder ............................. 48 4.2 Parallelizing the H.264 Encoder ............................. 49 4.3 Other Proposed Parallel Solutions ........................... 52 5 Open Platform for Parallel H.264 Video Encoding 53 5.1 Assumptions ....................................... 55 5.2 Structures Redesign and Code Improvements ..................... 57 5.2.1 Transform & Quantization ............................ 59 5.2.2 Intra Prediction .................................. 60 5.2.3 Inter Prediction .................................. 61 5.2.4 Deblocking Filter ................................. 61 5.2.5 Interpolation ................................... 61 5.2.6 Data Parallelization - SIMD Instructions .................... 62 5.3 Levels of Parallelism in H.264 Encoding ........................ 63 5.4 Heterogeneous vs Homogeneous Multi-core Platform ................ 65 5.5 Slice Definition ...................................... 65 5.6 Data Partitioning ..................................... 66 5.7 Parallel Programming of H.264 Video Encoders .................... 67 5.7.1 Parallel Encoding of Intra Type Frames ..................... 68 5.7.2 Parallel the Encoding of Inter Type Frames .................. 69 5.7.3 Increasing the Slice-level Scalability ...................... 73 6 Results 77 6.1 Frame Partition and Memory Consumption ...................... 80 viii Contents 6.2 Architectures Comparison ................................ 80 6.2.1 Full Pipeline Architecture ............................ 81 6.2.2 Mixed Parallel Architecture ........................... 83 6.2.3 Full Parallel Architecture ............................. 83 6.2.4 Discussion .................................... 84 6.3 Comparison Between Slice-level and Macroblock-level Parallelism ......... 85 7 Conclusions 89 7.1 Summary ......................................... 90 7.2 Result’s Analysis ..................................... 90 7.3 Future Work ........................................ 91 8 Bibliography 93 A Architecture Results 99 A.1 Full pipeline architecture ................................. 100 A.2 Mixed parallel architecture ................................ 104 A.3 Full parallel architecture ................................. 107 ix Contents x List of Figures 2.1 Different sampling patterns. ............................... 7 2.2 Difference between two consecutive frames. Figure c was enhanced so the differ- ences can be noticed. .................................. 8 2.3 Example of an interframe prediction using BMA. ................... 9 2.4 Main features supported by each of the H.264 profiles. ................ 11 2.5 H.264 encoder block diagram. .............................. 12 2.6 Different slice group maps, supported in the H.264. .................. 14 2.7 Directions available for Intra prediction in H.264. ................... 15 2.8 Supported modes in INTRA-4x4 prediction. ...................... 15 2.9 INTRA-16 × 16 supported modes. ............................ 16 2.10 Possible partitions of a macroblock in H.264 Inter prediction. ............ 16 2.11 Prediction types: (a) Type P; (b) Type P and B - presentation order; (c) Type P and B - codification order. ................................... 17 2.12 Usage of multiple reference frames in Inter prediction. ...............

Load more