Optimization of Multimedia Applications on Embedded Multicore Processors Elias Michel Baaklini

Optimization of multimedia applications on embedded multicore processors Elias Michel Baaklini To cite this version: Elias Michel Baaklini. Optimization of multimedia applications on embedded multicore processors. Other. Université de Valenciennes et du Hainaut-Cambresis; Arab open university. Lebanon branch (Antelias, Liban), 2014. English. NNT : 2014VALE0004. tel-01127384 HAL Id: tel-01127384 https://tel.archives-ouvertes.fr/tel-01127384 Submitted on 7 Mar 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Thèse de doctorat Pour obtenir le grade de Docteur de l’Université de VALENCIENNES ET DU HAINAUT-CAMBRESIS et de l’ARAB OPEN UNIVERSITY, LIBAN Discipline : Informatique Présentée et soutenue par Elias Michel, Baaklini. Le 12/02/2014, à Valenciennes, France. Ecole doctorale : Sciences Pour l’Ingénieur (SPI), Collège Doctoral Lille Nord de France Equipe de recherche, laboratoire : Laboratoire d’Automatique, de Mécanique et d’Informatique Industrielles et Humaines (LAMIH) Optimisation des Applications Multimédia sur des Processeurs Multicoeurs Embarqués JURY Président du jury - Artiba, Abdelhakim, Professeur, Université de Valenciennes, France. Rapporteurs - Goossens, Bernard, Professeur, Université de Perpignan, France. - Diguet, Jean-Philippe, Docteur, Directeur de recherche CNRS, LAB-STICC, Lorient, France. Examinateurs - Yurdakul, Arda, Professeur, Bogazici University, Istanbul, Turquie. - Artiba, Abdelhakim, Professeur, Université de Valenciennes, France. Directeur de thèse - Niar, Smail, Professeur, LAMIH, Université de Valenciennes, Valenciennes, France. Co-encadrant - Sbeity, Hassan, Assistant Professor, Arab Open University, Beyrouth, Liban. Optimization of Multimedia Applications on Embedded Multicore Processors Elias Michel Baaklini University of Valenciennes France A thesis submitted for the degree of Doctor of Philosophy February 2014 Acknowledgments The work presented in this thesis was carried out in the computer science departments at the University of Valenciennes in France and the Arab Open University in Lebanon. I would like to thank all the people who made this research successful. In the first place, I would like to show my greatest gratitude to my thesis director Smail Niar, professor at University of Valenciennes, France, and to my supervisor Hassan Sbeity, assistant professor at Arab Open University in Beirut, Lebanon, for their continuous gui- dance, inspiration, and motivation. Moreover, I would like to deeply thank my reviewers, Bernard Goos- sens, professor at University of Perpignan, and Jean-Philippe Diguet, research director at LAB-STICC in Lorient, for reviewing, correcting, and enhancing my thesis dissertation with their detailed comments and constructive feedback. In addition, I would like to thank and to show my appreciation to my examiners, Arda Yurdakul, professor at Bogazici University in Istanbul, Turkey, and Abdelhakim Artiba, professor at University of Valenciennes, France. Furthermore, I cannot forget all the support of my friends and col- leagues who helped me during my long journey. Thank you all for your company and assistance. Finally, I am very thankful for the valuable support of my family. Special gratitude goes to my mother, Fairouz, for her permanent love and encouragement. I would like to dedicate this PhD thesis to my loving mother Fairouz and to my late father Michel. Elias Baaklini Résumé L’utilisation de plusieurs cœurs pour l’exécution des applications mo- biles sera l’approche dominante dans les systèmesembarquéspour les prochaines années.Cette approche permet en généraled’augmenter les performances du systèmesans augmenter la vitesse de l’horloge. Grace àcela, la consommation d’énergie reste modérée.Toutefois, la concurrence entre les tâches doit êtreexploitéeafin d’améliorerles performances du systèmedans les différentes situations ou l’application peut s’exécuter. Les applications multimédias comme la vidéoconférence ou la vidéo haute définition, ont de nombreuses nouvelles fonctionnalitésqui néce- ssitent des calculs complexes par rapport aux normes précédentes de codage vidéo.Ces applications créent une charge de travail trèsimpor- tante sur les systèmesmultiprocesseurs. L’exploitation du parallélisme pour les applications multimédia, comme le codec vidéoH.264/AVC, peut se faire àdifférents niveaux : au niveau de donnéesou bien au niveau tâches. Dans le cadre de cette thèsede doctorat, nous proposons de nouvelles solutions pour une meilleure exploitation du parallélismedans les applications multimédia sur des systèmesembarquésayant une architecture parallèlesymétrique (ou SMP pour Symmetric Multi-Processor). Des approches innovantes pour le décodeur H.264/AVC qui traitent des composantes de couleur et des blocs de l’image en parallèlesont proposéeset expérimentées. Mots Clés: Multimédia, Standard H.264/AVC, Compression Vidéo, Optimisation, Calcul Parallèle, SystèmesEmbarqués,Processeurs Mul- ticoeurs. Abstract Parallel computing is currently the dominating architecture in embedded systems. Concurrency improves the performance of the system rather without increasing the clock speed which affects the power consumption of the system. However, concurrency needs to be exploi- ted in order to improve the system performance in different applications environments. Multimedia applications (real-time conversational services such as video conferencing, video phone, etc.) have many new features that re- quire complex computations compared to previous video coding standards. These applications have a challenging workload for future multiprocessors. Exploiting parallelism in multimedia applications can be done at data and functional levels or using different instruction sets and architectures. In this research, we design new parallel algorithms and mapping me- thodologies in order to exploit the natural existence of parallelism in multimedia applications, specifically the H.264/AVC video decoder. We mainly target symmetric shared-memory multiprocessors (SMPs) for embedded devices such as ARM Cortex-A9 multicore chips. We evaluate our novel parallel algorithms of the H.264/AVC video decoder on different levels : memory load, energy consumption, and execution time. Keywords : Multimedia, H.264/AVC Standard, Video Compression, Optimization, Parallel Computing, Embedded Systems, Multicore Pro- cessors. Contents Contents ix List of Figures xiv Nomenclature xvii 1 Introduction 2 1.1 Background and Motivation . 2 1.2 Problem Statement . 3 1.3 Existing Solutions . 4 1.4 Contributions . 4 1.5 Outline . 5 2 Parallel Computing 6 2.1 Introduction . 6 2.2 Types of Parallelism . 7 2.2.1 Classification of Processors . 7 2.2.2 Instruction-Level Parallelism . 9 2.2.3 Data-Level Parallelism . 9 2.2.4 Thread-Level Parallelism . 11 2.3 Memory Architecture for Parallel Systems . 17 2.3.1 Main Memory . 17 2.3.2 Processor Communication . 18 2.3.3 Memory Access . 18 2.3.4 Caches . 18 ix CONTENTS CONTENTS 2.3.5 Cache Coherency . 19 2.4 Parallel Applications . 21 2.4.1 Amdahl’s Law . 21 2.4.2 Challenges of Parallel Processing . 22 2.4.3 Programming Languages . 25 2.5 Conclusion . 27 3 H.264/AVC Standard Overview 28 3.1 Introduction . 28 3.2 Video Coding Review . 29 3.2.1 Digital Video . 29 3.2.2 Block Based Video Coding . 32 3.2.3 Video Coding Standards . 37 3.3 Standard Development . 39 3.4 Features and Tools . 40 3.4.1 Layer Structure . 41 3.4.2 Profiles and Levels . 41 3.4.3 Picture Format . 42 3.5 Video Coding . 43 3.5.1 Encoder . 43 3.5.2 Decoder . 45 3.6 Coding Tools and Functions . 45 3.6.1 Intra Prediction . 46 3.6.2 Inter Prediction . 47 3.6.3 Transform and Quantization . 50 3.6.4 Skipped Macroblocks . 50 3.6.5 Deblocking Filter . 51 3.7 Parallel Implementations . 51 3.7.1 Slice-Level . 51 3.7.2 Macroblock-Level . 52 3.7.3 Deblocking Filter . 53 3.7.4 Discussion . 53 3.8 Summary and Conclusion . 54 x CONTENTS CONTENTS 4 H.264 Color Components Parallel Decoding 56 4.1 Introduction . 56 4.2 Parallel Decoding . 57 4.2.1 Stages Decomposition . 57 4.2.2 Color Components Processing . 58 4.2.3 Parallel Execution and Synchronization . 59 4.2.4 Pipeline Execution . 62 4.3 Experiments with JM H.264 Reference Software . 63 4.3.1 MPARM simulator and H.264 porting . 64 4.3.2 Profiling H.264 Stages . 64 4.3.3 Discussion . 64 4.3.4 Speedup using Parallelism . 65 4.4 Experiments with FFmpeg H.264 Decoder . 66 4.4.1 Multi2Sim Simulator . 66 4.4.2 FFmpeg H.264 Implementation . 67 4.4.3 Speedup using Parallelism . 67 4.4.4 Power Efficiency . 69 4.4.5 FFmpeg Multi-Threaded Version . 70 4.5 Conclusion . 71 5 H.264 Macroblocks Rows Parallel Decoding 74 5.1 Introduction . 74 5.2 Decoder Decomposition . 76 5.2.1 Decoding Stages . 76 5.2.2 Macroblocks . 77 5.3 Parallel Implementation . 78 5.3.1 Parallel Motion Compensation . 79 5.3.2 Macroblocks Dependencies . 80 5.3.3 IDR Frame Frequency . 81 5.3.4 Macroblock Dependency Check Algorithm . 82 5.3.5 Macroblocks Partitioning . 84 5.3.6 Scalability of Parallel Motion Compensation . 85 5.3.7 Parallel Deblocking Filter . 86 xi CONTENTS CONTENTS 5.4 Experimental Results on Multicore Systems . 87 5.4.1 Parallel Execution . 88 5.4.2 Environment and Configurations . 88 5.4.3 Results for Parallel Motion Compensation . 89 5.4.4 Comparison with Related Work . 91 5.4.5 Results for Parallel Deblocking Filter . 93 5.4.6 Results for Overall Execution .

Optimization of Multimedia Applications on Embedded Multicore Processors Elias Michel Baaklini

High-Performance Computing and Compiler Optimization

The Intel X86 Microarchitectures Map Version 2.0

The Paramountcy of Reconfigurable Computing

The Intel X86 Microarchitectures Map Version 2.2

Redalyc.Optimization of Operating Systems Towards Green Computing

Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory

High-Performance Parallel Computing

The Intel X86 Microarchitectures Map Version 1.1

Analysis of Task Scheduling for Multi-Core Embedded Systems

Thesis Title

Extracting Parallelism from Legacy Sequential Code Using Transactional Memory

Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Lecture 1