Improving the Scalability of Multicore Systems with a Focus on H.264 Video Decoding

Improving the Scalability of Multicore Systems With a Focus on H.264 Video Decoding Improving the Scalability of Multicore Systems With a Focus on H.264 Video Decoding PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Technische Universiteit Delft, op gezag van de Rector Magnificus prof.ir. K.Ch.A.M Luyben, voorzitter van het College voor Promoties, in het openbaar te verdedigen op vrijdag 9 juli 2010 om 15:00 uur door Cornelis Hermanus MEENDERINCK elektrotechnisch ingenieur geboren te Amsterdam, Nederland. Dit proefschrift is goedgekeurd door de promotoren: Prof. dr. B.H.H. Juurlink Prof. dr. K.G.W. Goossens Samenstelling promotiecommissie: Rector Magnificus, voorzitter Technische Universiteit Delft Prof. dr. B.H.H. Juurlink, promotor Technische Universitat¨ Berlin Prof. dr. K.G.W. Goossens, promotor Technische Universiteit Delft Prof. dr. H. Corporaal Technische Universiteit Eindhoven Prof. dr. ir. A.J.C. van Gemund Technische Universiteit Delft Dr. H.P. Hofstee IBM Systems and Technology Group Prof. dr. K.G. Langendoen Technische Universiteit Delft Dr. A. Ramirez Universitat Politecnica de Catalunya Prof. dr. B.H.H. Juurlink was werkzaam aan de Technische Universiteit van Delft tot eind december 2009, en heeft als begeleider in belangrijke mate bij- gedragen aan de totstandkoming van dit proefschrift. ISBN: 978-90-72298-08-9 Keywords: multicore, Chip MultiProcessor (CMP), scalability, power efficiency, thread level-parallelism (TLP), H.264, video decoding, instruction set architecture, domain specific accelerator, task management Acknowledgements: This work was supported by the European Com- mission in the context of the SARC integrated project #27648 (FP6). Copyright °c 2010 C.H. Meenderinck All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without permission of the author. Printed in the Netherlands Dedicated to my wife for loving and supporting me. Improving the Scalability of Multicore Systems With a Focus on H.264 Video Decoding Abstract n pursuit of ever increasing performance, more and more processor architectures have become multicore processors. As clock frequency was no I longer increasing rapidly and ILP techniques showed diminishing results, increasing the number of cores per chip was the natural choice. The transis- tor budget is still increasing and thus it is expected that within ten years chips can contain hundreds of high performance cores. Scaling the number of cores, however, does not necessarily translate into an equal scaling of performance. In this thesis, we propose several techniques to improve the performance scalability of multicore systems. With those techniques we address several key challenges of the multicore area. First, we investigate the effect of the power wall on future multicore architecture. Our model includes predictions of technology improvements, analysis of symmetric and asymmetric multicores, as well as the influence of Amdahl’s Law. Second, we investigate the parallelization of the H.264 video decoding application, thereby addressing application scalability. Existing parallelization strategies are discussed and a novel strategy is proposed. Analysis shows that using the new parallelization strategy the amount of available parallelism is in the order of thousands. Several implementations of the strategy are discussed, which show the difficulty and the possibility of actually exploiting the available parallelism. Third, we propose an Application Specific Instruction Set (ASIP) processor for H.264 decoding, based on the Cell SPE. ASIPs are energy efficient and allow performance scaling in systems that are limited by the power budget. Finally, we propose hardware support for task management, of which the ben- efits are two-fold. First, it supports the SARC programming model, which is a task-based dataflow programming model based on StarSS. By providing hardware support for the most time-consuming part of the runtime system, it improves the scalability. Second, it reduces the parallelization overhead, such as synchronization, by providing fast hardware primitives. i Acknowledgments Many people contributed to this work in some way and I owe them a debt of gratitude. To reach this point it takes professors who teach you, colleagues who assist you, and family and friends who support you. First of all, I thank Ben Juurlink for supervising me. You helped focusing this work, while at the same time pointing me to numerous potential ideas. You showed me what others had done in our research area and how we might ei- ther build on that or fill a gap. You suggested me to visit BSC and collaborate with them closely. Also when I wanted to visit HiPEAC meetings you supported me. Al those trips, where I met other researchers, other ideas, and other scientific work, have been of great value. I’m also thankful to Stamatis Vassiliadis for giving me the opportunity to work on the SARC project. I profited from the synergy effect within such a large project. The shift to the SARC project also meant the end of the collaboration with Sorin Cotofana. Sorin, I thank you for the time we worked together. You were the one who introduced me into academia, you taught me how to write papers, and I still remember the conversations we had about life, religion, gas- tronomy, and jazz. I also thank Kees Goossens for being one of my promoters. Over the years I have worked together with several other PhD students and some MSc students. I thank Arnaldo and Mauricio for all the work we did together on parallelizing H.264 and porting it to the Cell processor. I thank Alejandro, Felipe, Augusto, and David for all the effort they put into the simulator and helping me to use and extend it. I thank Efren for implementing the Nexus system in the simulator. I thank Carsten, Martijn, and Chi for doing their MSc project with us, by which you helped me and others. I thank Pepijn for helping me with Linux issues and being a pleasant room mate. I thank Alex Ramirez and Xavi Martorell for warmly welcoming me in Barcelona, the collaboration, and your insight. My visits to Barcelona car- ried a lot of fruit. It helped me to put my research in a larger perspective; it iii both broadened and deepened my knowledge, and taught me how to enjoy life in an intense way. I thank Jan Hoogerbrugge for the valuable discussions on parallelizing H.264. I thank Stefanos Kaxiras for his input on the methodology used to estimate the power consumption of future multicores. I thank Peter Hofstee and Brian Flachs for commenting on the SPE specialization. I thank Ayal Zaks for helping me with compiler issues. I thank Yoav Etsion for the discussions and collaboration on hardware dependency resolution. Secretaries and coffee; they always seem to go together. Whether I was in Delft or in Barcelona, the secretary’s office was always my source of coffee and a chat. Lidwina and Monique, thank you for the coffee, the chats, and helping me with the administrative stuff. Lourdes, thank you for the warm welcome in BSC, the coffee, and the conversations. I thank the European Network of Excellence on High Performance and Em- bedded Architecture and Compilation (HiPEAC) for financing the collaboration with BSC and some trips to the HiPEAC workshops and conferences. I’m very grateful to my parents who raised me, who loved me, and who supported me continuously throughout the years. I thank God for giving me the capabilities of doing a PhD and being my guidance through life. Many, many thanks to my dearest wife; the princess of my life. Thank you for loving me, for supporting me, and for taking this wonderful journey of life together with me. C.H. Meenderinck Delft, The Netherlands, 2010 iv Contents Abstract i Acknowledgments iii List of Figures xii List of Tables xiv Abbreviations xv 1 Introduction 1 1.1 Motivation . 2 1.2 Objectives . 7 1.3 Organization and Contributions . 9 2 (When) Will Multicores hit the Power Wall? 11 2.1 Methodology . 13 2.2 Scaling of the Alpha 21264 . 13 2.3 Power and Performance Assuming Perfect Parallelization . 18 2.4 Intermezzo: Amdahl vs. Gustafson . 20 2.5 Performance Assuming Non-Perfect Parallelization . 25 2.6 Conclusions . 32 3 Parallel Scalability of Video Decoders 35 v 3.1 Overview of the H.264 Standard . 36 3.2 Benchmarks . 41 3.3 Parallelization Strategies for H.264 . 42 3.3.1 GOP-level Parallelism . 43 3.3.2 Frame-level Parallelism . 44 3.3.3 Slice-level Parallelism . 44 3.3.4 Macroblock-level Parallelism . 45 3.3.5 Block-level Parallelism . 48 3.4 Scalable MB-level Parallelism: The 3D-Wave . 49 3.5 Parallel Scalability of the Dynamic 3D-Wave . 52 3.6 Case Study: Mobile Video . 62 3.7 Experiences with Parallel Implementations . 63 3.7.1 3D-Wave on a TriMedia-based Multicore . 65 3.7.2 2D-Wave on a Multiprocessor System . 67 3.7.3 2D-Wave on the Cell Processor . 68 3.8 Conclusions . 70 4 The SARC Media Accelerator 73 4.1 Related Work . 74 4.2 The SPE Architecture . 76 4.3 Experimental Setup . 79 4.3.1 Benchmarks . 79 4.3.2 Compiler . 80 4.3.3 Simulator . 81 4.4 Enhancements to the SPE Architecture . 84 4.4.1 Accelerating Scalar Operations . 85 4.4.2 Accelerating Saturation and Packing . 92 4.4.3 Accelerating Matrix Transposition . 99 4.4.4 Accelerating Arithmetic Operations . 106 4.4.5 Accelerating Unaligned Memory Accesses . 124 4.5 Performance Evaluation . 127 vi 4.5.1 Results for the IDCT8 Kernel . 127 4.5.2 Results for the IDCT4 kernel .

Improving the Scalability of Multicore Systems with a Focus on H.264 Video Decoding

Introduction to the Cell Multiprocessor

Download Presentation

Keir, Paul G (2012) Design and Implementation of an Array Language for Computational Science on a Heterogeneous Multicore Architecture

Introducing the Cell Processor

Real-Time Supercomputing and Technology for Games and Entertainment H

The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor

Heterogeneous Processors: the Cell Broadband Engine (Plus Processor History & Outlook)

CELL Processor

HC17.S1T1 a Novel SIMD Architecture for the Cell Heterogeneous Chip-Multiprocessor

An Open Source Environment for Cell Broadband Engine System Software

Tutorial Hardware and Software Architectures for the CELL

Der Cell Prozessor