Improving the Scalability of Multicore Systems with a Focus on H.264 Video Decoding

Improving the Scalability of Multicore Systems with a Focus on H.264 Video Decoding

Improving the Scalability of Multicore Systems With a Focus on H.264 Video Decoding Improving the Scalability of Multicore Systems With a Focus on H.264 Video Decoding PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Technische Universiteit Delft, op gezag van de Rector Magnificus prof.ir. K.Ch.A.M Luyben, voorzitter van het College voor Promoties, in het openbaar te verdedigen op vrijdag 9 juli 2010 om 15:00 uur door Cornelis Hermanus MEENDERINCK elektrotechnisch ingenieur geboren te Amsterdam, Nederland. Dit proefschrift is goedgekeurd door de promotoren: Prof. dr. B.H.H. Juurlink Prof. dr. K.G.W. Goossens Samenstelling promotiecommissie: Rector Magnificus, voorzitter Technische Universiteit Delft Prof. dr. B.H.H. Juurlink, promotor Technische Universitat¨ Berlin Prof. dr. K.G.W. Goossens, promotor Technische Universiteit Delft Prof. dr. H. Corporaal Technische Universiteit Eindhoven Prof. dr. ir. A.J.C. van Gemund Technische Universiteit Delft Dr. H.P. Hofstee IBM Systems and Technology Group Prof. dr. K.G. Langendoen Technische Universiteit Delft Dr. A. Ramirez Universitat Politecnica de Catalunya Prof. dr. B.H.H. Juurlink was werkzaam aan de Technische Universiteit van Delft tot eind december 2009, en heeft als begeleider in belangrijke mate bij- gedragen aan de totstandkoming van dit proefschrift. ISBN: 978-90-72298-08-9 Keywords: multicore, Chip MultiProcessor (CMP), scalability, power efficiency, thread level-parallelism (TLP), H.264, video decoding, instruction set architecture, domain specific accelerator, task management Acknowledgements: This work was supported by the European Com- mission in the context of the SARC integrated project #27648 (FP6). Copyright °c 2010 C.H. Meenderinck All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without permission of the author. Printed in the Netherlands Dedicated to my wife for loving and supporting me. Improving the Scalability of Multicore Systems With a Focus on H.264 Video Decoding Abstract n pursuit of ever increasing performance, more and more processor archi- tectures have become multicore processors. As clock frequency was no I longer increasing rapidly and ILP techniques showed diminishing results, increasing the number of cores per chip was the natural choice. The transis- tor budget is still increasing and thus it is expected that within ten years chips can contain hundreds of high performance cores. Scaling the number of cores, however, does not necessarily translate into an equal scaling of performance. In this thesis, we propose several techniques to improve the performance scal- ability of multicore systems. With those techniques we address several key challenges of the multicore area. First, we investigate the effect of the power wall on future multicore architec- ture. Our model includes predictions of technology improvements, analysis of symmetric and asymmetric multicores, as well as the influence of Amdahl’s Law. Second, we investigate the parallelization of the H.264 video decoding ap- plication, thereby addressing application scalability. Existing parallelization strategies are discussed and a novel strategy is proposed. Analysis shows that using the new parallelization strategy the amount of available parallelism is in the order of thousands. Several implementations of the strategy are discussed, which show the difficulty and the possibility of actually exploiting the avail- able parallelism. Third, we propose an Application Specific Instruction Set (ASIP) processor for H.264 decoding, based on the Cell SPE. ASIPs are energy efficient and allow performance scaling in systems that are limited by the power budget. Finally, we propose hardware support for task management, of which the ben- efits are two-fold. First, it supports the SARC programming model, which is a task-based dataflow programming model based on StarSS. By providing hardware support for the most time-consuming part of the runtime system, it improves the scalability. Second, it reduces the parallelization overhead, such as synchronization, by providing fast hardware primitives. i Acknowledgments Many people contributed to this work in some way and I owe them a debt of gratitude. To reach this point it takes professors who teach you, colleagues who assist you, and family and friends who support you. First of all, I thank Ben Juurlink for supervising me. You helped focusing this work, while at the same time pointing me to numerous potential ideas. You showed me what others had done in our research area and how we might ei- ther build on that or fill a gap. You suggested me to visit BSC and collaborate with them closely. Also when I wanted to visit HiPEAC meetings you sup- ported me. Al those trips, where I met other researchers, other ideas, and other scientific work, have been of great value. I’m also thankful to Stamatis Vassiliadis for giving me the opportunity to work on the SARC project. I profited from the synergy effect within such a large project. The shift to the SARC project also meant the end of the collaboration with Sorin Cotofana. Sorin, I thank you for the time we worked together. You were the one who introduced me into academia, you taught me how to write papers, and I still remember the conversations we had about life, religion, gas- tronomy, and jazz. I also thank Kees Goossens for being one of my promoters. Over the years I have worked together with several other PhD students and some MSc students. I thank Arnaldo and Mauricio for all the work we did together on parallelizing H.264 and porting it to the Cell processor. I thank Alejandro, Felipe, Augusto, and David for all the effort they put into the sim- ulator and helping me to use and extend it. I thank Efren for implementing the Nexus system in the simulator. I thank Carsten, Martijn, and Chi for doing their MSc project with us, by which you helped me and others. I thank Pepijn for helping me with Linux issues and being a pleasant room mate. I thank Alex Ramirez and Xavi Martorell for warmly welcoming me in Barcelona, the collaboration, and your insight. My visits to Barcelona car- ried a lot of fruit. It helped me to put my research in a larger perspective; it iii both broadened and deepened my knowledge, and taught me how to enjoy life in an intense way. I thank Jan Hoogerbrugge for the valuable discussions on parallelizing H.264. I thank Stefanos Kaxiras for his input on the methodology used to estimate the power consumption of future multicores. I thank Peter Hofstee and Brian Flachs for commenting on the SPE specialization. I thank Ayal Zaks for help- ing me with compiler issues. I thank Yoav Etsion for the discussions and col- laboration on hardware dependency resolution. Secretaries and coffee; they always seem to go together. Whether I was in Delft or in Barcelona, the secretary’s office was always my source of coffee and a chat. Lidwina and Monique, thank you for the coffee, the chats, and helping me with the administrative stuff. Lourdes, thank you for the warm welcome in BSC, the coffee, and the conversations. I thank the European Network of Excellence on High Performance and Em- bedded Architecture and Compilation (HiPEAC) for financing the collabora- tion with BSC and some trips to the HiPEAC workshops and conferences. I’m very grateful to my parents who raised me, who loved me, and who sup- ported me continuously throughout the years. I thank God for giving me the capabilities of doing a PhD and being my guidance through life. Many, many thanks to my dearest wife; the princess of my life. Thank you for loving me, for supporting me, and for taking this wonderful journey of life together with me. C.H. Meenderinck Delft, The Netherlands, 2010 iv Contents Abstract i Acknowledgments iii List of Figures xii List of Tables xiv Abbreviations xv 1 Introduction 1 1.1 Motivation . 2 1.2 Objectives . 7 1.3 Organization and Contributions . 9 2 (When) Will Multicores hit the Power Wall? 11 2.1 Methodology . 13 2.2 Scaling of the Alpha 21264 . 13 2.3 Power and Performance Assuming Perfect Parallelization . 18 2.4 Intermezzo: Amdahl vs. Gustafson . 20 2.5 Performance Assuming Non-Perfect Parallelization . 25 2.6 Conclusions . 32 3 Parallel Scalability of Video Decoders 35 v 3.1 Overview of the H.264 Standard . 36 3.2 Benchmarks . 41 3.3 Parallelization Strategies for H.264 . 42 3.3.1 GOP-level Parallelism . 43 3.3.2 Frame-level Parallelism . 44 3.3.3 Slice-level Parallelism . 44 3.3.4 Macroblock-level Parallelism . 45 3.3.5 Block-level Parallelism . 48 3.4 Scalable MB-level Parallelism: The 3D-Wave . 49 3.5 Parallel Scalability of the Dynamic 3D-Wave . 52 3.6 Case Study: Mobile Video . 62 3.7 Experiences with Parallel Implementations . 63 3.7.1 3D-Wave on a TriMedia-based Multicore . 65 3.7.2 2D-Wave on a Multiprocessor System . 67 3.7.3 2D-Wave on the Cell Processor . 68 3.8 Conclusions . 70 4 The SARC Media Accelerator 73 4.1 Related Work . 74 4.2 The SPE Architecture . 76 4.3 Experimental Setup . 79 4.3.1 Benchmarks . 79 4.3.2 Compiler . 80 4.3.3 Simulator . 81 4.4 Enhancements to the SPE Architecture . 84 4.4.1 Accelerating Scalar Operations . 85 4.4.2 Accelerating Saturation and Packing . 92 4.4.3 Accelerating Matrix Transposition . 99 4.4.4 Accelerating Arithmetic Operations . 106 4.4.5 Accelerating Unaligned Memory Accesses . 124 4.5 Performance Evaluation . 127 vi 4.5.1 Results for the IDCT8 Kernel . 127 4.5.2 Results for the IDCT4 kernel .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    285 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us