Tmam1de1.Pdf (3076Mb)
Total Page:16
File Type:pdf, Size:1020Kb
ADVERTIMENT . La consulta d’aquesta tesi queda condicionada a l’acceptació de les següents condicions d'ús: La difusió d’aquesta tesi per mitjà del servei TDX ( www.tesisenxarxa.net ) ha estat autoritzada pels titulars dels drets de propietat intel·lectual únicament per a usos privats emmarcats en activitats d’investigació i docència. No s’autoritza la seva reproducció amb finalitats de lucre ni la seva difusió i posada a disposició des d’un lloc aliè al servei TDX. No s’autoritza la presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de drets afecta tant al resum de presentació de la tesi com als seus continguts. En la utilització o cita de parts de la tesi és obligat indicar el nom de la persona autora. ADVERTENCIA . La consulta de esta tesis queda condicionada a la aceptación de las siguientes condiciones de uso: La difusión de esta tesis por medio del servicio TDR ( www.tesisenred.net ) ha sido autorizada por los titulares de los derechos de propiedad intelectual únicamente para usos privados enmarcados en actividades de investigación y docencia. No se autoriza su reproducción con finalidades de lucro ni su difusión y puesta a disposición desde un sitio ajeno al servicio TDR. No se autoriza la presentación de su contenido en una ventana o marco ajeno a TDR (framing). Esta reserva de derechos afecta tanto al resumen de presentación de la tesis como a sus contenidos. En la utilización o cita de partes de la tesis es obligado indicar el nombre de la persona autora. WARNING . On having consulted this thesis you’re accepting the following use conditions: Spreading this thesis by the TDX ( www.tesisenxarxa.net ) service has been authorized by the titular of the intellectual property rights only for private uses placed in investigation and teaching activities. Reproduction with lucrative aims is not authorized neither its spreading and availability from a site foreign to the TDX service. Introducing its content in a window or frame foreign to the TDX service is not authorized (framing). This rights affect to the presentation summary of the thesis as well as to its contents. In the using or citation of parts of the thesis it’s obliged to indicate the name of the author Parallel Video Decoding Mauricio Alvarez´ Mesa A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy in Computing Engineering Advisors: Alex Ram´ırezand Mateo Valero Department of Computer Architecture Universitat Polit`ecnicade Catalunya (UPC) June, 2011 To Luna, Claudia, Carolina and Mariela. The women of my life. Without their support and love this work would have been meaningless. Abstract Digital video is a popular technology used in many different applications. The quality of video, expressed in the spatial and temporal resolution, has been increasing continuously in the last years. In order to reduce the bitrate required for its storage and transmis- sion, a new generation of video encoders and decoders (codecs) have been developed. The latest video codec standard, known as H.264/AVC, includes sophisticated com- pression tools that require more computing resources than any previous video codec. The combination of high quality video and the advanced compression tools found in H.264/AVC has resulted in a significant increase in the computational requirements of video decoding applications. The main objective of this thesis is to provide the performance required for real- time operation of high quality video decoding using programmable architectures. Our solution has been the simultaneous exploitation of multiple levels of parallelism. On the one hand, video decoders have been modified in order to extract as much parallelism as possible. And, on the other hand, general purpose architectures has been enhanced for exploiting the type of parallelism that is present in video codec applications. First, we made a scalability analysis of two different Single Instruction Multiple Data (SIMD) extensions: a 1-dimensional (1D) extension and a 2-d matrix extension (2D). We have shown that scaling the 2D extension results in higher performance and lower complexity than the 1D extension for MPEG-2 video coding and decoding. We performed a workload characterization of H.264/AVC for High Definition (HD) applications. We identified the main kernels and compared them with the kernels of previous video codecs. Due to the lack of a proper benchmark for HD video decoding we developed our own one, called HD-VideoBench. This benchmark includes complete applications for video coding and decoding together with a set in of input videos in HD resolutions. After that, we optimized the most relevant kernels of the H.264/AVC decoder using SIMD instructions. However, we were not able to reach the maximum performance due to the overhead of data re-alignment. As a solution, we implemented and evaluated the required hardware and software for supporting unaligned accesses in SIMD extensions. This support resulted in significant performance gains both for kernels and the complete application. Because SIMD extensions were not enough to provide all the required performance, we developed an investigation on how to extract Thread-Level-Parallelism. We found that none of the existing mechanisms could scale to massive parallel systems. As a solution, we developed a new algorithm, called the dynamic 3D-Wave, that is able to reveal thousands of independent tasks exploiting macroblock-level parallelism. We implemented intra-frame macroblock-level parallelism on a Distributed Shared Memory (DSM) parallel machine but this implementation was not able to reach the maximum performance due to the negative impact of thread synchronization and the effect of the entropy decoding kernel. iii In order to eliminate these bottlenecks we proposed a parallelization of the entropy decoding stage at the frame-level combined with a parallelization of the other kernels at the macroblock-level. A further performance increase was obtained by using different type of processors for each type of kernel. The overhead of thread synchronization was almost eliminated with a special hardware support for synchronization operations. With all the presented enhancements we were able to process, in real-time, video decoding applications at high definition and at high frame-rate. We created a scal- able solution that is able to use the growing number of cores in emerging multicore architectures. iv Acknowledgments This thesis has been the result of a long effort during many years of my life. During that time I received the support of many people in many different ways. First, I would like to express my gratitude to my thesis advisors Mateo Valero and Alex Ram´ırez. I want to thank Mateo for giving me the opportunity of making the PhD at the High Performance Computing Group at UPC under his direction. You also suggested me the idea of working on architectures for multimedia applications and, specifically, the idea of doing my research on high definition video processing using H.264/AVC. This field now constitutes my main area of interest and, it seems that, I am going to continue working on that for some time. I only regret for not being able to have more meetings like those of the first period of the doctorate. Also, I want to express my gratitude to Alex Ram´ırezfor all his efforts and dedication during these years. You suggested me to cooperate with other research groups that were working on similar topics. These collaborations (with researchers from TU-Delft, NXP and others) gave me the opportunity to work in an international collaborative environment. Some of the main contributions of this thesis have been the result of these collaborations. I am also thankful to Esther Salam´ıfor working with me during the first years of the PhD. I learned a lot working with you, especially in those practical issues that are difficult on the day-by-day research. Also, I want to recognize your words of support which helped me a lot when things were not clear. I am specially grateful to my friend and colleague Friman S´anchez. During these years we have supported each other and we have forged a great friendship. Apart from the technical discussions I have enjoyed very much the shared readings and conversations about literature, philosophy, politics and history. They helped me to understand what are the important things in life. I thank Cor Meenderinck and Arnaldo Azevedo from TU-Delft for the collaboration that we developed about parallelization of video codecs. I learned a lot from you and enjoyed our meetings at different places of Europe. I am specially grateful with Ben Juurlink (now in TU-Berlin) for the fruitful col- laboration we have had over the years about parallel H.264 decoding. I thank your welcome in Berlin during the spring of 2011 and your dedication and help with my the- sis. I also want to thank Chi Ching Chi for the good time working together on HEVC parallelization during my visit to TU-Berlin. I thank Ayal Zaks and Uzi Shvadron for helping me during my visit to IBM Haifa Labs. It was a really good experience to work with you during these months. I am very grateful for your kind welcome in Israel. I also want to express my gratitude to my colleagues of the PhD program at DAC UPC. Stefan Bieschewski, Nikolaos Galanis, Miquel Peric`as,Germ´anRodr´ıguezand others. My experience doing the PhD and living in Barcelona would not have been the same without our shared moments in the C6 room and other places. I thank Felipe Cabarcas and Alejandro Rico for their support with the TaskSim v simulator. A part of this thesis has been possible thanks to your work. I also want to thank the people from System Administration at DAC (LCAC) for their professionalism, responsiveness and for their commitment to free-software.