Parallel Heterogeneous Computing a Case Study on Accelerating JPEG2000 Coder
Total Page:16
File Type:pdf, Size:1020Kb
Parallel Heterogeneous Computing A Case Study On Accelerating JPEG2000 Coder by Ro-To Le M.Sc., Brown University; Providence, RI, USA, 2009 B.Sc., Hanoi University of Technology; Hanoi, Vietnam, 2007 A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in The School of Engineering at Brown University PROVIDENCE, RHODE ISLAND May 2013 c Copyright 2013 by Ro-To Le This dissertation by Ro-To Le is accepted in its present form by The School of Engineering as satisfying the dissertation requirement for the degree of Doctor of Philosophy. Date R. Iris Bahar, Ph.D., Advisor Date Joseph L. Mundy, Ph.D., Advisor Recommended to the Graduate Council Date Vishal Jain, Ph.D., Reader Approved by the Graduate Council Date Peter M. Weber, Dean of the Graduate School iii Vitae Roto Le was born in Duc-Tho, Ha-Tinh, a countryside area in the Midland of Vietnam. He received his B.Sc., with Excellent Classification, in Electronics and Telecommunications from Hanoi University of Technology in 2007. Soon after re- ceiving his B.Sc., Roto came to Brown University to start a Ph.D. program in Com- puter Engineering in Fall 2007. His Ph.D. program was sponsored by a fellowship from the Vietnam Education Foundation, which was selectively nominated by the National Academies’ scientists. During his Ph.D. program, he earned a M.Sc. degree in Computer Engineering in 2009. Roto has been studying several aspects of modern computing systems, from hardware architecture and VLSI system design to high-performance software design. He has published several articles in designing a parallel JPEG2000 coder based on heterogeneous CPU-GPGPU systems and designing novel Three-Dimensional (3D) FPGA architectures. His current research interests concern the exploration of hetero- geneous parallel computing, graphic computing, GPGPU architectures, the OpenCL programming platform and drivers for GPGPU-based systems. Ro-To [email protected] Brown University, Providence, RI 02912, USA iv Acknowledgements I would like to express my sincere gratitude to my advisors, Professor Iris Bahar and Professor Joseph Mundy, for their patience, encouragement and guidance during this work. Without them, this project would not have been possible. I am also grateful to Dr. Vishal Jain, my thesis committee member, for his invaluable feedback and the effort he has put in reading this dissertation manuscript. I would like to thank the Vietnam Education Foundation and Brown University for jointly sponsoring me an invaluable fellowship to start my Ph.D. program. Without it, I would not have been able to make such a big leap in comming to Brown. I also would like to thank all my friends and colleagues, who have been making my years at Brown even more pleasant. With their friendships, cultural and intellectual exchanges, the time at Brown has been a great exploration of my life. I would like to thank Mai Tran, Ngoc Le, Thuy Nguyen, Dung Han, Van Nghiem, Atilgan Yilmaz, Gokhan Dimirkan, Kat Dimirkan, Ahmet Eken, Ozer Selcuk, Cesare Ferri, Elejdis Kulla, Octavian Biris, Kumud Nepal, Marco Donato, Andrea Marongiu, Fabio Cre- mona, Andrea Calimera, Nuno Alves, Elif Alpaslan, Yiwen Shi, Stella Hu, James Kelley, Satrio Wicaksono, my friends in VietPlusPVD, my friends in VEFFA. Last but not least, I deeply thank my family and my senior friends. My parents Le Van Dau and Pham Thi Lan, my brothers and sisters Le Thanh-A, Le Y-Von, Le Puya, Le Tec-Nen and Le Diem Y and their families for their unwavering support. My senior friends Mr. Richard Nguyen, Mrs. Chi Nguyen, Mr. Hoang Nhu, Mr. Linh Chau, Dr. Nam Pham, Dr. Thang Nguyen for their words of wisdom, which have been delighting me and helping me overcome challenging moments in my life. v Preface Through the evolution of modern computing platforms, in both hardware and software, processing heavy multimedia content has never been so exciting. While in the hardware domain modern processing units such as CPUs, GPGPUs, FPGAs, DSP processors and ASICs are increasingly offering tremendous computing capabil- ities, the demands from the application domain are still very high. These demands cannot be met just by simply utilizing massive numbers of processing units. Rather, they require more complex and efficient hardware-software co-design with consider- ations towards power and economic budgets, integration form-factors etc. This study conducts an exhaustive exploration in both the hardware domain and software domain of modern parallel computing platforms with two primary objec- tives: (1) to understand the performance of the platforms on complex multimedia processing applications; (2) to quest for efficient design approaches to accelerate the applications based on parallel computing. The study focuses on the two most pop- ular general purpose parallel computing platforms, which are based on multicore CPUs and manycore GPGPUs. The platforms’ characteristics are observed from their performance in accelerating the case study of the JPEG2000 image compres- sion standard, a very interesting and challenging media coding application. The exploration can be divided in two main phases. First, it analyzes the case study of the JPEG2000 application and the computing platforms. The analysis pro- cess provides insights into the key operations within the JPEG2000 coding flow; the architectures and the execution models of the CPUs and GPGPUs. The first phase therefore is a crucial foundation for the second phase, which proposes novel design approaches to accelerate the JPEG2000 coder and also finds out many interesting conclusions about the performance of the modern parallel computing platforms. vi The design process starts with a GPGPU-only approach to accelerate the JPEG2000 encoder. In order to leverage the massively parallel processing capability of GPG- PUs’ SIMD architecture, novel parallel processing methods are proposed to expose very fine-grained parallelism. Significant performance speedups are achieved for the JPEG2000 flow. The GPGPU-based parallel bitplane coder runs more than 30× faster than a well-known single-threaded implementation. The Tier-1 encoding stage gains more than 17× speedup. However, despite the great capabilities in parallel processing, GPGPU-based platforms reveal several critical drawbacks. First, GPGPU-based software imple- mentations require heavy optimization efforts. Secondly, GPGPUs lack flexibility to completely handle a complex computational flow by themselves. To overcome the shortcomings of GPGPUs, this study discovers a more efficient approach for general purpose parallel computing: the heterogeneous parallel computing approach. In par- ticular, the heterogeneous approach makes use of collaborating heterogeneous pro- cessing units (i.e. GPGPUs and CPUs) to accelerate different stages of a complex computing flow. In accelerating the JPEG2000 decoder, the experimental results show that the heterogeneous approach is significantly more efficient. While GPG- PUs can provide great parallel arithmetic throughput, flexible CPUs can act very well in orchestrating the whole computational flow. Additionally, modern multicore CPUs show great capabilities in parallel computing too. The multicore CPUs can even outperform manycore GPGPUs in solving complex task-level parallel programs. Moreover, not being limited to exploiting just the heterogeneity of different hard- ware devices, the design process also exploits soft-heterogeneity of different software runtime environments. vii Contents Vitae iv Acknowledgments v Preface vi 1 Introduction 1 1.1 Exploiting Modern Computing Platforms To Accelerate Multimedia ProcessingApplications . 2 1.2 Choosing JPEG2000 Standard As A Case Study . 6 1.2.1 TheMotivation .......................... 6 1.2.2 JPEG2000 Application Domains . 7 1.2.3 Demand For High-Performance JPEG2000 Coder . 9 1.3 ThesisContributions . .. ... .. .. .. .. ... .. .. .. .. .. 11 1.4 OrganizationOfTheThesis . 13 2 Background On The JPEG2000 Image Compression Standard 14 2.1 Introduction To The JPEG2000 Image Compression Standard .... 15 2.1.1 JPEG2000 Standard History . 15 2.1.2 JPEG2000 Advanced Features . 16 2.2 JPEG2000CodingFlow . ... .. .. .. .. ... .. .. .. .. .. 21 2.3 JPEG2000 Wavelet Transform . 22 2.4 JPEG2000BitplaneCoding . 26 2.4.1 JPEG2000 Data Structure . 26 2.4.2 JPEG2000 Bitplane Coding . 28 2.4.3 BitplaneCodingPasses. 29 2.5 JPEG2000 Entropy Coding . 37 2.5.1 Understanding The Legacy Arithmetic Coder . 37 2.5.2 The MQ Coder In JPEG2000 . 39 2.6 Conclusion................................. 45 3 Modern Hardware Architectures For Parallel Computing 46 3.1 Modern Hardware Architectures For Parallel Computing . ...... 47 3.1.1 Understanding Computing Performance . 48 viii 3.1.2 Exploiting Parallelism To Increase Performance . 50 3.1.3 Exploiting Instruction Level Parallelism . 53 3.1.4 ExploitingDataLevelParallelism . 56 3.1.5 Exploiting Thread Level Parallelism With Multicore . ..... 58 3.2 Mainstream General Purpose Processing Units . 62 3.3 General Purpose Graphics Processing Units . 65 3.3.1 General Purpose Graphics Processing Units Overview . 66 3.3.2 GPGPUCoreProcessingArchitecture . 68 3.4 Conclusion................................. 72 4 OpenCL Programming Model 74 4.1 Open Computing Language For Parallel Heterogeneous Computing . 75 4.2 AnatomyOfOpenCL........................... 76 4.2.1 OpenCLPlatformModel. 77 4.2.2 OpenCL ExecutionModel: Working Threads. 78 4.2.3 OpenCL Execution Model: Context And Command Queues . 81 4.2.4 MemoryModel: MemoryAbstraction . 83 4.2.5 SynchronizationInOpenCL . 86 4.3