<<

Parallel Heterogeneous Computing A Case Study On Accelerating JPEG2000 Coder

by Ro-To Le

M.Sc., Brown University; Providence, RI, USA, 2009

B.Sc., Hanoi University of Technology; Hanoi, Vietnam, 2007

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy

in The School of Engineering at Brown University

PROVIDENCE, RHODE ISLAND

May 2013 Copyright 2013 by Ro-To Le This dissertation by Ro-To Le is accepted in its present form

by The School of Engineering as satisfying the

dissertation requirement for the degree of Doctor of Philosophy.

Date

R. Iris Bahar, Ph.D., Advisor

Date

Joseph L. Mundy, Ph.D., Advisor

Recommended to the Graduate Council

Date

Vishal Jain, Ph.D., Reader

Approved by the Graduate Council

Date

Peter M. Weber, Dean of the Graduate School

iii Vitae

Roto Le was born in Duc-Tho, Ha-Tinh, a countryside area in the Midland of Vietnam. He received his B.Sc., with Excellent Classification, in Electronics and

Telecommunications from Hanoi University of Technology in 2007. Soon after re- ceiving his B.Sc., Roto came to Brown University to start a Ph.D. program in Com- puter Engineering in Fall 2007. His Ph.D. program was sponsored by a fellowship from the Vietnam Education Foundation, which was selectively nominated by the

National Academies’ scientists. During his Ph.D. program, he earned a M.Sc. degree in Computer Engineering in 2009.

Roto has been studying several aspects of modern computing systems, from hardware architecture and VLSI system design to high-performance software design.

He has published several articles in designing a parallel JPEG2000 coder based on heterogeneous CPU-GPGPU systems and designing novel Three-Dimensional (3D) FPGA architectures. His current research interests concern the exploration of hetero- geneous parallel computing, graphic computing, GPGPU architectures, the OpenCL programming platform and drivers for GPGPU-based systems.

Ro-To [email protected]

Brown University, Providence, RI 02912, USA

iv Acknowledgements

I would like to express my sincere gratitude to my advisors, Professor Iris Bahar and

Professor Joseph Mundy, for their patience, encouragement and guidance during this work. Without them, this project would not have been possible.

I am also grateful to Dr. Vishal Jain, my thesis committee member, for his invaluable feedback and the effort he has put in reading this dissertation manuscript.

I would like to thank the Vietnam Education Foundation and Brown University for jointly sponsoring me an invaluable fellowship to start my Ph.D. program. Without it, I would not have been able to make such a big leap in comming to Brown.

I also would like to thank all my friends and colleagues, who have been making my years at Brown even more pleasant. With their friendships, cultural and intellectual exchanges, the time at Brown has been a great exploration of my life. I would like to thank Mai Tran, Ngoc Le, Thuy Nguyen, Dung Han, Van Nghiem, Atilgan Yilmaz,

Gokhan Dimirkan, Kat Dimirkan, Ahmet Eken, Ozer Selcuk, Cesare Ferri, Elejdis Kulla, Octavian Biris, Kumud Nepal, Marco Donato, Andrea Marongiu, Fabio Cre- mona, Andrea Calimera, Nuno Alves, Elif Alpaslan, Yiwen Shi, Stella Hu, James Kelley, Satrio Wicaksono, my friends in VietPlusPVD, my friends in VEFFA.

Last but not least, I deeply thank my family and my senior friends. My parents Le

Van Dau and Pham Thi Lan, my brothers and sisters Le Thanh-A, Le Y-Von, Le Puya, Le Tec-Nen and Le Diem Y and their families for their unwavering support.

My senior friends Mr. Richard Nguyen, Mrs. Chi Nguyen, Mr. Hoang Nhu, Mr.

Linh Chau, Dr. Nam Pham, Dr. Thang Nguyen for their words of wisdom, which have been delighting me and helping me overcome challenging moments in my life.

v Preface

Through the evolution of modern computing platforms, in both hardware and software, processing heavy multimedia content has never been so exciting. While in the hardware domain modern processing units such as CPUs, GPGPUs, FPGAs,

DSP processors and ASICs are increasingly offering tremendous computing capabil- ities, the demands from the application domain are still very high. These demands cannot be met just by simply utilizing massive numbers of processing units. Rather, they require more complex and efficient hardware-software co-design with consider- ations towards power and economic budgets, integration form-factors etc.

This study conducts an exhaustive exploration in both the hardware domain and software domain of modern parallel computing platforms with two primary objec- tives: (1) to understand the performance of the platforms on complex multimedia processing applications; (2) to quest for efficient design approaches to accelerate the applications based on parallel computing. The study focuses on the two most pop- ular general purpose parallel computing platforms, which are based on multicore

CPUs and manycore GPGPUs. The platforms’ characteristics are observed from their performance in accelerating the case study of the JPEG2000 image compres- sion standard, a very interesting and challenging media coding application.

The exploration can be divided in two main phases. First, it analyzes the case study of the JPEG2000 application and the computing platforms. The analysis pro- cess provides insights into the key operations within the JPEG2000 coding flow; the architectures and the execution models of the CPUs and GPGPUs. The first phase therefore is a crucial foundation for the second phase, which proposes novel design approaches to accelerate the JPEG2000 coder and also finds out many interesting conclusions about the performance of the modern parallel computing platforms.

vi The design process starts with a GPGPU-only approach to accelerate the JPEG2000 encoder. In order to leverage the massively parallel processing capability of GPG-

PUs’ SIMD architecture, novel parallel processing methods are proposed to expose very fine-grained parallelism. Significant performance speedups are achieved for the

JPEG2000 flow. The GPGPU-based parallel bitplane coder runs more than 30× faster than a well-known single-threaded implementation. The Tier-1 encoding stage gains more than 17× speedup.

However, despite the great capabilities in parallel processing, GPGPU-based platforms reveal several critical drawbacks. First, GPGPU-based software imple- mentations require heavy optimization efforts. Secondly, GPGPUs lack flexibility to completely handle a complex computational flow by themselves. To overcome the shortcomings of GPGPUs, this study discovers a more efficient approach for general purpose parallel computing: the heterogeneous parallel computing approach. In par- ticular, the heterogeneous approach makes use of collaborating heterogeneous pro- cessing units (i.e. GPGPUs and CPUs) to accelerate different stages of a complex computing flow. In accelerating the JPEG2000 decoder, the experimental results show that the heterogeneous approach is significantly more efficient. While GPG-

PUs can provide great parallel arithmetic throughput, flexible CPUs can act very well in orchestrating the whole computational flow. Additionally, modern multicore

CPUs show great capabilities in parallel computing too. The multicore CPUs can even outperform manycore GPGPUs in solving complex task-level parallel programs. Moreover, not being limited to exploiting just the heterogeneity of different hard- ware devices, the design process also exploits soft-heterogeneity of different software runtime environments.

vii Contents

Vitae iv

Acknowledgments v

Preface vi

1 Introduction 1 1.1 Exploiting Modern Computing Platforms To Accelerate Multimedia ProcessingApplications ...... 2 1.2 Choosing JPEG2000 Standard As A Case Study ...... 6 1.2.1 TheMotivation ...... 6 1.2.2 JPEG2000 Application Domains ...... 7 1.2.3 Demand For High-Performance JPEG2000 Coder ...... 9 1.3 ThesisContributions ...... 11 1.4 OrganizationOfTheThesis ...... 13

2 Background On The JPEG2000 Standard 14 2.1 Introduction To The JPEG2000 Image Compression Standard .... 15 2.1.1 JPEG2000 Standard History ...... 15 2.1.2 JPEG2000 Advanced Features ...... 16 2.2 JPEG2000CodingFlow ...... 21 2.3 JPEG2000 Transform ...... 22 2.4 JPEG2000BitplaneCoding ...... 26 2.4.1 JPEG2000 Data Structure ...... 26 2.4.2 JPEG2000 Bitplane Coding ...... 28 2.4.3 BitplaneCodingPasses...... 29 2.5 JPEG2000 Entropy Coding ...... 37 2.5.1 Understanding The Legacy Arithmetic Coder ...... 37 2.5.2 The MQ Coder In JPEG2000 ...... 39 2.6 Conclusion...... 45

3 Modern Hardware Architectures For Parallel Computing 46 3.1 Modern Hardware Architectures For Parallel Computing ...... 47 3.1.1 Understanding Computing Performance ...... 48

viii 3.1.2 Exploiting Parallelism To Increase Performance ...... 50 3.1.3 Exploiting Instruction Level Parallelism ...... 53 3.1.4 ExploitingDataLevelParallelism ...... 56 3.1.5 Exploiting Thread Level Parallelism With Multicore ...... 58 3.2 Mainstream General Purpose Processing Units ...... 62 3.3 General Purpose Graphics Processing Units ...... 65 3.3.1 General Purpose Graphics Processing Units Overview . . . . . 66 3.3.2 GPGPUCoreProcessingArchitecture ...... 68 3.4 Conclusion...... 72

4 OpenCL Programming Model 74 4.1 Open Computing Language For Parallel Heterogeneous Computing . 75 4.2 AnatomyOfOpenCL...... 76 4.2.1 OpenCLPlatformModel...... 77 4.2.2 OpenCL ExecutionModel: Working Threads...... 78 4.2.3 OpenCL Execution Model: Context And Command Queues . 81 4.2.4 MemoryModel: MemoryAbstraction ...... 83 4.2.5 SynchronizationInOpenCL ...... 86 4.3 MappingOpenCLModeltoHardwareDevices ...... 88 4.3.1 MappingOpenCLModelToMulticoreCPUs ...... 89 4.3.2 Mapping OpenCL Model To Manycore GPGPUs ...... 90 4.4 Conclusion...... 92

5 Design Novel Parallel Processing Methods For JPEG2000 Tar- geting Manycore GPGPUs 94 5.1 Introduction...... 94 5.2 PreviousWork OnAcceleratingJPEG2000 ...... 96 5.3 JPEG2000 Encoder Performance Profile ...... 99 5.4 JPEG2000BitplaneCoderAnalysis ...... 101 5.5 ParallelBitplaneCoderDesign ...... 106 5.5.1 Fine-grained Parallel Approach For Bitplane Coder ...... 107 5.5.2 RemovingInter-SampleDependency ...... 109 5.5.3 PredictingStateVariables ...... 113 5.6 DesigningAParallelArithmeticCoder ...... 117 5.6.1 Entropy Coding In JPEG2000 Revision ...... 118 5.6.2 Parallelizing The Classic Arithmetic Coder ...... 119 5.6.3 HandlingRenormalizationPoints ...... 122 5.6.4 ParallelContextAdaptiveAC ...... 124 5.7 ImplementingTheTier-1CoderOnGPUs ...... 126 5.7.1 Mapping Parallel Tier-1 Coder To GPGPU ...... 126 5.7.2 ReducingFlowControlOperations ...... 127 5.7.3 OptimizingMemoryAllocation ...... 129 5.7.4 OptimizingMemoryUtilization ...... 131 5.8 ExperimentalResults...... 134 5.9 SummaryAndDiscussion ...... 136

ix 6 Accelerate JPEG2000 Decoder Using Heterogeneous Parallel Com- puting Approach 139 6.1 Introduction...... 139 6.2 PreviousWorkOnJPEG2000Decoder ...... 142 6.3 JPEG2000DecoderRuntimeAnalysis...... 144 6.4 ParallelizingTier-1Decoder ...... 147 6.4.1 Parallel Model Selection And Experimental Setup ...... 147 6.4.2 ParallelizingThe Tier-1 Decoder On GPGPUs ...... 150 6.4.3 Optimizing the Parallel Tier-1 Decoder For GPGPUs . . . . . 153 6.4.4 ParallelizingThe Tier-1Decoder On CPUs ...... 157 6.5 Parallelizing The Inverse Discrete ...... 160 6.6 Pipelining The Decoding Stages On Heterogeneous Computing Units 163 6.6.1 Extending Amdahl’s Law For Heterogeneous Systems . . . . . 164 6.6.2 Pipelining the JPEG2000 Decoding Flow ...... 165 6.6.3 Exploiting Soft-heterogeneity In Multiple Software Runtimes OnMulticoreCPUsForPipelining ...... 169 6.6.4 Performance Results On A Heterogeneous System ...... 171 6.7 SummaryAndDiscussion ...... 173

7 Conclusion And Future Work 175

x List of Tables

2.1 Reference table for Zero Coding on LL and LH subbands ...... 35 2.2 Referencetable forZeroCoding on HLsubband ...... 35 2.3 Referencetable for ZeroCoding on HHsubband ...... 35 2.4 ReferencetableforSignCoding ...... 36 2.5 Reference table for Magnitude Refinement Coding ...... 37 2.6 Probability estimation look-up-table for MQ coder [60] ...... 41 5.1 Runtime profile of JasPer [11] JPEG2000 encoder in lossless mode (runningonIntelCorei7930processor) ...... 100 5.2 Runtime comparison for the BitPlane Coder (BPC) in different imple- mentations: serial BPC of CPU-based JasPer [11] vs. our GPU-based parallel BPC vs. GPU-based parallel BPC in [61]. JasPer runs on an Intel Core i7 930 CPU. Both the parallel BPC coder run on a Nvidia GTX480GPU...... 135 5.3 Runtime comparison for Tier-1 coder of JasPer running on Intel Core i7 930 CPU vs. our parallel Tier-1 running on Nvidia GTX 480 GPU. 136

xi List of Figures

1.1 Showcase of many complex applications significantly accelerated on Nvidia GPGPUs using the CUDA computing platform. Performance gainsareuptotenstohundredstimes[36]...... 4 1.2 Relative runtime performance of JasPer [11] JPEG2000 encoder vs. IJG JPEG [50] encoder on a set of test images (lena, gold hill, ultra- sound[59])[68]...... 10 2.1 Lena image compressed with JPEG with a compression ratio of 1:80 . 18 2.2 Lena image compressed with JPEG2000 with a compression ratio of 1:80 ...... 18 2.3 Lena image decomposed into three wavelet resolution levels...... 19 2.4 JPEG2000EncodingFlow ...... 21 2.5 1Dlifting-based(5,3)filterbank...... 23 2.6 2DDWTfor2Dimage ...... 24 2.7 Lenaoriginalimage...... 25 2.8 Lenawaveletdecomposition ...... 25 2.9 PyramidDataStructureinJPEG2000 ...... 27 2.10 Viewing of codeblock, bitplane, sample stripe and context window in JPEG2000...... 28 2.11 Different data and state arrays that JPEG2000 maintains during the codingprocess...... 32 2.12 The Context formation flow of JPEG2000 Bitplane Coder ...... 33 2.13 Context Windows for different coding primitives ...... 34 2.14 Interval scaling in the ArithmeticCoder...... 38 2.15 Multiplier-freeMQcodermodel ...... 39 2.16 MainencodingflowoftheMQcoder ...... 42 2.17 FlowofencodingaLPSsymbolinMQcoder...... 43 2.18 FlowofencodingaMPSsymbolinMQcoder ...... 43 2.19 Renormalizationprocess inthe MQcoder...... 44 3.1 Frequency scaling and transistor integration in Intel’s CPU genera- tions[82]...... 51 3.2 RISC5-Stage InstructionPipelining...... 54 3.3 Exploiting Instruction-Level Parallelism using Superscalar technique [17] 55 3.4 Exploiting Data-Level Parallelism using SIMD technique ...... 57 3.5 Simultaneous Multithreading (SMT) with two concurrent threads . . 59 3.6 Conceptual Viewing of Singlecore vs. Multicore architecture . .. .. 60

xii 3.7 Power consumption in singlecore vs. multicore processor with the same compute througput. The rate at which instructions are retired is the same in these two cases, but the power is much less with two cores running at half the frequency of a single core [9]...... 61 3.8 A Conceptual SymmetricMulticore Processor ...... 63 3.9 Silicon die of an Intel 4-core Nehalem processor. The size of a silicon area is respective to the fabricated logic resource thus it can provide anestimationonlogicutilization[15] ...... 65 3.10 GPGPUs devote most of the logic resources to arithmetic units to target computationally-intensive applications while CPUs keep a bal- ance between logic resources for arithmetic units and control units to target a wide range of general purpose applications [39] ...... 67 3.11 The NVIDIA Fermi architecture [39]. This device has 16 streaming multiprocessor (SM), two SIMD arrays per SM. The SMs share a common GDDR5 memory system and L2 cache ...... 69 3.12 The NVIDIA Fermi architecture [39]. Each SM consists of two SIMD CUDA core arrays. Each SM includes a shared memory/L1 cache and a separate array of special function units(SFU). The SIMD CUDA cores share instruction cache, scheduler, dispatcher and register set . 70 3.13 Fermi Thread Scheduler and Instruction Dispatcher Unit [39] . . . . . 71 4.1 OpenCl platform model [17]. The plaform consist of a single host and other heterogeneous compute devices. The host interacts with computedevicesthroughOpenCLprogram ...... 78 4.2 A realistic hardware platform for parallel heterogeneous computing with OpenCL. The platform consists of three different processing units: an Intel Core i7 processor works as the host, a Nvidia GTX 480 GPGPU and an AMD HD 5800 GPGPU work as compute devices 79 4.3 OpenCL NDRange index space showing work-items, their global IDs and their mapping onto the pair of work-group and local IDs [43]. Each work-group may consist of multiple wavefronts(AMD term) or warps (Nvidiaterm)...... 80 4.4 The abstract memorymodeldefinedbyOpenCL...... 84 4.5 OpenCL Synchronization: local vs. global synchronization [17]. . . . . 87 4.6 MappingOpenCLmodeltoaphysicalCPU ...... 89 4.7 Assign multiple workgroups with a large number of workitems to each cpu thread [17]. a CPU hardware thread processes an entire work- group one work-item at a time before moving to the next work-group. 91 4.8 Mapping OpenCL model to a physical GPGPU ...... 92 5.1 Functional block diagram of Analog Device’s ADV 212 JPEG2000 ASIC chip [48]. ADV 212 employs three Tier-1 encoders (EC blocks) to encode three wavelet subbands of input image in parallel...... 97 5.2 Column-based processing architecture proposed by Lian et al. [66]. . . 98 5.3 JPEG2000EncodingFlow ...... 100 5.4 Viewing of codeblock, bitplane, sample stripe and context window in JPEG2000...... 102 5.5 Conditions for a bit to be encoded by Zero Coding(ZC) in Significance Propagation Pass (SPP) ...... 105 5.6 updating values of significance state bit (σ) of encoding bit-sample (in dataarray)acrossZCsteps ...... 106 5.7 PyramidDataStructureinJPEG2000 ...... 107

xiii 5.8 Performance of GPGPU-based CUJ2K parallel JPEG2000 coder on EBCOT Tier-1 stage [88]. CUJ2K’s coarse-grained approach achieve only about 3× speedup over serial JPEG2000 coder running on CPU. 109 5.9 An example of racing condition happens in parallel Zero coding process110 5.10 Bit-sample level parallel approach for JPEG2000. This very fine- grained approach allows numerous parallel threads to process not only codeblocks concurrently but also bit-samples within each codeblock concurrently...... 111 5.11 Scanning pattern in bit plane. Within each context window, a sample always has 4 visited neighbors and 4 non-visited neighbors ...... 111 5.12 Reduction steps on predicting significance state (SIG) variables. . . . 115 5.13 Interval scaling in the Arithmetic Coder. It recursively creates a se- quence of nested intervals of the form Φk(S) = hbk,lki based on the probabilityofinputsymbols ...... 120 5.14 Parallel prefix scanning algorithm [20]: compute an associative func- tion F (S) on a n − element S array using log(n) reduction steps . . . 122 5.15 An example of prefix scanning algorithm: it takes 8 parallel threads and log(8) = 3 scanning steps to finish computing F (a0,...,a7) . . . 122 5.16 Processing a stream using multiple parallel windows...... 123 5.17 32-bit length register with 24-bit waiting buffer ...... 124 5.18 19 separate buffers required for 19 contexts ...... 125 5.19 Mapping parallel BPC algorithm to existing graphics hardware. Each codeblock is assigned to be encoded by a streaming multiprocessor (SM). SMs directly work on the fast on-chip shared memory, where each codeblock’s sample and associative state variables and outputs are temporarily located. After each codeblock is completely processed, outputs are written back to the global memory. Constant look-up- tablesarelocatedintheconstantmemory...... 127 5.20 Context Windows for Zero Coding (ZC) and Sign Coding (SC) primi- tives. When a coding primitive codes a sample, it has to refer to state information of the sample’s neighbors defined in its respective context window...... 130 5.21 The sate variables of a sample and its neighbors (within its context window) are packed into a 16-bit state flag. The index of ZC or SC LUT can be extracted from the flag by applying a binary mask. . . . 130 5.22 Impact of codeblock size on compression efficiency and streaming mul- tiprocessor (SM) occupancy (in coding BIKE image). As codeblock size increases, compression efficiency increases but SM occupancy de- creases...... 131 5.23 Impact of codeblock size and processing method on the runtime speedup of the bitplane coder (in coding BIKE image). In cblk-based process- ing, increasing in cblk size leads to decreasing SM occupancy, which causes notable drops in speedup. The stripe-based processing signifi- cantlyimprovesthisissue...... 133 6.1 JPEG2000DecodingFlow ...... 145 6.2 Runtime Profile of JasPer JPEG2000 Decoder ...... 145 6.3 Pyramid Structure of the JPEG2000 Codestream...... 146 6.4 The process of reconstructing the sample values in Tier-1...... 148 6.5 Mapping the parallel Tier-1 decoder to existing graphics hardware. . 151

xiv 6.6 Context Windows for Zero Coding (ZC) and Sign Coding (SC) prim- itives. When a coding primitive codes a sample, it has to refer to the state information of the sample’s neighbors defined in its respective contextwindow...... 154 6.7 The state variables of a sample and its neighbors (within its context window) are packed into a 16-bit state flag. The index of ZC or SC LUT can be extracted from the flag by applying a binary mask. . . . 154 6.8 Runtime speedup of different versions of the GPGPU-based Parallel Tier-1 decoder compared to the JasPer JPEG2000...... 156 6.9 mapping task-level parallel Tier-1 decoder to a multicore CPU. . . . . 157 6.10 Runtime speedup of the CPU-based Parallel Tier-1 decoder kernel compiled by two different versions of Intel OpenCL compilers. By updating the compiler from V1.0 to V1.5, the performance increases from 3× to 6× forthesamekernel...... 158 6.11 Runtime speedup of different Parallel Tier-1 decoder implementations on CPU and GPGPU compared to the JasPer JPEG2000...... 159 6.12 1Dlifting-based(5,3)filterbank ...... 161 6.13 2DDWTfor2Dimage ...... 161 6.14 Runtime speedup of the parallel IDWT running on GTX480 GPU compared to the JasPer JPEG2000 in decoding the Aerial image . . . 163 6.15 Decoding a single JPEG2000 image frame on a GPGPU-CPU system 166 6.16 A simple pipeline scheme for JPEG2000 decoding flow ...... 167 6.17 The runtime speedups of the parallel JPEG2000 decoder using the simple pipeline scheme. The overhead and unbalance in stages make this scheme even slower than the un-pipeline scheme...... 168 6.18 A more complex pipeline scheme for the parallel JPEG2000 decoder. It runs two different programs: parallel Tier-1 and single-threaded Tier-2 on the same CPU simultaneously...... 169 6.19 Thread scheduling of the master thread and OpenCL threads on a multicore CPU. This execution scheme make it feasible to run both the C runtime’s thread and the OpenCL runtime’s threads simultaneously. 170 6.20 The runtime speedups of the parallel JPEG2000 decoder using differ- ent scheduling methods compared to the JasPer JPEG2000 decoder. The pipeline scheme that exploits soft-heterogeneity gains a significant boostinspeedups...... 172

xv xvi Chapter 1

Introduction

Through the evolution of modern computing platforms, in both hardware and soft- ware, processing heavy multimedia content has never been so exciting. While in the hardware domain modern processing units such as CPUs, GPGPUs, FPGAs, DSP processors and ASICs are increasingly offering tremendous computing capabilities, the demands from the application domain are still very high. It is not only the mul- timedia content that becomes much heavier, but also the demands on functionality and processing speed are significantly increased in the era of interactive multimedia. These demands cannot be met just by simply utilizing a massive number of processing units. Rather, they require more complex and efficient hardware-software co-design with considerations towards power and economic budgets, integration form-factor, etc. As a result, considerable efforts will be directed towards recasting exiting soft- ware algorithms in order to leverage the newly available hardware processing power. This chapter will first introduce the motivation behind this study and the effort in both the hardware and software domains in exploring the capabilities of modern hardware architectures in order to solve complex multimedia processing applica-

1 2 tions. Next, the exciting parallel processing capabilities of modern multicore CPUs and manycore GPGPUs will be described in a image and video processing context.

Finally, the motivation behind selecting the JPEG2000 image coding application will be presented as a case study in exploring modern parallel computing hardware.

1.1 Exploiting Modern Computing Platforms To

Accelerate Multimedia Processing Applications

It is interesting to point out that despite the significant growth of parallel process- ing in the multimedia domain during the last several years, well-known and widely adopted multimedia processing algorithms are still the ones that were developed during the serial-processing era. For example, in the image compression domain, the most widely adopted standards in the consumer segment and the professional segment are the JPEG standard [21] and the JPEG2000 standard [60] respectively. In the video compression domain, MPEG [53] and H.264/AVC [87] are currently the most commonly used standards for the recording, compression, and distribution of high definition video. All of these standards were developed from the 1990s to the early 2000s. The reason behind this stems from the fact that developing and adopting a new multimedia processing standard requires tremendous amounts of effort and time. Moreover, the multimedia processing industry still favors those well-known standards since their performance in terms of quality still meets the demands.

However, the well-known and widely-adopted serial-based multimedia flows are hitting a glass ceiling in processing speed since the size of media content (i.e. res- olution of images and videos) has been significantly increased since the standards were released. In the early 2000s, when JPEG2000 and H.264/AVC were released, 3 the sensor of an advanced consumer digital camera (such as Nikon Coolpix 995 [56]) had a resolution of only 3 Mega pixels. Nowadays, such a camera (Nikon Coolpix

P310 [57]) has a resolution of 16 Mega pixels. Back then, general purpose comput- ing platforms had very limited capabilities of handling a complex algorithm such as JPEG2000 and H.264/AVC, which requires intensive amounts of arithmetic and memory operations. One of the best general purpose processors around that period was the Intel Pentium 4 processor [8] having only a single core operating at 2.4 GHz, with a slow main memory bus of 400 MHz. With this limited hardware platform, it was nearly impossible to achieve desired performance for the computational flows despite any effort in algorithm design. Moreover, one of the well-known strategies in high-performance algorithm design is exploiting parallelism, but it was nearly impossible with just a single core on a single system. Therefore, during the early

2000s, the most feasible way to accelerate media processing flows was employing many-processor systems such as a distributed server cluster in [73]. However, this approach also could not achieve much speedup.

However, since then, high density VLSI integration technology as well as rapid growth in modern computer architecture designs have significantly improved the capability of general purpose hardware platforms. In the general purpose processing domain, modern CPUs have been significantly improved in arithmetic capability and memory bandwidth. Particularly, there are often multiple processing cores with many parallel hardware techniques integrated in a single modern CPU, which can exploit parallelism effectively. For example, a modern Intel Core i7 processor now has up to 6 cores running at 3.2 GHz with 12 concurrent computing threads with a peak memory bandwidth of 17 GB/s compared to the Pentium 4 processor of the 2000s era, having only one core running at 2.4 GHz with single thread and peak memory bandwidth of 3GB/s [8]. 4

Figure 1.1: Showcase of many complex applications significantly accelerated on Nvidia GPGPUs using the CUDA computing platform. Performance gains are up to tens to hundreds times [36].

Moreover, the high-performance computing domain has just introduced an entire new class of general processing processors, the manycore general purpose graphic processing units (GPGPUs) [37, 42]. With new parallel programming models and instruction set architectures, GPGPUs, from prominent vendors such as Nvidia and AMD, leverage parallel computation engines to solve many complex computational problems with massive arithmetic and memory capabilities. For example, a modern

GPGPU can run tens of thousands of concurrent threads to provide a peak arithmetic performance of up to 2000 GFLOP/s with a peak memory bandwidth up to 200

GB/s. GPGPUs have proved their great potential in accelerating many complex computational problems, particularly multimedia processing applications. Figure 1.1 shows many applications greatly accelerated by GPGPUs with performance gains on the order of ten to hundreds.

In order to enable developers to exploit the great capability of modern processors in the hardware domain (e.g. multicore CPUs, manycore GPGPUs), the software 5 domain has also been introducing numerous programming models. Programming models for exploiting parallelism have greatly evolved, some notable examples being

Open MultiProcessing (OpenMP) [5], Intel Threading Building Blocks [31], Mes- sage Passing Interface (MPI) [4], and Apple’s Grand Central Dispatch [27], Nvidia

CUDA [35], Khronos OpenCL [51]. In particular, the Open Computing Language (OpenCL) platform [51] has recently been introduced to support parallel computing across heterogeneous hardware processors such as CPUs, GPGPUs, cell processors and even FPGAs and DSP processors.

Altogether, modern parallel hardware and software development platforms have shifted high-performance computing to a completely new paradigm. They introduce a great to reconsider the performance problem in multimedia processing applications. On the other hand, multimedia processing applications are good case studies for exploring the behavior of modern parallel computing platforms in solving realistic applications. Indeed, parallelism can be found on several levels from coarse- grained to fine-grained in multimedia applications and hence it can allow the testing of the performance of parallel platforms on different levels of parallelism. Moreover, as mentioned earlier, the well-known multimedia processing algorithms were devel- oped with a uniprocessor model and serial processing in mind. Consequently they may bring exciting new challenges in software-hardware co-design to adapt them to modern parallel platforms.

This study therefore aims at developing new computing approaches that use modern OpenCL development platforms to exploit the great capabilities of powerful multicore CPUs and manycore GPGPUs in order to accelerate a well-known and complex multimedia processing application in a computer system. Among many popular multimedia processing standards, the JPEG2000 standard is selected as the case study for this endeavor. More details will be introduced in the next section. 6

1.2 Choosing JPEG2000 Standard As A Case Study

1.2.1 The Motivation

As discussed earlier, modern parallel computing platforms introduce a great oppor- tunity to reconsider the performance problem in multimedia processing applications. However, there are still many open questions about the behavior of those computing platforms on realistic applications, particularly on applications developed during the uniprocessor-based serial computing era. Therefore, it is crucial to select a realistic application as a case study for exploring modern parallel computing platforms. The case study should be able to allow the computing platform to expose its most ben- eficial characteristics. On the other hand, reconsidering the performance problem is one of the major goals in applying parallel computing. Hence, the selected case study should be an application that has a great demand for high performance so that the effort of this study can yield a dual benefit. Putting all of these factors together, this study will select the JPEG2000 image compression standard as its case study with the following motivations.

Digital imagery is pervasive in our world today. It appears in all market seg- ments, from entry consumer to industrial applications to scientific and defense mis- sions. This very wide variety of digital imaging is present not only in the application domain but also in the hardware domain. Nowadays, a digital camera can vary from integrated cameras on smart phones [6, 7] with resolutions on the order of mega pixels, to high-end professional cameras for cinematography with resolutions of hun- dreds mega pixels [26] all the way to custom cameras for surveillance with enormous resolutions on the order of giga pixels [46]. Therefore, putting effort in improving performance of an image compression standard always attracts the community. Par- 7 ticularly, improving a crucial standard such as JPEG2000 (discussed in Section 1.2.2) will be most suitable primarily due to its great demand in performance (discussed in Section 1.2.3).

Moreover, as will be shortly discussed in detail in Chapter 2, JPEG2000 consists of many special computational characteristics. On one hand, JPEG2000 processes image pixels at a very fine-grained level, coding every single bit of a pixel coefficient and thus exposing a very high degree of parallelism. On the other hand, JPEG2000’s processing flow is inherently serial and contains many complex control operations. This would bring exciting design challenges not only in adapting its flow to parallel computing but also in efficiently utilizing the right parallel computing architectures

(i.e. multicore vs. manycore architectures) for each processing stage of the flow.

1.2.2 JPEG2000 Application Domains

With excellent coding performance and many attractive features (introduced in Sec- tion 2.1.2 Chapter 2), JPEG 2000 has a very large potential application base. The potential applications [24] of JPEG2000 include:

• Internet multimedia

• Digital photography

• Digital cinema

• Medical imaging

• Wireless imaging

• Document imaging 8

• Remote sensing and GIS

• Scientific and industrial

• Surveillance and security

• Image archives and databases

In real world applications, JPEG2000 has been widely adopted in many do- mains. In the consumer software domain, JPEG2000 is now supported by many

image processing applications and internet browsers such as ACDsee, Adobe Photo-

shop, Apple iPhoto, Firefox, Safari, etc. [91]. JPEG2000 is increasingly adopted in digital cinema for postproduction tasks such as editing, compression, packaging, and

encryption [49]. JPEG2000 is also featured in the newly introduced 4K videography.

JPEG2000 is particularly used in the domain of surveillance and space imaging, with support from big industry companies such as BAE Systems, ERDAS IMAGINE and government agencies such as DARPA and NASA. The JPEG2000 standard is the core image coding engine of DARPA’s sophisticated Autonomous Real-time Ground

Ubiquitous Surveillance-Imaging System (ARGUS-IS) [46]. ARGUS-IS is a real-time, high-resolution, wide-area video surveillance system that provides the warfighter super-high resolution images. AGUS-IS can provide a minimum of 65 VGA video windows across the field of view and a maximum frame of 1.8 giga pixels, which can cover an area up to 40 Km2. ERDAS IMAGINE develops its own JPEG2000

SDK named ERDAS ECW/JP2 SDK targetted specifically to geospatial imaging

applications. Even ICER, the file format used by the NASA Mars Rovers [72], has many design commonalities with JPEG2000 [63].

JPEG2000 is also supported in the hardware domain, mostly in ASIC such as

Analog Device’s ADV 212 chip [48] and proprietary custom FPGA-based implemen- 9 tations. Although it has been widely adopted, most of JPEG2000 implementations are in proprietary software domains, only a few being open source projects avail- able to the public including JasPer [11] recommended by the JPEG committee and OpenJPEG JPEG2000 Codec [75].

1.2.3 Demand For High-Performance JPEG2000 Coder

The superior performance of JPEG2000 ( introduced in Section 2.1.2 ) comes at a cost in computational complexity. The main drawback of the JPEG2000 standard compared to current JPEG is that the coding algorithm is much more complex and the computational needs are much higher. In particular, JPEG2000 encodes images at a very fine level of granularity: it processes the image’s wavelet coefficients bit- plane-wise and encodes every single bit using very complex context formation and entropy coding (presented in Chapter 2). Moreover, bit-plane-wise processing may restrict good computational performance on a general-purpose computing platform. Consequently, analysis [68] shows that the JPEG2000 compression and decompres- sion process on general purpose processors, using single-threaded programs, are more than 10× slower than JPEG as shown in Figure 1.2.

Slow JPEG2000 runtime performance is very critical in realistic applications where JPEG2000 often is not a stand-alone image coder but serves as a stage of a more elaborate image analysis/processing flow. For example, in surveillance appli- cations, JPEG2000 frames are encoded in remote stations (such as remote cameras, drones, etc.) and the compressed frames are transmitted in realtime to base stations where they are decoded for viewing or for performing other tasks such as object recognition, object tracking etc. Another example is in medical imaging, where the high resolution medical images are often stored in JPEG2000 format and need to 10

'0>*4=!'*#$LJJJ!!;34=! &'*#$!'*#$!!;34=!

K! JFS! JFRP! JFQP! JFR!

JFP!

JFN! !0).(%'3(-4' JFL! JFJR! JFJQ! JFJP!

J! 84:0! 5;83!6788! B8?=0!>;B:3! !

Figure 1.2: Relative runtime performance of JasPer [11] JPEG2000 encoder vs. IJG JPEG [50] encoder on a set of test images (lena, gold hill, ultrasound [59]) [68]. be decoded for analysis multiple times. Consequently, a slow JPEG2000 codec can slow down the whole process. Sometimes this slow performance is unacceptable in applications such as real-time surveillance and security. Moreover, as image sensors rapidly grow in resolution, higher demands on performance for image coding and processing are introduced, making the slow performance of JPEG2000 pronounced even further.

As a result, there has been a tremendous need to develop high-performance implementations of the JPEG2000 codec, in both software and hardware architec- tures. Since its first release in 2000, JPEG2000 has gathered a lot of attention from the community in accelerating its performance with various software implementa- tions [83, 69, 73, 88, 78] and hardware architectures for both special-purpose custom

VLSI chips and FPGA-based designs [16, 48, 66, 86, 54]. However, due to limita- tions in both supporting platforms (e.g. computing model, processors, etc.) and algorithmic approaches, JPEG2000 codecs’ performance is still far from application demands.

In the hardware domain, performance of one of the most popular ASIC chips for 11

JPEG2000, Analog Devices ADV 212, is still slower than needed in typical applica-

tions. Two ADV 212 chips are needed just to compress a real-time stream of SMPTE

274M (1080i) video [48]. Moreover, JPEG2000 performance is even more slower in software implementations. For example, it takes the modern Intel Core i7 processor

up to 17 seconds just to compress a typical high resolution image, 4096×4096 24-bit RGB, with the reference JasPer JPEG2000 Software [11]. Therefore this study aims at exploiting the capabilities of powerful multicore CPUs and manycore GPGPUs in order to accelerate the JPEG2000 codec in a single computer system.

1.3 Thesis Contributions

To summarize, this thesis makes the following contributions:

• Explore new approaches to accelerate JPEG2000: New approaches are

explored that exploit modern multicore CPUs and manycore GPGPUs par- allel processing capabilities in order to accelerate JPEG2000. Different from

previous studies which often rely on a serial legacy JPEG2000 algorithm, this

approach is based on a novel algorithm and processing methods designed to expose fine parallelism granularity in the JPEG2000 coding flow and to exploit

parallel hardware capabilities.

• Design novel parallel processing methods for JPEG2000: Novel process-

ing methods are designed for the coding stages in the JPEG2000 flow includ-

ing bitplane coding, entropy coding and wavelet transformation, in both the JPEG2000 encoder and decoder. The parallel processing methods are opti-

mized to target special characteristics of multicore CPUs and manycore GPG-

PUs. Significant performance speedups for the coding stages on both CPUs 12

and GPGPUs have been achieved.

• Exploration of Heterogeneous Parallel Computing: This study started

when GPGPUs had just been introduced and the parallel computing commu-

nity had very high expectations of using GPGPUs to accelerate performance- demanding computing problems. However, through the study process, the

necessity and advantages of a heterogeneous computing model which exploits

unique capabilities of both CPUs and GPGPUs to solve realistic computing problems became more and more obvious. Through the learning curve, trade-

offs of different parallel design approaches have been pointed out, in both hard- ware and software when solving realistic applications such as JPEG2000. As

a result, task-level as well as data-level parallelism were exploited in different

coding stages and have efficiently been mapped to both the CPU and GPU to achieve the best performance gain.

• Propose extensions for JPEG2000 targeting parallel processing:

Although this study achieves significant performance gains with design ap-

proaches that are fully compatible with the JPEG2000 standard, there is still great potential to further increase the performance by alternate coding stages

of JPEG2000 to exploit parallel hardware more efficiently. JPEG2000 was de- signed in the serial processing era, therefore the coding stages in JPEG2000

are inherently serial. However, it is sometimes not necessary to follow those

serial paths. Therefore, this thesis proposes extensions that slightly modify two major coding stages in JPEG2000, the context formation and entropy coding,

to better expose parallelism for massively parallel hardware such as GPGPUs. 13

1.4 Organization Of The Thesis

The rest of this thesis is organized as follows. Chapter 2 provides background and related work on the JPEG2000 coding flow. Details on major coding stages including context formation, entropy coding and wavelet transform are introduced. Chapter 2 also discusses different approaches previously used to accelerate JPEG2000, in both the hardware and software domains. Chapter 3 and Chapter 4 present the back- ground of modern parallel computing platforms in both the hardware and software domains. Chapter 3 introduces various hardware techniques to exploit parallelism.

It also discusses two realistic parallel hardware platforms, multicore CPU and many- core GPGPU, which are used as the hardware backbone for this study. Chapter 4 introduces key concepts in the OpenCL software development platform, which is used in this study to program the CPUs and GPGPUs. In addition, it analyzes the

OpenCL platform in three main categories: OpenCL platform model, OpenCL exe- cution model and OpenCL memory model. Next, Chapter 5 and Chapter 6 present various novel design techniques proposed by this study to accelerate the JPEG2000 coding flow on GPGPUs and CPUs. The design process introduces an exciting ex- ploration of capabilities of each type of computing unit on solving such a complex computational flow. It not only achieves significant speedups for the JPEG2000 coding flow but also leads to interesting approaches using heterogeneous parallel computing. Finally, Chapter 7 presents the conclusions and future directions. Chapter 2

Background On The JPEG2000 Image Compression Standard

This chapter presents the background of the JPEG2000 image compression standard, the case study of this thesis. The JPEG2000 standard is very complex, having numer- ous extending features in addition to the basic compression functionality. Therefore, this chapter does not cover all the details of the standard. Rather, it will cover the core algorithms designed to improve performance which will be further explored in the next chapters. Section 2.1 briefly introduces the history of JPEG2000’s develop- ment and its advanced features. Section 2.2 presents an overview on the JPEG2000 coding flow. Next, Sections 2.3, 2.4 and 2.5 present technical details in three major stages in the JPEG2000 coding flow: the wavelet transform, the bitplane coding and the entropy coding.

14 15

2.1 Introduction To The JPEG2000 Image Com-

pression Standard

Digital imagery is pervasive in our world today. It appears in all market segments, from the entry-level consumer to industrial applications to scientific and defense mis- sions. This very wide variety of digital imaging is present not only in the application domain but also in the hardware domain. Nowadays, a digital camera can vary from integrated cameras on smart phones [6, 7] with resolutions on the order of mega pixels, to high-end professional cameras for cinematography with resolutions of hun- dreds mega pixels [26] all the way to custom cameras for surveillance with enormous resolutions on the order of giga pixels [46]. Consequently, standards for the efficient representation and interchange of digital images are essential. Therefore, this section introduces the JPEG2000 image standard, one of the newest international standards for image compression methods and file formats.

2.1.1 JPEG2000 Standard History

JPEG2000 is the international standard for image compression [13, 60, 84] devel- oped by the Joint Photographic Experts Group (JPEG) [21] of the International Organization for Standardization (ISO) [2], and the International Electrotechnical

Commission (IEC) [1] and also recommended by International Telecommunications Union (ITU) [3]. This standard is designed to be the successor of the well-known and widely-accepted JPEG image compression standard [25].

JPEG2000 is motivated primarily by the need for compressed image representa- tions offering features demanded by modern applications as well as superior com- 16 pression performance [84]. The system architecture of this new standard has been defined in such a unified manner that it offers a single unified algorithmic framework and a single syntax definition of the code-stream organization. Therefore, different modes of operations can be handled by the same algorithm and the same syntax definition offers the aforementioned desirable functionalities. This unique capability of JPEG2000 is referred to as compress once-decompress many ways.

The heart of the JPEG2000 algorithm, in contrast to JPEG technology which uses the Discrete Cosine Transform (DCT) [89], is based on the Discrete Wavelet Transform(DWT) [90] which provides high image compression with image quality superior to all existing standard encoding techniques even at low bit rates. Moreover, the wavelet-based compression method provides JPEG2000 a number of advantages over the JPEG format, which will be introduced shortly in Section 2.1.2.

Work on the JPEG-2000 standard commenced with an initial call for contribu- tions [71] in March 1997. The JPEG 2000 core coding system (Part 1) [60] was released in 2000 and since then the development of the JPEG2000 standard is still continued to provide new features in order to meet increasingly higher demands from field applications. Up to the current date, JPEG2000 consists of 14 parts [22, 91] with part 14 [23] still being under development and focusing on adding the capability of XML structural representation and reference.

2.1.2 JPEG2000 Advanced Features

As mentioned earlier, the JPEG2000 standard was developed to become the successor of JPEG with a desire for rich features not only to overcome the limitations of the

JPEG standard but also to be effective in vast areas of applications in the era of 17

digital imaging. As a result, besides the basic compression functionality, numerous

other features are provided in JPEG2000 including:

• Supporting large images file with superior low bit-rate performance

• Progressive transmission by pixel accuracy and resolution

• Region of interest coding (ROI)

• Random access code stream

• Robust error resilience

Next, an overview of these advanced features of JPEG2000 and their impact on realistic applications will be presented.

JPEG2000 superior low bit-rate performance

JPEG2000 offers superior performance in terms of visual quality and PSNR (peak signal-to-noise ratio) at very low bit-rates (below 0.25 bit/pixel) compared to the

baseline JPEG. JPEG2000 implementations outperform JPEG by up to 3 dB in

PSNR [68]. The difference is relatively consistent across various bit rates. For the same compression ratio, a JPEG2000 image shows superior visual quality compared

to a JPEG image as shown in Figure 2.1 and Figure 2.2. At the compression ratio of 1:80, the visual quality of a JPEG-compressed Lena image is significantly decreased

due to artifacts while the JPEG2000 compressed image still maintains a good visual

quality. This feature is very useful for the transmission of compressed images through a low-bandwidth channel. 18

Figure 2.1: Lena image compressed with Figure 2.2: Lena image compressed with JPEG with a compression ratio of 1:80 JPEG2000 with a compression ratio of 1:80

Progressive transmission by pixel accuracy and resolution

The quality scability and resolution scability in JPEG2000 provide a unique progres- sive transmission capability. Images saved in the JPEG 2000 format can be coded such that the image gradually increases in resolution, starting with a thumbnail, or gradually increases in quality as data is transmitted. A combination of these (and other) quality measures can also be achieved. The user can stop the image transmis- sion once they have enough detail to make their next choice, as the data is ordered in the file in such a way that its delivery is simplified by image servers. This is possible by progressively decoding most significant bit-planes to lower significant bit-planes until all the bit-planes are reconstructed. The code-stream can also be organized as progressive in resolution such that the higher-resolution images are generated as more compressed data is received and decoded. This is possible by decoding and ap- plying the inverse Discrete Wavelet Transform (iDWT) of more and more higher level subbands that were generated by the multiresolution decomposition of the image via the forward DWT as shown in Figure 2.3. These features are extremely effective 19 for real-time applications such as web imaging, surveillance imaging, particularly in a system with a limited memory buffer, or transmission through limited-bandwidth channels.

Figure 2.3: Lena image decomposed into three wavelet resolution levels.

Region Of Interest coding (ROI)

In certain applications, it might be desired to code certain parts of a picture at a higher quality than the other parts . For example, in surveillance applications, a camera once might be interested in a particular region in a field (i.e. a truck on a road). This part needs to be given higher priority for quality and transmission. During decompression, the quality of the image can also be adjusted depending on the degree of interest in each ROI. JPEG2000’s ROI capability derives from the fact that code-blocks can be encoded independently with different precision. This capability of JPEG2000 is very useful in surveillance, scientific and medical imaging. 20

Random access code stream

JPEG2000 allows the decoder to randomly extract code-blocks from the compressed codestream, enabling the manipulation of certain areas of the image. This capability is useful to compressed-domain processing such as cropping, flipping, rotation, trans- lation, scaling, feature extraction etc. One might want to replace one object in the image with another, sometimes even with a synthetically generated image object. It is possible to extract the compressed code-blocks representing the object and replace them with compressed code-blocks of the desired object.

Robust error resilience

Robustness to bit-errors is highly desirable for transmission of images over noisy communications channels, particularly in wireless transmission channels such as mo- bile communications, remote surveillance etc. The JPEG2000 standard facilitates this by coding small size independent code-blocks and including resynchronization markers in the syntax of the compressed codestream. Since code-blocks are encoded and decoded independently, if some of them are distorted due to transmission error, the decoded image still preserves a large part of its information. There are also provisions to detect and correct errors within each code-block. The robust error resilience makes JPEG2000 applicable in emerging mobile multimedia applications. 21

2.2 JPEG2000 Coding Flow

In order to provide a big picture of the roles of the different coding stages in

JPEG2000, the overview of the entire JPEG2000 encoding and decoding flow is

first presented in this section.

The JPEG2000 encoding flow is shown in Figure 2.4. A multicolor transfor-

mation(MCT) is first applied to the target image if necessary. The MCT step is optional and is often used to transform the image color space from the RGB space

to the YCbCr color space. Since JPEG2000 encodes the color components of an image independently, it’s best to have a minimum correlation among them. It has

been proven that the YCbCr color space can yield a higher JPEG2000 encoding

efficiency [84].

9 : ; <

= > ? ; @ > A

t u v w x

i j k l

B > C @D < ¡ ¢ £ ¤ ¥ ¦ ¢ ¦ §        # $ % &' ( . / 0 1 2 3

m n n n

EF : G < ¨ § © ¦ §          ) * + , - % & 4 5 6 7 8 0 1     !  " 

o p q r s

H I J KL M L N S T U VW X V ^ _ ` a b c d a ` e

OM PI QI R P Y T Z[ \ V]T U f g h d _

Figure 2.4: JPEG2000 Encoding Flow

The MCT-transformed image components are then fed to a discrete wavelet trans- form engine where the image pixel values are transformed into wavelet coefficients.

The adoption of the wavelet transform in JPEG2000 standard is one of the most important changes in JPEG2000 compared to its predecessor, the JPEG image cod- ing standard [25]. Rather than incrementally improving the original JPEG stan- 22 dard, JPEG2000 implements an entirely new way of compressing images based on the discrete wavelet transform (DWT) which brings to JPEG2000 many advantages compared to the JPEG standard. JPEG2000’s discrete wavelet transform engine and its advantages will be further discussed in Section 2.3.

The wavelet coefficients may be quantized in the quantization step if the image is coded in lossy mode [84]. Quantization is a simple operation and will not be discussed further in this thesis. The interested readers can refer to [84] for further technical details on quantization in JPEG2000.

The DWT coefficients, which may be scalar quantized in lossy mode, are then par- titioned into individual code blocks and coded by the core coding engine of JPEG2000 called embedded block coding with optimized truncation (EBCOT). EBCOT is a two-tiered coder: Tier-1 is responsible for bit plane coding (BPC) and context adap- tive arithmetic encoding (AE); Tier-2 handles rate-distortion optimization and bit- stream layer formation. The EBCOT engine of JPEG2000 is the core of the standard and consists of a lot of complex technical details. Therefore each individual stages of the EBCOT engine will be further discussed in the next sections.

2.3 JPEG2000 Wavelet Transform

JPEG2000 implements an entirely new compression method compared to its prede- cessor, the JPEG standard. JPEG2000 compression is based on the discrete wavelet transform (DWT), in contrast to JPEG compression which uses the discrete co- sine transform (DCT). The adoption of the DWT in JPEG2000 is one of the most important changes, which brings many advantages to the standard. 23

The Discrete Wavelet Transform (DWT) is a multiresolution representation of

signals based on wavelet decomposition. Image processing applications adopt the

discrete version of the wavelet transform since images are often represented in an array of discrete pixels [12]. The DWT has traditionally been implemented by con-

volution with a FIR filter bank [67, 47]. However, the core coding system of the JPEG2000 standard [60] recommends a lifting-based DWT scheme [47] instead. The

lifting-based DWT is more suitable compared to the convolution-based scheme since

it can not only reduce the number of computations but also allow in-place compu- tation of wavelet coefficients leading to a reduced memory requirement. Moreover,

the lifting scheme exposes a high degree of parallelism [65]. Figure 6.12 illustrates the one dimensional(1D) lifting-based (5,3) filter bank used in .

The input samples X[i] are first split into odd and even samples (S0[i]= X[2i] and

D0[i]= X[2i +1]) . Afterwards, the predict and update steps are applied to the S0[i] and D0[i] samples using the following rules:

D1[i]= D0[i]+ a(S0[i]+ S0[i + 1]) (2.1)

S1[i]= S0[i]+ b(D1[i − 1] + D1[i]) (2.2)

y z { | } ~  €  ‚ ƒ „ †

¡ ¢ £ ¤ ¥ ¦ ¦

• – — ˜ ™ š

¡ ¡ ¦

§ ¨ © ª ª § © « ¬

‡ ˆ ‰ Š ‹ Œ

›

œ  ž Ÿ

­ ¤ ¥ ¦ ¦

¨ ®

¡ ¡ ¦

§ ¨ © ª ª § © « ¬

 Ž   ‘ ’ “ ”

Figure 2.5: 1D lifting-based (5,3) filter bank

−1 1 1 The filter coefficient values are a = 2 and b = 4. The outputs of D [i] are the 24

high-pass outputs of the filter and the outputs of S1[i] are the low-pass outputs of

the filter. Thus the result of the 1D DWT is a signal divided into a low-pass and

a high-pass subband. The 2D signals (e.g., images) are usually transformed in both dimensions by first applying a 1D DWT to the rows and then to the columns of the

previous output. The 2D DWT output consists of four subbands (labeled LL, HL, LH and HH) shown in Figure 6.13. The Inverse DWT (iDWT) can be derived by

following the DWT steps in reverse direction.

´ ´ ² ³

µ ¶ · ¸ ¹ º » ¼

Ë Ì Ì ½ ¾ ¾

À Á ÂÃ Ä Å ÆÇ ÈÉ Ê ¸ ¹ ¿

ÎÏ Ð Ñ Ò Ó Ô

Æ Ç Í

¯ ° ± ±

Figure 2.6: 2D DWT for 2D image

The wavelet decomposition helps achieve high compression efficiency because many images have most of their energy in the low-frequency subband, despite the fact that there is still some correlation between bands. Figure 2.8 shows an example of the wavelet decomposition of the well-known Lena image [74]. It is clear that most of the detail in the wavelet decomposition is concentrated in the LL subband. The wavelet subbands therefore have different dynamic ranges, the coefficients in the

LL band being often significantly larger than those in the LH, HL and HH bands. Consequently, the coefficients in the LH, HL and HH bands can be represented with significantly fewer bits, which leads to better compression efficiency.

In contrast to the DCT which applies the DCT filter on a small 8 × 8 block, the DWT analyzes the whole image. The sharp-edged 8 × 8 DCT blocks lead to very noticeable artifacts in JPEG-compressed images, especially when encoded at low bitrates. In JPEG2000 this problem is significantly reduced as a result of using 25

Figure 2.7: Lena original image Figure 2.8: Lena wavelet decomposition the DWT. This key difference allows JPEG2000-encoded images to obtain a superior spatial quality at the same compression rate compared to JPEG-encoded images, as shown in Figure 2.1 and Figure 2.2.

Moreover, the multiresolution nature of the wavelet decomposition enables a very special capability of JPEG2000, namely the multiresolutional scalability of images.

The pyramid structure of the wavelet decomposition allows a JPEG2000 image to be reconstructed at a certain resolution without having to decode the full-resolution code stream. This capability is very crucial since different clients often have different resolution requirements on the same frame. For example, an editing or analysis client often prefers the full-resolution of the frame while a web-view client needs only a relatively smaller resolution. 26

2.4 JPEG2000 Bitplane Coding

Section 2.3 has discussed the wavelet transform and its benefits towards efficient compression. However, the DWT is just one of the very first stages that prepares the preliminary data for the core coding engine of JPEG2000, namely EBCOT. The EBCOT-Embedded Block Coding with Optimized Truncation coding engine is a two-tiered coder: Tier-1 is responsible for bit plane coding (BPC) and context adaptive arithmetic encoding (AE); Tier-2 handles rate-distortion optimization and bitstream layer formation. This section will start presenting the EBCOT’s major technical details by discussing the bitplane coding stage (BPC).

2.4.1 JPEG2000 Data Structure

Before going into the details of the BPC stage, it is important to understand the organization of its input data. As shown in Figure 2.4, the wavelet coefficients obtained in the DWT stage are fed to the Bit Plane Coder in a very well structured way as illustrated in Figure 2.9.

Figure 2.9 illustrates a pyramid of different layers in the organization of the

JPEG2000 data structure. Consider an input image of size M × N pixels having

K(1 ≤ K ≤ 214) color components [13]. For example, a typical color image often has K =3or K = 4 color components (RGB, YUV, RGBA, etc.) while a gray scale image has only one color component, thus K = 1. As mentioned earlier, JPEG2000 encodes each color component independently, treating them as independent M × N arrays of coefficients. In some situations where the image is too large to be fit entirely in the system’s memory or due to design efficiency concerns, the image may

27

ÕÖ × Ø Ù

Ö Ü Ù Ö Ü Ù Ö Ü Ù

ÚÛ Û Ý Ý Þ ÚÛ Û Ý Ý Þ ÚÛ Û Ý Ý Þ

ß

è ç

× × × ×

à à á â ã ã ë à ì â ã ã à ë ì â ã ã ë ë ì â ã ã

Ý ä Ý ä Ý ä Ý ä

å Ù å Ù å Ù å Ù

ã æ ã æ ã æ ã æ

Û ä Û Ú ç è Û ä Û Ú ç é Û ä Û Ú ç ê Û ä Û Ú ç Ý

Figure 2.9: Pyramid Data Structure in JPEG2000 be partitioned into smaller tiles which will be encoded independently. Regardless of the tiling operation, each component is transformed via the DWT independently thus producing the wavelet subbands. Each subband is scaled down according to its resolution level as shown in Figure 6.13. JPEG2000 also encodes wavelet subbands independently. Note that the wavelet coefficients may be quantized if the JPEG2000 coder works in lossy mode.

Each wavelet subband is further partitioned into smaller subarrays called code blocks. Code blocks are the smallest data structure that the JPEG2000 coding engine can process independently. They are rectangular arrays having a nominal width

(Cbw) and nominal height (Cbh). These values are mutable and can be changed for different coding processes. However, the Cbw and Cbh values are subjected to certain constraints which are described in detail in [60, 84]. The notable constrains include, but are not limited to: 1) the nominal width and height of a code block must be an integer which is a power of two, and 2) the product of the nominal width and height cannot exceed 4096. The nominal width and height of a code block is typically 64 or 128. 28

2.4.2 JPEG2000 Bitplane Coding

As mentioned in Section 2.4.1, the (quantized) wavelet coefficients are partitioned into smaller, independent arrays called code blocks which are fed to the bitplane coder. The name of the bitplane coder is derived from the way that the coder encodes the code blocks, namely bitplane by bitplane. The bitplane organization of a code block can be illustrated by Figure 2.10.

--" +-" +-!

-+" ++" ! !" ! !!% (

-+! ++! ! !#! !"& !$ ! 23C7;7A 1B443<6@ !

!'' !& !$ !&" !

*=67 *=67 *=67 !#$ !( !& !% );=5: );=5: );=5:

*=67 );=5: )9<3?F 07>?7@7

49A >;3<7

.1) 1A?9>7

-1)

)9A-/;3<7@ ,< 3 *=67-);=5: 153<<9<8 =?67? 9< 3 49A >;3<7 *=

Figure 2.10: Viewing of codeblock, bitplane, sample stripe and context window in JPEG2000

The top of Figure 2.10 shows a snapshot of a code block having 4 × 4 samples.

Each code block sample is just an unsigned integer, which is represented using 0 and 1 bits. For example the sample 172 (highlighted in orange in the code block) is represented as 10101100, where the most significant bit (MSB) is bit 1 and the least 29 significant bit is bit 0. The coder processes the samples in the current codeblock bit by bit starting from the MSB, going one bit level at a time. The collection of bits in a certain level of all samples in a code block is called a bitplane and is shown in the bottom part of Figure 2.10. In other words, instead of encoding all coefficients of a code block at a time, JPEG2000 encodes a code block by encoding every single bit, going bitplane by bitplane, starting from the MSB bitplane. The bitplane processing scheme therefore allows JPEG2000 to process data at a very fine granularity. The disadvantage, however, is that it would take a massive amount of computation to finish the processing job in efficient time. The impact of fine-grained data processing on JPEG2000’s performance will be further discussed in the next chapters.

Although the bitplane-based processing scheme has already introduced a signifi- cant degree of complexity in JPEG2000 processing, the rules on processing bitplanes are even more complex, as shown in the bottom right corner of Figure 2.10. Within a certain bitplane of a code block, the bitplane coder divides the bit-samples into groups of 4 bit lines called stripes and the bitplane coder processes the bit-samples of a bitplane stripe by stripe. Furthermore, the bitplane coder must follow a restricted scanning order within a stripe, specifically, the coder has to scan bit-samples from top to bottom, left to right.

2.4.3 Bitplane Coding Passes

The previous section discussed the data structure of JPEG2000, the scanning rules and the processing order of the bitplane coder. However, the core of the coding engine and the coding algorithm have not yet been presented. This section will discuss the coding passes and the context formation process, one of the most important stages of the EBCOT coding engine. 30

The JPEG2000 context formation process is mainly adopted from the fractional coding scheme firstly developed by David Taubman in [83]. The core concept of the fractional coding scheme is that for each bitplane, p, the coding proceeds in a number of distinct coding passes, called fractional bitplane coding passes. JPEG2000 em- ploys three distinct coding passes namely Significant Propagation Pass, Magenitude Refinement Pass and Clean Up Pass. The reason for introducing multiple coding passes is to ensure that each code-block has a finely embedded bit-stream support- ing efficient truncation and rate-optimization in lossy encoding mode. Additionally, fractional coding passes allow JPEG2000 to efficiently form contexts for input bits in order to increase the coding efficiency of the upcoming stage. These aspects of the fractional coding passes will be further discussed in the next sections. For now, the coding primitives using the coding passes of JPEG2000 will be analyzed first.

The JPEG2000 bitplane coding process has four different primitive coding opera- tions which form the foundation of the embedded block coding strategy. Specifically these operations are “Zero Coding” (ZC), “Sign Coding” (SC), “Run-Length Cod- ing”(RLC), and ”Magnitude Refinement Coding” (MRC). The primitives are used to code new information for a single bit-sample in a certain bitplane, p. A specific coding pass in the bitplane coding may use one or more coding primitives.

Before going into the details of the operations of the coding primitives, it is important to first understand the preliminary terms and the concepts used in these primitives. One of the most important concepts is the context of a bit-sample, which is determined based on the value and the state of the current coding bit as well as its neighbors. To code a single bit-sample, JPEG2000 must gather information from all its neighbors in a 3 × 3 window shown in Figure 2.10. 31

It is important to note that JPEG2000 has to maintain some state information for

every code block sample besides the array holding the absolute values of the samples

themselves (υ) and the sign array(χ). Different types of data and state arrays that JPEG2000 maintains during the coding process are shown in Figure 2.11. A sign bit of χ[x,y] = 1, indicates that the sample at location(x,y) in the code block is negative.

0 There are three types of state information arrays labeled σ, σ , and η that JPEG2000 maintains for each bit-sample υ[x,y]. This state information is represented in bits called state bits. The state bits of each bitplane of size of m × n are also organized in a two dimensional array of the same size as the bitplane. Initially all state bits in

0 σ, σ , and η are set to zero.

• State bit σ is called significant state. σ[x,y] = 1 indicates that the first nonzero bit of υ[x,y] at row x and column y has been coded; otherwise it is equal to

0. Significant states are maintained until JPEG2000 finishes coding the last

bitplane of the code block (the LSB bitplane).

0 0 • State bit σ is called the refinement state. σ [x,y] = 1 indicates that a magni- tude refinement coding operation (defined in the next section) has been applied

to υ[x,y]; otherwise, it is equal to zero. Similar to the significant states, refine- ment states are maintained until JPEG2000 finishes coding the last bitplane

of the code block (the LSB bitplane).

• State bit η is called zero state. η[x,y] = 1 indicates that a zero coding operation

(defined in the next section) has been applied to υ[x,y] in the significant prop- agation pass; otherwise, it is equal to 0. Zero states are reset after JPEG2000

finishes coding every bitplane.

As mentioned earlier, JPEG2000 uses four different coding primitives(ZC, SC,

RLC, MRC) to encode bit-samples on every bitplane based on the 3 × 3 context 32

"#! "!! ""& )! ! ! ! !

"$" "#' "%! "!! ! " ! !

"(( "'! "%! "'# ! ! " "

"$% ")! "'! "&! ! ! ! "

-16:7>?34 *==1@ (" /76: *==1@ (!

! ! ! ! ! ! ! !

! ! " ! ! ! ! !

" " " " ! ! ! !

" " " " ! ! ! !

-/+ +7><81:4 ,:7>718 /76:75721:> *==1@ (#

! ! ! ! ! ! ! !

! ! ! ! ! ! ! !

! ! ! ! ! ! ! !

! ! ! ! ! ! ! !

,:7>718 .457:494:> *==1@ (#A ,:7>718 04=; />1>4 *==1@ ( !

Figure 2.11: Different data and state arrays that JPEG2000 maintains during the coding process 33

window as illustrated in Figure 2.11. In order to code a single bit-sample (i.e. the

bit at location (x,y)=(3,1)), the coding primitives may have to refer to up to five

different context windows in the magnitude array (υ) sign array (χ), significant state

0 array (σ), refinement state array (σ ), and zero state array (η). For this reason, it should be noted that the process of deciding context for bit-samples is called context formation process.

Although the rules of each coding primitives are different, their outputs are in the same form as shown in Figure 2.12. All coding primitives produce a pair of context index (Cx) and decision bit (D) for every single bit-sample, which is often refered to as a (Cx,D) pair. The context index is an integer number (0 ≤ Cx ≤ 18) that is associated with one of the 19 different contexts that JPEG2000 defines based on the state information it gathers in the 3 × 3 context window. The decision bit is a binary bit (D = {0, 1}) that will be actually coded by the arithmetic coder (which will be discussed in the next sections).

%*/2067,- (0/20.0+*26 '-.02-1-26 )-43 (6*6- (0/2 !065 !065 (6*6- !065 (6*6- !065 !065

"3,02/ &4010608-5

"326-96 $# #-+05032 !06 ("9 (#

! Figure 2.12: The Context formation flow of JPEG2000 Bitplane Coder 34

"0 $0 "1 $0

#0 % #1 #0 % #1

"! $1 " $1

"C C'&($*( !%&#') C C'&($*( !%&#') !

Figure 2.13: Context Windows for different coding primitives

Next, the rules that determine the value of the context and decision bit of an input bit-sample based on its state information [60, 84] will be discussed.

• Zero Coding (ZC): The decision bit (D) of a bit-sample in the ZC primitive is

identical to the value of that bit-sample. For example if υX = 1 then DX = 1

else DX = 0. The ZC primitive produces the context of a bit sample based on all significant state bits (σ) of its 8 neighbors. The context window and associated annotation for the neighboring significance states of the coding bit-

sample (X) is shown in Figure 2.13. The context formation rules are based on

the ”Zero Coding Tables” that are specified in [60] by the JPEG2000 standard. There are different tables, as shown in table 2.1, table 2.2, and table 2.3, for

different wavelet subbands (e.g. LL, LH, HL, HH) since the characteristics of the wavelet coefficients are different for each subband. 35

LL and LH Subbands Context H V D Cx P2 Px Px 8 1 ≥ 1 x 7 1 0 ≥ 1 6 1 0 0 5 0 2 x 4 0 1 x 3 0 1 ≥ 2 2 0 0 1 1 0 0 0 0

Table 2.1: Reference table for Zero Coding on LL and LH subbands

HL Subbands Context H V D Cx Px P2 Px 8 ≥ 1 1 x 7 0 1 ≥ 1 6 0 1 0 5 2 0 x 4 1 0 x 3 0 1 ≥ 2 2 0 0 1 1 0 0 0 0

Table 2.2: Reference table for Zero Coding on HL subband

HL Subbands Context (H + V ) D Cx P x P≥ 3 8 ≥ 1 2 7 0 2 6 ≥ 2 1 5 1 1 4 0 1 3 ≥ 2 1 2 0 1 1 0 0 0

Table 2.3: Reference table for Zero Coding on HH subband 36

• Sign Coding (SC): The D and Cx value for the sign coding primitive are deter- mined by references to horizontal neighbors’ (H) and vertical neighbors’ (V) significance and sign bits as shown, in Figure 2.13. Suppose that the current coding bit is υ[x,y], the values of H and V are computed by the following equations:

H = min[1,max(−1,σ[x, y − 1] × (1 − 2χ[x, y − 1]) + σ[x, y +1] × (1 − 2χ[x, y + l]))] (2.3)

V = min[1,max(−1,σ[x − 1, y] × (1 − 2χ[x − 1, y]) + σ[x +1, y] × (1 − 2χ[x +1, y]))] (2.4)

After the values of H and V are computed, the SC primitive uses both of

them and refers to the rule in Table 2.4 to determine the context (Cx) and the

decision bit value (D). The annotation χ is the binary reverse of the sign bit χ.

H V χ Cx 1 1 1 13 1 0 0 12 1 -1 0 11 0 1 0 10 0 0 0 9 0 -1 1 10 -1 1 1 11 -1 0 1 12 -1 -1 1 13

Table 2.4: Reference table for Sign Coding

• Magnitude Refinement Coding (MRC): Similar to the ZC primitive, the MRC

primitive just assigns the value of the decision bit (D) to the value of the magnitude bit (υ). In other words, D[x,y] = υ[x,y]. The context Cx for the

0 MRC is determined based on the refinement state (σ [x,y]) and the significance

state of all 8 neighbors. The rule for MRC context formation is summarized in Table 2.5. 37

0 0 σ [x − 1,y]+ σ [x + 1,y]+ 0 0 0 σ [x,y] σ [x − 1,y − 1] + σ [x − 1,y + 1]+ Cx 0 0 σ [x + 1,y − 1] + σ [x + 1,y + 1]+ 1 x 16 0 ≥ 1 15 0 0 14

Table 2.5: Reference table for Magnitude Refinement Coding

2.5 JPEG2000 Entropy Coding

The second stage of Tier-1 coding in JPEG2000 is entropy coding using context adap-

tive arithmetic encoding (AE) which eliminates redundant information to further compress the JPEG2000 bit stream. The entropy coding stage employs a multiplier- free context adaptive arithmetic coder named the MQ coder, which is a descendant of the multiplier-free Q coder [19]. The MQ coder takes a stream of context and de- cision (Cx,D) pairs from the bitplane coder and encodes each decision bit (D) based on its corresponding context (Cx). This is the reason why the MQ coder is referred to as a context adaptive coder. This section presents the core operations of the MQ coder in JPEG2000.

2.5.1 Understanding The Legacy Arithmetic Coder

As recently mentioned, JPEG2000 employs a multiplier-free arithmetic coder named

the MQ coder for the entropy coding stage. Therefore it is necessary to understand the legacy arithmetic coder model before explaining the MQ coder’s operations.

Arithmetic coding is a statistical coding method that encodes data symbols based

on their probability with variable-size code.

The arithmetic coding method is shown in Figure 2.14. The method starts with

38

¦ ¥ §

 £ ¤ ¡

¢

ö ¨

= - ©

õ + +

ù ÷

ø

ÿ

ý ü þ û

= - ú

ô ó

ï ñ ð î -í

Figure 2.14: Interval scaling in the Arithmetic Coder. a certain interval, reads the input file symbol by symbol, and uses the probability of each symbol to narrow the interval. It recursively creates a sequence of nested intervals of the form Φk(S)= hbk,lki where bk,lk are called the base and the length of the interval Φk respectively. In binary arithmetic coding, for example, bk and lk are computed based on values of bk−1, lk−1 and the input sample sk (sk = {0; 1}) using the set of equations (2.5):

(0) xk = lk−1C  k−1    lk = lk−1 − xk   if sk = 1   (2.5)  bk = bk−1 + xk     lk = xk   if sk = 0    bk = bk−1    

(0) (0) where Ck is the cumulative distribution of symbol ‘0’ up to step k. Ck is computed using equation (2.6) based on the number of times bit 0 and bit 1 appeared in the

(0) (1) input sequence up to step k (i.e., Nk and Nk ).

(0) (0) Nk + 1 Ck = (0) (1) (2.6) Nk + Nk + 2 39

2.5.2 The MQ Coder In JPEG2000

The MQ coder in JPEG2000 eliminates the multiplication operation in the legacy

Arithmetic coder (presented in Section 2.5.1) by using integer approximation for the scaling operations in Equation (2.5) and a look-up-table (LUT) for probability

estimation. The model of the MQ coder is shown in Figure 2.15.

& ' ( ) *  "  

7 8 9 : ; <

= > ?@ A 9 :

     *  "   5 6 6  

+

             "     

1 2 1 3 1 4 1 %

          ! #     $  %

*  )   ,      

!

,

B

@ > C 9 D C : 8 C I F 9 C 8 ?

!

B

E@ :F G C 8 @ > H @ A 9 :

-  )   ,  .  )   

/!     &  * 0  ) 

Figure 2.15: Multiplier-free MQ coder model

The MQ coder is a binary coder which codes each binary decision bit (D) based on its context (Cx). The MQ coder follows the recursive probability interval subdivision scheme of the legacy Arithmetic Coder. With each binary decision, the current probability interval is subdivided into two sub-intervals. The code string is modified (if necessary) such that it points to the base (the lower bound) of the probability sub-interval assigned to the symbol which occurred.

When partitioning the current interval (named A) into two sub-intervals, the sub-interval of the more probable symbol (MPS) is ordered above the sub-interval of the less probable symbol (LPS). Therefore, when the MPS is coded, the LPS subinterval is added to the code string. This coding convention requires that symbols be recognized as either MPS or LPS, rather than ‘0’ or ‘1’. The value of MPS (or

LPS) therefore changes frequently between ‘0’ and ‘1’. Consequently, the size of the LPS interval and the sense of the MPS for each decision must be known in order to code that decision. 40

Conventionally, following the division rule of arithmetic coding, given the current

interval (A) of the MQ coder and the current estimate of the LPS probability Qe, a precise calculation of the sub-intervals would be expressed in Equation (2.7).

A - Qe ∗ A = sub − interval for the MPS   (2.7) Qe ∗ A = sub − interval for the LPS   However, if A is always kept near 1 then the subintervals can be approximated by Equation (2.8). JPEG2000’s MQ coder keeps A in the range of [0.75 : 1.5).

A - Qe ∗ A ≈ A-Qe = sub − interval for the MPS  if A ≈ 1  (2.8) Qe ∗ A ≈ Qe = sub − interval for the LPS  

The above equation presented a method for scaling the coding interval A without the use of multiplication. However, the MQ coder may still require multiplication to compute the probability of the input symbols. For example, Equation (5.12) requires division/multiplication to compute the probability of symbols based on the counting method. To overcome this, the MQ coder employs a probability estimation based on a look-up-table [60, 84](Table 2.6), which is built from a complex Markov model, proposed by Langdon et. al [19]. Table 2.6 can be regarded as a finite state machine (FSM) which takes input bit D and the current state (Index) to decide its probability (Qe) and the next state. The operation of this complex FSM will be explained together with the operations of the MQ encoding flow in Figure 2.16,

Figure 2.17, and Figure 2.18. 41

Index Qe Value NMPS NLPS SWITCH (hexa) (binary) (decimal) 0 0x5601 0101 0110 0000 0001 0.503 937 1 1 1 1 0x3401 0011 0100 0000 0001 0.304 715 2 6 0 2 0x1801 0001 1000 0000 0001 0.140 650 3 9 0 3 0x0AC1 0000 1010 1100 0001 0.063 012 4 12 0 4 0x0521 0000 0101 0010 0001 0.030 053 5 29 0 5 0x0221 0000 0010 0010 0001 0.012 474 38 33 0 6 0x5601 0101 0110 0000 0001 0.503 937 7 6 1 7 0x5401 0101 0100 0000 0001 0.492 218 8 14 0 8 0x4801 0100 1000 0000 0001 0.421 904 9 14 0 9 0x3801 0011 1000 0000 0001 0.328 153 10 14 0 10 0x3001 0011 0000 0000 0001 0.281 277 11 17 0 11 0x2401 0010 0100 0000 0001 0.210 964 12 18 0 12 0x1C01 0001 1100 0000 0001 0.164 088 13 20 0 13 0x1601 0001 0110 0000 0001 0.128 931 29 21 0 14 0x5601 0101 0110 0000 0001 0.503 937 15 14 1 15 0x5401 0101 0100 0000 0001 0.492 218 16 14 0 16 0x5101 0101 0001 0000 0001 0.474 640 17 15 0 17 0x4801 0100 1000 0000 0001 0.421 904 18 16 0 18 0x3801 0011 1000 0000 0001 0.328 153 19 17 0 19 0x3401 0011 0100 0000 0001 0.304 715 20 18 0 20 0x3001 0011 0000 0000 0001 0.281 277 21 19 0 21 0x2801 0010 1000 0000 0001 0.234 401 22 19 0 22 0x2401 0010 0100 0000 0001 0.210 964 23 20 0 23 0x2201 0010 0010 0000 0001 0.199 245 24 21 0 24 0x1C01 0001 1100 0000 0001 0.164 088 25 22 0 25 0x1801 0001 1000 0000 0001 0.140 650 26 23 0 26 0x1601 0001 0110 0000 0001 0.128 931 27 24 0 27 0x1401 0001 0100 0000 0001 0.117 212 28 25 0 28 0x1201 0001 0010 0000 0001 0.105 493 29 26 0 29 0x1101 0001 0001 0000 0001 0.099 634 30 27 0 30 0x0AC1 0000 1010 1100 0001 0.063 012 31 28 0 31 0x09C1 0000 1001 1100 0001 0.057 153 32 29 0 32 0x08A1 0000 1000 1010 0001 0.050 561 33 30 0 33 0x0521 0000 0101 0010 0001 0.030 053 34 31 0 34 0x0441 0000 0100 0100 0001 0.024 926 35 32 0 35 0x02Al 0000 0010 1010 0001 0.015 404 36 33 0 36 0x0221 0000 0010 0010 0001 0.012 474 37 34 0 37 0x0141 0000 0001 0100 0001 0.007 347 38 35 0 38 0x0111 0000 0001 0001 0001 0.006 249 39 36 0 39 0x0085 0000 0000 1000 0101 0.003 044 40 37 0 40 0x0049 0000 0000 0100 1001 0.001 671 41 38 0 b41 0x0025 0000 0000 0010 0101 0.000 847 42 39 0 42 0x0015 0000 0000 0001 0101 0.000 481 43 40 0 43 0x0009 0000 0000 0000 1001 0.000 206 44 41 0 44 0x0005 0000 0000 0000 0101 0.000 114 45 42 0 45 0x0001 0000 0000 0000 0001 0.000 023 45 43 0 46 0x5601 0101 0110 0000 0001 0.503 937 46 46 0

Table 2.6: Probability estimation look-up-table for MQ coder [60]

42

T U V W XY Z [ \

 €  ‚  ƒ „  † ‡ ˆ ‰ † Š

J K L M NO P QR S

© ª ¦ § ¨

œ  ž Ÿ ¡ ¢ £ ¤ ¥

Ž   ‘ ’ “ ” • – — ˜ ™ š ›

] ^ _ ` a

© ª

b c d e f g

h i j k l m n

¦ § ¨

o p q r s t u o s v t

w x y z { | } x { ~

‹ Œ 

Figure 2.16: Main encoding flow of the MQ coder

Figure 2.16 shows the main encoding flow of the MQ coder. The flow starts with the initialization procedure which initializes starting values of various param- eters of the MQ encoding process including: the value of the interval (A) register, the value of code (C) register, initial MPS symbol, initial state of contexts (I(Cx)). After finishing the initialization, the MQ encoder starts reading the (Cx,D) pairs, one by one, from the (Cx,D) stream (produced by the bitplane coder as presented in Section 2.4) and encodes the decision bits (D) based on their context (Cx). Note that the MQ coder is a context adaptive coder, which means that the coding oper-

ations highly depend on the context of the input bit, which can be one of the 19 contexts defined in the bitplane coder in Section 2.4. Consequently, the MQ coder 43 maintains different state I(Cx) and MPS(Cx) values for every single context Cx.

The MQ coder then decides to code D as a MPS or LPS based on whether the con- dition MPS(Cx) = D is satisfied or not. The MQ coder encodes LPS and MPS

symbols differently as shown in Figure 2.17 and Figure 2.18. The LPS and MPS

« ¬ ­ ® ¯ ° ±

  

² ³ ² ´ µ ¶ · ¸ · ¹ º »»

            

Û Ü

4 5

Î Ï Ð Ñ ÒÓÒÔ Õ Ö Ö × Ø Ù Ú

[ \ ]

- . / 0 1 / / / 2 / 3

4 5

¼ ½ ¾ ¿ ÀÁÀÂ Ã ÄÄ Å Æ Å Ç È É ÊËÊÅ Ì ÍÍ

H I H J K L MNMH O PP Q R S T UVUW X YYZ [ \ ]

     !  " ## $ % & ' ()(* + ,,

¥ ¦ §

Ý Þ ß à á â ã ß ã á ä å å æ ç è

¨ ©

é ê ë ì í î ï ð ñ ò

é ê ë ì í î ï

6 7 8 9 :; < = > ? 7 6 7 8 9 ::

ó ô õ ö ÷ ø ù ú û ü ô ó ô õ ö ÷ ÷

@ A B C @ D A

ý þ ÿ ý ¡ þ

¢ £ ¤

E F G

Figure 2.17: Flow of encoding a LPS Figure 2.18: Flow of encoding a MPS symbol in symbol in MQ coder MQ coder encoding flows show how the probability estimation look-up-table (LUT) provided in Table 2.6 works. The probability Qe of an input bit D is predicted based on the state of D’s context, I(Cx). I(Cx) is actually the index which the MQ coder refers to in the LUT to get the estimated value Qe(I(Cx)). No explicit symbol counts 44

are needed for the estimation. It is important to remember that both the state I

and MPS value of a certain context Cx, change frequently. MPS(Cx) is inverted to

1 − MPS(I(Cx)) if SWITCH(I(Cx)) = 1. I(Cx) is changed based on the rule: I(Cx) = NLPS(I(Cx)), where NLPS stands for next LPS symbol of the context

Cx, its value also being provided in the LUT. The change in I(Cx) allows the MQ coder to jump between different states in the FSM (i.e. the LUT) based on the

current state and the input pair (Cx,D).

^ _

` a b c ` ^ a

d e d f f g

h e h f f g

h i e h i j g

ƒ „

k l m n o

€  ‚

p q r s t u r

€  ‚

v w x y z x x x { x |

ƒ „

} ~ 

Figure 2.19: Renormalization process in the MQ coder

So far, the main coding flow of the MQ coder and the way that the probability estimation model works has been presented. However, it hasn’t been mentioned where the code bits are output. Instead of outputting each bit after coding each decision (D), the code bits are being temporarily stored in a code register (C reg) and will only be output in bytes during the normalization process (shown in Figure 2.19).

The renormalization process happens when the LPS sub-interval is larger than the MPS sub-interval. This scenario is possible and happens quite frequently in the MQ

1 coder. For example, if Qe = 0.5 and A = 0.75, the approximate scaling gives 3 of 45

2 the interval to the MPS and 3 to the LPS. To avoid this size inversion, the MPS and LPS intervals are exchanged whenever the LPS interval is larger than the MPS interval.

2.6 Conclusion

This chapter presented the core algorithms employed in the major coding stages of JPEG2000, including the wavelet transform stage, bitplane coding stage and entropy coding stage. Understanding these major coding stages is crucial to follow further development of this thesis since various design techniques for improving JPEG2000 performance that will be proposed in next chapters are based on the background presented in this chapter. However, due to the complexity of the JPEG2000 standard, not all the details were covered in this chapter. The interested reader can refer to the standard specification in [60] and well-documented literatures such as [10, 13, 84] for further reference. Chapter 3

Modern Hardware Architectures For Parallel Computing

Chapter 2 discssssussed the aims of this study and the background of the main computational application, the JPEG2000 image coder. Before solving the com- putational problem, the next chapters will first discuss the software and hardware platforms used to support the computations.

A computer system can be divided roughly into four components: the hardware, the , the application programs and the users. Therefore, in order to solve any computational problems, it is important to understand the underlying hardware that executes the computational operations and the programming model used to create the application programs. For this reason, before presenting the proposal for accelerating the JPEG2000 image coder, Chapter 3 and Chapter 4 will first focus on discussing the modern hardware architecture and software programming model that support the high performance computing.

46 47

The rest of this chapter will be organized as follows: Section 3.1 will first present the background for modern parallel computer organization, which includes various techniques to exploit parallelism such as instruction-level parallelism, data-level par- allelism and thread-level parallelism. Next, several realistic parallel hardware archi- tectures that are based on these fundamental techniques will be discussed. Section 3.2 will discuss the organization and operation model of modern multicore general pur- pose processing units (CPUs). Finally, Section 3.3 will discuss the organization and operation model of modern manycore general purpose graphics processing units (GPGPUs).

3.1 Modern Hardware Architectures For Parallel

Computing

During the past few years, heterogeneous computers composed of CPUs and GPUs have revolutionized computing. By matching different parts of a workload to the most suitable processor, tremendous performance gains have been achieved. Much of this revolution has been driven by the emergence of many-core processors with very high parallel processing capability such as GPGPUs and multicore CPUs. For example, it is now possible to buy a graphics card that can execute more than a trillion floating point operations per second (teraflops) [39]. These GPUs were de- signed to render beautiful images, but for the right workloads, they can also be used as high-performance computing engines for applications from scientific computing to augmented reality [9]. However, the development of high-performance programs for parallel hardware still requires a deep understanding of the underlying hardware platform. Consequently, in order to understand the implementations and optimiza- 48 tions of the high-performance software solutions in this study, this section will discuss the hardware architectures for parallel computing.

The discussion will focus on the two most common modern hardware architec- tures for parallel computing using OpenCL: the multicore X86 general purpose cen- tral processing unit (CPU) architecture and the manycore general purpose graphics processing unit (GPGPU) architecture. The following sections will not manage to present these architectures in detail, but rather focus on underlining the key per- formance characteristics of CPUs and GPGPUs and the advantages of these two different hardware architectures. However, before expatiating any further on hard- ware performance, it is important to first address the performance evaluation metric for the hardware system in Section 3.1.1.

3.1.1 Understanding Computing Performance

Because modern architectures for parallel computing are designed to increase the efficiency of computational flows, it is important to first understand the performance metric for computer hardware. Therefore, this section discusses the performance metric for hardware systems and the key elements that impact the performance of a hardware architecture. This section is very crucial to understanding the roots of the design behind the architectures that will be discussed in the next sections shortly.

The performance of a computer system is a wide-range term which highly de- pends on the adopted perspective during evaluation. Specifically, the performance of a computer system can be measured as the total runtime of a program (response time) or as the power it consumes during the entire execution of a program. In addition, it can also be measured as the utilization of the system’s resources (i.e. in 49

a server cluster), etc. However, measure response time is one of the most common

ways to assess the performance of a computer system. This section will be primarily

concerned with response time as the performance metric, which is also referred to as runtime performance. To asses the performance of computer system C, the re-

sponse time of C on a (set of) benchmark program(s) is evaluated P as expressed in Equation (3.1):

C 1 PerformanceP = (3.1) Execution TimeP

Execution time in Equation (3.1) can be measured as the elapsed time, which is the

total time to complete a task, including disk accesses, memory accesses, input/output

activities, operating system overhead etc. However, the aim is to understand the performance of the microprocessor, which is the core of a computer system, therefore,

for consistency, the execution time will be considered as the time that the processor is spending on the program–the CPU execution time. The CPU execution time over

a program P is computed by Equation (3.2):

CPUExecutionTimeP = CPU clock cycles spent on P × Clock cycle time CPU clock cycles spent on P = (3.2) Clock rate

The total number of CPU clock cycles that the CPU spent on program P highly depends on the number of instruction in P and the average clock cycles per instruc- tion (CPI) as expressed in Equation (3.3). Equation (3.3) uses the term of CPI since different instructions may take different amounts of time depending on what they do, CPI being an average of all the instructions executed in the program. CPI pro- vides one way of comparing two different implementations of the same instruction set architecture, since the instruction count required for a program will, of course, 50

be the same.

CPU Clock CyclesP = Instruction Count of P × CPI (3.3)

Combining equations (3.1), (3.2) and (3.3), the performance of a CPU C on a pro- gram P can be expressed by Equation (3.4).

1 1 Performance = Clock rate × × (3.4) Instruction count CPI

It shows that the performance of a processor on a program depends not only on the

clock rate of the computer, but also on a combination of the characteristics of the program and the efficiency of the CPU architecture, in terms of instruction count and

average clock cycles per instruction (CPI). The three parameters of Equation (3.4)

are therefore the primary factors that designers need to consider when wanting to boost computing performance. These factors will be further discussed shortly in the

next sections in the context of different design strategies in parallel computing to increase computing performance.

3.1.2 Exploiting Parallelism To Increase Performance

As recently discussed in Section 3.1.1, the operating frequency (F ) of a processor is one of the primary factors that decide the processor’s performance. Hence, increasing the operating frequency is one of the simplest and most straight forward ways of increasing the processor’s performance. In fact, the strategy of scaling up frequency in order to boost up the processor’s performance had been applied in processor design for a long time [82]. However, during the past decade, it has become obvious that continued scaling of clock frequencies of CPUs is not practical, largely due to power 51

Figure 3.1: Frequency scaling and transistor integration in Intel’s CPU generations [82] and heat dissipation constraints as shown in Figure 3.1, for example. The reason is that power consumption varies non-linearly with frequency. CMOS dynamic power consumption is approximated by the combination of dynamic and static power:

P = Pdynamic + Pstatic

= Ptransient + Pshort circuit + Pstatic (3.5)

where Ptransient is the power that the circuit dissipates due to signal transitions including logic activities (i.e. transitions between logic ‘0’ and logic ‘1’ levels);

Pshort circuit is the power that the circuit dissipates due to the short circuit phe- nomenon in CMOS when it changes state [55]; and Pstatic is the power that the circuit dissipates due to the leakage current [18].

The transient power (Ptransient) dominates total power consumption and it is 52

computed by Equation (3.6):

2 Ptransient = αCV F (3.6)

where α is the activity factor, or fraction of the number of transistors in the circuit that are switching; C is the capacitance of the circuit; V is the voltage applied across the circuit; F is the switching frequency. It appears from this equation that power varies linearly with frequency. In reality, to increase the frequency, one has to increase the rate of charge flow into and out of the capacitors in the circuit. This requires a comparable increase in voltage, which scales both the dynamic and static terms in Equation (3.5). For a long time, based on Moore’s law [70], voltages could be reduced with each processor generation such that frequency scaling would not increase the power consumption uncontrollably [55]. However, as processor technology reached smaller and smaller sizes (e.g. 22 nm today), one can no longer aggressively scale the voltage down without increasing the error rate of transistor switching, thus the linearity in voltage/frequency scaling is no longer maintained in modern CMOS design. The increase in power from any increase in frequency is then substantial. As a second problem, increasing on-chip clock frequency requires either increasing off-chip memory bandwidth to provide data fast enough to prevent stalling the linear workload running through the processor or increasing the amount of caching in the system.

Since increasing frequency with the goal of obtaining higher performance is no longer a viable option, other solutions need to be explored. Equation (3.4) suggests that reducing average clock cycles per instruction (CPI), will increase the number of operations performed during a given clock cycle. In fact, reducing CPI is one of the most well-known directions in modern hardware computing system design to increase performance. The primary strategy to reduce CPI is to move to parallelism where 53 the processor can launch multiple instructions per stage/cycle allowing the instruc- tion execution rate to exceed the clock rate or, stated alternatively, for the CPI to be less than 1. There have been many designs based on parallelism to increase perfor- mance using various approaches. However, those designs often fall in three classes of exploiting parallelism including instruction-level parallelism; data-level parallelism; and thread-level parallelism. Hence, this section will focus on discussing these three major approaches.

3.1.3 Exploiting Instruction Level Parallelism

Instruction level parallelism (ILP) is a technique that speeds up execution time by increasing the instruction throughput per clock cycle of the processor. Instruction level parallelism is obtained primarily in two ways in uniprocessors: through in- struction pipelining (pipeline) and through keeping multiple functional units busy executing multiple instructions at the same time (multiple-issue).

Instruction pipelining is a technique whereby multiple instructions are overlap- ping in execution. The pipelining technique is enabled by the stage-based designs of instruction architecture and the circuitry that executes the instructions. For exam- ple, Figure 3.2 shows the reduced instruction set computer (RISC) [77] pipeline that has five stages including Instruction Fetch (IF), Instruction Decode/Register Fetch

(ID), Execute/Address Calculation (EX), Memory Access (MEM), and Write Back (WB). Since the staged circuitry allows different stages (i.e. IF, ID, EX, MEM, WB) from different instructions to run concurrently as long as there are no dependencies among them, it is possible to run multiple instructions in an overlapping manner to increase the instruction throughput.

54

á â ã ä å æ ç ä è é â ê ë ê ç æ ä è é â è â ì í ã ä î ï ê ð è ð ê ñ è â ê

Ð ÑÒÑÓÔ

Õ Ö × ØÙÚÛÜ Ý Þ ß à

ÁÂÃÄÅÆ ÇÄÈÉ Â Ê † ‡ ˆ ‰ Š ‹ Œ ‹  Ž

ÁÂÃÄÅÆ ÇÄÈÉ Â Ë   ‘ ’ “ ” • – • — ˜

ÁÂÃÄÅÆÇÄÈÉ Â Ì ™ š › œ  ž Ÿ Ÿ ¡ ¢

ÁÂÃÄÅÆ ÇÄÈÉ Â Í £ ¤ ¥ ¦ § ¨ © ª © « ¬

ÁÂÃÄÅÆ ÇÄÈÉ Â Ï ­ ® ¯ ° ± ² ³ ´ ³ µ ¶

ÁÂÃÄÅÆ ÇÄÈÉ Â Î · ¸ ¹ º » ¼ ½ ¾ ½ ¿ À

Figure 3.2: RISC 5-Stage Instruction Pipelining

Pipelining has been one of the most efficient techniques to increase the perfor- mance of processors, particularly in the era of single core processors (uniprocessors).

However, the upper bound of instruction throughput(Instruction Per Clock cycle– IPC) on the pipeline is limited to 1. Moreover, in practice it is impossible to get

IPC = 1 due to the stalls in the pipeline [77]. Hence, in order to achieve an

IPC larger than 1 in ILP, the multiple-issue technique takes place. Multiple-issue technique is based on an idea that multiple instructions are executed at once by uti- lizing multiple functional units running concurrently. Multiple-issue technique can be implemented in hardware, which is referred to as Superscalar and implemented in software, which is referred to as Very Long Instruction Word (VLWI).

In superscalar designs, the CPU maintains dependence information between in- structions in the instruction stream and schedules work onto unused functional units when possible. Figure 3.3 shows an example of exploiting ILP in an instruction stream by using the superscalar technique. The independent instructions of the stream are dispatched and executed concurrently, in a out-of-order manner, and the results of the instructions are then reordered again to assure the correct output of the instruction stream. All of the ordering and dispatching are managed automatically 55 by hardware. Therefore, in superscalar, the software programmer does not have to worry about the execution of his instruction flow, scheduling decisions being made dynamically by hardware. By extracting parallelism from the programmer’s code automatically within the hardware, serial code performs faster without any extra developer effort. Indeed, superscalar designs predate frequency scaling limitations by a decade or more, even in popular mass-produced devices, as a way to increase overall performance superlinearly. However, it is not without disadvantages.

Figure 3.3: Exploiting Instruction-Level Parallelism using Superscalar technique [17] 56

Out-of-order scheduling logic requires a substantial area of the CPU due to main- taining dependence information and queues of instructions to deal with dynamic schedules throughout the hardware. In addition, speculative instruction execution necessary to expand the window of out-of-order instructions to execute in parallel results in inefficient execution of throwaway work. As a result, out-of-order execution in a CPU has shown diminishing returns; the industry has taken other approaches to increasing performance as transistor size has decreased, even on the high-performance devices in which superscalar logic was formerly feasible. On embedded and special- purpose devices, extraction of parallelism from serial code has never been as much of a goal, and such designs have historically been less common in these areas.

3.1.4 Exploiting Data Level Parallelism

The previous section went over techniques in exploiting instruction level parallelism, which are different in approach but come from a common basis - the independence of instructions in an instruction stream. Another technique in exploiting parallelism is discussed, coming from a different angle than ILP, called Data Level Parallelism

(DLP). There are several different approaches to exploit DLP, however the focus will be on the Single Instruction Multiple Data (SIMD) technique, which is the core of

GPGPU architectures. Different from ILP, SIMD directly allows the hardware units to target data parallel execution. Rather than specifying a scalar operation, a single

SIMD instruction encapsulates a request that the same operation be performed on multiple data elements using multiple hardware execution units (i.e. ALUs). The SIMD technique has proven to be very efficient in applications that can expose a high degree of data parallelism such as image processing where the same set of operations are often applied repeatedly on independent pixels.

57

W XY Z [\] ^ ] \ _ ` a Y ` b c ‚ ƒ „ † ‡ ˆ ˆ ‰ Š ‹ Œ

H I J KL M N KO P N KQ Q

w x x y z { | } w z { | } ~ z { |

KTL U KTQ V KT

w x x y z  | } w z  | } ~ z  |

R S S S

w x x y z € | } w z € | } ~ z € |

w x x y z  | } w z  | } ~ z  |

” • – — ˜ ™ ™ š › œ  ž

 Ž Ž   ‘ ’  ’ “

ò ó ô õ ö ÷ ø õ ù ú ó û ü õ ø ý

d e f g h i j k l

k m n o p q r s t u v v s r

þ ÿ ¡ ¢ £ ¤ ¡ ¥ ¦ ÿ § ¨ ¤ ¦ © ¨ ¢

            

     0 1 2 3 4

$ %& '( < = > ?@      !"  !#  0 5 5 6 74 89: 74 89 ; 74 8

$ )) * +(,-. +(,- / +(, < AA BC@DEFC@DEGC@D

Figure 3.4: Exploiting Data-Level Parallelism using SIMD technique

Figure 3.4 shows an example of exploiting parallelism from an instruction stream using a four-way SIMD engine. The instruction stream is now issued linearly rather than out of order as in the case of superscalar in Figure 3.3. However, each of these instructions now executes over a group of four ALUs at the same time. The integer instructions are issued one by one through the four-way integer vector ALU on the left, and the floating point instructions are issued similarly through the four-way floating point ALU on the right.

The advantage of SIMD execution is that relative to ALU work, the amount of scheduling and instruction decode logic can both be decreased. Four operations are now performed with a single instruction and a single point in the dependence schedule. Consequently, more arithmetic capability can be added to the processor given the same power/silicon budget. This advantage is the primary motivation 58 behind GPGPU architectures, which will be discussed shortly in the next sections.

Of course, as with the previous techniques, there are trade-offs. It is often not very easy to expose data-level parallelism, particularly in the legacy algorithms that are designed serially during the uniprocessor era. On the other hand, it is simply too difficult for the compiler to extract data parallelism from code. Consequently, it is up to the developer to design algorithms and write code that explicitly exposes data-level parallelism. If data parallelism cannot be exposed the instructions will be executed on single data element thus ending up with unused ALUs and transistor wastage.

3.1.5 Exploiting Thread Level Parallelism With Multicore

In the former techniques, computing throughput was increased by fine-grained ap- proaches including ILP and DLP. However, there is a simpler, coarse-grained, ap- proach to improve throughput by exploiting thread parallelism. The concept of thread parallelism is quite simple the parallelism is extracted from the independent instruction stream (i.e. computing thread). Clearly, this form is heavily used on large, parallel machines, but it is also useful within a single CPU core.

As previously discussed, exposing data-level parallelism or extracting indepen- dent instructions from an instruction stream is difficult, in terms of both hardware and compiler work, and it is sometimes impossible. However, the ILP approach can benefit from multithreading technology, which allows processors to run multiple threads concurrently since the threads are already independent and extracting in- struction parallelism from the threads is trivial. However, challenges still exist in the use of multithreading, particularly the challenge in hardware architecture to manage 59

multiple instruction streams , their state information and cache.

One of the most well-known techniques to enable mltithreading is simultaneous

multithreading (SMT), which utilizes one copy of the architecture state for each logical processor, and the logical processors share a single set of physical execution

resources. Figure 3.5 illustrates the concept of SMT with two concurrent threads. Instructions from the threads are interleaved on the execution resources by an ex-

tension to the superscalar scheduling logic that tracks both instruction dependencies

and source threads. The goal is for the execution resources to be more effectively utilized, as is the case in the figure. A higher proportion of execution slots are occu-

pied with useful work. The cost of this approach is that the instruction dependence

and scheduling logic becomes more complicated as it manages two distinct sets of

dependencies, resources, and execution queues.

° ± ² ³ ´ µ ¶ · ÀºÁ ´µ¶·¸¹ºµ»³³º ¹³ ° ±²³´µ¶·¸¹ºµ»³³º ¹ ¹»³º ¼¹µ»

§ ¨

¸ ¹ º µ » ³ ³ º ¹ ´ ³ ´ Ã · » ½ º ÄÅ ¶ · · º µ ¶ ½ » ¾ ¿ º ¹ ½ ± ¹ » ¶ ¾ ³ ÆÇÈÉÊËÌÊÍ

Â

¦ ÿ

ÿ

© ª « ¬ ­ ® ª ¯ ¬ ­

þ ¥

¤

Ÿ ¡ ¢ £ ¤ ¥ ¦

ý £

¢

Õ Ö × Ø Ù Ú Û ÎÏÐÑÒÓ Ô Ÿ ¡ ¢ £ ¤ ¥ §

¡

ý

Ÿ ¡ ¢ £ ¤ ¥ ¨

þ ÿ

ý

ü

û

§ ¨

æ ç è é ê ë ç ì é ê

¦ ÿ

Ü Ý Þ ß à á â Ý ã

ÿ

í î ï ð ñ ò ó

þ ¥

¤

Ü Ý Þ ß à á â Ý ä

ý £

¢

Ü Ý Þ ß à á â Ý å

ô õ ö ÷ ø ù ú

¡

þ

ý

ü û

Figure 3.5: Simultaneous Multithreading (SMT) with two concurrent threads

The architecture state of each logical processor consists of registers including the general-purpose registers, the control registers, the advanced programmable in- terrupt controller (APIC) registers and some machine state registers. Logical pro- cessors share nearly all other resources on the physical processor, such as caches, execution units, branch predictors, control logic, and buses [29]. The number of concurrent threads per physical processor can be decided by the chip designers, but 60

practical restrictions on chip complexity have limited the number to two for most

SMT implementations such as Intel Hyperthreading technology [29].

Multithreading technology can efficiently enable thread-level parallelism in pro- cessors, particularly in single-core processors, however the number of concurrent

threads is limited due to design complexity. In order to archieve a higher degree of thread-level parallelism, conceptually at least, the simplest approach to increas-

ing the amount of work performed per clock cycle is to simply clone a single core

multiple times on a chip as shown in Figure 3.6. Moreover, the multicore approach provides a great advantage in power as multicore processors can operate with moder-

ated frequency while still maintaining high computational throughput (Figure 3.7).

In the simplest case, each of these cores executes largely independently, sharing data through the memory system, usually through a cache coherency protocol. Each

physical core can be a single-threaded core or a multi-threaded one.

{ ‰  —

ž t ‚

 s z  ˆ  –

œ r y € ‡ Ž •

› q x  †  ”

™ š o p v w } ~ „ ‹ Œ ’ “

˜ n u | ƒ Š ‘

S TUV W ©    . / 0 1 2 A BCD E

² ¦ § ¤ ³ ² ¦ § ¤ ³ ² ¦ § ¤ ´ ² ¦ § ¤ ³ ² ¦ § ¤ ´

X YZ[\      ! " # $ % 3 4 5 6 7 F GHIJ

f g h i j k l m h & ' ( ) * + ,-( KLMNOP QRM

] ^ _ ` a b c a b d e ^             8 9 : ; < = > < = ? @ 9

Ÿ ¡ ¢ £ ¤ ¥ ¦ § ¤ ¨ § ¦ ¥ ¤ © © ¦ § ± ­ ¯ £ ¥ ¦ § ¤ ¨ § ¦ ¥ ¤ © © ¦ § ± ­ ¯ £ ¥ ¦ § ¤ ¨ § ¦ ¥ ¤ © © ¦ §

ª « ¬ ¦ ­ « ® ­ £« «¬ §¤ ¯ ° ¡ ¢ ª « ¬ ¦ ­ « ® ­ £« «¬ §¤ ¯ ° ¡ ¢ ª « ¬ ® ­ £« «¬ §¤ ¯ ° ¡ ¢

Figure 3.6: Conceptual Viewing of Singlecore vs. Multicore architecture

The multicore design approach has been widely adopted in modern processors as increasing performance of singlecore processors has become very challenging. This is partly because most of the low-hanging fruit has already been picked, and partly 61

Processor Input Processor Output

Input f/2 Output f

Capacitance = C Processor f Voltage = V Frequency = f Power = CV 2f f/2 Capacitance = 2.2 C Voltage = 0.6 V Frequency = 0.5 f Power = 0.396 CV 2f

Figure 3.7: Power consumption in singlecore vs. multicore processor with the same compute througput. The rate at which instructions are retired is the same in these two cases, but the power is much less with two cores running at half the frequency of a single core [9]. because processors are starting to run up against power budgets, and both out-of- order instruction execution and higher clock frequency are power-intensive. The multicore approach has the great advantage that, given software that can parallelize across many such cores, performance can scale nearly linearly as more and more cores get packed onto chips in the future. Performance of multicore processors has increased significantly during the last several years as a result of increased integration density.

To conclude this section about different techniques for exploiting parallelism, it should be noted that although various techniques including pipeline, superscalar,

SIMD, SMT, Multicore have been discussed separately, it does not mean that they cannot sit together in the same processor. In fact, modern processors often combine some of these techniques to exploit different advantages of each technique in order to maximize computing throughput. For example, Intel and AMD CPUs employ pipeline, superscalar and even SIMD, SMT in their multicore architecture [30]. In realistic designs, these technologies are adopted selectively depending on goals and 62 tradeoffs of their targeted applications. The next sections will discuss the adoptions of these technologies in maintream desktop CPUs and GPGPUs from the perspective of parallel computing.

3.2 Mainstream General Purpose Processing Units

Mainstream General Purpose Processing Units (CPUs) have been widely adopted as the most common processing unit for mainstream computing. Modern CPUs, sup- porting advanced integration technologies and micro architecture evolutions, have been transformed from singlecore/singlethreading to multicore/multithreading pro- cessing units with great parallel processing capability. This section, will discuss several real architectures of modern desktop CPUs from Intel and AMD and where they fit in the design space, trading off some of the features discussed previously. Hereafter, mainstream desktop CPUs will be shortly referred to as CPUs.

Nowadays, the two prominent CPU vendors, AMD and Intel, offer up to hundreds of CPUs targeting a wide range of segments from high-end to entry level computing.

The CPUs can be different in processing capability, architecture and implementation details, number of cores, operating frequency, etc. However, they are based on a com- mon symmetric multiprocessor architecture model that is conceptually illustrated in

Figure 3.8. A modern multicore processor often consists of several processor cores, each core being able to operate independently with its own cache memory, control and scheduling unit, own state register set(s) and execution units. All the cores within a processor share a system memory and communicate with it as well as with each other through system buses.

63

æ ç è é ê ç ë ì í î ï ð ñ ò ï ó

             2 3 4 5 6 3 7 8 9 : ; < = > ; ?

ì ô õ ô ï ö ï ÷ ø ù ô ï ó ì ï ô

               8 @A @; B ; C DE @; ? 8 ; @

ì ô õ ô ï ö ï ÷ ø ù ô ï ó ì ï ô

               8 @A @; B ; C DE @; ? 8 ; @

Þ ß à á â ã ä å

¤ ¥ ¦ § ¨ © * + , - . / 0 1

Ù Ú ÛÜ Ý ÿ ¡ ¢ £ % & '( )

µ ¶ · ¸ ¹

ú û ü ý þ ! " # $

Ã Ä ÅÆÇ È É Ê Å

º » ¼ ½ ¾ ¿ À ¾ ¿ Á »

Ë Ì Í Í Î ÏÐ ÑÒ Ó Ô ÕÏÑÒ Ö Ð Î × Ð Ö Ò Î Ø Ø Ö Ð

Figure 3.8: A Conceptual Symmetric Multicore Processor

The number of cores integrated in a multicore processor depends on the inte- gration technology, power budget and targeted market segment. For example, the

AMD FX multicore processor family has three different versions including AMD FX 8-core, AMD FX 6-core and AMD FX 4-core [41], while the Intel Core family offers versions ranging from 2 to 6 cores. The architectures of the cores in different versions of the same processor family are often very similar or even identical.

Each processor core often employs some of the parallel techniques presented in

Section 3.1.2 (i.e. pipeline, superscalar, SIMD, SMT, etc.) to exploit parallelism in order to maximize computational throughput. While pipeline and superscalar techniques have been adopted in both AMD and Intel CPUs as standard techniques for a long time, SIMD and SMT techniques are only selectively adopted on some CPU versions. For example, Intel processors often employ the SMT technique (with the trademark of Hyper-Threading Technology [29]) whereas AMD often does not

equip SMT on their processors. The combination of multicore architecture and 64 multithreading can provide modern CPUs a high degree of hardware parallelism.

The modest Intel Core i7 has 6 cores that can run up to 12 concurrent threads.

Most modern processor cores also employ the SIMD technique to exploit data parallelism in Streaming SIMD Extension units (SSE) [33]. Each core often carries a full 128-bit SSE unit that can issue add, multiply, and miscellaneous instructions simultaneously. A 128-bit SSE unit has a wide L1 cache interface (128 bits/cycle) and decodes, fetches, and issues 4 integer operations in parallel. The utilization of SSE units often requires extensive support from the compiler, which detects the instructions that operate on independent data elements and then translates them to vector instructions. This process is often referred to as code vectorization [34].

SSE was originally designed to accelerate particular applications that have a high degree of data parallelism such as image processing and/or rendering. However, general computational flows also benefit greatly from SSE [34]. To further exploit the advantage of SIMD, modest Intel Sandy Bridge processors [32] extend the SIMD operations to 256-bit wide, called Advanced Vector Extensions (AVX) [28]. An Intel

AVX unit can execute up to 8 integer operations concurrently.

It also should be noted that, CPUs are designed for general purpose computing, therefore a large part of logic resources are dedicated to control units (e.g. Out- of-Order scheduling and retirement unit, Branch prediction unit, instruction paging unit, etc.). Logic resource utilization of different units can be roughly translated to the silicon die areas where circuits are fabricated, as shown in Figure 3.9. Figure 3.9 clearly shows that a large portion of the silicon dies are designated to control units.

The price needed to be paid for large and complex control units in CPUs results in a high degree of flexibility of processor cores. Their flexibility is really important when the cores have to handle complex computational threads which will be discussed in the future. 65

Figure 3.9: Silicon die of an Intel 4-core Nehalem processor. The size of a silicon area is respective to the fabricated logic resource thus it can provide an estimation on logic utilization [15]

3.3 General Purpose Graphics Processing Units

This section will discuss another device that has become a great candidate for parallel computing in recent years, the general purpose graphics processing units (GPGPUs). Although GPGPU is an emerging trend in parallel computing, it has gathered much attention from the community and it provides a whole new paradigm for parallel computing with a massive parallel processing capability from a very different design approach compared to CPUs. Like CPUs, GPU architectures come in a wide variety of flavors. Several will be discussed briefly before going in depth on the Nvidia GPGPU architecture. 66

3.3.1 General Purpose Graphics Processing Units Overview

The graphics processing unit (GPU), first invented by NVIDIA in 1999 [40], is the

most pervasive parallel processor to date. However, the GPU has been rather an application-specific processor which was specifically designed to accelerate graphic

rendering pipelines [20]. Efforts to exploit the GPU for non-graphical applications

have been underway since 2003. By using high-level shading languages such as DirectX, OpenGL and Cg, various data parallel algorithms have been ported to the

GPU. Problems such as protein folding, stock options pricing, SQL queries, and MRI reconstruction achieved remarkable performance speedups on the GPU. These

early efforts that used graphics APIs for general purpose computing were known

as GPGPU programs [39]. General purpose computing on GPGPUs are now highly supported by graphics hardware vendors, particularly by Nvidia and AMD (formerly

ATI).

More recently, GPU architecture design has seen significant shifts to make it a more general purpose processing unit. GPU architectures are completely re-designed

to approach general purpose computing rather than just a fix-function unit dedicated for graphic rendering pipelines. This provides a brief history of the transformation

of the GPU to general purpose–GPGPU. Although GPGPUs are re-designed for

general purpose computing, the GPGPU approach is very different from the CPU approach towards general purpose computing. Different from the CPU, the GPGPU is specialized for computationally-intensive, highly parallel computation − exactly

what graphics rendering is all about − and therefore is designed such that more transistors are devoted to data processing rather than data caching and flow control,

as schematically illustrated in Figure 3.10. More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations.

67

F N O PQN R

K LM LM K

K LM LM K

G HI J F

S T K U

S T K U

Y W X

V W X

Figure 3.10: GPGPUs devote most of the logic resources to arithmetic units to target computationally-intensive applications while CPUs keep a balance between logic resources for arith- metic units and control units to target a wide range of general purpose applications [39]

The same program is executed on many data elements in parallel with high arithmetic intensity (The ratio of arithmetic operations to memory operations). Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control. Also, because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

GPGPUs inherit massive multithreading capabilities from previous generations of GPU architectures designed for graphic rendering pipelines. GPUs are well-known in sophisticated hardware task management because the graphics workloads they are designed to process consist of complex vertex, geometry, and pixel processing task graphs. These tasks and the pixels they process are highly parallel, which gives a substantial amount of independent work to process for devices with multiple cores and highly latency-tolerant multithreading. In fact, many algorithms outside the

field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.

As discussed, GPGPUs are designed with most logic resources devoted to the arithmetic unit to target computationally-intensive applications that can expose a 68 high degree of data-level parallelism. GPGPUs therefore employ the SIMD technique as the core engine to reach the proposed goal. The details on the design approaches for the SIMD arithmetic and sophisticated hardware scheduler will be discussed in the next section with a case study on Nvidia the Fermi [39] GPGPU architecture.

3.3.2 GPGPU Core Processing Architecture

To further explore GPGPU architecture, it is necessary to analyze the design of a realistic GPGPU device. Although Nvidia and AMD offer different versions of

GPGPU devices, the core concepts in architecture organization are similar. One can select a specific model from either AMD or Nvidia as a case study without losing generality. The experiments in this study are based on the NVIDIA GTX580

GPGPU (code name Fermi) [39]. This section therefore will discuss the architecture of the Nvidia Fermi GPGPU.

Figure 3.11 shows an abstract view on the organization of the Nvidia Fermi GPGPU. The first Fermi based GPU, implemented with 3.0 billion transistors, fea- tures up to 512 CUDA cores [39]. A CUDA core executes a floating point or integer instruction per clock for a thread. The 512 CUDA cores are organized in 16 stream- ing multiprocessor(SM) of 32 cores each. The SMs share a common GDDR5 DRAM memory system, with six 64-bit memory partitions, totaling to a 384-bit memory interface. The memory is cached on a L2 cache system that is shared among SMs. A host interface connects the GPU to the CPU via PCI-Express. The GigaThread global scheduler distributes thread blocks to SM thread schedulers.

The organization of a streaming multiprocessor (SM) which consists of arrays of

CUDA cores is shown in Figure 3.12. Each SM consists of two CUDA core arrays,

69

x z { | } ~ Ÿ ¡ ¢ £ ¤ ¥ ÆÈÉÊËÌ í ï ð ñ ò ó       ; = > ? @ A b d e f g h ‰ ‹ Œ  Ž 

 € ~ } ¦ § ¥ ¤ ÍÎÌË ô õ ó ò     BCA@ i j h g  ‘  Ž

i a  ˆ · ¯ Þ Ö ¥ ý , $ S K z r

h `  ‡ ¶ ® Ý Õ ¤ ü + # R J y q

g _ Ž † µ ­ Ü Ô £ û * " Q I x p

f g ^ _  Ž † ´ µ ¬ ­ ÛÜ ÓÔ ¢ £ ú û )* !" PQ HI w x o p

e ] Œ „ ³ « Ú Ò ¡ ù ( O G v n

d \ ‹ ƒ ² ª Ù Ñ ø ' N F u m

c [ Š ‚ ± © Ø Ð ÿ ÷ &  M E t l

b Z ‰  ° ¨ × Ï þ ö %  L D s k

ç

æ

ß

Ù

ì

ÝÕ

è ë

u v w x y œ  ž Ÿ ÃÄÅÆÇ ê ë ì í î      8 9 : ; < _ ` a b c † ‡ ˆ ‰ Š

ç

Ü

å Þ

r s t ™ š › ÀÁ ç è é    5 6 7 \ ] ^ ƒ „

j k l m n o p m q ‘ ’ “ ” • – — ” ˜ ¸ ¹ º » ¼ ½ ¾ » ¿ ß à á â ã ä å â æ ¦ § ¨ © © - . / 0 1 2 3 0 4 TUVWXYZW[ { | } ~  €  ~ ‚

Ý

é ê

è

ÛÜ

ã

Ú

ç

Ê Ë Ì Í Î Ï Ð

ØÙ

å æ

ä

Ô

è

Ö ×

ã

¢ £ ¤ ¥ ¦ § ¨ ¥ © ÉÊËÌÍÎÏÌÐ ð ñ ò ó ô õ ö ó ÷          > ? @ A B C D A E e f g h i j k h l Œ  Ž   ‘ ’  “ ³ ´ µ ¶ · ¸ ¹ ¶ º

ç

ÔÕ

Ñ

ª « ¬ ÑÒÓ ø ù ú ! FGH m n o ” • – » ¼ ½

å æ

Ó

ä

­ ® ¯ ° ± Ô Õ Ö × Ø û ü ý þ ÿ " # $ % & IJKLM p q r s t — ˜ ™ š › ¾ ¿ À Á Â

ã Ò

Ò

á â

à Ñ

¡ ™ È À ï ç   = 5 d \ ‹ ƒ ² ª

˜ Ç ¿ î æ  < 4 c [ Š ‚ ± ©

Ÿ — Æ ¾ í å  ; 3 b Z ‰  ° ¨

ž Ÿ – — ÅÆ ½ ¾ ì í ä å   :; 2 3 a b YZ ˆ ‰ €  ¯ ° § ¨

 • Ä ¼ ë ã  9 1 ` X ‡  ® ¦

œ ” à » ê â  © 8 0 _ W † ~ ­ ¥

› “  º é á  ¨ 7 / ^ V } ¬ ¤

š ’ Á ¹ è à  § 6 . ] U „ | « £

° ² ³ ´ µ ¶ × Ù Ú Û Ü Ý þ ¡ ¢ £ ¤ % ' ( ) * + LNOPQR s u v w x y š œ  ž Ÿ ÁÃÄÅÆÇ

· ¸ ¶ µ Þ ß Ý Ü ¥ ¦ ¤ £ , - + * STRQ z { y x ¡ ¢ Ÿ ÈÉÇÆ

Figure 3.11: The NVIDIA Fermi architecture [39]. This device has 16 streaming multiprocessor (SM), two SIMD arrays per SM. The SMs share. a common GDDR5 memory system and L2 cache a shared memory/L1 cache and a separate array of special function units(SFU). It is important to note that an Nvidia CUDA core (or an AMD streaming processor) is not a core with the full capability of a core in CPUs. Rather, a CUDA core is just a simple processing element (PE) that executes arithmetic operations. The SIMD CUDA cores share an instruction cache, scheduler, dispatcher and register set. Constructing SMs from arrays of simple PEs that share all other resources (e.g. cache, register, scheduler, etc.) is the key technique in GPGPU design so that logic resources are devoted primarily to arithmetic throughput.

The SM schedules threads in groups of 32 parallel threads called warps. Each

SM features two warp schedulers and two instruction dispatch units (Figure 3.13), allowing two warps to be issued and executed concurrently on two SIMD CUDA core arrays. Fermi’s dual warp scheduler selects two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs. 70

Instruction Cache

Warp Scheduler Warp Scheduler

Dispatch Unit Dispatch Unit

Register File (32,768 x 32-bit)

CUDA Core LD/ST Core Core Core Core Dispatch Port LD/ST Operand Collector SFU LD/ST Core Core Core Core FP Unit INT Unit LD/ST

LD/ST Result Queue Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST

LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST

LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST

Interconnect Network

64 KB Shared Memory / L1 Cache

Uniform Cache

Figure 3.12: The NVIDIA Fermi architecture [39]. Each SM consists of two SIMD CUDA core arrays. Each SM includes a shared memory/L1 cache and a separate array of special function units(SFU). The SIMD CUDA cores share instruction. cache, scheduler, dispatcher and register set

Because warps execute independently, Fermi’s scheduler does not need to check for dependencies from within the instruction stream. Using this elegant model of dual- issue, Fermi achieves near peak hardware performance.

To obtain the high degrees of performance in GPUGPU computing, scheduling must be very efficient. Thread scheduling overhead needs to remain low because the chunks of work assembled into a thread may be very small which results in very frequent context switching for threads. For this reason, GPGPU manages thread scheduling entirely in the hardware scheduler rather than relying on operating sys- tems as in a CPU-based system. The Fermi pipeline is optimized to reduce the cost of an application context switch to below 25 microseconds [39]. Also, since the amount 71

Figure 3.13: Fermi Thread Scheduler and Instruction Dispatcher Unit [39] . of work in each thread is very small, sometimes a latency in memory access can be greater than the time needed to execute the arithmetic operations in the thread.

Thus it is needed to provide a sufficient amount of work threads to GPGPUs so that the latency can be hidden efficiently. In essence, a large number of threads need to be created to occupy the machine: As discussed previously, the GPU is a throughput machine.

The SIMD-based streaming multiprocessor design brings great computing through- put and massive parallel capability on a moderated budget of logic resource and power. However, this approach also faces several crucial drawbacks. One of the biggest concerns about the GPGPU architecture is its flexibility. In a SM, due to a massive amount of PEs operating in parallel while sharing all other resources, the degree of flexibility within a single PE is very low. Moreover, the SIMD-based ap- proach with shared control units requires that all threads within a warp must operate on the same path of execution. For example, flow control, such as branching, is han- dled by combining all necessary paths as a warp. Consequently, if a thread within 72 a warp diverges, all paths are executed serially. For example, if a thread contains a branch with two paths, the warp first executes one path, then the second path. The total time to execute the branch is the sum of the times in each path. This issue causes control operations to be very expensive in GPGPUs. On the other hand, as discussed above, the GPU is a throughput machine and one needs to create a large number of threads to occupy the machine in order to reach its high performance capability. However, this goal is not very easy to achieve in realistic applications where it is very difficult to expose data-level parallelism.

3.4 Conclusion

In this chapter, various techniques for exploiting parallelism have been discussed to increase computing throughput, from super scalar to SIMD to multi threading. As the traditional method of increasing operating frequency of processors to increase performance no longer holds its merit, exploiting parallelism has become more im- portant.

After discussing the fundamental techniques for exploiting parallelism, two pro- cessor families have been analyzed, CPUs and GPGPUs, that are widely adopted as the processing units for mainstream computing. Major advantages and disadvan- tages have been pointed out in each type of processing unit. While CPUs have a limited number of cores, they can operate independently with a very high degree of flexibility. Moreover, the adoption of multithreading and Streaming SIMD extension technology brings modern multicore CPUs a significant improvement in parallel pro- cessing capability. The GPGPUs approach the modern computing paradigm from a very different angle compared to the approach of CPUs. The GPGPU is designed 73 based on the SIMD technique to become a throughput engine, which trades flexi- bility for massive parallel arithmetic capability. Although the GPGPU has emerged as a great candidate for speeding up computing performance, its low flexibility and complex programming model still remain critical drawbacks. Chapter 4

OpenCL Programming Model

Chapter 3 has discussed different modern hardware architectures for parallel com- puting. However, in order to command the underlying hardware to solve a compu- tational problem, a program and an associative execution model are required. For that reason, in order to provide a complete picture on the hardware-software inter- face used to accelerate the computational application, this chapter will discuss the software side, the operating system and the programming model. The rest of the chapter is organized as follows: Section 4.1 will first review the history of paral- lel software development platforms and introduce the Open Computing Language

(OpenCL) platform for parallel heterogeneous computing. Next, Section 4.2 will discuss the key concepts of the OpenCL software development platform. The last section, Section 4.3, will present the mapping of OpenCL abstraction models to targeted hardware devices, specifically, the mapping of the OpenCL model to two family of hardware devices that were introduced earlier in Chapter 3, namely the multicore CPU and the manycore GPGPU.

74 75

4.1 Open Computing Language For Parallel Het-

erogeneous Computing

As discussed in Chapter 3, during the past few years, tremendous performance gains in computing applications have been achieved from modern processors such as multi- core CPUs and manycore GPGPUs. Modern processor architectures have embraced parallelism as an important pathway to increased performance. However, it is not feasible to benefit from the massive computation capability of modern parallel pro- cessors without efficient parallel software. It is crucial to enable software developers to take full advantage of these capable parallel processors in a heterogeneous ap- proach with a suitable parallel software development platform.

Parallel software development has existed for a long time with a lot of efforts coming from industry as well as academia [76]. There have been many paral- lel programming frameworks, APIs, and standards, such as Open MultiProcess- ing (OpenMP) [5], Intel Threading Building Blocks [31], Message Passing Interface

(MPI) [4], and Apple’s Grand Central Dispatch [27] that were developed to sup- port parallel computing. However, the parallel computing community is still putting massive efforts on developing new parallel programming models as these well-known platforms still do not meet the demand of parallel computing. Moreover, as a matter of history, these platforms were designed to target general purpose CPUs whereas the parallel computing industry has changed rapidly during the last several years.

The trend has been shifting to heterogeneous computing which is not only domi- nated by general purpose CPUs but also by emerging manycore processors such as

GPUs [37, 42], Cell processors [44] and even FPGAs and DSP processors [58]. The parallel computing industry therefore demands a new parallel software development platform that can supports a wide range of applications accross heterogeneous hard- 76 ware. The Open Computing Language (OpenCL) platform [51] has recently been introduced to meet this demand.

OpenCL (Open Computing Language) is an open royalty-free standard for gen- eral purpose parallel programming across CPUs, GPUs and other processors, giving software developers portable and efficient access to the power of these heterogeneous processing platforms [51]. OpenCL is jointly developed by a very large group of ex- perts from processor vendors, system OEMs, middleware vendors, application devel- opers, academia and research labs, FPGA vendors [51]. The standard is managed by the nonprofit technology consortium Khronos Group. OpenCL supports a wide range of applications, ranging from embedded and consumer software to high-performance computing solutions, through a low-level, high-performance, portable abstraction. By creating an efficient, close-to-the-metal programming interface, OpenCL forms the foundation layer of a parallel computing ecosystem of platform-independent tools, middleware and applications [51].

OpenCL consists of an application programming interface (API) for coordinating parallel computation across heterogeneous processors. The target of OpenCL is ex- pert programmers wanting to write portable yet efficient code. This includes library writers, middleware vendors, and performance oriented application programmers.

Therefore OpenCL provides a low-level hardware abstraction plus a framework to support programming while exposing many details of the underlying hardware.

4.2 Anatomy Of OpenCL

OpenCL can be viewed as a hierarchy of the following models [52, 17]: 77

• Platform model: a high-level description of the heterogeneous system

• Execution model: an abstract representation of how streams of instructions

execute on the heterogeneous platform

– The organization of working threads, which do the actual computational

work for the OpenCl application.

– Execution context: prepares execution context-the environment within

which working threads execute.

• Memory model: the collection of memory regions within OpenCL and how

they interact during an OpenCL computation

The key concepts in these OpenCL models will be discussed from Sections 4.2.1 to Section 4.2.5 before elaborating on how to map the logical models in OpenCL to physical hardware, in particular the mapping to manycore GPGPUs and multicore

CPUs introduced earlier in Chapter 3.

4.2.1 OpenCL Platform Model

The Platform model for OpenCL is illustrated in figure 4.1. The OpenCL platform model defines a high-level representation of any heterogeneous platform used with

OpenCL. An OpenCL platform always includes a single host. The host interacts with the environment external to the OpenCL program, including I/O or interaction with a program’s user.

The host is connected to one or more OpenCL devices. The device is where the streams of instructions (or kernels) execute; thus an OpenCL device is often referred 78

Device Compute Unit Device Host P ComputeP P …Unit P E E E ComputeDeviceE Unit P ComputeP P … Unit P P ComputeP P Unit P E PE PE P …EComputeP… unit E PEComputePE P UnitE P E E E PEComputeP… P …unit P E PE PE P E P E PEComputeP…E P unitE P E E E E … E PE PE P …E P E E E E

Figure 4.1: OpenCl platform model [17]. The plaform consist of a single host and other hetero- geneous compute devices. The host interacts with. compute devices through OpenCL program

to as a compute device. An OpenCL compute device is further divided into one or

more compute units (CUs) which are further divided into one or more processing elements (PEs). Computations on a device occur within the processing elements.

A device can be a CPU, a GPU, a DSP, or any other processor provided by the hardware and supported by the OpenCL vendor.

The OpenCL platform device model closely corresponds to realistic hardware

platforms, which can consist of a CPU and several compute devices such as GPGPUs and FPGAs. For example, a realistic hardware platform used in this study has three

different processing units: an Intel Core i7 processor works as the host, a Nvidia

GTX 580 GPGPU and an AMD HD 5800 GPGPU work as compute devices.

4.2.2 OpenCL Execution Model: Working Threads

Execution of an OpenCL program occurs in two parts: kernels that execute on one or more OpenCL devices and a host program that executes on the host. The host

program defines the context for the kernels and manages their execution. The kernels

execute on the OpenCL devices. They do the real work of an OpenCL application. 79

Figure 4.2: A realistic hardware platform for parallel heterogeneous computing with OpenCL. The platform consists of three different processing units: an Intel Core i7 processor works as the host, a Nvidia GTX 480 GPGPU and an AMD. HD 5800 GPGPU work as compute devices

Kernels are typically simple functions that transform input memory objects into output memory objects [17, 52].

The core of the OpenCL execution model is defined by how the kernels execute.

When a kernel is submitted for execution by the host, an index space is defined. An instance of the kernel executes for each point in this index space. The execution of the OpenCL kernel occurs in parallel fashion hence there may be multiple instances of kernel(s) running simultaneously. This kernel instance is called a work-item and is identified by its point in the index space, which provides a global ID for the work- item. Within a work-group, each work-item executes the same code but the specific execution pathway through the code and the data operated upon can varied. Recall from the OpenCL platform model, an OpenCL compute device is further divided into one or more compute units (CUs) which are further divided into one or more processing elements (PEs). Computations on a device occur within the processing elements and each PE executes a work-item. 80

Figure 4.3: OpenCL NDRange index space showing work-items, their global IDs and their map- ping onto the pair of work-group and local IDs [43]. Each work-group may consist of multiple wavefronts(AMD term) or warps (Nvidia term).

Figure 4.3 shows the organization of work-items and the index space. Work-items are organized into work-groups. The work-groups provide a more coarse-grained decomposition of the index space. Work-groups are assigned a unique work-group ID with the same dimensionality as the index space used for the work-items. Work- items are assigned a unique local ID within a work-group so that a single work-item can be uniquely identified by its global ID or by a combination of its local ID and work-group ID. The work-items in a given work-group execute concurrently on the processing elements of a single compute unit. The index space supported in OpenCL 81

is called an NDRange. An NDRange is an N-dimensional index space, where N is one, two or three. An NDRange is defined by an integer array of length N specifying the extent of the index space in each dimension starting at an offset index F (zero by default). Each work-items global ID and local ID are N-dimensional tuples. The global ID components are values in the range from F, to F plus the number of elements in that dimension minus one.

Typically, the number of work-items in a work-group is often much larger than the number of processing elements in a compute unit in a compute device due to the fact that PEs hide the latency of the work-items by executing multiple work-items in pipeline streaming fashion. For this reason, the work-items in a work-group are often divided into teams of work-items that are executed simultaneously in PEs. A team of 32 to 64 work-items is called a warp in the Nvidia platform [40] and a wavefront

in the AMD platform [43]. Figure 4.3 also illustrates wavefronts in work-groups.

4.2.3 OpenCL Execution Model: Context And Command

Queues

The organization of computational threads/ work-items of an OpenCL application which take place on compute devices has been discussed. However, more details are

needed to understand how these work-items exist in compute devices. In fact, these

work-items are submitted to compute devices from the host device. The host plays a very important role in the OpenCL application, as it prepares the execution context- the environment within which work-items execute. The host defines the NDRange and the queues that control the details of how and when the kernels execute. All of these important functions are contained in the APIs within OpenCL’s definition. 82

An execution context includes the following resources:

• Devices: The collection of OpenCL devices to be used by the host.

• Kernels: The OpenCL functions that run on OpenCL devices.

• Program Objects: The program source that implements the kernels.

• Memory Objects: A set of memory objects visible to the host and the OpenCL

devices. Memory objects contain values that can be operated on by instances of a kernel.

The program objects in a context contain the program source and executable that implement the kernel. It is important to note that, in contrast to a traditional computer program which is often built once and loaded from the storage device up on user request, the program object is built at runtime within the host program.

The reason behind this special way of building the OpenCL program objects comes from the aim of the OpenCL platform, which targets various families of computing devices. Thus, the application programmer has no control over which GPUs or CPUs or other chips the end user may run the application on. All the OpenCL programmer knows is that the target platform will be conformant to the OpenCL specification.

Consequently, the OpenCL API forces program objects to be built from source at runtime specifically for the computing devices in the execution context. Only at that point is it possible to know how to compile the program source code in order to create the binary code for the kernels.

The host creates a data structure called a command-queue to coordinate execu- tion of the kernels on the devices. The host places commands into the command- queue which are then scheduled onto the devices within the context. These include: 83

• Kernel execution commands: Execute a kernel on the PEs of a device.

• Memory commands: Transfer data to, from, or between memory objects, or

map and unmap memory objects from the host address space.

• Synchronization commands: Constrain the order of execution of commands.

The command-queue schedules commands for execution on a device. These execute asynchronously between the host and the device. The host submits a command queue to the device but does not manage the execution of the queue on it. Right after submitting the command-queue, the host lets the OpenCL runtime together with the compute device driver take care of the queue, thus permitting the host to schedule another job. It is possible to associate multiple queues with a single con- text. These queues run concurrently and independently with no explicit mechanisms within OpenCL to synchronize them.

4.2.4 Memory Model: Memory Abstraction

The execution model describes how the kernels execute, how they interact with

the host and how they interact with other kernels. To complete the OpenCL pro- gramming environment, it is necessary to discuss the memory model of OpenCL.

More importantly, as OpenCL is a plaform designed for parallel computing where the memory organization is vital, it is very crucial for the developer to understand

the OpenCL memory model well. First, the hierarchy organization of the OpenCL

memory model will be discussed.

In general, memory subsystems vary greatly between hardware platforms. For

example, all modern CPUs support automatic caching, while many GPGPUs do not.

84

    ! N OPQ R ST

" ! # $ + %&' ( )* U T V W ^ XYZ [ \]

* , ] _

í î ï ð     4 5 6 7 F G HI

ñ ò ó ô     8 9 : ; JK L M

õ ö ÷ ø ù ú û ü ö ý þ < = > ? @ A B C = DE

- . / 0 1 /. 2 3 ` a b c d b a e f

ÿ ¡ ¢ £ ¤ ¥ ¦ ¡ § ¨

©     

Figure 4.4: The abstract memory model defined by OpenCL.

To support code portability, OpenCL’s approach is to define an abstract memory model that programmers can target when writing code and vendors can map to their actual memory hardware. The memory spaces defined by OpenCL are shown in Figure 4.4. As shown in Figure 4.4, the OpenCL memory abstraction has five memory domains: private, local, global, constant and host memory [52].

• PrivateMemory: specific to a work-item; it is not visible to other work-items.

• LocalMemory: specific to a work-group; accessible only by work-items be-

longing to that work-group.

• GlobalMemory: accessible to all work-items executing in a context, as well as to the host (read, write, and map commands). Global memory is visible

to all compute units on the device. Global memory is also the central station

that holds data communication between host and device. 85

• ConstantMemory: region for host-allocated and -initialized objects that are

not changed during kernel execution.

• HostMemory: region for an application’s data structures and program data.

These memory spaces are relevant within OpenCL programs. The keywords associated with each space can be used to specify where a variable should be created or where the data that it points to resides. As mentioned above, different memory types in OpenCL are rather an abstract definition and they will be mapped to physical memory regions in a specific compute device. This memory mapping highly depends on hardware memory hierarchy in the compute device, which often varies across devices. Memory mapping schemes in CPUs and GPGPUs will be further discussed shortly.

Memory Model: Memory Objects

In the execution model section, two types of memory objects have been presented, namely the program object and the command-queue object. However, neither the details of these objects, nor their types as memory objects and the rules on how to safely use them have been explained yet.

In OpenCL there are two types of memory objects : buffer objects and image objects [52]. A buffer object stores a one-dimensional collection of elements and is allocated as a contiguous block of memory. buffer objects can be accessed through pointers similar to the traditional memory pointer in legacy C programming. An image object is used to store a two or three-dimensional texture, frame-buffer or image. The elements of an image object are selected from a list of predefined image 86 formats [52]. Different from the buffer object, an image object cannot be accessed directly through pointers, as it requires special functions provided by the OpenCL

API.

OpenCL memory buffers are defined on top of OpenCL memory domains intro- duced above(e.g private, local, global, constant and host memory). For example, a buffer object or an image buffer can be allocated in either local memory or global memory.

4.2.5 Synchronization In OpenCL

As recently discussed, OpenCL has been developed by a wide range of industry groups to satisfy the need to standardize programming models that can achieve good or high performance across a wide range of hardware devices. This key goal in OpenCL is to follow a relaxed synchronization and memory consistency model in order to be able to maintain concurrency while assuring correctness of OpenCL programs across different hardware. For example, threading is managed by hardware in GPGPUs and to reduce overhead in synchronizing thousands of threads, GPGPU hardware does not guarantee thread synchronization on a global scope. On the other hand, in mainstream CPUs, threading is managed by the operating system with a much tighter synchronization model, i.e. semaphores and locks.

OpenCL only defines global synchronization at kernel boundaries. Within a kernel dispatch, each work item is independent. In OpenCL, synchronization between work items is not defined. A write performed in one work item has no ordering guarantee with a read performed in another work item. To support sharing of data between work items within the same workgroup in local memory, OpenCL specifies 87 the barrier operation within the workgroup. A call to barrier within a work item prevents that work item from continuing past the barrier until all work items in the group have also reached the barrier.

Figure 4.5 shows an example of local synchronization barriers: local barriers in different workgroups. Within a single kernel dispatch, synchronization is only guaranteed within workgroups. The different barriers in different workgroups are independent. That said a certain barrier has impact only on work-items within their own workgroup. On the other hand, global synchronization is maintained by completion of the kernel and the guarantee that on a completion event all work is complete and memory content is as expected.

Figure 4.5: OpenCL Synchronization: local vs. global synchronization [17]. 88

In OpenCL relaxed memory consistency, any memory object that is shared be-

tween multiple enqueued commands is guaranteed to be consistent only at synchro-

nization points. This means that between two commands, consistency and hence correctness of communication, is guaranteed at the minimum between elements in

an in-order queue or on a communicated event from one command that generates the event to another that waits on it. Even in this case, memory object consistency

will be maintained only during runtime, being invisible to the host API. To achieve

host API correctness, the user must use one of the discussed blocking operations. For example, clFinish will block until all operations in the specified queue have com- pleted, thus guaranteeing the memory consistency of any buffers used by operations in the queue.

4.3 Mapping OpenCL Model to Hardware Devices

Key concepts in OpenCL models have been discussed, including the platform model, execution model and memory model. It has been noted that OpenCL is designed to support a wide range of hardware devices therefore its model definition is more of an abstraction rather than being specified for any particular hardware devices. Consequently, one of the first steps in developing OpenCL programs is understanding the mapping of OpenCL abstraction models to targeted hardware devices. This section will present such mapping of the OpenCL model to two family of hardware devices that were introduced earlier in Chapter 3: the multicore CPU and manycore

GPGPU. 89

4.3.1 Mapping OpenCL Model To Multicore CPUs

Figure 4.6 shows mapping of different components in the OpenCL model to corre-

sponding hardware resources in a physical multicore processor, whose architecture

has been discussed in Chapter 3.

œ  ž Ÿ ¡  ¢ £ ¤ ¥ ¦ § ¨ ¤ © §

Ü Ý Þ ß à á â       

ã â ä å ì æ ç è é ê ë          

ë í 

ª « ¬ ­ Ô Õ Ö × õ ö ÷ ø § ¨ ©

g h i j k l l m n o p i h k  €  ‚ ƒ „ „ † ‡ ˆ  € ƒ

® ¯ ° ± ØÙÚ Û ù ú û ü 

² ³ ´ µ ¶ · ¸ ¹ ³ º » ý þ ÿ ¡ ¢ £ ¤ þ ¥ ¦

q r s t u ‰ Š ‹ Œ 

î ï ð ñ ò ð ï ó ô ! " # $ % #" & '

¼ ½ ¾ ¿ À ½ Á Â Ã ¾ ÄÅ v w x y z { | z { } ~ w

Ž   ‘ ’ “ ” • – — • ˜ ’ “ ™ š › Ž š ™ “ › ‘ ‘ ™ š

Æ Ç È ÉÊ Ë Ì Í Î Ï Ð Ñ Æ Ò Ç Ó Ï Ì

Figure 4.6: Mapping OpenCL model to a physical CPU

CPU’s system memory (i.e. DRAM memory in a mainstream computer) now serves as global memory for OpenCL programs, while CPU’s cache memory serves as local memory. A CPU’s core does not have its own private memory but it has a set of registers and OpenCL considers it as the private memory of the core. Technically, when a local memory buffer is explicitly declared in an OpenCL kernel, the OpenCL runtime will use the CPU’s cache to allocate this buffer. However, this implementa- tion is not recommended since it decreases performance. The reason is that if a local memory buffer is used, data needs to be copied explicitly between global memory and the local buffer, which is actually the CPU’s cache memory. On the other hand,

CPUs are equipped with very sophisticated caching hardware, which often is able to transfer data between system memory and cache much more efficiently than an explicit copy. 90

Another important consideration in OpenCL to CPU mapping is assigning work- items to processing cores. The OpenCL kernel often tends to exploit data-level parallelism with massive amount of concurrent work-items to achieve high computa- tional throughput. On the other hand, multicore cpus often have a limited number of cores (newest Intel’s Core i7 cpu has 6 cores) which implies a much less number of concurrent threads. Therefore when running a OpenCL kernel on a CPU, it likely will assign multiple workgroups with a large number of work-items to each CPU hardware thread as illustrated in Figure 4.7. When this situation occurs, a CPU hardware thread processes an entire work-group one work-item at a time before moving to the next workgroup.

4.3.2 Mapping OpenCL Model To Manycore GPGPUs

Figure 4.8 shows the mapping of different components in the OpenCL model to cor- responding hardware resources in a physical manycore GPGPU, whose architecture has been discussed in Chapter 3. Architecturally, the GPGPU model is very close to the OpenCL abstraction model therefore it can have a near one-to-one mapping from OpenCL to GPGPU.

GPGPU’s memory hierarchy is nearly identical to the OpenCL memory hierarchy.

When running an OpenCL program, GPGPU off-chip GDDR system memory serves as global memory. The characteristics of GPGPU off-chip system memory and on- chip shared memory, discussed in Section 3.3 in Chapter 3, are very close to the definition of global memory and local memory abstraction in OpenCL, respectively. Similar to the ones in the CPU, each processing element(PE) in GPGPU does not have dedicated memory so each PE uses its own register set as private memory for the OpenCL program. 91

CPU Thread 0 CPU Thread 1 CPU Thread n-1 Work group 0 Work group 1 Work group n-1 WI 0 WI 1 WI 2 WI 3 WI n WI 0 WI 1 WI 2 WI 3 WI n WI 0 WI 1 WI 2 WI 3 WI n

......

barrier(...); barrier(...); barrier(...); Work group 0 Work group 1 Work group n-1 WI 0 WI 1 WI 2 WI 3 WI n WI 0 WI 1 WI 2 WI 3 WI n WI 0 WI 1 WI 2 WI 3 WI n

......

Work group n Work group n+1 Work group 2*n-1 WI 0 WI 1 WI 2 WI 3 WI n WI 0 WI 1 WI 2 WI 3 WI n WI 0 WI 1 WI 2 WI 3 WI n

......

barrier(...); barrier(...); barrier(...); Work group n Work group n+1 Work group 2*n-1 WI 0 WI 1 WI 2 WI 3 WI n WI 0 WI 1 WI 2 WI 3 WI n WI 0 WI 1 WI 2 WI 3 WI n

......

Figure 4.7: Assign multiple workgroups with a large number of workitems to each cpu thread [17]. a CPU hardware thread processes an entire work-group one work-item at a time before moving to the next work-group.

The GPGPU model and the OpenCL model are not only analogous in memory hierarchy but also in the organization of processing elements. As discussed in Chap- ter 3, GPGPUs consist of hundreds processing elements (PE) and these PEs are organized in clusters of streaming processors (SM), each SM often containing 16 to

31 PEs. It is clear that each OpenCL work-group can be directly mapped to an SM.

The work-items within a work-group are further grouped into teams of work-items, work-items within each team being executed simultaneously in SIMD fashion on PEs of the SM. This is the way that a warp in Nvidia GPGPU or a wavefront in AMD

GPGPU works.

92

VWXYZ[\ ‰ Š ‹ Œ  Ž 

ÆÇÈÉÊËÇÌ ÍÎÏÐÑÒÎÓ ã ä å æ ç è ä é ê ë ì í î ï ë ð

_f ` a b c d e ’™ “ ” • – — ˜ ] \ ^   ‘

e g ˜ š

( ) * + N O PQ o p q r  ‚ ƒ „

¢ £ Ä Å Ö × á â

,-. / RST U s t u v † ‡ ˆ

0 1 2 3 4 5 6 7 1 8 9 w x y z { | } ~ x  € ¤ ¥ ¦ § ¨ © ª ¨ « ¬ § ­ Ø Ù Ú ÛÜ Ý Ü Þ ß Û à

Ô Õ ñ ò

h i j k l j i m n › œ  ž Ÿ  œ ¡

: ; < = > ; ? @ A < BC ® ¯ ¯ ° ± ² ³ ´ µ ¶ · µ ¶ ¸ ¹ ²

D E F G H I J K L F M º » ¼ ½ ¾ ¿ À Á Â º Â º Ã

Figure 4.8: Mapping OpenCL model to a physical GPGPU

4.4 Conclusion

This chapter introduced the Open Computing Language (OpenCL) which is designed for modern heterogeneous parallel computing. OpenCL is jointly developed by a very large group of experts from processor vendors, system OEMs, middleware vendors, application developers, academia and research labs, FPGA vendors [51]. OpenCL is currently supported on a very wide range of hardware platforms from many different vendors such as GPGPUs from Nvidia, AMD, Qualcomm or CPUs from Intel, AMD or even FPGAs from Altera and Xilinx. That said, OpenCL is a very prominent candidate for the future of parallel heterogeneous computing.

In order to help the reader understand OpenCL, key concepts in OpenCL models were introduced in Section 4.2 including the OpenCL platform model, OpenCL exe- cution model and OpenCL memory model. While model details have been omitted, this thesis instead focuses on understanding the key concepts of the operation of the OpenCL platform. Particularly, understanding the OpenCL execution and memory model, which are uniquely designed to exploit massive parallelism, is very important 93 to writing efficient OpenCL programs.

After discussing the key concepts of OpenCL models, Section 4.3 presented the mapping of OpenCL abstraction models to targeted hardware devices, specifically, the mapping of the OpenCL model to two family of hardware devices that were introduced earlier in Chapter 3-multicore CPU and manycore GPGPU. It has been noted that OpenCL is designed to support a wide range of hardware devices, therefore its model definition is more of an abstraction rather than specified for any particular hardware device. Consequently, understanding the mapping of OpenCL to hardware devices is very crucial and is one of the first major steps in developing an OpenCL program.

This Chapter, together with Chapter 3 which discussed various concepts in mod- ern parallel hardware, has provided a complete view on parallel algorithm develop- ment, from software to hardware. This foundation will be the key to help the reader understand the solutions proposed by this thesis which will be discussed in the next chapters. Chapter 5

Design Novel Parallel Processing Methods For JPEG2000 Targeting Manycore GPGPUs

5.1 Introduction

As introduced earlier in Chapter 2, JPEG2000 provides many superior features compared to JPEG, including superior visual quality at very low bit-rates, wavelet- based compression, progressive transmission and random access to the code stream.

JPEG2000 also provides great scalability in both quality and resolution and can work in both lossy and lossless mode on very large images. With all of these fea- tures, JPEG2000 is arguably one of the most advanced still image compression stan- dards today. It is an ideal image standard for both mobile applications as well as high-quality applications such as medical and scientific imaging, earth imaging, and

94 95 post processing cinema. However, with the extra features comes slow processing speeds. Slow performance has long been noted as a major drawback of JPEG2000, particularly in software implementations. Despite over a decade of research, the computation time for serial JPEG2000 coding on a single processing unit is still relatively high. For example, it takes the Intel Core i7 processor up to 17 seconds just to compress a typical high resolution image, 4096 × 4096 24-bit RGB, with the reference JasPer JPEG2000 Compression Software [11]. Therefore developing a high-performance solution for JPEG2000 is still a very attractive challenge.

This Chapter therefore aims at proposing novel processing techniques to achieve a high speedup in accelerating JPEG2000. Different from previous JPEG2000 ac- celeration techniques, which only extract low amounts of coarse-grained parallelism, this study proposes techniques aimed at exploiting very fine-grained parallelism in

JPEG2000. The design efforts will primarily focus on the JPEG2000 Tier-1 encoding stage, the major bottleneck of the JPEG2000 flow which occupies more than 80% of the total running time (discussed in section 5.3). Moreover, the design techniques proposed in this Chapter are highly optimized for massive parallel general purpose graphic processing units (GPGPUs). The GPGPU-based parallel approach for the

Tier-1 encoding stage is actually conducted with two objectives in mind. First, the Tier-1 encoding stage has a great potential for exposing very fine-grained paral- lelism which would fit very well to the manycore architecture of GPGPUs. Secondly, it would offer insightful views on GPGPU performance characteristics in accelerating this exciting computational challenge. As a result, this study expects to gain crucial lessons on exploring modern parallel computing.

The rest of this chapter is organized as follows: Section 5.2 first discusses previ- ous designs’ approaches in accelerating the JPEG2000 coder. Next, before presenting the proposed methods, Section 5.3 and Section 5.4 deeply analyzes the runtime pro- 96

file and computational characteristics of JPEG2000’s encoding flow, particularly in identifying its major bottleneck. Section 5.5 and Section 5.6 present novel processing methods to parallelize the bitplane coding and entropy coding stages of the Tier-1 coder respectively. The implementations on GPGPUs of the proposed processing methods are discussed in Section 5.7. Particularly, Section 5.7 focuses on presenting various optimization techniques for GPGPUs. Finally, the results on performance gains of the proposed methods and crucial conclusions are presented in Section 5.8 and Section 5.9.

5.2 Previous Work On Accelerating JPEG2000

As mentioned earlier, there have been numerous studies on accelerating the speed of the JPEG2000 coder in both hardware and software [83, 16, 69, 73, 48, 66, 86, 88, 54, 78]. However, their performance gains are still moderate due to the limitations on design approaches and supporting platforms. Most of those efforts have focused on parallelizing the Tier-1 coder in coarse-grained levels, such as processing code blocks or image tiles in parallel. The main reason behind this approach is that the code blocks, sub-bands and tiles can be processed using the original serial bit-level algorithms running on multiple threads. However, performance benefits are limited since coarse granularity is not fine enough to impact the root problem of the BPC, which requires massive operations at sample-level. There were also several studies trying to optimize the Tier-1 coder at sample-level but they are not complete parallel solutions and thus have very limited performance gains.

In the hardware domain, a common approach to accelerate the JPEG2000 coder is replicating multiple function units to code multiple image tiles or wavelet subbands 97

concurrently [48, 16]. This coarse-grained approach is easy to implement but its

performance gain is often limited. Due to the limitation in silicon die area and

power/economic budgets, it is impractical to replicate many function blocks on a single chip. For example, Analog Devices’s ADV 212 JPEG2000 ASIC chip uses an

architecture (shown in Figure 5.1) that employs one wavelet engine and three Tier-1 coders (EC) to encode three wavelet sub-bands concurrently [48]. Despite being one

of the most popular ASIC chips for JPEG2000, ADV 212’s performance is still slower

than needed for typical demands on coding high resolution image/video. Two ADV 212 chips are needed just to compress a real-time stream of SMPTE 274M (1080i)

video.

¤ ¥ ¦ § ¨ § ©

      ! " # ! " $ ! " %

ó ô õ ö ÷ ô ø ù

§ §

ú û ü ú ý þ ÿ

¡ ¢ ÿ £ ü ý

P O Q K NRM

C DE F G H DH I

                    

= > ? @ A BA >

J K K L M NM O

& ' ( & ) ) & )

* + , -

2 3 4 5 6 7

. * / - & , , / *

8 9 : ; < ;

, 0 , 1 & '

Figure 5.1: Functional block diagram of Analog Device’s ADV 212 JPEG2000 ASIC chip [48]. ADV 212 employs three Tier-1 encoders (EC blocks) to encode three wavelet subbands of input image in parallel.

There also have been several hardware architectures that approached the prob- lem at a fairly fine-grained level such as [66, 54, 86]. Lian et al. [66] proposed an architecture that processed all the samples in a stripe column concurrently (shown in Figure 5.2). This is one of the most fine-grained approaches in hardware-based implementations of JPEG2000. However, this architecture is difficult to massively parallelize the coder. Although the method managed to remove the dependencies 98

among four bit-samples within a column, inter-column dependency is still present.

Moreover, this type of fine-grained processing in hardware would require complex

synchronization circuitry. Overall, this column-based processing gains about 40% speedup compared to a conventional serial-based implementation. In another direc-

tion, Varma et al. [86] and Gupta et al. [54] proposed a pipeline architecture that gave a 33% improvement in throughput compared to a standard serial implementation.

Figure 5.2: Column-based processing architecture proposed by Lian et al. [66].

In the software domain, increased parallelism can be easily obtained by running multi-threaded programs on multiprocessors or multicore processors to code inde- pendent image tiles or code blocks. However, this solution does not provide a great speedup due to the limited number of cores, which is typicallyless than 8 on a modern general purpose processor. For example, Norcen et. al. [69, 73] only gains a 3.34× speedup with a code block level parallelization using OpenMP on a 10-processors

platform.

More recently, JPEG2000 has been parallelized using many-core Graphics Pro- cessing Units (GPUs) [88, 78] but the performance still remains slow due to the

coarse-grained parallelization strategy and the lack of optimizations specific to GPU

hardware. Weiss [88] proposed a codeblock-level parallelization using CUDA but this 99

coarse-grained parallelization with a large amount of branching operations only al-

lowed a very modest 3× speedup over the reference JasPer Software [11]. In another

work, Rusnak and Matela [78, 61] proposed a GPU-based parallel BPC working at sample granularity. Although they obtained a good performance improvement for

the fine-grained parallel BPC, the bottleneck due to arithmetic coding still remains, thus leading to an overall speedup for the Tier-1 coder in JasPer of only 2.7× [61].

To summarize, while prior designs have shown several fold improvements in per- formance, their gains are still far from the demands of realistic applications, especially in high-resolution imaging. Moreover, there have been very few efforts on exploit- ing modern parallel computing platforms, particularly GPGPUs, to accelerate the

JPEG2000 coder in software implementation. Hence this chapter will aim at devel- oping novel processing techniques to achieve a significant speedup for the JPEG2000 coder based on a GPGPU approach. To realize this goal, the next section will first dissect the JPEG2000 encoding flow to find out its performance bottlenecks.

5.3 JPEG2000 Encoder Performance Profile

It is mandatory to analyze well any computational problem in order to understand its characteristics and bottlenecks before trying to improve its performance. This section dissects the JPEG2000 encoding flow to find out the roots of its slow performance.

The fundamental operations in major stages of the flow have been already presented in Section 2.4.2 of Chapter 2. Therefore this section does not fully repeat them again, but rather manages to analyze the computational complexity of those operations.

To better characterize the performance bottlenecks of JPEG2000, its encoding 100

flow is decomposed into main processes shown in Figure 5.3. The flow includes the discrete wavelet transform (DWT) and embedded block coding with optimized truncation (EBCOT). The DWT is a sub-band transform which transforms images from the spatial domain to frequency domain. Therefore, the DWT can efficiently exploit the spatial correlation between pixels in an image. Since JPEG2000 processes color components independently, a multi-color transformation (MCT) step may be needed to eliminate dependency between the components before the DWT. The

DWT coefficients, which may be scalar quantized in lossy mode, are then partitioned into individual code blocks and coded by the EBCOT independently. EBCOT is a two-tiered coder: Tier-1 is responsible for bit plane coding (BPC) and context adaptive arithmetic encoding (AE); Tier-2 handles rate-distortion optimization and

bitstream layer formation.

Œ  Ž 

 ‘ ’ Ž “ ‘ ”

Ç È É Ê Ë

¼ ½ ¾ ¿

• ‘ – “ —  S T UVWX Y UY Z a b c d e d f v w x y z {  ‚ ƒ „ †

À Á Á Á

˜ ™  š  [ Z\ ] ^ _Y Z ` g h b i j k l h m n o p q r s t p r s u q | } ~  € x y ‡ ˆ ‰ Š ‹ ƒ „

ÂÃ Ä Å Æ

› œ  ž Ÿ Ÿ ¡ ¦ § ¨ © ª « © ± ² ³ ´ µ ¶ · ´ ³ ¸

¢ £ œ ¤ œ ¥ £ ¬ § ­ ® ¯ © ° § ¨ ¹ º » · ²

Figure 5.3: JPEG2000 Encoding Flow

Runtime (ms) Operations bike cafe wash-ir woman MCT 17.22 17.24 14.23 17.09 DWT 308.30 308.36 224.22 309.38 Tier-1 4069.76 4888.55 4881.09 3877.5 Tier2+I/O 350.38 362.43 314.02 345.69

Table 5.1: Runtime profile of JasPer [11] JPEG2000 encoder in lossless mode (running on Intel Core i7 930 processor) 101

The major coding stages are profiled using the JasPer JPEG2000 encoder with a popular image test set including bike, woman, cafe and wash-ir images obtained from [59, 74]. The results are shown in Table 5.1. The runtime is divided into four parts: MCT is the runtime of the multi-color transformation; DWT is the runtime of the wavelet transform; Tier-1 is the total runtime of BPC and AE; and Tier- 2+I/O is the total runtime of Tier-2 coding plus the remaining image I/O operations.

The table clearly shows that the Tier-1 coder is the major bottleneck of the flow, consuming more than 80% of the total processing time. This is the reason why Tier-1 encoding optimization has been the focus of much research [66, 54, 16, 88].

Consequently, in order to accelerate the JPEG2000 encoding flow, this study would also try to break the Tier-1 coder bottleneck first.

5.4 JPEG2000 Bitplane Coder Analysis

In order to break the Tier-1 coder bottleneck, it is important to understand the characteristics of its coding stages. This section dissects the first stage of the Tier-1 coder: the bitplane coding stage. The entropy coding stage will be also analyzed shortly in Section 5.6.

Recalling from Section 2.4.2 of Chapter 2, the bitplane coder, shown in Figure 5.4, processes code blocks bitplane by bitplane, from the most significant bit (MSB) to the least significant bit (LSB). In each bit plane, the bits are further grouped into stripes of four lines and the BPC scans stripes column-wise, from left to right using three non-overlapping coding passes: significance propagation pass, magnitude refinement and Cleanup Pass [84, 60]. The BPC generates (Cx,D) pairs, contexts (Cx) being defined in [60] and taking one of 19 different values, while decision bits (D) being 102 determined by the context window (3 × 3 neighborhood). The very fine-grained processing at bit level by the BPC efficiently classifies the data samples to help the context adaptive arithmetic coder (introduced in Section 2.5.2) to obtain better coding efficiency. Additionally, the BPC fractional coding scheme with three non- overlapping coding passes provide the Tier-2 coder the best truncation points in the code stream for rate-distortion optimization [83]. However, the fine-grained coding scheme introduces a very high level of complexity in both arithmetic and memory operations.

--" +-" +-!

-+" ++" ! !" ! !!% (

-+! ++! ! !#! !"& !$ ! 23C7;7A 1B443<6@ !

!'' !& !$ !&" !

*=67 *=67 *=67 !#$ !( !& !% );=5: );=5: );=5:

*=67 );=5: )9<3?F 07>?7@7

49A >;3<7

.1) 1A?9>7

-1)

)9A-/;3<7@ ,< 3 *=67-);=5: 153<<9<8 =?67? 9< 3 49A >;3<7 *=

Figure 5.4: Viewing of codeblock, bitplane, sample stripe and context window in JPEG2000

Consider a 2-D RGB color image to be coded with JPEG2000, having a size of M ×N pixels in three color components. The BPC has to form contexts for K × M × N 103

bit-samples in three separate coding passes (e.g. SPP, MRP, CUP), where K is

maximum number of bits required to represent the wavelet coefficients. Furthermore,

when forming the context of a single sample, the BPC has to refer to information from a 3 × 3 neighborhood centered around the specific sample (context window).

The bit plane coder (BPC) maintains three arrays of state variables called σ, σ0 and η during the coding process. When the BPC codes a sample s[x,y], it assesses the set of state variables (σ, σ0, η) at each of the neighbors within the context window centered around x,y. As a result, the complexity of the BPC stage in both arithmetic operations and memory operations is O(9 × 3 × 3 × K × M × N), which is of tremendous complexity.

Moreover, the BPC stage not only requires an enormous number of memory access operations but also its memory access pattern is also very inefficient on gen- eral purpose computing platforms whose memory hierarchies rely on locality in the application’s memory reference stream to deliver performance. In a typical mem- ory hierarchy in a general computing platform, the processing unit actually works with data cached in cache memory instead of working directly with data stored in main memory. Accessing cache memory (taking several clock cycles) is much faster than accessing main memory (taking several hundreds clock cycles) [77]. Therefore cache misses would result in significant performance loss. One of the widely adopted techniques in memory hierarchy design to reduce cache misses is exploiting spatial locality of data elements. Specifically, modern cache designs exploit spatial local- ity by fetching large blocks of data called cache lines on a cache miss. Subsequent references to words within the same cache line (spatially local) result in cache hits.

On the other hand, the samples in a 3 × 3 context window in the BPC stage are often spatially far apart from each other in the main memory although, visually, one would think that the elements in a context window are close. Data in an array is of- 104

ten stored row-wise in the main memory which causes the data elements of different

array rows to be often located in different DRAM rows. Consequently, the elements

are cached in different cache lines. Overall, the context window-based access pattern in the BPC significantly reduces cache spatial locality. The same problem would

happen if the arrays were stored column-wise.

Finally, fine-grained BPC introduces inter-sample dependency making parallel

processing in JPEG2000 at bit-sample granularity very difficult. Indeed, the pro-

cessing challenges in the BPC stage are not only reflected in the high degree of complexity in arithmetic and memory operations but also in the inherently serial

nature of its processing scheme. In particular, the state variables’ updating process

is highly serial. The states (σ, σ0, η) are updated on the fly during the coding pro- cess anytime when the BPC forms a new context for a bit-sample. In other words,

when the context of a bit-sample is changed, it will affect all the state variables of its neighbors within its context window. For example, in the Significance Propagation

Pass (SPP), Figure 5.5 shows the conditions to encode a bit-sample S(x,y,j) at row x and column y on bitplane j. After the bit-sample is encoded, its significance state bit is also updated to σ(x,y) = 1. Note that the significance state bit at a certain

location is updated from ‘0’ to ‘1’ only once and retained across bitplanes during the entire BPC process. This is also the reason why σ(x,y) = 1 does not require a

bitplane sub-index (j).

The coding conditions and on-the-fly updating of state bits shown in Figure 5.5 cause the coding process in the BPC to be very complex. Figure 5.6 shows the coding

steps that the BPC follows when conducting Zero Coding on a 4×4 bit-sample array.

Recall from Section 2.4.2 of Chapter 2 that the coding process must follow a restricted scanning order from top-to-bottom and left-to-right. Following the scanning order

on the 4 × 4 data array and state array in Figure 5.6, the sample S(2, 0) is the first

105

Ì ÍÎÏ ÐÏÑ ÒÓÔ ÒÓ Õ Ö Ö

Þ

ø d ÷

ç ==

ù Ý

ô

ö õ æ

å == Ü

U d ó ==

ã ä

ñ ò

- £ ð £ +

á â

î ï

- £ í £ +

ë ê é ì è à

¹ ß ¹

Ú × Û

û ü ú

¡

¢ £

ÿ

ý þ

¤ §

d ¦

¥ =

× ØÙ

Figure 5.5: Conditions for a bit to be encoded by Zero Coding(ZC) in Significance Propagation Pass (SPP) bit-sample that meets the conditions in Figure 5.5. After S(2, 0) is encoded, σ(2, 0) is also updated to ‘1’. The BPC restrictedly requires that all coding conditions must be satisfied, for example when coding sample S(1, 1), two conditions are satisfied: sample S(1, 1) = 1 and S(1, 1) has a significant neighbor (σ(2, 0) = 1). However,

σ(1, 1) = 1 therefore this bit-sample is not encoded and no update happens in the state array. The on-the-fly update process also changes the conditions of bit-samples rapidly. For example, sample S(3, 1) initially has no significant neighbors (shown in initial SIG array), however, when the coding process reaches S(3, 1) (following the scanning order), S(3, 1) already has a significant neighbor as σ(2, 0) has just been updated on-the-fly. Consequently, the coding conditions of bit-samples are non- deterministic, being changed on-the-fly and depending on the characteristics of the input data. This issue makes the BPC process inherently serial and it may require

106

¨     &

F A G A H :@I

8 9 : ; < = > ? @A

J ? K = :A K

©     ' %

     $

g h i i j k l m n o p k q

B CD E 9 : <

r n m s l p n k

    ! " #

B : = N : 9 H O B CD M N N = I

L = : = M N N = I

( , 0 6 W [ _ e t x | ‚ „ ˆ Œ ’

) 7 / 5 X f ^ d u ƒ {  “ ‹ ‘

* - . 4 Y \ ] c v y z € † ‰ Š 

+ 1 2 3 Z ` a b w } ~  ‡  Ž 

´ µ ¶ Ä º ¸ Â » º ´ µ ¶ · ¸ ¹ º »

B CD E A PQ NA ´ µ ¶ · ¸ ¹ º »

´ ¼ Å ¾ ³ À Á Â Ã º Ã ´ ¼ Å ¾ ³ À Á Â Ã º Ã

B RS TU V G Q K A K ´ ¼ ½ ¾ ¿ À Á Â Ã º Ã

¤ ¨ ¬ ² ” ˜ œ ¢ Ö Ú Þ ä Æ Ê Î Ô

¥ ³ « ± • £ › ¡ × å Ý ã Ç Õ Í Ó

¦ © ª ° – ™ š Ø Û Ü â È Ë Ì Ò

§ ­ ® ¯ —  ž Ÿ Ù ß à á É Ï Ð Ñ

æ ç è é ê ë ì í ê æ ç è ô ë õ ê í ´ µ ¶ Ä º ¸ Â » º ´ µ ¶ · ¸ ¹ º »

æ î å ï ð ñ ò ì ó ê ó æ î å ï ð ñ ò ì ó ê ó ´ ¼ ½ ¾ ½ À Á Â Ã º Ã ´ ¼ ½ ¾ ½ À Á Â Ã º Ã

Figure 5.6: updating values of significance state bit (σ) of encoding bit-sample (in data array) across ZC steps special processing techniques and data organization to be able to expose parallelism at bit-sample level. This study manages to solve this problem by proposing novel processing techniques in the next sections.

5.5 Parallel Bitplane Coder Design

This section will propose various processing techniques to accelerate the BPC stage based on parallel processing on manycore GPGPUs. However, it is very important in 107

designing any parallel program to decide the level of granularity at which parallelism

occurs during processing. Consequently, Section 5.5.1 will first discuss the finely-

grained parallel approach for the bitplane coder on manycore GPGPUs that this study aims to achieve.

5.5.1 Fine-grained Parallel Approach For Bitplane Coder

As discussed earlier in Section 3.3 of Chapter 3, although GPGPUs are re-designed

for general purpose computing, the GPGPU approach is very different from the

CPU approach towards the general purpose computing. Specifically, the GPGPU is specialized for computationally-intensive, highly parallel code. Moreover, in GPG-

PUs, more transistors are devoted to data processing rather than data caching and

flow control. Therefore, designing parallel programs for GPGPUs requires specific attention to data granularity and also to the complexity of the program. In order to exploit GPGPUs’ massive parallel capability, data granularity must often be fine enough to unlock SIMD parallel threads. Also, GPGPUs often favor programs that

have simple execution paths to avoid branching divergence as much as possible.

ö ÷ ø ù ú

÷ ý ú ÷ ý ú ÷ ý ú

û ü ü þ þ ÿ û ü ü þ þ ÿ û ü ü þ þ ÿ

© ¨

ø ø ø ø

¡¡ ¢ £ ¤ ¤ ¡ £ ¤ ¤ ¡ £ ¤ ¤ £ ¤ ¤

þ ¥ þ ¥ þ ¥ þ ¥

¦ ú ¦ ú ¦ ú ¦ ú

¤ § ¤ § ¤ § ¤ §

ü ¥ ü û ¨ © ü ¥ ü û ¨ ü ¥ ü û ¨ ü ¥ ü û ¨ þ

Figure 5.7: Pyramid Data Structure in JPEG2000 108

However, the JPEG2000 flow is very complex with numerous control operations and inherently serial processing paths, especially in the BPC stage as discussed in

Section 5.4. To the best of the author’s knowledge, there had not been any de- sign approach to date that can expose bit-sample level parallelism in the JPEG2000

flow , particularly in the BPC stage. Instead, previous designs often approach par- allel processing in JPEG2000 with a coarse granularity, either in CPU-based solu- tions [73] or GPGPU-based solutions [88]. In particular, a coarse-grained parallel approach for JPEG2000 is very inefficient on the GPGPU since it is unable to ex- pose enough concurrency for massively parallel SIMD threads in GPGPUs, while introducing very complex execution paths with many control operations to GPG- PUs. For example, the CUJ2K project, one of the first GPGPU-based parallel solution for JPEG2000, exploits parallelism in JPEG2000 at the codeblock level. As shown in Figure 5.7, JPEG2000 codeblocks are processed independently, therefore CUJ2K encodes each codeblock with a GPGPU’s streaming multiprocessor (SM) using legacy serial JPEG2000 flow. This approach is indeed simple to implement, its parallel approach being similar to replicating multiple serial coders using multiple SMs. However, CUJ2K’s performance gain is very limited, achieving merely about

3× speedup over the serial JPEG2000 coder in the EBCOT Tier-1 stage as shown figure 5.8.

After in-depth analysis on the JPEG2000 coding flow and GPGPUs’ character- istic, this study recognized that fine-grained parallelism is mandatory to exploit the massive parallel capability of GPGPUs. Therefore, it was decided to design a parallel

JPEG2000 coder that can operate at a very fine granularity, specifically at bit-sample level as shown in Figure 5.10. The study aims at allowing numerous parallel threads of the GPGPU to process all the bit-sample concurrently and having the design approach start from the most important stage, namely the BPC stage. However, 109

Figure 5.8: Performance of GPGPU-based CUJ2K parallel JPEG2000 coder on EBCOT Tier-1 stage [88]. CUJ2K’s coarse-grained approach achieve only about 3× speedup over serial JPEG2000 coder running on CPU.

as analyzed in Section 5.4, the BPC stage is inherently serial and very complex. It may require novel processing techniques and data structures to expose bit-sample

parallelism to GPGPUs. Section 5.5.2 will first tackle the problem by proposing a

novel processing method to remove inter-sample dependency in the BPC stage.

5.5.2 Removing Inter-Sample Dependency

It is the sequential dependence between the state variables that prevents all samples to be processed concurrently in the context formation stage. Indeed, at each loca-

tion in a stripe, the BPC needs state information for 3 × 3 samples but the state

variables of each sample are not predetermined since they are updated on the fly. This situation can be considered as a read before update race condition. Figure 5.9

illustrate an example of racing condition happens in parallel Zero coding process.

110

å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú

š › œ  ž Ÿ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ÀÁ ÂÃ Ä ÅÆÇ È É Ê ËÌ Í Î Ï ÐÑ Ò ÓÔÕÖ × Ø Ù Ú ÛÜÝ Þ ß à á â ã ä

J K L M Z [ \ ]

   

F G H I V W X Y

   

B C D E R S T U

   

> ? @ A N O P Q

   

F G HI J K L MN O P Q R S TU V WX YZ [

û ü ý þ ÿ ¡ ¢ £ ¤ ¥ ¦ § ¨ ©                  ! " # $ % &'( ) * + ,- . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ?@ A BCD E

* + , - j k l m

: ; < =

& ' ( ) f g h i

6 7 8 9

" # $ % b c d e

2 3 4 5

 ! ^ _ ` a

. / 0 1

\ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~  €  ‚ ƒ „ † ‡ ˆ ‰ Š ‹ Œ  Ž   ‘ ’ “ ” • – — ˜ ™ š › œ  ž Ÿ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹

º » ¼ ½ ¾ ¿ À Á Â Ã Ä ÅÆ Ç È É Ê ËÌ ÍÎÏÐÑÒ Ó ÔÕ Ö × Ø ÙÚ Û ÜÝÞ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ ¡ ¢ £ ¤ ¥ ¦ § ¨ ©                

 ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B CDEF G H IJ KL M N O P QR S T U VW X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | }

~  €  ‚ ƒ „ † ‡ ˆ ‰ Š ‹ Œ  Ž   ‘ ’ “ ” • – — ˜ ™ š › œ  ž Ÿ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ

¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â ÃÄ Å ÆÇÈ É ÊË Ì Í Î Ï ÐÑ Ò ÓÔÕÖ × Ø Ù Ú ÛÜÝ Þ ß à á â ã ä

z { | }

– — ˜ ™

v w x y

’ “ ” •

~  €  ‚ ƒ „

† ‡ ˆ ‰

r s t u

Ž   ‘

n o p q

Š ‹ Œ 

å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ ¡ ¢ £ ¤ ¥ ¦ § ¨ ©                  ! " # $ % & '( ) * + , -. / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @

A BCD EF G H I J K L M NO P QRS T UV W X Y Z [\ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~  €  ‚ ƒ „ † ‡ ˆ ‰ Š ‹ Œ  Ž   ‘ ’ “ ” • – — ˜ ™ š › œ  ž Ÿ

¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á ÂÃÄ Å Æ ÇÈ ÉÊ Ë Ì ÍÎ ÏÐÑ ÒÓ Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù

ú û ü ý þ ÿ ¡ ¢ £ ¤ ¥ ¦ § ¨ ©                  ! " # $ % & ' ( ) * + ,- . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ AB C D E F G HIJK L M N O PQ RS TU V W X Y Z[ \

] ^ _ ` a b c d e f g h i j k l m n o p q r s t

Figure 5.9: An example of racing condition happens in parallel Zero coding process

111

u v w x

y z v { |

} ~

u v w x

y z v { |

} 

u v w x

y z v { |

} €

u v w x

y z v { | } 

Figure 5.10: Bit-sample level parallel approach for JPEG2000. This very fine-grained approach allows numerous parallel threads to process not only codeblocks concurrently but also bit-samples within each codeblock concurrently.

The racing condition causes some significance bits to be read before being updated.

Consequently, the output is wrongly different compared to the standard defined se- rial scheme. Therefore, a prediction method is needed to pre-determine the state

information before the BPC starts coding so that it can process all the samples

concurrently.

‚ ƒ „ ƒ † ‡ ˆ ‰ Š Š † ‹ ‹ Œ  ‚ ƒ „ ƒ † ‡

‘ ’ “ ” • ’ “ ¡ ¢ £ ¤ ¥ ´ µ ¶ · ¸ ¹ ¶

– — ˜ ™ š Ž   ¯ ° ± ² ³

› œ  ž Ÿ  ¦ § ¨ © ª « ¬ ­ ® ¬ ­

Figure 5.11: Scanning pattern in bit plane. Within each context window, a sample always has 4 visited neighbors and 4 non-visited neighbors

The scanning pattern of the BPC on the bit plane is shown in Figure 5.11. In this

scanning pattern, assume the BPC is processing a certain sample (i.e., the current sample) and the rest of the samples are always divided into two sub arrays: the 112

visited samples and non-visited samples. The visited samples are the ones that were already scanned by the BPC from top-down and left-right and are referred to as past samples. The non-visited samples are the ones that the BPC hasn’t reached yet and are referred to as future samples. When the BPC starts coding the bit

0 plane p the state arrays are called the pre arrays (i.e., σpre,p,σpre,p, ηpre,p), while after

0 the BPC finishes coding they are called post arrays (i.e. σpost,p,σpost,p, ηpost,p). Note that in a bit plane the state variables are only updated once and remain unchanged

until the BPC finishes coding that bit plane. Therefore, when the BPC codes a certain sample S[x,y], the context window is formed from four visited neighbors with

0 the state information from σpost,p,σpost,p, ηpost,p arrays and four non-visited neighbors

0 with the state information from σpre,p,σpre,p, ηpre,p. For example, the summation of significance (SIG) states of horizontal and vertical neighbors of sample S[x,y] in

Figure 5.11 can be computed as:

[x,y] = σ [x − 1,y]+ σ [x + 1,y] (5.1) X pos,p pre,p H,p [x,y] = σ [x,y − 1] + σ [x,y + 1] (5.2) X pos,p pre,p V,p

This observation enables the removal of the inter-sample dependency by combining the pre and post state arrays. Specifically, if the BPC knows the state arrays before

it starts the coding process, it can concurrently form context windows for all the samples. This can be accomplished by the state information prediction methods

introduced in the next section. 113

5.5.3 Predicting State Variables

As discussed in Section 5.5.2, three state variables σ, σ0, η need to be pre-determined

to support the concurrent bitplane coding of all samples. However the BPC only refers to the η state of a sample itself; therefore, there is no inter-sample dependency

in the η state so no prediction is needed. By definition, at bit plane p the post

0 refinement state σpost,p[x,y] = 1 indicates that the MRP has already been applied to the sample s[x,y], which is equivalent to saying the sample s[x,y] is already significant at bit plane p − 1. Therefore the refinement state can be computed as:

0 σpost,p[x,y]= σpost,p−1[x,y] (5.3)

The remaining task is to predict the significance array σpost,p. The SIG state of a sample is updated using a zero coding primitive that appears in two coding passes, SPP and CUP. The significant samples that are updated by SPP are predicted first.

In SPP, if an insignificant sample has a value of 1 and is in a preferred neighborhood

(i.e. has at least a significant neighbor in its 3 × 3 context window), its SIG state will switch from 0 to 1. Mathematically, the condition for updating significance state

σ[x,y] of sample sp[x,y] in SPP can be expressed by the following equations:

σ[x,y]=0 (5.4)

sp[x,y]=1 (5.5)

σ[m, n]=1 (5.6) _ m=6 x,n=6 y x−1≤m≤x+1 y−1≤n≤y+1

The conditions (5.5) and (5.6) are easy to check. However, for condition (5.6), 114

the BPC needs to check the SIG state of the 8 neighbors. On the other hand, if

the BPC processes all the samples concurrently, it’s possible that the SIG state of

the neighbors are being updated as well and the read before update situation will happen. Once this occurs, the SIG state of a sample may not be updated from 0 to

1 as intended. For example, if the BPC checks the neighbors of sp[x,y] before their

SIG states are updated the BPC may incorrectly determine whether or not sp[x,y] is in a preferred neighborhood. However, the BPC can properly predict σ[x,y] if condition (5.6) is satisfied before it processes sp[x,y]. Also note that after the state σ[x,y] is updated from 0 to 1, all neighbors of sp[x,y] will be in a preferred neighborhood.

The above observations suggest that the σ array must be predicted through

several parallel reduction steps with each step k processing a subset nk,p ⊂ Np. Notice that at each successive step the number of samples with significant neighbors will increase monotonically. Figure 5.12 shows an example of this strategy. Given a 4×4

bit-sample array and a corresponding 4×4 significance state (SIG) array (the same set

of input data on Figure 5.6). Initially, only sample S(1, 1) is significant with σ(1, 1) = 1. In the first reduction steps, 16 concurrent computing threads blindly check all 16

context windows corresponding to all 16 bit-samples. The threads check all context windows in a blind way because they do not know yet if those context windows will

all satisfy the coding condition. However it is expected that there would be some.

Indeed, there are two context windows (W ) satisfying the condition in the first step, W (2, 0) and W (2, 2) (W (x,y) is the context window centered at the bit-sample at

row x, column y). Consequently, after the first reduction step, σ(2, 0) and σ(2, 2)

are updated from ‘0’ to ‘1’. In the next reduction step, the second step, W (1, 3) and W (3, 1) are now satisfied and σ(1, 3) and σ(3, 1) are updated respectively. It is

important to note that, in the first step, the conditions for W (1, 3) and W (3, 1) were

115

º ¾ Â È Ê Î Ò Ø ê ë ì ë

» É Á Ç Ë Ù Ñ ×

í î ï

ð ñ ì ò

¼ ¿ À Æ Ì Ï Ð Ö

ó ô õ ô ö ì ÷ ø

½ Ã Ä Å Í Ó Ô Õ

ù ú û ë ì ô û

  ! "

í í î ï

ì ë ý ì ñ ö þ ü ý ý ë ø

ê ë ì ë ü ý ý ë ø

! '

# $ % &

" !

(

$ # ) & $

Ú Þ â è   

Û é á ç    

Ü ß à æ    

Ý ã ä å    

í î ï

¢ £ ¤ ¥ ¦ § ¨ © ¥ § ¨ ð ô ÿ ý ô ¡ ú û ë ì ô

* + , - . / 0 1 2 3 / + 4 5

6 : > D ^ b f l

7 E = C _ m e k

8 ; < B ` c d j

9 ? @ A a g h i

F GH I J KL MJ N O P Q RJ S TU V WXY Z [ \ ] V XY

n o p q r s t u v w s o x y

Figure 5.12: Reduction steps on predicting significance state (SIG) variables. not satisfied, however, the newly updated σ(2, 0) and σ(2, 2) yield the condition in W (1, 3) and W (3, 1) to be satisfied (to have at least one significant neighbor within the window). This phenomenon is very crucial in helping the reduction strategy converge. Note that even after σ(1, 3) is updated, W (0, 3) is still not satisfied since

S(1, 3) is a post location of S(0, 3) therefore the updated value of σ(1, 3) does not have any impact on W (0, 3). Overall, after just two reduction steps, the prediction process finished and the predicted SIG array is identical to the SIG array obtained from legacy serial processing shown in Figure 5.6. Another crucial issue is determining how many iterations it takes to finish the reduction process. Recall from Equation (5.2) that a recently updated state σpos,p[x,y] only affects four future neighbor samples. 116

The reduction process therefore takes at most four iterations to finish because after

four steps the BPC has already updated all the samples.

After updating the set of significant samples in bit plane p (name Np,SPP ), an exclusion method can be used to predict the set of significant samples updated by the CUP (name Np,CUP ). Only SPP and CUP can update the significant samples so the set of significant samples updated in bit plane p (name Np) can be computed using Equation (5.8).

Np = Np,SPP ∪ Np,CUP (5.7)

⇒ Np,CUP = Np \ Np,SPP (5.8)

On the other hand, all insignificant samples that have a value of one at bit plane p will become significant after the BPC codes that bit plane. Np therefore can be explicitly computed using Equation (5.9).

Np = {s[x,y]: sp[x,y] = 1,σp−1[x,y] = 0} (5.9)

The prediction strategy efficiently eliminates the inter-sample dependency using two state arrays (called pre and post) precomputed by the reduction and exclusion methods in Section 5.5.3. By eliminating inter-sample dependencies we can exploit a very high level of parallelism, as we can partition our workload into a high num- ber of very fine-grained concurrent threads without any conflict. Our technique can be generalized and applied to other parallel applications that need to eliminate the inter-sample dependency, which often arises in many applications from the image processing domain. It also should be noted that the technique is optimized for the

SIMT parallel computing platforms that have massively parallel arithmetic capabil- 117 ity provided by a very large array of simple ALUs such as GPGPUs. Therefore, although the technique can run on multicore CPUs, these CPUs may not fully ex- ploit the advantages of the technique due to the lack of massively parallel arithmetic capability.

5.6 Designing A Parallel Arithmetic Coder

As discussed earlier, Tier-1 coding in JPEG2000, which consists of two major stages- bitplane coding and entropy coding, is the major bottleneck in the coding flow.

Therefore it is very crucial to speedup Tier-1 coding in order to accelerate the JPEG2000 flow. Recently, Section 5.5 has just introduced novel techniques to par- allelize the Bitplane Coder, hence the remaining task in speeding up Tier-1 coding is accelerating the entropy coding stage. In this section, an effort on accelerating the entropy coding stage is made by a novel parallel implementation of the classic arithmetic coder (AC).

The design process of a parallel arithmetic coder is started by first revising the key challenges in the MQ coder of JPEG2000 and in classic arithmetic coding. The two major roots that cause arithmetic coding to require intensive serial computing operations will be analyzed. Next, a novel method will be proposed to efficiently solve the key challenges in arithmetic coding. Particularly, a parallel AC is introduced. 118

5.6.1 Entropy Coding In JPEG2000 Revision

Recall from Section 2.5.2 of Chapter 2, JPEG2000 entropy coding uses context adap-

tive arithmetic encoding (AE) which eliminates redundant information to further compress the JPEG2000 bit stream. The entropy coding stage employs a multiplier- free context adaptive arithmetic coder named the MQ coder, which is a descendant of the multiplier-free Q coder [19]. The MQ coder eliminates multiplication op- erations by using an approximation expressed in Equation (5.10) together with a look-up-table for estimating symbols’ probability.

A - Qe ∗ A ≈ A-Qe = sub − interval for the MPS  if A ≈ 1  (5.10) Qe ∗ A ≈ Qe = sub − interval for the LPS   To keep the approximation condition in Equation (5.10) satisfied, the MQ coder must always keep the length register A in the range of [0.75 : 1.5] [84]. This condition therefore requires that the A register be renormalized (RENORM ) very frequently.

The LUT (Table 2.6) is actually a finite state machine (FSM) built from a complex

Markov model [19].

Consequently, both the FSM and the RENORM processes make the MQ coder very difficult to parallelize. In particular, because the FSM is built from a very large probability space, there is a high computational cost in predicting the state of the FSM [19, 84]. Moreover, frequent RENORM operations break the recursive relationship between the steps in the MQ coder, the discontinuity of a recursive track at a RENORM point will be discussed more in Section 5.6.3. It is therefore almost impossible to build a prediction function for the symbol probability and the length registers required for a parallel solution of MQ coder. 119

On the other hand, Taubman, who invented the EBCOT coding engine for

JPEG2000 [83], initially used the classic arithmetic coder (AC) rather than the

MQ coder [80, 81] in JPEG2000. The main reason for eventually adopting the MQ coder was to avoid the expensive computational cost of the multiplier. However, this

comes at the cost of decreased coding efficiency. Moreover, the original motivation of trying to avoid multiplication is obsolete since this operation is now as fast as the

shift and add unit. In fact, in recent CPU implementations, an optimized classic AC

speed is similar the speed of the MQ coder [79].

It is clear that there is little motivation to stick with the MQ coder for JPEG2000,

particularly in enabling a parallel solution. After investigating various types of en-

tropy coding, it was determined that the classic AC is a good starting point. Like the MQ coder, the classic AC also consists of a chain of highly serial steps and renormal-

izations. However, these steps are well optimized using a recursive approach, which makes it more amenable to parallelization. Next section describes how these steps

are parallelized.

5.6.2 Parallelizing The Classic Arithmetic Coder

Recall from Section 2.5.1 of Chapter 2, the arithmetic coding method (shown in

Figure 5.13) recursively creates a sequence of nested intervals in the form Φk(S) = hbk,lki where bk,lk are called the base and the length of the interval Φk respectively.

In binary arithmetic coding, for example, bk and lk are computed based on values of bk−1, lk−1 and the input sample sk (sk = {0; 1}) using the set of equations (5.11) [80]:

120

“ ’ ”

› š  ‘ Ž



ƒ ˜ • — ™

= - –

‚ + +

† „

 Œ

Š ‰ ‹ ˆ

= - ‡

 €

+

| ~ } { -z

Figure 5.13: Interval scaling in the Arithmetic Coder. It recursively creates a sequence of nested intervals of the form Φk(S) = hbk, lki based on the probability of input symbols

(0) xk = lk−1Ck−1    l l x  k = k−1 − k if sk = 1     (5.11)  bk = bk−1 + xk     lk = xk   if sk = 0    bk = bk−1    

(0) 0 (0) where Ck is the cumulative distribution of symbol ‘0 up to step k. Ck is computed by Equation (5.12), previously seen in (2.6), based on the number of times bit 0

(0) (1) and bit 1 appeared in the input sequence up to step k (i.e., Nk and Nk ).

N (0) + 1 C(0) k k = (0) (1) (5.12) Nk + Nk + 2

To adapt the classic arithmetic coding method for parallel processing, the scaling

equations can be expressed in a different way as in Equation (5.13) 121

(0) xk = lk−1C  k−1  lk = lk−1αk    lk = lk−1 − xk  if s = 1    k  bk = bk−1 + βk  ⇒  (5.13)  bk = bk−1 + xk   (0)  αk = (1 − 2sk)Ck− + sk   1  lk = xk    (0) if sk = 0     βk = lk−1Ck− sk   1  bk = bk−1     

From the set of equations in (5.13), a set of recursive equations (5.14) and (5.15) can be derived to compute the base (bk) and the length (lk) of the interval Φk:

k l = l α (5.14) k 0 Y i i=1

k b = b + β (5.15) k 0 X i i=1

Equations (5.14) and (5.15) contain a prefix sum and a prefix product. Fortu- nately this class of prefix operations can be computed using the Prefix Scan algorithm in Figure 5.14 [20]. The prefix scanning algorithm is sufficient to conduct many type of associative functions such as prefix sum, prefix product, maximum, minimum on a set of elements using parallel threads. It calculates in parallel in log n steps an associative function F (S) on all prefixes of an n-element array S using n parallel

threads. Figure 5.15 shows an example of 8 parallel threads finishing the computa-

tion of F (a0,...,a7) in log(8) = 3 scanning steps. 122

Algorithm 1 Parallel Prefix Scan Input: S = s[0]...s[n] Output: Scan F(S) = s[0], F(s[0],s[1]), . . . , F(s[0]. . . s[n]) 1: for j := 0 to lg(n)-1 do 2: for i := 2 j to n-1 do 3: s[i] := F(s[i-2j], s[i]) in parallel

Figure 5.14: Parallel prefix scanning algorithm [20]: compute an associative

function F (S) on a n − element S array using log(n) reduction steps

œ  œ ž œ Ÿ œ œ ¡ œ ¢ œ £ œ ¤

œ  ¥ ¦ œ  § œ ž ¨ ¥ ¦ œ ž § œ Ÿ ¨ ¥ ¦ œ Ÿ § œ ¨ ¥ ¦ œ § œ ¡ ¨ ¥ ¦ œ ¡ § œ ¢ ¨ ¥ ¦ œ ¢ § œ £ ¨ ¥ ¦ œ £ § œ ¤ ¨

œ  ¥ ¦ œ  § œ ž ¨ ¥ ¦ œ  © œ Ÿ ¨ ¥ ¦ œ ž © œ ¨ ¥ ¦ œ Ÿ © œ ¡ ¨ ¥ ¦ œ © œ ¢ ¨ ¥ ¦ œ ¡ © œ £ ¨ ¥ ¦ œ ¢ © œ ¤ ¨

œ  ¥ ¦ œ  § œ ž ¨ ¥ ¦ œ  © œ Ÿ ¨ ¥ ¦ œ  © œ ¨ ¥ ¦ œ  © œ ¡ ¨ ¥ ¦ œ  © œ ¢ ¨ ¥ ¦ œ  © œ £ ¨ ¥ ¦ œ  © œ ¤ ¨

Figure 5.15: An example of prefix scanning algorithm: it takes 8 parallel threads and log(8) = 3 scanning steps to finish computing F (a0,...,a7)

5.6.3 Handling Renormalization Points

The parallelization method proposed in Section 5.6.2 can be realized if the recursion continuity is not broken. However in arithmetic coding the RENORM step might break the recursion. Indeed, each time the base and length are renormalized, their values will be out of the recursive track. Specifically, Equations (5.14) and (5.15) are developed based on the relationship lk = αklk−1 where αk is preknown but if there is a RENORM at step j then the length at that step will be computed as 123

lj = Rj αjlj−1 where Rj is a nondeterministic RENORM factor. Hence the recursion track in Equations (2.5)) is broken at step j and the track must restart with new

initial values b0 = bj = Rj βjbj−1 and l0 = lj = Rjαjlj−1. As a result, only a small group of samples between two adjacent RENORMs can be processed in parallel using

a parallel scanning strategy for Equations (5.14) and (5.15). Such a group of samples processed in parallel is referred to as a parallel window and the whole input stream must be processesed through multiple parallel windows, as illustrated in Figure 5.16.

The speedup of this processing method highly depends on the size of the parallel windows, which enclose the number of samples between two adjacent RENORMs.

In addition, in classic arithmetic coding with short registers, RENORM is called every time each symbol is coded to avoid losing accuracy and compression efficiency.

However, this RENORM strategy is inefficient on a modern hardware platform with

very long registers exceeding 128 bits.

¬ ® ± ³ ¬ ¸ µ ¶

¬ ® ± ³ ¬ ¸ µ · ¬ ® ± ³ ¬ ¸ µ ¸

ª « ¬ « ­ ­ ® ­ ª « ¬ « ­ ­ ® ­ ª « ¬ « ­ ­ ® ­

¹ º » ¼ ¹

± ½ ¶

¯ ° ± ² ³ ´ µ ¶ ¯ ° ± ² ³ ´ µ · ¯ ° ± ² ³ ´ µ ¸

Figure 5.16: Processing a stream using multiple parallel windows.

In order to effectively parallelize arithmetic coding, the RENORM strategy must be changed such that it will not occur very frequently, thus keeping the window size large. This strategy can be accomplished by applying the RENORM technique developed by Said [80]. Said proposed the use of long registers (16 bits or more), shown in Figure 5.17, with a waiting buffer for code bits in order to reduce the number of times RENORM is invoked, hence extending the size of the parallel window. The waiting buffer works as follows: In a P -bit precision arithmetic coder, the length (l) must be kept in the range of 2P −1 ≤ l ≤ 2P with the RENORM condition l ≤ 2P −1. 124

However, if the l register has (P +B) bits then B bits can be used as a waiting buffer for code bits. Each time l is scaled, the induced code bit is first stored in the waiting buffer and l is not renormalized immediately. When the waiting buffer is full which is equivalent to l ≤ 2B(P −1), B code bits are output to the code buffer and the l is

finally renormalized. So a B-bit waiting buffer can reduce the number of RENORM events by a factor of B.

32 bits l-reg Initial l register 1aaa...a000000...00 z }| { 8 bits 24 bits | active{z } | trailing{z } Coding l register cccccc...cc aaaa...a 24 bits 8 bits | code{z } | active{z } Renormed l register aaaa...a000000...00 8 bits 24 bits | active{z } | trailing{z }

Figure 5.17: 32-bit length register with 24-bit waiting buffer

5.6.4 Parallel Context Adaptive AC

In Section 5.6.2 an efficient parallel arithmetic encoder for a single context binary stream was proposed. However, it needs to be further optimized to handle the context

adaptive arithmetic coding of JPEG2000. In this scheme, the binary symbols are classified into 19 different streams based on their context, and the probability of each

symbol is estimated independently for each context. Therefore, 19 different context

buffers need to be reserved and the parallel prefix sum is applied on each buffer to count the number of ‘1’ and ‘0’ bits of each context.

Note that the context output stream must be in the strict order of the scanning

pattern. On the other hand, a sample at a certain location can generate an unpre- 125

dictable context, as illustrated in Figure 5.18. Hence, to satisfy the order restriction

and allow all threads to output concurrently, each context buffer must be the exact

same size as the input stream. For example, if we assume the kth input sample to

th have context j(j ∈ [0 : 18]), the k element of buffer Cx[j] will have a value of

1. Note that the parallel prefix sum will be applied directly to Cx buffers to count the cumulative number of ‘1’ and ‘0’ bits up to a certain location. So the number of bits to represent each Cx buffer element must also be large enough to store the cumulative value. Specifically, the size of one Cx element is log2(N) bits, thus the

Cx buffer size of one bit plane is 19N log(N) bits, where N is the size of processing

window.

¾ ¿ÀÁ

Ò × Ê Ø Í Ô Ô ÙË Ú Û ÙË Ü Ê Û

 ÃÄÅÆ Ç À

ÑË Ò Ó Ì Ô Õ Ö Ò Í Ô

¿Ï Á

É Ê Ë ÌÍ Î Ì È

¿ Á

É Ê Ë ÌÍ Î Ì È È

¿ Á

É Ê Ë ÌÍ Î Ì È Ð È

Figure 5.18: 19 separate buffers required for 19 contexts

To the best of the author’s knowledge, the parallel arithmetic coder introduced in this section is the first completly parallel arithmetic coder in the published do- main. This version of the arithmetic coder is optimized for running on the GPU but its parallelization strategy is applicable to other multicore or manycore platforms. Moreover, the benefit of this parallel arithmetic coder is not limited only to the par- allel Tier-1 coder but it can also be employed by a wide range of applications based on arithmetic coding. 126

5.7 Implementing The Tier-1 Coder On GPUs

Section 5.5 and Section 5.6 present designs of a novel parallel bitplane and arith- metic coder, the two stages in Tier-1 coding which cause the major bottleneck of the

JPEG2000 runtime. The proposed parallel processing methods are designed to ex- pose a very fine-grained parallelism to target the massive parallel manycore GPGPU architecture. However, it is not simple to deploy the proposed parallel processing method straight onto GPGPUs. As discussed in Chapter 3, different from CPUs, which often provide developers quite a high-level abstraction of underlying hard- ware, GPGPUs often still require a lot of low-level hardware-based considerations in software implementations. It often requires careful resource utilization, particu- larly in assigning tasks to streaming multiprocessors, data organization and memory utilization. This is mainly due to specialties in the architecture of the SIMD model as discussed earlier in Chapter 3. This section presents such efforts, specifically in implementing the proposed parallel Tier-1 coder on GPGPUs. It first presents the mapping of the logical structure of the algorithms to the physical architecture of the

GPGPUs. Next, various optimization techniques will be presented to exploit the capabilities of GPGPUs.

5.7.1 Mapping Parallel Tier-1 Coder To GPGPU

One of the first tasks in implementing a computational flow is mapping the logical structures of the flow to the physical resources of the underlying hardware. The software-to-hardware mapping is significantly more important in a GPGPU-based implementation due to specialties in the GPGPU’s SIMD model and memory hier- archy. Moreover, current compilers for GPGPU-based parallel programs could not 127 help much in making resource assignment decisions.

Architecturally, the parallel algorithm for the bitplane coder is optimized to match the existing graphics hardware, as shown in Figure 5.19. A number of code blocks are processed independently and each code block sample should be processed in parallel using the state prediction method proposed in Section 5.5. Hence, each work group is in charge of handling one code block, and multiple processing ele- ments (PE) of the work group can process the samples in parallel. However, the

GPU implementation should be optimized to fully exploit hardware capabilities. In particular, the usage of memory resources has to be carefully optimized, control flow operations must be minimized as they result in costly processor stalls, and the work- load must be distributed so as to maximize the occupancy of hardware resources and

hide memory latency when stalls are unavoidable.

£ ¤ ¥ ¦ § ¨ ©  0 1 2 3 4 5 6 7 8 9 : ; J KLM NOPQ R STU

      DEFGHI ^ _ ` a b c

          < = > ? @ > = A B C V WXYZXW[\ ]

ù ú û ü ý þ ÿ ¡ ¢

í îï ðñ îòóï ôõöñ ôö óñó÷ø

d e f g h i j k j l m n j

  ~  €  

‚ ƒ

!"# $ % ! € „

&' ()*+) ,-..* /

Ý Þ ß à á â ã ä é ê ë ì å æ ç è

o pq r s pq ttuv w xy z { z q | }

Figure 5.19: Mapping parallel BPC algorithm to existing graphics hardware. Each codeblock is assigned to be encoded by a streaming multiprocessor (SM). SMs directly work on the fast on-chip shared memory, where each codeblock’s sample and associative state variables and outputs are temporarily located. After each codeblock is completely processed, outputs are written back to the global memory. Constant look-up-tables are located in the constant memory.

5.7.2 Reducing Flow Control Operations

In GPGPUs any flow control instruction (i.e. if, switch...) can significantly affect the instruction throughput by causing threads within a parallel thread block to diverge; 128

that is, to follow different execution paths. As discussed in Chapter 3, GPGPUs

provide enormous arithmetic capability at a low hardware cost, but to achieve this

goal, the cores in each multiprocessor often share only one instruction decoder and one small branch control unit. Therefore a single instruction is executed over N

threads in parallel, where N is specific to the hardware chip and often has a value of 32 or 64. A group of N parallel threads is called a warp (Nvidia [40]) or a wavefront

(AMD [43]). If the threads within a warp (or wavefront) are diverged in P paths,

those paths will be executed serially and the execution time is theoretically increased by P times. As a result, control instructions and branch divergences on GPGPUs

tend to be very expensive [40, 43].

Unfortunately, the context formation process in JPEG2000 requires many control operations. For example, the Zero Coding (ZC) primitive of the bitplane coder

(BPC) forms the context of a sample based on the significance states (σ) of its its 8 neighbors (shown in Figure 5.20). Consequently, the ZC primitive may take

28 = 256 different execution paths. Additionally, the BPC needs to select one out of

19 contexts based on that information. If the BPC is implemented with a standard switch/case construct its performance would be bound to be very low.

However, the context formation rules are predefined by the JPEG2000 standard

as introduced in Chapter 2 (in Table 2.1 and Table 2.4). Therefore, look-up-tables (LUT) for context formation can be constructed to avoid the branching of control

flow. In particular, a LUT should have 256 entries, where the indices of the entries are formed from the 8 neighbors’ state bits and the value is selected based on the

predefined context rules. For example, the index of ZC context formation LUT can

be formed from a set 8 significance state bits (σ[V0], σ[V1], σ[H0], σ[H1], σ[D0], σ[D1],

σ[D2], σ[D3]). However, it is still inefficient if 8 different memory locations have to be accessed every time in order to retrieve the set of 8 state bits which in turn will 129

yield the LUT index via concatenation. Instead, by employing the state broadcast-

ing strategy from [84], a state flag data structure is designed to pack all state bits

(within each context window) into a single package, as shown in Figure 5.21. Each sample (X) has one corresponding state flag that stores the state information itself

0 (η[X], σ[X], σ [X]) and the state bits (sign bits χ and significance bits σ) of its 8 neighbors. The state flag organization allows the BPC to easily retrieve a LUT

index by applying a bit mask. For example the index for accessing ZC’s LUT, called

IDZCLUT , can be retrieved by using the bit masking in Equation (5.16).

IDZCLUT = StateFlag & 0xFF

= σ[V0]|σ[V1]|σ[H0]|σ[H1]|σ[D0]|σ[D1]|σ[D2]|σ[D3] (5.16)

5.7.3 Optimizing Memory Allocation

It is very critical to optimize memory usage for the BPC where the context formation

process executes massive numbers of memory and arithmetic operations. Particularly on GPGPUs, an efficient memory utilization can not only significantly reduce the

latency but also can increase the occupancy of computational resources to speedup

computing time.

The first optimization considered is to efficiently allocate different sets of data

to the most suitable memory blocks (local shared, global, constant) based on the application’s demand in order to reduce latency and conflict. When the BPC pro-

cesses one sample, it not only refers to the sample itself but also to its 8 neighbors.

Consequently, there is a high degree of memory conflicts in the BPC, particularly 130

"0 $0 "1 $0

#0 % #1 #0 % #1

"! $1 " $1

"C C'&($*( !%&#') C C'&($*( !%&#') !

Figure 5.20: Context Windows for Zero Coding (ZC) and Sign Coding (SC) primitives. When a coding primitive codes a sample, it has to refer to state information of the sample’s neighbors defined in its respective context window.

SC LUT index ] ] ] ] ] ] ] ] ] ] ] ] ] ] 1 0 3 0 2 1 1 0 ] z }| { 0 1 1 0 X H V V H D D D D H H V V X [ X [ [ [ [ 0 [ [ [ [ [ [ [ [ [ [ σ σ σ σ σ σ χ σ η χ χ χ σ σ σ

ZC LUT index | {z } Figure 5.21: The sate variables of a sample and its neighbors (within its context window) are packed into a 16-bit state flag. The index of ZC or SC LUT can be extracted from the flag by applying a binary mask. in the parallel BPC where multiple threads concurrently access different samples. However, both the memory conflict rate and memory latency can be dramatically reduced with very fast, multi-way shared memory that resides locally on-chip. The shared memory on modern graphics cards has from 16 to 32 banks which can be ac- cessed independently with a latency of only one clock cycle. It is also very important to optimize allocation of the context output buffer since it is the intermediate buffer for context formation and the arithmetic coder. Storing this buffer in global memory would be very inefficient since the BPC would write to global memory and then the arithmetic coder would have to read back from it immediately after. Therefore this output buffer should also be stored in on-chip shared memory. Additionally, since the BPC refers to the LUTs and the state flags very frequently, these data structures 131 should also be placed in the shared memory as well. The LUTs are read-only and small enough to reside in the fast constant cache memory. The memory hierarchy for the parallel Tier-1 coder is shown in Figure 5.19. The code blocks are initially stored in the off-chip global memory then each multiprocessor will copy its respective code block into its shared memory. The LUTs are stored in constant memory and fetched

to multiprocessors’ constant cache at runtime.



© ª « ¬ ­ ® ¯ ° ± ² ³ ³ ° ¯ ´ ± ± ª ® µ ¶ ± · ° ¸ ® ¯ ² ³ ³ ­ ° ¶ ¹ º º ­ ± ­ ² ¶ ± ·

‹

‡ † ‡

¡

Ÿ

¡

Ž ‡

“

™

‹

‡ ‡

Ÿ

¥ §

ž

¨

ž

š

 ‡

—



š

™

Š ‡

§

™ 

•

œ

œ

›

Œ

‡

£

š



¦ 

‰ ‡

“

™

˜

–

˜

‹

‡

—

˜“

–

˜

•

—

”

ˆ ‡



‡

–

“

’ “

•

™

¤ ¥

‡ † ‡

¢ £

      Œ ‹ Œ ‹  

‰ ‘ ‰ † ‘ † ‘ †  ‘ † 

Figure 5.22: Impact of codeblock size on compression efficiency and streaming multiprocessor (SM) occupancy (in coding BIKE image). As codeblock size increases, compression efficiency increases but SM occupancy decreases.

5.7.4 Optimizing Memory Utilization

After the data sets are efficiently allocated into selected memory regions, the uti- lization of memory, especially shared memory, should be minimized to increase the multiprocessor occupancy of the GPGPUs.

The multiprocessor occupancy is defined as the ratio of the number of resident warps to the maximum number of warps supported on a multiprocessor of a GPU.

Typically the higher the occupancy the better multiprocessors can hide the warps’ 132

latency and increase ALU utilization which will yields better speedup [40, 43]. Each

multiprocessor on a GPU has a set of registers and a small amount of on-chip shared

memory. For example, the Nvidia GTX 480 GPU has 32768 registers, 48 KB shared memory and 16 KB L1 cache per multiprocessor. These resources are shared among

the active thread warps. Therefore, the lower the shared resources utilized by a particular warp, the higher the number of warps that can reside in a multiprocessor.

The compiler can attempt to minimize register usage but the utilization of shared

memory must be optimized by the programmer. Additionally, the number of threads per work group should be large enough, at least 4× the warp size, to achieve the

best performance.

However, it is not simple to reduce shared memory utilization in the parallel Tier-1 coder since it depends on the code block size (CblkSize) which is one of the

key parameters that determines JPEG2000 compression efficiency. The CblkSize is often varied from 8 × 8 to 64 × 64 samples, but the largest allowed configuration is preferred [84, 85] since larger CblkSize yields better compression efficiency. On the other hand, a large CblkSize requires a large shared memory buffer which may reduce the multiprocessor occupancy and hence reduce the performance speedup. In a naive implementation, at least 24 Kbytes of shared memory are needed to store a 64 × 64 code block and the two respective flagpre and flagpost state flag arrays proposed in Section 5.5.2, while a 8 × 8 code block requires only a small 256-byte buffer. The impact of CblkSize on compression efficiency and multiprocessor occupancy ratio is shown in Figure 5.22. The results show that by changing CblkSize from 64 × 64 to

8 × 8, the compression efficiency drops about 25% on image Bike. But by decreasing

CblkSize the multiprocessor occupancy can be significantly improved from 17% to 50%. There is an unusual result with the 8 × 8 code block where the occupancy no longer increases. This result can be explained as the GTX 480 GPU only uses one 133 warp of 64 threads to process an 8 × 8 block therefore the work group size is not large enough to exploit the hardware.

The impact of code block size on the overall speedup of the parallel BPC kernel is shown in the cblk-based processing graph in Figure 5.23. It clearly shows that larger code blocks significantly decrease the multiprocessor occupancy which results in a significant drop in speedup. To overcome this problem, previous studies had to compromise compression efficiency to use a GPU-affordable code block size such as

8 × 8. However, this is an impractical size [85]. This study therefore manages to design a special strategy that can handle large code blocks with low utilization of

shared memory.

ÒÓÔÕÖ × ØÙ Ú Ò × Û Ö ÔÜ Ý × ÒÒÕÞ ß Ý Ù à á Ø Ù Ú Ò × Û Ö ÔÜ Ý × ÒÒÕÞ ß

»

¿

Ñ

¾ »

ÏÐ

Í

Æ

Î

Ë

Ë

½ »

Í

Ì

Ë

ÉÊ

È

Ç

»

¼

ÅÆ

»

À Á Â Ã Â À Á Ã À Á ¾ ½ Ã ¾ ½ À Á Ã

¼ Ä ¼ Ä Ä ¿ Ä ¿

Figure 5.23: Impact of codeblock size and processing method on the runtime speedup of the bitplane coder (in coding BIKE image). In cblk-based processing, increasing in cblk size leads to decreasing SM occupancy, which causes notable drops in speedup. The stripe-based processing significantly improves this issue.

Fortunately, by integrating the T ERMALL and CAUSAL operation modes dis- cussed in [84], the parallel Tier-1 coder can handle a 64×64 code block using a small shared memory buffer. The T ERMALL and CAUSAL modes allow JPEG2000 coder to eliminate the inter-stripe and inter-bit plane dependencies. Specifically,

T ERMALL allows one stripe to be coded without having to refer to the incoming stripes below it; CAUSAL allows the arithmetic coder to reset symbol states after 134

each bit plane. Using these two modes, a large code block can be processed through

multiple, smaller loads of one stripe (4 × 64 samples) in shared memory. The arith-

metic coder immediately codes the (Cx,D) stream of each stripe generated by the BPC. The speedup improvement of this stripe-based processing method is shown in

the stripe-based processing graph in Figure 5.23. The speedup when processing a 64 × 64 code block increases from 13× to 31×. Note that the 16 × 16 and 8 × 8 code block can be easily fitted into shared memory so it is not necessary to apply the stripe processing method.

5.8 Experimental Results

This section presents the results for the parallel bit plane coder and parallel arith- metic coder proposed in Section 5.5 and Section 5.6. The major steps of the JPEG2000 coding flow are illustrated in Figure 5.3. The parallel JPEG2000 Tier-1 coder is implementedusing OpenCL 1.0 on the Nvidia SDK 3.2 running on an Nvidia GTX480 graphics card. While Nvidia GPUs are widely applied using the proprietary

CUDA programming model and tool, this study chooses OpenCL instead, for it en- sures portability across other platforms such as AMD GPUs, cell processors, etc. The reference CPU platform uses an Intel Core i7, with 12GB RAM running at

2.8GHz.

There are several popular versions of JPEG2000 compression software running on CPUs, including JasPer [11], [62], and OpenJPEG [75]. JasPer is chosen to compare against the GPU implementation since it is an open source program, with fully accessible source code, and very good performance. The image test set includes the most popular JPEG2000 test images from [59, 74] (e.g. bike, cafe, lena). 135

JasPer BPC Our Parallel BPC Average Speedup of (ms) (ms) Parallel BPC[61] IMAGE Core i7 930 GTX 480 speedup Running on GTX 480 lena 183.23 6.01 30.5× aerial2 1125.33 30.17 37.3× bike 3296.50 106.23 31.0× 12× cafe 3950.73 117.92 33.5× woman 3146.77 100.9 31.2×

Table 5.2: Runtime comparison for the BitPlane Coder (BPC) in different implemen- tations: serial BPC of CPU-based JasPer [11] vs. our GPU-based parallel BPC vs. GPU-based parallel BPC in [61]. JasPer runs on an Intel Core i7 930 CPU. Both the parallel BPC coder run on a Nvidia GTX 480 GPU.

Table 5.2 compares the runtime for JasPer and the GPU implementation for just the bitplane coding(BPC) portion of the JPEG2000 Tier-1 coder. One can see that the proposed parallel BPC running on GTX 480 consistently obtains more than 30× speedup compared to JasPer. This result is also a significant improvement compared to 12× speedup of the parallel BPC proposed in [61], which runs on the GTX 480 GPU as well. The high speedup rate of the parallel BPC primarily comes from the massive parallel arithmetic operation capability of the GPU and its very fast multi-way shared memory. To efficiently realize the benefits of these features, this implementation successfully eliminates inter-sample dependency to allow multiple parallel threads to process the samples at a very fine-grained level. The multi-way shared memory significantly reduces conflicts when multiple threads are processing the context windows. Additionally this implementation develops efficient optimiza- tion strategies to increase the computational resource occupancy in order to gain more speedup. Note that the speedup gains slightly vary since the BPC’s runtime also depends on the images’ characteristics.

Table 5.3 shows more comprehensive results of all processing stages implemented on the GPU. Runtime results are combined for DWT plus MCT stages (columns

2 and 3) and BPC plus AE stages (columns 4 and 5). The GPU-based DWT + 136

MCT implementations are more than 100× faster than the JasPer implementation.

From columns 4 and 5 we see that our GPU-based implementation of the Tier-

1 coder (BPC+AC) is roughly 16× faster than the JasPer implementation. This result is significantly better than previously published parallel solutions for the Tier-

1 coder such as 2.7× in [61] and 1.4× in [88]. However, the overall performance improvements are not as great as those of the BPC reported in Table 5.2. The main reason for this drop in improvement is the relatively slow runtime of the arithmetic coder stage, despite the fact that the parallel solution proposed in Section 5.6 has been implemented. The current GTX 480 graphics hardware does not include full

1 64-bit support, so its double precision performance is reduced to 8 single precision performance. The parallel arithmetic coder presented in this study is expected to run much faster on any platform which supports double precision.

MCT+DWT BPC+AC Average (ms) (ms) Speedup IMAGE JasPer GPU JasPer GPU lena 10.74 0.13 228.68 15.1 15.7× aerial2 70.89 0.64 1393.32 80.1 18.1× bike 328.53 2.03 4069.76 272.91 16.1× cafe 325.61 2.01 4888.55 300.105 17.4× woman 326.47 2.01 3877.50 261.27 16.1×

Table 5.3: Runtime comparison for Tier-1 coder of JasPer running on Intel Core i7 930 CPU vs. our parallel Tier-1 running on Nvidia GTX 480 GPU.

5.9 Summary And Discussion

In this chapter, the design and development of a novel JPEG2000 parallel Tier-1 coder are presented. The parallel algorithm can process data at sample-level (i.e. possibly the finest granularity of data). In particular, this chapter is the first to presents a fully parallel solution for the arithmetic coder in JPEG2000. The im- 137 plementation of the Tier-1 coder leverages widely available and massively parallel

GPGPU hardware and provides a 16× performance speedup compared to the JasPer software implementation. It is believed that even greater speedup is possible with full 64-bit hardware support. Additionally, the proposed parallel algorithms are potentially applicable to a wide range of image processing and data compression applications.

The proposed parallel Tier-1 implementation deviates from the JPEG2000 stan- dard since it uses a classic arithmetic coder for entropy coding instead of the MQ coder in JasPer. However, it is still fair to compare the speed of the GPU imple- mentation with that of JasPer since in recent CPU implementations, exchanging the arithmetic coder with the MQ coder does not really alter the coding speed [79]. Indeed, the classic arithmetic coder actually yields more coding efficiency than the

MQ coder and it was initially used in the JPEG2000 draft by Taubman [83], the inventor of JPEG2000’s core coding algorithms. The hardware limitations at that time were the only reason for its replacement with the MQ coder. Therefore, it is suggested that, with its advantages as well as the potential of being parallelized, the classic arithmetic coder should be employed again in JPEG2000.

Further, this chapter also provides interesting observations on performance char- acteristics of GPGPUs on realistic computational flows. The analysis conducted earlier in Chapter 3 is actually confirmed and corroborated . First, it is shown that GPGPUs can greatly perform on the parallel processing flows that can expose finely-grained parallelism. However, exposing this sort of parallelism is often not easy, sometimes being necessary to completely replace one flow with an entirely new one (such as the MQ coder case). GPGPU-based programs still require a lot of opti- mization efforts, mostly from developers, to tune-up programs for GPGPUs’ special

SIMD model and memory hierarchy. The lack of flexibility in the SIMD execution 138 model is also pronounced in realistic flows, which are often very complex. Overall, de- spite the great capabilities of parallel processing, the GPGPU-based approach itself may not a one-size-fits-all solution for accelerating realistic applications. Within a realistic application, GPGPUs may speedup well several throughput-intensive stages but it still needs a more-flexible device to orchestrate the whole flow. This is an im- portant design realization which leads to an interesting proposal of a heterogeneous parallel computing approach, which will be shortly presented in Chapter 6.

For future work specified in the JPEG2000 coding flow, the emphasis will be on further improving the implementations of the arithmetic coder. In addition, the Tier-

2 routines can be parallelized in order to have a complete JPEG2000 encoding flow running on a GPU platform. Another research direction is that of implementing the proposed parallel solutions on different parallel hardware platforms to compare the different architectures on the performance and optimization strategies. In particular, the capabilities of embedded GPGPUs in accelerating the coding flow should be explored. To the best of the author’s knowledge, there has not been any GPGPU- based parallel solution for JPEG2000 using mobile and embedded systems. Chapter 6

Accelerate JPEG2000 Decoder Using Heterogeneous Parallel Computing Approach

6.1 Introduction

Chapter 5 has presented novel parallel processing techniques for the JPEG2000 Tier-1 encoder. The proposed techniques achieve significant performance gains. For exam-

ple, the parallel bitplane coder on the GPGPU gains a speedup of more than 30×.

However, going through the design process of GPGPU-based parallel processing tech- niques clearly shows that despite being very capable, the GPGPU-based approach still has several crucial drawbacks.

First, GPGPU-based parallel programs require very finely-grained parallelism in

139 140 order to achieve good performance. However, it is often not very easy to expose this type of parallelism in realistic computational flows, especially in those developed during the uniprocessor era with serial-based processing in mind. Consequently, it often takes remarkable efforts on designing parallel programs for GPGPUs as discussed in Chapter 5. In some cases, to be able to expose finely-grained parallelism in a flow, it is necessary to modify or replace the flow altogether with a different one.

The parallelization process for the entropy coder in Chapter 5 shows a clear example of this issue.

Secondly, observing GPGPU performance when executing realistic computational

flows confirms once again the low-flexibility of GPGPU’s SIMD architecture (dis- cussed early in Chapter 3). The design process in Chapter 5 provides concrete evidence that GPGPUs are primarily suitable for arithmetic-intensive but simple computational flows. However, realistic computational flows are often complex, re- quiring a lot of control operations which are often more suitable for flexible general purpose CPUs.

The observed drawbacks of GPGPUs are critical but they do not lead to a dead- end. Instead, those observations are the essential elements that helped this study discover the important role of another approach on parallel computing: heteroge- neous parallel computing. Indeed, while realistic applications often consist of com- plex characteristics, the hardware processors are not often designed to cover such a wide-range of demands. Due to limitations in development costs and time to market, as well as power budget and integration technology, each family of processors is often designed to target a certain segment of computing problems. For example, CPUs are designed with flexibility to handle complex control-intensive flows while GPGPUs are designed with massive arithmetic throughput to handle arithmetic-intensive flows.

Therefore, this study discovered early that it would be more efficient to partition a 141 realistic computational flow into groups of stages having similar characteristics and mapping them onto suitable computing devices. Particularly in parallel computing, the modern flexible multicore CPUs and high-throughput manycore GPGPUs can cooperate to accelerate computational flows very efficiently. Within a system, a CPU can act as a command station to handle complex control operations and also to use task-level parallel threads to execute stages that cannot be parallelized at data-level in GPGPUs. The simultaneous execution of CPUs and GPGPUs within a system also provides an opportunity to pipeline different stages on the devices. All of these advantage factors lead this study to decisively take a further step in exploring the capability of parallel heterogeneous computing. Following the same strategy in pick- ing up a case study, the acceleration of the JPEG2000 decoder will be targeted , being a very crucial objective due to the decoder’s relatively slow performance.

As discussed earlier, the JPEG2000 encoder and decoder demand a high per- formance from the computing platform. On the other hand, while there have been numerous efforts focused on accelerating the JPEG2000 Encoder [83, 16, 69, 73, 48,

66, 86, 88, 54, 78, 64], there have been relatively few efforts focused on accelerating the performance of the JPEG2000 Decoder. Consequently, it is vital to invest more effort in accelerating it since its performance is just as critical as the one of the encoder. Moreover, previous acceleration efforts are often based on multi-threaded programming using OpenMP-based task-parallelism running only on CPUs, which only allows extracting low amounts of coarse-grained parallelism with limited per- formance gain (e.g. [73]).

In this chapter, novel techniques are developed to exploit the recent improve- ments of both parallel programming models and hardware architectures of CPUs and GPGPUs to speed up the JPEG2000 decoding engine. Specifically, this chapter makes the following contributions: 142

• A parallel streaming decoder running on a GPGPU-CPU heterogeneous system

is developed to fully exploit both the flexibility of the high-performance multi-

core CPUs and the massively parallel capability of GPGPUs.

• In order to leverage the advantages of GPGPUs and CPUs, an exhaustive analysis on the JPEG2000 decoding flow is conducted to determine how to

optimally parallelize the decoding algorithm and how to best distribute the

computational tasks across the heterogeneous computing units.

• Efficient scheduling methods are developed to keep the loads of the computing units optimally balanced. Specifically, a new task scheduling strategy is devel-

oped that exploits the soft-heterogeneity in OpenCL and C/C++ at runtime in order to gain a significant performance boost.

The rest of this chapter is organized as follows. Section 6.2 discusses more details on previous work on accelerating the JPEG2000 decoder. Section 6.3 conducts an analysis on the JPEG2000 decoding flow and is followed by Sections 6.4 and 6.5, which present different acceleration methods on various stages of the flow. Sec- tion 6.6 presents a novel pipelining scheme for improving the performance of the

JPEG2000 streaming decoder. Conclusions and ideas for future work are discussed in Section 6.7.

6.2 Previous Work On JPEG2000 Decoder

As mentioned earlier, there are relatively few studies that focus on accelerating the JPEG2000 decoder, despite the crucial role its performance plays on JPEG2000- based image processing systems such as video tracking and object recognition. In the 143 hardware domain, one of the best ASIC chips implementing the JPEG2000 standard is the ADV 212 JPEG2000 coder chip by Analog Devices. The ADV 212 chip uses an architecture that employs one wavelet engine and three entropy coders to encode/de- code three wavelet sub-bands concurrently [48]. Similar solutions are also proposed in [16]. Although the ADV 212 is the most popular ASIC chip for JPEG2000, its performance is still slower than needed for typical applications. Two ADV 212 chips are needed just to compress a real-time stream of SMPTE 274M (1080i) video. In addition, ASIC chips have very high development and manufacturing costs, which limits rapid release of new and improved designs.

In the software domain, most of the acceleration efforts for the JPEG2000 de- coder have focused on coarse-grained parallelization using CPUs. For example, code blocks or image tiles may be processed in parallel based on the task-parallelism model using multiple OpenMP-based threads [73, 45]. The main motivation for this ap- proach is that the code blocks, sub-bands and tiles can be processed independently, using the original sequential algorithms running on concurrent threads. However, these OpenMP-based approaches have several critical drawbacks. First, OpenMP can only exploit coarse-grained parallelism which may not address the root prob- lems preventing higher performance in JPEG2000. The more critical drawback is that OpenMP-based approaches are designed to decode single frames, one by one, while field image processing applications often have high potential for pipelining on streaming inputs. For instance, in systems that do video tracking, the analysis process often works in streaming mode where the same set of operations including decoding, recognition, and tracking are applied on a sequence of frames. Therefore, different processing stages on a sequence of images can be pipelined.

Another drawback for the OpenMP-based JPEG2000 decoders is that they often use CPUs as the only computing unit. Recent improvements in both programming 144 models and hardware architectures such as general-purpose graphics processing units

(GPGPUs) that support parallel computing can bring new opportunities to speed up the JPEG2000 coding engine. For instance, prior work has proposed paralleliz- ing the Tier-1 encoder in the JPEG2000 using GPGPUs [64]. Results show that specific components of the encoder can be greatly accelerated using a GPU-based implementation. However, to the best of the author’s knowledge, there have not been any previous GPU-based solutions for the JPEG2000 decoder. Even the pop- ular Nvidia Performance Primitives library (NPP) [38] currently supports only the JPEG standard.

Depending on the specific task, either a thread-level parallelism approach on multicore CPUs or a data-level parallelism approach on GPGPUs may be more ad- vantageous in terms of performance. To take advantage of each type of parallelism, approaches that can make effective use of a heterogeneous system based on GPGPUs and multi-core CPUs seem very promising for optimizing JPEG2000’s parallel per- formance. This chapter describes the work on implementing such a parallel solution targeting a heterogeneous GPGPU-CPU system. Specifically, this work explores how a GPGPU-CPU heterogeneous system may be exploited to accelerate the JPEG2000 decoder.

6.3 JPEG2000 Decoder Runtime Analysis

Given that runtime profiling is often a critical step towards effectively accelerating an application, this section conducts an exhaustive analysis on the performance profile of the single-threaded JasPer JPEG2000 decoder. JasPer is chosen since it is a reference implementation recommended by the JPEG2000 committee [22].

145

O P Q R S TTT § ¨ © ¨

â ã ä å æ ç ö ÷

UV W XY   ¨

è ä é ê ë ä å ø ù ú û ü ý þ ÷ ò ì í î ï ð ñ ÿ ¡ ¢ £ ¤ ¥ ¦

î ó ô õ î ï

E FG H IJ

K G L M N G H

    © ¨

, - ./ 0 1 2

 ! " #

  ¨   



- 4 2 / 5 2 : ; < = > ? = < @ A B C D

3

       # %& ' ( ) * +

$

6 - 7 8 9 2 1 - 4

Figure 6.1: JPEG2000 Decoding Flow

[

Z[

\[ ^

_ ` a b c a e f g a b

^ d

_ ` a b c ] a e f g a b

d

h _ j k l m a

d i

n o p q r s o t u q v \][

Figure 6.2: Runtime Profile of JasPer JPEG2000 Decoder

The JPEG2000 decoding flow (shown in Figure 6.1) can be decomposed into five major stages [60]. A JPEG2000 code stream is first decoded by the Tier-2 decoder, where the header of the code stream is parsed to provide the Tier-1 decoder the necessary information about the structure of the code stream and its parameters. Based on the header information, the Tier-2 decoder extracts the code segments and forwards them to the Tier-1 decoder. The pyramid structure of the JPEG2000 code stream is shown in Figure 6.3. The Tier-1 decoder is responsible for taking the binary code bits and decoding them into wavelet coefficients. These wavelet coefficients are then fed to the Inverse Discrete Wavelet Transform (IDWT) which maps them back to the actual values of the image pixels. The last stage performs an Inverse Multi-color Transformation (IMCT) to transform the decoded channels. It should be noted that the De-Quantizing and IMCT stages are optional and also consume a very small portion of the total runtime. Therefore, to focus on understanding 146

the major bottlenecks of the decoding process, it only analyzes three major stages:

Tier-2 decoding, Tier-1 decoding and the IDWT.

 € ‹ Œ

½ ¾ ¿ À Á Â Â Â

 ‚ ƒ  Ž 

ÃÄ Å Æ Ç

„ ‚ † ‡ ˆ ‰ Š  Ž ‘ ’ “ ” • –

      

— ˜ ™ š › œ  ž Ÿ ¡ ¢ £ ¤ ¥ ¦ § ¨

« ¬ ­ ® ° « ¬ ­ ® ¯

§ ¨ © ©

© © ª w x y z { | } } ~

          

± ² ³ ´ µ ¶ · ¸ ¹ º » ¼

 ! " ! # $ ! % &"

È É Ê ç è è é ê è ë ì è

í î ï ð ñ ò ó ô õ ö ÷ ÷ ø ù ÷ ú û ÷

Þ ß à á â ã ä å æ

ü ý þ ÿ ¡ ¢ £ ¤ ¥ ¦

Ë ÌÍ Î Ï ÐÑ Î ÒÐÓ Ï Ó Ô Î Õ Ó Ö × Ø ÙÓ Õ Ú ÛÜ Õ Ó Ö × Ü × Í Ý × Ï Ò

Figure 6.3: Pyramid Structure of the JPEG2000 Codestream.

Similar to the one on the encoder side, the Tier-1 stage on the decoder side is the major bottleneck. JasPer’s runtime profile (shown in Figure 6.2) shows that the Tier-

1 decoder takes more than 80% of the total runtime, followed by the Tier-2 decoder and the IDWT stage. The Tier-1 decoder consumes most of the total running time since it includes two major decoding tasks: the Bitplane Decoder (BPD) and the MQ Decoder. The decomposition in Figure 6.1 provides more details on the Tier-1 operations.

As mentioned earlier, the code stream is organized as streams of multiple code segments, where each code segment is the compressed information of one code block.

The code block represents a spatial region of m × n pixels in the original image. A typical code block has a size of 64×64 samples. These code segments are decoded by 147 the Tier-1 decoder to reconstruct the samples of the respective code blocks. However, the process of reconstructing the sample values is very complex; only one bit of a sample is reconstructed at a time starting from the most significant bit to the least significant bit, as shown in Figure 2.10. For example, an 8-bit sample with value

0x11000100 must be reconstructed through 8 decoding steps starting from the most significant bit. Therefore the complexity of Tier-1 decoding operations on a M × N image is on the order of O(K × M × N), where K is the maximum number of bits needed to represent the samples in a code block.

To process each bitplane, the Tier-1 decoder must scan from left to right, line by line, from top to bottom. Then, to reconstruct the value of a single bit of a sample at a certain location, the Tier-1 decoder first has to go through a context formation process where it gathers the information about the sample’s 8 adjacent neighbors. The neighboring operations therefore initiate a very large amount of memory operations, on the order of O(8 × K × M × N). The neighboring operations help determine the context of the bits for the MQ decoder.

6.4 Parallelizing Tier-1 Decoder

6.4.1 Parallel Model Selection And Experimental Setup

The runtime profile of the single-threaded JPEG2000 decoder described in Section 6.3 clearly identifies the Tier-1 decoder as the most critical bottleneck in the JPEG2000 decoding flow. From the analysis it would make sense that most of the acceleration effort should be placed on the Tier-1 decoding stage. However, two major stages of the Tier-1 decoder (i.e., the context formation and MQ decoding stages shown in 148

Figure 6.4) contain a high degree of dependency, down to the bit level. It is therefore

very difficult (though perhaps not impossible) to expose data-level parallelism at each

individual stage. In particular, similar to the case of the MQ encoder, up to now there has not been any successful fine-grained parallel solution for the MQ decoder.

Moreover, parallelizing the Tier-1 decoder is even more difficult since the bitplane

decoding and MQ decoding are interlaced.

œ  ž Ÿ ¡ š ¢

£ ¤ ¤

›  š ¥ ¦ § ¡ ¨ ¨ © ¢ ª › ¢ «

0 1 2 3 4 5 6 2 7 8 9 2

l m n o p q r s t u p v t w x

: 1 ; < = > ? 2 @

_ ` a b c d a ` e a c f

' ( ( ) * + , ) -./

g h i j k `

A B C DE FG DHIJ y z { | } ~  €  ‚

K L M N OPQ NRST

ƒ „ † ‡ ˆ ‰ ˆ Š ‹ Œ  Ž  

U VWX Y Z [ ‘ ’ “ ” “ ’ • – ‘ — ˜

\ ] ^ ™ š ›

Figure 6.4: The process of reconstructing the sample values in Tier-1

Alternatively, the chosen strategy of this work was to focus on task-level paral- lelism for the Tier-1 decoder, which turns out to be relatively easy to exploit, using the support of the pyramid structure of the JPEG2000 code stream (shown in Fig- ure 6.3). Working with the pyramid structure, the Tier-1 decoder can independently decode the code segments of different image tiles, color components, resolution levels or code blocks. Hence, multiple threads can run the same sequence of operations on different code segments independently. 149

Although the IDWT and Tier-2 decoder account for less than 20% of the total run time, the two stages should also be carefully considered since they could become bottlenecks after the Tier-1 stage is successfully accelerated. The data structure of the IDWT stage already exposes data-level parallelism without any modification; therefore, it is very suitable for GPGPU-based acceleration (details will be discussed in Section 6.5). On the other hand, the Tier-2 decoder is actually a recursive pro- cess that parses the pyramid JPEG2000 code stream and is therefore difficult to implement using a multi-threaded approach. Instead, a smart pipelining scheme is proposed that overlaps the Tier-2 decoder runtime with that of the Tier-1 decoder

(details to be presented in Section 6.6).

As mentioned earlier, both the CPU and GPGPU have their own advantages for different tasks in the JPEG2000 flow. While it may be simple to run task-level threads for the Tier-1 decoder on the CPU, the massively parallel resources of the GPGPU is very efficient for data-level parallelism in the IDWT stage. Moreover, since the Tier-1 stage is very time consuming, it would be much more efficient to distribute the loads of Tier-1 decoding to both CPU and GPGPU. Consequently, a GPGPU-CPU heterogeneous system would be an ideal configuration for accelerating the JPEG2000 decoder and could also allow better pipelining capability for the JPEG2000 flow working in streaming mode.

To efficiently accelerate the Tier-1 stage on the heterogeneous system, it is very important to design a task scheduler that can efficiently distribute the computa- tional operations across the heterogeneous computing units. Consider a heteroge- neous system that contains N = {P1 ...Pn} different processing units that can run concurrently, the total running time of the heterogeneous system being computed as:

T = max{T1,T2,...,Tn} (6.1) 150

where Ti is the time that processing unit Pi takes to finish the operations on the queue assigned to it. Consequently, it is very crucial to balance the runtime of each unit. This aim can only be achieved by exhaustively assessing the capabilities of each unit and its performance on the operations. Such efforts will be shortly presented in

Sections 6.4.2 and 6.4.4, which will conduct assessments on different versions of the task-level parallel Tier-1 decoder running on the GPGPU and CPU.

In the experimental setup, two different computing units are considered: an Intel

Core i7 930 CPU and a Nvidia GTX 480 GPGPU. In addition, OpenCL is used for the parallel programming model since it supports various parallel hardware platforms including CPUs and GPGPUs from different vendors. Hereafter, the proposed het- erogeneous JPEG2000 decoder platform will be referred to as the OCL-JPEG2000 decoder. Similar to the assessment process in Chapter 5, this setup will use the JasPer

JPEG2000 decoder as the reference single-threaded software platform. Although there are several popular versions of single-threaded JPEG2000 software, including

JasPer [11], Kakadu [62], and OpenJPEG [75], JasPer has very good performance with fully accessible source code and is the reference implementation recommended by the JPEG2000 committee [22]. The test images are obtained from the ITU-T and

USC-SIPI databases [59, 74] (i.e., (Aerial, DC, Bigtree, and Haiti)) as well as from field surveillance operations.

6.4.2 Parallelizing The Tier-1 Decoder On GPGPUs

In order to understand capabilities of GPGPUs on task-level parallel programs, this section presents a parallel Tier-1 decoder exploiting task-level parallelism. In addi- tion, exhaustive assessments on performance gains and optimization strategies will be conducted. 151

As already discussed in Section 6.4.1, the parallel Tier-1 decoder on both GPGPU

and CPU can exploit task-level parallelism from the pyramid structure of the JPEG2000

code stream as shown in Figure 6.3. The data granularity should be at the finest possible level to provide the task scheduler with the flexibility needed to achieve op-

timal load balancing across processing units. Therefore the GPGPU will work at the code block level, which is indeed the finest possible level in the code stream pyramid

that can expose task-level parallelism. Each code block’s code segment is assigned

to be decoded by one streaming multiprocessor(SM), as shown in Figure 6.5.

Ò ÓÔÕ Ö × Ø Ù Ú ÛÜÝ ú û ü ý þ ÿ ¡ ¢ £ ¤ ¥            

à á â ã ä å ¨ ©   ! "

æ ä æ   # !#

Þ Ý Ú ß ¦ ¥ ¢ §    

¼ ½¾ ¿À ½Á¾ ÃÄÅÀ ÃÅ ÂÀÂÆÇ ÈÉÊËÌÍËÎÏÐÑ

$ % & ' ()* + * , -.*

ç è é ê

> ? @ ? A

ê ì í ê î ï BC

ë @ D >

ð ñ ò ó ô ñ ô õ ö ÷ ø ù ñ õ

¬ ­ ® ¯ ° ± ² ³ ¸ ¹ º » ´ µ ¶ ·

/ 01 23 01 4456 7 89 : ;: 1 < =

Figure 6.5: Mapping the parallel Tier-1 decoder to existing graphics hardware.

However, since GPGPUs are architecturally different from CPUs, a straight port- ing of source code of the single-threaded program running on a CPU to the threads running on a GPGPU may lead to very limited performance gain. The columns labeled Unoptimized OCL-T1 on GPU in Figure 6.8 shows that the speedup of the unoptimized multi-threaded Tier-1 decoder running on the GPGPU compared to the single-threaded JasPer decoder running on the CPU is merely 1.5× faster. It would therefore require a lot of optimization effort specifically targeted to the GPGPU architecture to improve the performance of the multi-threaded Tier-1 decoder on a

GPGPU platform.

Before going into the details of optimization strategies for the parallel Tier-1 152 decoder on GPGPUs, it is necessary to review some of the special characteristics of GPGPUs (discussed in detail in Chapter 3). As shown in Figure 6.5, a GPGPU consists of a number of streaming multiprocessors (SMs), where each SM is composed of an array of processing units (PEs) that are actually simple ALUs running on a

Single Instruction Multiple Data (SIMD) model. For example, the Nvidia GTX 480 GPGPU has 15 SMs and each SM is composed of 32 PEs. The SIMD term is derived from the fact that the PEs within a SM have to execute the same instruction but on different data elements in a clock cycle directed by the shared instruction decoder, instruction dispatcher and flow controller. The SIMD model is one of the key differences between GPGPUs and CPUs. By contrast, the processor cores in CPUs can each work independently executing its own set of instructions.

Consequently, the SIMD model can be a limitation of GPGPUs when it has to handle threads exploiting coarse-grained task-level parallelism. Specifically, multiple task-level threads cannot run concurrently on different PEs within a single SM since each thread often executes its own instruction queue while the SM has only one shared instruction decoder and dispatcher. Hence, each SM can only run one task- level thread on only one PE, while the rest of the PEs within the SM will remain idle. The idling of a dominant number of PEs in a SM when the SMs have to handle task-level threads leads to a massive waste in the GPGPU processing capability.

Additionally, running flow control operations on GPGPUs can significantly decrease the instruction throughput by causing threads of the same group to diverge and follow different execution paths which must be serialized [40, 43]. Therefore, the threads running on GPGPUs should be optimized to reduce the flow control operations.

Besides the different execution models, CPUs have only one type of main memory whereas GPGPUs often have three different types of memory each with its own purpose. The first memory type is the off-chip memory, or global memory, which is 153 shared among the SMs. The global memory often has large capacity and it functions similarly to the DRAM of CPUs. However, the global memory on the GPGPU has a special coalesced burst access capability that can significantly increase memory throughput [40]. For example, the Nvidia GTX 480 GPGPU PEs within a SM can request together a global memory access in a burst of 128 bytes. The second memory type is the multi-way on-chip memory called local memory. Each SM has its own local memory space which is shared among its PEs only. The local memory is much faster than the global memory but its capacity is significantly smaller. The local memory size of each SM is often in the range of only 32–64 Kbytes [40, 43]. The third memory type is the constant memory that stores read-only constants and is often located on-chip. Due to this very special memory architecture, developing programs for GPGPUs requires detailed consideration of memory utilization and access.

6.4.3 Optimizing the Parallel Tier-1 Decoder For GPGPUs

Given the special characteristics of the GPGPU execution model and memory ar- chitecture, this study will aim at optimizing the task-level Tier-1 decoder kernel for GPGPUs by reducing the flow control operations, leveraging the advantages of different memory types, and maximizing the GPGPU occupancy.

Similar to the one in the Tier-1 encoder, the context formation process in the Tier- 1 decoder requires many flow control operations. For instance, a sample’s context is formed by checking the status of its 8 neighbors (shown in Figure 6.6), which may result in 256 different cases. Therefore the implementation of the GPGPU- based parallel Tier-1 decoder employs the same strategy used in the GPGPU-based parallel Tier-1 encoder. Specifically, look-up-tables (LUTs) are constructed for the 154

"0 $0 "1 $0

#0 % #1 #0 % #1

"! $1 " $1

"C C'&($*( !%&#') C C'&($*( !%&#') !

Figure 6.6: Context Windows for Zero Coding (ZC) and Sign Coding (SC) primitives. When a coding primitive codes a sample, it has to refer to the state information of the sample’s neighbors defined in its respective context window.

SC LUT index ] ] ] ] ] ] ] ] ] ] ] ] ] ] 1 0 3 0 2 1 1 0 ] z }| { 0 1 1 0 X H V V H D D D D H H V V X [ X [ [ [ [ 0 [ [ [ [ [ [ [ [ [ [ σ σ σ σ σ σ χ σ η χ χ χ σ σ σ

ZC LUT index | {z } Figure 6.7: The state variables of a sample and its neighbors (within its context window) are packed into a 16-bit state flag. The index of ZC or SC LUT can be extracted from the flag by applying a binary mask. context formation decision based on the rules predefined in [60]. The state variables of a sample and its neighbors (within its context window) are packed into a 16-bit state flag (shown again in Figure 6.7). The index of ZC or SC LUT can be extracted from the flag by applying a binary mask.

It is also critical to optimize memory utilization and access for the Tier-1 decoder kernel on GPGPUs since the context formation process initiates a massive number of memory operations. Moreover, efficient memory utilization on the GPGPUs can not only significantly reduce the latency but also can increase the occupancy of computational resources which in turn increases computing throughput [40, 43].

One of the first memory optimizations is to efficiently allocate different sets of 155 data into the most suitable memory blocks (e.g. local shared, global, constant) based on the application’s demand to reduce latency and conflict. There are four major data sets in the Tier-1 decoder including the input code segments, the lookup tables that define the context formation rules, the state flags that store neighboring information of the samples and the decoded outputs.

While it is clear that the LUTs should be located in constant memory, the allo- cation for the three remaining data sets must be carefully considered since their size is large compared to the local memory size. There are two straightforward options for allocating these buffers, namely either in shared or global memory. However, neither of these options work well. It is mainly because the access to global memory is much slower than to the local memory, especially for task-level parallel kernels where the coalesced memory access cannot be established. Moreover, when all SMs concurrently access the global memory there would be a heavy memory conflict. On the other hand, it is also very inefficient if all the three data sets are located in local memory since a large use of local memory can significantly reduce the SM occupancy.

To deal with the limitation in local memory capacity and high latency in global memory access, a hybrid solution is developed with an aim that allows the decoder to work on fast local memory while having the latter’s use moderated to keep SM occupancy from being compromised.

In the hybrid solution, only a small portion of the buffers for input code segments, decoded output data, and state flags are allocated in local memory each decoding session. Accordingly, a SM only decodes a part of a code block in a local session on local memory and exchanges the data sets between global memory and local memory after each session. Specifically, in a local session, a SM only decodes a group of 8 × 64 samples instead of a whole block of 64 × 64 samples. Although 156

only one task-level thread does the decoding job, the exchange of data sets between

local buffers and global buffers for each local decoding session is done by all threads

within a warp to facilitate the coalesced burst memory access. Moreover, the latency of memory communication is hidden by the pipelining of different tasks done by

multiple workgroups in the SM. The memory accesses and arithmetic operations can

be executed concurrently.

e f g h i h j k l e m n o p f g h i h j k l e m n o p

H q r s t s t u u r q H q r s t s t u

K

J FG

d

c

b J

`

Y

I FG

a

^

^

I

_ `

^

FG

H

\]

[

Z

H

XY

E FG

E

L M NOP Q R OS T NM M U P OTO V W

Figure 6.8: Runtime speedup of different versions of the GPGPU- based Parallel Tier-1 decoder compared to the JasPer JPEG2000.

Overall, the optimizations applied to the Tier-1 decoder to run on a GPGPU

platform significantly improve its performance, as shown in the Optimized OCL-T1 on GPU columns of Figure 6.8. The speedup of the Tier-1 decoder on the GPGPU compared to single-threaded JasPer running on a CPU is increased from 1.5× for the un-optimized version to more than 3× for the optimized version. However, 3×

speedup is still far from expected since a simple CPU-based implementation with a

4-core CPU would gain the same speedup. The drawback of low-flexible GPGPU architecture in handling complex task-level parallel programs is clearly confirmed.

Next, the potential of multicore CPUs, which are highly-flexible but with a limited number of cores, in handling the parallel Tier-1 decoder will be examined. 157

6.4.4 Parallelizing The Tier-1 Decoder On CPUs

In contrast to the design process for the GPGPU-based parallel Tier-1 decoder kernel,

which required a lot of optimization efforts, the development of the multi-threaded Tier-1 decoder targeted for CPUs is significantly easier. It is a quite straightfor-

ward porting from the single-threaded Tier-1 decoder with a few simple optimiza-

tion strategies introduced in [34]. While several multi-threaded programming models are available (e.g., Java, OpenMP, OpenCL, CUDA), OpenCL was chosen for this

specific task.

v w x y z { | } ~ È ÉÊ Ë Ì Í Î Ï Ð Õ Ö × Ø Ù Ú Û Ü Ý

 €  ‚ Ñ Ò ÓÔ Þ ß à á

ƒ „ † ‡

® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹

Œ  Ž   ‘ ’ â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö

ˆ ‰ Š ‹

“ ” • – — ˜ ™ š › œ  ž Ÿ º » ¼ ½ ¾ ¿À ÁÂÃ Ä ÅÆÇ

¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­

Figure 6.9: mapping task-level parallel Tier-1 decoder to a multicore CPU.

Figure 6.9 shows the mapping of the parallel Tier-1 decoder to a multicore CPU.

All the code-segments of codeblocks, output buffers as well as the state variable arrays are stored in the CPU’s DRAM (considered as global memory in OpenCL). It is important to note that CPU’s hardware can move data between off-chip DRAM 158

and on-chip cache memory very efficiently, therefore it does not require explicit

utilization of local memory in the OpenCL model. As a result, the CPU-based

parallel Tier-1 decoder requires nearly no memory optimization effort. The CPU’s cores run parallel threads that execute the legacy serial flow of the JPEG2000 decoder

to decode the codeblocks independently. The CPU’s cores have a great capability in handling complex control operations and branch predictions which can significantly

reduce explicit optimizations on the code.

It also important to notice the crucial role of the compiler in significantly reducing developers’ optimization efforts for CPU-based implementations. Through develop-

ment, the same kernel of the CPU-based parallel Tier-1 decoder compiled with a

newer Intel compiler version (v 1.5) yields performance gains which are noticeably different from those of the previous version as shown in Figure 6.10. More specif-

ically, the speedup boost increased from 3× to 6× . This significant improvement is credited to the compiler’s better capabilities in code vectorizing and exploiting

hyper threading.

   ¢ ¤     § ¤  ! " # $ % &    ¢ ¤ ¡

   ¢ ¤     § ¤  ! " # $ ' &    ¢ ¤ ¡

þ

ý







ü







û



 

ú



 

ù



ø

÷

ÿ ¡ ¢ £ ¤ ¥ ¢ ¦ § ¡ ¨ £ ¢ § ¢ ©

Figure 6.10: Runtime speedup of the CPU-based Parallel Tier-1 decoder kernel compiled by two different versions of Intel OpenCL compilers. By updating the compiler from V1.0 to V1.5, the perfor- mance increases from 3× to 6× for the same kernel.

159

I J 8 3 K 3 L 1 M I ; NOP Q R S ; T U I J 8 3 K 3 L 1 M I ; NOP Q R S V T V T U

/

.

H

G

- F

D

=

E

B

,

B

CD

+

B

@A

?

>

*

< =

)

(

0 1 2 3 4 5 6 3 7 8 2 1 1 9 4 3 8 3 : ;

Figure 6.11: Runtime speedup of different Parallel Tier-1 decoder implemen- tations on CPU and GPGPU compared to the JasPer JPEG2000.

The comparison in speedups of CPU-based and GPGPU-based parallel Tier-

1 decoders is shown in Figure 6.11. It shows that the CPU-based parallel Tier-1 decoder gains a 6× speedup while the GPGPU-based version gains only 3×. It is

interesting that in the task-level approach for the parallel Tier-1 decoder, the multi-

core CPU outperforms the GPGPU by a factor of 2×. This outcome can be explained by the advantages of the CPU in handling heavy and complex kernels exploiting task-

level parallelism. Specifically, the Intel Core i7 CPU has four cores but its hyper

threading technology allows eight threads [34] running concurrently. Therefore there are virtually eight CPU cores executing the Tier-1 decoder kernel concurrently with

a much higher degree of flexibility compared to the SMPs of GPGPUs. Although the number of 8 virtual cores in a Core i7 CPU is smaller than 15 SMPs in a Nvidia GTX

480 GPGPU, the Core i7 CPU cores operate at more than 2× higher frequency along

with a more efficient capability of handling control operations. Moreover, each CPU core has a four-way SIMD engine, which can theoretically increase the computing

throughput by up to 4× if the compiler can completely vectorize the code [34].

The development process of the parallel Tier-1 decoder also provides very helpful 160 observations on optimizing the parallel program targeted to GPGPUs and CPUs, which may also benefit other studies. The multi-threaded kernels that exploit task- level parallelism can benefit better from the flexibility of CPUs than from the mas- sively parallel capability of GPGPUs. Moreover, the optimization efforts for the

CPU kernels are often much less than the efforts for the GPGPU kernels. Therefore the study would suggest that the heavy and complex kernels, which cannot expose data-level parallelism easily, should be targeted to CPUs. This observation would be a key guidance in obtaining a good performance speedup per optimization effort.

6.5 Parallelizing The Inverse Discrete Wavelet

Transform

This section discusses the parallelization of the Inverse Discrete Wavelet Transform

(IDWT) process, which is a major stage in the JPEG2000 decoding flow, as shown in Figure 6.1. The Discrete Wavelet Transform (DWT) has traditionally been im- plemented by convolution or FIR filter bank structures. However, the core coding system of the JPEG2000 standard recommends the lifting-based DWT instead. Fig- ure 6.12 illustrates the one dimensional(1D) lifting-based (5,3) filter bank for loss- less compression. The input samples X[i] are first split into odd and even samples (S0[i] = X[2i] and D0[i] = X[2i + 1]) and then the predict and update steps are applied to S0[i] and D0[i] samples using the following rules:

D1[i]= D0[i]+ a(S0[i]+ S0[i + 1]) (6.2)

S1[i]= S0[i]+ b(D1[i − 1] + D1[i]) (6.3)

161

W X Y Z [ \ ] ^ _ ` a b c d

~  €  ‚ ƒ „ „

s t u v w x

  „

† ‡ ˆ ˆ ‡ ‰ Š

e f g h i j

y

z { | }

‹ ‚ ƒ „ „

† Œ

  „

† ‡ ˆ ˆ ‡ ‰ Š

k l m n o p q r

Figure 6.12: 1D lifting-based (5,3) filter bank

’ ’  ‘

“ ” • – — ˜ ™ š

› œ – œ —  ž Ÿ © ¡ ª ¢ ª £ ¤ ¥ ¦ § ¨

¬ ­ ® ¯ ° ± ²

¤ ¥ «

 Ž  

Figure 6.13: 2D DWT for 2D image

−1 1 1 The filter coefficient values are a = 2 and b = 4 . The outputs of D [i] are actually the high-pass outputs of the filter and the outputs of S1[i] are the low-pass outputs of the filter. So the result of the 1D DWT is a signal divided into low-pass and high-pass subbands. The 2D signals (e.g., images) are usually transformed in both dimensions by first applying 1D DWT to all rows and then to all columns. The

2D DWT results in four subbands (labeled LL, HL, LH and HH) shown in Figure 6.13. The Inverse DWT (IDWT) can be derived by traversing the DWT steps in reverse direction.

If the intermediate results from the predict and update steps are stored in tem- porary buffers, there is no data dependency between different samples. This allows the lifting-based filter bank to be implemented very efficiently on a GPGPU-based platform. Moreover, the large number of DWT samples that need to be inverse transformed can be sufficiently supported by the massively parallel computational resources of GPGPUs. However, there is still a big concern about memory access 162 conflicts in a 2D IDWT implementation on GPGPUs. If the samples are stored in global memory, there is a high possibility of bank conflicts when the PEs access samples in a column. This problem is very critical in 2D IDWT where lifting is done on all columns, one at a time.

The shared memory in GPGPUs can significantly reduce bank conflicts since the shared memory often has a multi-bank organization that can be accessed concur- rently. For instance, the Nvidia GTX480 GPGPU’s shared memory has 32 banks that are organized such that successive 32-bit words are assigned to successive banks, i.e., interleaved. However, as mentioned earlier, the major drawback of shared mem- ory in GPGPUs is its very limited capacity while the subbands of images are often relatively large. However, clustering samples helped overcome this limitation . Each workgroup of GPGPU copies a cluster of 64 × 8 samples plus extra boundary rows and columns into the shared memory and performs row-wise and column-wise lifting. This processing method not only significantly reduces the memory conflict but also increases memory bandwidth. The communication latency between global memory and shared memory can also be efficiently hidden by the pipelining between different workgroups. For example, when one workgroup is doing computational tasks, an- other workgroup can do memory transfer between global memory and local memory concurrently.

Combining all the optimization techniques discussed above, the parallel IDWT kernel executed on the GPGPU is capable of remarkable speedups. Figure 6.14 shows that the parallel IDWT on the Nvidia GTX480 GPGPU runs more than 25× faster compared to the IDWT routine in the single-thread JasPer JPEG2000 decoder running on an Intel Core i7 processor.

163

¶ ³

Ä

Ã

Â

µ ³

À

¹

Á

¾

¾

¿À

¾

¼ ½

´ ³

»

º

¸ ¹

³

´ µ ¶ ·

Å Æ Ç È É Ê Ë Ì Í É ÎË ÏÆ ÐÑË Ò ÓÉÔÉ ÏÎ

Figure 6.14: Runtime speedup of the parallel IDWT running on GTX480 GPU compared to the JasPer JPEG2000 in decoding the Aerial image

6.6 Pipelining The Decoding Stages On Hetero-

geneous Computing Units

While previous sections describe how to parallelize the Tier-1 decoder and IDWT stages of the JPEG2000 decoder, it is important to note that the Tier-2 decoding stage still remains un-parallelized. As mentioned earlier in Section 6.3, the Tier-2 decoding stage contains a recursive process for parsing the pyramid code stream, which makes it difficult for parallelization. To eliminate the bottleneck of the Tier-

2 decoding stage, this section will present a pipelining scheme that can efficiently hide the Tier-2 decoding runtime within the Tier-1 decoding runtime to significantly boost up the performance for the JPEG2000 decoder running in streaming mode, a very common mode in field applications. 164

6.6.1 Extending Amdahl’s Law For Heterogeneous Systems

Before going into the details of the pipelining technique for the JPEG2000 decod-

ing flow, this section also develops an extension for Amdahl’s law [14] to find the maximum expected improvement of computational flows accelerated using heteroge-

neous systems. Amdahl’s law has been widely used for estimating the performance

of computational flows; however, the original model of the law has some limitations and needs to be extended for applicability to heterogeneous systems. This extension

of Amdahl’s law will not only provide a quantitative assessment on the impact of the Tier-2 bottleneck but also serve as a general performance estimation model for

heterogeneous computing.

Amdahl’s law estimates the achievable speedup to a computation flow when a technique can be applied to a fraction P of the total execution time and is responsible

for accelerating that portion of the execution by a factor of S. Amdahl’s law states

that the maximum overall speedup of the whole computation flow will be:

1 Speedup = P (6.4) (1 − P )+ S

The correctness of Equation (6.4) is maintained if the computation flow is accelerated on a system that yields uniform speedups for the different stages of the flow. However,

in heterogeneous systems, performance gains on different computing units can vary.

Consider a heterogeneous system consisting of N different computing units (e.g., CPU, GPGPU, FPGA, ASIC, etc.) and a computation flow F . Assume F has a set of tasks with runtime proportions of P = {p1, p2,...,pk} and are accelerated

by K (K ≤ N) computing units with speedups of S = {s1,s2,...,sk} respectively. The maximum overall speedup of the whole flow on the heterogeneous system is 165

computed by the following equation, derived as an extension of Amdahl’s law:

1 Speedup = (6.5) p p p1 pk [1 − ( 1 + ··· + k)]+ s1 + ··· + sk

With the application of Equation 6.5 to a heterogeneous system with N = 2 com-

puting units (CPU and GPGPU), the two stages (Tier-1 decoding and IDWT) can be

accelerated by exploiting parallelism. The runtime proportions are p1 = 0.83, p2 =

0.08 and the speedups are s1 = 7,s2 = 25 respectively. The maximum speedup of the JPEG2000 decoding flow on the systems is computed by the equation (6.6), which

is close to the empirical results shown in the Un-Pipelined OCL-J2K columns of the Figure 6.20.

1 Speedup = 0.83 0.08 = 4.8(×) (6.6) [1 − (0.83 + 0.08)] + 7 + 25

6.6.2 Pipelining the JPEG2000 Decoding Flow

Extending Amdahl’s law to the expression shown in Equation (6.6) points out that

the Tier-2 bottleneck bounds the maximum speedup of the JPEG2000 decoding flow to less than 5×. Therefore the Tier-2 bottleneck must be resolved in order to further accelerate the performance of the flow. This section will present a pipelining approach for the JPEG2000 streaming decoder to hide the Tier-2 decoder runtime within the one of the Tier-1 decoder .

Figure 6.15 shows the JPEG2000 decoder using a heterogeneous GPGPU-CPU system to decode a single frame. In the flow, the Tier-1 computational loads are shared between the GPGPU and the CPU to achieve the best performance, however,

166

¤ ¥ ¦ § ¨

©     

©   

Õ Ö × ØÙÚ Û × Ü Ý Þ ×

   

ß Ø à á × â ã ä

          

å æ ç è é ê ë ç ì í î ç

!  "  #  ! $  %&'

ï ð è ñ ò ó è ð ô ç õ ö ÷

ø ù ú û

() * +

ü ý þ ÿ ¡ ¢ £

©

Figure 6.15: Decoding a single JPEG2000 image frame on a GPGPU-CPU system the drawbacks are evident. During the decoding process, the computing units are idle in several time slots. For example, when the CPU is doing the Tier-2 decoding, the GPGPU must wait because it cannot start the Tier-1 decoding until the Tier 2 decoder extracts the code segments and headers from the compressed codestream.

The waiting time of the CPU and the GPGPU causes the flow to be very inefficient since they have full independence to run concurrently. Therefore the JPEG2000 decoder should be designed to exploit this concurrency advantage. Moreover, in field applications such as surveillance imaging, earth imaging etc., the JPEG2000 decoder often works in a streaming mode that decodes a sequence of compressed frames rather than decoding a single frame.

Based on the above observations, a pipelining scheme will be introduced next that exploits the advantages of streaming inputs in the heterogeneous system to pipeline the Tier-1 and Tier-2 decoding stages on different frames. It should also be noted that the IDWT stage is un-pipelined since the runtime of the parallel

IDWT is very small compared to the runtime of the Tier-1 and Tier-2 stages. For example, in decoding Aerial image, the runtime of the the parallel IDWT is only 167

3.5 ms compared to 120 ms for the Tier-2 stage. Pipelining the IDWT may cause

unnecessary overhead rather than saving a very small amount of runtime.

` a b c d

Z [ \ ] ^ Z [ _ ] _ ] ^

, -. / 0 1 2 . 3 4 5 . > ?@ ABC D @ E F G @

6 / 7 8 . 9 : ; < = H AI J @ KL M

NO P Q

R ST U V WX Y

e f g h f g

Figure 6.16: A simple pipeline scheme for JPEG2000 decoding flow

Recall again that the JPEG2000 decoding flow can be completely split into inde- pendent stages, thus making the pipelining process easier. However, there are still problems that need to be resolved. The first challenge is the unbalanced runtime of the different stages. Consider a simple pipelining scheme shown in Figure 6.16, where the whole Tier-1 decoding stage of frame[k] is executed only on the GPGPU and the Tier-2 decoding stage of frame[k +1] is executed on the CPU concurrently.

This simple scheme can pipeline the Tier-2 decoding and Tier-1 decoding stages for a sequence of frames, but it is inefficient due to the unbalanced runtimes of the Tier-2

and Tier-1 stages. For example, in decoding the Aerial image, the runtime of the

Tier-2 decoding stage is 120 ms whereas the runtime of Tier-1 decoding stage on the GPGPU is 429 ms. Therefore the runtime of the pipeline scheme is dominated

by the runtime of the Tier-1 stage on the GPGPU and this problem actually makes

the pipelined scheme significantly slower than the un-pipelined scheme, as shown in Figure 6.17.

Further, the serial Tier-2 decoding process utilizes only one single thread, while

168

‰ Š ‹Œ s q usŠ q Ž  {  ‹ ‘ k ’ “ s ”  u • Œ s  q u s Š q Ž  {  ‹ ‘ k ’

o

n

ˆ

‡

†

m

„

}

‚

‚

l

ƒ „

‚

k

€ 



~

| }

j

i

p q r s t u v s w x r q q y t s x s z {

Figure 6.17: The runtime speedups of the parallel JPEG2000 decoder using the simple pipeline scheme. The overhead and unbalance in stages make this scheme even slower than the un-pipeline scheme. a modern multicore CPU often supports multiple threads running concurrently. For instance, an Intel Core i7 CPU has four cores which can run eight threads con- currently. Therefore, using a powerful CPU to solely run a single-threaded Tier-2 decoder is resource wasteful. This observation suggests that while only one single CPU thread is busy in the Tier-2 stage, the remaining free CPU threads should be utilized for some Tier-1 tasks. Specifically, the CPU should run both the Tier-2 stage and a part of the Tier-1 stage concurrently, as shown in Figure 6.18. Note that another part of the Tier-1 stage is still handled by the GPU to share the Tier-1 loads with the CPU. This solution would not only increases CPU utilization significantly but also keeps the runtime of CPU and GPU balanced.

However, it is not simple to simultaneously execute both the parallel Tier-1 de- coder and Tier-2 decoder on a single CPU as proposed in Figure 6.18. The parallel Tier-1 decoder is a multi-threaded routine running on the OpenCL runtime environ- ment while the Tier-2 decoder is a single-threaded routine running on the conven- tional C/C++ runtime environment. In the quest of making this pipeline scheme possible, it is necessary to revise the execution of an OpenCL program. Such an effort is conducted in the next section, Section 6.6.3.

169

ï ð ñ ò ó

Í Î Ò Ð Ò Ð Ñ

Í Î Ï Ð Ñ

– — ˜ ™ š › — œ  Ó ÔÕ Ö × Ø ÔÙ

­ ® ¯ ° ± ² ³ ¯ ´ µ ¶ ¯

Ú Û Ü Ý Þ ß

ž Ÿ ¡ ¢ Ÿ £

· ¸ ° ¹ º µ » ¼ ° ¸ ½ ¯ ¾ ¿ À

¤ ¥ ¦ § Ÿ ¨ © ª « ¬ à á â ã ä á å æ ç è ã é

ê ç æ ë á ì í î

ÁÂ Ã Ä

Å ÆÇ È É ÊË Ì

ô õ ö ÷ õ ö

Figure 6.18: A more complex pipeline scheme for the parallel JPEG2000 decoder. It runs two different programs: parallel Tier-1 and single-threaded Tier-2 on the same CPU simultaneously.

6.6.3 Exploiting Soft-heterogeneity In Multiple Software Run-

times On Multicore CPUs For Pipelining

Recall from Chapter 4 that the OpenCL programming model always consists of a host device and multiple compute devices from a logical point of view. The host

device is in charge of calling and setting up the OpenCL runtime as well as syn- chronizing the compute devices within the system. The host managing program is

often executed by a single thread, hereafter referred to as the master thread (often

a C thread). The master thread sets up the runtime environment by setting up the execution context for each compute device and its associative buffers. After that, the

master thread launches an OpenCL kernel by calling the OpenCL runtime function

clEnqueueNDRangeKernel() and then lets the OpenCL runtime manager schedule threads to execute the OpenCL command queue. Now, consider the process hap-

pening on a multicore CPU. It is interesting that a multicore CPU can take on the two roles simultaneously, as a host and as a compute device. Within a CPU-based

system (shown in Figure 6.19), the master thread and the OpenCL threads are ac-

tually considered as user-level threads. These threads are mapped to the operating 170

system’s kernel threads before being scheduled and executed by CPU hardware.

ø ù ú û ü ú ý þ ÿ ¡ ¢ £ ¤ ¥ ¡ ¦ ¢ ¡ § ¨

_ ` a b ` c a d e

f g b h i b a c

©                

j k l m n o p k q r s

t n k u v w q m u l p x k

   ! " # $ %  &   ! " '

( ) * + , -./ 0 1 2 3 -* 4 3

~  €  ‚ ƒ „ € † „ € 

3 5 6 * 7 8 9 * , / 7 7 . 3 ) , - 5 6 - 6 + * , 7 3

‡ ˆ ‰ Š ‹ Œ

: ; < = ; > ?@ < ; A B C

 Ž ‰   ‘ ’ ‰ Š  ‰ “

” ˆ  ˆ • – —

D E F G H G I J KG L | }

M G NO G P KQ NG R S L

y z {

T U VK W G KG NU X G O G YKZ U V [ J PKY\ PG

˜ ™ š

NJ O KY[ G U O D E F ] ^ R L G S L Z L KG [

Figure 6.19: Thread scheduling of the master thread and OpenCL threads on a multicore CPU. This execution scheme make it feasible to run both the C runtime’s thread and the OpenCL runtime’s threads simultaneously.

While exploiting the special characteristics of multicore CPUs in running OpenCL, it has been discovered that it is possible to decouple the OpenCL threads from

the master thread. The process of decoupling happens as follows: after success-

fully launching the OpenCL kernel, the master thread is free to run another task while waiting for the OpenCL threads to finish execution. Note that the master

thread does not have to be involved during the execution time of the OpenCL threads, the OpenCL runtime taking control over all the tasks initiated after the

command clEnqueueNDRangeKernel(). This capability of OpenCL is referred to as soft-heterogeneity since two different sets of threads running in two different runtime environments can be executed concurrently on a single hardware device.

With the above soft-heterogeneity considerations in mind, the proposal of running a single-threaded Tier-2 decoder and a multithreaded OpenCL-based Tier-1 decoder 171 simultaneously on a multicore CPU will be feasible. The flow in Figure 6.19 shows more details on exploiting soft-heterogeneity. The master thread will first perform the OpenCL setup for executing the OpenCL parallel Tier-1 decoder kernel (for frame[k+1]). Afterwards, the OpenCL runtime will take control over the kernel, leaving the master thread free to perform the Tier-2 decoding tasks (for frame[k]). When the master thread finishes executing the Tier-2 decoder, it queries the OpenCL runtime ( using command clFinish() ) to check whether the Tier-1 decoder finished.

To the best of the author’s knowledge, this study is one of the first to efficiently exploit the soft-heterogeneity of OpenCL and C runtime scheduling to enable concur- rent execution of OpenCL kernels and the master thread which perform completely different tasks. Moreover, exploiting soft-heterogeneity is not limited to just the OpenCL and C, but can also be applied to other programming languages.

6.6.4 Performance Results On A Heterogeneous System

The performance speedup of the JPEG2000 decoder using different scheduling meth- ods compared to the JasPer JPEG2000 is shown in Figure 6.20. The Un-Pipelined

OCL-J2K columns show the speedup of the JPEG2000 decoder that does not employ any pipelining scheme; the Simply Pipelined OCL-J2K columns show the speedup of the JPEG2000 decoder that employs the simple pipeline scheme mentioned in the previous sections; the Pipelined OC-J2K columns show the speedup of the JPEG2000 streaming decoder that employs the pipeline scheme in Figure 6.18 which exploits the advantages of the soft-heterogeneity.

The performance speedup of the pipelined streaming decoder is more than 8×, which is a significant gain compared to the performance of previous works. For

172

¾ ¨ ¿ ¦ ª ¨ À ¦ Á  ° Ã Ä Å  Æ Ç À Ä ¾ ¨ ¿ ¦ ª ¨ À ¦ Á  ° Ã Ä Å  Æ È ¨ É ¿ ª Ê ¾ ¨ ¿ ¦ ª ¨ À ¦ Á  ° Ã Ä Å  Æ

¤

£

½

¼

¢

»

¡

¹

²

º

·

·

¸ ¹

Ÿ

·

ž

µ ¶

´

³



± ²

œ

›

¥ ¦ § ¨ © ª « ¨ ¬ ­ § ¦ ¦ ® © ¨ ­ ¨ ¯ °

Figure 6.20: The runtime speedups of the parallel JPEG2000 decoder using differ- ent scheduling methods compared to the JasPer JPEG2000 decoder. The pipeline scheme that exploits soft-heterogeneity gains a significant boost in speedups. instance, compared with the solutions which used only CPUs, the performance of [73] is reportedly 3.34× on a four-processor system. While the Intel IPP [45] is one of the most widely-respected high performance, multi-core-ready image processing suites, its performance is about 5× compared to JasPer on a four-core Intel Core i7 processor with no capability of pipelining on streaming inputs. Since this work is one of the first to propose a GPU-CPU based solution, there is no direct comparison with prior works.

Overall, the significant improvement of the streaming decoder mainly comes from the efficient resource utilization given the characteristics of the computational tasks and efficient scheduling methods to distribute the computations across the heteroge- neous processing units. In particular, exploiting the soft-heterogeneity of OpenCL runtime and C/C++ runtime maintains a balance between pipelining stages and completely hides the runtime of the Tier-2 stage within the runtime of the Tier-1 stage. 173

6.7 Summary And Discussion

In this chapter, the design and development of a high performance parallel JPEG2000

streaming decoder using GPGPU-CPU heterogeneous systems are presented. The

parallel programs achieve very significant speedups compared to the reference JasPer JPEG2000 software. The parallel IDWT engine gains more than 25× speedup com- pared to the IDWT engine of JasPer. The speedup of the parallel streaming decoder is more than 8× compared to JasPer JPEG2000 decoder.

Based on the crucial lessons from the GPGPU-only approach in Chapter 5, this chapter introduces a more efficient approach based on heterogeneous parallel com- puting to accelerate the JPEG2000 coding flow. It was discovered that it is signif- icantly more efficient to collaborate heterogeneous processing units (i.e. GPGPUs and CPUs) in accelerating complex field applications like JPEG2000. While GPG- PUs can provide great parallel arithmetic throughput, flexible CPUs can act very well in orchestrating the whole computational flow. Additionally, modern multicore CPUs show great capabilities in parallel computing too. The experiments show that multicore CPUs can outperform manycore GPGPUs in solving complex task-level parallel programs.

Moreover, not only exploiting heterogeneity of different hardware devices, this study also exploits soft-heterogeneity of different software runtime environments. In particular, it successfully exploits the soft-heterogeneity of OpenCL runtime and C runtime to gain crucial speedups for the JPEG2000 flow. Although it is newly dis- covered, the soft-heterogeneity has promised a great potential in increasing multicore CPUs’ performance. To the best of the author’s knowledge, the study is one of the

first to introduce the concept of exploiting soft-heterogeneity in different software 174 runtime environments.

Supporting the heterogeneous approach, an efficient pipelining scheme that uti- lizes both hard-heterogeneity and soft-heterogeneity has been successfully imple- mented. The pipelining scheme not only helps the system gain extra speedup but also provides an efficient processing scheme for streaming input, which is often com- mon in field multimedia applications. Together, exploiting heterogeneity in both hardware and software, along with the pipelining strategy provide a very efficient approach for parallel computing in accelerating complex field applications.

For future work, emphasis will be on further exploiting the heterogeneous GPGPU-

CPU system to accelerate complex computational flows like JPEG2000. In addition, focus will be maintained on developing efficient pipeline and scheduling techniques to facilitate GPGPUs and CPUs to cooperate in tandem. The aims will be also di- rected at developing an automated working flow for task distribution and pipelining process. Finally, the concept of soft-heterogeneity using different software runtime environments will be further explored to exploit the advantage of the modern mul- ticore processors. Chapter 7

Conclusion And Future Work

This thesis has presented an exploration on modern parallel computing platforms with a case study on accelerating the JPEG2000 image compression standard.

The exploration starts in Chapter 2 with a presentation on the background of the case study, namely the JPEG2000 coding flow. Although JPEG2000 consists of many complex details, Chapter 2 chose to focus only on major stages of the

JPEG2000 coding flow including the wavelet transform, bitplane coding and entropy coding, which are its most cricitical bottlenecks. The chapter presented key concepts and operations within each stage in order to help readers understand the complex algorithms which form the underpinnings of the development of the thesis.

In the next step, Chapter 3 presented various approaches on parallel hardware architectures. The parallel architectures were selectively explored on purpose. Ex- ploiting parallelism in hardware is one of the primary directions in modern computing used to achieve high performance when frequency scaling is hitting limitations. The chapter first presented three major parallel hardware techniques to exploit paral-

175 176 lelism including instruction-level parallelism, data-level parallelism and thread-level parallelism. Afterwards, it presented realistic processors that use those parallel hard- ware techniques. They are the two most common families of modern parallel pro- cessors: the general pupose CPUs and the general purpose graphic processing units

(GPGPUs). Overall, this chapter provides crucial elements to understanding mod- ern parallel hardware architectures, an important step for proposing optimization techniques in software design, particularly in accelerating JPEG2000.

Complementing Chapter 3, Chapter 4 presented a modern programming model for parallel computing, the OpenCL programming model. It is necessary to under- stand the way in which developers can command underlying hardware to execute their computational flows. The chapter first presented the anatomy of the OpenCL platform, which consists of the OpenCl platform model, execution model and mem- ory model. It also discussed how the OpenCL logical model can map to the physical model of realistic hardware devices, namely the CPUs and the GPGPUs, which were elaborated upon earlier in Chapter 3. In summary, Chapter 2, Chapter 3 and Chap- ter 4 together provided a crucial foundation for the novel design techniques proposed in Chapter 5 and Chapter 6.

Chapter 5 continued the exploration by trying to understand the potentials and characteristics of modern manycore GPGPUs in solving realistic computational flows. It proposed novel design techniques targeting the SIMD architecture of GPGPUs to accelerate JPEG2000’s major bottleneck, the Tier-1 encoding stage. Significant speedups were achieved for the coding flow and more importantly crucial insights were gained into the GPGPUs’ characteristics. The implementation of the par- allel Tier-1 coder provides a significant 16× performance speedup compared to the JasPer software implementation. However, GPGPUs only perform well on the paral- lel processing flows that can expose fine-grained parallelism. Further, GPGPU-based 177 programs still require a lot of optimization efforts, mostly from developers, to fine- tune programs for GPGPUs’ special SIMD model and memory hierarchy. The lack of flexibility in the SIMD execution model is also pronounced in the realistic flows, which are often very complex. Overall, the findings in this chapter lead to the con- clusion that, despite the great capabilities in parallel processing, the GPGPU-based approach on its own may not be a one-size-fits-all solution for accelerating realistic applications. This was an important design guideline which leads to an interesting proposal of a heterogeneous parallel computing approach, which was proposed in Chapter 6.

Based on the crucial lessons from Chapter 5, Chapter 6 introduced a more efficient approach based on heterogeneous parallel computing to accelerate the JPEG2000 coding flow. It discovered the necessity of collaborating heterogeneous process- ing units (i.e. GPGPUs and CPUs) to accelerate complex field applications like JPEG2000. While GPGPUs can provide great parallel arithmetic throughput, flex- ible CPUs can act very well in orchestrating the whole computational flow. Addi- tionally, modern multicore CPUs showed great capabilities in parallel computing too. The experiments showed that multicore CPUs can outperform manycore GPGPUs in solving complex task-level parallel programs. Moreover, this study also exploits soft- heterogeneity of different software runtime environments, which has promised a great potential in increasing the multicore CPUs’ performance. In supporting the hetero- geneous approach, this chapter also proposed an efficient pipelining scheme that utilizes both hard-heterogeneity and soft-heterogeneity. The pipelining scheme not only helped the system gain extra speedup but also provided an efficient processing scheme for streaming input, which is often common in field multimedia applications.

In summary, the introduction of the efficient parallel heterogeneous computing approach in Chapter 6 has successfully concluded the exploration of modern parallel 178 computing platforms of this thesis. Not only did this endeavour achieve significant speedups on the JPEG2000 coding flow but also provided crucial insights into the behaviors of modern computing platforms. However, this study can still be extended in several directions. For accelerating the JPEG2000 coding flow, a better version of the entropy coding stage can improve the overall performance of the flow. The Tier-2 stage can also be parallelized to allow the JPEG2000 flow to completely run in parallel. More importantly, as demands in mobile computing are rapidly growing, an effort on accelerating JPEG2000 for embedded/mobile systems, using embedded CPUs and GPGPUs, should also be considered. To the best of the author’s knowl- edge, there has not been any GPGPU-based parallel solution for JPEG2000 using mobile and embedded systems. For heterogeneous parallel computing, pipeline and scheduling techniques can be further improved and exploited to accelerate computing

flows. Particularly, the aim should be developing an automated work flow for task distribution and pipeline processing. The concept of soft-heterogeneity can also be further explored to exploit the advantage of the modern multicore processors. Bibliography

[1] International Electrotechnical Commission (IEC). Home page http://www.iec.ch/. [2] International Stadard Organization (ISO). Home page http://www.iso.org/iso/home.html. [3] International Telecommunication Union (ITU). Home page http://www.itu.int/en/Pages/default.aspx. [4] Message Passing Interface MPI. Available at http://www.mpi-forum.org/. [5] OpenMP API specification for parallel programming. Available at http://openmp.org/wp/. [6] Apple Corporation 2012. Apple iPhone. Available at http://www.apple.com/iphone/. [7] Google Corporation 2012. Adroid smartphones. Available at http://www.android.com/about/. [8] Intel Corporation 2012. Intel microprocessor quick reference guide. Available at http://www.intel.com/pressroom/kits/quickreffam.htm#core. [9] Timothy G. Mattson, James Fung, Aadtab Munshi, Benedict R. Gaster and Dan Ginsburg. OpenCL Programming Guide. Addison Wesley, 2009. [10] Tinku Acharya and Ping-Sing Tsai. JPEG2000 Standard for Image Compres- sion. John Wiley & Sons, Inc, 2005. [11] Michael Adam. JasPer JPEG 2000 compression software. Available at http://www.ece.uvic.ca/~mdadams/jasper/. [12] M.D. Adams and R. Ward. Wavelet transforms in the JPEG 2000 standard. In Communications, Computers and signal Processing, 2001. PACRIM. 2001 IEEE Pacific Rim Conference on, volume 1, pages 160 –163 vol.1, 2001. [13] Michael D. Adams. The JPEG 2000 still image compression standard, 2005.

179 180

[14] Gene M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference, AFIPS ’67 (Spring), pages 483–485, New York, NY, USA, 1967. ACM. [15] Anandtech. Intel Nehalem architecture review. Available at http://www.anandtech.com/show/2658. [16] K. Andra, C. Chakrabarti, and T. Acharya. A high-performance JPEG2000 architecture. Circuits and Systems for Video Technology, IEEE Transactions on, 13(3):209 – 218, mar 2003. [17] L. Howes, B. R. Gaster and D. Kaeli. Heterogenous Computing with OpenCL. Morgan Kaufamann, 2012. [18] R. Iris Bahar. Lectures on low power design. Brown University. [19] W. B.Pennebaker, Jr. J. L. Mitchell, G. G. Langdon, and R. B. Arps. An overview of the basic principles of the Q-coder adaptive binary arithmetic coder. Image Processing, IEEE Transactions on, 32(6):717, Nov. 1988. [20] Edited by Hubert Nguyen. GPU Gems 3. Addison-Wesley Professional, 2007. [21] JPEG Committee. Joint photographic experts group (JPEG). Available at http://www.jpeg.org/committee.html. [22] JPEG Committee. JPEG 2000 image compression standard. Available at http://www.jpeg.org/jpeg2000/index.html. [23] ISO. ISO/IEC AWI 15444-14 - Information technology – JPEG 2000 image coding system – Part 14: XML structural representation and reference. [24] JPEG Committee. JPEG 2000 potential applications. Available at http://www.jpeg.org/apps/index.html. [25] JPEG Committee. JPEG image compression standard. Available at http://www.jpeg.org/jpeg/index.html. [26] Red Digital Cinema Camera Company. RED scarlet digital cinematography camera. Available at http://www.red.com/store/scarlet/. [27] Apple Corporation. Grand Central Dispatch technology. Available at https://developer.apple.com/technologies/mac/core.html# grand-central. [28] Intel Corporation. http://software.intel.com/en-us/avx/. Available at http://software.intel.com/en-us/avx/. [29] Intel Corporation. Hyper-Threading Technology architecture and microar- chitecture. Available at http://download.intel.com/technology/itj/ 2002/volume06issue01/art01_hyper/vol6iss1_art01.. 181

[30] Intel Corporation. Intel Core i7-800 processor series and the Intel Core i5- 700 processor series based on Intel microarchitecture (Nehalem). Available at download.intel.com/products/processor/corei7/319724.pdf. [31] Intel Corporation. Intel Parallel Building Blocks. Available at http://software.intel.com/en-us/articles/ intel-parallel-building-blocks/. [32] Intel Corporation. Intel Sandy Bridge processor family. Available at http://ark.intel.com/products/codename/29900/Sandy-Bridge. [33] Intel Corporation. Intel SSE4 programming reference. Available at software.intel.com/file/18187/. [34] Intel Corporation. Writing optimal OpenCL code with intel OpenCL SDK, Intel OpenCL SDK 1.5 available at http://software.intel.com/en-us/ articles/vcsource-tools-opencl-sdk/. [35] Nvidia Corporation. Nvidia CUDA parallel computing platform. Available at http://www.nvidia.com/object/cuda\_home\_new.html. [36] Nvidia Corporation. Nvidia CUDA showcases. Available at http://www.nvidia.com/object/cuda-apps-flash-new.html#. [37] Nvidia Corporation. Nvidia Geforce GPUs. Available at http://www.nvidia.com/object/geforce_family.html. [38] Nvidia Corporation. Nvidia Performance Primitives (NPP). Available at http://developer.nvidia.com/npp. [39] Nvidia Corporation. Nvidia’s next generation CUDA compute architec- ture: Fermi. http://www.nvidia.com/content/PDF/fermi_white_papers/ NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf. [40] Nvidia Corporation. OpenCL programming guide for the CUDA archi- tecture V4.2. http://developer.download.nvidia.com/compute/DevZone/ docs/html/OpenCL/doc/OpenCL_Programming_Guide.pdf. [41] AMD Corporation. AMD FX processors. http://www.amd.com/us/products/ desktop/processors/amdfx/Pages/amdfx.aspx. [42] AMD Corporation. AMD Radeon GPUs. http://www.amd.com/us/ products/desktop/graphics/Pages/desktop-graphics.aspx. [43] AMD Corporation. Programming guide AMD accelerated parallel processing,. http://developer.amd.com/sdks/amdappsdk/assets/amd_accelerated_ parallel_processing_opencl_programming_guide.pdf. [44] IBM Corporation. IBM Cell architecture. Available at http://domino. research.ibm.com/comm/research.nsf/pages/r.arch.innovation.html. [45] Intel Corporation. Intel Integrated Performance Primitives ( IPP ). Available at http://software.intel.com/en-us/articles/intel-ipp/ 182

[46] DARPAR. The Autonomous Real-time Ground Ubiquitous Surveillance- Imaging System (ARGUS-IS). Available at http://www.darpa.mil/ Our_Work/I2O/Programs/Autonomous_Real-time_Ground_Ubiquitous_ Surveillance-Imaging_System_(ARGUS-IS).aspx. [47] Ingrid Daubechies and Wim Sweldens. Factoring wavelet transforms into lifting steps. J. Fourier Anal. Appl, 4:247–269, 1998. [48] Analog Devices. ADV212: JPEG 2000 . Technical Documents. http://www.analog.com/en/audiovideo-products/video-compression/ adv212/products/product.html. [49] . SCC2000 secure content creator. Available at http://www.dolby.com/us/en/professional/hardware/cinema/ postproduction-products/scc2000.html. [50] Independent JPEG Group (IJG). IJG JPEG codec. Available at http://www.ijg.org/. [51] Khronos Group. OpenCL-the open standard for parallel programming of het- erogeneous systems. http://www.khronos.org/opencl/. [52] Khronos Group. The OpenCL specification 1.1. http://www.khronos.org/ registry/cl/specs/opencl-1.1.pdf. [53] Moving Picture Experts Group. MPEG-2: generic coding of moving pictures and associated audio information. http://mpeg.chiariglione.org/standards/ mpeg-2/mpeg-2.htm. [54] A.K. Gupta, D. Taubman, and S. Nooshabadi. High speed VLSI architecture for bit plane encoder of JPEG2000. In Circuits and Systems, 2004. MWSCAS ’04. The 2004 47th Midwest Symposium on, volume 2, pages II–233 – II–236 vol.2, jul. 2004. [55] David M. Harris and Neil Weste. CMOS VLSI Design: A Circuits and Systems Perspective. Addison Wesley, 2010. [56] Nikon Inc. Nikon coolpix 995 camera: product information. Available at http://imaging.nikon.com/lineup/coolpix/others/995/spec.htm. [57] Nikon Inc. Nikon coolpix p310 camera: product information. Available at http://imaging.nikon.com/lineup/coolpix/performance/p310/spec.htm. [58] Texas Instruments. Digital Signal Processors and Arm microprocessors. http://www.ti.com/lsds/ti/dsp/home.page?DCMP= TIHeaderTracking&HQS=Other+OT+hdr_p_dsp. [59] Telecommunication Standardization Sector (ITU-T). T.24:standardized digi- tized image set. http://www.itu.int/rec/T-REC-T.24/en;. [60] Telecommunication Standardization Sector (ITU-T). T.800 : Information tech- nology JPEG 2000 image coding system: Core coding system, 2001. 183

[61] V. Rusnak J. Matela and P. Holub. GPU-based sample-parallel context model- ing for EBCOT in JPEG2000. 2010 Annual Doctoral Workshop on Mathemat- ical and Engineering Methods in Computer Science (MEMICS’ 10). Available at http://www.memics.cz/2010/pres/palava/Saturday/matela.pdf. [62] Kakadu. Kakadu JPEG 2000 compression software. Available at http://www.kakadusoftware.com/;. [63] A. Kiely and M. Klimesh. The ICER progressive wavelet image compressor. In IPN Progress Report, pages 1–46, 2003. [64] Roto Le, Iris R. Bahar, and Josepht L. Mundy. A novel parallel tier-1 coder for JPEG2000 using GPUs. In Application Specific Processors (SASP), 2011 IEEE 9th Symposium on, pages 129 –136, June 2011. [65] Roto Le, Joseph L. Mundy, and Iris R. Bahar. High performance parallel JPEG2000 streaming decoder using GPGPU-CPU heterogeneous system. In Application-specific Systems, Architectures and Processors, 2012 IEEE 23rd In- ternational Conference on, july 2012. [66] Chung-Jr Lian, Kuan-Fu Chen, Hong-Hui Chen, and Liang-Gee Chen. Analy- sis and architecture design of block-coding engine for EBCOT in JPEG 2000. Circuits and Systems for Video Technology, IEEE Transactions on, 13(3):219 – 230, mar. 2003. [67] S.G. Mallat. A theory for multiresolution signal decomposition: the wavelet representation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 11(7):674 –693, jul 1989. [68] Hong Man, Alen Docef, and Faouzi Kossentini. Performance analysis of the JPEG 2000 image coding standard. Multimedia Tools Appl., 26(1):27–57, May 2005. [69] P. Meerwald, R. Norcen, and A. Uhl. Parallel JPEG2000 image coding on multiprocessors. In Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, pages 2 –7, 2002. [70] G.E. Moore. Cramming more components onto integrated circuits. Proceedings of the IEEE, 86(1):82 –85, jan. 1998. [71] ISO/IEC JTC 1/SC 29/WG1 N505. ISO/IEC call for contributions for JPEG 2000. March 1997. [72] NASA. Mars Exploration Rovers. http://marsrovers.jpl.nasa.gov/home/ index.html. [73] Roland Norcen and Andreas Uhl. High performance JPEG 2000 and MPEG-4 VTC on SMPs using OpenMP. Parallel Comput., 31:1082–1098, October 2005. [74] University of Southern California (USC). The USC-SIPI image database. http: //sipi.usc.edu/database/;. [75] OpenJPEG. OpenJPEG JPEG 2000 compression library. Available at http://www.openjpeg.org/. 184

[76] Oracle. Developing parallel programs - a discussion of popular models. http://www.oracle.com/technetwork/server-storage/solarisstudio/ documentation/oss-parallel-programs-170709.pdf. [77] David Patterson and John Hennessy. Computer Organization and Design: The Hardware/Software Interface. Morgan Kaufamann, 2011. [78] V. Rusnak. Design and implementation of arithmetic coder for CUDA platform, Diploma Thesis, Masarykova Univerzita, Fakulta Informatiky, 2010 [79] Amir Said. Comprative analysis of arithmetic coding computational complexity. Technical report, Imagine Systems laboratory, HP Labs Palo Alto, 2004. http: //www.hpl.hp.com/techreports/2004/HPL-2004-75.pdf;. [80] Amir Said. Introduction to arithmetic coding-theory and practice. Technical report, Imagine Systems laboratory, HP Labs Palo Alto, 2004. http://www. hpl.hp.com/techreports/2004/HPL-2004-75.pdf;. [81] Khalid Sayood. Lossless Data Comrepssion. Academic Press, 2005. [82] Herb Sutter. The free lunch is over: a fundamental turn toward con- currency in software. available at http://www.gotw.ca/publications/ concurrency-ddj.htm. [83] D. Taubman. High performance scalable image compression with EBCOT. Im- age Processing, IEEE Transactions on, 9(7):1158 –1170, jul. 2000. [84] David S. Taubman and Michael W. Marcellin. JPEG2000 Image Compression Fundamentals, Standards and Practice. Kluwer Academic Publisher, 2002. [85] K. Varma and A. Bell. JPEG2000 - choices and tradeoffs for encoders. Signal Processing Magazine, IEEE, 21(6):70 – 75, nov. 2004. [86] K. Varma, H.B. Damecharla, A.E. Bell, J.E. Carletta, and G.V. Back. A fast JPEG2000 encoder that preserves coding efficiency: The split arithmetic encoder. Circuits and Systems I: Regular Papers, IEEE Transactions on, 55(11):3711 –3722, dec. 2008. [87] Vcodex. H.264 . http://www.vcodex.com/h264.html. [88] A. Weiss, Martin Heide, Simon Papandreou, and Norbert F¨urst. CUJ2K: a JPEG2000 encoder in CUDA. 2009. [89] Wikipedia. Discrete Cosine Transform (DCT). Overview at http://en. wikipedia.org/wiki/Discrete_cosine_transform. [90] Wikipedia. Discrete Wavelet Transform (DWT). Overview at http://en. wikipedia.org/wiki/Discrete_wavelet_transform. [91] Wikipedia. JPEG2000 standard. Overview at http://en.wikipedia.org/ wiki/JPEG_2000.