Cope- BT-2008-Phd-Thesis.Pdf

Video Processing Acceleration using Reconfigurable Logic and Graphics Processors Benjamin Thomas Cope MENG ACGI A thesis submitted for the degree of Doctor of Philosophy of Imperial College London and for the Diploma of Membership of Imperial College London Circuits and Systems Group Department of Electrical and Electronic Engineering Imperial College London February 2008 2 Abstract A vexing question is `which architecture will prevail as the core feature of the next state of the art video processing system?' This thesis examines the substitutive and collaborative use of the two alternatives of the recon¯gurable logic and graphics processor architectures. A structured approach to executing architecture comparison is presented - this includes a proposed `Three Axes of Algorithm Characterisation' scheme and a formulation of performance drivers. The approach is an appealing platform for clearly de¯ning the problem, assumptions and results of a comparison. In this work it is used to resolve the advanta- geous factors of the graphics processor and recon¯gurable logic for video processing, and the conditions determining which one is superior. The comparison results prompt the exploration of the customisable options for the graphics processor architecture. To clearly de¯ne the architectural design space, the graphics processor is ¯rst identi¯ed as part of a wider scope of homogeneous multi-processing element (HoMPE) architectures. A novel exploration tool is described which is suited to the investigation of the customisable options of HoMPE architectures. The tool adopts a systematic exploration approach and a high-level parameterisable system model, and is used to explore pre- and post-fabrication customisable options for the graphics processor. A positive result of the exploration is the proposal of a recon¯gurable engine for data access (REDA) to optimise graphics processor performance for video processing-speci¯c memory access patterns. REDA demonstrates the viability of the use of recon¯gurable logic as collaborative `glue logic' in the graphics processor architecture. 3 Acknowledgement I gratefully acknowledge the support of my co-supervisors Professors Peter Y.K. Cheung and Wayne Luk. Peter for providing an inspirational working environment, insightful guidance both within and outside of my research studies, and the opportunity to attend international conferences. To Wayne I am indebted for his relentless support and for always making time for an inspirational discussion. The ¯nancial support from the Donal Morphy and Engineering and Physical Sci- ences Research Council (EP/C549481/1) scholarships has made my studies feasible. My gratitude goes to the awarding bodies for providing this unique opportunity. I wish to thank my project sponsors Sony Broadcast and Professional Research Labs Europe for their ¯nancial assistance and two both academically and personally rewarding industrial placements. Of particular note John Stone, Simon Haynes and Sarah Witt for their support in guiding my research activities. From within the Circuits and Systems Group at Imperial College London I acknowledge the academic and moral enlightenment from notably and amongst others Pete Sed- cole, Christos Bouganis, Terrance Mak, Sutjipto Ari¯n, Laurence Hey, Jon Clarke and Alastair Smith. The experiences we have shared will stay with me forever. Discussions with Lee Howes and Jay Cornwall, from the Department of Computing, have contributed signi¯cantly to my knowledge of graphics processors. In particular, Jay's GLSLProgram C++ class library has been a invaluable asset to my work. The technical comment and guidance from Julien Lamoureux and David Thomas, also from the Department of Computing, is gratefully acknowledged. Outside of college my life has been touched and inspired by friends from the Serpentine running club and the Imperial College swim and water polo club. Prominently by William Bradlow and Eoin O Colgain in moral guidance and comradeship. Thank you all for providing many happy memories and sharing my sporting passions. I would achieve nothing without the encouragement and compassion I receive from all of my family. Foremost, I am blessed with two loving parents who continue to provide unquestioning support and guidance. This thesis is dedicated to them. Finally, to my girlfriend Barbie Bressolles, who is always a sanctuary of calm and clarity: for your time spent and diligence in proof reading my thesis, I owe you! 4 Contents Abstract 2 Acknowledgement 3 Contents 4 List of Figures 7 List of Tables 9 Chapter 1. Introduction 11 1.1 Motivation . 11 1.2 Research Questions . 14 1.3 Original Contributions . 15 1.4 Organisation of the Thesis . 17 Chapter 2. Background 18 2.1 De¯nition: Video Processing Systems and Accelerators . 19 2.2 The Architectures of Suitable Accelerator Devices . 20 2.2.1 Recon¯gurable Logic and Recon¯gurable Computing . 20 2.2.2 Graphics Processors . 24 2.2.3 Homogeneous Multi-Processing Element Architectures . 28 2.2.4 Other Relevant Accelerator Devices . 30 2.3 The Comparison Challenge . 31 2.3.1 Die Area . 31 2.3.2 Development Tools: Design Time . 32 2.3.3 Power Minimisation Techniques . 33 2.3.4 Numerical Precision . 34 2.3.5 Computational Density: A Comparison . 35 2.3.6 Comparison Setup . 37 2.4 Related Literature on Implementation Studies . 38 2.4.1 Comparative Studies . 38 2.4.2 Video Processing on the Graphics Processor . 42 2.5 Design Space Exploration . 43 2.5.1 Motivating Literature . 43 2.5.2 Models for Design Space Exploration . 44 2.6 Synthesis of the Background with the Thesis . 47 2.6.1 Algorithm Characteristics and Implementation Challenges . 47 2.6.2 Recon¯gurable Logic and Graphics Processor Comparison . 48 2.6.3 Design Space Exploration of HoMPE Architectures . 49 Contents 5 2.6.4 Customising a Graphics Processor with Recon¯gurable Logic . 49 2.7 Summary . 50 Chapter 3. Algorithm Characteristics and Implementation Challenges 51 3.1 Video Processing Terminology . 52 3.1.1 Video, Throughput Rate and Realtime Performance . 52 3.1.2 Colour Spaces . 53 3.1.3 Algorithm Acceleration and Amdahl's Law . 54 3.2 Case Study Algorithms . 55 3.2.1 Primary Colour Correction . 55 3.2.2 2D Convolution Filter . 56 3.2.3 Video Frame Resizing . 56 3.2.4 Histogram Equalisation . 56 3.2.5 Motion Vector Estimation . 57 3.3 The Three Axes of Algorithm Characterisation . 59 3.3.1 Axis 1: Arithmetic Complexity . 59 3.3.2 Axis 2: Memory Access Requirements . 60 3.3.3 Axis 3: Data Dependence . 62 3.3.4 Case Study Algorithm Characterisation . 64 3.4 Implementation Challenges . 67 3.4.1 E±cient Graphics Processor Implementations . 67 3.4.2 Recon¯gurable Logic Toolbox . 72 3.4.3 The Implications for Video Processing Acceleration . 77 3.5 Summary . 80 Chapter 4. Recon¯gurable Logic and Graphics Processor Comparison 81 4.1 Setting the Scene: A Performance Summary . 83 4.2 Formulating the Performance Comparison . 86 4.2.1 Throughput Rate Drivers . 86 4.2.2 Analysis of the Factors which A®ect Speed-up . 87 4.2.3 Instruction Disassembly for the Graphics Processor . 92 4.3 Experimental Setup . 94 4.4 Quantifying the Throughput Rate Drivers . 96 4.4.1 Clock Speed Comparison . 96 4.4.2 Cycles per Output and Iteration Level Parallelism . 97 4.5 Axis 1: Arithmetic Complexity . 99 4.5.1 Deterministic Graphics Processor Performance . 100 4.5.2 Intrinsic Bene¯ts over the General Purpose Processor ISA . 101 4.5.3 The Implications for non-Arithmetic Intensive Algorithms . 104 4.6 Axis 2: Memory Access Requirements . 106 4.6.1 On-Chip Memory Accesses . 106 4.6.2 O®-Chip Memory Accesses . 108 4.6.3 Case Study: Graphics Processor Input Bandwidth . 109 4.6.4 Case Study: The E®ect of Variable Data Reuse Potential . 111 4.7 Axis 3: Data Dependence . 114 4.7.1 A Data Dependence Strategy for the Graphics Processor . 115 4.7.2 Case Study: Optimising For Data Dependence . 116 4.7.3 A Comparison to the Recon¯gurable Logic `Strategy' . 120 4.8 Limitations of the Performance Comparison . 121 4.8.1 Recon¯gurable Logic Resource Usage . 121 4.8.2 Numerical Precision . 122 4.8.3 Application Scalability . 123 4.9 Summary . 125 Contents 6 4.9.1 Findings from the Structured Comparison . 125 4.9.2 Future Scalability Implications . 127 4.9.3 Answers to Performance Question to Motivate Chapters 5 and 6 . 128 Chapter 5. Design Space Exploration of HoMPE Architectures 129 5.1 Design Space Exploration: The Big Picture . 132 5.1.1 The Challenges . 132 5.1.2 A Systematic Approach . 134 5.1.3 Tool Flow . 135 5.2 The Design Space: Architecture Feature Set . 137 5.2.1 Architecture Model . 138 5.2.2 Optimisation Goals . 146 5.3 Architectural Trends for the Graphics Processor . 147 5.3.1 System Model with a Single Processing Element . 147 5.3.2 System Model with Multiple Processing Elements . 149 5.3.3 The Distribution of Memory Access Patterns . 156 5.3.4 Limitations of the Model . 157 5.3.5 Case Study: Motion Vector Estimation . 159 5.3.6 Implications for other HoMPE Architectures . 160 5.4 Post-Fabrication Customisable Options . 162 5.4.1 Memory System . 162 5.4.2 Processing Group Flexibility . 162 5.4.3 Flexible Processing Pattern . 163 5.5 Summary . 164 Chapter 6. Customising a Graphics Processor with Recon¯gurable Logic165 6.1 Recon¯gurable Logic in a HoMPE Architecture . 167 6.1.1 The Proposed Classi¯cation . 168 6.1.2 Prior Recon¯gurable Logic Customisations . 169 6.1.3 Computation using Recon¯gurable Logic . 170 6.2 Problem De¯nition: Memory System Performance . 172 6.3 A Recon¯gurable Engine for Data Access . 175 6.3.1 The Graphics Processor Memory System .

Cope- BT-2008-Phd-Thesis.Pdf

Exploring Weak Scalability for FEM Calculations on a GPU-Enhanced Cluster

Conga-TR4 User's Guide

Warehouse-Scale Video Acceleration: Co-Design and Deployment in the Wild

AMD Firepro S9150

Hardware-Accelerated Multi-Tile Streaming for Realtime Remote Visualization

Proposed Premium Hardware for Hackintosh High Performance Computer by End User for 1-Off Self-Build

AMD Reports 2013 First Quarter Results

Real-Time Hair Rendering

Opencl-Accelerated Point Feature Histogram and Its Application in Railway Track Point Cloud Data Processing

Radeon HD 5000 Series

Parallel Acceleration of H.265 Video Processing

The LPGPU2 Project: Low-Power Parallel Computing on Gpus