
UNIVERSIDAD DE CORDOBA´ Departamento de Arquitectura de Computadores, Electronica´ y Tecnolog´ıa Electronica´ TESIS DOCTORAL Programming issues for video analysis on Graphics Processing Units Juan Gomez´ Luna Cordoba,´ Febrero de 2012 A mis padres, por ensenarme˜ el camino A Virginia y a Nicolas,´ por recorrerlo conmigo Agradecimientos Con la satisfacci´on de haber llegado al final y despu´es del esfuerzo realizado, hay que mirar atr´as para agradecer su apoyo a todos los que han hecho posible la consecuci´on de esta meta. En primer lugar, a mis directores Dr. Jos´eMar´ıa Gonz´alez Linares, Dr. Jos´eIgnacio Benavides y Dr. Nicol´as Guil, por haberme orientado tan certeramente. Me siento afortunado por trabajar en condiciones de exigencia y rigor. A mis compa˜neros del Departamento de Arquitectura de Computadores, Electr´onica y Tecnolog´ıa Electr´onica de la Universidad de C´ordoba, en especial a Edmundo S´aez por su ayuda en los inicios. Tambi´en a los compa˜neros del Departamento de Arquitectura de Computadores de la Universidad de M´alaga, particularmente a Juan Lucena, por su habilidad resolviendo problemas hardware, y a Fran, por sus orientaciones en la preparaci´on de este documento. Al Vicerrectorado de Pol´ıtica Cient´ıfica y al Vicerrectorado de Estudios de Posgrado y Formaci´on Continua de la Universidad de C´ordoba, por toda la ayuda prestada. Al profesor Walter Stechele, de la Universidad T´ecnica de Munich, por darme la posibilidad de hacer la estancia all´ıy por facilitarme todos los tr´amites para la consecuci´on del Doctorado Europeo. Por supuesto, a Holger Endt, de BMW Forschung und Technik. Tambi´en, a los profesores Hans- Joachim Bungartz y Noel O’Connor por elaborar los informes sobre mi tesis. A Nacho, Marisa y Miguel Angel,´ por acogerme tan bien en M´alaga. A mi padre, Pedro, por ser mi mentor en la tesis y en todo lo dem´as; a mi madre, Mercedes, por escucharme siempre y por hacer todo por ayudarme; y a mis hermanos, Fernando y Pedro, porque me gusta que seamos tres diferentes implementaciones de la misma arquitectura. Y, por supuesto, a Loli, por lo bien que me prepara el hatillo cada vez que paso por casa. Por ´ultimo, a Virginia por ser la motivaci´on y la recompensa, por compartir lo bueno y lo malo incondicionalmente (”Bajo el t´ıtulo estar´ami nombre, que traducido significar´ael tuyo”). Y a mi peque˜no Nicol´as por darle sentido a todo. Contents List of Figures vii List of Tables ix 1.- Video analysis on Graphics Processing Units 1 1.1 Introduction ....................................... 1 1.1.1 Parallelism as the key for improving computer performance ......... 2 1.1.2 Recent evolution of parallel hardware ..................... 2 1.1.3 Parallel programming models ......................... 4 1.2 Programming GPUs for general-purpose processing .................. 5 1.2.1 A few words on CUDA ............................. 6 1.2.2 Conditions and bottlenecks for GPU performance ............... 6 1.2.3 Generic optimization techniques on GPUs ................... 7 1.3 Towards video processing optimization on GPU .................... 10 1.3.1 State of the art of video and image processing on GPU ............ 11 1.3.2 Efficient mapping of video analysis applications on GPU ........... 12 1.3.3 Stream processing paradigm for video analysis on GPU ............ 13 1.3.4 Aims of this work ................................ 16 1.4 Structure of this document ............................... 16 2.- An introduction to GPU computing with CUDA 19 2.1 Graphics processing units as general-purpose processors ............... 19 2.2 CUDA-enabled devices ................................. 21 2.3 CUDA programming model .............................. 21 2.3.1 Thread hierarchy ................................ 21 2.3.2 Memory hierarchy ............................... 23 2.4 Hardware implementation ............................... 23 i CONTENTS 2.4.1 SIMT architecture and multithreading ..................... 24 2.4.2 Streaming multiprocessors ........................... 24 2.4.3 Memory spaces ................................. 26 3.- Target applications 31 3.1 Introduction ....................................... 31 3.2 Histogram calculation .................................. 32 3.2.1 Discussion ................................... 32 3.3 Egomotion compensation and moving objects detection algorithm .......... 33 3.3.1 Discussion ................................... 38 3.4 The Generalized Hough Transform ........................... 39 3.4.1 Discussion ................................... 42 3.5 Conclusions ....................................... 44 4.- Highly optimized histogram calculation on GPU 47 4.1 Introduction ....................................... 47 4.2 Related work ...................................... 49 4.3 A microbenchmark-based study of the shared memory ................ 50 4.3.1 Methodology and initial observations ..................... 51 4.3.2 Warp access patterns .............................. 52 4.3.3 Non-atomic access ............................... 53 4.3.4 Atomic access ................................. 54 4.4 An optimized approach to histogram generation in shared memory .......... 65 4.4.1 Replication ................................... 66 4.4.2 Padding ..................................... 68 4.4.3 Interleaved read access ............................. 68 4.5 Experimental evaluation ................................ 69 4.5.1 Evaluation of the optimization techniques ................... 70 4.5.2 Thorough evaluation of our approach and comparison to related works . 73 4.5.3 Histogram-based kernels for color images ................... 74 4.5.4 Discussion ................................... 75 4.5.5 Evaluation of the R -per-block approach on older GPU generations ...... 76 4.6 Experiences with replication in global memory .................... 77 4.7 Conclusions ....................................... 77 ii Universidad de C´ordoba CONTENTS 5.- Efficient work distribution 81 5.1 Introduction ....................................... 81 5.2 Dealing with sequential parts .............................. 82 5.2.1 SISD and SIMD computing on the GPU .................... 83 5.2.2 Experimental evaluation ............................ 84 5.3 Re-organizing the workload ............................... 85 5.3.1 Reducing memory accesses and executed instructions through compaction . 86 5.3.2 Minimizing warp divergence through sorting ................. 87 5.3.3 Experimental evaluation ............................ 87 5.4 Load balancing versus occupancy maximization .................... 89 5.4.1 Applying compaction and sorting to the GHT ................. 89 5.4.2 Work distribution among blocks and threads .................. 92 5.4.3 Application of the mechanisms ......................... 94 5.4.4 Experimental evaluation ............................ 96 5.5 Conclusions ....................................... 100 6.- Stream processing on GPU with CUDA streams 101 6.1 Introduction ....................................... 101 6.2 CUDA streams ..................................... 103 6.3 Characterizing the behavior of CUDA streams ..................... 105 6.3.1 A thorough observation of CUDA streams ................... 106 6.3.2 CUDA streams performance models ...................... 110 6.4 Testing the streams with SDK-based applications ................... 112 6.4.1 Matrix multiplication .............................. 113 6.4.2 256-bins histogram ............................... 115 6.4.3 RGB to grayscale conversion .......................... 117 6.5 Optimized stream processing with CUDA streams ................... 118 6.5.1 Adaptation to variable kernel computation time ................ 118 6.6 Conclusions ....................................... 120 7.- Conclusions 123 7.1 Conclusions and main contributions .......................... 123 7.2 Publications related to this dissertation ......................... 125 7.2.1 Publications in conference proceedings .................... 126 7.2.2 Publications in journals ............................. 126 Programming issues for video analysis on Graphics Processing Units iii CONTENTS 7.2.3 Technical reports ................................ 127 7.2.4 Articles under review .............................. 127 7.3 Future research ..................................... 128 A.- Resumen de la tesis doctoral en castellano 131 A.1 Paralelizaci´on eficiente de las aplicaciones de v´ıdeo en GPU ............. 131 A.2 Stream processing para an´alisis de v´ıdeo en GPU ................... 133 A.3 Principales aportaciones ................................ 133 A.4 Conclusiones y trabajos futuros ............................. 134 Bibliography 135 iv Universidad de C´ordoba List of Figures 1.1 Schematic of heterogeneous architectures ....................... 3 1.2 Blocking/tiling in shared memory ........................... 8 1.3 Thread Level Parallelism vs. Instruction Level Parallelism .............. 8 1.4 Scatter and gather parallelization ............................ 9 1.5 Global memory organization and addresses ...................... 10 1.6 SISD, SIMD and stream processing .......................... 14 1.7 Programming issues tackled in this dissertation .................... 17 2.1 Comparison of CPU and GPU architectures ...................... 20 2.2 CUDA programming model .............................. 23 2.3 Streaming multiprocessor in c.c. 1.x .......................... 25 2.4 Streaming multiprocessor in c.c. 2.0 .......................... 26 3.1 Parallel histogram calculation .............................. 33 3.2
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages163 Page
-
File Size-