"Programming Issues for Video Analysis on Graphics Processing

UNIVERSIDAD DE CORDOBA´ Departamento de Arquitectura de Computadores, Electronica´ y Tecnolog´ıa Electronica´ TESIS DOCTORAL Programming issues for video analysis on Graphics Processing Units Juan Gomez´ Luna Cordoba,´ Febrero de 2012 A mis padres, por ensenarme˜ el camino A Virginia y a Nicolas,´ por recorrerlo conmigo Agradecimientos Con la satisfacción de haber llegado al final y después del esfuerzo realizado, hay que mirar atrás para agradecer su apoyo a todos los que han hecho posible la consecución de esta meta. En primer lugar, a mis directores Dr. JoséMar´ıa González Linares, Dr. JoséIgnacio Benavides y Dr. Nicolás Guil, por haberme orientado tan certeramente. Me siento afortunado por trabajar en condiciones de exigencia y rigor. A mis compañeros del Departamento de Arquitectura de Computadores, Electrónica y Tecnolog´ıa Electrónica de la Universidad de Córdoba, en especial a Edmundo Sáez por su ayuda en los inicios. También a los compañeros del Departamento de Arquitectura de Computadores de la Universidad de Málaga, particularmente a Juan Lucena, por su habilidad resolviendo problemas hardware, y a Fran, por sus orientaciones en la preparación de este documento. Al Vicerrectorado de Pol´ıtica Cient´ıfica y al Vicerrectorado de Estudios de Posgrado y Formación Continua de la Universidad de Córdoba, por toda la ayuda prestada. Al profesor Walter Stechele, de la Universidad Técnica de Munich, por darme la posibilidad de hacer la estancia all´ıy por facilitarme todos los trámites para la consecución del Doctorado Europeo. Por supuesto, a Holger Endt, de BMW Forschung und Technik. También, a los profesores Hans- Joachim Bungartz y Noel O’Connor por elaborar los informes sobre mi tesis. A Nacho, Marisa y Miguel Angel,´ por acogerme tan bien en Málaga. A mi padre, Pedro, por ser mi mentor en la tesis y en todo lo demás; a mi madre, Mercedes, por escucharme siempre y por hacer todo por ayudarme; y a mis hermanos, Fernando y Pedro, porque me gusta que seamos tres diferentes implementaciones de la misma arquitectura. Y, por supuesto, a Loli, por lo bien que me prepara el hatillo cada vez que paso por casa. Por último, a Virginia por ser la motivación y la recompensa, por compartir lo bueno y lo malo incondicionalmente (”Bajo el t´ıtulo estarámi nombre, que traducido significaráel tuyo”). Y a mi pequeño Nicolás por darle sentido a todo. Contents List of Figures vii List of Tables ix 1.- Video analysis on Graphics Processing Units 1 1.1 Introduction ....................................... 1 1.1.1 Parallelism as the key for improving computer performance ......... 2 1.1.2 Recent evolution of parallel hardware ..................... 2 1.1.3 Parallel programming models ......................... 4 1.2 Programming GPUs for general-purpose processing .................. 5 1.2.1 A few words on CUDA ............................. 6 1.2.2 Conditions and bottlenecks for GPU performance ............... 6 1.2.3 Generic optimization techniques on GPUs ................... 7 1.3 Towards video processing optimization on GPU .................... 10 1.3.1 State of the art of video and image processing on GPU ............ 11 1.3.2 Efficient mapping of video analysis applications on GPU ........... 12 1.3.3 Stream processing paradigm for video analysis on GPU ............ 13 1.3.4 Aims of this work ................................ 16 1.4 Structure of this document ............................... 16 2.- An introduction to GPU computing with CUDA 19 2.1 Graphics processing units as general-purpose processors ............... 19 2.2 CUDA-enabled devices ................................. 21 2.3 CUDA programming model .............................. 21 2.3.1 Thread hierarchy ................................ 21 2.3.2 Memory hierarchy ............................... 23 2.4 Hardware implementation ............................... 23 i CONTENTS 2.4.1 SIMT architecture and multithreading ..................... 24 2.4.2 Streaming multiprocessors ........................... 24 2.4.3 Memory spaces ................................. 26 3.- Target applications 31 3.1 Introduction ....................................... 31 3.2 Histogram calculation .................................. 32 3.2.1 Discussion ................................... 32 3.3 Egomotion compensation and moving objects detection algorithm .......... 33 3.3.1 Discussion ................................... 38 3.4 The Generalized Hough Transform ........................... 39 3.4.1 Discussion ................................... 42 3.5 Conclusions ....................................... 44 4.- Highly optimized histogram calculation on GPU 47 4.1 Introduction ....................................... 47 4.2 Related work ...................................... 49 4.3 A microbenchmark-based study of the shared memory ................ 50 4.3.1 Methodology and initial observations ..................... 51 4.3.2 Warp access patterns .............................. 52 4.3.3 Non-atomic access ............................... 53 4.3.4 Atomic access ................................. 54 4.4 An optimized approach to histogram generation in shared memory .......... 65 4.4.1 Replication ................................... 66 4.4.2 Padding ..................................... 68 4.4.3 Interleaved read access ............................. 68 4.5 Experimental evaluation ................................ 69 4.5.1 Evaluation of the optimization techniques ................... 70 4.5.2 Thorough evaluation of our approach and comparison to related works . 73 4.5.3 Histogram-based kernels for color images ................... 74 4.5.4 Discussion ................................... 75 4.5.5 Evaluation of the R -per-block approach on older GPU generations ...... 76 4.6 Experiences with replication in global memory .................... 77 4.7 Conclusions ....................................... 77 ii Universidad de Córdoba CONTENTS 5.- Efficient work distribution 81 5.1 Introduction ....................................... 81 5.2 Dealing with sequential parts .............................. 82 5.2.1 SISD and SIMD computing on the GPU .................... 83 5.2.2 Experimental evaluation ............................ 84 5.3 Re-organizing the workload ............................... 85 5.3.1 Reducing memory accesses and executed instructions through compaction . 86 5.3.2 Minimizing warp divergence through sorting ................. 87 5.3.3 Experimental evaluation ............................ 87 5.4 Load balancing versus occupancy maximization .................... 89 5.4.1 Applying compaction and sorting to the GHT ................. 89 5.4.2 Work distribution among blocks and threads .................. 92 5.4.3 Application of the mechanisms ......................... 94 5.4.4 Experimental evaluation ............................ 96 5.5 Conclusions ....................................... 100 6.- Stream processing on GPU with CUDA streams 101 6.1 Introduction ....................................... 101 6.2 CUDA streams ..................................... 103 6.3 Characterizing the behavior of CUDA streams ..................... 105 6.3.1 A thorough observation of CUDA streams ................... 106 6.3.2 CUDA streams performance models ...................... 110 6.4 Testing the streams with SDK-based applications ................... 112 6.4.1 Matrix multiplication .............................. 113 6.4.2 256-bins histogram ............................... 115 6.4.3 RGB to grayscale conversion .......................... 117 6.5 Optimized stream processing with CUDA streams ................... 118 6.5.1 Adaptation to variable kernel computation time ................ 118 6.6 Conclusions ....................................... 120 7.- Conclusions 123 7.1 Conclusions and main contributions .......................... 123 7.2 Publications related to this dissertation ......................... 125 7.2.1 Publications in conference proceedings .................... 126 7.2.2 Publications in journals ............................. 126 Programming issues for video analysis on Graphics Processing Units iii CONTENTS 7.2.3 Technical reports ................................ 127 7.2.4 Articles under review .............................. 127 7.3 Future research ..................................... 128 A.- Resumen de la tesis doctoral en castellano 131 A.1 Paralelización eficiente de las aplicaciones de v´ıdeo en GPU ............. 131 A.2 Stream processing para análisis de v´ıdeo en GPU ................... 133 A.3 Principales aportaciones ................................ 133 A.4 Conclusiones y trabajos futuros ............................. 134 Bibliography 135 iv Universidad de Córdoba List of Figures 1.1 Schematic of heterogeneous architectures ....................... 3 1.2 Blocking/tiling in shared memory ........................... 8 1.3 Thread Level Parallelism vs. Instruction Level Parallelism .............. 8 1.4 Scatter and gather parallelization ............................ 9 1.5 Global memory organization and addresses ...................... 10 1.6 SISD, SIMD and stream processing .......................... 14 1.7 Programming issues tackled in this dissertation .................... 17 2.1 Comparison of CPU and GPU architectures ...................... 20 2.2 CUDA programming model .............................. 23 2.3 Streaming multiprocessor in c.c. 1.x .......................... 25 2.4 Streaming multiprocessor in c.c. 2.0 .......................... 26 3.1 Parallel histogram calculation .............................. 33 3.2

Load more